Rodolfo M. Raya (rmraya@maxprograms.com)
Chief Technical Officer, Maxprograms
Rodolfo M. Raya (rmraya@maxprograms.com)
Chief Technical Officer, Maxprograms
Exchanging glossary data without character corruption or without being locked into proprietary formats is a problem that affects translation/localization industry. This article presents GlossML, an open XML vocabulary specially designed to facilitate the exchange of glossaries.
Webster's College dictionary defines glossary as a list of terms in a special subject, field or area of usage, with accompanying definitions. Glossaries are invaluable tools for translators, as they help in the selection of appropriate terms during the translation process.
A bilingual glossary is a list of terms in one language which are defined in a second language or glossed by synonyms (or at least near-synonyms) in another language. This kind of glossary is quite common in translation/localization industry.
Contracts and technical manuals normally contain terms with specific meanings that pertain to a particular context. When a document is sent to a translator, it is usually accompanied by a glossary and, if preferred translations are included in the glossary, translators are expected to use the translations provided.
Without a good glossary, translators can turn a complex technical manual into a worthless document.
A glossary can be written in any format. A Microsoft Excel spreadsheet with two columns is probably the most common format used for storing bilingual glossaries. Glossaries are normally exchanged in native Excel format (*.xls) or as a CSV (Comma Separated Values) file exported from Excel.
Exchanging glossaries in Excel or CSV format is inconvenient due to portability issues. Excel saves files using character sets that are dependent on the version of Excel and the operating system used, often resulting in character corruption when glossaries are read under a different combination. Excel is only available on Windows and macOS; translators who use Linux may also encounter conversion problems when reading Excel documents using other tools.
CSV files are a bit better than Excel when exported using Unicode, but the choice of text delimiters or column separators introduces a new set of problems in reading the exchanged file. Characters used as delimiters must be escaped when present in a text, but there isn't a unique way of doing this. Some tools escape characters by doubling them and others put a backslash (\) before these characters.
If a glossary is made available to a translator or translation agency, it should be in a useful format. XML solves the problems of Excel (character set declared in the header of the file) and CSV (text clearly delimited in XML elements). However, at this moment, there is no open standard specifically designed for representing glossaries in XML format.
There are several standards for exchanging terminology information, like OLIF (Open Lexicon Interchange Format), MARTIF (Machine-Readable Terminology Interchange Format - ISO 12200) and TBX (TermBase eXchange). A subset of one of them could arguably be used for holding glossaries, but even in their simplest form they are too complex for representing simple bilingual glossaries like the ones normally used by translators.
Some translation tool vendors opted for proprietary solutions for exchanging terminology and glossaries. Translators who use these tools cannot exchange their data with colleagues who work with different software, as the format used for containing glosses and terms is secret.
Other translation tool vendors opted for adapting MARTIF, OLIF or TBX to their particular needs. Unfortunately, most of them did it in their own ways, without documenting their custom changes. As result, exchange using these open standards is also only possible between users of the same tool.
A solution to the problem of glossary exchange should:
GlossML is an XML-based vocabulary specifically designed for containing glossaries that can be used for storing monolingual and multilingual lists of terms and, optionally, their definitions.
The GlossML specification and related materials (XML Schema and examples) are licensed under the Creative Commons Attribution-No Derivative Works 3.0 Unported License. This means that anyone can use and distribute the GlossML format without paying royalties of any kind.
A distinctive aspect of GlossML vocabulary is its extreme simplicity. It only has 6 elements and 4 attributes. This is possible because it focuses solely on holding glossary data. It is not intended for terminology exchange.
Listing 1 below shows you a sample bilingual GlossML file. This example is also available for download in the Resources section.
<?xml version="1.0" encoding="UTF-8"?> <glossary version="1.0" srclang="en-US" xmlns="https://www.maxprograms.com/gml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.maxprograms.com/gml GlossML.xsd"> <comment>Sample bilingual glossary</comment> <glossentry> <langentry xml:lang="en-US"> <term>structure</term> <definition source="Merriam Webster">the manner in which something is constructed</definition> </langentry> <langentry xml:lang="es"> <term>estructura</term> </langentry> </glossentry> <glossentry> <comment from="RMR">This entry doesn't refer to statistics as a science</comment> <langentry xml:lang="en-US"> <term>statistic</term> <definition source="Merriam Webster">a numerical fact or datum, esp. one computed from a sample</definition> </langentry> <langentry xml:lang="es"> <term>estadística</term> <definition source="Larousse">cuadro numérico de un hecho que se presta a la estadística</definition> </langentry> </glossentry> </glossary>
Listing 1. Sample GlossML File
As GlossML grammar is written using an XML Schema, glossaries in this format can be embedded in other XML vocabularies. For example, a GlossML glossary can be included in an XLIFF ( XML Localization Interchange File Format) document in the following manner:
<?xml version="1.0" encoding="UTF-8" ?> <xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:tc:xliff:document:1.2 xliff-core-1.2-transitional.xsd" xmlns:gls="https://www.maxprograms.com/gml"> <file datatype="javalistresourcebundle" original="capi.properties" source-language="en"> <header> <gls:glossary version="1.0" srclang="en"> ... GlossML data ... </gls:glossary> </header> <body> ... XLIFF data ... </body> </file> </xliff>
Listing 2. Embedding a GlossML Document in an XLIFF File
The root of a GlossML glossary is the <glossary>
element and the main
language used in the glossary is declared in the required "srclang"
attribute.
A <glossary>
element contains one optional <comment>
and one
or more <glossentry>
elements. A <glossentry>
element
contains a term and its optional translations into one or more languages. Terms are stored
in <langentry>
elements, which contain the term text in one
<term>
element and an optional definition in a <definition>
element.
Figure 1 below indicates the hierarchical relationship of GlossML elements.
A major advantage of GlossML is that, due to its simplicity, conversion from GlossML to other vocabularies like TMX or TBX is easy to achieve using XSL transformations. The Resources section below has links to XSL Stylesheets for doing such conversions.
There is a clear need in localization/translation industry for a simple XML-based representation of glossaries. GlossML fills the void, providing an open format that can be used by anyone in commercial or open source applications.
Hopefully, one day translation tool vendors will simplify the exchange of glossary data between users of different translation tools. GlossML could be the first step in this direction.
Rodolfo Raya is Maxprograms' CTO (Chief Technical Officer), where he develops multi-platform translation/localisation and content publishing tools using XML and Java technology. He can be reached at rmraya@maxprograms.com.