My Account

Introduction to GlossML

Glossary Markup Language: an XML Representation for Glossaries

Rodolfo M. Raya (rmraya@maxprograms.com)
Chief Technical Officer, Maxprograms.

Exchanging glossary data without character corruption or without being locked into proprietary formats is a problem that affects translation/localization industry. This article presents GlossML, an open XML vocabulary specially designed to facilitate the exchange of glossaries.

What's a glossary?

Webster's College dictionary defines glossary as a list of terms in a special subject, field or area of usage, with accompanying definitions. Glossaries are invaluable tools for translators, as they help in the selection of appropriate terms during the translation process.

A bilingual glossary is a list of terms in one language which are defined in a second language or glossed by synonyms (or at least near-synonyms) in another language. This kind of glossary is quite common in translation/localization industry.

Contracts and technical manuals normally contain terms with specific meanings that pertain to a particular context. When a document is sent to a translator, it is usually accompanied by a glossary and, if preferred translations are included in the glossary, translators are expected to use the translations provided.

Without a good glossary, translators can turn a complex technical manual into a worthless document.

Glossary Exchange Problems

A glossary can be written in any format. A Microsoft Excel spreadsheet with two columns is probably the most common format used for storing bilingual glossaries. Glossaries are normally exchanged in native Excel format (*.xls) or as a CSV (Comma Separated Values) file exported from Excel.

Exchanging glossaries in Excel or CSV format is inconvenient due to portability issues. Excel saves files using character sets that are dependent on the version of Excel and the operating system used, often resulting in character corruption when glossaries are read under a different combination. Excel is only available on Windows and Mac OS X; translators who use Linux may also encounter conversion problems when reading Excel documents using other tools.

CSV files are a bit better than Excel when exported using Unicode, but the choice of text delimiters or column separators introduces a new set of problems in reading the exchanged file. Characters used as delimiters must be escaped when present in a text, but there isn't a unique way of doing this. Some tools escape characters by doubling them and others put a backslash (\) before these characters.

If a glossary is made available to a translator or translation agency, it should be in a useful format. XML solves the problems of Excel (character set declared in the header of the file) and CSV (text clearly delimited in XML elements). However, at this moment, there is no open standard specifically designed for representing glossaries in XML format.

There are several standards for exchanging terminology information, like OLIF (Open Lexicon Interchange Format), MARTIF (Machine-Readable Terminology Interchange Format - ISO 12200) and TBX (TermBase eXchange). A subset of one of them could arguably be used for holding glossaries, but even in their simplest form they are too complex for representing simple bilingual glossaries like the ones normally used by translators.

Some translation tool vendors opted for proprietary solutions for exchanging terminology and glossaries. Translators who use these tools cannot exchange their data with colleagues who work with different software, as the format used for containing glosses and terms is secret.

Other translation tool vendors opted for adapting MARTIF, OLIF or TBX to their particular needs. Unfortunately, most of them did it in their own ways, without documenting their custom changes. As result, exchange using these open standards is also only possible between users of the same tool.

GlossML, an Open Solution

A solution to the problem of glossary exchange should:

  • Be based on XML, to avoid portability problems;
  • Be as simple as possible, to avoid misuse of the vocabulary;
  • Be documented, so everyone knows how to use it;
  • Be convertible to other formats, to facilitate exchange between users of different software;
  • Allow royalty free redistribution and implementation.

GlossML is an XML-based vocabulary specifically designed for containing glossaries that can be used for storing monolingual and multilingual lists of terms and, optionally, their definitions.

The GlossML specification and related materials (XML Schema and examples) are licensed under the Creative Commons Attribution-No Derivative Works 3.0 Unported License. This means that anyone can use and distribute the GlossML format without paying royalties of any kind.

A distinctive aspect of GlossML vocabulary is its extreme simplicity. It only has 6 elements and 4 attributes. This is possible because it focuses solely on holding glossary data. It is not intended for terminology exchange.

Listing 1 below shows you a sample bilingual GlossML file. This example is also available for download in the Resources section.

<?xml version="1.0" encoding="UTF-8"?>
<glossary version="1.0" srclang="en-US" 
         xmlns="http://www.maxprograms.com/gml"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.maxprograms.com/gml GlossML.xsd">
   <comment>Sample bilingual glossary</comment>
   <glossentry>
      <langentry xml:lang="en-US">
         <term>structure</term>            
         <definition source="Merriam Webster">the manner in which 
               something is constructed</definition>
      </langentry>
      <langentry xml:lang="es">
         <term>estructura</term>
      </langentry>
   </glossentry>
   <glossentry>
      <comment from="RMR">This entry doesn't refer to statistics as a 
             science</comment>
      <langentry xml:lang="en-US">
         <term>statistic</term>
         <definition source="Merriam Webster">a numerical fact or datum, 
                 esp. one computed from a sample</definition>
      </langentry>
      <langentry xml:lang="es">
         <term>estadística</term>
         <definition source="Larousse">cuadro numérico de un hecho que se 
                 presta a la estadística</definition>
     </langentry>
   </glossentry>
</glossary>
                                 

Listing 1. Sample GlossML File

As GlossML grammar is written using an XML Schema, glossaries in this format can be embedded in other XML vocabularies. For example, a GlossML glossary can be included in an XLIFF ( XML Localization Interchange File Format) document in the following manner:

<?xml version="1.0" encoding="UTF-8" ?>
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="urn:oasis:names:tc:xliff:document:1.2 
             xliff-core-1.2-transitional.xsd"
      xmlns:gls="http://www.maxprograms.com/gml">
   <file datatype="javalistresourcebundle"    
             original="capi.properties" source-language="en">
      <header>
         <gls:glossary version="1.0" srclang="en">
            ... GlossML data ...  
         </gls:glossary>
      </header>
      <body>
            ... XLIFF data ... 
      </body>
   </file>
</xliff>

Listing 2. Embedding a GlossML Document in an XLIFF File

GlossML Structure

The root of a GlossML glossary is the <glossary> element and the main language used in the glossary is declared in the required "srclang" attribute. A <glossary> element contains one optional <comment> and one or more <glossentry> elements. A <glossentry> element contains a term and its optional translations into one or more languages. Terms are stored in <langentry> elements, which contain the term text in one <term> element and an optional definition in a <definition> element.

Figure 1 below indicates the hierarchical relationship of GlossML elements.

GlossML element tree

Figure 1. GlossML Element Tree

A major advantage of GlossML is that, due to its simplicity, conversion from GlossML to other vocabularies like TMX or TBX is easy to achieve using XSL transformations. The Resources section below has links to XSL Stylesheets for doing such conversions.

Summary

There is a clear need in localization/translation industry for a simple XML-based representation of glossaries. GlossML fills the void, providing an open format that can be used by anyone in commercial or open source applications.

Hopefully, one day translation tool vendors will simplify the exchange of glossary data between users of different translation tools. GlossML could be the first step in this direction.

Resources

  • Download or view GlossML Specification in PDF format.
  • Download GlossML.xsd, the GlossML grammar in XML Schema format.
  • Create and maintain your own glossaries in GlossML using Anchovy.
  • Experiment the possibilities that XSL transformations offer for manipulating XML files. Get sample.gls and process it with these stylesheets:

About the author

Rodolfo M. Raya

Rodolfo Raya is Maxprograms' CTO (Chief Technical Officer), where he develops multi-platform translation/localisation and content publishing tools using XML and Java technology. He can be reached at rmraya@maxprograms.com.