The first article in this series briefly explained the most relevant XML standards used in the localisation industry. This second part focuses on XML Localisation Interchange File Format (XLIFF) and explains with practical examples how to use it for translating different kinds of documents. This article presents a step-by-step guide to translating multilingual documents using XLIFF as an intermediary file format, and provides useful resources for localizing Java applications.
XLIFF is a format that's used to exchange localisation data between participants in a translation project. This special format enables translators to concentrate on the text to be translated, without worrying about text layout. The XLIFF standard is supported by a large group of localisation service providers and localisation tools providers.
The most important reason for you to use XLIFF when translating documents is that you can use a single file format when translating different kinds of documents.
In my previous article, I described the steps to follow in a localisation project:
The special XML document mentioned in step 2 is an XLIFF document. This article will focus on steps 1, 2, and 4.
I'll now show you a translation process that uses XLIFF files as an intermediary format. Figure 1 illustrates the steps that you will follow.
Figure 1. Translation process
The process can now be reformulated with more detail as follows:
To aid the translator, translatable text is separated from text layout information. Listing 1 shows the code of a simple HTML page. This page contains two translatable sentences and several tags that are irrelevant for a translator.
<html> <head> <title>A Title</title> </head> <body> <p>One paragraph</p> </body> </html>
Listing 1. Simple HTML page
Special programs called filters separate text and layout. Computer Aided Translation (CAT) tools usually have filters for the most common formats: HTML, RTF, XML, and plain text.
Filters store the non-translatable portions in special files called skeletons. All translatable sentences are replaced by special marks in the skeleton. Listing 2 shows a sample skeleton for the text in Listing 1.
<html> <head> <title>%%%1%%%</title> </head> <body> <p>%%%2%%%</p> </body> </html>
Listing 2. Skeleton file
Each text fragment is stored in a translation unit element (
<trans-unit>) in an XLIFF
file. The mark used in the skeleton can be used as an id attribute for the translation unit to simplify
the mapping between the skeleton and the XLIFF file.
The XLIFF format definition as stated in the formal specification is concise, clear, and practical:
XLIFF is an XML application, as such it begins with an XML declaration. After the XML declaration comes the XLIFF document itself, enclosed within the
<xliff>element. An XLIFF document is composed of one or more sections, each enclosed within a
<file>element consists of a
<header>element, which contains metadata about the
<file>, and a
<body>element, which contains the extracted translatable data from the <file>. The translatable data within
<trans-unit>elements is organized into
<target>paired elements. These
<trans-unit>elements can be grouped recursively in
<? xml version="1.0" ?> <xliff version="1.0"> <file original="sample.html" source-language="en" datatype="HTML Page"> <header> <skl> <external-file href="sample.skl"/> </skl> </header> <body> <trans-unit id="%%%1%%%"> <source xml:lang="en">A Title</source> </trans-unit> <trans-unit id="%%%2%%%"> <source xml:lang="en">One paragraph</source> </trans-unit> </body> </file> </xliff>
Listing 3. Sample XLIFF file
The complexity of a filter program depends on the format that has to be parsed. HTML and XML are well-documented formats. Filter programs use certain specifications as reference to distinguish between translatable text and formatting information. Word processors usually have proprietary file formats. Fortunately, most word processors can export and import documents in RTF, another well-documented format.
The accompanying source code contains a complete Java-language implementation of a filter for Java properties files (see Resources).
Table 1 shows two versions of the same sentence, one in HTML format and the other in RTF.
|This is <b>bold</b> text||This is \b bold\b0 text|
Table 1. HTML and RTF formatting
Both versions contain the same text -- only the formatting codes are different. Markup code should not be translated. A translator who uses an HTML editor and a word processor has to translate twice -- once for each tool. The translator needs to work with a single tool that can show the relevant text, hide markup information, and allow for reuse of translations despite differences in markup.
Text layout information is stored in special elements called inline elements that the XLIFF standard defines for that purpose. Listing 4 shows the markup of the examples in Table 1 enclosed in the appropriate inline elements.
1: <source>This is <bpt id="1" ctype="bold"><b></bpt> bold <ept id="1"></b></ept> text </source> 2: <source> This is <bpt id="1" ctype="bold">\b</bpt> bold <ept id="1">\b0</ept> text </source>
Listing 4. Markup of Table 1 examples with inline elements
Notice that the elements used to wrap markup are identical; only their content varies. This allows a CAT
tool to search for translations in its TM database using the information provided in the attributes. For
example, if the database contains a translation for the HTML version, it can be used with the RTF version
by automatically adjusting the content of the
Translators should not see the inline elements as XML when translating with a CAT tool. They must be presented in a friendly way, either as an image or as a text mark that can be inserted easily by clicking a button or using shortcut keys.
Table 2 contains the complete list of inline elements and explanations of their usage.
||Generic group placeholder: The
||Generic placeholder: The
||Begin paired placeholder: The
||End paired placeholder: The
||Begin paired tag: The
||End paired tag: The
||Isolated tag: The
Table 2. Inline elements
Once the extracted text is examined and all markup is wrapped with the correct inline elements, a new process called segmentation can be applied to the translation unit. Segmentation is the breaking down of a block of text into smaller translatable segments of text, such as a sentence, a paragraph, or a phrase. It is important to keep translation units as small as possible to maximize the chances of finding usable translations in the TM database.
Usually, a segmenter (the tool that performs segmentation) breaks up text according to these basic rules:
Special care is required when breaking after a period because the period could be part of an abbreviation.
The segmentation rules listed above are not applicable for Asian languages such as Chinese, Japanese, and Korean.
Actual translation must be performed by a human translator. A machine can't substitute for a human here; it doesn't matter how good a translation program is, the translation must be checked by a real person to ensure quality.
It is possible to assist the translator by providing translations of similar, perhaps identical, texts made previously. A professional translator can decide if a given translation is good enough to reuse.
After the original document has been converted to XLIFF, the tool used to perform conversion can iterate
over all translation units and search for matching translations in a TM database. Whenever a suitable
match is found, an
<alt-trans> element is added to the translation unit. This process is
usually called pre-translation.
The text found in the database does not need to be identical to the text that needs translation. The
example in Listing 5 shows how translations from HTML and RTF documents can be
mixed in a translation unit. Notice that the second
<alt-trans> element has a match
quality of 96% due to the differences in markup format, but the text is good enough to be accepted.
<trans-unit approved="no" id="5" resname="res5" xml:space="default"> <source xml:lang="en"> This is <bpt id="1" ctype="bold"><b></bpt>bold<ept id="1"></b></ept> text </source> <target xml:lang="es"/> <alt-trans match-quality="100" origin="web" tool="TM Search"> <source xml:lang="en"> This is <bpt id="1" ctype="bold"><b></bpt>bold<ept id="1"></b></ept> text </source> <target xml:lang="es"> Este es texto <bpt id="1" ctype="bold"><b></bpt>enfatizado<ept id="1"></b></ept> </target> </alt-trans> <alt-trans match-quality="96" origin="rtf" tool="TM Search"> <source xml:lang="en"> This is <bpt id="1" ctype="bold">\b</bpt>bold<ept id="1">\b0</ept> text </source> <target xml:lang="es"> Este es texto <bpt id="1" ctype="bold">\b</bpt>enfatizado<ept id="1">\b0</ept> </target> </alt-trans> </trans-unit>
When the XLIFF file is finally ready, it is sent to a professional translator. The translator then uses an XLIFF-enabled CAT tool to add all the missing translations and to verify the ones provided at the pre-translation stage.
The translated XLIFF must now be merged with the skeleton file to produce a translated document in the desired output format.
A filter is used to read the skeleton and process all special marks sequentially. For each mark found in
the skeleton, a corresponding translation unit is revised in the XLIFF file; if the translation unit has
been marked as approved by the translator, the text in the
<target> element replaces the
mark in the skeleton. If there is no translation for the segment, or if the included translation is not
approved, the text in the
<source> element is used instead. After all marks have been
replaced with the corresponding text from the XLIFF file, the skeleton becomes a translated document and
should be saved under a new name.
Most document formats require fixes in the layout of the translated document; the XML, HTML, and RTF formats generally require the fewest post-translation adjustments.
One final task remains: extraction of
<target> pairs from the
<trans-unit> elements of the XLIFF file. Store these pairs in the TM database for
later reuse. These pairs are usually stored in a special XML format called Translation Memory eXchange
(TMX), which all important translation tools support.
The accompanying Java source code (see Resources) contains two filters:
Properties files are used in Java programs to store the text displayed by the application GUI. These are
text files that can have multiple lines, where each line is either a comment, a blank line, or a
Segmentation and pre-translation routines are not included in the sample code as they add complexity beyond the scope of this article. Nevertheless, the code contains placeholders where the reader can implement those processes, if desired.
This article has shown how to translate a document using XLIFF as an intermediate file format, explaining all the stages of the process with simple examples. The included source code demonstrates in a practical way how to implement an XLIFF-based solution for Java localisation projects, a solution that can be extended to support other formats.
The next article in the series will explain how a TM database works, emphasizing the relevance of the TMX format for translation exchange.
Rodolfo Raya is Maxprograms' CTO (Chief Technical Officer), where he develops multi-platform translation/localisation and content publishing tools using XML and Java technology. He can be reached at firstname.lastname@example.org.