Open Source Software Solutions

Java Localization with TMX standard

Nicola Asuni, 2004-10-14 / 2006-06-04

Foreword

One of the main concerns of internationalization consists of separating the main source code from the texts, the labels, the messages and all the other objects related to the specific language in use. This facilitates the translation process as such as all the resources related to the local language context are well identified and separated.

Since version JDK 1.1, Java provides great support for internationalization (i18n) by offering several instruments and tools, for example the support to Unicode 2.0, the multilingual environments and the object localization, just to mention a few.
However, all these instruments may not be sufficient when we target a global market in which the costs to translate and update the texts (including labels, messages, menu elements and so on) can easily become quite high.
This is the context where the TMX standard comes to help by applying to the translation and management process of these texts the concepts of reuse, increase of consistency, and the shortening of the production cycle. All this with the added bonus of cutting the development costs.

TMX - Translation Memory eXchange

http://www.lisa.org/tmx/

TMX is an open standard that uses XML for the archiving and mutual exchange of the Translation Memories (TM). These memories are created by using specific translation and localization software called CAT software (Computer Aided Translation).
TMX is the result of a project developed by one of the Special Interest Groups of LISA, known as OSCAR (Open Standards for Container/Content Allowing Re-use).

The goal of TMX is to provide a neutral system to exchange data between different translation systems, while minimizing or eliminating the loss of critical data.
The TMX format is supported by the majority of the translation software in the market today.

The specifics of the TMX standard are available for free in the website http://www.lisa.org/tmx/, together with several related links, documents, articles and software tools.

TMX file example

sample_tmx.xml
<?xml version="1.0" ?>
<tmx version="1.4">
	<header
		creationtool="XYZTool"
		creationtoolversion="1.01-023"
		datatype="PlainText"
		segtype="sentence"
		adminlang="en-us"
		srclang="EN"
		o-tmf="ABCTransMem">
	</header>
	<body>
		<tu tuid="hello" datatype="plaintext">
			<tuv xml:lang="en">
				<seg>hello</seg>
			</tuv>
			<tuv xml:lang="it">
				<seg>ciao</seg>
			</tuv>
		</tu>
		<tu tuid="world" datatype="plaintext">
			<tuv xml:lang="en">
				<seg>world</seg>
			</tuv>
			<tuv xml:lang="it">
				<seg>mondo</seg>
			</tuv>
		</tu>
	</body>
</tmx>
where:
  • tu: translation unit, unit father of every element to be translated. It can contain a unique identifier (tuid).
  • tuv: translation unit variant, unit that contains the language code of the translation (xml:lang).
  • seg: segment, it contains the translated text.

TM - Translation Memory

http://www.opentag.com/tm.htm

The Translation Memories (TM), also known as Translation Database, consist of a database in which the various sentences written in a reference language are linked to the associated translations in one or more languages. A reference sentence together with its translations is called translation memory unit (record of the database). The applications that use TM's are helpful tools for language translations, intended to improve the quality and the efficiency of the human translation process and not to substitute it.
Whenever a new sentence is entered from the TM application, the application will search for it among the reference sentences in the database and will calculate a corresponding specific value according to the match (matching value).
When the matching value is 100%, meaning exact match, the corresponding translation found in the database will be assumed to be correct and it will be directly utilized to build the translated text. When the matching value is smaller than 100% but bigger than a certain threshold (fuzzy match), the corresponding translation found in the database will be proposed to a human translator, so as to be judged and possibly corrected. For the sentences whose score falls under the threshold there will not be any proposed translation, and they will have to be entirely translated by hand. The new sentences for which a translation has been entered will be stored in the database and used for future searches.

Several software houses offer complex commercial products that work similarly to these concepts.

LISA - Localization Industry Standards Association

http://www.lisa.org

Founded in 1990, LISA is the premier no-profit worldwide organization for GILT (Globalization, Internationalization, Localization and Translation). LISA includes different subjects as individuals, businesses, associations and organizations involved in languages, technologies for languages, and standards for languages.

Over 400 leading IT manufacturers and services providers, along with industry professionals representing corporations with an international business focus, have helped establish LISA's best practice guidelines and language-technology standards for enterprise globalization.

LISA serves as a nexus between the many organizations engaged in helping businesses to become global enterprises. This includes customers, governments, technical and industry-specific standards organizations, research and consulting firms, language technology developers and service providers.

LISA offers services in the form of standards initiatives, Special Interest Groups, conferences and training programs to provide GILT support to businesses.

LISA partners and affiliate groups include the International Organization for Standardization (ISO Liaison Category A Members of TC 37 and TC 46), The World Bank, OASIS, IDEAlliance, AIIM, The Advisory Council (TAC), Fort-Ross, ¤TTEC, the Japan Technical Communicators Association, the Society of Automotive Engineers (SAE), the European Union, the Canadian Translation Bureau, TermNet, the American Translators Association (ATA), IWIPS, Fédération Internationale des Traducteurs (FIT), Termium, JETRO, the Institute of Translating and Interpreting (ITI), The Unicode Consortium, OpenI18N, and other professional and trade organizations.

LISA members and co-founders include some of the largest and best-known companies in the world, including Adobe, Avaya, Cisco Systems, CLS Communication, EMC, Hewlett Packard, IBM, Innodata Isogen, Fuji Xerox, Microsoft, Oracle, Nokia, Logitech, SAP, Siebel Systems, Standard Chartered Bank, FileNet, LionBridge Technologies, Lucent, Sun Microsystems, WH&P, PeopleSoft, Philips Medical Systems, Rockwell Automation, The RWS Group, Xerox Corporation and Canon Research, among others.

TMX Java Bridge

With the java.util.ResourceBundle class, Java provides a useful solution for localization. Indeed, the methods of this class enable us to extract the textual elements from the original source code, by isolating them in a component named ResourceBundle, for example the ListResourceBundle class or a proprietary file.
This solutions offer several advantages to the programmer but can become very complicated for the translator, especially in terms of reusability of the translation.

A better option consists of the archiving of the textual resources in the exchange format TMX (XML file). This enables the translators to export and import the translations to and from their preferred translation tools (there are several compatible with TMX) in a way completely independent from the programming language utilized.

As suggested by Masaki Itagaki in his article "Use XML as a Java Localization Solution", the best solution to implement the TMX standard in Java applications consist of extending the ResourceBundle class so that it can directly read data from XML files complying with the TMX standard (Java class ==> TMX file <== translation program).
This allows us to take advantage of all the aspects of the ResourceBundle class and to simplify the porting process toward external TMX applications.

The disadvantages of this technique are mainly related to the time and the memory necessary to load the entire TMX file.

With the intention to simplify our explanation, we will consider just those TMX elements necessary to translate a simple text (see sample_tmx.xml):
  • tu: translation unit, unit father of every element to be translated. It can contain a unique identifier (tuid).
  • tuv: translation unit variant, unit that contains the language code of the translation (xml:lang).
  • seg: segment, it contains the translated text.
for example:
	<tu tuid="hello" datatype="plaintext">
		<tuv xml:lang="en">
			<seg>hello</seg>
		</tuv>
		<tuv xml:lang="it">
			<seg>ciao</seg>
		</tuv>
	</tu>

TMXResourceBundle.java Class

To instantiate the class TMXResourceBundle is the same as instantiating the class PropertyResourceBundle.
With the constructor we specify the name and path of the file in TMX format that contains the translations and the ISO code of the reference language.
Once the class has been instantiated, the method parseXmlFile (that uses the XML parser by Sun javax.xml.parsers.DocumentBuilder.parse) loads the TMX data in an object of the org.w3c.dom.Document type (DOM - Document Object Model).
At this point, the nodes of the documents are examined and the key-value couples are added to hashcontents, an object of the type java.util.Hashtable. These couples consist respectively of the attribute tuid of the element tu, and the value of the node seg contained inside the node tuv in which the value of the attribute xml:lang is identical to the one of the specified language.
The extension of the class ResourceBundle requires the overriding of the abstract methods handleGetObject and getKeys so as to enable us to extract the element corresponding to a particular key. This is done by using the methods inherited from ResourceBundle: getObject(String key), getString(String key), getStringArray(String key).
The getString(String key, String def) overloading of the method getString(String key) of ResourceBundle returns the string associated to a particular key, or a default value in case of errors.

Source Code

(download the full project from Sourceforge).
package com.tecnick.tmxjavabridge;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.Serializable;
import java.util.Enumeration;
import java.util.Hashtable;
import java.util.MissingResourceException;
import java.util.ResourceBundle;
import java.util.Vector;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.FactoryConfigurationError;
import javax.xml.parsers.ParserConfigurationException;

import org.w3c.dom.Attr;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

/**
 * <p>
 * Reads resource text data directly from a TMX (XML) file.
 * </p>
 * <p>
 * First, the TMXResourceBundle class instantiates itself with two parameters: a
 * TMX file name and a target language name. Then, using a DOM parser, it reads
 * all of a translation unit's properties for the key information and specified
 * language data and populates a hashtable with them.
 * </p>
 * <p>
 * <b>TMX info: </b> http://www.lisa.org/tmx/
 * </p>
 * 
 * <h4>Implementation notes</h4>
 * <p>
 * You instantiate the TMXResourceBundle class in a program to read data from a
 * TMX file. Once the class is instantiated, it reads all the data in a TMX file
 * and loads into a DOM tree. Then it populates a hashtable so the
 * handleGetObject() method can be called to find text information based on a
 * key just as a standard ResourceBundle class does. <br>
 * Instantiating the TMXResouceBundle class is the same as instantiating the
 * PropertyResourceBundle class. First you obtain a system language code (e.g.:
 * from a locale's information). In TMX the value of the attribute must be one
 * of the ISO language identifiers (a two- or three-letter code) or one of the
 * standard locale identifiers (a two- or three-letter language code, a dash,
 * and a two-letter region code).
 * </p>
 * 
 * Copyright (c) 2004-2006 Tecnick.com S.r.l (www.tecnick.com) Via Ugo Foscolo
 * n.19 - 09045 Quartu Sant'Elena (CA) - ITALY www.tecnick.com -
 * info@tecnick.com <br/> Project homepage: <a
 * href="http://tmxjavabridge.sourceforge.net"
 * target="_blank">http://tmxjavabridge.sourceforge.net</a><br/> License:
 * http://www.gnu.org/copyleft/lesser.html LGPL
 * 
 * @author Nicola Asuni [www.tecnick.com].
 * @version 1.1.008
 */

public class TMXResourceBundle extends ResourceBundle implements Serializable {

	/**
	 * Serial Version UID
	 */
	private static final long serialVersionUID = -2098421084432070017L;

	/**
	 * The hastable that will contain data loaded from XML
	 */
	protected Hashtable hashcontents = null;

	/**
	 * Number of translation units (tu) items
	 */
	protected int numberOfItems = 0;

	/**
	 * Vector to store tu items keys
	 */
	protected Vector vectOfItems;
	
	/**
	 * TMX to Hashtable conversion. Reads XML and store data in HashTable.
	 * 
	 * @param xmlfile
	 *            the TMX (XML) file to read, supports also URI resources or JAR
	 *            resources
	 * @param language
	 *            ISO language identifier (a two- or three-letter code)
	 */
	public TMXResourceBundle(String xmlfile, String language) {
		this(xmlfile, language, "");
	}

	/**
	 * Copy object data to this object.
	 * @param obj object to copy.
	 */
	private void copyTMXResourceBundle(TMXResourceBundle obj) {
		this.hashcontents = obj.hashcontents;
		this.numberOfItems = obj.numberOfItems;
		this.vectOfItems = obj.vectOfItems;
	}
	
	/**
	 * TMX to Hashtable conversion. Reads XML and store data in HashTable. NOTE:
	 * you must manually delete the cachefile to refresh its content.
	 * 
	 * @param xmlfile
	 *            the TMX (XML) file to read, supports also URI resources or JAR
	 *            resources
	 * @param language
	 *            ISO language identifier (a two- or three-letter code)
	 * @param cachefile
	 *            name of the file used to store cache data for the specified
	 *            language
	 */
	public TMXResourceBundle(String xmlfile, String language, String cachefile) {
		
		// try to get data from cachefile (if any)
		if (cachefile.length() > 0) {
			try {
				FileInputStream fis = new FileInputStream(cachefile);
				ObjectInputStream in = new ObjectInputStream(fis);
				copyTMXResourceBundle((TMXResourceBundle)in.readObject());
				in.close();
				return;
			} catch (Exception e) {
				System.err.println("Exception:" + e);
			}
		}
		
		String temp_key = null; // store hashtable key names
		String temp_value = null; // store hashtable values
		NamedNodeMap temp_list = null; // list of <tu> attributes
		Attr temp_attr = null; // <tu> attribute
		NodeList listOfTUVs = null; // list of <tuv> elements
		NodeList listOfSEG = null; // list of <seg> elements
		Element SEGElements = null; // <seg> element
		int numberOfTUVs = 0; // number of <tuv> elements

		// Create Document with parser
		Document document = parseXmlFile(xmlfile, false);

		// handle document error
		if (document == null) {
			hashcontents = new Hashtable(); // initialize a void hashtable
			return;
		}

		// Make a list of Term Units and count the number of items
		NodeList listOfTermUnits = document.getElementsByTagName("tu");
		numberOfItems = listOfTermUnits.getLength();

		// set tu keys vector size
		vectOfItems = new Vector(numberOfItems);

		// set hash size
		hashcontents = new Hashtable(numberOfItems);
		for (int i = 0; i < numberOfItems; i++) {
			temp_value = null;

			// set a key
			temp_list = listOfTermUnits.item(i).getAttributes();
			temp_attr = (Attr) temp_list.getNamedItem("tuid");
			temp_key = temp_attr.getValue();

			vectOfItems.add(temp_key); // store key on vector

			// get a value
			// Make a TUV list => "listOfTUVs"
			Node TUVs = listOfTermUnits.item(i);
			if (TUVs.getNodeType() == Node.ELEMENT_NODE) {
				Element TUVElements = (Element) TUVs;
				listOfTUVs = TUVElements.getElementsByTagName("tuv");
				numberOfTUVs = listOfTUVs.getLength();
			}

			// Check each TUV. If it's a specified lang, then get a SEG value
			for (int j = 0; j < numberOfTUVs; j++) {
				temp_list = listOfTUVs.item(j).getAttributes();
				temp_attr = (Attr) temp_list.getNamedItem("xml:lang");
				if (temp_attr.getValue().equalsIgnoreCase(language)) {
					// -- Get a SEG value
					SEGElements = (Element) listOfTUVs.item(j);
					listOfSEG = SEGElements.getElementsByTagName("seg");
					try {
						temp_value = listOfSEG.item(0).getFirstChild()
								.getNodeValue();
					} catch (Exception e) {
						// in case of error print error message and set value to
						// void string
						System.err.println(this.getClass().getName() + "(\""
								+ xmlfile + "\", \"" + language + "\") :: "
								+ "Void <seg> value on <tu tuid=\"" + temp_key
								+ "\"> key");
						temp_value = "";
					}
				}
			}

			// Populate hashtable
			if ((temp_key != null) && (temp_value != null)) {
				hashcontents.put(temp_key, temp_value);
			}
		} // for loop
		// try to save this object on cache file
		if (cachefile.length() > 0) {
			try {
				FileOutputStream fos = new FileOutputStream(cachefile);
				ObjectOutputStream out = new ObjectOutputStream(fos);
				out.writeObject(this);
				out.close();
			} catch (Exception e) {
				System.err.println("Exception:" + e);
			}
		}
	}

	/**
	 * Parses an XML file and returns a DOM document.
	 * 
	 * @param filename
	 *            the name of XML file
	 * @param validating
	 *            If true, the contents is validated against the DTD specified
	 *            in the file.
	 * @return the parsed document
	 */
	public Document parseXmlFile(String filename, boolean validating) {
		Document doc = null;
		DocumentBuilderFactory factory = null;
		// Create a builder factory
		try {
			factory = DocumentBuilderFactory.newInstance();
		} catch (FactoryConfigurationError e) {
			System.err.println(e);
			return null;
		}
		factory.setValidating(validating);
		// Create the builder and parse the file
		try {
			try {
				// try to get the file from jar
				InputStream instream = getClass().getResourceAsStream(filename);
				doc = factory.newDocumentBuilder().parse(instream);
			} catch (Exception ejar) {
				try {
					// try to get the file as external URI
					doc = factory.newDocumentBuilder().parse(filename);
				} catch (IOException euri) {
					try {
						// try to get the file as local filename
						doc = factory.newDocumentBuilder().parse(new File(filename));
					} catch (IOException efile) {
						try {
							// try to resolve the path as relative to local
							// class folder
							String[] classPath = System.getProperties().getProperty("java.class.path", ".").split(";");
							String newpath = classPath[0] + "/" + filename;
							doc = factory.newDocumentBuilder().parse(
									new File(newpath));
						} catch (IOException epath) {
							// unable to get the input file
							System.err.println("IOException:" + epath);
						}
					}
				}
			}
		} catch (ParserConfigurationException e) {
			System.err.println("[" + filename + "] ParserConfigurationException:" + e);
		} catch (SAXException e) {
			System.err.println("[" + filename + "] SAXException:" + e);
		}
		return doc;
	}

	/**
	 * Get key value, return default if void.
	 * 
	 * @param key
	 *            name of key
	 * @param def
	 *            default value
	 * @return parameter value or default
	 */
	public String getString(String key, String def) {
		String param_value = "";
		try {
			param_value = this.getString(key);
			if ((param_value != null) && (param_value.length() > 0)) {
				return param_value;
			}
		} catch (Exception e) {
			// for any exception return the default value
			return def;
		}
		return def;
	}

	/**
	 * handleGetObject implementation
	 * 
	 * @param key
	 *            the resource key
	 * @return the content associated to the specified key
	 * @throws MissingResourceException
	 */
	public final Object handleGetObject(String key) throws MissingResourceException {
		return hashcontents.get(key);
	}

	/**
	 * Returns the number of translation units
	 * 
	 * @return number of Items
	 */
	public int getNumberOfItems() {
		return numberOfItems;
	}

	/**
	 * Define getKeys method
	 * 
	 * @return item elements
	 */
	public Enumeration getKeys() {
		return vectOfItems.elements();
	}

}

Sample Class

This class shows how to instantiate the class TMXResourceBundle with the example file sample_tmx.xml.
In this example the language code (en = English) is explicitly specified, but it can also be obtained from a locale's information.

Source Code

package com.tecnick.tmxjavabridge.sample;

import com.tecnick.tmxjavabridge.TMXResourceBundle;

/**
 * Sample class for TMXResourceBundle class.
 * <br/><br/>
 * Copyright (c) 2004-2006
 * Tecnick.com S.r.l (www.tecnick.com) 
 * Via Ugo Foscolo n.19 - 09045 Quartu Sant'Elena (CA) - ITALY
 * www.tecnick.com - info@tecnick.com<br/>
 * License: http://www.gnu.org/copyleft/lesser.html LGPL
 * 
 * @author Nicola Asuni [www.tecnick.com].
 * @version 1.1.008
 */
public class TMXJBSample {
	
	/**
	 * loads TMX data
	 */
	final static TMXResourceBundle res_en = new TMXResourceBundle("tmx/sample_tmx.xml", "en");
	// test cache system
	final static TMXResourceBundle res_it = new TMXResourceBundle("tmx/sample_tmx.xml", "it", "src/com/tecnick/tmxjavabridge/test/test_tmx_it.obj");
	
	/**
	 * Prints 2 strings on System.out
	 * @param args String[]
	 */
	public static void main(String[] args) {
		System.out.println(res_en.getString("hello", ""));
		System.out.println(res_en.getString("world", ""));
		System.out.println(res_it.getString("hello", ""));
		System.out.println(res_it.getString("world", ""));
	}
}

References


IT | EN
W3C XHTML 1.0 | W3C CSS 2.0