Sunday, May 31, 2009

Stopping the Word Count Insanity

By Andrzej Zydron,
xml-Intl Ltd.

In the localization industry, there is a total lack of consistency among word or character counts, not only between rival products, but even among different versions of the same product. The same can be said for word processing software: word and character counts differ among vendors and versions. An additional problem is that none of this software provides any proper verifiable specification as to how the actual metrics are determined. You have to accept them as they are.

This is effectively the same situation that existed for weights and measures before the French Revolution established a sane and uniform system that everyone could agree upon, one that we still use today (with minor exceptions).

Trying to establish a measure for the size of a given localization task poses a real problem for the professional who is trying to calculate a price. The differences in word and character counts among different translation or word processing tools can be as much as 20 percent. And such a gap can mean the difference between profitability and loss.

Realizing that this problem needed to be addressed by an independent industry body, LISA OSCAR undertook the task, in 2004, of establishing a standard that everyone can agree on and that can be independently verified.

Nearly three years later, we finally have a far-reaching and considerably reviewed approach to this problem. The core of the new standard comes under the umbrella concept of Global Information Management Metrics Exchange or GMX for short.

We all know that word and character counts are not the only measure of a given localization task. Thus, GMX comprises three standards:

# GMX-V (for volume)
# GMX-Q (for quality
# GMX-C (for complexity)

GMX-V is the first of the three standards to be completed. Work will commence in 2007 on GMX-Q and GMXC. Quality (GMX-Q) will deal with the level of quality required for a task. For example, the quality required for the translation of a legal document is much higher than that for technical documentation that will have a relatively small audience. Complexity (GMX-Q) will take into consideration the source and format of the original document and its subject matter. For example, a highly complex document dealing with a specific tight domain is far more complex to translate than user instructions for a simple consumer device.

All of the GMX family of standards relies on an XML vocabulary for the exchange of metric data. Using the three standards together, it will be possible to have a uniform measure for defining the specific aspects of a localization task, to a point where one can completely automate all the pricing aspects of the task and exchange this data electronically.

GMX-V

GMX-V is designed to fulfill two primary roles:

* Establish a verifiable way of calculating the primary word and character counts for a given electronic document.
* Establish a specific XML vocabulary that enables the automatic exchange of metric data

As with all good standards, GMX-V is itself based on other well established standards:

* Unicode 5.0 normalized form
* Unicode Technical Report 29 – Text Boundaries
* OASIS XML Localization Interchange File Format (XLIFF) 1.2
* LISA OSCAR Segmentation Rules Exchange (SRX) 2.0

WORDS AND CHARACTERS

GMX-V mandates both word and character counts. Character counts convey the most precise definition of a localization task, whereas word counts are the most commonly used metric in the industry.

OTHER METRICS

The XML exchange notation of GMX-V allows for the exchange of all metrics relating to a given localization task, such as page counts, file counts, screen shot counts, etc.

CANONICAL FORM

One of the main problems with calculating word and character counts is the sheer range of differing proprietary file formats. Trying to establish a standard that addresses all formats is impossible. GMX-V required a canonical form that effectively levels the playing field. Such a common format is available through the OASIS XLIFF standard, which is now supported by all of the localization tool providers.

Within XLIFF, inline codes are interpreted as inline XML elements. The inline elements are not included in the word and character counts, but form a separate inline element count of their own. The frequency of inline elements can have an impact on the translation workload, so a separate count is useful when sizing a job. Punctuation and white space characters are also featured as additional categories.

GMX-V addresses all issues related to counting words and characters in the XLIFF canonical format. Since the sentence is the commonly accepted atomic unit for translation, it proposes sentence-level granularity for counting purposes within XLIFF.

GMX-V does not preclude producing metrics directly from non-XLIFF files, as long as the format for counting is based on the XLIFF canonical form for each text unit being counted. This can be done dynamically on the fly, and it requires an audit file for verification purposes.

WORDS

GMX-V uses “Unicode Technical Report 29 (TR29-9) – Text Boundaries” to define words and characters. This provides a clear and unambiguous definition of word or “grapheme” boundaries.
LOGOGRAPHIC SCRIPTS

Word counts have little relevance for Chinese, Japanese, Korean (CJK) and Thai source text. For these languages, GMX-V recommends using only character counts.

There is a proposal before ISO TC 37, submitted by Professor Sun Maosong, relating to the automatic identification of word boundaries for CJK languages. Should this recommendation become a standard, GMX-V should reference it for the provision of CJK word counts.

QUANTITATIVE AND QUALITATIVE MEASUREMENTS

GMX-V counts fall into two categories: how many and what type. The primary count is unqualified. For example, how many characters and words are in the file? This is the minimal conformance level proposed for GMX-V.

A typical translatable document will contain a variety of text elements. Some of these elements will contain non-translatable text, some will have been matched from translation memory, and some will have been fuzzy matched by the customer. Therefore, it is important to be able to categorize the word and character counts according to type, in order to provide a figure in words and characters for a given localization task. GMX-V also provides an extension mechanism that enables user defined categories.

COUNT CATEGORIES

Apart from the total-word-count and total-charactercount values, GMX-V also includes these count categories:

* In-context exact matches – An accumulation of the word and character count for text units that have been matched unambiguously with a prior translation and that require no translator input.

* Leveraged matches – An accumulation of the word and character count for text units that have been matched against a leveraged translation memory database.

* Repetition matches – An accumulation of the word count for repeating text units that have not been matched in any other form. Repetition matching is deemed to take precedence over fuzzy matching.

* Fuzzy matches – An accumulation of the word and character count for text units that have been fuzzy matched against a leveraged translation memory database.

* Alphanumeric-only text units – An accumulation of the word and character counts for text units that have been identified as containing only alphanumeric words.

* Numeric-only text units – An accumulation of the word and character counts for text units that have been identified as containing only numeric words.

* Punctuation characters – An accumulation of the punctuation characters.

* White Spaces – An accumulation of white space characters.

* Measurement-only – An accumulation of the word and character count from measurement-only text units.

* Other Non-translatable words – An accumulation of other non-translatable word and character counts.

* Automatically treatable text – A count of automatically treatable inline elements, such as date, time, measurements, or simple and complex numeric values.

VERIFIABILITY

Any measurement standard must have a reference implementation, as well as an authoritative body that tests and validates the measuring instruments. In the US, this is provided by the National Institute of Standards and Technology. In order to be successful, GMX-V must provide for a certification authority that will (1) maintain reference documents with known metrics and (2) provide an online facility to test given XLIFF documents. In this way, both customers and suppliers can be confident that GMX-V provides an unambiguous and reliable way of quantifying a localization or global-information-management task.

NON-VERIFIABLE METRICS AND EXCHANGE NOTATION

There are many instances where it is not possible to verify electronically the metrics data, such as screen shots, number of pages, etc. GMX-V allows for the annotation and exchange of all relevant metrics for a given localization task.

SUMMARY

GMX-V has been widely peer reviewed and published for open public comment for eighteen months. Much valuable feedback has been submitted and incorporated into the standard. All major localization tool providers have been consulted, to insure no obstacles to implementing it. GMX-V also provides a specification that can be used by word processing tool vendors and localization tool suppliers. It provides a consistent and unambiguous common standard for word and character counts.

Further details of GMX-V are available at the following URL: www.lisa.org/standards/gmx

ClientSide News Magazine - http://www.clientsidenews.com/

Corporate Blog of Elite - Professional Translation Services serving ASEAN & East Asia

No comments: