Introduction
You may sometimes wonder why there is a discrepancy in the number of words between the source file and XTM Metrics after project creation. This is because each CAT Tool or any other application operates according to its own set of rules. Wordcount is not exempted from such rules and tends to differ across various software. For example, whereas one application can count an email address as one word, another one can count is as three words. Wordcount in XTM is thus calculated “on its own terms”.
Calculation of XTM wordcount as compared with other software
Algorithm for XTM wordcount
Not all CAT tools have a basic similar wordcount mechanism (please, note that Microsoft Office is not a CAT tool), actually, each CAT tool and any other application might be based on a different word count algorithm, as there are many.
In this case, XTM makes use of the GMX-V algorithm (Global Information Management Metrics Volume). It recognizes the word boundaries based on various factors which are explained in detail in the below sources that contain all the specifications for the said algorithm:
XTM wordcount vs Microsoft Office
In the vast majority of cases, clients compare the wordcount output between XTM and MS Office files. As was mentioned in the previous section, Microsoft Office is not a CAT tool, so the comparison of the two is with no rhyme or reason. What is more, if a given project is using a filter template, this may result in a higher or lower word count, as we might exclude or include some elements from and into translation.
Nevertheless, following up on the mechanism of the aforementioned algorithm, a couple of major distinctions need to be laid out in terms of how each of these applications “recognizes” words:
Compound words that are hyphenated (i.e. contain the dash “-” character) are counted differently. In a Word document, this would be counted as a single word, whereas in XTM - as two words.
Hidden elements (such as slides, columns, or pages) in MS Office are counted despite being hidden, while XTM does not count them by default unless they are targeted for translation. This can be achieved by a certain configuration on the back-end side (or in the form of a filter template) done by the XTM Support team, or you can create one on your own in XTM UI: Configuration → Filter templates (administrative privileges required!).
While using DOCX (and similar) file formats as the source files, you need to remember that often this particular file type allows something called alternate content. Such content is often a string of text that is contained within charts' or images' descriptions or within text boxes. If you find duplicated segments, and the duplicate is nowhere to be seen within the source file, then most likely you are dealing with alternate content.
Inline tags which are placed within the boundaries of one word divide it into two words in XTM.
Names of chemical compounds as well as dates are also counted differently in XTM.
Semicolons, commas, and slashes between words are also counted differently in XTM.
EXAMPLES | ||
Slashes | Word boundaries | Commas with no whitespaces |
The phrase sentence/text in MS Office is treated as a single word. In XTM the slash "/" character is considered a word separator, and this phrase will thus be counted as two words. | The following example, taken from Unicode TR 29, shows an example of the identification of grapheme boundaries:
Extracted words in XTM:
| The phrase Skilled,Rocking,Famous in MS Office or Excel is considered a single word (because there is no whitespace after the comma). In XTM the comma "," character is considered a word separator, and this phrase will thus be counted as three words. |
IMPORTANT!
Please, keep in mind that the vast majority of elements you wish to be extracted for translation can, in fact, be set up either by you or our XTM Support team.
If you want to more on the way XTM extracts content for translation from Office files, please, get familiar with the following article: How does the extraction of segments in MS Office files work (Word, Excel, PowerPoint)?
XTM allows you to create your own filter templates which are used to identify translatable text in a document of a specified Office format. Please, get familiar with the guideline: How to create a filter template in XTM UI and apply it to a project.
In case of any more complex filters, do not hesitate to contact our XTM Support team and provide the details. By requesting our help, make sure you follow the guidelines specified in this article: How to request a new configuration or a change of an existing one?
Additionally, learn about possible configuration levels from this article: How can a source file be processed and what are configuration levels?
XTM wordcount vs other CAT tools
We do not really take into consideration other CAT tools and how they treat words in particular. As was mentioned at the beginning, other CAT tools treat some of the words, formulas, etc. differently than we do at XTM. Copied elements, merged cells, semicolons between words, and commas can also be counted by every such tool differently.
XTM Wordcount for Asian languages
XTM allows a user to take full control over the word count for Asian (Chinese-Japanese-Korean) languages. To alter the character per word ratio, please, go to (administrative privileges required!) Configuration → Settings → Translation → Metrics → Metrics calculations.
Here you can set how many characters will be converted into words for metrics calculation where for each Asian language you need to enter the number of characters that is equivalent to one word. When it comes to a mixed script (e.g. Japanese + English source), the non-Asian language will be counted on a normal basis while the Asian alphabet will be counted according to factors set in the above-mentioned section.