How is wordcount calculated in XTM Cloud?
Introduction
You might sometimes wonder why there is a discrepancy between the number of words in the source file and XTM Cloud Metrics after project creation. This is because each CAT Tool or any other application operates according to its own set of rules. Wordcount is not exempted from such rules and tends to differ across various software. For example, whereas one application might count an email address as one word, another one can count it as three words. Wordcount in XTM Cloud is therefore calculated “on its own terms".
Calculation of XTM wordcount as compared with other software
Algorithm for XTM wordcount
Not all CAT tools have the same wordcount mechanism. Each CAT tool and any other application might be based on a different word count algorithm, as there are many. Note that Microsoft Office is not a CAT tool.
In this case, XTM Cloud makes use of the GMX-V algorithm (Global Information Management Metrics Volume). It recognizes the word boundaries based on various factors which are explained in detail in the sources below. These contain all the specifications for the algorithm:
XTM wordcount vs. Microsoft Office
In the vast majority of cases, clients compare the wordcount output from XTM Cloud and in MS Office files. As mentioned in the previous section, Microsoft Office is not a CAT tool, so there is no point in comparing them. What is more, if a specific project is using a filter template, this might result in a higher or lower word count, as we might exclude or include some elements from or in translation.
Nevertheless, following up on the way that the algorithm mentioned above works, there are some major differences in the ways that each of these applications “recognizes" words:
Compound words that are hyphenated (i.e. contain the dash character "-") are counted differently. In a Word document, this would be counted as a single word, whereas in XTM Cloud it would be counted as two words.
Hidden elements (such as slides, columns, or pages) in MS Office are counted despite being hidden, while XTM Cloud does not count them by default unless they are targeted for translation. This can be achieved with a special configuration in the back-end (or in the form of a filter template), performed by the XTM International Support team, or you can create one on your own in the XTM Cloud UI: Configuration → Filter templates (administrative privileges required!).
While using source files with DOCX (and similar) file formats, remember that these particular file types often allow something called alternate content. This kind of content is often a string of text that is contained within chart or image descriptions or within text boxes. If you find duplicated segments, and the duplicate cannot be found within the source file, you are probably dealing with alternate content.
Inline tags which are placed within the boundaries of one word divide it into two words in XTM Cloud.
Names of chemical compounds and dates are also counted differently in XTM Cloud.
Semicolons, commas, and slashes between words are also counted differently in XTM Cloud.
EXAMPLES | ||
Slashes | Word boundaries | Commas with no whitespaces |
The phrase sentence/text in MS Office is treated as a single word. In XTM Cloud, the slash "/" character is considered a word separator and this phrase will therefore be counted as two words. | The following example, taken from Unicode TR 29, shows an example of the identification of grapheme boundaries: The quick ("brown") fox can't jump 32.3 feet, right? Extracted words in XTM:
| The phrase Skilled,Rocking,Famous in MS Office or Excel is considered a single word (because there is no whitespace after each comma). In XTM Cloud, the comma "," character is considered a word separator, and this phrase will therefore be counted as three words. |
IMPORTANT!
Keep in mind that the vast majority of elements you want to be extracted for translation can, in fact, be set up either by you or our XTM International Support team.
To find out more about how XTM Cloud extracts content from Office files for translation, we recommend that you read this article: How are segments in MS Office files (Word, Excel, PowerPoint) extracted?
In XTM Cloud, you can create your own filter templates. These are used to identify translatable text in a document with a specific Office format. Familiarize yourself with the guideline: How to create a filter template in the XTM Cloud UI and apply it to a project.
If there are any more complex filters, do not hesitate to contact our XTM International Support team and provide the details. When requesting our help, ensure that you follow the guidelines in this article: How to request a new configuration or change to an existing one?
Additionally, learn about possible configuration levels from this article: How can a source file be processed and what are configuration levels?
XTM wordcount vs. other CAT tools
We do not really take into consideration other CAT tools and how they treat words. As mentioned at the beginning, other CAT tools treat some words, formulas, etc. differently than we do in XTM Cloud. Copied elements, merged cells, semicolons between words, and commas can also be counted differently by these tools.
XTM Wordcount for Asian languages
XTM Cloud gives a user full control over the word count for Asian languages (Chinese, Japanese, Korean). To alter the character per word ratio, go to (administrative privileges required!) Configuration → Settings → Translation → Metrics → Metrics calculations.
Here you can set how many characters will be converted into words for metrics calculation. For each Asian language, enter the number of characters that is equivalent to one word. When it comes to a mixed script (e.g. Japanese + English source), the non-Asian language will be counted on a normal basis and the Asian alphabet will be counted according to factors set in the above-mentioned section.