How are segments in MS Office files (Word, Excel, PowerPoint) extracted?

Introduction

You might sometimes have the difficulty that a particular piece of information is not extracted for translation from Microsoft Office formats (e.g. DOCX, XLSX or PPTX) to XTM Workbench. In the vast majority of cases, this is caused by the fact that, for example, certain cells in Excel files or text in Word files do not have the correct format or style.

XTM Cloud takes only certain pieces of data for translation, by default. However, at your explicit request, we can configure the editor so that exactly the required content is displayed in it.

Any change of this kind will affect the total word count for any project in the XTM Cloud UI (see How is wordcount calculated in XTM Cloud? for more information).


What is and can be extracted for a particular MS Office file?

Word – .doc, .docx, .rtf

DoNotTranslate restriction

When processing Word formats, the most important thing is to remember that the content is extracted for translation in XTM Cloud as long as the DoNotTranslate style has not been applied to it. Once that style is detected, the text containing it will be rejected during file analysis in XTM Cloud, so it will not be displayed in XTM Workbench.

IMPORTANT!

Note the wording of the style name: XTM Cloud will only recognize the following names:

  • DoNotTranslate;

  • Donttranslate;

  • tw4winExternal.

To apply such a style, proceed as follows:

  1. Right-click on the fragment in question and select Styles, and then Apply Styles.

  1. Enter the name of the style and click New (or Reapply if this style has already been applied elsewhere).

  1. As a result, this particular text will not be displayed in XTM Workbench.

Hidden option

There is also another way to exclude particular text from translation in XTM Cloud: use the Hidden option.

  1. Right-click on the fragment in question, and then select Font.

  1. In Fonts Effects, select Hidden and click OK.

  1. The text will disappear from the document and will not be available in XTM Workbench.

Alternate content

When translating in XTM Workbench, you might wonder why certain text is displayed in the editor twice. This is because of the existence of the so-called alternate content which is by default taken for translation by XTM (unless configured otherwise on your instance by the XTM International Support team). Such content is often a string of text that is contained within charts' or images' descriptions or within text boxes.

Considering the above, there might be a rare situation when you end up reporting an issue that you do not see actual translation (or updated translation) in the target file. This might be caused by the fact you have translated (or updated the translation) the segment containing the mentioned alternate content.

Take a look at the following screenshot which presents two segments, the first of which (70) is actual content and the second one (82) is its alternate content. In this concrete example, instead of updating the translation of the former segment, the user only updated the second segment. As a result, they did not observe the updated translation in their target file.

IMPORTANT!

Ensure that you know the difference between alternate content and alternative text, as those are two separate functionalities.

Alternative text is mostly used in relation to images for which it provides a description that can then be read aloud for the visually impaired. Unlike alternate content, alternative text is not taken for translation in XTM Cloud!

Text not exposed in the source file

Sometimes you might report an issue with part of the text not having been extracted for translation in XTM Workbench. In all likelihood, this is the text not exposed in the source file.

Therefore, in the after-analysis file, it has the style set to VANISH:ON:

<w:p entryName="word/document.xml" xtm-id="5880" elements-with-default-style="" defaultStyle="[PARAGRAPH_STYLE_NAME:Normal]"> <w:t xml:space="preserve" xtm-id="5881" style="[BOLD:OFF;COLOR_ELM-VALUE:000000;VANISH:ON]">Clicking </w:t> <w:t xtm-id="5884" xml:space="preserve" style="[COLOR_ELM-VALUE:000000;VANISH:ON]">four times</w:t> <w:t xml:space="preserve" xtm-id="5887" style="[BOLD:OFF;COLOR_ELM-VALUE:000000;VANISH:ON]"> will change the element back to its original color.</w:t> </w:p>

Such elements by default are not extracted for translation.

However, it is possible to change it via configuration, or another solution would be to simply adjust the formatting in the source file.

What else is extracted?

Apart from the above, XTM takes these elements for translation:

  • headers & footers,

  • dates,

  • content in embedded Office formats (Excel, Word) - charts, graphs,

  • editable PDFs,

  • hyperlink display texts,

  • comments;

  • superscripts & subscripts;

  • dropdown values;

  • descriptions of images.

Excel (normal) – .xls, .xlsx, .xlsm

General information

The default Excel filter takes the whole text from a particular Excel cell and places it in a single segment in XTM Workbench. In other words, the rule is one Excel cell = one segment, which is manifested in the following default configuration:

<its:withinTextRule selector="//t" withinText="yes"/>

Take a look at the following example of a single Excel cell where the text is split into two lines:

This is how the aforementioned cell is “viewed” by the XTM filter. According to the default rule, the <t> element is treated as part of the text; therefore, the whole text is displayed as one segment in XTM Workbench.

<t>First segment</t> <t>second segment</t>

In Excel files, each and every line of text placed in a cell is tagged with the <t> elements.

What cell formats are extracted?

By default, XTM Cloud only takes the following cell formats for translation:

  • General (which is mostly a “text” format),

  • Text.

Other formats

As previously stated, only Text cell format is pulled in. What about the others?

The remaining formats are:

  • usually non-translatable (number, fraction, percentage, formula, accounting),

  • automatically localized by the MS Office package which makes use of a system language on a user device (currency, date, time).

For this reason, XTM Cloud does not extract them as the total content displayed in the editor (especially in the case of Excel files) would then be enormous.

What else is extracted?

Apart from the above, XTM Cloud takes these elements for translation:

  • sheet names,

  • content in embedded Office formats (Excel, Word): charts, graphs,

  • editable PDFs,

  • hyperlink display texts.

PowerPoint – .pttx

What is extracted?

By default, XTM Cloud takes these elements for translation:

  • slides (layout, master),

  • headers & footers,

  • dates,

  • content in embedded Office formats (Excel, Word): charts, graphs,

  • editable PDFs,

  • hyperlink display texts,

  • comments,

  • notes,

  • hidden slides and all the content they contain,

  • alternative text.

Custom configuration

It is possible to include or exclude specific data by creating and setting appropriate rules. In XTM Cloud, you can create your own filter templates which are used to identify translatable text in a document with a specified format. We recommend that you read this article, which contains a step-by-step guideline to setting up a configuration of this kind in the UI: How to create a filter template in the XTM Cloud UI and apply it to a project.

If there are any more complex filters, do not hesitate to contact our XTM International Support team and provide details about them. When requesting our help, ensure that you follow the guidelines in this article: How to request a new configuration or change to an existing one.

Additionally, learn about possible configuration levels from this article: How can a source file be processed and what are configuration levels.

How to browse through the source file structure to find a particular element

You might sometimes find that one or more elements have been sent for translation in XTM Workbench but are not displayed in any actual source file. Such a situation is not usually due to the incorrect parsing of the file in XTM Cloud but from the existence of this kind of element "hard-wired" somewhere in the source file structures.

To track down this element, proceed as follows:

  1. Right-click on a source file, select a file archiving program and then Extract to (NAME_OF_THE_FILE).

  1. A new folder will be created with the file’s name. The folder will contain the structural elements for a source file:

  1. Open Notepad++.

  2. Use the keyword combination CRTL + F to display a search toolbox.

  3. Select Find in Files. In the dialog, enter your search phrase in the Find what field. Then, select the three dots box next to the Directory field and find the relevant folder. Click Find All.

  1. In the Search results section at the bottom of the file, you will find the exact path to the location of the element that you are interested in.