How are segments in MS Office files (Word, Excel, PowerPoint) extracted?

Introduction

You might sometimes have the difficulty that a particular piece of information is not extracted for translation from Microsoft Office formats (e.g. DOCX, XLSX or PPTX) to XTM Workbench. In the vast majority of cases, this is caused by the fact that, for example, certain cells in Excel files or text in Word files do not have the correct format or style.

XTM Cloud takes only certain pieces of data for translation, by default. However, at your explicit request, we can configure the editor so that exactly the required content is displayed in it.

Any change of this kind will affect the total word count for any project in the XTM Cloud UI (see How is wordcount calculated in XTM Cloud? for more information).


What is and can be extracted for a particular MS Office file?

Word – .doc, .docx, .rtf

DoNotTranslate restriction

When processing Word formats, the most important thing is to remember that the content is extracted for translation in XTM Cloud as long as the DoNotTranslate style has not been applied to it. Once that style is detected, the text containing it will be rejected during file analysis in XTM Cloud, so it will not be displayed in XTM Workbench.

IMPORTANT!

Note the wording of the style name: XTM Cloud will only recognize the following names:

  • DoNotTranslate;

  • Donttranslate;

  • tw4winExternal.

To apply such a style, proceed as follows:

  1. Right-click on the fragment in question and select Styles, and then Apply Styles.

  1. Enter the name of the style and click New (or Reapply if this style has already been applied elsewhere).

  1. As a result, this particular text will not be displayed in XTM Workbench.

Hidden option

There is also another way to exclude particular text from translation in XTM Cloud: use the Hidden option.

  1. Right-click on the fragment in question, and then select Font.

  1. In Fonts Effects, select Hidden and click OK.

  1. The text will disappear from the document and will not be available in XTM Workbench.

Alternate content

When translating in XTM Workbench, you might wonder why certain text is displayed in the editor twice. This might be caused by the existence of the so-called alternate content which is by default taken for translation by XTM (unless configured otherwise on your instance by the XTM International Support team). Such content is often a string of text that is contained within charts' or images' descriptions or within text boxes.

IMPORTANT!

Ensure that you know the difference between alternate content and alternative text, as those are two separate functionalities.

Alternative text is mostly used in relation to images for which it provides a description that can then be read aloud for the visually impaired. Unlike alternate content, alternative text is not taken for translation in XTM Cloud!

What else is extracted?

Apart from the above, XTM takes these elements for translation:

  • headers & footers,

  • dates,

  • content in embedded Office formats (Excel, Word) - charts, graphs,

  • editable PDFs,

  • hyperlink display texts,

  • comments;

  • superscripts & subscripts;

  • descriptions of images.

Excel (normal) – .xls, .xlsx, .xlsm

What cell formats are extracted?

By default, XTM Cloud only takes the following cell formats for translation:

  • General (which is mostly a “text” format),

  • Text.

IMPORTANT!

Keep in mind that cell formats are "sort of symbolic" and only take effect if the values contained are entered correctly. If, for example, you have a Date format in a cell but decide to enter plain text such as "abcd", Excel will treat the cell as having Text format instead and will send it to XTM Workbench for translation.

Always ensure that you have entered your values in a way that matches the format of a particular cell.

Other formats

As previously stated, only Text cell format is pulled in. What about the others?

The remaining formats are:

  • usually non-translatable (number, fraction, percentage, formula, accounting),

  • automatically localized by the MS Office package which makes use of a system language on a user device (currency, date, time).

For this reason, XTM Cloud does not extract them as the total content displayed in the editor (especially in the case of Excel files) would then be enormous.

What else is extracted?

Apart from the above, XTM Cloud takes these elements for translation:

  • sheet names,

  • content in embedded Office formats (Excel, Word): charts, graphs,

  • editable PDFs,

  • hyperlink display texts.

PowerPoint – .pttx

What is extracted?

By default, XTM Cloud takes these elements for translation:

  • slides (layout, master),

  • headers & footers,

  • dates,

  • content in embedded Office formats (Excel, Word): charts, graphs,

  • editable PDFs,

  • hyperlink display texts,

  • comments,

  • notes,

  • hidden slides and all the content they contain,

  • alternative text.

Custom configuration

It is possible to include or exclude only specific data by creating and setting certain rules.

It is possible to include or exclude specific data by creating and setting appropriate rules.</p><p>In XTM Cloud, you can create your own filter templates which are used to identify translatable text in a document with a specified format. We recommend that you read this article, which contains a step-by-step guideline to setting up a configuration of this kind in the UI: How to create a filter template in the XTM Cloud UI and apply it to a project.

If there are any more complex filters, do not hesitate to contact our XTM International Support team and provide details about them. When requesting our help, ensure that you follow the guidelines in this article: How to request a new configuration or change to an existing one.

Additionally, learn about possible configuration levels from this article: How can a source file be processed and what are configuration levels.

How to browse through the source file structure to find a particular element

You might sometimes find that one or more elements have been sent for translation in XTM Workbench but are not displayed in any actual source file. Such a situation is not usually due to the incorrect parsing of the file in XTM Cloud but from the existence of this kind of element "hard-wired" somewhere in the source file structures.

To track down this element, proceed as follows:

  1. Right-click on a source file, select a file archiving program and then Extract to (NAME_OF_THE_FILE).

  1. A new folder will be created with the file’s name. The folder will contain the structural elements for a source file:

  1. Open Notepad++.

  2. Use the keyword combination CRTL + F to display a search toolbox.

  3. Select Find in Files. In the dialog, enter your search phrase in the Find what field. Then, select the three dots box next to the Directory field and find the relevant folder. Click Find All.

  1. In the Search results section at the bottom of the file, you will find the exact path to the location of the element that you are interested in.