Terminology extraction mechanism

Introduction

It has been possible to extract terms during project creation since XTM Cloud version 8.5. To enable this module, first turn it on globally in a particular XTM Cloud instance. To do so, select Configuration → Settings → Translation → Terminology → Terminology optionsRun Terminology extraction.

To see a step-by-step terminology extraction guideline, see How to extract terminology from projects.

When enabled, XTM Cloud will extract a list of candidate terms as an additional step in the file analysis. The candidate terms are exported to an Excel spreadsheet which can be downloaded.

IMPORTANT!

Extracted terms cannot be downloaded for archived projects.


How does it work?

To find and record relevant term entries, XTM Cloud uses detailed Part of Speech (POS) analysis of the source file to find noun-phrases: (ADJ*+NOUN*). The search is performed on the morphologically reduced version of each segment. What is actually captured is all grammatical forms of a particular noun phrase. XTM Cloud then sorts it by frequency with a cut-off of 3.

The result is a generated Excel spreadsheet containing the terms in question and contextual information:

  1. en-GB (Column “A”) → the root noun phrase (the basic form of a term);

  2. Surface Forms (Column “B”) → all the grammatical forms of the noun phrase that occur in the file after you expand the cell, these forms will be displayed along with the number of occurrences in the file;

  3. Frequency (Column “C”) → the combined number of occurrences of all the forms in the file;

  4. Ignore (Column “D”) → indication of whether or not the phrase is ignored;

  5. Found in file (Column “E”) → name of the source file from which a particular phrase originates;

  6. Context (Column “F”) → complete example sentences from the source file in which those grammatical forms appear; the maximum of sentences to be displayed for one form of this kind is 5 whereas the maximum of sentences to be displayed in general is 6;

  7. Remarks (Column “G”) → indication of TF-IDF (Term Frequency–Inverse Document Frequency); a numerical statistic used to indicate how important a phrase is, in a document; the TF–IDF value increases in proportion to the number of times a word occurs in the document.

  8. Target language (Column “H”) → Add translations for extracted terms in the columns with the appropriate language code heading or delete rows with unnecessary entries.


Good to know!

The terms which already exist in a particular customer's terminology will not be omitted by the mechanism and not exported in the Excel sheet.