Accessing machine translation (MT) performance data/calculating costs based on EDC

Introduction

XTM Cloud comes with handy tools/options that help you to keep track of the changes made by a post-editor in a machine-translated target segment.

Before reading the article below, we recommend that you first familiarize yourself with the step-by-step guideline to using machine translation: How to enable and set up machine translation (MT).

Before reading the article below, we recommend that you first familiarize yourself with the step-by-step guideline to using machine translation: How to enable and set up machine translation (MT).


Generating a report

A Preview Extended table report (see Preview Extended table – description for more information about the report) contains the MT performance information that you can generate in XTM Cloud.

  1. Once the work in a project is finished, in XTM Cloud, select Project Editor → Files → Preview → (click on the “cog” icon) → Extended table…

  1. When generating a report, ensure that you include the columns below:

  • Pre-translated text,

  • Post-edited text,

  • Final text,

  • Edit distance score.

  1. Once the report has been generated, the columns will look like this:


Columns – description

Pre-translated text

The column shows a pure, unedited by a linguist, text coming directly from a machine translation engine. Of course, it does not necessarily have to be an MT match. It can be any other existing TM match that has been pre-populated in the target segment, if this is permitted by the project settings. The bold label at the top of the text indicates that an actual match has been applied.

Post-edited text

The column shows the linguist’s very first manual revision of a text that has been machine-translated or pre-populated from a TM.

  1. Imagine a workflow that consists of three steps: Translate → Correct1 → Correct2 and the segment to be translated from English to Polish: Alice has a cat.

  2. When the project is created, a pre-populated match is inserted in the segment. This segment contains the text: Ala ma chomika.

  3. In the Translate step, the first linguist translates the segment to: Ala ma psa.

  4. Then, in the Correct1 step, the second linguist corrects the previous linguist's translation, by entering another version of it: Ala ma rybki.

  5. Finally, in the Correct2 step, another linguist changes the translation to: Ala ma kota.

As a result, the "Post-edited text" column will only display the translation that was inserted in the Translate step. In our example, this is:

  • Ala ma chomika.

Final text

The column shows a text which is a final translation version of a particular segment in XTM Workbench and so will be displayed in a target document. Note that the workflow step name in bold indicates the step in which this final version was inserted/the segment containing this version status was confirmed, so it does not necessarily have to be the last step in the workflow. A linguist might as well have begun work on the project in one of the previous steps and made the change there.

Edit distance score

This column contains a score which you can use to track the number of changes made by a post-editor in a machine-translated target segment. For more information, refer to the Edit distance calculation (EDC) section.


Edit distance calculation (EDC)

Definition

Edit distance calculation (EDC) is based on a metric that corresponds to the number of characters edited in a target machine-translated segment, divided by the total number of characters in that segment. The EDC feature compares all characters, white spaces and special characters in the MT output and final translation output.

Once the segment has been translated, the string from the MT match and the string from the actual translation are prepared for the CharacTER algorithm, and then all words (or characters, in the case of Asian languages) are put in alphabetical order and compared against one another using the algorithm. This calculation results in an EDC score, which ranges from 0 to 1. A score of 1 corresponds to a segment that required complete editing of the machine translation output. This would therefore be reflected by payment of the maximum possible amount.

EXAMPLE:

For Chinese, if the entire segment consists of only two characters returned by an MT match, and you replace them with two different characters, the output will be completely different and equal to 1.

IMPORTANT!

To obtain a correct value in the EDC score, you must ensure that the following criteria are satisfied:

  1. The appropriate setting needs to be enabled in global workflow settings, as well as in the workflow settings of the project.

  2. The EDC score is only calculated when the linguist edits and/or approves a segment that contains an MT match.

  3. The segment must be approved.

  4. If the MT match is approved and not edited, the EDC score will be 0 in the report.

  5. The EDC feature does not work retroactively, meaning that the score will not be calculated for segments translated earlier (when EDC was disabled), if you decide to enable the feature at a later time.

  6. If a segment is set as “Done” automatically, an edit to the target must be made to calculate an EDC score.

Examples

  1. A segment has just an MT match that was automatically set as “Done”.

  • The linguist can change the status to “Not approved” and then approve it again – EDC will not be calculated.

  1. A segment has an ICE match that was automatically set as “Done”, and an MT match.

  • The linguist can change the status to “Not approved” and then approve it again – EDC will not be calculated.

  1. A segment has a not approved ICE match that was populated but not set as “Done” automatically, and an MT match.

  • The linguist approves the segment – EDC will be calculated.

How to enable it?

As a user with the Administrator role, go to Configuration → Settings → WorkflowWorkflow options and activate Calculate edit distance. Do not forget to Save the changes.

In Workflow editor, for a specific project, activate the Calculate edit distance option in any CAT tool workflow step (one which can be opened up in XTM Workbench) in which you want to perform post-editing.

Obtaining EDC information via the API

Aside from a standard Excel sheet form, if you are using the API, you can easily obtain the most important information about a specific project, such as the:

  • Project name,

  • Customer name,

  • Language pair,

  • Segment ID,

  • EDC percentage.

Make use of the findProjectStatistics method in the REST API (see findProjectStatistics).

[ { "targetLanguage": "ja_JP", "usersStatistics": [ { "userDisplayName": "Ling1 Ling1", "stepsStatistics": [ { "workflowStepName": "Translation1", "jobsStatistics": [ { "jobId": 9465222, "sourceStatistics": { "totalSegments": 9, "totalWords": 36, "totalCharacters": 191, "totalWhitespaces": 27, "nonTranslatableSegments": 0, "nonTranslatableWords": 0, "nonTranslatableCharacters": 0, "nonTranslatableWhitespaces": 0, "nonTranslatableTrackedTime": 0, "iceMatchSegments": 0, "iceMatchWords": 0, "iceMatchCharacters": 0, "iceMatchWhitespaces": 0, "iceMatchTrackedTime": 0, "highFuzzyMatchSegments": 0, "highFuzzyMatchWords": 0, "highFuzzyMatchCharacters": 0, "highFuzzyMatchWhitespaces": 0, "highFuzzyMatchTrackedTime": 0, "mediumFuzzyMatchSegments": 0, "mediumFuzzyMatchWords": 0, "mediumFuzzyMatchCharacters": 0, "mediumFuzzyMatchWhitespaces": 0, "mediumFuzzyMatchTrackedTime": 0, "lowFuzzyMatchSegments": 0, "lowFuzzyMatchWords": 0, "lowFuzzyMatchCharacters": 0, "lowFuzzyMatchWhitespaces": 0, "lowFuzzyMatchTrackedTime": 0, "leveragedSegments": 0, "leveragedWords": 0, "leveragedCharacters": 0, "leveragedWhitespaces": 0, "leveragedTrackedTime": 0, "otherNonTranslatableWords": 0, "otherNonTranslatableCharacters": 0, "otherNonTranslatableWhitespaces": 0, "otherNonTranslatableTrackedTime": 0, "repeatsSegments": 0, "repeatsWords": 0, "repeatsCharacters": 0, "repeatsWhitespaces": 0, "repeatsTrackedTime": 0, "highFuzzyRepeatsSegments": 0, "highFuzzyRepeatsWords": 0, "highFuzzyRepeatsCharacters": 0, "highFuzzyRepeatsWhitespaces": 0, "highFuzzyRepeatsTrackedTime": 0, "mediumFuzzyRepeatsSegments": 0, "mediumFuzzyRepeatsWords": 0, "mediumFuzzyRepeatsCharacters": 0, "mediumFuzzyRepeatsWhitespaces": 0, "mediumFuzzyRepeatsTrackedTime": 0, "lowFuzzyRepeatsSegments": 0, "lowFuzzyRepeatsWords": 0, "lowFuzzyRepeatsCharacters": 0, "lowFuzzyRepeatsWhitespaces": 0, "lowFuzzyRepeatsTrackedTime": 0, "machineTranslationCharacters": 130, "machineTranslationSegments": 8, "machineTranslationWords": 25, "machineTranslationWhitespaces": 18, "machineTranslationTrackedTime": 0, "noMatchingTrackedTime": 0, "noMatchingSegments": 1, "noMatchingWords": 11, "noMatchingCharacters": 61, "noMatchingWhitespaces": 9, "totalTime": 0 }, "targetStatistics": { "totalSegments": 9, "totalWords": 35, "totalCharacters": 124, "totalWhitespaces": 12, "nonTranslatableSegments": 0, "nonTranslatableWords": 0, "nonTranslatableCharacters": 0, "nonTranslatableWhitespaces": 0, "nonTranslatableTrackedTime": 0, "iceMatchSegments": 0, "iceMatchWords": 0, "iceMatchCharacters": 0, "iceMatchWhitespaces": 0, "iceMatchTrackedTime": 0, "highFuzzyMatchSegments": 0, "highFuzzyMatchWords": 0, "highFuzzyMatchCharacters": 0, "highFuzzyMatchWhitespaces": 0, "highFuzzyMatchTrackedTime": 0, "mediumFuzzyMatchSegments": 0, "mediumFuzzyMatchWords": 0, "mediumFuzzyMatchCharacters": 0, "mediumFuzzyMatchWhitespaces": 0, "mediumFuzzyMatchTrackedTime": 0, "lowFuzzyMatchSegments": 0, "lowFuzzyMatchWords": 0, "lowFuzzyMatchCharacters": 0, "lowFuzzyMatchWhitespaces": 0, "lowFuzzyMatchTrackedTime": 0, "leveragedSegments": 0, "leveragedWords": 0, "leveragedCharacters": 0, "leveragedWhitespaces": 0, "leveragedTrackedTime": 0, "otherNonTranslatableWords": 0, "otherNonTranslatableCharacters": 0, "otherNonTranslatableWhitespaces": 0, "otherNonTranslatableTrackedTime": 0, "repeatsSegments": 0, "repeatsWords": 0, "repeatsCharacters": 0, "repeatsWhitespaces": 0, "repeatsTrackedTime": 0, "highFuzzyRepeatsSegments": 0, "highFuzzyRepeatsWords": 0, "highFuzzyRepeatsCharacters": 0, "highFuzzyRepeatsWhitespaces": 0, "highFuzzyRepeatsTrackedTime": 0, "mediumFuzzyRepeatsSegments": 0, "mediumFuzzyRepeatsWords": 0, "mediumFuzzyRepeatsCharacters": 0, "mediumFuzzyRepeatsWhitespaces": 0, "mediumFuzzyRepeatsTrackedTime": 0, "lowFuzzyRepeatsSegments": 0, "lowFuzzyRepeatsWords": 0, "lowFuzzyRepeatsCharacters": 0, "lowFuzzyRepeatsWhitespaces": 0, "lowFuzzyRepeatsTrackedTime": 0, "machineTranslationCharacters": 65, "machineTranslationSegments": 8, "machineTranslationWords": 24, "machineTranslationWhitespaces": 3, "machineTranslationTrackedTime": 0, "noMatchingTrackedTime": 0, "noMatchingSegments": 1, "noMatchingWords": 11, "noMatchingCharacters": 59, "noMatchingWhitespaces": 9, "totalTime": 0 }, "creationDate": 1667986032175, "machineTranslationEDCWords": { "category1": { "discount": 95, "wordCount": 11, "range": { "from": 0.0, "to": 0.0 } }, "category3": { "discount": 80, "wordCount": 11, "range": { "from": 0.05, "to": 0.099 } }, "category9": { "discount": 1, "wordCount": 1, "range": { "from": 0.6, "to": 0.699 } }, "category10": { "discount": 0, "wordCount": 2, "range": { "from": 0.7, "to": 1.0 } } } } ] } ] } ] } ]

Calculating costs based on EDC

Use the EDC feature to calculate costs for post-edited segments of this kind. To enable costs calculation based on EDC, go to Configuration → Data → Estimates → Cost settings → Cost settingsCalculate cost based on edit distance.

Matrix

When EDC and costs based on EDC are enabled, projects costs will be reduced in accordance with the Edit Distance Score. The calculation of this score is based on a matrix located in the back-end, to which only the XTM International Support team has access.

See the default matrix for Edit Distance Score calculation immediately below:

Aggregated Edit Distance Score

Costs Reduction in %

Aggregated Edit Distance Score

Costs Reduction in %

0.0 - 0.024

60

0.025 - 0.049

56

0.05 - 0.074

53

0.075 - 0.999

49

0.1 - 0.124

45

0.125 - 0.149

42

0.15 - 0.174

38

0.175 - 0.199

35

0.2 - 0.224

31

0.225 - 0.249

27

0.25 - 0.274

24

0.275 - 0.299

20

0.3 - 0.324

19

0.325 - 0.349

18

0.35 - 0.374

16

0.375 - 0.399

15

0.4 - 0.424

14

0.425 - 0.449

13

0.45 - 0.474

12

0.475 - 0.499

11

0.5 - 0.524

9

0.525 - 0.549

8

0.55 - 0.574

7

0.575 - 0.599

6

0.6 - 0.624

5

0.625 - 0.649

4

0.65 - 0.674

2

0.675 - 0.699

1

0.7 - 1

0

If you would like to modify the matrix, please create a proper ticket to the XTM International Support team and provide request details.

IMPORTANT!

The EDC feature works only for the costs based on Statistics (either Statistics source or Statistics target).

Furthermore, costs which are based on Statistics for internal matches, i.e ICE, Leverage, Fuzzy, Repetitions, will not be reduced by EDC!

EXAMPLE:

  1. With EDC fully enabled and an MT engine set up to match against the segments, create a simple project with a file containing some text.

  2. Open the project in XTM Workbench.

  3. Accept all MT matches without making any changes to the segments.

  4. Go to Project editor → EstimatesCosts, and generate statistics-based costs.

  5. You should identify that, because only matches from the MT engine were used, and no changes were made to them, costs were reduced proportionately (by 60%, based on the default matrix), so the Price now contains a lower value.