How to use machine translation responsibly in the government

The Department of Justice recently issued guidance encouraging federal agencies to “communicate with individuals with limited proficiency in English using artificial intelligence and machine translation.” The memo specifically calls for “responsible use” of these technologies and “generate cost-effective ways to fill language barriers.”

However, the note does not provide details on what “responsible use” means. Also, Memo M-25-21 has little guidance on the acceleration of AI or revoked Biden AI executive orders regarding responsible use of translation models. This will allow the agency to know the implementation details themselves.

So, what does responsible AI translation look like?

Responsible use of machine translation requires case-specific assessments to be used, both before and after deployment, to inform what happens to use those results. This is what an institution should do:

Test content for a specific use case. Do not rely on general model performance benchmarks. Create a representative sample of the document for each use case. Tools suitable for everyday responses can fail in technical documentation.

Compare all options for each use case. Test multiple machine translation systems against human translators. Rate the review translations from the rater. In addition to requiring different samples for each use case, different evaluation criteria or cutoff scores may also be required. Evaluations do not only show how often the model fails, but also what situation it is, information that can form a deployment plan.

Act based on the outcome. Machine translation doesn't work well in some languages, but it doesn't work well in others, or using uncertainty scores to flag unreliable translations for human reviews works better than all or invalid approaches. Here, confidence scores are trickier than other models, but that's something you can pursue.

We will continue to evaluate it after it develops. Your documents and needs evolve. Regular reevaluations using a subset of new documents keep the model meeting the requirements.

Real problem: unfolding without ratings

My main concern is that the institution will deploy machine translation tools with minimal testing. Relying on vendor billing or general benchmarks that may not apply to a particular use case. The government didn't have enough technical expertise before President Donald Trump took office and has only gotten worse since. And even when contractors are at work, federal civilians are ultimately responsible for vendor choice, contract language and the deliverables they request.

As an AI/ML engineer with the Department of Homeland Security's AI Corps who drafted a guide to testing and evaluation of AI/ML models, I have seen significant changes across the government. We expect the implementation of DOJ guidance to be similarly inconsistent. Some agencies conduct thorough assessments, while others deploy their tools with minimal testing.

More rules are not the answer either

But there is a message for machine translation skeptics too. Even today, institutions do not always choose machine translations and good human translations. In some cases, they have no choice but to translate the unevacuated aircraft at all. Documents may not be translated or translated in a timely manner.

For example, at the federal agency I worked with, employees used Google Translate to use somewhat sensitive materials. A secure machine translation tool with basic ratings solved both security issues and provided more information about what kind of documents were suitable for translation, and what human reviews needed.

Furthermore, evaluation criteria should be driven by the requirements of the use case, not the technology used. It is important to focus on AI assessments, but this same attention to performance standards should apply whether work is carried out by human, mechanical, or hybrid approaches.

Finally, this lack of specific implementation guidance from both the White House and the previous one is appropriate. The more normative requirements do not resolve the underlying capacity issues, but they are interpreted by some institutions in the most restrictive way possible. For example, White House memos that require evaluation of machine translation models can be easily interpreted at a lower level, as they require engineers to require this evaluation to be performed before installing and trying the open source language model. On the other hand, the documentation is not translated at all. Or it's sent to Google Translate.

Therefore, even as an assessment advocate, I do not see the ambiguity of guidance as an issue. When an institution encounters the development of machine translation, it is due to a lack of institutional capacity and is not corrected by being told to make an assessment.

Abigail Haddad He is a former artificial intelligence/machine learning engineer with the AI team of the Department of Homeland Security.