Researchers explore machine learning to ethically automate early and modern text transcription

Machine Learning


Ancient text

Credit: Unsplash/CC0 Public Domain

Over the past 20 years, mass digitalization has dramatically changed the landscape of academic research. The ability to search for digital transcriptions of sources of specific keywords saves valuable time and will no longer be confined to archives and libraries if scholars want to look through texts.

However, the spread of digital transcription raises new concerns surrounding the labor required to enable such accessibility. New articles for 16th century journal Researchers are proposing ways to avoid unethical labor practices while obtaining transcriptions of early and modern digitalized sources.

“Unlocking the digitized archives of early modern printing: Automatic transcription of early modern printed books,” authors Serena Strecker and Kimberly Lifton begin with a brief history of two types of software used to create transcriptions. Optical Character Recognition (OCR) software has proven suitable for transcribing works from the late 19th and 20th centuries, but the irregularities common in modern printing make OCR insufficient for reliable transcription of these sources.

Instead, early modern scholars have been transformed into handwritten text recognition (HTR) technology. Transkribus, the leading HTR software, allows users to refer to publicly available transcription software models or train their own models. In comparing the various HTR models tested in the selection of pages from the four 16th century Escla collections, Strecker and Lifton highlight the ability of Transkribus to promote the creation of dedicated transcription models to the specifications of the desired sources of the scholars in five basic steps.

Using the Transkribus public model, researchers can generate the training data they need to train their very accurate models. The authors argue that this process “is no longer necessary or desirable” to rely on outsourced labor, such as the work of graduate students and workers in the Global South.

“Because the accurate and automated transcription of early modern printing is no longer a goal and a reality, the early field of modern research must consider whether the combination of human labor and machine learning techniques is accepted and supported and ultimately shape the future of research,” the author concludes.

“Only by asserting ethical labor practices, scholars can avoid exacerbating inequality within academic hierarchies or perpetuating the perpetual inequality of colonialism.”

detail:
Serena Strecker et al., unlocking digitized archives of early modern printing: automatic transcription of early modern printed books, 16th century journal (2025). doi:10.1086/735052

Provided by the University of Chicago

Quote: Researchers explore machine learning to automate the early latest text transcription (July 18, 2025) obtained from July 18, 2025 https://phys.org/news/2025-07-explore-machine-automate-automate-modern.html (July 18, 2025)

This document is subject to copyright. Apart from fair transactions for private research or research purposes, there is no part that is reproduced without written permission. Content is provided with information only.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *