Early modern text transcription innovations by ethical machine learning tools

Machine Learning


In recent years, digitalization efforts have made 16th and 17th centuries more widely available than ever before. Scholars can now search for digital transcriptions of keywords without leaving their desks or accessing physical archives. Still, while access is easy, most digitalized material has not been transcribed due to time, labor and funding restrictions.

Early modern text transcription innovations by ethical machine learning tools
Illuminated manuscripts, Antifonery, Santachiara (Napoli), 16th century. Credit: Yair Haklai / CC by-sa 4.0

A new article published in a 16th century journal by Serena Strecker and Kimberly Lifton address both the technical and ethical aspects of the issue. The author discusses alternatives to traditional transcriptional methods that often relied on outsourced workers, such as graduate students and workers, to manually transcribe historical texts.

Optical character recognition (OCR) software is effective in transcribing text from the late 19th and 20th centuries, but is inappropriate for the kind of inconsistencies common in early modern printing. Thus, early modern scholars have become increasingly varied into handwritten text recognition (HTR) technology. The most effective HTR software, Transkribus supports public transcription model access or personal training, providing new solutions for transcription challenges.

Strecker and Lifton conducted case studies using Transkribus on a sample group from four German model collections of the 16th century. The results of their experiments proved that even published models of HTR can produce very accurate initial printed text transcriptions. Furthermore, when scholars use the Transkribus public model to generate training data, they can develop their own models tailored to the source material in a five-step process.

Early modern text transcription innovations by ethical machine learning tools
An example of a letter extracted from the handwritten chronicle of the Zoo Museum by Wilhelm Moritz Kefferstein around 1864. Credit: F. Welter-Schultes

This approach not only maximizes transcriptional accuracy, but also ensures ethical compliance. The authors argue that hiring outsourced workers is “not desirable or necessary anymore.” Instead, they encourage a shift towards empowering individual researchers to produce their own transcriptions. This avoids recreating the long-term impact of academia inequality and colonial labor practices.

Despite the promises of HTR, the authors reveal that the early modern academic community needs to discuss how this technology can be integrated into research workflows. “With the precise and automated transcription of early modern printing, it is no longer a goal and reality,” Strecker and Lifton stated, “The early modern field of research must consider whether the combination of human labor and machine learning techniques is accepted and endorsed and ultimately shapes the future of research.”

They emphasize that future transcriptions should not only be technically efficient, but also have to maintain a work ethic. “Only by asserting ethical labor practices, scholars can avoid exacerbating inequality within academic hierarchies or perpetuating the perpetual inequality of colonialism.”

detail: Strecker, S. , & Lifton, K. (2025). Unlocking digitized archives of early modern printing: Automatic transcription of early modern printed books. 16th century journal, 56(2), 395–419. doi:10.1086/735052





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *