August 25, 2025

OCR/HTR Workshop for Under-resourced and Under-represented Languages in Digital Humanities
Main organizer: Alíz Horváth (Central European University)
Co-organizers: Grigor Boykov and Yavuz Köse (University of Vienna), Patrick McAllister (ÖAW)
Student assistant: Saranya Chandran (Central European University)
This workshop will bring together early-career and more senior scholars, as well as technical specialists, who have worked with or developed OCR (Optical Character Recognition) /HTR (Handwritten Text Recognition) tools for under-represented languages and scripts (ie. most languages beyond the diverse forms of English), to discuss relevant challenges, potential solutions, and recommendations pertinent to the digitization of textual materials in under-represented and under-resourced languages and scripts to help the broader scholarly community achieve tangible results.
Recent advances in OCR and HTR technology have permitted the digitization of vast amounts of textual data as the basis of text analysis. However, the performance of these technologies often lags and poses unique challenges for scholars when applied to under-resourced and under-represented languages and scripts, even though well-functioning OCR/HTR tools are critical for increasing the availability of textual corpora. Text recognition not only provides the foundation for further digital text analysis but also strengthens the representation of under-resourced languages and scripts, ushering in a more linguistically diverse and inclusive future for digital humanities. We will bring together researchers with diverse linguistic backgrounds and technical experts from well-established text recognition projects such as Transkribus and eScriptorium who have developed OCR/HTR tools for multilingual purposes. This will provide a rare opportunity for scholars working with different languages to interact, learn from one another, and find inspiration and potential parallels in each other’s experiences.
The workshop will focus mainly on non-Latin script languages, because specific guidelines for such scripts are generally lacking, therefore a relevant workflow could serve as an immensely useful tool towards the enrichment and diversification of relevant research infrastructures. Featured languages and scripts will include: Kanbun (literary Sinitic), classical Korean, Chinese, Tibetan, Garshuni Malayalam, Hebrew, Ottoman Turkish, Sanskrit and Newar, ancient Greek, and Devanagari. The workshop will thus combine practice- (and process)-focused presentations, followed by discussions, and a collaborative writing sprint. The insights gained will hopefully be useful for researchers, DH practitioners, and for language preservation efforts worldwide. By involving technical representatives from text recognition projects and infrastructure providers in the discussions with active scholars, the workshop seeks to highlight common issues to foster collaborations, and to help accelerate the achievement of tangible results.
This international workshop aims to not only foster collaboration between Central European University, the University of Vienna, the Austrian Academy of Sciences, and beyond, but also between organizations, namely the FWF Cluster of Excellence EurAsian Transformations and CLARIAH-AT, who both provided funding for the workshop, while also extending our outreach to Asia and the US.