Yale Library is developing an AI application that could transform research in digitized collections

  • Screenshot of page of prototype, Digital Collections AI
November 25, 2024

Yale Library has created a prototype for an innovative use of Artificial Intelligence (AI) to elicit information and insights from digitally scanned texts. The new tool, with the working title “Digital Collections AI,” could change the way researchers use the library’s vast and rapidly growing digital collections.   

The application deploys AI-powered Large Language Models (LLMs) to rapidly “read,” summarize, and analyze digital texts—transcribed with Optical Character Recognition (OCR)—and then answer a set of questions about the content. LLMs are designed to mimic the human brain’s way of recognizing, interpreting, and learning from language patterns, but at exponentially higher rates of speed than a person can.  

Digital Collections AI originated with Michael Appleby, director of software engineering in Library IT, while he was experimenting with AI for a conference presentation. Now he’s leading the development of the tool, working with materials from Yale Library Digital Collections and seeking help with beta testing from faculty and graduate students.

Unlocking secrets, saving time

Appleby is motivated by his own experience at Yale in the 1990s as a graduate student in Classics, when he often had to decipher secondary sources in multiple European languages in addition to primary sources in Latin and Greek. “I hope that this application will help students and researchers to unlock the secrets of our collections and more easily identify the resources they need,” he said. 

Although the tool is still being tested and refined, Appleby’s library colleagues see enormous potential for its use. “It can allow you to ask questions, say, of a handwritten manuscript with hundreds of pages, which would otherwise take hours of time to read through and decipher,” said Jonathan Manton, director of Digital Special Collections and Access at the Beinecke Rare Book and Manuscript Library. “The tool does not replace that work, but it can give you a head start by identifying, for example, the people, places, subjects, and common terms in the text. It can also provide you with summaries of the text as well as a stylistic analysis, all to help you understand if the resource is helpful and worthy of further investigation.”  

Case study: Travels in Victorian Europe

In a recent presentation, Appleby demonstrated Digital Collections AI on the handwritten travel journal of Mrs. E. A. Kenah, a British woman who jaunted through Europe in the 1820s with her husband, two friends, and a pet spaniel. The small leatherbound volume is shelved in the Beinecke Library, and the digital version is online in Yale Library Digital Collections.

Digital Collections AI analyzed, summarized, indexed by assigned categories, and organized by topic the journal’s text. It then responded to prompts Appleby created, including: “Provide a summary of the text,” “List geographic locations,” “List personal names” “List literary works,” “List scientific terms,” and others. 

The tool will allow users to run the same text through different LLMs (such as Gemini, Claude, ChatGPT, and others) to yield a broader range of information and points of comparison. In Appleby’s experiment, for example, Claude 3.5 Sonnet produced a bulleted summary of key points, while ChatGPT provided a more detailed four-paragraph summary. Claude produced a list of countries visited, while ChatGPT named specific cities. ChatGPT identified the author as “a 19th-century traveler,” while Claude described her as “an English traveler, likely of some means, able to undertake extended tours of Europe.”  

OCR scans are sometimes of poor quality, especially with illegible originals. In these cases, although they are not entirely reliable, the LLMs often can correct and reconstruct the logic of the text, “much like a human would,” Appleby explained.  

The application is currently limited in terms of the languages that can be translated and analyzed—primarily languages written in Latin scripts (English, French, German, Polish, Czech among them) are readable at this time. Another challenge still to be addressed is the documented phenomenon of the LLMs’ “hallucinating” incorrect or fabricated information. 

For Yale instructors: Collaboration and funding opportunities

Yale Library is seeking faculty and graduate students who are interested in testing the tool and helping to shape its ongoing development. The library is also offering funding for research assistants to those instructors who want to include the tool in a class assignment or activity.  

“We are envisioning that interested instructors might use the tool to support their students’ engagement with digitized special collections or to help students learn more about how LLMs work,” said Lauren Di Monte, associate university librarian for Research and Teaching. “The funding can be used to research digitized collections and support other integration work.” 

For information about funding or to collaborate on testing and development, Yale instructors should contact Lauren Di Monte.  

Read more about Yale Library resources for AI in research.   

Images: Screen and slides from Michael Appleby’s Adobe PowerPoint demonstration of the prototype for Digital Collections AI