The project expects to launch by the end of the month. When it does, researchers and the public will be able to comb through widely reprinted texts identified by mining 41,829 issues of 132 newspapers from the Library of Congress. While this first stage focuses on texts from before the Civil War, the project eventually will include the later 19th century and expand to include magazines and other publications, says Ryan Cordell, an assistant professor of English at Northeastern University and a leader of the project.
Fast forward a century and a half and many of these newspapers have been scanned and digitized. Northeastern computer scientist David Smith developed an algorithm that mines this vast trove of text for reprinted items by hunting for clusters of five words that appear in the same sequence in multiple publications (Google uses a similar concept for its Ngram viewer).
The project is sponsored by the NULab for Texts, Maps, and Networks at Northeastern and the Office of Digital Humanities at the National Endowment for the Humanities. Cordell says the main goal is to build a resource for other scholars, but he’s already capitalizing on it for his own research, using modern mapping and network analysis tools to explore how things went viral back then.