Making newspaper sources usable for modern research
ULB Saxony-Anhalt digitizes historical newspapers
The demands made on digitization projects in libraries are increasing. Going beyond mere preservation, digitization allows patrons the chance to conduct intensive, digital work on books, manuscripts, prints or magazines.
Anke Berghaus-Sprengel and Benjamin Auberer from the University and State Library (Universitäts- und Landesbibliothek – ULB) of Saxony-Anhalt report on how these demands are being met on the basis of a newspaper digitization project.
The ULB Saxony-Anhalt houses one of Germany’s largest newspaper inventories. Photo: Markus Scholz.
Newspapers are a key source of information for all disciplines dealing with historical matters. They supply important insights into the politics, economy, culture and society of an era. As a legal deposit library, the University and State Library of Saxony-Anhalt (ULB) possesses one of Germany’s largest newspaper repositories: over 1,300 newspaper titles before the publication date of 1945, of which roughly 800 are regional titles from Central Germany.
As key examples of these regional newspapers, the General-Anzeiger für Halle und den Saalkreis and the Saale-Zeitung are highly significant when it comes to the economic and social history of Central Germany. They first appeared in the second half of the 19th century and by 1900, reached more than half of the population in Halle and the surrounding region.
Numerous inquiries for newspaper repositories from academics and society demonstrate the lasting interest in the media. Even in the 1990s, newspapers were icrofilmed for reasons of preservation to counteract the decomposition of the paper.
However, accessing newspapers with a microfilm reader proves to be time-consuming and tedious and no longer meets the requirements of modern scientific practice.
Refining text recognition by means of machine learning
For this reason, a project to digitize the two regional newspapers was launched in the middle of 2019 at the ULB. Sponsored by the German Research Foundation (Deutsche Forschungsgemeinschaft – DFG), nearly a million newspaper pages will be digitized within two years and made freely available online in Open Access under the license CC-BY-SA 3.0 DE.
The ULB Saxony-Anhalt is thus facilitating historical engagement with this unique research material. The important criterion here is that contents can be found simply and quickly in the form of full text – as for a Google search.
To achieve this goal, the text recognition software (OCR) Tesseract is used. With the aid of machine learning and a training workflow developed at the ULB, this software is capable of correctly differentiating letters which looked similar in those days on newspaper pages and learning to correctly recognize different typefaces.
In this way the scanned pages can be machine read and the contents searched for any reference word. This full-text search function also enables researchers to pursue innovative, data-driven questions from the Digital Humanities sphere. For example, it enables empirically grounded, corpus-based examinations to be conducted.
Good OCR results require high image quality
However, the results of text recognition are only as good as the quality of the underlying images. Great importance is thus given to high image quality in the project. Only then can the contents be very well prepared by means of full text analyses.
A roll film scanner from Zeutschel – the OM 1800 – is being used to digitize the microfilm. The microfilm is scanned in its entirety and then automatically divided into individual pages and double-page spreads using the OM 1800s’ Quantum Process software. The DFG’s practical rules for digitization are used.
The roll film scanner used offers true optical resolution of at least 470 dpi with reference to an original in A1 format (possible resolution of OM 1800 is up to 600 dpi) and at least a 12-bit gray scale. In this way, the films can be efficiently digitized with the highest level of quality.
Individual pages are aligned and the margins reduced to a minimum to optimize the use of storage space. Newspapers which have not been microfilmed are digitized directly from the original.
The microfilms are digitized in accordance with the DFG’s rules withthe Zeutschel rollfilm scanner type OM 1800
Digitization is intended to stimulate new research
The combination of roll film scanner and text recognition software delivers excellent results. The accuracy is approximately 95 percent with a goal to boost this figure further. This makes a big difference to the use of the newspapers.
This rate of text recognition will then make it possible to reliably search for any terms whatever in newspaper articles and advertisements. For example, researchers who enter the term “Kapp-Putsch” will obtain detailed accounts of the “horror days in Halle” from 13 to 23 March 1920 and they will learn that the town was one of the major hubs of political violence in the twenties.
The stories of companies and individuals featured in Halle’s history can thus be easily traced over the periods in which the newspapers were published – a genuine boon for research and teaching. With this digitization program, the ULB is hoping to go beyond the mere provision of the material and encourage researchers to study innovative, data-driven questions in order to explore the history of Halle and the surrounding areas.
Director of the University and State Library of Saxony-Anhalt.
Specialist Librarian at the Universityand State Library of Saxony-Anhalt.
University and State Library of Saxony-Anhalt
The University and State Library of Saxony-Anhalt, ULB for short, was set up in 1696, two years after the foundation of Halle University. With an inventory of 5.5 million publications, the ULB is the largest general academic library in Saxony-Anhalt. Beside a large collection of newspapers and magazines, the ULB in Halle houses over 115,000 manuscripts and a valuable collection of prints from the 15th to the 18th century. https://bibliothek.uni-halle.de/