Corpus Assembly as Text Data Integration from Digital Libraries and the Web
Jena University Language & Information Engineering (JULIE) Lab
https://julielab.de/
DFG Graduate School „Romanticism as a Model“
http://modellromantik.uni-jena.de
Friedrich Schiller University Jena, Germany
Jun 3 2019 – Urbana-Champaign ILJCDL 19‘ – Session 1A – Generation and Linking
Udo Hahn & Tinghui Duan
Jena/HalleGermany
Allgemeine Literatur-Zeitung (1785-1849)
Very important historical text sourcefor literary studies
in German Romanticism (1790-1830)
General Literature Gazette, ALZ
Allgemeine Literatur-Zeitung (1785-1849)
Corpus • Analyse
Research Result
Allgemeine Literatur-Zeitung (1785-1849)Traditional Workflow
Printed Book • Scan
Scanned Picture
• OCR
Full Text• Encode
• Assemble
Corpus • Analyse
Research Result
Allgemeine Literatur-Zeitung (1785-1849)Traditional Workflow
Printed Book • Scan
Scanned Picture
• OCR
Full Text• Encod
• Assemble
Corpus • Analyse
Research Result
315 Volumes
≈ 150,000 Pages
≈ 150,000,000 Tokens
Allgemeine Literatur-Zeitung (1785-1849)Traditional Workflow
Printed Book • Scan
Scanned Picture
• OCR
Full Text• Encode
• Assemble
Corpus • Analyse
Research Result
Cost- and Time-Consuming
315 Volumes
≈ 150,000 Pages
≈ 150,000,000 Tokens
Allgemeine Literatur-Zeitung (1785-1849)
Full Text• Encode
• Assemble
Digital Libraries
Corpus • Analyse
Research Result
Alternative Workflow
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
Austria:Austrian National Library
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
Austria:Austrian National Library
Switzerland:University of Lausanne
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
UK:University of Oxford
Austria:Austrian National Library
Switzerland:University of Lausanne
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
USA:Harvard UniversityIndiana UniversityNew York Public LibraryPrinceton UniversityStanford UniversityUniversity of IllinoisUniversity of Michigan
UK:University of Oxford
Austria:Austrian National Library
Switzerland:University of Lausanne
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
USA:Harvard UniversityIndiana UniversityNew York Public LibraryPrinceton UniversityStanford UniversityUniversity of IllinoisUniversity of Michigan
UK:University of Oxford
Austria:Austrian National Library
Switzerland:University of Lausanne
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
USA:Harvard UniversityIndiana UniversityNew York Public LibraryPrinceton UniversityStanford UniversityUniversity of IllinoisUniversity of Michigan
UK:University of Oxford
Austria:Austrian National Library
Switzerland:University of Lausanne
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
USA:Harvard UniversityIndiana UniversityNew York Public LibraryPrinceton UniversityStanford UniversityUniversity of IllinoisUniversity of Michigan
UK:University of Oxford
Austria:Austrian National Library
Switzerland:University of Lausanne
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
1,200+ Volumes
600,000+ Pages
600,000,000+ Tokens
Proposed Workflow
Digital Libraries and the Web
• Collect
• Correct Metadata
Proposed Workflow
Digital Libraries and the Web
• Collect
• Correct Metadata
https://archive.org/details/bub_gb_udTjAAAAMAAJ/
Proposed Workflow
Digital Libraries and the Web
• Collect
• Correct Metadata
Full-Texts • Evaluate
• Select
Proposed Workflow
Digital Libraries and the Web
• Collect
• Correct Metadata
Full-Texts • Evaluate
• Select
14 different full-text versions for this page!
Proposed Workflow
Digital Libraries and the Web
• Collect
• Correct Metadata
Full-Texts • Evaluate
• Select
Best-Quality Full-Texts
• Encode
• Assemble
Proposed Workflow
Digital Libraries and the Web
• Collect
• Correct Metadata
Full-Texts • Evaluate
• Select
Best-Quality Full-Texts
• Encode
• Assemble
Target-Corpus
Result
Digital Libraries and the Web
• Collect
• Correct Metadata
Full-Texts • Evaluate
• Select
Best-Quality Full-Texts
• Encode
• Assemble
Target-Corpus
261 Volumes
126,612 Pages
120,369,005 Tokens
Result
Digital Libraries and the Web
• Collect
• Correct Metadata
Full-Texts • Evaluate
• Select
Best-Quality Full-Texts
• Encode
• Assemble
Target-Corpus
315 Volumes
≈ 150,000 Pages
≈ 150,000,000 Tokens
261 Volumes
126,612 Pages
120,369,005 Tokens
≈ 82% coverage
Result
Digital Libraries and the Web
• Collect
• Correct Metadata
Full-Texts • Evaluate
• Select
Best-Quality Full-Texts
• Encode
• Assemble
Target-Corpus
The Largest Corpus for German Romanticism
https://github.com/JULIELab/ALZ
315 Volumes
≈ 150,000 Pages
≈ 150,000,000 Tokens
261 Volumes
126,612 Pages
120,369,005 Tokens
≈ 82% coverage
Problems
• Restricted Accessibility
• Heterogeneous Digitizing Conditions and OCR-Qualities
Conclusion
• The Largest Corpus for German Romanticism
• Big Potential of DLs for Computational Literary Studies
• More Cooperation Between DLs Desirable
• Better Metadata and OCR-Quality are Desirable
Corpus Assembly as Text Data Integration from Digital Libraries and the Web
Jena University Language & Information Engineering (JULIE) Lab
https://julielab.de/
DFG Graduate School „Romanticism as a Model“
http://modellromantik.uni-jena.de
Friedrich Schiller University Jena, Germany
Udo Hahn & Tinghui Duan
Thank you!