data science at dbc in 29 slides
TRANSCRIPT
![Page 1: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/1.jpg)
Data Science at DBC in 29 slides
Christian BoesgaardDBC, Team XP
![Page 2: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/2.jpg)
●a few words about DBC●a very short story about data science at DBC
●some examples of what we do
![Page 3: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/3.jpg)
DBC provides solutions to support the goal of the libraries:
“to encourage enlightenment, education and cultural activities”.(From “lov om biblioteksvirksomhed”)
![Page 4: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/4.jpg)
National Bibliographyregistration of books, music, AV materials, Internet documents, articles and reviews in newspapers and magazines
(metadata production, 50+ persons)
![Page 5: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/5.jpg)
DanBibThe union catalogue of the Danish libraries and the infrastructure for interlibrary loans.
(metadata + usage)
![Page 6: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/6.jpg)
bibliotek.dkAccess to all Danish publications and to the holdings of the Danish libraries.
(metadata usage
… and user data production)
![Page 7: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/7.jpg)
What Data?●registration metadata
some full text docs
front covers
loan data●search data
(And much more)
![Page 8: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/8.jpg)
How it began(I have been at DBC 10+ years and have a
background in distributed systems, applied cryptography, and philosopy)
![Page 9: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/9.jpg)
![Page 10: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/10.jpg)
Stanford CS221
![Page 11: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/11.jpg)
So... AI is not that magical
… And it works
We should really use this!
![Page 12: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/12.jpg)
Automatic metadata assignment for articlesTraining set: 136K articles with subject metadata
22K subject terms (95% used 169 times or less)
![Page 13: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/13.jpg)
“København Zoo beskyldes for at udbrede kreationisme”+- creationisme-+ 9 darwinisme++ 12 evolutionsteori+- formidling-+ 6 intelligent design+- kristendom+- livets oprindelse-+ 6 religion++ 8 skabelsen+- skilte+- zoologiske haver
![Page 14: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/14.jpg)
“Copenhagen Zoo is accused of advancing creationism”+- creationism-+ 9 darwinism++ 12 evolution theory+- dissemination-+ 6 intelligent design+- christianity+- origin of life-+ 6 religion++ 8 creation+- signs+- zoo
![Page 15: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/15.jpg)
Approach1.bag-of-words + liblinear
2.bag-of-words + k-nearest neighbors
3.paragraph vectors + k-nn
Works pretty well for assisted indexing and is now an integrated part of the system used for registration.
![Page 16: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/16.jpg)
Metadata to Metadata Sometimes, simple is good:demokrati [930] politiske_forhold 897 politik 341 historie 243 islam 234 valg 155 ytringsfrihed 129 menneskerettigheder 117 oprør 94 udenrigspolitik 93
![Page 17: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/17.jpg)
XP
![Page 18: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/18.jpg)
![Page 19: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/19.jpg)
Recommendationscontent-based (metadata)
collaborative (item-item, loans)Foucaults Pendul - Umberto EcoDronning Loanas mystiske flamme - Umberto EcoBaudolino - Umberto EcoRosens navn - Umberto EcoKirkegården i Prag - Umberto EcoJudasbrevet - Eric FrattiniSkaberens kort - Emilio Calderón
![Page 20: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/20.jpg)
Ranking
●popularity
personalized (loans/likes/...)For search results (or recommendations)
![Page 21: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/21.jpg)
Suggestions●popularity (loans)
subjects, creator, etc.
E.g. for completion
![Page 22: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/22.jpg)
From Lady Gaga to James Joyce
![Page 23: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/23.jpg)
“Enlightenment” ...Not guaranteed
But we can recommend “towards” a curated collection
(based on item-item similarity or
P(loan(y)|loan(x)) )
![Page 24: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/24.jpg)
![Page 25: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/25.jpg)
Similarity PathsBorn this way - Lady Gaga (music)Rasmus Seebach - Rasmus Seebach (music)In these waters - Mads Langer (music)De urørlige (movie)Fasandræberne - Jussi Adler-Olsen (book)
![Page 26: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/26.jpg)
Similarity PathsFasandræberne - Jussi Adler-OlsenDet syvende barn - Erik ValeurProfeterne i Evighedsfjorden - Kim LeineMin kamp - Karl Ove KnausgårdPå sporet af den tabte tid - Marcel ProustFædre og sønner - Ivan TurgenevPortræt af kunstneren [...] - James JoyceUlysses - James Joyce
![Page 27: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/27.jpg)
Similarity Paths(for the kids...)
Sheik Yerbouti - Frank ZappaAladdin Sane - David BowieThe red shoes - Kate BushMDNA world tour - MadonnaLotus - Christina AguileraBorn this way - Lady Gaga
![Page 29: Data science at DBC in 29 slides](https://reader036.vdocument.in/reader036/viewer/2022070510/58aaef4f1a28abc73a8b5e27/html5/thumbnails/29.jpg)
What we use(for “data science”)
Python: SciPy stack, scikit-learn, gensim, Tornado.
Kafka, MongoDB, Solr.
(Java, R)