data science at dbc in 29 slides

29
Data Science at DBC in 29 slides Christian Boesgaard DBC, Team XP

Upload: dansk-bibliotekscenter

Post on 20-Feb-2017

858 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Data science at DBC in 29 slides

Data Science at DBC in 29 slides

Christian BoesgaardDBC, Team XP

Page 2: Data science at DBC in 29 slides

●a few words about DBC●a very short story about data science at DBC

●some examples of what we do

Page 3: Data science at DBC in 29 slides

DBC provides solutions to support the goal of the libraries:

“to encourage enlightenment, education and cultural activities”.(From “lov om biblioteksvirksomhed”)

Page 4: Data science at DBC in 29 slides

National Bibliographyregistration of books, music, AV materials, Internet documents, articles and reviews in newspapers and magazines

(metadata production, 50+ persons)

Page 5: Data science at DBC in 29 slides

DanBibThe union catalogue of the Danish libraries and the infrastructure for interlibrary loans.

(metadata + usage)

Page 6: Data science at DBC in 29 slides

bibliotek.dkAccess to all Danish publications and to the holdings of the Danish libraries.

(metadata usage

… and user data production)

Page 7: Data science at DBC in 29 slides

What Data?●registration metadata

some full text docs

front covers

loan data●search data

(And much more)

Page 8: Data science at DBC in 29 slides

How it began(I have been at DBC 10+ years and have a

background in distributed systems, applied cryptography, and philosopy)

Page 9: Data science at DBC in 29 slides
Page 10: Data science at DBC in 29 slides

Stanford CS221

Page 11: Data science at DBC in 29 slides

So... AI is not that magical

… And it works

We should really use this!

Page 12: Data science at DBC in 29 slides

Automatic metadata assignment for articlesTraining set: 136K articles with subject metadata

22K subject terms (95% used 169 times or less)

Page 13: Data science at DBC in 29 slides

“København Zoo beskyldes for at udbrede kreationisme”+- creationisme-+ 9 darwinisme++ 12 evolutionsteori+- formidling-+ 6 intelligent design+- kristendom+- livets oprindelse-+ 6 religion++ 8 skabelsen+- skilte+- zoologiske haver

Page 14: Data science at DBC in 29 slides

“Copenhagen Zoo is accused of advancing creationism”+- creationism-+ 9 darwinism++ 12 evolution theory+- dissemination-+ 6 intelligent design+- christianity+- origin of life-+ 6 religion++ 8 creation+- signs+- zoo

Page 15: Data science at DBC in 29 slides

Approach1.bag-of-words + liblinear

2.bag-of-words + k-nearest neighbors

3.paragraph vectors + k-nn

Works pretty well for assisted indexing and is now an integrated part of the system used for registration.

Page 16: Data science at DBC in 29 slides

Metadata to Metadata Sometimes, simple is good:demokrati [930] politiske_forhold 897 politik 341 historie 243 islam 234 valg 155 ytringsfrihed 129 menneskerettigheder 117 oprør 94 udenrigspolitik 93

Page 17: Data science at DBC in 29 slides

XP

Page 18: Data science at DBC in 29 slides
Page 19: Data science at DBC in 29 slides

Recommendationscontent-based (metadata)

collaborative (item-item, loans)Foucaults Pendul - Umberto EcoDronning Loanas mystiske flamme - Umberto EcoBaudolino - Umberto EcoRosens navn - Umberto EcoKirkegården i Prag - Umberto EcoJudasbrevet - Eric FrattiniSkaberens kort - Emilio Calderón

Page 20: Data science at DBC in 29 slides

Ranking

●popularity

personalized (loans/likes/...)For search results (or recommendations)

Page 21: Data science at DBC in 29 slides

Suggestions●popularity (loans)

subjects, creator, etc.

E.g. for completion

Page 22: Data science at DBC in 29 slides

From Lady Gaga to James Joyce

Page 23: Data science at DBC in 29 slides

“Enlightenment” ...Not guaranteed

But we can recommend “towards” a curated collection

(based on item-item similarity or

P(loan(y)|loan(x)) )

Page 24: Data science at DBC in 29 slides
Page 25: Data science at DBC in 29 slides

Similarity PathsBorn this way - Lady Gaga (music)Rasmus Seebach - Rasmus Seebach (music)In these waters - Mads Langer (music)De urørlige (movie)Fasandræberne - Jussi Adler-Olsen (book)

Page 26: Data science at DBC in 29 slides

Similarity PathsFasandræberne - Jussi Adler-OlsenDet syvende barn - Erik ValeurProfeterne i Evighedsfjorden - Kim LeineMin kamp - Karl Ove KnausgårdPå sporet af den tabte tid - Marcel ProustFædre og sønner - Ivan TurgenevPortræt af kunstneren [...] - James JoyceUlysses - James Joyce

Page 27: Data science at DBC in 29 slides

Similarity Paths(for the kids...)

Sheik Yerbouti - Frank ZappaAladdin Sane - David BowieThe red shoes - Kate BushMDNA world tour - MadonnaLotus - Christina AguileraBorn this way - Lady Gaga

Page 28: Data science at DBC in 29 slides

The EndChristian Boesgaard

Team XP [email protected]

Page 29: Data science at DBC in 29 slides

What we use(for “data science”)

Python: SciPy stack, scikit-learn, gensim, Tornado.

Kafka, MongoDB, Solr.

(Java, R)