linkage in haze: challenges and take-home messages of crowd-sourcing vagueness in musical data
TRANSCRIPT
Linkage in HazeChallenges and take-home messages of crowd-sourcing vagueness in musical data
Alessandro AdamouListening to music: people, practices and experiencesSunday, October 25, 2015
What we capture in a listening experience
what they saidwhere
documented where
who
what
when
how
What we capture in a listening experience
A well-formed listening experience
On July 19, 2014Leonie Holmes (a professor of Music in New Zealand)*
was listening to Johann Strauss’ “Don Juan” and “Also Spracht Zarathustra” and Sarah Ballard’s “Synergos”
played by the NZSO National Youth Orchestra and Alexander Shelley
using harp and double bass (+others?)
in the Aotea Centre.(*) plus a generic public, which does not pose a problem in the representation.
A worst-case (but more likely) listening experience
One evening between May and September in the late 1950’sa group of war veterans, a reporter and an unknown female were listening to Chopin and an anthem played by a string orchestrain a concert hall in South London.
Goal: to capture both fact sets as structured data
(interconnected or prepared for refinement)
Other issues• Source
– unpublished manuscripts• Unaligned semantic layers
– instrument category | instrument name | brand and model– generic occupations | gender-dependent | personal titles
Monarch
King (Queen), Emperor (Empress)…
King of England, fourth Sultan of Zanzibar…
Chords
Electric Guitar
Gibson Les Paul Custom Sunburst
Factors contributing to fuzziness
• Domain knowledge of the evidence author(s)
• Deterioration of evidence
• Crowd-sourcing community issues:– Misaligned semantics– Varying scholarly rigour– Popularity of the domain of interest
Data representation in LED• Linked [Open] Data http://linkeddata.org
– Formalism for machine-readable and human-readable data
– Object identifiers are URIs
– Standard representation and query languages (RDF, SPARQL…)
– The meaning of links between objects is globally understood.
Identity in Linked Data• http://
musicbrainz.org/artist/f5aca88c-e3c1-4bc2-af33-68a9a9f7b56a#_ – (the band Killing Joke, as in MusicBrainz)
• http://bnb.data.bl.uk/id/agent/DailyMirror – (The Daily Mirror on the British National Bibliography)
• http://dbpedia.org/resource/London – (London, as in Wikipedia/DBpedia)
• http://reference.data.gov.uk/doc/day/2015-10-23 – (last Friday, as in the UK Government Calendar data)
• http://led.kmi.open.ac.uk/term/Medium.Live – (the concept of live music, as in LED)
Identity in Linked Data• http://
musicbrainz.org/artist/f5aca88c-e3c1-4bc2-af33-68a9a9f7b56a#_ – (the band Killing Joke, as in MusicBrainz)
• http://bnb.data.bl.uk/id/agent/DailyMirror – (The Daily Mirror on the British National Bibliography)
• http://dbpedia.org/resource/London – (London, as in Wikipedia/DBpedia)
• http://reference.data.gov.uk/doc/day/2015-10-23 – (last Friday, as in the UK Government Calendar data)
• http://led.kmi.open.ac.uk/term/Medium.Live – (the concept of live music, as in LED)
Easy: these are all named entities…
Goal: to capture both fact sets as linked data
There are no right or wrong ways to do it, only linkable or
unlinkable.
Linked Data encourage reuse…
No two things are distinct nor equal, until some LD node asserts or implies otherwise.
– e.g. Bono on MusicBrainz and Bono on DBpedia– Groups too, if it can be demonstrated they are an exact
match
…but fuzzy concepts have caveats
Group entity “Mourners of Felix Mendelssohn” (attending the arrival of his body in Berlin)
See http://led.kmi.open.ac.uk/entity/lexp/1434029100189
• Identifier of the group is http://data.open.ac.uk/led/agent/Mourners+of+Felix+Mendelssohn/1434029100190
should not be reused when modelling an entry about Mendelssohn’s funeral service.
See http://led.kmi.open.ac.uk/entity/lexp/1434029387526
– Identifier of the group is http://data.open.ac.uk/led/person/Mourners+at+the+Funeral+Service+of+Felix+Mendelssohn/1434029247629
Blank nodes• Fallback mechanism for providing data about objects
without having a naming convention for them.• Reference something not by name, but by description.• Example:
:performance/Messiah/12345 mo:listener [ a foaf:Group ; dc:description “Foreign ambassadors” :occupation dbpedia:Ambassador]
• Generally not an advisable solution:– Cannot perform matching on blank nodes– Querying or detecting changes in the data is much harder
Ontological classes• Model vague objects as formally-specified categories rather than named entities
• e.g. “the class of all people whose occupation is Ambassador and who were at the Royal Albert Hall on May 12, 1876”
• Pros:– Allows separation of “known” and “generic” entities– Semantically cleaner and easier to store and manage
• Cons:– Still need to make URIs for each class– They have to be instantiated before they can be used in a listening
experience– Harder to apply changes to the data without fixed classes
Countermeasures in LED• No blank nodes• For unaligned semantic layers (cf.
example on instruments and occupations):–Use lax model properties–Enforce reuse of external taxonomies
• ‘rich’ real-time recommendations
Countermeasures in LED
Data reconciliation
Currently with restricted access, but plans to open to crowd-sourcing
Countermeasures in LED• Ad-hoc formal models for underspecified data.• Example: Extended Date/Time Format (standard draft, Library of
Congress, 2012)– Allows formalisation of underspecified points in time and intervals, e.g.
“187u-05-uu”– We extended it to support subjective fuzzy intervals (e.g. early/mid/late)
and ranges (from-to)– Made available in RDF through data.open.ac.uk
• Example 2: GeoSPARQL– Used to support geospatial queries in Linked Data– Named entity recognition on arbitrary text for locations (recently)– We compute location URIs by hashing their descriptions and all the
locations extracted from it and related via geosparql:sfIntersects
How thick is the mist in LED?
Named Vague TotalParticipants 802 260 1062Locations 136 15
(cannot pinpoint)151*
Times 826 843(ranges, not qualified)
1669
Musical works 1550 1263 2813
(*) since database opened to arbitrary experience locations
Figures for LED public dataset
Lessons learnt• Advantages
– Open-world semantics: minimise risk of ambiguities generated by name clashes, allows for coherent management
– Monotonic: data are refined by addition of facts– Can be reasoned upon by machine-learning agents working on
the native data structure– Incorporates reuse for the benefit of the whole data cloud.
• Disadvantages– No reuse entails heavy replication– Data cleansing may require a large context for detecting entities that
can be reconciled
Lessons learnt• Most, if not all representational issues with vagueness can be
addressed in LD without resorting to blank nodes and safe from ambiguity.
– Way more powerful that traditional database systems.
• Data providers are yet to reach an albeit silent agreement on:– representational paradigms for entities commonly at risk of
underspecification, such as spatio-temporal ones;– how to name their objects.
• Most are making it easy for themselves when it comes to LD• The way to go is de facto standards
Where to go next• Model ontological classes as their instances
(equivalence classes?)• Increase context for fact-based data alignment
(opening reconciliation facilities to the public – with voting?)
• Argumentation on every statement in LED• Dissemination of controlled vocabularies and
naming convention for managed vague entities.
Are Linked Data mature for representing vagueness?
• The technology is.• The data out there aren’t.
– (but that is the part that can be improved)
Further reading• Eero Hyvönen, Publishing and Using Cultural Heritage Linked
Data on the Semantic Web (Morgan & Claypool, 2012)• Daniel J. Lewis and Trevor P. Martin, Managing Vagueness with
Fuzzy in Hierarchical Big Data. In 2015 INNS Conference on Big Data (Elsevier, 2015), Procedia Computer Science, Vol. 53, p. 19-28
• Fuzzy Logic and the Semantic Web, Elie Sanchez (ed.) (Elsevier, 2006)
Thank you!QA time