linkage in haze: challenges and take-home messages of crowd-sourcing vagueness in musical data

Linkage in HazeChallenges and take-home messages of crowd-sourcing vagueness in musical data

Alessandro AdamouListening to music: people, practices and experiencesSunday, October 25, 2015

What we capture in a listening experience

what they saidwhere

documented where

who

what

when

how

What we capture in a listening experience

A well-formed listening experience

On July 19, 2014Leonie Holmes (a professor of Music in New Zealand)*

was listening to Johann Strauss’ “Don Juan” and “Also Spracht Zarathustra” and Sarah Ballard’s “Synergos”

played by the NZSO National Youth Orchestra and Alexander Shelley

using harp and double bass (+others?)

in the Aotea Centre.(*) plus a generic public, which does not pose a problem in the representation.

A worst-case (but more likely) listening experience

One evening between May and September in the late 1950’sa group of war veterans, a reporter and an unknown female were listening to Chopin and an anthem played by a string orchestrain a concert hall in South London.

Goal: to capture both fact sets as structured data

(interconnected or prepared for refinement)

Other issues• Source

– unpublished manuscripts• Unaligned semantic layers

– instrument category | instrument name | brand and model– generic occupations | gender-dependent | personal titles

Monarch

King (Queen), Emperor (Empress)…

King of England, fourth Sultan of Zanzibar…

Chords

Electric Guitar

Gibson Les Paul Custom Sunburst

Factors contributing to fuzziness

• Domain knowledge of the evidence author(s)

• Deterioration of evidence

• Crowd-sourcing community issues:– Misaligned semantics– Varying scholarly rigour– Popularity of the domain of interest

Data representation in LED• Linked [Open] Data http://linkeddata.org

– Formalism for machine-readable and human-readable data

– Object identifiers are URIs

– Standard representation and query languages (RDF, SPARQL…)

– The meaning of links between objects is globally understood.

http://linkeddata.org/

Identity in Linked Data• http://

musicbrainz.org/artist/f5aca88c-e3c1-4bc2-af33-68a9a9f7b56a#_ – (the band Killing Joke, as in MusicBrainz)

• http://bnb.data.bl.uk/id/agent/DailyMirror – (The Daily Mirror on the British National Bibliography)

• http://dbpedia.org/resource/London – (London, as in Wikipedia/DBpedia)

• http://reference.data.gov.uk/doc/day/2015-10-23 – (last Friday, as in the UK Government Calendar data)

• http://led.kmi.open.ac.uk/term/Medium.Live – (the concept of live music, as in LED)

http://musicbrainz.org/artist/f5aca88c-e3c1-4bc2-af33-68a9a9f7b56a%23_


http://bnb.data.bl.uk/id/agent/DailyMirror


http://dbpedia.org/resource/London

http://reference.data.gov.uk/doc/day/2015-10-23


http://led.kmi.open.ac.uk/term/Medium.Live


Identity in Linked Data• http://

musicbrainz.org/artist/f5aca88c-e3c1-4bc2-af33-68a9a9f7b56a#_ – (the band Killing Joke, as in MusicBrainz)

• http://bnb.data.bl.uk/id/agent/DailyMirror – (The Daily Mirror on the British National Bibliography)

• http://dbpedia.org/resource/London – (London, as in Wikipedia/DBpedia)

• http://reference.data.gov.uk/doc/day/2015-10-23 – (last Friday, as in the UK Government Calendar data)

• http://led.kmi.open.ac.uk/term/Medium.Live – (the concept of live music, as in LED)

Easy: these are all named entities…





http://dbpedia.org/resource/London





Goal: to capture both fact sets as linked data

There are no right or wrong ways to do it, only linkable or

unlinkable.

Linked Data encourage reuse…

No two things are distinct nor equal, until some LD node asserts or implies otherwise.

– e.g. Bono on MusicBrainz and Bono on DBpedia– Groups too, if it can be demonstrated they are an exact

match

…but fuzzy concepts have caveats

Group entity “Mourners of Felix Mendelssohn” (attending the arrival of his body in Berlin)

See http://led.kmi.open.ac.uk/entity/lexp/1434029100189

• Identifier of the group is http://data.open.ac.uk/led/agent/Mourners+of+Felix+Mendelssohn/1434029100190

should not be reused when modelling an entry about Mendelssohn’s funeral service.

See http://led.kmi.open.ac.uk/entity/lexp/1434029387526

– Identifier of the group is http://data.open.ac.uk/led/person/Mourners+at+the+Funeral+Service+of+Felix+Mendelssohn/1434029247629

http://led.kmi.open.ac.uk/entity/lexp/1434029100189



http://data.open.ac.uk/led/agent/Mourners+of+Felix+Mendelssohn/1434029100190

http://data.open.ac.uk/led/agent/Mourners+of+Felix+Mendelssohn/1434029100190



http://data.open.ac.uk/led/person/Mourners+at+the+Funeral+Service+of+Felix+Mendelssohn/1434029247629

http://data.open.ac.uk/led/person/Mourners+at+the+Funeral+Service+of+Felix+Mendelssohn/1434029247629

Blank nodes• Fallback mechanism for providing data about objects

without having a naming convention for them.• Reference something not by name, but by description.• Example:

:performance/Messiah/12345 mo:listener [ a foaf:Group ; dc:description “Foreign ambassadors” :occupation dbpedia:Ambassador]

• Generally not an advisable solution:– Cannot perform matching on blank nodes– Querying or detecting changes in the data is much harder

Ontological classes• Model vague objects as formally-specified categories rather than named entities

• e.g. “the class of all people whose occupation is Ambassador and who were at the Royal Albert Hall on May 12, 1876”

• Pros:– Allows separation of “known” and “generic” entities– Semantically cleaner and easier to store and manage

• Cons:– Still need to make URIs for each class– They have to be instantiated before they can be used in a listening

experience– Harder to apply changes to the data without fixed classes

Countermeasures in LED• No blank nodes• For unaligned semantic layers (cf.

example on instruments and occupations):–Use lax model properties–Enforce reuse of external taxonomies

• ‘rich’ real-time recommendations

Countermeasures in LED

Data reconciliation

Currently with restricted access, but plans to open to crowd-sourcing

Countermeasures in LED• Ad-hoc formal models for underspecified data.• Example: Extended Date/Time Format (standard draft, Library of

Congress, 2012)– Allows formalisation of underspecified points in time and intervals, e.g.

“187u-05-uu”– We extended it to support subjective fuzzy intervals (e.g. early/mid/late)

and ranges (from-to)– Made available in RDF through data.open.ac.uk

• Example 2: GeoSPARQL– Used to support geospatial queries in Linked Data– Named entity recognition on arbitrary text for locations (recently)– We compute location URIs by hashing their descriptions and all the

locations extracted from it and related via geosparql:sfIntersects

How thick is the mist in LED?

Named Vague TotalParticipants 802 260 1062Locations 136 15

(cannot pinpoint)151*

Times 826 843(ranges, not qualified)

1669

Musical works 1550 1263 2813

(*) since database opened to arbitrary experience locations

Figures for LED public dataset

Lessons learnt• Advantages

– Open-world semantics: minimise risk of ambiguities generated by name clashes, allows for coherent management

– Monotonic: data are refined by addition of facts– Can be reasoned upon by machine-learning agents working on

the native data structure– Incorporates reuse for the benefit of the whole data cloud.

• Disadvantages– No reuse entails heavy replication– Data cleansing may require a large context for detecting entities that

can be reconciled

Lessons learnt• Most, if not all representational issues with vagueness can be

addressed in LD without resorting to blank nodes and safe from ambiguity.

– Way more powerful that traditional database systems.

• Data providers are yet to reach an albeit silent agreement on:– representational paradigms for entities commonly at risk of

underspecification, such as spatio-temporal ones;– how to name their objects.

• Most are making it easy for themselves when it comes to LD• The way to go is de facto standards

Where to go next• Model ontological classes as their instances

(equivalence classes?)• Increase context for fact-based data alignment

(opening reconciliation facilities to the public – with voting?)

• Argumentation on every statement in LED• Dissemination of controlled vocabularies and

naming convention for managed vague entities.

Are Linked Data mature for representing vagueness?

• The technology is.• The data out there aren’t.

– (but that is the part that can be improved)

Further reading• Eero Hyvönen, Publishing and Using Cultural Heritage Linked

Data on the Semantic Web (Morgan & Claypool, 2012)• Daniel J. Lewis and Trevor P. Martin, Managing Vagueness with

Fuzzy in Hierarchical Big Data. In 2015 INNS Conference on Big Data (Elsevier, 2015), Procedia Computer Science, Vol. 53, p. 19-28

• Fuzzy Logic and the Semantic Web, Elie Sanchez (ed.) (Elsevier, 2006)

Thank you!QA time

[email protected]

linkage in haze: challenges and take-home messages of crowd-sourcing vagueness in musical data

Data & Analytics