quantifying the bias in data links

19
Quantifying the bias in data links Ilaria Tiddi, Mathieu d’Aquin, Enrico Motta

Upload: i-tiddi

Post on 08-Jul-2015

265 views

Category:

Presentations & Public Speaking


6 download

DESCRIPTION

An approach to identify how much a Linked Data dataset is biased, using statistical methods and the links between datasets. 28/11/2014 @EKAW2014, Linköping, Sweden

TRANSCRIPT

Page 1: Quantifying the bias in data links

Quantifying the bias in data links

Ilaria Tiddi, Mathieu d’Aquin, Enrico Motta

Page 2: Quantifying the bias in data links

The problem

• Linked Data datasets are biased

• Bias = the information is unevenly distributed

• To detect such a bias, the information distribution in the

dataset should be compared to an unbiased one (ground

truth), which is not available

• Our proposal is to use information coming from the

connected datasets to approximate such a comparison

Page 3: Quantifying the bias in data links

• LMDB is biased towards old movies (i.e., it mostly contains information about old movies)

• A recommender system would therefore produce results biased towards old movies

• There is a need of identifying this bias

• to properly assess the results of Linked Data systems and

• to compensate the bias.

Is bias a problem?

Page 4: Quantifying the bias in data links

Motivation

• Dedalo: using Linked Data to explain patterns

• Pattern

• Students of the Open University enroll into Health&Social Care

courses more often around Manchester than in other places

• Explanation

• Health&Social Care courses are popular in Manchester because it is

in the Northern Hemisphere

• In DBpedia, the information incompleteness regarding places

locations is unevenly distributed, i.e. there is a bias

Page 5: Quantifying the bias in data links

• Measure how much a dataset is biased when compared to another one

• Use the dataset projection into its connecting dataset D

• compare the property values distribution of entities in D

• with the one of entities in S (the dataset projection)

D

Sowl:sameAsrdf:seeAlso

skos:exactMatch….

Identifying the bias

Dataset

Page 6: Quantifying the bias in data links

• Compare dc:subject values for the entities in D and in S

LMDB is biased towards black and white movies

• Same for dbp:released

LMDB is biased towards older movies

Example : is LMDB biased?

Page 7: Quantifying the bias in data links

• Use SPARQL to build pairs of values distributions in S and D

• Given

• two populations (values) and

• a same observation (RDF property)

dc:subject(D) = {dbCat:ScienceFictionMovies,dbCat:Black&WhiteMovies}

dc:subject(S) = {dbCat:Black&WhiteMovies}

• Use the statistical t-tests commonly exploited to compare observations

Bias detection proposition

Page 8: Quantifying the bias in data links

• There is a significant difference between two populations

• calculates the probability p that the difference is due to chance

• state a null hypothesis (i.e. is due to chance)

• there is no bias in a property

• an alternate hypothesis (the one you want to prove)

• there is bias in a property

• if p below 0.05, then one can reject the null hypothesis

• the lower p, the more the property is biased

• Rank the properties according to p to find the most biased ones

T-Tests of statistical significance

Page 9: Quantifying the bias in data links

Experiments and results

• 30 datasets and 54 pairs from the DataHub1

• Varying in size of entities in S (from 30 to 60,000 approx.)

• Varying in domain (multi-domain, biomedical computer science, education, geography…)

[1] http://datahub.io/

Page 10: Quantifying the bias in data links

When results are expected…

• NLFinland, places in Finland (connected to DBpedia)

• NLSpain, bibliographic Spanish data (connected to DBpedia)

class prop value p

db:Place dc:subject db:CitiesAndTownsInFinland p < 1.00e-15

db:Place dbp:latd (average) 40.5 p < 1.00e-15

db:Place dbp:longd (average) 24.6 p < 1.00e-15

class prop value p

db:MusicalArtist db:birthPlace db:Spain p < 1.13e-13

db:Writer dbp:nationality db:Spanish p < 4.64e-03

Page 11: Quantifying the bias in data links

class prop value p

up:Protein up:isolatedFrom uptissue:Brain p < 1.33e-04

class prop value p

db:Agent db:genre db:Novel p < 1.00e-15

db:Agent db:genre db:Poetry p < 1.00e-15

db:Agent db:deathCause db:Suicide p < 1.00e-15

…when results are less expected

• Uniprot, biomedical data (connected to

Bio2RDF/BioPax/DrugBank)

• RED, writers data (connected to DBpedia)

Page 12: Quantifying the bias in data links

• The importance of identifying the bias in a dataset

• Approach:

• with information from the connected datasets

• statistical t-tests on the distributions of the values of a property

• ranking properties basing on the probability of being biased

• Evaluating Dedalo’s performance on Google Trends

Please participate!

http://linkedu.eu/dedalo/eval/

Conclusions and future work

Page 13: Quantifying the bias in data links

ilaria.tiddi @open.ac.uk

@IlaTiddi http://linkedu.eu/dedalo/eval/

Thank you for your attention

Questions?

Page 14: Quantifying the bias in data links

Dedalo: explaining clusters with Linked Data

• Linked Data are a graph

• nodes : URIs

• edges : RDF properties

• Some nodes walk to the same node

Walk = a chain of RDF properties

• Walks can be an explanation for the cluster

ExplC = a chain of properties and one final entity

Page 15: Quantifying the bias in data links

Dedalo: explaining clusters with Linked Data

A* iterative search

Entropy to drive the search expanding the graph

Improving the F-score of ExplC at each iteration

ExplC =“movies whose subject is a subcategory of Science Fiction”

Page 16: Quantifying the bias in data links

Knowledge Discovery

The process of identifying patterns in data1

Patterns are usually interpreted by the experts

Linked Data can be used to automatically interpret patterns

open, shared, multi-domain, connected knowledge

rawdata

cleandata

Patterns

Knowledge

[1] Fayyad, 1998.

Page 17: Quantifying the bias in data links

Need of identify the bias when producing Linked Data systems

We propose a process to identify and measure the bias based on statistical methods

Contribution

A recommender system based on DBpedia (any kind of movies)

DBpedia is linked to the Linked Movies Database ( ‘30s movies )

The recommendation might be compromised

Page 18: Quantifying the bias in data links

• Students are interested in Health&Social Care since they live in the Northern Hemisphere

• What about the other counties?

• are they connected to the “Northern Hemisphere” entity?

• There must be a bias :the information is unevenly distributed

• Solution: weighting properties to rebalance the unevenness

Motivation

Page 19: Quantifying the bias in data links

ilaria.tiddi @open.ac.uk

@IlaTiddi

THANK YOU VERY MUCH!

Questions?