quantifying the bias in data links

Quantifying the bias in data links

Ilaria Tiddi, Mathieu d’Aquin, Enrico Motta

The problem

• Linked Data datasets are biased

• Bias = the information is unevenly distributed

• To detect such a bias, the information distribution in the

dataset should be compared to an unbiased one (ground

truth), which is not available

• Our proposal is to use information coming from the

connected datasets to approximate such a comparison

• LMDB is biased towards old movies (i.e., it mostly contains information about old movies)

• A recommender system would therefore produce results biased towards old movies

• There is a need of identifying this bias

• to properly assess the results of Linked Data systems and

• to compensate the bias.

Is bias a problem?

Motivation

• Dedalo: using Linked Data to explain patterns

• Pattern

• Students of the Open University enroll into Health&Social Care

courses more often around Manchester than in other places

• Explanation

• Health&Social Care courses are popular in Manchester because it is

in the Northern Hemisphere

• In DBpedia, the information incompleteness regarding places

locations is unevenly distributed, i.e. there is a bias

• Measure how much a dataset is biased when compared to another one

• Use the dataset projection into its connecting dataset D

• compare the property values distribution of entities in D

• with the one of entities in S (the dataset projection)

D

Sowl:sameAsrdf:seeAlso

skos:exactMatch….

Identifying the bias

Dataset

• Compare dc:subject values for the entities in D and in S

LMDB is biased towards black and white movies

• Same for dbp:released

LMDB is biased towards older movies

Example : is LMDB biased?

• Use SPARQL to build pairs of values distributions in S and D

• Given

• two populations (values) and

• a same observation (RDF property)

dc:subject(D) = {dbCat:ScienceFictionMovies,dbCat:Black&WhiteMovies}

dc:subject(S) = {dbCat:Black&WhiteMovies}

• Use the statistical t-tests commonly exploited to compare observations

Bias detection proposition

• There is a significant difference between two populations

• calculates the probability p that the difference is due to chance

• state a null hypothesis (i.e. is due to chance)

• there is no bias in a property

• an alternate hypothesis (the one you want to prove)

• there is bias in a property

• if p below 0.05, then one can reject the null hypothesis

• the lower p, the more the property is biased

• Rank the properties according to p to find the most biased ones

T-Tests of statistical significance

Experiments and results

• 30 datasets and 54 pairs from the DataHub1

• Varying in size of entities in S (from 30 to 60,000 approx.)

• Varying in domain (multi-domain, biomedical computer science, education, geography…)

[1] http://datahub.io/

When results are expected…

• NLFinland, places in Finland (connected to DBpedia)

• NLSpain, bibliographic Spanish data (connected to DBpedia)

class prop value p

db:Place dc:subject db:CitiesAndTownsInFinland p < 1.00e-15

db:Place dbp:latd (average) 40.5 p < 1.00e-15

db:Place dbp:longd (average) 24.6 p < 1.00e-15

class prop value p

db:MusicalArtist db:birthPlace db:Spain p < 1.13e-13

db:Writer dbp:nationality db:Spanish p < 4.64e-03

class prop value p

up:Protein up:isolatedFrom uptissue:Brain p < 1.33e-04

class prop value p

db:Agent db:genre db:Novel p < 1.00e-15

db:Agent db:genre db:Poetry p < 1.00e-15

db:Agent db:deathCause db:Suicide p < 1.00e-15

…when results are less expected

• Uniprot, biomedical data (connected to

Bio2RDF/BioPax/DrugBank)

• RED, writers data (connected to DBpedia)

• The importance of identifying the bias in a dataset

• Approach:

• with information from the connected datasets

• statistical t-tests on the distributions of the values of a property

• ranking properties basing on the probability of being biased

• Evaluating Dedalo’s performance on Google Trends

Please participate!

http://linkedu.eu/dedalo/eval/

Conclusions and future work

ilaria.tiddi @open.ac.uk

@IlaTiddi http://linkedu.eu/dedalo/eval/

Thank you for your attention

Questions?

Dedalo: explaining clusters with Linked Data

• Linked Data are a graph

• nodes : URIs

• edges : RDF properties

• Some nodes walk to the same node

Walk = a chain of RDF properties

• Walks can be an explanation for the cluster

ExplC = a chain of properties and one final entity

Dedalo: explaining clusters with Linked Data

A* iterative search

Entropy to drive the search expanding the graph

Improving the F-score of ExplC at each iteration

ExplC =“movies whose subject is a subcategory of Science Fiction”

Knowledge Discovery

The process of identifying patterns in data1

Patterns are usually interpreted by the experts

Linked Data can be used to automatically interpret patterns

open, shared, multi-domain, connected knowledge

rawdata

cleandata

Patterns

Knowledge

[1] Fayyad, 1998.

Need of identify the bias when producing Linked Data systems

We propose a process to identify and measure the bias based on statistical methods

Contribution

A recommender system based on DBpedia (any kind of movies)

DBpedia is linked to the Linked Movies Database ( ‘30s movies )

The recommendation might be compromised

• Students are interested in Health&Social Care since they live in the Northern Hemisphere

• What about the other counties?

• are they connected to the “Northern Hemisphere” entity?

• There must be a bias :the information is unevenly distributed

• Solution: weighting properties to rebalance the unevenness

Motivation

ilaria.tiddi @open.ac.uk

@IlaTiddi

THANK YOU VERY MUCH!

Questions?

quantifying the bias in data links

Presentations & Public Speaking