a “quick and dirty” website data quality indicator irit askira gelman university of arizona...

16
A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

Upload: camilla-jenkins

Post on 05-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

A “Quick and Dirty” Website Data Quality Indicator

Irit Askira GelmanUniversity of Arizona

Anthony L. BarlettaUniversity of Arizona

Page 2: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

October 30, 2008 WICOW 2008 2

Information quality on the web: Information quality on the web: DEBKADEBKAfilefile An Israeli, Jerusalem-based website (www.debka.co.il) with

commentary and analyses on terrorism, intelligence, security, and military and political affairs in the Middle East

According to DEBKAfile, over 1,000,000 viewers a week

Forbes' Best of The Web award: “Debkafile has been ahead of the pack often enough to suggest that the reporting is good.”

However, Forbes decries the fact that "most of the information is attributed to unidentified sources"

Has been criticized as a fringe outfit catering to conspiracy theorists. Some claim that the site relies on sources with an agenda, and that Israeli intelligence does not consider even 10% of the content reliable

The site's operators have claimed that 80% turns out to be true

Page 3: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

October 30, 2008 WICOW 2008 3

Information quality on the web: Information quality on the web: DEBKADEBKAfilefile

repeated

Page 4: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

October 30, 2008 WICOW 2008 4

Information quality on the web: Information quality on the web: DEBKADEBKAfilefileContent quality: Highly current, up-to-date

But.. deficiencies in Accuracy Source reliability Objectivity

Representation quality: Spelling errors & various

typos Very long sentences Grammatical errors ..

Page 5: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

October 30, 2008 WICOW 2008 5

Critical observationCritical observation

Information quality deficiencies are often not isolated

Poor information quality control?

Page 6: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

October 30, 2008 WICOW 2008 6

Website information quality Website information quality assessment: assessment: Our approach (I)Our approach (I)

Look for an easy to measure data quality facet

Use it as an indicator of aggregate data quality

Page 7: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

October 30, 2008 WICOW 2008 7

Website information quality Website information quality assessment: assessment: Our approach (I)Our approach (I)

Focus on spelling errors as an indicator of aggregate data quality

Hypothesis 1: The spelling error rate of a document set is positively related to the aggregate data quality of the set

Page 8: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

October 30, 2008 WICOW 2008 8

Related questions (I)

To what extent is a lower aggregate quality detected by the spelling error rate?

To what extent does a higher spelling error rate indicate a lower aggregate quality?

Are there significant variations across different settings?

Page 9: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

October 30, 2008 WICOW 2008 9

Our approach (II):A “quick and dirty” indicator

Instead of an exhaustive spelling error check, focus on a minimal set of spelling errors, carefully chosen to fit the target document population

Use the hit count feature of a common search engine (e.g., Google) to assess the rate of the chosen spelling errors in the target population

Page 10: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

October 30, 2008 WICOW 2008 10

A “quick and dirty” indicator: Initial implementation

10 common English spelling errors selected from the autocorrect word list of MS Office

target broad document populations

Google’s hit count

Page 11: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

October 30, 2008 WICOW 2008 11

A “quick and dirty” indicator: Initial implementation

Spelling Error Correct Spelling

Recieve Receive

Accomodate Accommodate

Accross Across

Truely Truly

Acheive Achieve

Affraid Afraid

Agressive Aggressive

Appearence Appearance

Tomorow Tomorrow

Arguement Argument

Page 12: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

October 30, 2008 WICOW 2008 12

A “quick and dirty” indicator: Initial implementation Indicator defined by:

(( )

+ +1

, ),

( , ) ( , )j

jj j

HitCount eErrorIndex

de d

HitCount e d HitCount c d

( ) { ( , ): 1,..,10} jErrorIndex d ErrorIndex e d jAVERAGE

je , j=1,..,10, denotes the jth spelling error

denotes the correct spelling that matches jcje

d denotes the document set

Page 13: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

October 30, 2008 WICOW 2008 13

Website information quality Website information quality assessment: Our approach (II)assessment: Our approach (II)

Hypothesis 2: The proposed indicator is positively related to the aggregate data quality of the document set

Page 14: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

October 30, 2008 WICOW 2008 14

Related questions (II)

To what extent is a lower aggregate quality detected by this indicator?

… (see Questions I)

Spelling error set: what spelling errors to include? How many?

Hit count: is it reliable? How valid is it in measuring error rates?

Page 15: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

October 30, 2008 WICOW 2008 15

Initial tests & results

We have conducted initial tests of hypothesis 1, hypothesis 2, & related questions

Askira Gelman I. and Barletta A.L. Initial Study of a “Quick and Dirty” Website Data Quality Index, ICIQ 2008

Page 16: A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

October 30, 2008 WICOW 2008 16

Initial tests & results

To what extent does a higher spelling error rate indicate a lower aggregate quality?

Positive initial results on large websites & web domains (.gov sites, university sites, wikipedia, and more)

Spelling error set: size can be increased; select carefully to avoid the lack of context sensitivity of the search engine

Hit count: for higher reliability conduct a series of measurements and remove outliers