a “quick and dirty” website data quality indicator irit askira gelman university of arizona...
TRANSCRIPT
A “Quick and Dirty” Website Data Quality Indicator
Irit Askira GelmanUniversity of Arizona
Anthony L. BarlettaUniversity of Arizona
October 30, 2008 WICOW 2008 2
Information quality on the web: Information quality on the web: DEBKADEBKAfilefile An Israeli, Jerusalem-based website (www.debka.co.il) with
commentary and analyses on terrorism, intelligence, security, and military and political affairs in the Middle East
According to DEBKAfile, over 1,000,000 viewers a week
Forbes' Best of The Web award: “Debkafile has been ahead of the pack often enough to suggest that the reporting is good.”
However, Forbes decries the fact that "most of the information is attributed to unidentified sources"
Has been criticized as a fringe outfit catering to conspiracy theorists. Some claim that the site relies on sources with an agenda, and that Israeli intelligence does not consider even 10% of the content reliable
The site's operators have claimed that 80% turns out to be true
October 30, 2008 WICOW 2008 3
Information quality on the web: Information quality on the web: DEBKADEBKAfilefile
repeated
October 30, 2008 WICOW 2008 4
Information quality on the web: Information quality on the web: DEBKADEBKAfilefileContent quality: Highly current, up-to-date
But.. deficiencies in Accuracy Source reliability Objectivity
Representation quality: Spelling errors & various
typos Very long sentences Grammatical errors ..
October 30, 2008 WICOW 2008 5
Critical observationCritical observation
Information quality deficiencies are often not isolated
Poor information quality control?
October 30, 2008 WICOW 2008 6
Website information quality Website information quality assessment: assessment: Our approach (I)Our approach (I)
Look for an easy to measure data quality facet
Use it as an indicator of aggregate data quality
October 30, 2008 WICOW 2008 7
Website information quality Website information quality assessment: assessment: Our approach (I)Our approach (I)
Focus on spelling errors as an indicator of aggregate data quality
Hypothesis 1: The spelling error rate of a document set is positively related to the aggregate data quality of the set
October 30, 2008 WICOW 2008 8
Related questions (I)
To what extent is a lower aggregate quality detected by the spelling error rate?
To what extent does a higher spelling error rate indicate a lower aggregate quality?
Are there significant variations across different settings?
October 30, 2008 WICOW 2008 9
Our approach (II):A “quick and dirty” indicator
Instead of an exhaustive spelling error check, focus on a minimal set of spelling errors, carefully chosen to fit the target document population
Use the hit count feature of a common search engine (e.g., Google) to assess the rate of the chosen spelling errors in the target population
October 30, 2008 WICOW 2008 10
A “quick and dirty” indicator: Initial implementation
10 common English spelling errors selected from the autocorrect word list of MS Office
target broad document populations
Google’s hit count
October 30, 2008 WICOW 2008 11
A “quick and dirty” indicator: Initial implementation
Spelling Error Correct Spelling
Recieve Receive
Accomodate Accommodate
Accross Across
Truely Truly
Acheive Achieve
Affraid Afraid
Agressive Aggressive
Appearence Appearance
Tomorow Tomorrow
Arguement Argument
October 30, 2008 WICOW 2008 12
A “quick and dirty” indicator: Initial implementation Indicator defined by:
(( )
+ +1
, ),
( , ) ( , )j
jj j
HitCount eErrorIndex
de d
HitCount e d HitCount c d
( ) { ( , ): 1,..,10} jErrorIndex d ErrorIndex e d jAVERAGE
je , j=1,..,10, denotes the jth spelling error
denotes the correct spelling that matches jcje
d denotes the document set
October 30, 2008 WICOW 2008 13
Website information quality Website information quality assessment: Our approach (II)assessment: Our approach (II)
Hypothesis 2: The proposed indicator is positively related to the aggregate data quality of the document set
October 30, 2008 WICOW 2008 14
Related questions (II)
To what extent is a lower aggregate quality detected by this indicator?
… (see Questions I)
Spelling error set: what spelling errors to include? How many?
Hit count: is it reliable? How valid is it in measuring error rates?
October 30, 2008 WICOW 2008 15
Initial tests & results
We have conducted initial tests of hypothesis 1, hypothesis 2, & related questions
Askira Gelman I. and Barletta A.L. Initial Study of a “Quick and Dirty” Website Data Quality Index, ICIQ 2008
October 30, 2008 WICOW 2008 16
Initial tests & results
To what extent does a higher spelling error rate indicate a lower aggregate quality?
Positive initial results on large websites & web domains (.gov sites, university sites, wikipedia, and more)
Spelling error set: size can be increased; select carefully to avoid the lack of context sensitivity of the search engine
Hit count: for higher reliability conduct a series of measurements and remove outliers