![Page 1: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/1.jpg)
AnHai DoanUniversity of Wisconsin-Madison
Data Quality Challenges in Data Quality Challenges in Community SystemsCommunity Systems
Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron Gao, Fei Chen, Yoonkyong Lee, Raghu Ramakrishnan, Jeff Naughton
![Page 2: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/2.jpg)
Numerous Web CommunitiesNumerous Web Communities
Academic domains– database researchers, bioinformatists
Infotainments– movie fans, mountain climbers, fantasy football
Scientific data management– biomagnetic databank, E. Coli community
Business– enterprise intranets, tech support groups, lawyers
CIA / homeland security– Intellipedia
![Page 3: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/3.jpg)
Much Efforts to Build Community PortalsMuch Efforts to Build Community Portals Initially taxonomy based (e.g., Yahoo style) But now many structured data portals
– capture key entities and relationships of community
No general solution yet on how to build such portals
![Page 4: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/4.jpg)
Cimple Project @ Wisconsin / Yahoo! ResearchCimple Project @ Wisconsin / Yahoo! Research
Researcher
Homepages
Conference
Pages
Group Pages
DBworld
mailing list
DBLP
Web pages
Text documents
* **
** * ***
SIGMOD-04
**
** give-talk
Jim Gray
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
Jim Gray
SIGMOD-04
**
Maintain and add more sources
Develops such a general solution using extraction + integration + mass collaboration
Mass collaboration
![Page 5: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/5.jpg)
Prototype System: DBLifePrototype System: DBLife
Integrate data of the DB research community 1164 data sources
Crawled daily, 11000+ pages = 160+ MB / day
![Page 6: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/6.jpg)
Data ExtractionData Extraction
![Page 7: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/7.jpg)
Data IntegrationData Integration
Raghu Ramakrishnan
co-authors = A. Doan, Divesh Srivastava, ...
![Page 8: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/8.jpg)
Resulting ER GraphResulting ER Graph
“Proactive Re-optimization
Jennifer Widom
Shivnath Babu
SIGMOD 2005
David DeWitt
Pedro Bizarrocoauthor
coauthor
coauthor
advise advise
write
write
write
PC-Chair
PC-member
![Page 10: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/10.jpg)
Mass Collaboration: VotingMass Collaboration: Voting
Picture is removed if enough users vote “no”.
![Page 11: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/11.jpg)
Mass Collaboration via WikiMass Collaboration via Wiki
![Page 12: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/12.jpg)
Summary: Community SystemsSummary: Community Systems Data integration systems + extraction + Web 2.0
– manage both data and users in a synergistic fashion
In sync with current trends– manage unstructured data (e.g., text, Web pages)– get more structure (IE, Semantic Web)– engage more people (Web 2.0)– best-effort data integration, data spaces, pay-as-you-go
Numerous potential applications
But raises many difficult data quality challenges
![Page 13: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/13.jpg)
Rest of the TalkRest of the Talk
Data quality challenges in 1. Source selection2. Extraction and integration3. Detecting problems and providing feedback4. Mass collaboration
Conclusions & ways forward
![Page 14: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/14.jpg)
1. Source Selection1. Source Selection
Researcher
Homepages
Conference
Pages
Group Pages
DBworld
mailing list
DBLP
Web pages
Text documents
* **
** * ***
SIGMOD-04
**
** give-talk
Jim Gray
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
Jim Gray
SIGMOD-04
**
Maintain and add more sources
Mass collaboration
![Page 15: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/15.jpg)
Current Solutions vs. Cimple Current Solutions vs. Cimple
Current solutions– find all relevant data sources
(e.g., using focused crawling, search engines)– maximize coverage – have lot of noisy sources
Cimple – starts with a small set of high-quality “core” sources– incrementally adds more sources
– only from “high-quality” places– or as suggested by users (mass collaboration)
![Page 16: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/16.jpg)
Start with a Small Set of “Core” SourcesStart with a Small Set of “Core” Sources
Key observation: communities often follow 80-20 rules– 20% of sources cover 80% of interesting activities
Initial portal over these 20% often is already quite useful
How to select these 20%– select as many sources as possible– evaluate and select most relevant ones
![Page 17: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/17.jpg)
Evaluate the Relevancy of SourcesEvaluate the Relevancy of Sources Use PageRank + virtual links across entities + TF/IDF
... Gerhard Weikum
G. Weikum
See [VLDB-07a]
![Page 18: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/18.jpg)
Add More Sources over TimeAdd More Sources over Time Key observation: most important sources will
eventually be mentioned within the community– so monitor certain “community channels” to find themMessage type: conf. ann.Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data
Call for Participation Workshop on "Management of Uncertain Data" in conjunction with VLDB 2007
http://mud.cs.utwente.nl ...
Also allow users to suggest new sources– e.g., the Silicon Valley Database Society
![Page 19: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/19.jpg)
Summary: Source SelectionSummary: Source Selection
Sharp contrast to current work– start with highly relevant sources– expand carefully – minimize “garbage in, garbage out”
Need a notion of source relevance Need a way to compute this
![Page 20: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/20.jpg)
2. Extraction and Integration2. Extraction and Integration
Researcher
Homepages
Conference
Pages
Group Pages
DBworld
mailing list
DBLP
Web pages
Text documents
* **
** * ***
SIGMOD-04
**
** give-talk
Jim Gray
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
Jim Gray
SIGMOD-04
**
Maintain and add more sources
Mass collaboration
![Page 21: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/21.jpg)
Extracting Entity MentionsExtracting Entity Mentions Key idea: reasonable plan, then patch Reasonable plan:
– collect person names, e.g., David Smith– generate variations, e.g., D. Smith, Dr. Smith, etc.– find occurrences of these variations
ExtractMbyName
Union
s1 … sn
Works well, but can’t handle certain difficult spots
![Page 22: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/22.jpg)
Handling Difficult SpotsHandling Difficult Spots Example
– R. Miller, D. Smith, B. Jones– if “David Miller” is in the dictionary
will flag “Miller, D.” as a person name
Solution: patch such spots with stricter plans
ExtractMbyName
Union
s1 … sn
FindPotentialNameLists
ExtractMStrict
![Page 23: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/23.jpg)
Matching Entity MentionsMatching Entity Mentions Key idea: reasonable plan, then patch Reasonable plan
– mention names are the same (modulo some variation) match
– e.g., David Smith and D. Smith
Union
Extract Plan
MatchMbyName
s1 sn…Works well, but can’t handle certain difficult spots
![Page 24: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/24.jpg)
Handling Difficult SpotsHandling Difficult Spots
Estimate the semantic ambiguity of data sources– use social networking techniques [see ICDE-07a]
Apply stricter matchers to more ambiguous sources
MatchMStrict
Extract Plan
MatchMbyName
Union
{s1 … sn} DBLP\
Extract Plan
DBLP
DBLP: Chen Li
· · ·41. Chen Li, Bin Wang, Xiaochun Yang.VGRAM. VLDB 2007.· · ·38. Ping-Qi Pan, Jian-Feng Hu, Chen Li.Feasible region contraction.Applied Mathematics and Computation.· · ·
![Page 25: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/25.jpg)
Going Beyond Sources: Going Beyond Sources: Difficult Data Spots Can Cover Any Difficult Data Spots Can Cover Any
Portion of DataPortion of Data
MatchMStrict
Extract Plan
MatchMbyName
Union
{s1 … sn} DBLP\
Extract Plan
DBLP
Mentions that Match “J. Han”
MatchMStrict2
![Page 26: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/26.jpg)
Summary: Extraction and IntegrationSummary: Extraction and Integration Most current solutions
– try to find a single good plan, applied to all of data
Cimple solution: reasonable plan, then patch So the focus shifts to:
– how to find a reasonable plan? – how to detect problematic data spots? – how to patch those?
Need a notion of semantic ambiguity Different from the notion of source relevance
![Page 27: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/27.jpg)
3. Detecting Problems 3. Detecting Problems and Providing Feedbackand Providing Feedback
Researcher
Homepages
Conference
Pages
Group Pages
DBworld
mailing list
DBLP
Web pages
Text documents
* **
** * ***
SIGMOD-04
**
** give-talk
Jim Gray
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
Jim Gray
SIGMOD-04
**
Maintain and add more sources
Mass collaboration
![Page 28: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/28.jpg)
How to Detect Problems?How to Detect Problems? After extraction and matching, build services
– e.g., superhomepages Many such homepages contain minor problems
– e.g., X graduated in 19998 X chairs SIGMOD-05 and VLDB-05 X published 5 SIGMOD-03 papers
Intuitively, something is semantically incorrect
To fix this, lets build a Semantic Debugger– learns what is a normal profile for researcher, paper, etc. – alerts the builder to potentially buggy superhomepages– so feedback can be provided
![Page 29: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/29.jpg)
What Types of Feedback?What Types of Feedback? Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06 Add domain knowledge
– e.g., no researcher has ever published 5 SIGMOD papers in a year
Add more data– e.g., X was advised by Z– e.g., here is the URL of another data source
Modify the underlying algorithm– e.g., pull out all data involving X
match using names and co-authors, not just names
![Page 30: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/30.jpg)
How to Make How to Make Providing Feedback Very Easy?Providing Feedback Very Easy?
“Providing feedback” for the masses– in sync with current trends of empowering the masses
Extremely crucial in DBLife context If feedback can be provided easily
– can get more feedback– can leverage the mass of users
But this turned out to be very difficult
![Page 31: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/31.jpg)
Critical in our experience, but
unsolved
Provide a Wiki interface
How to Make How to Make Providing Feedback Very Easy?Providing Feedback Very Easy?
Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06 Add domain knowledge Add more data Modify the underlying algorithm
Provide form interfaces
Unsolved, some recent interest on
how to mass customize software
See our IEEE Data Engineering Bulletin paperon user-centric challenges, 2007
![Page 32: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/32.jpg)
What Feedback What Feedback Would Make the Most Impact?Would Make the Most Impact?
I have one hour spare time, would like to “teach” DBLife– what problems should I work on?– what feedback should I provide?
Need a Feedback Advisor– define a notion of system quality Q(s)– define questions q1, ..., qn that DBLife can ask users– for each qi, evaluate its expected improvement in Q(s)– pick question with highest expected quality improvement
Observations– a precise notion of system quality is now crucial– this notion should model the expected usage
![Page 33: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/33.jpg)
Summary: Detection and FeedbackSummary: Detection and Feedback
How to detect problems? – Semantic Debugger
What types of feedback & how to easily provide them?– critical, largely unsolved
What feedback would make most impact?– crucial in large-scale systems – need a Feedback Advisor– need a precise notion of system quality
![Page 34: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/34.jpg)
4. Mass Collaboration4. Mass Collaboration
Researcher
Homepages
Conference
Pages
Group Pages
DBworld
mailing list
DBLP
Web pages
Text documents
* **
** * ***
SIGMOD-04
**
** give-talk
Jim Gray
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
Jim Gray
SIGMOD-04
**
Maintenance and expansion
Mass collaboration
![Page 35: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/35.jpg)
Mass Collaboration: VotingMass Collaboration: Voting
Can be applied to numerous problems
![Page 36: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/36.jpg)
Example: MatchingExample: Matching
Hard for machine, but easy for human
Mouse for Dell laptop 200 series ...
Dell X200; mouse at reduced price ...
Dell laptop X200 with mouse ...
![Page 37: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/37.jpg)
ChallengesChallenges How to detect and remove noisy users?
– evaluate them using questions with known answers
How to combine user feedback?– # of yes votes vs. # of no votes
See [ICDE-05a, ICDE-08a]
![Page 38: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/38.jpg)
Mass Collaboration: WikiMass Collaboration: Wiki
Community wikipedia– built by machine + human– backed up by a structured database
DataSources G
T
V1
V2
V3
W1
W2
W3
u1
V3’ W3’
T3’
M
![Page 39: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/39.jpg)
Machine MachineHuman
Mass Collaboration: WikiMass Collaboration: Wiki
<# person(id=1){name}=David J. DeWitt #>
<# person(id=1){title}=Professor #>
<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel Database #>
David J. DeWitt
Professor
Interests: Parallel Database
<# person(id=1){name}=David J. DeWitt #>
<# person(id=1){title}=John P. Morgridge Professor #>
<# person(id=1) {organization}=UW #> since 1976
<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel Database #>
<# person(id=1){name}=David J. DeWitt #>
<# person(id=1){title}= John P. Morgridge Professor #>
<# person(id=1){organization}=UW-Madison#>since 1976
<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel Database #>
<# person(id=1).interests(id=5).topic(id=6){name}=Privacy #>
David J. DeWitt
John P. Morgridge ProfessorUW-Madison since 1976
Interests: Parallel Database Privacy
Machine
Human
![Page 40: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/40.jpg)
Sample Data Quality ChallengesSample Data Quality Challenges How to detect noisy users?
– no clear solution yet– for now, limit editing to trusted editors– modify notion of system quality to account for this
How to combine feedback, handle inconsistent data?– user vs. user– user vs. machine
How to verify claimed ownership of data portions?– e.g., this superhomepage is about me– only I can edit it
See [ICDE-08b]
![Page 41: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/41.jpg)
Summary: Mass CollaborationSummary: Mass Collaboration
What can users contribute? How to evaluate user quality? How to reconcile inconsistent data?
![Page 42: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/42.jpg)
Additional ChallengesAdditional Challenges
Dealing with evolving data (e.g., matching) Iterative code development Lifelong quality improvement Querying over inconsistent data Managing provenance and uncertainty Generating explanations Undo
![Page 43: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron](https://reader036.vdocument.in/reader036/viewer/2022062517/56649f005503460f94c1694f/html5/thumbnails/43.jpg)
ConclusionsConclusions Community systems:
– data integration + IE + Web 2.0– potentially very useful in numerous domains
Such systems raise myriad data quality challenges– subsume many current challenges– suggest new ones
Can provide a unifying context for us to make progress– building systems has been a key strength of our field– we need a community effort, as always
See “cimple wisc” for more detail Let us know if you want code/data