truth finding on the deep web
DESCRIPTION
Truth Finding on the Deep WEB. Xin Luna Dong Google Inc. 4/2013. Why Was I Motivated 5+ Years Ago? . 2007. 7/2009. Why Was I Motivated? –Erroneous Info. 7/2009. Why Was I Motivated?—Out-Of-Date Info. 7/2009. Why Was I Motivated?—Out-Of-Date Info. 7/2009. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/1.jpg)
TRUTH FINDING ON THE DEEP WEB
Xin Luna DongGoogle Inc.
4/2013
![Page 2: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/2.jpg)
Why Was I Motivated 5+ Years Ago?
7/2009
2007
![Page 3: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/3.jpg)
Why Was I Motivated? –Erroneous Info
7/2009
![Page 4: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/4.jpg)
Why Was I Motivated?—Out-Of-Date Info
7/2009
![Page 5: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/5.jpg)
Why Was I Motivated?—Out-Of-Date Info
7/2009
![Page 6: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/6.jpg)
Why Was I Motivated?—Ahead-Of-Time Info
The story, marked “Hold for release – Do not use”, was sent in error to the news service’s thousands of corporate clients.
![Page 7: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/7.jpg)
Why Was I Motivated?—RumorsMaurice Jarre (1924-2009) French Conductor and Composer
“One could say my life itself has been one long soundtrack. Music was my life, music brought me to life, and music is how I will be remembered long after I leave this life. When I die there will be a final waltz playing in my head and that only I can hear.”
2:29, 30 March 2009
![Page 8: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/8.jpg)
Wrong information can be just as bad as lack of information.The Internet needs a way to help people separate rumor from real science.
– Tim Berners-Lee
![Page 9: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/9.jpg)
ARE DEEP-WEB DATA CONSISTENT &
RELIABLE?[PVLDB,
2013]
![Page 10: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/10.jpg)
Study on Two Domains#Sourc
esPeriod #Objec
ts#Local-
attrs#Global-attrs
Considered items
Stock 55 7/2011 1000*20
333 153 16000*20
Flight 38 12/2011 1200*31
43 15 7200*31Stock
Search “stock price quotes” and “AAPL quotes” Sources: 200 (search results)89 (deep web)76 (GET method) 55 (none
javascript) 1000 “Objects”: a stock with a particular symbol on a
particular day 30 from Dow Jones Index 100 from NASDAQ100 (3 overlaps) 873 from Russel 3000
Attributes: 333 (local) 153 (global) 21 (provided by > 1/3 sources) 16 (no change after market close)
Data sets available at lunadong.com/fusionDataSets.htm
![Page 11: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/11.jpg)
Study on Two Domains#Sourc
esPeriod #Objec
ts#Local-
attrs#Global-attrs
Considered items
Stock 55 7/2011 1000*20
333 153 16000*20
Flight 38 12/2011 1200*31
43 15 7200*31Flight
Search “flight status” Sources: 38
3 airline websites (AA, UA, Continental) 8 airport websites (SFO, DEN, etc.) 27 third-party webistes (Orbitz, Travelocity, etc.)
1200 “Objects”: a flight with a particular flight number on a particular day from a particular departure city Departing or arriving at the hub airports of AA/UA/Continental
Attributes: 43 (local) 15 (global) 6 (provided by > 1/3 sources) scheduled dept/arr time, actual dept/arr time, dept/arr gate
Data sets available at lunadong.com/fusionDataSets.htm
![Page 12: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/12.jpg)
Study on Two Domains
Why these two domains?Belief of fairly clean dataData quality can have big impact on
people’s livesResolved heterogeneity at schema level and instance level
#Sources
Period #Objects
#Local-attrs
#Global-attrs
Considered items
Stock 55 7/2011 1000*20
333 153 16000*21
Flight 38 12/2011 1200*31
43 15 7200*31
Data sets available at lunadong.com/fusionDataSets.htm
![Page 13: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/13.jpg)
Q1. Are There a Lot of Redundant Data on the Deep Web?
![Page 14: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/14.jpg)
Q2. Are the Data Consistent?
Inconsistency on 70% data itemsTolerance to 1% difference
![Page 15: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/15.jpg)
Why Such Inconsistency?— I. Semantic AmbiguityYahoo! Finance
NasdaqDay’s Range: 93.80-
95.71
52wk Range: 25.38-95.71
52 Wk: 25.38-93.72
![Page 16: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/16.jpg)
Why Such Inconsistency?— II. Instance Ambiguity
![Page 17: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/17.jpg)
Why Such Inconsistency?— III. Out-of-Date Data
4:05 pm 3:57 pm
![Page 18: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/18.jpg)
Why Such Inconsistency?— IV. Unit Error
76,821,000
76.82B
![Page 19: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/19.jpg)
Why Such Inconsistency?— V. Pure Error
FlightView FlightAware Orbitz
6:15 PM
6:15 PM6:22 PM
9:40 PM8:33 PM 9:54 PM
![Page 20: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/20.jpg)
Why Such Inconsistency?
Random sample of 20 data items and 5 items with the largest #values in each domain
![Page 21: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/21.jpg)
Q3. Is Each Source of High Accuracy?
Not high on average: .86 for Stock and .8 for FlightGold standard
Stock: vote on data from Google Finance, Yahoo! Finance, MSN Money, NASDAQ, Bloomberg
Flight: from airline websites
![Page 22: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/22.jpg)
Q3-2. Are Authoritative Sources of High Accuracy?
Reasonable but not so high accuracyMedium coverage
![Page 23: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/23.jpg)
Q4. Is There Copying or Data Sharing Between Web Sources?
![Page 24: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/24.jpg)
Q4-2. Is Copying or Data Sharing Mainly on Accurate Data?
![Page 25: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/25.jpg)
HOW TO RESOLVE INCONSISTENCY(DATA FUSION)?
![Page 26: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/26.jpg)
Baseline Solution: Voting
Only 70% correct values are provided by over half of the sourcesVoting precision:
.908 for Stock; i.e., wrong values for 1500 data items .864 for Flight; i.e., wrong values for 1000 data items
![Page 27: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/27.jpg)
Improvement I. Leveraging Source Accuracy
S1 S2 S3Stonebrak
erMIT Berkel
eyMIT
Dewitt MSR MSR UWiscBernstein MSR MSR MSR
Carey UCI AT&T BEAHalevy Google Google UW
![Page 28: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/28.jpg)
Improvement I. Leveraging Source Accuracy
S1 S2 S3Stonebrak
erMIT Berkel
eyMIT
Dewitt MSR MSR UWiscBernstein MSR MSR MSR
Carey UCI AT&T BEAHalevy Google Google UW
Naïve voting obtains an accuracy of 80%
Higher accuracy;
More trustable
![Page 29: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/29.jpg)
Improvement I. Leveraging Source Accuracy
S1 S2 S3Stonebrak
erMIT Berkel
eyMIT
Dewitt MSR MSR UWiscBernstein MSR MSR MSR
Carey UCI AT&T BEAHalevy Google Google UW
Considering accuracy obtains an accuracy of 100%
Higher accuracy;
More trustable
Challenges: 1. How to decide source accuracy?2. How to leverage accuracy in
voting?
![Page 30: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/30.jpg)
Computing Source AccuracySource Accuracy: A(S)
-values provided by S P(v)-pr of value v being true
)()()(vPAvgSA
SVv
)(SV
How to compute P(v)?
![Page 31: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/31.jpg)
Applying Source Accuracy in Data Fusion
Input: Data item DDom(D)={v0,v1,…,vn}Observation Ф on D
Output: Pr(vi true|Ф) for each i=0,…, n (sum up to 1)According to the Bayes Rule, we need to knowPr(Ф|vi true)
Assuming independence of sources, we need to know Pr(Ф(S) |vi true)
If S provides vi : Pr(Ф(S) |vi true) =A(S) If S does not provide vi : Pr(Ф(S) |vi true) =(1-A(S))/n
Challenge: How to handle inter-dependence between source accuracy and value probability?
![Page 32: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/32.jpg)
Data Fusion w. Source AccuracySource accuracy
Source vote count
Value vote count
Value probability
)()()(vPAvgSA
SVv
)(1)(ln)('SASnASA
)(
)(')(vSS
SAvC
)(
)(
)(
0
0)(
ODv
vC
vC
eevP
Continue until source accuracy converges
PropertiesA value provided by more accurate sources has a higher probability to be trueAssuming uniform accuracy, a value provided by more sources has a higher probability to be true
![Page 33: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/33.jpg)
Example
Accuracy S1 S2 S3Round 1 .69 .57 .45Round 2 .81 .63 .41Round 3 .87 .65 .40Round 4 .90 .64 .39Round 5 .93 .63 .40Round 6 .95 .62 .40Round 7 .96 .62 .40Round 8 .97 .61 .40
Value vote count
Carey
UCI AT&T BEA
Round 1 1.61 1.61 1.61Round 2 2.40 1.89 1.42Round 3 3.05 2.16 1.26Round 4 3.51 2.23 1.19Round 5 3.86 2.20 1.18Round 6 4.17 2.15 1.19Round 7 4.47 2.11 1.20Round 8 4.76 2.09 1.20
S1 S2 S3Stonebrak
erMIT Berkel
eyMIT
Dewitt MSR MSR UWiscBernstein MSR MSR MSR
Carey UCI AT&T BEAHalevy Google Google UW
![Page 34: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/34.jpg)
Results on Stock Data
Sources ordered by recall (coverage * accuracy)Accu obtains a final precision (=recall) of .900, worse than Vote (.908)With precise source accuracy as input, Accu obtains final precision of .910
![Page 35: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/35.jpg)
Consider value similarity
Data Fusion w. Value SimilaritySource accuracy
Source vote count
Value vote count
Value probability
)()()(vPAvgSA
SVv
)(1)(ln)('SASnASA
)(
)(')(vSS
SAvC
)(
)(
)(
0
0)(
ODv
vC
vC
eevP
)',()'()()('
* vvsimvCvCvCvv
![Page 36: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/36.jpg)
Results on Stock Data (II)
AccuSim obtains a final precision of .929, higher than Vote (.908)
This translates to 350 more correct values
![Page 37: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/37.jpg)
Results on Stock Data (III)
![Page 38: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/38.jpg)
Results on Flight Data
Accu/AccuSim obtains a final precision of .831/.833, both lower than Vote (.857)With precise source accuracy as input, Accu/AccuSim obtains final recall of .91/.952WHY??? What is that magic source?
![Page 39: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/39.jpg)
Copying or Data Sharing Can Happen on Inaccurate Data
![Page 40: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/40.jpg)
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Naïve voting works only if data sources are independent.
![Page 41: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/41.jpg)
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UWHigher
accuracy;More trustable
Consider source accuracy can be worse when there is copying
![Page 42: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/42.jpg)
Improvement II. Ignoring Copied Data
It is important to detect copying and ignore copied values in fusion
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
![Page 43: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/43.jpg)
Challenges in Copy Detection1. Sharing common data does not in itself imply copying.
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
2. With only a snapshot it is hard to decide which source is a copier.
3. A copier can also provide or verify some data by itself, so it is inappropriate to ignore all of its data.
![Page 44: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/44.jpg)
High-Level Intuitions for Copy Detection
Intuition I: decide dependence (w/o direction)
For shared data, Pr(Ф(S1)|S1S2) is low e.g., incorrect value
Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2
![Page 45: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/45.jpg)
Copying?Not necessarilyName: Alice Score:
51. A2. C3. D4. C5. B6. D7. B8. A9. B10.C
Name: Bob Score:
51. A2. C3. D4. C5. B6. D7. B8. A9. B10.C
![Page 46: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/46.jpg)
Copying?—Common ErrorsVery likelyName: Mary Score:
11. A2. B3. B4. D5. A6. C7. C8. D9. E10.C
Name: John Score:
11. A2. B3. B4. D5. A6. C7. C8. D9. E10.B
![Page 47: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/47.jpg)
High-Level Intuitions for Copy Detection
Intuition I: decide dependence (w/o direction)
For shared data, Pr(Ф(S1)|S1S2) is low e.g., incorrect data
Intuition II: decide copying directionLet F be a property function of the data
(e.g., accuracy of data)|F(Ф(S1) Ф(S2))-F(Ф(S1)-Ф(S2))|
> |F(Ф(S1) Ф(S2))-F(Ф(S2)-Ф(S1))| .
Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2
![Page 48: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/48.jpg)
Copying?—Different AccuracyJohn copies from AliceName: Alice Score:
31. B2. B3. D4. D5. B6. D7. D8. A9. B10.C
Name: John
Score:11. B2. B3. D4. D5. B6. C7. C8. D9. E10.B
![Page 49: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/49.jpg)
Copying?—Different AccuracyAlice copies from JohnName: John Score:
11. A2. B3. B4. D5. A6. C7. C8. D9. E10.B
Name: Alice Score:
31. A2. B3. B4. D5. A6. D7. B8. A9. B10.C
![Page 50: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/50.jpg)
Data Fusion w. Copying
Consider dependence
I(S)- Pr of independently providing value v
)()(')()(
SISAvCvSS
Source accuracy
Source vote count
Value vote count
Value probability
)()()(vPAvgSA
SVv
)(1)(ln)('SASnASA
)(
)(')(vSS
SAvC
)(
)(
)(
0
0)(
ODv
vC
vC
eevP
![Page 51: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/51.jpg)
Combining Accuracy and Dependence
Truth Discovery
Source-accuracy
ComputationCopy
DetectionStep 1Step 3
Step 2
Theorem: w/o accuracy, converges Observation: w. accuracy, converges when #objs >> #srcs
![Page 52: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/52.jpg)
Example Con’tS1 S2 S3 S4 S5
Stonebraker
MIT Berkeley
MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
UCI AT&T
BEA
Truth Discovery(1-.99*.8=.2)
(.22)
S1
S2
S4
S3
S5
.87 .2.2
.99
.99.99
S1 S2
S3
S4 S5Round 1
![Page 53: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/53.jpg)
Example Con’tS1 S2 S3 S4 S5
Stonebraker
MIT Berkeley
MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
S1
S2
S4
S3
S5
.14
.49.49
.49.08
.49.49.49
AT&T
BEA
Truth Discovery
S2
S3
S4 S5
UCIS1
Round 2
![Page 54: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/54.jpg)
Example Con’tS1 S2 S3 S4 S5
Stonebraker
MIT Berkeley
MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
S1
S2
S4
S3
S5
.12
.49.49
.49.06
.49.49.49
AT&T
BEA
Truth Discovery
S2
S3
S4 S5
UCI
S1
Round 3
![Page 55: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/55.jpg)
Example Con’tS1 S2 S3 S4 S5
Stonebraker
MIT Berkeley
MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
S1
S2
S4
S3
S5
.10
.48.49
.50.05
.49.48.50
AT&T
BEA
Truth Discovery
S2
UCI
S1
Round 4
S3
S4 S5
![Page 56: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/56.jpg)
Example Con’tS1 S2 S3 S4 S5
Stonebraker
MIT Berkeley
MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
AT&T
BEA
Truth Discovery
S2
UCI
S1
Round 5
S3
S4 S5
S1
S2
S4
S3
S5
.09
.47.49
.51.04
.49.47.51
![Page 57: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/57.jpg)
Example Con’tS1 S2 S3 S4 S5
Stonebraker
MIT Berkeley
MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
AT&T
BEA
Truth Discovery
S2
UCI
S1
Round 13
S3
S4 S5
S1
S2
S4
S3
S5
.55.49
.55.49.44.44
![Page 58: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/58.jpg)
Results on Flight Data
AccuCopy obtains a final precision of .943, much higher than Vote (.864)
This translates to 570 more correct values
![Page 59: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/59.jpg)
Results on Flight Data (II)
![Page 60: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/60.jpg)
SOLOMON: SEEKING THE TRUTH VIA COPY
DETECTION
![Page 61: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/61.jpg)
Solomon
Solomon Project
Copy detection• Local
detection [VLDB’09a]
• Global detection [VLDB’10a]
• Detection w. dynamic data [VLDB’09b]
Applications in data integration• Truth
discovery [VLDB’09a][VLDB’09b]
• Query answering [VLDB’11][EDBT’11]
• Record linkage [VLDB’10b]
Visualization and decision explanation• Visualization
[VLDB’10 demo]
• Decision explanation[WWW’13]
![Page 62: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/62.jpg)
I. Copy Detection
Local Detection Global Detection [VLDB’10a]
Large-ScaleDetection
Consider correctness
of data [VLDB’09a]
Consider additional evidence
[VLDB’10a]
Consider correlated copying
[VLDB’10a]
Consider updates [VLDB’09b]
![Page 63: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/63.jpg)
II. Data Fusion
Consider formatting[VLDB’13a]
Fusing Pr data
Evolving values[VLDB’09b]
Consider source accuracy and copying
[VLDB’09a]
Consider value popularity [VLDB’13b]
![Page 64: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/64.jpg)
II. Data Fusion
Offline Fusion Online Fusion [VLDB’11]
Consider formatting[VLDB’13a]
Fusing Pr data
Evolving values[VLDB’09b]
Consider source accuracy and copying
[VLDB’09a]
Consider value popularity [VLDB’13b]
![Page 65: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/65.jpg)
III. Visualization [VLDB Demo’2010]
![Page 66: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/66.jpg)
WHAT’S NEXT?
![Page 67: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/67.jpg)
Why Am I Motivated NOW?
7/2009
2007
2013
![Page 68: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/68.jpg)
Harvesting Knowledge from the Web
The most important Google story this year was the launch of the Knowledge Graph. This marked the shift from a first-generation Google that merely indexed the words and metadata of the Web to a next-generation Google that recognizes discrete things and the relationships between them.
- ReadWrite 12/27/2012
![Page 69: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/69.jpg)
Impact of Google KG on Search
3/31/2013
![Page 70: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/70.jpg)
Where is the Knowledge From?
Source-specific
wrappers
DOM-tree extractors for Deep Web
Web tables & ListsFree-text extractors
Crowdsourcing
![Page 71: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/71.jpg)
Challenges in Building the Web-Scale KGEssentially a large-scale data extraction & integration problem
Extracting triplesReconciling entitiesMapping relationsResolving conflictsDetecting malicious sources/users
Errors can creep in at every stageBut we require a high precision of knowledge
Data extraction
Record linkage
Schema mapping
Data fusion
Spam detection
>99%
![Page 72: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/72.jpg)
New Challenges for Data FusionHandle errors from different stages of data integrationFusion for multi-truth data itemsFusing probabilistic dataActive learning by crowdsourcingQuality diagnose for contributors (extractors, mappers, etc.) Combination of schema mapping, entity resolution, and data fusionEtc.
![Page 73: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/73.jpg)
Related WorkCopy detection [VLDB’12 Tutorial]
Texts, programs, images/videos, structured sources
Data provenance [Buneman et al., PODS’08]Focus on effective presentation and retrievalAssume knowledge of provenance/lineage
Data fusion [VLDB’09 Tutorial, VLDB’13]Web-link based (HUB, AvgLog, Invest,
PooledInvest) [Roth et al., 2010-2011]IR based (2-Estimates, 3-Estimates, Cosine)
[Marian et al., 2010-2011]Bayesian based (TruthFinder) [Han, 2007-2008]
![Page 74: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/74.jpg)
Take-AwaysWeb data is not fully trustable and copying is commonCopying can be detected using statistical approachesLeveraging source accuracy, copying relationships, and value similarity can improve fusion resultsImportant and more challenging for building Web-scale knowledge bases
![Page 75: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/75.jpg)
AcknowledgementsKen Lyons(AT&T Research)
Divesh Srivastava(AT&T Research)
Alon Halevy(Google)
Yifan Hu(AT&T Research)
Remi Zajac(AT&T Research)
Songtao Guo(AT&T Interactive)
Laure Berti-Equille(Institute of Research for Development, France)
Xuan Liu(Singapore National Univ.)
Xian Li(SUNY Binhamton)
Amelie Marian(Rutgers Univ.)
Anish Das Sarma(Google)
Beng Chin Ooi(Singapore National Univ.)
![Page 76: Truth Finding on the Deep WEB](https://reader036.vdocument.in/reader036/viewer/2022062316/568168a8550346895ddf4539/html5/thumbnails/76.jpg)
SOLOMON: SEEKING THE TRUTH VIA COPY DETECTION
http://lunadong.comFusion data sets:
lunadong.com/fusionDataSets.htm