![Page 1: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/1.jpg)
TDD: Topics in Distributed Databases
(Querying and cleaning big data)
11
Wenfei Fan
University of Edinburgh
![Page 2: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/2.jpg)
What is big data?
22
![Page 3: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/3.jpg)
3
Big data: What is it anyway?
Everyone talks about big data. But what is it?
A departure from our familiar data management!
Volume: horrendously large• PB (1015B)• EB (1018B)
Variety: heterogeneous, semi-structured or unstructured• 9:1 ratio of unstructured data vs. structured data• collecting 95% restaurants requires at least 5000 sources
Velocity: dynamic• think of the Web and Facebook, …
Veracity: trust in its quality• real-life data is typically dirty!
cf. Online ordering of overlapping data sources, PVLDB 7(3), 2013, Mariam Salloum, Xin Luna Dong, Divesh Srivastava, Vassilis J. Tsotra
3
![Page 4: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/4.jpg)
4
Why is the data so big?
Big data is a relative notion: 1TB is already too big for your laptop
Worldwide information volume is growing annually at a minimum rate of 59%
A single jet engine produces 20TB (1012B) of data per hour
Facebook has 1.38 billion users, 140 billion links, about 300 PB of data
Genome of human: sampling, biochemistry, immunology, imaging, genetic, phenotypic data • 1 person: 1PB (1015B)• 1000 people: 1EB (1018B)• 1 billion people: 1ZB (1024B)
Gartner 2011
4
![Page 5: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/5.jpg)
Why do we care about big data?
55
![Page 6: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/6.jpg)
6
Example: Medicare
A new game: large number of data sources of big volume
Nature, 2009
6
![Page 7: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/7.jpg)
7
Big data is needed everywhere
The world is becoming data-driven, like it or not!
Social media marketing: • 78% of consumers trust peer (friend, colleague and family
member) recommendations – only 14% trust ad• if three close friends of person X like items P and W, and if X
also likes P, then the chances are that X likes W too Social event monitoring:
• Prevent terrorist attack• The Net Project, Shenzhen, China (Audaque)
Scientific research: • A new yet more effective way to develop theory, by exploring
and discovering correlations of seemingly disconnected factors
7
![Page 8: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/8.jpg)
8
The big data market is BIG
Big Data: The next frontier for innovation, competition and productivity
US HEALTH CARE $300 B
Increase industry value per year by $300 B US RETAIL 60+%
Increase net margin by 60+% MANUFACTURING –50%
Decrease development and assembly costs by 50% GLOBAL PERSONAL LOCATION DATA $100 B
Increase service provider revenue by $100 B EUROPE PUBLIC SECTOR ADMIN 250 B Euro
Increase industry value per year by 250 B EuroMcKinsey Global Institute, May 2011
8
![Page 9: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/9.jpg)
Why study big data?
Want to find a job? • Research and development of big data systems:
ETL, distributed systems (eg, Hadoop), visualization tools, data warehouse, OLAP, data integration, data quality control, …
• Big data applications: social marketing, healthcare, …
• Data analysis: to get values out of big datadiscovering and applying patterns, predicative analysis, business intelligence, privacy and security, …
Prepare you for • graduate study: current research and practical issues;• the job market: skills/knowledge in need
Big data = Big $$$
complexity theory, distributed databases, query answering, algorithms, data quality
9
![Page 10: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/10.jpg)
What challenges are introduced by big data?
1010
![Page 11: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/11.jpg)
11
Big data: Through the eyes of computation
Computer science is the topic about
the computation of function f(x)
Big data: the data parameter x is horrendously large: PB or EB
Are these true?
What is the challenge introduced to query answering?
Fallacies: Big data introduces no fundamental problems Big data = MapReduce (Hadoop) Big data = data quantity (scalability)
11
![Page 12: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/12.jpg)
Flashback: Relational queries
Questions: What is a relational schema? A relation? A relational database? What is a query? What is relational algebra? What does relationally completeness mean? What is a conjunctive query?
DB
query answer
store data
DBMS
updates
The bible for database researchers: Foundations of Databases12
![Page 13: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/13.jpg)
13
Traditional database management systems
A database is a collection of data, typically containing the
information about one or more related organizations.
A database management system (DBMS) is a software
package designed to store and manage databases.
Database: local DBMS: centralized; single processor (CPU); managing local
databases (single memory, disk)
DB
query answer
store data
DBMS
updates
![Page 14: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/14.jpg)
14
Facebook: Graph Search
Find me restaurants in New York my friends have been to in 2013• friend(pid1, pid2)• person(pid, name, city)• dine(pid, rid, dd, mm, yy)
SQL query (in fact, a conjunctive query, or an SPC query)
select rid
from friend(pid1, pid2), person(pid, name, city),
dine(pid, rid, dd, mm, yy)
where pid1 = p0 and pid2 = person.pid and
pid2 = dine.pid and city = NYC and yy = 2013
Is it feasible on big data?
Facebook : more than 1.38 billion nodes, and over 140 billion links
14
![Page 15: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/15.jpg)
15
Example queries: Graph pattern matching
Input: A pattern graph Q and a graph G
Output: All the matches of Q in G, i.e., all subgraphs of G that are isomorphic to Q
Applications• pattern recognition• intelligence analysis• transportation network analysis• Web site classification • social position detection• user targeted advertising• knowledge base disambiguation …
a bijective function f on nodes:
(u,u’ ) ∈ Q iff (f(u), f(u’)) ∈ G
What other graph queries do you know? 15
![Page 16: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/16.jpg)
Find all matches of a pattern in a graphFind all matches of a pattern in a graph
Graph pattern matching
Identify suspects in a drug ring
Identify suspects in a drug ring
16“Understanding the structure of drug trafficking organizations”
pattern graph
B
A1 Am
W
W
W
W W
W
WW
33
1
B
A S
W Is this feasible? Facebook : more than 1.38 billion nodes, and over 140 billion links
16
![Page 17: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/17.jpg)
17
Querying big data: New challenges
A departure from classical theory and traditional techniques
Given a query Q and a dataset D, compute Q(D)
What are new challenges introduced by querying big data?
Does querying big data introduce new fundamental problems?
What new methodology do we need to cope with the sheer size of big data D?
DDQ( ) DDQ( )
traditional database big data (PB or EB)
17
Why?
![Page 18: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/18.jpg)
18
The good, the bad and the ugly
Traditional computational complexity theory of almost 50 years:
• The good: polynomial time computable (PTIME)
• The bad: NP-hard (intractable)
• The ugly: PSPACE-hard, EXPTIME-hard, undecidable…
Polynomial time queries become intractable on big data!
What happens when it comes to big data?
Using SSD of 6G/s, a linear scan of a data set DD would take
• 1.9 days when DD is of 1PB (1015B)
• 5.28 years when DD is of 1EB (1018B)
O(n) time is already beyond reach on big data in practice!
How long does it take?
What query is this?
18
![Page 19: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/19.jpg)
19
Tractability revisited for big data
BD-tractable queries: properly contained in P unless P = NC
NP and beyond
PP
BD-tractablenot
BD-tractable
Parallel polylog time
Yes, querying big data
comes with new and hard
fundamental problems
19
![Page 20: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/20.jpg)
20
Challenges: query evaluation is costly
Already beyond reach in practice when the data is not very big
Graph pattern matching by subgraph isomorphism
• NP-complete to decide whether there exists a match
• possibly exponentially many matches
intractable even in the traditional complexity theory
Membership problem for relational queries
Input: a query Q, a database D, and a tuple t
Question: Is t in Q(D)?
• NP-complete if Q is a conjunctive query (SPC)
• PSPACE-complete if Q is in relational algebra (SQL)
20
What is the complexity?
![Page 21: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/21.jpg)
DB
P
M
DB
P
M
DB
P
M
interconnection network
10,000 processors
21
Using 10000 SSD of 6G/s, a linear scan of DD might take: 1.9 days/10000 = 16 seconds when DD is of 1PB (1015B)5.28 years/10000 = 4.63 days when DD is of 1EB (1018B)
Only ideally!
Is it still feasible to query big data?
Can we do better if we are given more resources?Parallel and distributed query processing – TDD
Yes, parallel query processing. But how?
![Page 22: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/22.jpg)
The two sides of a coin
Data = quantity + quality
When we talk about big data, we typically mean its quantity: What capacity of a system provides to cope with the sheer size
of the data? Is a query feasible on big data within our available resources? How can we make our queries tractable on big data? . . .
The study of data quality is as important as data quantity
Can we trust the answers to our queries?
Dirty data routinely lead to misleading financial reports, strategic
business planning decision loss of revenue, credibility and customers, disastrous consequences
Veracity!
22
![Page 23: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/23.jpg)
Data consistency
FN LN address AC city
Mary Smith 2 Small St 908 NYC
Mary Dupont 10 Elm St 610 PHI
Mary Dupont 6 Main St 212 NYC
Bob Luth 8 Cowan St 215 PHI
Robert Luth 6 Drum St 212 NYC
Q1: how many employees are in the NY office?
3 may not be the correct answer: the AC and city in the first tuple are inconsistent!
Error rates: 10% - 75% (telecommunication) 23
![Page 24: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/24.jpg)
Information completeness
FN LN address AC city
Mary Smith 2 Small St 908 NYC
Mary Dupont 10 Elm St 610 PHI
Mary Dupont 6 Main St 212 NYC
Bob Luth 8 Cowan St 215 PHI
Robert Luth 6 Drum St 212 NYC Q2: how many distinct employees have first name Marry?
3 may not be the correct answer:
• The first three tuples refer to the same person • The information may be incomplete
“information perceived as being needed for clinical
decisions was unavailable 13.6%--81% of the time” (2005) 24
![Page 25: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/25.jpg)
Data currency
FN LN address salary status
Mary Smith 2 Small St 50k single
Mary Dupont 10 Elm St 50k married
Mary Dupont 6 Main St 80k married
Bob Luth 8 Cowan St 80k married
Robert Luth 6 Drum St 55k married
Q3: what is Mary’s current salary?
In a customer file, within two years about 50% of record may become obsolete” (2002)
Mary RobertEntities:
80k In the real world, salary is monotonically increasing
Consistent, complete, and once correct
25
![Page 26: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/26.jpg)
Data fusion
FN LN address salary status
Mary Smith 2 Small St 50k single
Mary Dupont 10 Elm St 50k married
Mary Dupont 6 Main St 80k married
Bob Luth 8 Cowan St 80k married
Robert Luth 6 Drum St 55k married
Q4: what is Mary’s current last name?
Deduce the true values of an entity
Dupont In real life: • Marital status only changes from single married divorced• Tuples with the most current marital status also have the most
current last name
26
![Page 27: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/27.jpg)
Data in real-life is often dirty
Dirty data: inconsistent, inaccurate, incomplete, stale
500,000 dead people
retain active Medicare
cards
Pentagon asked 200+ dead officers to re-enlist
81 million National Insurance numbers but only 60 million eligible citizens
Data error rates in industry: 1% - 30% (Redman, 1998)
98000 deaths each year, caused by errors in medical data
27
![Page 28: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/28.jpg)
Dirty data are costly
Poor data cost US businesses $611 billion annually
Erroneously priced data in retail databases cost US
customers $2.5 billion each year
1/3 of system development projects were forced to delay or
cancel due to poor data quality
30%-80% of the development time and budget for data
warehousing are for data cleaning
CIA dirty data about WMD in Iraq!
The scale of the data quality problem is far worse on big data!
1998
2001
2000
28
Can we trust answers to our queries in dirty data?
![Page 29: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/29.jpg)
What does this course cover?
2929
Big data = quantity + quality
Volume (quantity) Veracity (quality)
![Page 30: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/30.jpg)
30
Basic topic 1: Parallel database management systems
Recall traditional DBMS: Database: “single” memory, disk DBMS: centralized; single processor (CPU);
Can we do better provided with multiple processors?
Parallel DBMS: exploring parallelism Improve performance Reliability and availability
DB
P
M
DB
P
M
DB
P
M
interconnection network MapReduce
![Page 31: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/31.jpg)
31
Basic topic 2: Distributed databases
Data is stored in several sites, each with an independent DBMS Local ownership: physically stored across different sites Increased availability and reliability Performance
network
DB
DBMS
local schema
DB
DBMS
local schema
DB
DBMS
local schema
network
global schema
query answer
Cloud computing
![Page 32: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/32.jpg)
32
Advanced topic 1: MapReduce
A programming model with two primitive functions:
Applications in cloud computing
Connection between MapReduce and parallel query processing
Other parallel programming models
• BSP (Bulk Synchronous Parallel)
• Vertex-centric
• Partial evaluation
Map: <k1, v1> list (k2, v2)
Reduce: <k2, list(v2)> list (k3, v3)
![Page 33: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/33.jpg)
33
Advanced topic 2: Querying big data
Foundations for querying big data
Tractability revised for querying big data
Parallel scalability
Bounded evaluability of queries
Querying big data: theory and practice 33
Techniques for querying big data
Develop parallel algorithms for querying big data
Bounded evaluability and access constraints
Query preserving compression
Query answering using views
Bounded incremental query processing
![Page 34: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/34.jpg)
34
Central issues for data quality Object identification (data fusion): do two objects refer to the same real-world entity? What is the true value of the entity? Data consistency: do our data values have conflicts? Data accuracy: is one value more accurate than another for a real-word entity? Data currency: is our data out of date? Information completeness: does D have enough information to answer our queries?
Make our data consistent, accurate, complete and up to date!
Big data = quantity + quality!
34
TDD: the Veracity of big data
Advanced topic 3: Data quality management
![Page 35: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/35.jpg)
35
Advanced topic 4: Dependencies as data quality rules
Fundamental problems for data quality rules:• consistency: are the data quality rules “dirty” themselves?• implication: can we optimize the rules by removing redundant
ones?
A revision of classical dependencies
Data quality rules:• Conditional (functional and inclusion) dependencies to
capture data inconsistencies • Matching dependencies for record matching Data
consistency: do our data values have conflicts?• There are also quality rules for data accuracy, data currency
and information completeness – in the textbook
A uniform logic framework for improving data quality
![Page 36: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/36.jpg)
36
Discover rules
Reasoning
Detect errors
Repair
Advanced topic 5: Data cleaning
Semi-automated systems for improving data quality
• Discover data quality rules• Validate rules discovered• Detect errors with rules• Repairing data with rules• Certain fixes• Deducing the true values of
entities
![Page 37: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/37.jpg)
37
Putting together
Basic technology Parallel DBMS: architectures, data partition, (intra/inter) operator
parallelism, parallel query processing and optimization Distributed DBMS: architectures, fragmentation, replication
Advanced topics Big data: the Volume
– MapReduce and other parallel programming models– Querying big data: theory and practice
Big data: the Veracity– Central issues for data quality– Dependencies as data quality rules– Cleaning distributed data: rule discovery, rule validation, error
detection, data repairing, certain fixes
Prerequisites
Volume (quantity) Veracity (quality)• Variety (entity resolution, conflict resolution• Velocity (incremental computation)
relational algebra/SQL, query processing, basic complexity and algorithmic
background (e.g., NP, undecidability)
![Page 38: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/38.jpg)
Course format
3838
![Page 39: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/39.jpg)
39
Basic information
Web site:
http://homepages.inf.ed.ac.uk/wenfei/tdd/home.html – Syllabus – Announcements – Lecture notes– deadlines
TA: Chao Tian– [email protected]
Office hours:– Informatics Forum 5.23, 11:00-12:00, Thursday
![Page 40: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/40.jpg)
40
Course format
Seminar course: there will be no exam!– Lectures: background.
http://homepages.inf.ed.ac.uk/wenfei/tdd/lecture/lecture-notes.html– Textbook:
R. Ramakrishnan, J. Gehrke: Database Management Systems. WCB/McGraw-Hill 2003 (3rd edition). Chap 22
Database System Concept, 4th edition, A. Silberschatz, H. Korth, S. Sudarshan, Part 6 (Parallel and Distributed Database Systems)
W. Fan and F. Geerts. Foundations of Data Quality Management. Morgan & Claypool, 2012 (Chapters 1-4; e-copy available upon request)
– Research papers or chapters related to the topics (3-4 each)• At the end of ln3-ln8
![Page 41: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/41.jpg)
41
Grading
Reviews of research papers (8 in total) 40% Project (report): 45% Project presentation: 15%
Homework:
Four sets of homework, starting from week 4; deadlines:• 9am, Thursday, February 5, week 4• 9am, Thursday, February 19, week 6• 9am, Thursday, March 5, week 8• 9am, Thursday, March 19, week 10
– Papers: choose two each time (two reviews) – not chapters– 5% for each paper, and 10% for each homework
down from 12, 2012
![Page 42: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/42.jpg)
42
Review Evaluation
Pick 2 research papers each time from the lecture note to be covered in next two weeks, starting from Week 4.
Write a one-page review for each of the papers, 10 marksSummary: 2 marks
• A clear problem statement: input, question/output • The need for this line of research: motivation• A summary of key ideas, techniques and contributions
Evaluation: 5 marks – Criteria for the line of research (e.g., expressive power,
complexity, accuracy, scalability, etc)– Evaluation based on your criteria; justify your evaluation
• 3 strong points• 3 weak points
Suggest possible extensions: 3 marks
![Page 43: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/43.jpg)
43
Project – Research and development (recommended)
Research and development:– Topic: pick one from lecture notes (ln3 – ln8)
Example: A MapReduce algorithm for graph simulation
Development– Pick a research paper from the reading list of ln3—ln8– Implement its main algorithms– Conduct its experimental study
Multiple people may work on the same project independently
You are encouraged to come up with your own project – talk to me first
Start early!
![Page 44: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/44.jpg)
44
Grading – design and development
Distribution:– Algorithms: technical depth, performance guarantees 20%– Prove the correctness, complexity analysis and performance
guarantees of your algorithms 15%– Justification (experimental evaluation) 10%
Report: in the form of technical report/research paper – Introduction: problem statement, motivation– Related work: survey– Techniques; algorithms, illustration via intuitive examples– Correctness/complexity/property/proofs– Experimental evaluation– Possible extensions
![Page 45: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/45.jpg)
45
Project – survey
Topic: pick one topic from a lecture note (ln3 – ln8)
Example: techniques for conflict resolution Distribution:
– Select 5-6 representative papers, independently 10%
– Develop a set of criteria: the most important issues in that line of research, based on your own understanding; justify your criteria 10%
– Evaluate each of the papers based on your criteria 15%– A table to summarize the assessment, based on your
criteria, draw and justify your conclusion and recommendation for various application 10%
Sample survey: A Brief Survey of Automatic Methods for Author Name Disambiguation
Find and download it from Google
Your understanding of the topic
![Page 46: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/46.jpg)
46
Project report and presentation – 15%
A clear problem statement
Motivation and challenges
Key ideas, techniques/approaches
Key results – what you have got, intuitive examples
Findings/recommendations for different applications
Demonstration: a must if you do a development project
Presentation: question handling (show that you have developed
a good understanding of the line of work)
Learn how to present your work
![Page 47: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/47.jpg)
47
Summary and Review
What is big data?
What is the volume of big data? Variety? Velocity? Veracity?
Why do we care about big data?
Is there any fundamental challenge introduced by querying big
data?
Why study data quality?
What is consistency? Information completeness? Data
currency? Data accuracy? Object identification?
![Page 48: TDD: Topics in Distributed Databases (Querying and cleaning big data) 11 Wenfei Fan University of Edinburgh](https://reader036.vdocument.in/reader036/viewer/2022062407/56649cfa5503460f949cc4cc/html5/thumbnails/48.jpg)
48
Reading list
For next week, parallel databases, before the next lecture– Database Management Systems, 2nd edition, R. Ramakrishnan
and J. Gehrke, Chapter 22.– Database System Concept, 4th edition, A. Silberschatz, H. Korth,
S. Sudarshan, Part 6 (Parallel and Distributed Database Systems)
About relational databases: – Foundations of databases, S. Abiteboul, R. Hull, V. VIanu
About big data
– W. Fan and J. Huai. Querying Big Data: Theory and Practice, JCST 2014
http://homepages.inf.ed.ac.uk/wenfei/papers/JCST14.pdf