data quality, data cleaning and treatment of noisy data dimacs workshop november 3-4, 2003...

22
Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

Upload: shon-johns

Post on 25-Dec-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

Data Quality, Data Cleaning and Treatment of Noisy Data

DIMACS WorkshopNovember 3-4, 2003

Organizer: Tamraparni Dasu, AT&T Labs - Research

Page 2: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

6

Why DQ?

• Data quality problems are expensive and pervasive– DQ problems cost hundreds of billions of $$$ each

year.• Lost revenues, credibility, customer retention

– Resolving data quality problems is often the biggest effort in a data mining study.

• 50%-80% of time in data mining projects spent on DQ

– Interest in streamlining business operations databases to increase operational efficiency (e.g. cycle times), reduce costs, conform to legal requirements

Page 3: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

16

The Data Quality Continuum

• Data/information is not static, it flows in a data collection and usage process– Data gathering

– Data delivery

– Data storage

– Data integration

– Data retrieval

– Data mining/analysis

• Problems can and do arise at all of these stages

• End-to-end, continuous monitoring needed

Page 4: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

37

Technical Approaches

• Need a multi-disciplinary approach – No single approach solves all problems

• Process management– Pertains to data process and flows– Checks and controls, audits

• Database– Storage, access, manipulation and retrieval

• Metadata / domain expertise– Interpretation and understanding

• Analysis – Data Mining, Statistics– Analysis, diagnosis, model fitting, prediction, decision making …

Page 5: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

30

Meaning of Data Quality –1

• Conventional definitions: completeness, uniqueness, consistency, accuracy etc. –measurable?Modernize definition of DQ wrt to DQ continuum

• Depends on data paradigms (data gathering, storage)– Federated, High dimensional, Descriptive,

Longitudinal, Streaming, Web (scraped), Numeric, Text data

Page 6: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

31

DQ Meaning - 2

• Depends on applications (delivery, integration, analysis)– Business operations, Aggregate analysis, prediction

– Customer relations …

• Data Interpretation– Know all the rules used to generate the data

• Data Suitability – Use of proxy data

– Relevant data is missing

Increased DQ Increased reliability and usability (directionally correct)

Page 7: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

Workshop

• Talks cover different aspects of the complex DQ issue

• Outstanding set of speakers from academia, industrial labs and industry

• Cover theoretical, methodological, applied aspects – case studies!

• From a wide range of disciplines and areas

Page 8: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

Welcome!

Page 9: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

Rene Miller• University of Toronto• Renee is an Associate Professor of

Computer Science at the University of Toronto. S.B., Mathematics, MIT. S.B., Cognitive Science, MIT. Ph.D., Computer Science, U. Wisconsin-Madison.

• Heterogeneous databases, data mining, and data warehousing.

• “Managing Inconsistency in Data Exchange and Integration”

Page 10: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

Grace Zhang

• Morgan Stanley Institutional Equity Division IT. Master of Philosophy in Computer Science from Columbia University, and a Master and B.S. in Computer Science from Zhongshan University,China.

• Develop tools to check data quality issues in equity trading data, design and build the standard destination referential data repository.

• “Data Quality in Trading Surveillance”

Page 11: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

Ted Johnson

• AT&T Labs – Research• Database Research department. B.S. in

Mathematics, Johns Hopkins University, Ph.D. in Computer Science, New York University, 1990.

• Data warehousing and data mining• “Bellman - A Data Quality Browser “

Page 12: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

Ron Pearson• Daniel Baugh Institute for Functional

Genomics and Computational Biology, Thomas Jefferson University. B.S. in physics from the University of Arkansas at Monticello and M.S.E.E. and PhD in electrical engineering from M.I.T. in 1982.

• Design and analysis of nonlinear digital filters, exploratory data analysis and the validation of analytical results.

• “The Data Cleaning Problem -- Some Key Issues and Practical Approaches”

Page 13: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

Dhammika Amaratunga, Javier Cabrera, Nandini Raghavan

• Johnson & Johnson, Rutgers University, Johnson & Johnson

• “Pre-processing of Microarray Data”

Page 14: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

S. Muthukrishnan

• Rutgers University, AT&T Labs – Research

• Associate Professor of Computer Science

• Design and analysis of algorithms• “Checks and Balances: Monitoring

Data Quality Problems in Network Traffic Databases”

Page 15: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

T. Bonates, P. Hammer, A. Kogan, and I. Lozina

• RutCOR, Rutgers University• Operations Research• Maximum Patterns and Outliers in

the Logical Analysis of Data (LAD)

Page 16: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

Jiawei Han

• Professor, Simon Fraser University. Currently at University of Illinois, UC. Ph. D. from University of Wisconsin, Madison in 1985.

• Data mining (knowledge discovery in databases), data warehousing, spatial databases, multimedia databases, deductive and object-oriented databases, and logic programming

• “Data Mining: A Powerful Tool for Data Cleaning”

Page 17: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

Jon Hill

• British Telecommunications • Jon leads a team of information experts

to deliver solutions within asset management, process control and billing assurance. Jon uses a wide range of information quality tools within projects and has extensive experience in investigation and solving IQ problems.

• “A $220 Million Success Story”

Page 18: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

G. Vesonder, J. Wright & T. Dasu

• AT&T Labs - Research • Head of Adaptive Systems

research• AI, Knowledge Engineering, Expert

Systems• “Life Cycle Datamining”

Page 19: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

Andrew Hume

• AT&T Labs – Research• Very large data systems, string

searching, performance measurement

• Tamed many legacy systems• “Managing Data Streams”

Page 20: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

Bing Liu

• Associate Professor at National Singapore University, on leave at University of Illinois at Chicago

• Data mining and knowledge discovery; web, text and image mining; Bioinformatics

• Web page cleaning for web data mining

Page 21: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

R.K. Pearson and M. Gabbouj

• Collaboration with Moncef Gabbouj from the Tampere University of Technology in Finland.

• “Relational Nonlinear FIR Filters”

Page 22: Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research

Thank you!