data quality, data cleaning and treatment of noisy data dimacs workshop november 3-4, 2003...

Data Quality, Data Cleaning and Treatment of Noisy Data

DIMACS WorkshopNovember 3-4, 2003

Organizer: Tamraparni Dasu, AT&T Labs - Research

Why DQ?

• Data quality problems are expensive and pervasive– DQ problems cost hundreds of billions of $$$ each

year.• Lost revenues, credibility, customer retention

– Resolving data quality problems is often the biggest effort in a data mining study.

• 50%-80% of time in data mining projects spent on DQ

– Interest in streamlining business operations databases to increase operational efficiency (e.g. cycle times), reduce costs, conform to legal requirements

The Data Quality Continuum

• Data/information is not static, it flows in a data collection and usage process– Data gathering

– Data delivery

– Data storage

– Data integration

– Data retrieval

– Data mining/analysis

• Problems can and do arise at all of these stages

• End-to-end, continuous monitoring needed

Technical Approaches

• Need a multi-disciplinary approach – No single approach solves all problems

• Process management– Pertains to data process and flows– Checks and controls, audits

• Database– Storage, access, manipulation and retrieval

• Metadata / domain expertise– Interpretation and understanding

• Analysis – Data Mining, Statistics– Analysis, diagnosis, model fitting, prediction, decision making …

Meaning of Data Quality –1

• Conventional definitions: completeness, uniqueness, consistency, accuracy etc. –measurable?Modernize definition of DQ wrt to DQ continuum

• Depends on data paradigms (data gathering, storage)– Federated, High dimensional, Descriptive,

Longitudinal, Streaming, Web (scraped), Numeric, Text data

DQ Meaning - 2

• Depends on applications (delivery, integration, analysis)– Business operations, Aggregate analysis, prediction

– Customer relations …

• Data Interpretation– Know all the rules used to generate the data

• Data Suitability – Use of proxy data

– Relevant data is missing

Increased DQ Increased reliability and usability (directionally correct)

Workshop

• Talks cover different aspects of the complex DQ issue

• Outstanding set of speakers from academia, industrial labs and industry

• Cover theoretical, methodological, applied aspects – case studies!

• From a wide range of disciplines and areas

Welcome!

Rene Miller• University of Toronto• Renee is an Associate Professor of

Computer Science at the University of Toronto. S.B., Mathematics, MIT. S.B., Cognitive Science, MIT. Ph.D., Computer Science, U. Wisconsin-Madison.

• Heterogeneous databases, data mining, and data warehousing.

• “Managing Inconsistency in Data Exchange and Integration”

Grace Zhang

• Morgan Stanley Institutional Equity Division IT. Master of Philosophy in Computer Science from Columbia University, and a Master and B.S. in Computer Science from Zhongshan University,China.

• Develop tools to check data quality issues in equity trading data, design and build the standard destination referential data repository.

• “Data Quality in Trading Surveillance”

Ted Johnson

• AT&T Labs – Research• Database Research department. B.S. in

Mathematics, Johns Hopkins University, Ph.D. in Computer Science, New York University, 1990.

• Data warehousing and data mining• “Bellman - A Data Quality Browser “

Ron Pearson• Daniel Baugh Institute for Functional

Genomics and Computational Biology, Thomas Jefferson University. B.S. in physics from the University of Arkansas at Monticello and M.S.E.E. and PhD in electrical engineering from M.I.T. in 1982.

• Design and analysis of nonlinear digital filters, exploratory data analysis and the validation of analytical results.

• “The Data Cleaning Problem -- Some Key Issues and Practical Approaches”

Dhammika Amaratunga, Javier Cabrera, Nandini Raghavan

• Johnson & Johnson, Rutgers University, Johnson & Johnson

• “Pre-processing of Microarray Data”

S. Muthukrishnan

• Rutgers University, AT&T Labs – Research

• Associate Professor of Computer Science

• Design and analysis of algorithms• “Checks and Balances: Monitoring

Data Quality Problems in Network Traffic Databases”

T. Bonates, P. Hammer, A. Kogan, and I. Lozina

• RutCOR, Rutgers University• Operations Research• Maximum Patterns and Outliers in

the Logical Analysis of Data (LAD)

Jiawei Han

• Professor, Simon Fraser University. Currently at University of Illinois, UC. Ph. D. from University of Wisconsin, Madison in 1985.

• Data mining (knowledge discovery in databases), data warehousing, spatial databases, multimedia databases, deductive and object-oriented databases, and logic programming

• “Data Mining: A Powerful Tool for Data Cleaning”

Jon Hill

• British Telecommunications • Jon leads a team of information experts

to deliver solutions within asset management, process control and billing assurance. Jon uses a wide range of information quality tools within projects and has extensive experience in investigation and solving IQ problems.

• “A $220 Million Success Story”

G. Vesonder, J. Wright & T. Dasu

• AT&T Labs - Research • Head of Adaptive Systems

research• AI, Knowledge Engineering, Expert

Systems• “Life Cycle Datamining”

Andrew Hume

• AT&T Labs – Research• Very large data systems, string

searching, performance measurement

• Tamed many legacy systems• “Managing Data Streams”

Bing Liu

• Associate Professor at National Singapore University, on leave at University of Illinois at Chicago

• Data mining and knowledge discovery; web, text and image mining; Bioinformatics

• Web page cleaning for web data mining

R.K. Pearson and M. Gabbouj

• Collaboration with Moncef Gabbouj from the Tampere University of Technology in Finland.

• “Relational Nonlinear FIR Filters”

Thank you!

data quality, data cleaning and treatment of noisy data dimacs workshop november 3-4, 2003...

data cleaning slide

microarray data slide

data warehousing

data quality browser

data exchange

data quality issues

integration slide

exploratory data analysis

Documents

ieej : may 2011eneken.ieej.or.jp/data/3918.pdf · power...

chunyang tong sriram dasu information & operations...

ieee nss & mic, norfolk, virginia, november 2002 1...

pakistan water and power development...

fit to monitor feed quality - vldb · fit to monitor feed...

undergraduate physics majors handbook · 1150 university...

biological optimisation of radiation therapy treatment...

grid laboratory of wisconsin (glow) sridhara dasu, dan...

bactericidal activity of flavonoids isolated from...

765kv dasu transmission line project resettlement...

dasu hydropower project - world...

dasu hydropower project - world bank · dasu hydropower...

dasu hydropower projec - dasuhpp.com · m. asif faqir...

estimation of sediment yield for dasu hydropower project...

dasu hydropower project - executive summary esa

dasu hydropower project - world...

dasu hydropower project - world bank...kw kilo watt...

dasu hydropower project - world bank · social and...

estimation of sediment yield for dasu hydropower … ·...

1 using wavelets for recognition of cognitive pattern...