data preprocessing
DESCRIPTION
TRANSCRIPT
![Page 1: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/1.jpg)
Data Preprocessing
![Page 2: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/2.jpg)
Content
What & Why preprocess the data?
Data cleaning
Data integration
Data transformation
Data reduction
PAAS Group
![Page 3: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/3.jpg)
It is a data mining technique that involves transforming raw data into an understandable format.
PAAS Group
![Page 4: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/4.jpg)
Why preprocess the data?
PAAS Group
![Page 5: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/5.jpg)
Data Preprocessing• Data in the real world is:
– incomplete: lacking values, certain attributes of interest, etc.
– noisy: containing errors or outliers
– inconsistent: lack of compatibility or similarity between two or more facts.
• No quality data, no quality mining results!– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
PAAS Group
![Page 6: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/6.jpg)
Measure of Data Quality
Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility
PAAS Group
![Page 7: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/7.jpg)
Data preprocessing techniques
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
PAAS Group
![Page 8: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/8.jpg)
Major Tasks in Data Preprocessing
• Data cleaning– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration– Integration of multiple databases, data cubes, or files
• Data transformation– Normalization and aggregation
• Data reduction– Obtains reduced representation in volume but produces the same or similar
analytical results
PAAS Group
![Page 9: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/9.jpg)
Data Preprocessing
PAAS Group
![Page 10: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/10.jpg)
Data Cleaning
PAAS Group
![Page 11: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/11.jpg)
Data Cleaning
“Data Cleaning attempt to fill in missing values, smooth out noise while identifying outliers and correct inconsistencies in the real world data.”
PAAS Group
![Page 12: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/12.jpg)
Data Cleaning - Missing Values
• Ignore the tuple
• Fill in the missing value manually
• Use a global constant
• Use attribute mean
• Use the most probable value (decision tree, Bayesian Formalism)
PAAS Group
![Page 13: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/13.jpg)
Data Cleaning - Noisy Data
• Binning
• Clustering
• Combined computer and human inspection
• Regression
PAAS Group
![Page 14: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/14.jpg)
Data Cleaning - Inconsistent Data
• Manually, using external references
• Knowledge engineering tools
PAAS Group
![Page 15: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/15.jpg)
Few Important Terms
• Discrepancy Detection– Human Error
– Data Decay
– Deliberate Errors
• Metadata
• Unique Rules
• Null Rules
PAAS Group
![Page 16: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/16.jpg)
Data Integration
PAAS Group
![Page 17: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/17.jpg)
Data Integration
“Data Integration implies combining of data from multiple sources into a coherent data store(data warehouse). ”
PAAS Group
![Page 18: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/18.jpg)
Data Integration - Issues
• Entity identification problem
• Redundancy
• Tuple Duplication
• Detecting data value conflicts
PAAS Group
![Page 19: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/19.jpg)
Data Transformation
PAAS Group
![Page 20: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/20.jpg)
Data Transformation
“Transforming or consolidating data into mining suitable form is known as Data Transformation.”
PAAS Group
![Page 21: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/21.jpg)
Handling Redundant Data in Data Integration
• Redundant data occur often when integration of multiple databases
– The same attribute may have different names in different databases
– One attribute may be a “derived” attribute in another table, e.g., annual revenue
![Page 22: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/22.jpg)
Handling Redundant Data in Data Integration
• Redundant data may be able to be detected by correlation analysis
• Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
![Page 23: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/23.jpg)
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
![Page 24: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/24.jpg)
Data Reduction
PAAS Group
![Page 25: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/25.jpg)
Data Reduction
“Data reduction techniques are applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of base data.”
PAAS Group
![Page 26: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/26.jpg)
Data Reduction - Strategies
• Data cube aggregation
• Dimension Reduction
• Data Compression
• Numerosity Reduction
• Discretization and concept hierarchy generation
PAAS Group
![Page 27: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/27.jpg)
Example of Decision Tree Induction
Initial attribute set:{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}PAAS Group
![Page 28: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/28.jpg)
Histograms• A popular data reduction
technique
• Divide data into buckets and store average (sum) for each bucket
• Can be constructed optimally in one dimension.
• Related to quantization problems.
PAAS Group
![Page 29: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/29.jpg)
Clustering
• Partition data set into clusters, and one can store cluster
representation only
• Can be very effective if data is clustered.
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures.
PAAS Group
![Page 30: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/30.jpg)
Sampling• Allows a large data set to be represented by a much
smaller of the data.• Let a large data set D, contains N tuples. • Methods to reduce data set D:
– Simple random sample without replacement (SRSWOR)
– Simple random sample with replacement (SRSWR)
– Cluster sample
– Stright sample
PAAS Group
![Page 31: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/31.jpg)
Sampling
SRSWOR(simple random sample without replacement)
SRSWR
Raw Data
PAAS Group
![Page 32: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/32.jpg)
Sampling
Raw Data Cluster/Stratified Sample
PAAS Group
![Page 33: Data preprocessing](https://reader033.vdocument.in/reader033/viewer/2022061210/548f186ab4795914668b473d/html5/thumbnails/33.jpg)
PAAS Group