visual profiling of large statistical datasets · visual pro ling of large statistical datasets 3....
TRANSCRIPT
Visual Profiling of LargeStatistical Datasets
Martijn TennekesEdwin de Jonge, Piet Daas
February 10, 2011
Outline
Introduction
Tableplot description
Applications
Implementation in R
Visual Profiling of Large Statistical Datasets 2
Introduction
Large statistical dataset
Administrative sources
Survey data
Quality assessment at a technical level
Step 1: Technical checks (e.g. readability and convertability)
Step 2: Data profiling
Representation and distribution of valuesStrange data patternsOccurrence of missing values
Visual Profiling of Large Statistical Datasets 3
Introduction
Large statistical dataset
Administrative sources
Survey data
Quality assessment at a technical level
Step 1: Technical checks (e.g. readability and convertability)
Step 2: Data profiling
Representation and distribution of valuesStrange data patternsOccurrence of missing values
Visual Profiling of Large Statistical Datasets 3
Introduction
Large statistical dataset
Administrative sources
Survey data
Quality assessment at a technical level
Step 1: Technical checks (e.g. readability and convertability)
Step 2: Data profiling
Representation and distribution of valuesStrange data patternsOccurrence of missing values
Visual Profiling of Large Statistical Datasets 3
Introduction
Large statistical dataset
Administrative sources
Survey data
Quality assessment at a technical level
Step 1: Technical checks (e.g. readability and convertability)
Step 2: Data profiling
Representation and distribution of valuesStrange data patternsOccurrence of missing values
Visual Profiling of Large Statistical Datasets 3
IntroductionTraditional approach:
Visual Profiling of Large Statistical Datasets 4
IntroductionNew approach:
Visual Profiling of Large Statistical Datasets 5
Tableplot description
DATA
v1 v2 .... v12
100,000 records
12 variables
Visual Profiling of Large Statistical Datasets 6
Tableplot description
DATA
v1 v2 .... v12
100,000 records
12 variables
Visual Profiling of Large Statistical Datasets 7
Tableplot description
DATA
v1 v2 .... v12
100,000 records
12 variables
Visual Profiling of Large Statistical Datasets 8
Tableplot description
DATA
v1 v2 .... v12
100,000 records
12 variables123
100
...
...
100 row bins
Visual Profiling of Large Statistical Datasets 9
Tableplot description
Visual Profiling of Large Statistical Datasets 10
Tableplot description
DATA
v1 v2 .... v12
100,000 records
12 variables123
100
...
...
100 row bins
Visual Profiling of Large Statistical Datasets 11
Tableplot description
DATA
v1 v2 .... v12
100,000 records
12 variables
100...
123
100 row bins...
Visual Profiling of Large Statistical Datasets 12
Tableplot description
Visual Profiling of Large Statistical Datasets 13
Tableplot description
Visual Profiling of Large Statistical Datasets 14
Tableplot description
Quality measures:
1 Smoothness of a data distribution
2 Selectivity of missing values
3 Distribution of correlated variables
Visual Profiling of Large Statistical Datasets 15
Applications
Structural Business Statistics (SBS)
Large business survey
Circa 50,000 respondents
Data editing and analysis process:1 Unprocessed data2 Edited data3 Data prepared for analysis
Visual Profiling of Large Statistical Datasets 16
Unprocessed data
Visual Profiling of Large Statistical Datasets 17
Edited data
Visual Profiling of Large Statistical Datasets 18
Data prepared for analysis
Visual Profiling of Large Statistical Datasets 19
Comparison with other sources
Comparison:
SBS turover
VAT turnover
STS turnover
Visual Profiling of Large Statistical Datasets 20
Comparison with other sources
Visual Profiling of Large Statistical Datasets 21
Implementation in R
Package tabplot
Available on CRAN
Functions
tableplot tableplot(myDataFrame,colNames = myColumnNames,nBins = 100)
num2fac num2fac(myNumericVector,method=“pretty”,n=5)
tabGUI tableGUI()
Supports very large datasets (up to 2 · 109 records)
Visual Profiling of Large Statistical Datasets 22
Conclusion
Quality assessment
Existing data sourcesNew data sources
Effective method to support top-down data analysis
Apply tableplot to other sources
Further improve tableplots
Visual Profiling of Large Statistical Datasets 23
Conclusion
Quality assessment
Existing data sourcesNew data sources
Effective method to support top-down data analysis
Apply tableplot to other sources
Further improve tableplots
Visual Profiling of Large Statistical Datasets 23
Conclusion
Quality assessment
Existing data sourcesNew data sources
Effective method to support top-down data analysis
Apply tableplot to other sources
Further improve tableplots
Visual Profiling of Large Statistical Datasets 23
Conclusion
Quality assessment
Existing data sourcesNew data sources
Effective method to support top-down data analysis
Apply tableplot to other sources
Further improve tableplots
Visual Profiling of Large Statistical Datasets 23