From Dirt to Shovels:From Dirt to Shovels:Fully Automatic Tool Generation from Fully Automatic Tool Generation from ASCII DataASCII Data
David WalkerPamela DragoshMary FernandezKathleen FisherAndrew Forrest Bob GruberYitzhak MandelbaumPeter WhiteKenny Q. Zhu www.padsproj.org
Data, data, everywhereData, data, everywhere
• AT&T and other information technology companies spend huge amounts of time and energy processing “ad hoc data”
• Ad hoc data = data in non-standard formats with no a priori data processing tools/libraries available – not free text; not html; not xml
• Common problems: no documentation, evolving formats, huge volume, error-filled ...
Web Logs
Network Monitoring
Billing Info
Router Configs
Call Details
Data, data, everywhereData, data, everywhere
207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/[email protected]/confirm HTTP/1.0" 200 941234.200.68.71 - - [15/Oct/1997:18:53:33 -0700] "GET /tr/img/gift.gif HTTP/1.0” 200 409240.142.174.15 - - [15/Oct/1997:18:39:25 -0700] "GET /tr/img/wool.gif HTTP/1.0" 404 178188.168.121.58 - - [16/Oct/1997:12:59:35 -0700] "GET / HTTP/1.0" 200 3082214.201.210.19 ekf - [17/Oct/1997:10:08:23 -0700] "GET /img/new.gif HTTP/1.0" 304 -
web serverweb servercommon log formatcommon log format
Data, data, everywhereData, data, everywhere
AT&T AT&T phone call provisioning dataphone call provisioning data
9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii152272|EDTF_6|0|MARVINS1|UNO|10|1000295291
9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii15222|EDTF_6|0|MARVINS1|UNO|10|1000295291|20|1000295291|17|1001649600|19|1001
649600|27|1001649600|29|1001649600|IA0288|1001714400|IE0288|1001714400|EDTF_CRTE|1001908800|EDTF_OS_1|1001995201|16|1021309814|26|1054589982
Data, data, everywhereData, data, everywhere
HA00000000START OF TEST CYCLEaA00000001BXYZ U1AB0000040000100B0000004200HE00000005START OF SUMMARYf 00000006NYZX B1QB00052000120000070000B000050000000520000 00490000005100+00000100B00000005300000052500000535000HF00000007END OF SUMMARYk 00000008LYXW B1KB0000065G0000009900100000001000020000HB00000009END OF TEST CYCLE
www.opradata.comwww.opradata.com
Data, data, everywhereData, data, everywhere
format-version: 1.0date: 11:11:2005 14:24auto-generated-by: DAG-Edit 1.419 rev 3default-namespace: gene_ontologysubsetdef: goslim_goa "GOA and proteome slim"
[Term]id: GO:0000001name: mitochondrion inheritancenamespace: biological_processdef: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID:10873824, PMID:11389764, SGD:mcc]is_a: GO:0048308 ! organelle inheritanceis_a: GO:0048311 ! mitochondrion distribution
www.geneontology.orgwww.geneontology.org
GoalGoal
Billing Info
Raw Data
ASCII log files Call Detail
XML
CSV
Standard formats & schema
Visual Information
End-user tools
We want to create this arrow
Half-way there: The PADS System Half-way there: The PADS System 1.0 1.0 [FG pldi 05, FMW popl 06, MFWFG popl 07][FG pldi 05, FMW popl 06, MFWFG popl 07]
“Ad Hoc” Data Source
AnalysisReport
XML
PADS Data Description
PADSCompiler
Generated Libraries(Parsing, Printing, Traversal)
PADS Runtime System(I/O, Error Handling)
XMLConverter
DataProfiler
GraphingTool
QueryEngine
CustomApp
Graph Information
?
genericdescription-directedprogramscodedonce
PADS Language Overview PADS Language Overview
• Rich base type library:– integers: Pint8, Puint32, …
– strings: Pstring(’|’), Pstring_FW(3), ...
– systems data: Pdate, Ptime, Pip, …
• Type constructors describe complex data sources:– sequences: Pstruct, Parray,
– choices: Punion, Penum, Pswitch
– constraints: arbitrary predicates describe expected semantic properties
– parameterization: allows definition of generic descriptions
Data formats are described using a specialized language of types
A formal semantics gives meaning to descriptions in terms of both external format and internal data structures generated.
The Last Mile: The PADS System 2.0The Last Mile: The PADS System 2.0
Chunking &Tokenization
Structure Discovery
Format Refinement
PADS Data Description
Scoring Function
Raw Data
PADSCompiler
Profiler
XMLifier
AnalysisReport
XML
FormatInferenceEngine
Chunking &Tokenization
Structure Discovery
• Convert raw input into sequence of “chunks.”
• Supported divisions:– Various forms of “newline”– File boundaries
• Also possible: user-defined “paragraphs”
Chunking ProcessChunking Process
TokenizationTokenization
•Tokens/Base types expressed as regular expressions.•Basic tokens
•Integer, white space, punctuation, strings•Distinctive tokens
•IP addresses, dates, times, MAC addresses, ...
HistogramsHistograms
Two frequency distributions are similar if they have the same shape (within some error tolerance) when the columns are sorted by height.
ClusteringClustering
Cluster 1
Group clusters with similar frequency distributions
Cluster 2
Cluster 3
Rank clusters by metric that rewards high coverage and narrower distributions. Chose cluster with highest score.
Partition chunksPartition chunks
In our example, all the tokens appear in the same order in all chunks, so the union is degenerate.
Find subcontextsFind subcontexts
Tokens in selected cluster:
Quote(2) Comma White
Then Recurse...Then Recurse...
Inferred typeInferred type
Structure Discovery ReviewStructure Discovery Review• Compute frequency distribution for each token.
• Cluster tokens with similar frequency distributions.• Create hypothesis about data structure from cluster distributions
– Struct– Array– Union– Basic type (bottom out)
• Partition data according to hypothesis & recurse
• Once structure discovery is complete, later phases massage & rewrite candidate description to create final form
“123, 24”“345, begin”“574, end”“9378, 56”“12, middle”“-12, problem”… 0
102030405060708090
100
" WhiteSpace
, Integer String
12
Testing and EvaluationTesting and Evaluation
• Evaluated overall results qualitatively– Compared with Excel -- a manual process with limited
facilities for representation of hierarchy or variation– Compared with hand-written descriptions –- performance
variable depending on tokenization choices & complexity
• Evaluated accuracy quantitatively– For many formats: 95%+ accuracy from 5% of available
data
• Evaluated performance quantitatively– Hours to days to hand-write formats– after fixing the format, appears to scale linearly with data
size– <1 min on 300K data
Technical Summary Technical Summary [www.padsproj.org][www.padsproj.org]• PADS 1.0 is an effective implementation
framework for many data processing tasks • PADS 2.0 improves programmer productivity
further by automatically inferring formats & generating many tools & libraries
ASCII log files Binary Tracesstruct { ........ ...... ...........}
XML
CSV
EndEnd
Execution TimeExecution TimeData source SD (s) Ref (s) Tot (s) HW (h)
1967Transactions.short
0.20 2.32 2.56 4.0
MER_T01_01.cvs 0.11 2.82 2.92 0.5
Ai.3000 1.97 26.35 28.64 1.0
Asl.log 2.90 52.07 55.26 1.0
Boot.log 0.11 2.40 2.53 1.0
Crashreporter.log 0.12 3.58 3.73 2.0
Crashreporter.log.mod 0.15 3.83 4.00 2.0
Sirius.1000 2.24 5.69 8.00 1.5
Ls-l.txt 0.01 0.10 0.11 1.0
Netstat-an 0.07 0.74 0.82 1.0
Page_log 0.08 0.55 0.65 0.5
quarterlypersonalincome
0.07 5.11 5.18 48
Railroad.txt 0.06 2.69 2.76 2.0
Scrollkeeper.log 0.13 3.24 3.40 1.0
Windowserver_last.log 0.37 9.65 10.07 1.5
Yum.txt 0.11 1.91 2.03 5.0
SD: structure discoveryRef: refinementTot: total
HW: hand-written
Training TimeTraining Time
Minimum Necessary Training SizesMinimum Necessary Training SizesData source 90% 95%
Sirius.1000 5 10
1967Transaction.short 5 5
Ai.3000 5 10
Asl.log 5 10
Scrollkeeper.log 5 5
Page_log 5 5
MER_T01_01.csv 5 5
Crashreporter.log 10 15
Crashreporter.log.mod 5 15
Windowserver_last.log 5 15
Netstat-an 25 35
Yum.txt 30 45
quarterlypersonalincome
10 10
Boot.log 45 60
Ls-l.txt 50 65
Railroad.txt 60 75
Problem: TokenizationProblem: Tokenization
• Technical problem:– Different data sources assume different tokenization strategies– Useful token definitions sometimes overlap, can be ambiguous,
aren’t always easily expressed using regular expressions– Matching tokenization of underlying data source can make a big
difference in structure discovery.
• Current solution:– Parameterize learning system with customizable configuration files– Automatically generate lexer file & basic token types
• Future solutions:– Use existing PADS descriptions and data sources to learn
probabilistic tokenizers– Incorporate probabilities into sophisticated back-end rewriting
system• Back end has more context for making final decisions than the
tokenizer, which reads 1 character at a time without look ahead