from dirt to shovels: fully automatic tool generation from ascii data david walker pamela dragosh...
Post on 20-Dec-2015
214 views
TRANSCRIPT
![Page 1: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/1.jpg)
From Dirt to Shovels:From Dirt to Shovels:Fully Automatic Tool Generation from Fully Automatic Tool Generation from ASCII DataASCII Data
David WalkerPamela DragoshMary FernandezKathleen FisherAndrew Forrest Bob GruberYitzhak MandelbaumPeter WhiteKenny Q. Zhu www.padsproj.org
![Page 2: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/2.jpg)
Data, data, everywhereData, data, everywhere
• AT&T and other information technology companies spend huge amounts of time and energy processing “ad hoc data”
• Ad hoc data = data in non-standard formats with no a priori data processing tools/libraries available – not free text; not html; not xml
• Common problems: no documentation, evolving formats, huge volume, error-filled ...
Web Logs
Network Monitoring
Billing Info
Router Configs
Call Details
![Page 3: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/3.jpg)
Data, data, everywhereData, data, everywhere
207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/[email protected]/confirm HTTP/1.0" 200 941234.200.68.71 - - [15/Oct/1997:18:53:33 -0700] "GET /tr/img/gift.gif HTTP/1.0” 200 409240.142.174.15 - - [15/Oct/1997:18:39:25 -0700] "GET /tr/img/wool.gif HTTP/1.0" 404 178188.168.121.58 - - [16/Oct/1997:12:59:35 -0700] "GET / HTTP/1.0" 200 3082214.201.210.19 ekf - [17/Oct/1997:10:08:23 -0700] "GET /img/new.gif HTTP/1.0" 304 -
web serverweb servercommon log formatcommon log format
![Page 4: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/4.jpg)
Data, data, everywhereData, data, everywhere
AT&T AT&T phone call provisioning dataphone call provisioning data
9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii152272|EDTF_6|0|MARVINS1|UNO|10|1000295291
9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii15222|EDTF_6|0|MARVINS1|UNO|10|1000295291|20|1000295291|17|1001649600|19|1001
649600|27|1001649600|29|1001649600|IA0288|1001714400|IE0288|1001714400|EDTF_CRTE|1001908800|EDTF_OS_1|1001995201|16|1021309814|26|1054589982
![Page 5: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/5.jpg)
Data, data, everywhereData, data, everywhere
HA00000000START OF TEST CYCLEaA00000001BXYZ U1AB0000040000100B0000004200HE00000005START OF SUMMARYf 00000006NYZX B1QB00052000120000070000B000050000000520000 00490000005100+00000100B00000005300000052500000535000HF00000007END OF SUMMARYk 00000008LYXW B1KB0000065G0000009900100000001000020000HB00000009END OF TEST CYCLE
www.opradata.comwww.opradata.com
![Page 6: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/6.jpg)
Data, data, everywhereData, data, everywhere
format-version: 1.0date: 11:11:2005 14:24auto-generated-by: DAG-Edit 1.419 rev 3default-namespace: gene_ontologysubsetdef: goslim_goa "GOA and proteome slim"
[Term]id: GO:0000001name: mitochondrion inheritancenamespace: biological_processdef: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID:10873824, PMID:11389764, SGD:mcc]is_a: GO:0048308 ! organelle inheritanceis_a: GO:0048311 ! mitochondrion distribution
www.geneontology.orgwww.geneontology.org
![Page 7: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/7.jpg)
GoalGoal
Billing Info
Raw Data
ASCII log files Call Detail
XML
CSV
Standard formats & schema
Visual Information
End-user tools
We want to create this arrow
![Page 8: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/8.jpg)
Half-way there: The PADS System Half-way there: The PADS System 1.0 1.0 [FG pldi 05, FMW popl 06, MFWFG popl 07][FG pldi 05, FMW popl 06, MFWFG popl 07]
“Ad Hoc” Data Source
AnalysisReport
XML
PADS Data Description
PADSCompiler
Generated Libraries(Parsing, Printing, Traversal)
PADS Runtime System(I/O, Error Handling)
XMLConverter
DataProfiler
GraphingTool
QueryEngine
CustomApp
Graph Information
?
genericdescription-directedprogramscodedonce
![Page 9: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/9.jpg)
PADS Language Overview PADS Language Overview
• Rich base type library:– integers: Pint8, Puint32, …
– strings: Pstring(’|’), Pstring_FW(3), ...
– systems data: Pdate, Ptime, Pip, …
• Type constructors describe complex data sources:– sequences: Pstruct, Parray,
– choices: Punion, Penum, Pswitch
– constraints: arbitrary predicates describe expected semantic properties
– parameterization: allows definition of generic descriptions
Data formats are described using a specialized language of types
A formal semantics gives meaning to descriptions in terms of both external format and internal data structures generated.
![Page 10: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/10.jpg)
The Last Mile: The PADS System 2.0The Last Mile: The PADS System 2.0
Chunking &Tokenization
Structure Discovery
Format Refinement
PADS Data Description
Scoring Function
Raw Data
PADSCompiler
Profiler
XMLifier
AnalysisReport
XML
FormatInferenceEngine
Chunking &Tokenization
Structure Discovery
![Page 11: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/11.jpg)
• Convert raw input into sequence of “chunks.”
• Supported divisions:– Various forms of “newline”– File boundaries
• Also possible: user-defined “paragraphs”
Chunking ProcessChunking Process
![Page 12: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/12.jpg)
TokenizationTokenization
•Tokens/Base types expressed as regular expressions.•Basic tokens
•Integer, white space, punctuation, strings•Distinctive tokens
•IP addresses, dates, times, MAC addresses, ...
![Page 13: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/13.jpg)
HistogramsHistograms
![Page 14: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/14.jpg)
Two frequency distributions are similar if they have the same shape (within some error tolerance) when the columns are sorted by height.
ClusteringClustering
Cluster 1
Group clusters with similar frequency distributions
Cluster 2
Cluster 3
Rank clusters by metric that rewards high coverage and narrower distributions. Chose cluster with highest score.
![Page 15: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/15.jpg)
Partition chunksPartition chunks
In our example, all the tokens appear in the same order in all chunks, so the union is degenerate.
![Page 16: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/16.jpg)
Find subcontextsFind subcontexts
Tokens in selected cluster:
Quote(2) Comma White
![Page 17: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/17.jpg)
Then Recurse...Then Recurse...
![Page 18: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/18.jpg)
Inferred typeInferred type
![Page 19: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/19.jpg)
Structure Discovery ReviewStructure Discovery Review• Compute frequency distribution for each token.
• Cluster tokens with similar frequency distributions.• Create hypothesis about data structure from cluster distributions
– Struct– Array– Union– Basic type (bottom out)
• Partition data according to hypothesis & recurse
• Once structure discovery is complete, later phases massage & rewrite candidate description to create final form
“123, 24”“345, begin”“574, end”“9378, 56”“12, middle”“-12, problem”… 0
102030405060708090
100
" WhiteSpace
, Integer String
12
![Page 20: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/20.jpg)
Testing and EvaluationTesting and Evaluation
• Evaluated overall results qualitatively– Compared with Excel -- a manual process with limited
facilities for representation of hierarchy or variation– Compared with hand-written descriptions –- performance
variable depending on tokenization choices & complexity
• Evaluated accuracy quantitatively– For many formats: 95%+ accuracy from 5% of available
data
• Evaluated performance quantitatively– Hours to days to hand-write formats– after fixing the format, appears to scale linearly with data
size– <1 min on 300K data
![Page 21: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/21.jpg)
Technical Summary Technical Summary [www.padsproj.org][www.padsproj.org]• PADS 1.0 is an effective implementation
framework for many data processing tasks • PADS 2.0 improves programmer productivity
further by automatically inferring formats & generating many tools & libraries
ASCII log files Binary Tracesstruct { ........ ...... ...........}
XML
CSV
![Page 22: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/22.jpg)
EndEnd
![Page 23: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/23.jpg)
Execution TimeExecution TimeData source SD (s) Ref (s) Tot (s) HW (h)
1967Transactions.short
0.20 2.32 2.56 4.0
MER_T01_01.cvs 0.11 2.82 2.92 0.5
Ai.3000 1.97 26.35 28.64 1.0
Asl.log 2.90 52.07 55.26 1.0
Boot.log 0.11 2.40 2.53 1.0
Crashreporter.log 0.12 3.58 3.73 2.0
Crashreporter.log.mod 0.15 3.83 4.00 2.0
Sirius.1000 2.24 5.69 8.00 1.5
Ls-l.txt 0.01 0.10 0.11 1.0
Netstat-an 0.07 0.74 0.82 1.0
Page_log 0.08 0.55 0.65 0.5
quarterlypersonalincome
0.07 5.11 5.18 48
Railroad.txt 0.06 2.69 2.76 2.0
Scrollkeeper.log 0.13 3.24 3.40 1.0
Windowserver_last.log 0.37 9.65 10.07 1.5
Yum.txt 0.11 1.91 2.03 5.0
SD: structure discoveryRef: refinementTot: total
HW: hand-written
![Page 24: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/24.jpg)
Training TimeTraining Time
![Page 25: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/25.jpg)
Minimum Necessary Training SizesMinimum Necessary Training SizesData source 90% 95%
Sirius.1000 5 10
1967Transaction.short 5 5
Ai.3000 5 10
Asl.log 5 10
Scrollkeeper.log 5 5
Page_log 5 5
MER_T01_01.csv 5 5
Crashreporter.log 10 15
Crashreporter.log.mod 5 15
Windowserver_last.log 5 15
Netstat-an 25 35
Yum.txt 30 45
quarterlypersonalincome
10 10
Boot.log 45 60
Ls-l.txt 50 65
Railroad.txt 60 75
![Page 26: From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber](https://reader035.vdocument.in/reader035/viewer/2022081519/56649d4c5503460f94a29e35/html5/thumbnails/26.jpg)
Problem: TokenizationProblem: Tokenization
• Technical problem:– Different data sources assume different tokenization strategies– Useful token definitions sometimes overlap, can be ambiguous,
aren’t always easily expressed using regular expressions– Matching tokenization of underlying data source can make a big
difference in structure discovery.
• Current solution:– Parameterize learning system with customizable configuration files– Automatically generate lexer file & basic token types
• Future solutions:– Use existing PADS descriptions and data sources to learn
probabilistic tokenizers– Incorporate probabilities into sophisticated back-end rewriting
system• Back end has more context for making final decisions than the
tokenizer, which reads 1 character at a time without look ahead