Download - From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data

From Dirt to Shovels:From Dirt to Shovels:Fully Automatic Tool Generation from Fully Automatic Tool Generation from ASCII DataASCII Data

David WalkerPamela DragoshMary FernandezKathleen FisherAndrew Forrest Bob GruberYitzhak MandelbaumPeter WhiteKenny Q. Zhu www.padsproj.org

Data, data, everywhereData, data, everywhere

• AT&T and other information technology companies spend huge amounts of time and energy processing “ad hoc data”

• Ad hoc data = data in non-standard formats with no a priori data processing tools/libraries available – not free text; not html; not xml

• Common problems: no documentation, evolving formats, huge volume, error-filled ...

Web Logs

Network Monitoring

Billing Info

Router Configs

Call Details


207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/[email protected]/confirm HTTP/1.0" 200 941234.200.68.71 - - [15/Oct/1997:18:53:33 -0700] "GET /tr/img/gift.gif HTTP/1.0” 200 409240.142.174.15 - - [15/Oct/1997:18:39:25 -0700] "GET /tr/img/wool.gif HTTP/1.0" 404 178188.168.121.58 - - [16/Oct/1997:12:59:35 -0700] "GET / HTTP/1.0" 200 3082214.201.210.19 ekf - [17/Oct/1997:10:08:23 -0700] "GET /img/new.gif HTTP/1.0" 304 -

web serverweb servercommon log formatcommon log format


AT&T AT&T phone call provisioning dataphone call provisioning data

9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii152272|EDTF_6|0|MARVINS1|UNO|10|1000295291

9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii15222|EDTF_6|0|MARVINS1|UNO|10|1000295291|20|1000295291|17|1001649600|19|1001

649600|27|1001649600|29|1001649600|IA0288|1001714400|IE0288|1001714400|EDTF_CRTE|1001908800|EDTF_OS_1|1001995201|16|1021309814|26|1054589982


HA00000000START OF TEST CYCLEaA00000001BXYZ U1AB0000040000100B0000004200HE00000005START OF SUMMARYf 00000006NYZX B1QB00052000120000070000B000050000000520000 00490000005100+00000100B00000005300000052500000535000HF00000007END OF SUMMARYk 00000008LYXW B1KB0000065G0000009900100000001000020000HB00000009END OF TEST CYCLE

www.opradata.comwww.opradata.com


format-version: 1.0date: 11:11:2005 14:24auto-generated-by: DAG-Edit 1.419 rev 3default-namespace: gene_ontologysubsetdef: goslim_goa "GOA and proteome slim"

[Term]id: GO:0000001name: mitochondrion inheritancenamespace: biological_processdef: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID:10873824, PMID:11389764, SGD:mcc]is_a: GO:0048308 ! organelle inheritanceis_a: GO:0048311 ! mitochondrion distribution

www.geneontology.orgwww.geneontology.org

GoalGoal

Billing Info

Raw Data

ASCII log files Call Detail

XML

CSV

Standard formats & schema

Visual Information

End-user tools

We want to create this arrow

Half-way there: The PADS System Half-way there: The PADS System 1.0 1.0 [FG pldi 05, FMW popl 06, MFWFG popl 07][FG pldi 05, FMW popl 06, MFWFG popl 07]

“Ad Hoc” Data Source

AnalysisReport

XML

PADS Data Description

PADSCompiler

Generated Libraries(Parsing, Printing, Traversal)

PADS Runtime System(I/O, Error Handling)

XMLConverter

DataProfiler

GraphingTool

QueryEngine

CustomApp

Graph Information

?

genericdescription-directedprogramscodedonce

PADS Language Overview PADS Language Overview

• Rich base type library:– integers: Pint8, Puint32, …

– strings: Pstring(’|’), Pstring_FW(3), ...

– systems data: Pdate, Ptime, Pip, …

• Type constructors describe complex data sources:– sequences: Pstruct, Parray,

– choices: Punion, Penum, Pswitch

– constraints: arbitrary predicates describe expected semantic properties

– parameterization: allows definition of generic descriptions

Data formats are described using a specialized language of types

A formal semantics gives meaning to descriptions in terms of both external format and internal data structures generated.

The Last Mile: The PADS System 2.0The Last Mile: The PADS System 2.0

Chunking &Tokenization

Structure Discovery

Format Refinement

PADS Data Description

Scoring Function

Raw Data

PADSCompiler

Profiler

XMLifier

AnalysisReport

XML

FormatInferenceEngine

Chunking &Tokenization

Structure Discovery

• Convert raw input into sequence of “chunks.”

• Supported divisions:– Various forms of “newline”– File boundaries

• Also possible: user-defined “paragraphs”

Chunking ProcessChunking Process

TokenizationTokenization

•Tokens/Base types expressed as regular expressions.•Basic tokens

•Integer, white space, punctuation, strings•Distinctive tokens

•IP addresses, dates, times, MAC addresses, ...

HistogramsHistograms

Two frequency distributions are similar if they have the same shape (within some error tolerance) when the columns are sorted by height.

ClusteringClustering

Cluster 1

Group clusters with similar frequency distributions

Cluster 2

Cluster 3

Rank clusters by metric that rewards high coverage and narrower distributions. Chose cluster with highest score.

Partition chunksPartition chunks

In our example, all the tokens appear in the same order in all chunks, so the union is degenerate.

Find subcontextsFind subcontexts

Tokens in selected cluster:

Quote(2) Comma White

Then Recurse...Then Recurse...

Inferred typeInferred type

Structure Discovery ReviewStructure Discovery Review• Compute frequency distribution for each token.

• Cluster tokens with similar frequency distributions.• Create hypothesis about data structure from cluster distributions

– Struct– Array– Union– Basic type (bottom out)

• Partition data according to hypothesis & recurse

• Once structure discovery is complete, later phases massage & rewrite candidate description to create final form

“123, 24”“345, begin”“574, end”“9378, 56”“12, middle”“-12, problem”… 0

102030405060708090

100

" WhiteSpace

, Integer String

12

Testing and EvaluationTesting and Evaluation

• Evaluated overall results qualitatively– Compared with Excel -- a manual process with limited

facilities for representation of hierarchy or variation– Compared with hand-written descriptions –- performance

variable depending on tokenization choices & complexity

• Evaluated accuracy quantitatively– For many formats: 95%+ accuracy from 5% of available

data

• Evaluated performance quantitatively– Hours to days to hand-write formats– after fixing the format, appears to scale linearly with data

size– <1 min on 300K data

Technical Summary Technical Summary [www.padsproj.org][www.padsproj.org]• PADS 1.0 is an effective implementation

framework for many data processing tasks • PADS 2.0 improves programmer productivity

further by automatically inferring formats & generating many tools & libraries

Email

ASCII log files Binary Tracesstruct { ........ ...... ...........}

XML

CSV

EndEnd

Execution TimeExecution TimeData source SD (s) Ref (s) Tot (s) HW (h)

1967Transactions.short

0.20 2.32 2.56 4.0

MER_T01_01.cvs 0.11 2.82 2.92 0.5

Ai.3000 1.97 26.35 28.64 1.0

Asl.log 2.90 52.07 55.26 1.0

Boot.log 0.11 2.40 2.53 1.0

Crashreporter.log 0.12 3.58 3.73 2.0

Crashreporter.log.mod 0.15 3.83 4.00 2.0

Sirius.1000 2.24 5.69 8.00 1.5

Ls-l.txt 0.01 0.10 0.11 1.0

Netstat-an 0.07 0.74 0.82 1.0

Page_log 0.08 0.55 0.65 0.5

quarterlypersonalincome

0.07 5.11 5.18 48

Railroad.txt 0.06 2.69 2.76 2.0

Scrollkeeper.log 0.13 3.24 3.40 1.0

Windowserver_last.log 0.37 9.65 10.07 1.5

Yum.txt 0.11 1.91 2.03 5.0

SD: structure discoveryRef: refinementTot: total

HW: hand-written

Training TimeTraining Time

Minimum Necessary Training SizesMinimum Necessary Training SizesData source 90% 95%

Sirius.1000 5 10

1967Transaction.short 5 5

Ai.3000 5 10

Asl.log 5 10

Scrollkeeper.log 5 5

Page_log 5 5

MER_T01_01.csv 5 5

Crashreporter.log 10 15

Crashreporter.log.mod 5 15

Windowserver_last.log 5 15

Netstat-an 25 35

Yum.txt 30 45

quarterlypersonalincome

10 10

Boot.log 45 60

Ls-l.txt 50 65

Railroad.txt 60 75

Problem: TokenizationProblem: Tokenization

• Technical problem:– Different data sources assume different tokenization strategies– Useful token definitions sometimes overlap, can be ambiguous,

aren’t always easily expressed using regular expressions– Matching tokenization of underlying data source can make a big

difference in structure discovery.

• Current solution:– Parameterize learning system with customizable configuration files– Automatically generate lexer file & basic token types

• Future solutions:– Use existing PADS descriptions and data sources to learn

probabilistic tokenizers– Incorporate probabilities into sophisticated back-end rewriting

system• Back end has more context for making final decisions than the

tokenizer, which reads 1 character at a time without look ahead

Download - From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data

Top Related