pads: a domain-specific language for processing ad hoc data
TRANSCRIPT
![Page 1: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/1.jpg)
PADS: A Domain-Specific Languagefor Processing Ad Hoc Data
talk based on paper byKathleen Fisher1 Robert Gruber2
1AT&T Labs Research 2Google
November 10, 2005
Jan Voung (UCSD) PADS November 10, 2005 1 / 24
![Page 2: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/2.jpg)
Outline
1 Why worry about ad hoc data?
2 Current options for processing ad hoc data
3 What PADS does for you
Jan Voung (UCSD) PADS November 10, 2005 2 / 24
![Page 3: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/3.jpg)
Outline
1 Why worry about ad hoc data?
2 Current options for processing ad hoc data
3 What PADS does for you
Jan Voung (UCSD) PADS November 10, 2005 3 / 24
![Page 4: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/4.jpg)
Data and formatting
Massive amounts of data is generated and collected every day.
Formatted data is easier to understand and process.
Formats can either be standardized, or not standardized.
Jan Voung (UCSD) PADS November 10, 2005 4 / 24
![Page 5: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/5.jpg)
Standard vs nonstandard formats
Boundary not so clear, but...
there is certainly a difference in the size of the expectedproducer/user-base. Standard formats tend to have many applicationsthat generate and process them. Standardized formats should bedesigned to expect a growing user-base.
Examples of data in standard formats:
• webpages in HTML
• pictures in JPEG
• XML
• data in databases
• The billions of lines of code you write every day?
Jan Voung (UCSD) PADS November 10, 2005 5 / 24
![Page 6: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/6.jpg)
Ad hoc data (data in nonstandard formats)
Claim/(fact?): Most data is actually stored in nonstandard formats.Here are some examples:
• AT&T accumulates 250-300GB/day of billing data
• Netflow data arrives at Cisco routers at over a GB/sec.
Jan Voung (UCSD) PADS November 10, 2005 6 / 24
![Page 7: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/7.jpg)
Common Log Format (CLF) for web servers
One record per line, with 7 fields:
The ’-’ character indicates missing data for that field
Jan Voung (UCSD) PADS November 10, 2005 7 / 24
![Page 8: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/8.jpg)
Investment data
Records within another kind of record
Jan Voung (UCSD) PADS November 10, 2005 8 / 24
![Page 9: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/9.jpg)
Ad hoc data in chemistry
Context-free?
Jan Voung (UCSD) PADS November 10, 2005 9 / 24
![Page 10: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/10.jpg)
DNS data
Binary data
Jan Voung (UCSD) PADS November 10, 2005 10 / 24
![Page 11: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/11.jpg)
Small audience, big expectations
Groups that work with ad hoc data, no matter how small, still need toolsthat
• parse the data to load into applications
• visualize the data
• allow queries over the data
• convert the data (maybe to load into a database)
• detect errors in the data
• correct errors
• filter data
• combine data from multiple sources
Jan Voung (UCSD) PADS November 10, 2005 11 / 24
![Page 12: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/12.jpg)
Outline
1 Why worry about ad hoc data?
2 Current options for processing ad hoc data
3 What PADS does for you
Jan Voung (UCSD) PADS November 10, 2005 12 / 24
![Page 13: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/13.jpg)
Perl/AWK/Shell scripts/C
Pros
• flexible
Cons
• time investment• hand-code parser• hand-code error handling (if you even bother)• hand-code other tools (e.g., converters, viewers)
• error-prone
• difficult to maintain in the face of format changes
Jan Voung (UCSD) PADS November 10, 2005 13 / 24
![Page 14: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/14.jpg)
Lex + Yacc
Pros
• ?
Cons
• specify lexer and parser separately
• error handling inflexible
• still need to hand-code other tools (e.g., converters, viewers)
People don’t use Lex + Yacc for ad hoc data.
Jan Voung (UCSD) PADS November 10, 2005 14 / 24
![Page 15: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/15.jpg)
Outline
1 Why worry about ad hoc data?
2 Current options for processing ad hoc data
3 What PADS does for you
Jan Voung (UCSD) PADS November 10, 2005 15 / 24
![Page 16: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/16.jpg)
What we want
• Simple data description language• Parser
• should handle various encodings (ASCII, binary, etc.)• should not halt on errors, and• should track errors (syntactic + semantic)• generate useful in-memory representation
• Other tools• converter to other formats (e.g., XML, CSV)• statistical profiler• query interface• more?
• High performance
Jan Voung (UCSD) PADS November 10, 2005 16 / 24
![Page 17: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/17.jpg)
What we get from PADS
• Simple data description language• Parser
• handles ASCII, EBCDIC, and binary• does not halt on errors, and• tracks errors (syntactic + semantic) in a parse-descriptor struct• generates C structs, unions, etc., for in-memory rep.
• Other tools• convert/print to other formats (XML + XML Schema, CSV)• statistical profilers via accumulator programs• query interface• more in the future
• Faster than Perl programs
• Can stream data, rather than read it all into memory at once
Jan Voung (UCSD) PADS November 10, 2005 17 / 24
![Page 18: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/18.jpg)
PADS architecture
(Not to scale)
Jan Voung (UCSD) PADS November 10, 2005 18 / 24
![Page 19: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/19.jpg)
PADS architecture: a closer look
Note: Use masks to select only the features that are relevant to theapplication ⇒ increased performance
Jan Voung (UCSD) PADS November 10, 2005 19 / 24
![Page 20: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/20.jpg)
Data Description Language
Primitives (Pint32, Pdate, Pstring, Pip).
Complex data (Pstruct, Punion, Parray, Popt, Ptypedefs,Penum).
Parameterized types (e.g., Puint16_FW(:3:)).
Can attach predicates for error checking (see following example).
Jan Voung (UCSD) PADS November 10, 2005 20 / 24
![Page 21: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/21.jpg)
CLF: Checking HTTP version conformance
Jan Voung (UCSD) PADS November 10, 2005 21 / 24
![Page 22: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/22.jpg)
Siruis Data: Check that timestamps increasemonotonically
Jan Voung (UCSD) PADS November 10, 2005 22 / 24
![Page 23: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/23.jpg)
Conclusion
• They evaluate the work by showing how to construct parsers for afew formats and comparing the performance of generated tools tohand-coded perl programs. PADS generated tools are no slower(actually faster in all 6 runs and 2x faster in 3 of the runs).
• Rather than standardize a format (e.g., HTML), then create tools,PADS can automatically generate tools as the format evolves. Theonly cost is the time to update the PADS description.
If you want to try it, go to http://www.padsproj.org
Jan Voung (UCSD) PADS November 10, 2005 23 / 24
![Page 24: PADS: A Domain-Specific Language for Processing Ad Hoc Data](https://reader036.vdocument.in/reader036/viewer/2022071602/613d7862736caf36b75db54e/html5/thumbnails/24.jpg)
Discussion
• What other methods of evaluation would you like to see for work ofthis kind?
• The PADS group plans to add more auxiliary tool generators. Is itreally useful, or do you imagine people building custom tools usingonly the generated parser?
• If people build custom tools, does that also build up data-formatinertia?
• Do you feel that the PADS compiler has more opportunities tooptimize the generated parser and tools than compiledhand-coded tools?
Jan Voung (UCSD) PADS November 10, 2005 24 / 24