lexing and parsing
DESCRIPTION
Beginners guide of Lexing and Parsing for PHP developers - given at Zendcon 2014TRANSCRIPT
![Page 1: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/1.jpg)
LEXING AND PARSINGTHE BEGINNER’S GUIDE
![Page 2: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/2.jpg)
WHY ARE WE DOING THIS?
• bbcode
• html
• xml
• programming language
![Page 3: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/3.jpg)
BUT I CAN JUST REGEX
• sometimes you can
• sometimes you can’t
• is your html well formed? (view source some time)
• it depends!!
![Page 4: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/4.jpg)
CHOMSKY HIERARCHY
![Page 5: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/5.jpg)
COMPUTER SCIENCEWE LIKE ACRONYMS AND WEIRD WORDS
![Page 6: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/6.jpg)
ENGLISH IS HARD!
• tokenizer
• scanner
• lexer
• parser
• lexical analyzer
• syntactic analyzer
• formal grammar
![Page 7: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/7.jpg)
LEXICAL ANALYSISBREAK DOWN INPUT INTO A SEQUENCE OF TOKENS
LEXING
![Page 8: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/8.jpg)
SCANNING
• Finite State Machine
• Finds Lexemes
• Might backtrack
![Page 9: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/9.jpg)
FINITE STATE MACHINE
![Page 10: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/10.jpg)
EVALUATOR
• looks at lexeme to get value
• lexeme + value = token
![Page 11: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/11.jpg)
LEXING PHP - $Y = 5;• $y
• array[309, ‘$y’, 1],
• =
• =
• 5
• array[305, 5, 1]
• 309 == T_VARIABLE
• 305 == T_LNUMBER
![Page 12: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/12.jpg)
LEXER GENERATORSDO NOT WRITE THIS BY HAND
Famous• lex
• flex
• re2c
• ANTLR
• DFASTAR
• jflex
• jlex
• quex
PHP generators• https://github.com/oliverheins/PHPSimpleLexYacc
• lex syntax
• https://github.com/pear/PHP_LexerGenerator
• re2c syntax
• https://github.com/wez/JLexPHP
• jlex syntax
• token_get_all (see php-parser)
• parse_ini_file/string (combined with parser)
![Page 13: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/13.jpg)
RE2C
![Page 14: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/14.jpg)
IN PHP LAND
![Page 15: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/15.jpg)
SYNTACTIC ANALYSISCONSTRUCTING SOMETHING BASED ON A GRAMMAR
PARSING
![Page 16: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/16.jpg)
THE PARSING PROCESS
• Tokens come in
• Magic
• Data structure comes out
• parse tree
• AST
![Page 17: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/17.jpg)
GRAMMAR (FORMAL OF COURSE)
• "Brave men run in my family.”
• I can't recommend this book too highly.
• Prostitutes Appeal to Pope
• I had had my car for four years before I ever learned to drive it.
![Page 18: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/18.jpg)
TYPES OF PARSERS
• Top Down
• Recursive Decent
• LL (left to right, leftmost derivation)
• Earley parser
• Bottom Up
• Precedence parser
• Operator-precedence parser
• Simple precedence parser
• BC (bounded context) parsing
• LR parser (Left-to-right, Rightmost derivation)
• Simple LR (SLR) parser
• LALR parser
• Canonical LR (LR(1)) parser
• GLR parser
• CYK parser
• Recursive ascent parser
![Page 19: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/19.jpg)
SENTENCE DIAGRAMMING
• People who live in glass house shouldn't throw stones.
![Page 20: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/20.jpg)
PARSE TREE
![Page 21: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/21.jpg)
TOP DOWN VS. BOTTOM UP PARSING
![Page 22: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/22.jpg)
PARSE TREES
• Constituency-based parse trees
• Dependency-based parse trees
![Page 23: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/23.jpg)
AST
• Not everything appears
• additional information may be applied
• can “improve” tree nodes
• PHP is getting one!
![Page 24: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/24.jpg)
LALR(K)
• Look ahead prevents “ambiguous” parsing
• I have one token, what token comes next?
![Page 25: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/25.jpg)
PARSER GENERATORS
Famous• bison
• bison
• bison
• bison
• yacc
• lemon
• ANTLR
PHP versions• https://github.com/wez/lemon-php
• https://github.com/pear/PHP_ParserGenerator
• lemon
• https://github.com/scato/phpeg
• peg (peg.js)
• https://github.com/jakubkulhan/pacc
• yacc
![Page 26: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/26.jpg)
BISON
• Generates LALR (or GLR) parsers
• Code in C, C++ or Java
• reentrant with %define api.pure set
• used by ALL THE THINGS
• PHP
• Ruby
• Postgresql
• Go
![Page 27: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/27.jpg)
BISON IN C
![Page 28: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/28.jpg)
LEMON
• Generates LALR(1) parser
• reentrant AND thread safe
• non-terminal destructor (leak avoidance)
• pull parsing
• sqlite
![Page 29: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/29.jpg)
PHP LEMON
![Page 30: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/30.jpg)
REENTRANT VS THREAD SAFE
• Process
• Thread
• Locking
• Scope
• Reentrant
![Page 31: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/31.jpg)
COMPILE IT
• transform programming language to computer language
![Page 32: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/32.jpg)
INTERPRET IT
• directly executes programming language
![Page 33: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/33.jpg)
PROFIT
![Page 34: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/34.jpg)
UNDER THE HOODWHAT USES THIS STUFF?
![Page 35: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/35.jpg)
PHPRE2C + Bison + these crazy opcodes….
![Page 36: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/36.jpg)
LALR(1) WRITTEN BY HANDHow - pythonic
![Page 37: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/37.jpg)
HHVMFlex and Bison and JIT – OH MY!
![Page 38: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/38.jpg)
SQLITELemon is tasty!
![Page 39: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/39.jpg)
WRITING PARSERS AND LEXERSTHEORIES OF CODING
![Page 40: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/40.jpg)
STEP 1: THINK SMALL
• Writing a general purpose parser is hard – that’s why you use PHP
• Writing a single purpose parser is much easier
• markup text (markdown)
• configuration or definition files (behat/gherkin syntax)
• complex validation (addresses in multiple formats)
![Page 41: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/41.jpg)
STEP 2: SEPARATE AND UNOPTIMIZED
• premature optimization yada yada
• combine after it’s ready to be used (or not at if you’ll need to change it later)
• lexer and parser each have unique, well defined goals
• the ability to potentially switch parser styles later will help you!
![Page 42: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/42.jpg)
STEP 3: LEXER
• the lexer's job is to recognize tokens
• it can do this via a giant switch statement of doom
• or maybe a giant loop
• or maybe a list of goto statements
• or maybe a complex class with methods
• …. or you can just use a generator
![Page 43: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/43.jpg)
LET’S BREAK THAT DOWN
1. Define a token format
2. Define grammar format (what are we looking for?)
3. Go over the input data (usually a string) and make matches
1. compare or regex or ctype_* or however it make sense
4. Keep track of your current state
5. Have an output format – AST, tree, whatever
![Page 44: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/44.jpg)
STEP 4: PARSER
• Loop over our tokens
• Look at the values and decide to what to do
![Page 45: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/45.jpg)
STEP 5: DO SOMETHING WITH IT!
1. Compile – write out to something that can be run (html)
2. Interpret – run through another program to get output (templates to html)
3. Analyze – run through to analyze the data inside (code analysis/sniffer tools)
4. Validate – check for proper “spelling and grammar”
5. ???
6. PROFIT
![Page 46: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/46.jpg)
“If you’re not sure how to do a job – ask!”
- silly poster on my laundry room wall
![Page 47: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/47.jpg)
RESOURCES
• http://savage.net.au/Ron/html/graphviz2.marpa/Lexing.and.Parsing.Overview.html
• http://nikic.github.io/2011/10/23/Improving-lexing-performance-in-PHP.html
• https://github.com/hafriedlander/php-peg
• https://github.com/nikic/PHP-Parser/
• http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html
• http://wikipedia.org
![Page 48: Lexing and parsing](https://reader035.vdocument.in/reader035/viewer/2022062220/558fcda71a28ab7d7f8b456b/html5/thumbnails/48.jpg)
CONTACT ME
• auroraeosrose – freenode.net #phpmentoring #phpwomen
• Twitter - @auroraeosrose
• http://github.com/auroraeosrose