writing parsers in python using pyparsing
TRANSCRIPT
Paul McGuireAPUG – May, 2016
Writing Parsers in PythonUsing Pyparsing
Writing Parsers in PythonUsing Pyparsing
Paul McGuireAPUG – May, 2016
Agenda• Quick Intro / Demo• Parsing 'geo:' URLs• Other Examples of Using Pyparsing• How to Get Started
2
Best practices:… highlighted in the examples
3
Paul McGuire• Mechanical Engineering degree from Rensselaer
Polytechnic Institute, Masters in Engineering from Univ of Texas; 30+ years developing planning and control software for electronics and semiconductor manufacturing (Pascal, PL/I, COBOL, Fortran, C/C++, Smalltalk, Java, C#, Python)• A long-time interest in parser applications, plus work
in O-O technologies in Smalltalk and Java, led to the object-based parser construction approach seen in Pyparsing, first released in 2003• Several articles published in Python Magazine, and an
e-book with O’Reilly, “Getting Started With PyParsing” published in 2007
Quick Intro / Demo• Parsers are built up using Pyparsing classes• Word, Literal (or just string literals)• OneOrMore, ZeroOrMore• And, Or, MatchFirst, Each
• with overloaded operators +, ^, |, and &• Whitespace is implicitly skippedinteger = Word('0123456789')phone_number = Optional('(' + integer + ')') + integer + '-' + integer
# re.compile(r'(\(\d+\))?\d+-\d+')
greet = Word(alphas) + "," + Word(alphas) + "!"greet.parseString("Hello, World!")
4
Best practice:Don’t include whitespace in the parser definition
Parsing 'geo:' URLs• URL for latitude / longitude / altitude values:
• geo:<latitude>,<longitude>[,<altitude>][;options…]
• options:• crs (coordinate reference system) – default = ‘wgs84’• u (uncertainty) – value in meters• other – key = value
5
Sample 'geo:' URLs• Samples
geo:27.9878,86.9250,8850;crs=wgs84;u=100
geo:-26.416,27.428,-3900;u=100
geo:17.75,142.5,-11033;crs=wgs84;u=100
geo:36.246944,-116.816944,-85;u=50
geo:30.2644663,-97.7841169;a=100;href=http://www.allure-energy.com/
6
'geo:' URL specification• IETF RFC 5870• from https://tools.ietf.org/html/rfc5870
geo-URI = geo-scheme ":" geo-pathgeo-scheme = "geo"geo-path = coordinates pcoordinates = num "," num [ "," num ]
p = [ crsp ] [ uncp ] [";" other]...crsp = ";crs=" crslabelcrslabel = "wgs84" / labeltextuncp = ";u=" uval
other = labeltext "=" valval = uval / chartext
7
Best practice:Start with a BNF
Parsers included in Python libgeo:27.9878,86.9250,8850;crs=wgs84;u=100• urlparse
• repatt = r'geo:(-?\d+(?:\.\d*)?),(-?\d+(?:\.\d*)?)(?:,(-?\d+(?:\.\d*)?))?' +
r'(?:;(crs=[^;]+))?(?:;(u=\d+(?:\.\d*)?))?'
print(re.compile(patt).match(tests[0]).groups())('27.9878', '86.9250', '8850', 'crs=wgs84', 'u=100')
ParseResult(scheme='geo', netloc='', path='27.9878,86.9250,8850;crs=wgs84;u=100', params='', query='', fragment='')
8
'geo' URL Parsing using Pyparsing
from pyparsing import *
EQ,COMMA = map(Suppress, "=,")number = Regex(r'-?\d+(\.\d*)?').addParseAction(lambda t: float(t[0]))
geo_coords = Group(number('lat') + COMMA + number('lng') + Optional(COMMA + number('alt')))
crs_arg = Group('crs' + EQ + Word(alphanums))u_arg = Group('u' + EQ + number)
url_args = Dict(delimitedList(crs_arg | u_arg, ';'))
geo_url = "geo:" + geo_coords('coords') + Optional(';' + url_args('args'))
9
Best practice:Use parse actions for conversions
Best practice:Use results names
10
Parsing some samplestests = """\ geo:36.246944,-116.816944,-85;u=50 geo:30.2644663,-97.7841169;a=100;href=http://www.allure-energy.com/"""
geo_url.runTests(tests)
assert geo_url.matches("geo:36.246944,-116.816944,-85;u=50“)assert geo_url.matches("geo:36.246944;u=50“)
Best practice:runTests() is new in 2.0.4
Best practice:Use matches() for incremental inline validation of your parser elements
11
Parsing some samples - resultsgeo:36.246944,-116.816944,-85;u=50
['geo:', [36.246944, -116.816944, -85.0], ';', [['u', 50.0]]]- args: [['u', 50.0]] - u: 50.0- coords: [36.246944, -116.816944, -85.0] - alt: -85.0 - lat: 36.246944 - lng: -116.816944
geo:30.2644663,-97.7841169;a=100;href=http://www.allure-energy.com/['geo:', [30.2644663, -97.7841169]]- coords: [30.2644663, -97.7841169] - lat: 30.2644663 - lng: -97.7841169
'geo' URL – add support for 'other'
from pyparsing import *
EQ,COMMA = map(Suppress, "=,")number = Regex(r'-?\d+(\.\d*)?').addParseAction(lambda t: float(t[0]))
geo_coords = Group(number('lat') + COMMA + number('lng') + Optional(COMMA + number('alt')))
crs_arg = Group('crs' + EQ + Word(alphanums))u_arg = Group('u' + EQ + number)other = Group(Word(alphas) + EQ + CharsNotIn(';'))
url_args = Dict(delimitedList(crs_arg | u_arg | other, ';'))
geo_url = "geo:" + geo_coords('coords') + Optional(';' + url_args('args'))
12
13
Parsing some samples (with 'other')geo:36.246944,-116.816944,-85;u=50
['geo:', [36.246944, -116.816944, -85.0], ';', [['u', 50.0]]]- args: [['u', 50.0]] - u: 50.0- coords: [36.246944, -116.816944, -85.0] - alt: -85.0 - lat: 36.246944 - lng: -116.816944
geo:30.2644663,-97.7841169;a=100;href=http://www.allure-energy.com/['geo:', [30.2644663, -97.7841169], ';', [['a', '100'], ['href', 'http://www.allure-energy.com/']]]- args: [['a', '100'], ['href', 'http://www.allure-energy.com/']] - a: 100 - href: http://www.allure-energy.com/- coords: [30.2644663, -97.7841169] - lat: 30.2644663 - lng: -97.7841169
Using the Pyparsing 'geo:' parser
geo = geo_url.parseString('geo:27.9878,86.9250,8850;crs=wgs84;u=100')
print(geo.dump())['geo:', [27.9878, 86.925, 8850.0], ';', [['crs', 'wgs84'], ['u', 100.0]]]- args: [['crs', 'wgs84'], ['u', 100.0]] - crs: wgs84 - u: 100.0- coords: [27.9878, 86.925, 8850.0] - alt: 8850.0 - lat: 27.9878 - lng: 86.925
print(geo.coords.alt)8850.0
print(geo.args.asDict()){'crs': 'wgs84', 'u': 100.0}
14
Best practice:dump() is very useful for seeing the structure and names in the parsed results
Best practice:pprint() is useful for seeing the results structure if no results names are defined
Other Examples of Using Pyparsing• State model Python code
• SQL SELECT statements• also a good starter for “FauxSQL”
• Lucene query• (see also https://bitbucket.org/mchaput/whoosh/overview)
• Elastic search query (plasticparser)
TrafficLight = { Red -> Green; Green -> Yellow; Yellow -> Red; }
DocumentRevision = { New -( create )-> Editing; Editing -( cancel )-> Deleted; Editing -( submit )-> PendingApproval; PendingApproval -( reject )-> Editing; PendingApproval -( approve )-> Approved; Approved -( activate )-> Active; Active -( deactivate )-> Approved; Approved -( retire )-> Retired; Retired -( purge )-> Deleted; }
15
More Pyparsing Usages• zhpy – Python interpreter with Chinese keywords –
Fred Lin
• https://pypi.python.org/pypi/zhpy/1.7.4 http://zh-tw.enc.tfode.com/%E5%91%A8%E8%9F%92
16
More Pyparsing Usages• Robot command language
• https://iusb.edu/computerscience/faculty-and-staff/faculty/jwolfer/005.pdf
rinit # initialize communicationrsens Rd # read robot sensorsrspeed Rright,Rleft # set robot motor speedsrspeed $immed,$immed
Crumblehttp://redfernelectronics.co.uk
17
More Pyparsing Usages• Logo interpreter (in French) - Christophe Vu-
Brugier
• http://www.enodev.fr/
LCTD 90 AV 60TD 90 AV 80BCREPETE 4 [ TG 90 AV 20 ]LCAV 80BCREPETE 4 [ AV 20 TG 90 ]
18
How to Get Started• Install Pyparsing (if not already available)
pip install –U pyparsing
• MIT license
• Go to the pyparsing wiki (http://pyparsing.wikispaces.com) and check out the examples page
• Online docs at http://pythonhosted.org/pyparsing/
• Post questions on StackOverflow (use pyparsing tag)
• Paul McGuire – [email protected] 19
20
QUIZ!!!
21
Best Practices Summary• Start with a BNF• Don’t use explicit whitespace in your parser• Use parse actions for parse-time conversions• Use results names to facilitate access to data after
parsing• parser.runTests() makes it easy to run through test
cases• parser.matches(test_string) is a simple test for unit
testing• results.dump() and results.pprint() are good for
examining the parsing results
22
Thank You!