context sensitive earley

Upload: praveen-raj

Post on 07-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 Context Sensitive Earley

    1/18

    Earley Parsing for Context-Sensitive GrammarsDaniel M. Roberts

    May 4, 2009

    1 Introduction

    The field of parsing has received considerable attention over the last 40 years, largelybecause of its applicability to nearly any problem involving the conversion of data. Ithas proven to be a beautiful example how theoretical research, in this case automatontheory, and clever algorithms can lead to vast improvements over naive or brute-forceimplementations.

    String parsing has applications in data retrieval, compilers, and many other fields. Forcompilers especially, format description languages such as YACC have developed to definehow code in a text file should be interpreted as a tree structure which the compiler canhandle directly. While this is good for the specialized domain, there are cases when onemight wish to use a lighter-weight all-in-one approach that also includes a small amountof in-line programmability.

    On the flip-side of the pursuit for expressiveness are efficiency constraints. If the lan-guage is quite restricted, the parsing algorithm can be more tightly tailored to fit theproblem, and thus can be more efficient. In practice, a simple language that is the sub-

    set of a more complicated language often outperforms the more expressive language whenthe two are given the same simple-language input. For example, regular expressions en-gines of Java, Python, Ruby, Perl, and certainly others, which include expressive featuressuch as look-aheads, back-references, and boundary conditions, have worst-case exponentialperformance even on strictly regular regular expressions [1], whereas an implementationspecialized for strict regular expressions can achieve worst-case quadratic timeor lineartime if we allow preprocessing. The goal of the current research is to explore how to goabout creating a parsing language that in addition to being generally efficient, is in factmaximally efficient on restricted sublanguages with known solutions.

    This work is mainly concerned with three formats for specifying the structure of textualdocuments: (1) regular expressions, which describe the regular languages, (2) context-free

    grammars (CFGs) with regular right-hand sides, which describe the context-free languages,and (3) what we shall call the context-sensitive grammars (CSGs)1, which will be the focus

    1CSGs as defined here are not to be confused with the related but different definition of a context-free

    1

  • 8/4/2019 Context Sensitive Earley

    2/18

    of this paper. Regular expressions and CFGs have long been staples of parsing and areused for everything from web apps and text search to compilers.

    Matching a string of length m to a regular expression of length n is known to be worstcase O(nm) in time if we do not allow preprocessing, and this bound can be achieved with

    the Thompson algorithm, described below. Parsing a string of length m in accordancewith a context-free grammar of size n is known to be worst case O(m2n) in time, and analgorithm that achieves this time bound is the Earley algorithm which is based on the sameprinciple as Thompson. Earley can also be extended to parse strings according to a CSG,but because of the wide range of expressiveness allowed by this format, there is no goodworst-case time bound. Nevertheless, one key property of the extension is that when theCSG describes a context-free language, it is no less efficient than traditional Earley parsing.Since both regular expressions and CFGs are subsets of CSGs, this augmented grammarspecification format offers more flexibility than either, without sacrificing efficiency in thecases where special features are not used.

    I shall begin with an overview of the theory for parsing regular expressions and context-

    free grammars.

    2 Regular Expressions

    Regular expressions have the form

    rexp ::= Eps | Char(char) | Alt(rexp,rexp) | Cat(rexp,rexp) | Star(rexp)

    Basic regular expressions syntax and matching conditions can be described as follows.

    1. Eps. A string matches () if it is the empty string.

    2. Char(c). A string matches c if it consists of the single character c.

    3. Alt(re1, re2). A string matches re1|re2 if it matches either re1 or re2.

    4. Cat(re1, re2). A string matches re1 re2 if it can be split into two strings s1 ands2 such that s1 matches re1 and s2 matches re2.

    5. Star(re). A string matches re* if it can be split into k 0 strings all of whichmatch re.

    In addition, it is conventional to give the operators |, , and * binding preferences thatare analogous to those of +, , and x2 in mathematical expressions, thus reducing the needfor parentheses. It is also conventional to leave out the explicit symbol , just as it is in

    mathematical expressions. Thus, for example, the regular expression ab|cd* stands for the

    grammar used in relation to the Chomsky Hierarchy.

    2

  • 8/4/2019 Context Sensitive Earley

    3/18

    more explicit (a b)|(c (d*)), and matches either the string ab or any string of the formcddd..., i.e. a single c followed by zero or more ds.

    The naive way to match a regular expression is to match its parts recursively, moreor less as described above. Although this algorithm has been shown to be exponential

    in the degenerate case, it is by far the most widely used algorithm in practice becauseit is conceptually the simplest to implement and it is trivially extensible to include morepowerful features such as back-references, which are inherently non-regular in the strictsense. It would be desirable, however, for the mere existence of such language features notto interfere with the computational complexity of parsing against expressions that use onlythe operations given here. This is the motivation for the present research.

    The polynomial time algorithm eluded to above is attributed to Ken Thompson, 1968.It works by reading in the characters of the string one at a time, keeping track of all possibleparses at once. To assist in keeping track of the partial parses, the algorithm first buildsa directed graph representation of the regular expression so that each node represents aparse state, and each transition is labeled with the character that needs to be parsed in

    order for the edge to be followed. Some edges can be unlabeled, in which case no characteris needed to pass from one state to the other. This type of graph is often called a FiniteState Automaton (FSA).

    Below is an example of a recursive function in pseudocode that returns the Initial nodeand the Final node. A parse that begins at the Start node and ends at the Final node isa successful parse.

    FSA(re):

    I = fresh_state()

    F = fresh_state()

    switch(re):

    Eps:return (F, F)

    Char(c):

    build [I --->c F]

    return (I, F)

    Alt(re1, re2):

    (I1, F1) = FSA(re1)

    (I2, F2) = FSA(re2)

    build [I ---> I1]

    build [I ---> I2]

    build [F1 ---> F]

    build [F2 ---> F]

    return (I, F)

    Cat(re1, re2):

    (I1, F1) = FSA(re1)

    3

  • 8/4/2019 Context Sensitive Earley

    4/18

    (I2, F2) = FSA(re2)

    build [F1 ---> I2]

    return (I1,F1)

    Star(re1):

    (I1, F1) = FSA(re1)

    build [I ---> I1]

    build [F1 ---> I]

    build [I ---> F]

    return (I, F)

    The Thompson algorithm to parse a string str against an FSA keeps track of Sk, theset of states that are reachable after parsing the first k characters, starting with k = 0 upto the length of the string. Specifically, for k > 0, Sk+1 depends only on Sk and str[k].Here are the rules for moving ahead with Thompson:

    Initialization:

    I S0

    Consumption:u Sk, u str[k] u

    u Sk+1

    Null Propagation:u Sk, u u

    u Sk

    When F Sk, the first k characters of the string match the regular expression.There are a number of additions to the syntax of regular expressions that do not affect

    the overall complexity of the grammars they describe. In particular we shall use

    1. re+ is equivalent to re re*, and

    2. re? is equivalent to re|().

    It is also convenient both for ease of expression and for efficiency of parsing to havecharacter classes. A character class matches any symbol in a given set of symbols: forexample, we may wish to have \a match any alphabetic character, have \w match anyalphanumeric character, and have a period (.) match any character, etc. Since we can

    often represent a set of characters as a range in the ascii character set, it is more efficientfor a parser to see whether a character falls within the range then to see whether it is equalto one of some list of characters, because the latter approach requires a linear-time search.

    4

  • 8/4/2019 Context Sensitive Earley

    5/18

    3 Context Free Grammars

    The context-free grammars describe a class of languages that is in some sense infinitelymore complex than regular expressions. Among the language features that CFGs can

    describe but that regular expressions cannot are

    1. Matching parentheses,

    2. Recursive algebraic structures, and

    3. Trees of arbitrary depth.

    The idea of recursion is explicitly built into the definition of a CFG. This makes themperfect for describing many formal languages, such programming language syntax anddata formats. As we will see shortly, the regular expressions syntax can be described as aCFG, and for that matter so can CFG syntax.

    3.1 CFG Syntax

    For the purposes of this paper, a CFG is a context-free grammar with regular right-handsides. A CFG is a list of rules of the form

    nont = rexp;

    where nont is the nonterminal symbol being defined and rexp is a regular expression overterminals and nonterminals. The first nonterminal in the list of definitions is taken as thestart nonterminal, which is to say that a string matches the CFG if it matches the first non-terminal. To distinguish nonterminals from terminals, nonterminals are enclosed in curlybraces {nont}. Just as character classes can be simulated by the other operations, namely

    alternation, so too can regular expressions be simulated by the more basic CFG syntax,which allows only concatenation. Allowing regular right-hand sides not only simplifies thenotation for the programmer, but also makes parsing more efficient.

    In addition to this baseline language, there are two extra syntactic features that donot express conditions on whether a string is accepted, but instead affect how verbose theresulting parse tree is. By default, all nonterminals processed in the course of the matchinga string are represented as a node in the tree, and all terminals are not represented at all.To hide a nonterminal and let its subtree be subsumed by its parent, the syntax

    .nont = rexp;

    is used. To express the characters that appear in a regular expression, use the syntax $(re).

    This is especially useful when you want the parse tree to keep track of the actual characterused to match a word class: $(.), $(\a), etc.; or to get the result of an alternation:$(a|b|c), etc. For grouping that does not save, brackets are used. Below is an exampleCFG that defines regular expressions syntax:

    5

  • 8/4/2019 Context Sensitive Earley

    6/18

    rexp = {rexp1};

    alt = {rexp2} (\| {rexp2})+;

    cat = {rexp3} {rexp3}+;

    uny = {rexp4} $(\*|\+|\?);

    eps = \(\);

    c = $(\w|\.|\\.);

    .paren = \({rexp1}\);

    .rexp1 = {rexp2} | {alt};

    .rexp2 = {rexp3} | {cat};

    .rexp3 = {rexp4} | {uny};

    .rexp4 = {eps} | {c} | {paren};

    Here the four nonterminals rexp1, rexp2, rexp3, and rexp4 represent four different levelsof binding. This prevents a string such as ab|cd* from being interpreted as the regularexpression (a(b|c)d)*, etc, but rather as (ab)|(c(d*)).

    3.2 CFG Transducers

    The transducers needed to represent a CFG will not be finite state automata in general,because the expressive power of FSAs is equal to that of regular expressions. We willcompile one FSA for each regular right-hand side, and use a new type of graph edge to jointhem together: the call edge. If some node u in nonterminal A has transition u {B} u

    ,(where A may or may not be the same as B), we build a call edge from u to the startstate of Bs FSA. To use this kind of transducer, we have to maintain a stack of returnaddresses: when we follow u call IB, , we push u on to the stack. When we reach FB, wepop the first item on the stack, u, and transition to some u where u {B} u

    . [3]Once a transducer has been assembled from a CFG, it can be marshaled and stored as

    a preprocessed form of the CFG, to be used directly for parsing.

    3.3 Earley Parsing

    As with FSAs, these transducers are in general non-deterministic; that is, from any givenstate there may be null-edges or multiple edges of the same type. Either of these meansthat during the parse, there will be moments where it is not clear which state we oughtto move to next. As with regular expressions, this nondeterminism can in theory be dealtwith by recursively trying every possible path until a match is found, but this kind ofbacktracking leads to poor performance that is worst-case exponential.

    The classic polynomial-time solution to this problem was proposed by Jay Earley [2]and has a similar flavor to the Thompson algorithm for regular expressions. The versiondescribed below has been modified to work with the CFG transducers described above,and is largely based on a version described in by Trevor Jim and Yitzhak Mandelbaum [3].

    6

  • 8/4/2019 Context Sensitive Earley

    7/18

    For every position 0 j n in the string to be parsed, the algorithm constructs anEarley set. Just as in Thompson, an Earley set is a set of possible parse states, butin the case of CFGs, a transducer vertex does not fully describe a parse state, because wealso need to be able to reconstruct the call stack. One way to do this would be to describe

    a parse state as a transducer vertex and the call stack, but this has the downside of beingvery inefficient, because there is no bound on the length of the call stack, and thus thesesets of possible states would be able to grow arbitrarily large. A far more compact way torepresent the call stack is with a return address i not to a vertex, but to an Earley set.What this means is that any vertex with a representative in the ith Earley set is a validreturn address. Thus, if two parse states in the same Earley set call the same nonterminal,the sub-parse is only done once, rather than once per call. Below are the formal parsingsemantics.

    Rules carried over from Thompson:

    Initialization:

    (I, 0) S0

    Consumption:(u, i) Sj , u str[j] v

    (v, i) Sj+1

    Null Propagation:(u, i) Sj , u v

    (v, i) Sj

    New Rules:

    Call Propagation:(u, i) Sj, u call v

    (v, j) Sj

    Return Propagation:

    (u, i) Sj, u A, (u, i) Si, u

    A v

    (v, i) Sj

    The general algorithm is to seed a set Sj with Initialization if j = 0 and Consumptionotherwise; repeat the three Propagation rules until Sj does not change; and to recursivelyapply this to S

    j+1if j < m. There are many ways to optimize propagation. One case

    where this can significantly increase efficiency is when a nonterminal is nullable, thatis, it matches the null string. Nullable nonterminals require applying the rules in severalrounds until nothing changes, because if a nonterminal doesnt consume any characters,

    7

  • 8/4/2019 Context Sensitive Earley

    8/18

    Return Propagation searches through the items in the same Earley set (i = j above), whichis yet unfinished2. This issue, however does not affect the overall time-complexity of thealgorithm, which has worst case time O(n2m) in either case, and this paper does not dealwith such optimizations.

    4 Context Sensitive Grammars

    The purpose of the present paper is to explore a grammar that has context-sensitive fea-tures, but that looks formally quite similar to our formulation of CFGs with regular right-hand sides. In formal language theory, context-sensitivity is often formulated by looseningthe restriction that rules have only a single nonterminal on the left hand side. The presentformulation is easier to reason with and incorporates some familiar features of impera-tive programming, such as the ability to pass arguments to subroutines, to store values invariables, and to reason with and operate on both values and variables.

    4.1 CSG Syntax

    The augmentation of CFG syntax to accommodate forms of context-sensitivity can be doneentirely by adding new types of expressions to regular expressions, now renamed rhss dueto their lack of regularity in the grammar theory sense.

    var = string

    nont = string

    rhs ::= Eps | Char(char) | Alt(rhs,rhs) | Cat(rhs,rhs) | Star(rhs)

    | Nont(nont, exp) | Assert(exp) | Capture(rhs,var) | Set(var,exp)

    exp ::= ...

    Here is a summary of the additions:

    1. A call to a nonterminal may contain an argument, which the nonterminal may useto guide its parse.

    2. An assert statement, for example [len(x) > 3], which matches the empty string ifand only if the expression evaluates to true.

    3. A capture statement, which after matching an rhs, stores the matched string ina variable, which may be referenced later. For example (.. @ x) matches twocharacters and stores them in the variable x.

    4. We may set and reset variables at any time with a command like (x=len(y)).2See

    8

  • 8/4/2019 Context Sensitive Earley

    9/18

    The expression language exp may be as expressive as one wishes so long as it doesnot modify the external environment, though a very simple language that includes integercalculation and comparison, strings, characters, atoi, string-length, variables, and equalitytesting is enough to allow this class of grammars to describe many practically applicable

    cases of context-sensitivity.

    4.2 CSG Transducers

    As above, once we have reformulated how to build the right-hand sides, joining them tocreate the transducer is done as described for CFGs, with the small caveat that call edgesare parameterized with the argument to be passed.

    To build a transducer fragment, we use the method described for regular expressions,also parameterizing nonterminal edges with their argument. An assert edge is labeled withthe assertion, and a set edge is labeled with the assignment.

    Capture(x,rhs) requires its own mechanism, which is to surround rhss fragment with

    an incoming push arrow and an outgoing pop x arrow. The transducer semantics aredescribed below.

    4.3 Augmented Earley Parsing

    This algorithm is a minimal modification of the Earley algorithm to include extra stateinformation, namely variable contexts and the capture stack. Thus, an Earley item forCSG parsing is (u , i , E , ), where E is the context which has type [(var, exp)], and whichhas type [int].Rules carried over from Thompson:

    Initialization:

    (I, 0, [], []) S0

    Consumption:(u , i , E , ) Sj, u str[j] v

    (v , i , E , ) Sj+1

    Null Propagation:(u,i,E,) Sj, u v

    (v , i , E , ) Sj

    Rules carried over from vanilla Earley:

    Call Propagation: (u , i , E , ) Sj , u call(e) v

    (v,j, [(arg, e)], []) Sj

    9

  • 8/4/2019 Context Sensitive Earley

    10/18

    Return Propagation:

    (u , i , E , ) Sj, u A, (u, i, E, ) Si, u

    A(E[arg]) v

    (v, i, E, ) Sj

    New Rules:

    Assert Propagation:

    (u , i , E , ) Sj, u assert(e) v, eval(e, E) = Bool(true)

    (v , i , E , ) Sj

    Set Propagation:(u,i,E,) Sj, u set(x,e) v

    (v,i, ((x, e) :: E), ) Sj

    Push Propagation:(u,i,E,) S

    j, u

    pushv

    (v,i,E, (j :: )) Sj

    Pop Propagation:(u,i,E, (k :: )) Sj, u pop(x) v

    (v,i, ((x, str[k : j]) :: E), ) Sj

    Two details deserve attention. First, Call Propagation has been modified to pass on acontext that contains the special variable arg set to the value of the parameter. Second,Return Propagation has been modified so that a nonterminal arc must match both innonterminal and in parameter. The capture mechanism works as follows: to start a capture,the input position is pushed onto the capture stack; to finish a capture, pop the startposition off the stack and extract the substring of the input that starts there and ends at

    the current position, and store this string in some variable.

    5 The Expression Language

    In the current implementation, a simple, untyped expression language is used. Here is aBNF outline:

    type exp ::= Var(var) | Unit | Bool(bool) | Int(int) | Char(char) | Str(string)

    | Not(exp) | Equals(exp, exp) | Less(exp, exp)

    | Minus(exp) | Sum(exp, exp) | Prod(exp, exp) |

    | GetChar(exp, exp) | Len(exp) | Atoi(exp) | Fail

    type var = string

    In addition to the symbols ! for Not, =, =, and - for binary Sum(e1,Minus(e2)) are used for syntactic convenience. The syntaxstr[i] is used for GetChar(str,i), len(str) for Len(str), and int(str) for Atoi(str).

    10

  • 8/4/2019 Context Sensitive Earley

    11/18

    6 Parse Tree Building

    To return a parse tree from the algorithm outlined above, it suffices to store for each parsestate item a pointer to the item or items that participated in its creation. The scheme used

    in the current implementation is as follows. Every item stores one of the following parseannotations:

    1. When an item is added by Call Propagation, it stores a PCall tag, with no pointers.

    2. When an item is added by Return Propagation, it stores PReturn(u, u , A , E [args], show),where u, u, A, and E correspond to the variables in Return Propagation as statedabove, and show corresponds to whether the parse of the nonterminal should be givenits own subtree, or whether it should be subsumed by the caller parse; this is shownsyntactically by the omission or inclusion of a period (.) before the name of thenonterminal in the grammar file.

    3. When an item is added by anything else, it stores a simple back pointer to the itemPTransEps(item).

    Additionally, in order to have some sort of control over what shows up in our parse treeand what is omitted, we can augment the CSG right hand sides to include a Show(var)constructor. Syntactically this is written ($var). This generates an arc in the transducerthat acts just like a null transition, except that if u show(var) v, then v stores a specialparse annotation: PShow(exp), where exp is the value of var, E[var]. This lets the treegenerator know to include exp as one of the Leafs of the tree. Note that a Leaf(exp)records the expression as an expression value, Unit, Bool(b), Int(i), Char(c), or Str(s),preserving the type.

    7 Examples

    To test the efficacy of this system in practice, the following examples have been tested onthe current implementation.

    7.1 IP Adresses

    This example matches an IP address, whose format is N.N.N.N, where 0 N 255.

    IP = {N255}\.{N255}\.{N255}\.{N255};

    .N255 = (\d+@x)(x=int(x))[x>=0][x

  • 8/4/2019 Context Sensitive Earley

    12/18

    7.2 Char-Terminated String

    This example uses a parameterized nonterminal. The nonterminal String matches theshortest substring that is terminated by the character passed as an argument, i.e. any

    string that doesnt contain arg anywhere except as the last character.

    program = _ {until \;} _;

    until = $( ( (. @ x)[x[0]!=arg] )* ) (. @ x)[x[0]=arg];

    Note that the $ directive used to save the value of the string matched, without theterminating character.

    7.3 XML

    XML is often thought of as the all-purpose way to represent a tree structure. Ironicallya normal CFG, the all-purpose way to generate tree structures, is actually incapable of

    describing XML syntax. This is because the open and close tags in XML have to match,and testing for that match requires string comparison of two sections of the parse. CSGsare capable of doing this.

    program = _ {xml} _;

    xml = \< _ $({word} @ head) (\s+ {setting})* _ \>

    {text}?({xml}+ {text})*{xml}*

    \;

    setting = $({word}) _ \= _ ($({word})|\"$(((.@cstr)[cstr!="\""])*)\");

    text = $(((. @ cstr)[cstr!="\"])*);

    .word = (\a | \_)(\w | \_)*;

    This little hack represents the basic XML format, that is, header and parameters in tri-angle brackets (), followed by text and other xml tags, followed by amatching end tag ().

    7.4 Operator Binding Strength

    In an earlier CFG example, we described regular expressions syntax as a CFG, using mul-tiple nonterminals to achieve order of operations rules, or binding strength. Parameterizednonterminals give us another, potentially neater way to express these rules. Here is aversion of the previous example that uses a few of our simple programming features.

    program = _ {rexp} _;

    .rexp = [arg=()] {rexp 1}| [arg

  • 8/4/2019 Context Sensitive Earley

    13/18

    | [arg

  • 8/4/2019 Context Sensitive Earley

    14/18

    When matched against strings of the form a a2 a3 ak a, python can instantly han-dle k in the hundreds. The CSGs performance, on the other hand, has the followingbehavior:

    k n CSG8 45 .066

    16 153 1.3

    17 171 1.9

    18 190 2.4

    19 210 3.0

    20 231 4.3

    21 253 5.3

    22 276 6.6

    Table 2: Finding repeated words in a a2 a3 ak a. n is the length such a string.

    It seems that the slowness has nothing to do with the linear-time complexity of [x=y];in fact, removing this assertion makes parsing much slower, not faster, so that it takes 16.5seconds to parse when k = 16, instead of 1.3 seconds. A minimal example of how capturescan slow a parse down is (.*@x).*(.*@y), which shows similar behavior on long strings; theanalogous non-capturing regular expression, .*.*.*, matches long strings instantaneously.

    The exact reason for CSGs poor performance in certain circumstances when comparedwith Python has not been shown. Although it may simply be a combination of the factthat Pythons algorithm is optimized for word boundary assertions, and that it may involvepreprocessing, it is quite possible that the backtracking algorithm is simply faster in thiscase. The Earley algorithm does, after all, do every parse simultaneously. When dealingwith CFGs, the size of an Earley set is bounded. Now that we are dealing with Earleysets that can grow to arbitrary size, this approach may cause the algorithm to take asignificance performance hit in some cases.

    9 Further Work

    This line of research needs to be more thoroughly explored and taken to its logical conclu-sion in a number of ways. What has been outlined here is a minimal example of how to effi-ciently incorporate programming features into the CFG and regular expressions paradigm.Here are a few of directions that deserve attention in future research.

    14

  • 8/4/2019 Context Sensitive Earley

    15/18

    9.1 Determinization

    It is often desirable to determinize a transducer in order to avoid repeatedly following thesame unnecessary paths. Full determinization produces a transducer in which (1) no null

    edges exist and (2) no node has two identical outgoing edges. A determinized transducerthus has the advantage that there is no guesswork, and this results in linear-time parsing.In the case of CSG transducers, full determinization must be compromised because of thenature of the problem, but it is possible that some standard techniques for determinizingCFGs may be applicable or partially applicable to CSG parsing. Potential challengesinclude:

    1. Preserving the behavior of context information and the capture stack

    2. Determinizing calls

    3. The essentially nondeterministic nature of situations such as (u push v, u assert(x=y)w), where u has two outgoing nodes, neither of which eats a character. This can po-tentially be managed by combing certain null-cosuming edges such as push-edgeswith the edges they point to, so that (u push v, v c w) becomes (u push; c w).

    9.2 Compact Language Features

    There are certain syntactic features that programmers are used to that could make CSGseasier to read, some of which would have the added benefit of speeding up parsing. Hereare some ideas:

    1. Include if p then re1 else re2 syntax, which could be shorthand for [p]re1|[!p]re2.

    2. Include match x with v1 -> re1 or or vn -> ren default -> re syntax,

    which could be shorthand for [x=v1]re1|[x!=v1]( [x=vn]ren([x!=vn]re)),but would be more efficient if implemented separately.

    3. Guards: [len(re) < 6 ], [re = exp], shorthand for (re @ x)[len(x)

  • 8/4/2019 Context Sensitive Earley

    16/18

    language that includes ways of creating functions, making lists and tuples, etc. Whilethis may be desirable, it may also unnecessarily complicate the language and lead to poorparsing performance.

    9.4 Integrating CFGs into the Right Hand Sides

    It is a bit of a theoretical eyesore that all of the rich language features are build directlyinto the right hand sides except for the association of nonterminals with their rhss. Thiscould be remedied by removing the top-level CFG structure and replacing it with a newrhs constructor.

    rhs ::= ... | WithNonts(rhs, [(nont,rhs)])

    This constructor combines an rhs that has undefined nonterminals and a mapping ofnonterminals to rhss. This changes the nature of the language in a number of ways. First,the start nonterminal does not need a name, which means that the language also accepts

    vanilla regular expressions. (In the present system, one must write S =rexp

    ;.) Also, thisallows for modularization, where nonterminals have a scope. For example, matching XMLfiles as in the example above requires a number of nonterminals, but no other nonterminalwould ever call them directly. Using this modularized CFG notation, the example for XMLmight look something like:

    _ {xml} _ : {

    xml = \< _ $({word} @ head) (\s+ {setting})* _ \> {inner}

    \ : {

    inner = {text}?({xml}+ {text})*{xml}*: {

    text = $(((. @ cstr)[cstr!="\"])*);

    };

    setting = $({word}) _ \= _ ($({word})|\"$(((.@cstr)[cstr!="\""])*)\");.word = (\a | \_)(\w | \_)*;

    };

    }

    Unlike some other potential modifications, this would only have the effect of making codeeasier to read, write, and update. It would not have a significant impact on performance.

    9.5 Explicit Tree Construction

    In the present system, tree construction is automatic and based on the way the nonterminalsare parsed. The only control we have over tree structure from within the language is the

    somewhat awkward dot-notation to suppress expression of a nonterminal. There may becases when we want to set aside parts of an rhs as a subtree without explicitly making anonterminal for it. To add such control over tree construction, we can add the followingconstructor:

    16

  • 8/4/2019 Context Sensitive Earley

    17/18

    rhs ::= ... | Label(rhs, string)

    This could be written {label: rhs}; nonterminal syntax could be replaced with ;{nont} could be a shorthand for {nont: }; and dot-notation could be eliminated.

    Thus there would be an explicit way to signal subtree creation, as well as a way to showor not show nonterminal subtrees on a per-case basis.A second addition would change the language quite dramatically but may be a good

    way to incorporate the equivalent of YACCs actions. Essentially, we could replace theconstructor above with

    rhs ::= ... | Construct(exp)

    So long as the expression language is rich enough, we can explicitly build an arbitrarystructure and not rely on the parsers tree representation at all. If we are also allowed toinclude global type statements for the expression language, we can get something like thefollowing code for building regular expressions:

    type rexp = Eps | Char(string) | Alt([rexp]) | Cat([rexp]) | Star(rexp);

    _ {rexp} _ :

    rexp =

    rexp_b = [arg

  • 8/4/2019 Context Sensitive Earley

    18/18

    that only use a simpler subset of the language. This paper shows an instance of threelanguages in a hierarchy of increasing expressiveness, each of which can be used as areplacement for the languages below it without sacrificing speed. That is, if we use theCSG engine to match a regular expression, it will have a running time of O(nm), and it

    we use it to match a CFG, it will have a running time of O(n2m) worst case.Notably, the extensions are all entirely natural, in that the algorithm does not have

    to explicitly probe the complexity of the grammar in order to optimize efficiency. Rather,any subsection of a CFG that is regular will be parsed like a regular expression and anysubsection of a CSG that is context-free will be parsed like a context-free grammar, bymere virtue of not using certain parse features.

    This minimal revision of CFGs to include context sensitivity shows some performancevulnerabilities that need to be explored further. It may improve performance, for example,to put a preliminary cap on the sizes of the Earley sets, and only do a full parse if the firstpass does not bear fruit. Reasoning about the transducer graph and doing optimizationsthat way may also address some performance issues. The fact remains, however, the parsing

    the general CSG is an NP-hard problem [1], and that beyond a certain point, solutionswill inevitably have to involve (1) optimizing for more special subproblems and (2) usingheuristics to help determine parsing order. This framework has already shown how tointegrate the solutions to two special subproblems, an in this particular respect, CSGs aresuperior to and out perform their counterparts in the regex libraries of many scriptinglanguages. Whether the Earley approach can be modified to match Python and Perl inthe general context-sensitive cases as well remains to be seen.

    References

    [1] Russ Cox. Regular expression matching can be simple and fast, January 2007.

    http://swtch.com/~rsc/regexp/regexp1.html .

    [2] Jay Earley. An efficient context-free parsing algorithm. Commun. ACM, 13(2):94102,1970.

    [3] Trevor Jim and Yitzhak Mandelbaum. Efficient earley parsing with regular right-handsides. Proceedings of LDTA 2009, March 2009.

    18