tools and analyses for ambiguous input streams andrew begel and susan l. graham university of...

85
Tools and Analyses for Ambiguous Input Streams Andrew Begel and Susan L. Graham University of California, Berkeley LDTA Workshop - April 3, 2004

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Tools and Analyses for Ambiguous Input Streams

Andrew Begel and Susan L. Graham

University of California, Berkeley

LDTA Workshop - April 3, 2004

April 3, 2004 LDTA 2004 2

Harmonia:Language-aware Editing Programming by Voice

– Code dictation– Voice-based editing commands

Program Transformations– Transformation actions– Pattern-matching constructs

April 3, 2004 LDTA 2004 3

Harmonia:Language-aware Editing Programming by Voice

– Code dictation– Voice-based editing commands

Program Transformations– Transformation actions– Pattern-matching constructs

Human Speech

April 3, 2004 LDTA 2004 4

Harmonia:Language-aware Editing Programming by Voice

– Code dictation– Voice-based editing commands

Program Transformations– Transformation actions– Pattern-matching constructs

Human Speech

EmbeddedLanguages

April 3, 2004 LDTA 2004 5

Harmonia:Language-aware Editing Programming by Voice

– Code dictation– Voice-based editing commands

Program Transformations– Transformation actions– Pattern-matching constructs

Each kind of input stream ambiguity requires new language analyses

Human Speech

EmbeddedLanguages

April 3, 2004 LDTA 2004 6

Speech Example

for (int i = 0; i < 10; i++ ) {

}

for int i equals zero i less than ten i plus plus

April 3, 2004 LDTA 2004 7

Ambiguities

for (int i = 0; i < 10; i++ ) {

}

4 int eye equals 0 aye less then 10 i plus plus

April 3, 2004 LDTA 2004 8

Ambiguities

for (int i = 0; i < 10; i++ ) {

}

4 int eye equals 0 aye less then 10 i plus plus

KW or #?

ID Spelling?KW or ID?

April 3, 2004 LDTA 2004 9

Another Utterance

for times ate equals zero two plus equals one

April 3, 2004 LDTA 2004 10

Many Valid Parses!

4 * 8 = zero; to += won

for times ate equals zero two plus equals one

for (times; ate == 0; to += 1) {

}

fore.times(8).equalsZero(2, plus == 1)

April 3, 2004 LDTA 2004 11

Embedded Language Example

C and Regexps embedded in Flex

Flex Rule for Identifiers

[_a-zA-Z]([_a-zA-Z0-9])* i++; RETURN_TOKEN(ID);

April 3, 2004 LDTA 2004 12

Embedded Language Example

C and Regexps embedded in Flex

Flex Rule for Identifiers

[_a-zA-Z]([_a-zA-Z0-9])* i++; RETURN_TOKEN(ID);

Why not this interpretation?

[_a-zA-Z]([_a-zA-Z0-9])* i++; RETURN_TOKEN(ID);

April 3, 2004 LDTA 2004 13

Legacy Language Example

Fortran

DO 57 I = 3,10

April 3, 2004 LDTA 2004 14

Legacy Language Example

Fortran

• Do Loop

DO 57 I = 3,10

April 3, 2004 LDTA 2004 15

Legacy Language Example

Fortran

• Do Loop

DO 57 I = 3,10

DO 57 I = 3

April 3, 2004 LDTA 2004 16

Legacy Language Example

Fortran

• Do Loop

DO 57 I = 3,10• Assignment

DO 57 I = 3

April 3, 2004 LDTA 2004 17

Legacy Language Example

Fortran

• Do Loop

DO 57 I = 3,10• Assignment

DO57I = 3

April 3, 2004 LDTA 2004 18

Legacy Language Example

PL/I

• Non-reserved Keywords

IF IF = THEN THEN THEN = ELSE ELSE ELSE = ENDEND

April 3, 2004 LDTA 2004 19

Legacy Language Example

PL/I

• Non-reserved Keywords

IF IF = THEN THEN THEN = ELSE ELSE ELSE = ENDEND

KW

ID

ID

ID

April 3, 2004 LDTA 2004 20

Single SpellingMultiple Spellings

Single Lexical Category

UnambiguousHomophone IDs

Lexical misspellings

Multiple Lexical Categories

Non-reserved keywords

Ambiguous interpretations

Homophones

Input Stream Classification

April 3, 2004 LDTA 2004 21

Single SpellingMultiple Spellings

Single Lexical Category

UnambiguousHomophone IDs

Lexical misspellings

Multiple Lexical Categories

Non-reserved keywords

Ambiguous interpretations

Homophones

Input Stream Classification

Embedded Languages Fall in all Four Categories!

April 3, 2004 LDTA 2004 22

GLR Analysis Architecture

LexerGLR

ParserSemantics

FOR I

FOR I

for (i = 0; i < 10; i++ ) {

}

(

April 3, 2004 LDTA 2004 23

GLR Analysis Architecture

LexerGLR

ParserSemantics

FOR I

FOR I

for (i = 0; i < 10; i++ ) {

}

(

Handles syntactic ambiguities

April 3, 2004 LDTA 2004 24

Our Contribution:XGLR Analysis Architecture

LexerXGLRParser

Semantics

for i equals zero ...

FOR I

FOR I

April 3, 2004 LDTA 2004 25

Our Contribution:XGLR Analysis Architecture

LexerXGLRParser

Semantics

for i equals zero ...

FOR I4 EYE

FOR I

Handles input stream ambiguities

April 3, 2004 LDTA 2004 26

LR Parsing

ID KW #

1 S2 S3 Err

2 R1 S4 Err

3 S9 R3 S7

IID

FORKW

=KW1

Parse Table

0#

Input StreamParse Stack

April 3, 2004 LDTA 2004 27

LR Parsing

ID KW #

1 S2 S3 Err

2 R1 S4 Err

3 S9 R3 S7

IID

FORKW

=KW1

Parse Table

0#

Input StreamParse Stack

April 3, 2004 LDTA 2004 28

LR Parsing

ID KW #

1 S2 S3 Err

2 R1 S4 Err

3 S9 R3 S7

IID

=KW1

Parse Table

0#

Input StreamParse Stack

FORKW

3

April 3, 2004 LDTA 2004 29

GLR Parsing

ID KW #

1 S2S3

R5Err

2R1

R2S4 Err

3 S9 R3 S7

IID

FORKW

=KW

1 Parse Table

0#

Input StreamParse Stack

April 3, 2004 LDTA 2004 30

GLR Parsing

ID KW #

1 S2S3

R5Err

2R1

R2S4 Err

3 S9 R3 S7

IID

FORKW

=KW

1 Parse Table

0#

Input StreamParse Stack

April 3, 2004 LDTA 2004 31

GLR Parsing

ID KW #

1 S2S3

R5Err

2R1

R2S4 Err

3 S9 R3 S7

IID

FORKW

=KW

1 Parse Table

0#

Input StreamParse Stack

25

April 3, 2004 LDTA 2004 32

GLR Parsing

ID KW #

1 S2S3

R5Err

2R1

R2S4 Err

3 S9 R3 S7

IID

=KW

1 Parse Table

0#

Input StreamParse Stack

2

FORKW

3

FORKW

45

April 3, 2004 LDTA 2004 33

Single SpellingMultiple Spellings

Single Lexical Category

Not Shown Example 1

Multiple Lexical Categories

Example 2 Example 1

XGLR in Action

April 3, 2004 LDTA 2004 34

23 FOR

Parsing Homophones

BAR

April 3, 2004 LDTA 2004 35

23 FOR

FORE

4

ID

KW

NUM

XGLR Extension: Multiple Spellings, Single and Multiple Lexical Categories

BAR

FOUR

April 3, 2004 LDTA 2004 36

23 FOR

FORE

4

23

23

ID

KW

NUM

XGLR Extension: Parsers fork due to input ambiguity

BAR

FOUR

April 3, 2004 LDTA 2004 37

23

26

FOR

FORE

4

29

35

23

23

ID

KW

NUM

Each parser shifts its now unambiguous input

BAR

FOUR

April 3, 2004 LDTA 2004 38

23

26

FOR

FORE

4

29

35

BAR

23

23

ID

IDKW

NUM

The next input is lexed unambiguously

FOUR

April 3, 2004 LDTA 2004 39

23

26

FOR

FORE

4

29

35

BAR

23

23

42

49ID

IDKW

NUM

ID is only a valid lookahead for two parsers

FOUR

April 3, 2004 LDTA 2004 40

Parsing Embedded LanguagesExample BNF Grammar

Contains Languages L and W

bL loopL dW ENDL

loopL LOOPL |

dW WHILEW NUMW doW

doW DOW |

L

W

April 3, 2004 LDTA 2004 41

Parsing Embedded LanguagesExample BNF GrammarContains Languages L and W

bL loopL dW ENDL

loopL LOOPL |

dW WHILEW NUMW doW

doW DOW |

LOOP WHILE 34 END WHILE 56 DO END

L

W

April 3, 2004 LDTA 2004 42

April 3, 2004 LDTA 2004 43

April 3, 2004 LDTA 2004 44

April 3, 2004 LDTA 2004 45

April 3, 2004 LDTA 2004 46

S 0

Parsing Embedded Languages

LOOP WHILE 34

April 3, 2004 LDTA 2004 47

S 0 LOOP WHILE 34

Current parse state has ambiguous lexical language

April 3, 2004 LDTA 2004 48

S

0

0

W

L

LOOP WHILE 34

XGLR Extension: Fork parsers, assign one to each lexical language

April 3, 2004 LDTA 2004 49

S

0

0 LOOP

LOOPW

L

W

L

KW

ID

WHILE 34

XGLR Extension: Single spelling, Multiple lexical categoriesLex lookahead both in language L and W

April 3, 2004 LDTA 2004 50

S

4L

0

0 LOOP

LOOPW

L

W

L

KW

ID

WHILE 34

Only LOOPL is valid lookahead, and is shifted

April 3, 2004 LDTA 2004 51

S

4W

0

0 LOOP

LOOPW

L

W

L

KW

ID

WHILE 34

XGLR Extension: State 4 has lexer lookaheads only in language W

April 3, 2004 LDTA 2004 52

S

4W

0

0 LOOP

LOOPW

L

W

L

KW

ID

WHILE

34KW

W

Lex lookahead in language W

April 3, 2004 LDTA 2004 53

S

4W

0

0 LOOP

LOOPW

L

W

L

KW

ID

WHILE

34KW

W

1W

REDUCE by rule 2 and GOTO state 1

loopL

April 3, 2004 LDTA 2004 54

S

4W

0

0 LOOP

LOOPW

L

W

L

KW

ID

WHILE

34

KW

W1

W

loopL

April 3, 2004 LDTA 2004 55

S

4W

0

0 LOOP

LOOPW

L

W

L

KW

ID

WHILE

34

KW

W1

W2

W

Shift into state 2

loopL

April 3, 2004 LDTA 2004 56

S

4W

0

0 LOOP

LOOPW

L

W

L

KW

ID

WHILE

34

KW

W1

W2

W

W

NUM

XGLR Extension: Lex lookahead in language W

loopL

April 3, 2004 LDTA 2004 57

S

4W

0

0 LOOP

LOOPW

L

W

L

KW

ID

WHILE 34KW

W1

W2

W W

NUM

loopL

April 3, 2004 LDTA 2004 58

4W

0

0 LOOP

LOOPW

L

W

L

KW

ID

WHILE 34KW

W1

W2

W W

NUM3

W

Shift into state 3

S

loopL

April 3, 2004 LDTA 2004 59

4W

0

0 LOOP

LOOPW

L

W

L

KW

ID

WHILE 34KW

W1

W2

W W

NUM3

W

Shift into state 3, which has ambiguous lexical language

S

loopL

April 3, 2004 LDTA 2004 60

4W

0

0 LOOP

LOOPW

L

W

L

KW

ID

WHILE 34KW

W1

W2

W W

NUM3

W

3L

XGLR Extension: Single spelling, Multiple lexical categoriesFork parsers, assign one to each lexical language

S

loopL

April 3, 2004 LDTA 2004 61

GLR Ambiguity Support

1. Fork parser on shift-reduce conflict

2. Fork parser on reduce-reduce conflict

April 3, 2004 LDTA 2004 62

XGLR Ambiguity Support

1. Fork parser on shift-reduce conflict

2. Fork parser on reduce-reduce conflict

April 3, 2004 LDTA 2004 63

XGLR Ambiguity Support

1. Fork parser on shift-reduce conflict

2. Fork parser on reduce-reduce conflict

3. Fork parsers on ambiguous lexical language Single spelling, Multiple lexical categories

4. Fork parsers on ambiguous lexical lookahead Single/Multiple Spellings, Multiple lexical

categories Shift-shift conflict resolution

April 3, 2004 LDTA 2004 64

XGLR Ambiguities

Many GLR programming language specs have finite, few ambiguities

XGLR language specs also have finite, but slightly more, ambiguities – Lexical ambiguity due to ambiguous input does

result in more ambiguous parse forests

April 3, 2004 LDTA 2004 65

XGLR Ambiguities

Many GLR programming language specs have finite, few ambiguities

XGLR language specs also have finite, but slightly more, ambiguities– Lexical ambiguity due to ambiguous input does

result in more ambiguous parse forests

Ambiguity causes parsers to fork GLR maintains efficiency by merging parsers

when ambiguity is over

April 3, 2004 LDTA 2004 66

Parser Merging

GLR: Parsers merge when in same parse state

DOKW

8

31

DOKW

5

57#

5

April 3, 2004 LDTA 2004 67

Parser Merging

GLR: Parsers merge when in same parse state

DOKW

8

31

DOKW

5 57#

4

5

57#

April 3, 2004 LDTA 2004 68

Parser Merging

XGLR: Parsers merge when in same parse state and same lexical state

DOKW

8

31

DOKW

5

57#

A

A A

A A

W

5A

April 3, 2004 LDTA 2004 69

Parser Merging

XGLR: Parsers merge when in same parse state and same lexical state

DOKW

8

31

DOKW

5

A

A A

A A

W

5A

57#

57#

A

W

April 3, 2004 LDTA 2004 70

Parser Merging

XGLR: Parsers merge when in same parse state and same lexical state

DOKW

8

31

DOKW

5

A

A A

A A

W

5A

57#

57#

A

W4

W

April 3, 2004 LDTA 2004 71

Parser Merging

XGLR: Parsers merge when in same parse state and same lexical state

DOKW

8

31

DOKW

5

A

A A

A A

W

5A

57#

57#

A

W4

A

April 3, 2004 LDTA 2004 72

Parser Merging

XGLR: Parsers merge when in same parse state and same lexical state

DOKW

8

31

DOKW

5

A

A A

A A

W

5A

57#

57#

A

W4

A

April 3, 2004 LDTA 2004 73

Out of Sync Parsers

DO57I=3

XGLR: Parsers merge when in same parse state and same lexical state and same input position

8

1A

W

5A

April 3, 2004 LDTA 2004 74

Out of Sync Parsers

57I=3

XGLR: Parsers merge when in same parse state and same lexical state and same input position

8

1A

W

5A

DOKW

A

DO57IID

W=3

April 3, 2004 LDTA 2004 75

Out of Sync Parsers

57I=3

XGLR: Parsers merge when in same parse state and same lexical state and same input position

8

1A

W

5A

DOKW

A

DO57IID

W=3

3A

5W

April 3, 2004 LDTA 2004 76

Out of Sync Parsers

I=3

XGLR: Parsers merge when in same parse state and same lexical state and same input position

8

1A

W

5A

DOKW

A

DO57IID

W3

3A

5W

=KW

W

57ID

A

April 3, 2004 LDTA 2004 77

Out of Sync Parsers

I=3

XGLR: Parsers merge when in same parse state and same lexical state and same input position

8

1A

W

5A

DOKW

A

DO57IID

W3

3A

5W

=KW

W

57ID

A

6W

4A

April 3, 2004 LDTA 2004 78

Out of Sync Parsers

=3

XGLR: Parsers merge when in same parse state and same lexical state and same input position

8

1A

W

5A

DOKW

A

DO57IID

W3

3A

5W

=KW

W

57ID

A

6W

4A

#

W

IID

A

April 3, 2004 LDTA 2004 79

Out of Sync Parsers

=3

XGLR: Parsers merge when in same parse state and same lexical state and same input position

8

1A

W

5A

DOKW

A

DO57IID

W3

3A

5W

=KW

W

57ID

A

6W

4A

#

W

IID

A

9W

April 3, 2004 LDTA 2004 80

Out of Sync Parsers

=3

XGLR: Parsers merge when in same parse state and same lexical state and same input position

8

1A

W

5A

DOKW

A

DO57IID

W3

3A

5W

=KW

W

57ID

A

6W

4A

#

W

IID

A

9W

April 3, 2004 LDTA 2004 81

Implementation

Keep map: lookahead parser to use when looking for parsers to merge with

Sort parsers by position of lookahead in the input – Enables pruning of map as parsers move past a

particular input location– Extra memory required is bounded by dynamic

separation between first and last parsers

April 3, 2004 LDTA 2004 82

Related Work GLR Parsing Algorithm

– Tomita [1985]– Farshi [1991]– Rekers [1992]– Johnstone et. al. [2002]

Incremental GLR– Wagner [1997]

GLR Implementations(that I heard of before today)

– ASF+SDF [1993]– Elkhound [2004]– Bison [2003]– DParser [2002]– Aycock and Horspool [1999]

Scannerless Parsing(or Context-Free Scanning)

– Salomon and Cormack [1989]– Visser [1997]

van den Brand [2002] Ambiguous Input Streams

– Aycock and Horspool [2001] Embedded Languages

– ASF+SDF [1997]– Van de Vanter and

Boshernitsan(CodeProcessor) [2000]

April 3, 2004 LDTA 2004 83

Future Work

Semantic Analysis of Embedded Languages

Automated Semantic Disambiguation

April 3, 2004 LDTA 2004 84

Contributions

1. Generalized GLR to handle input stream ambiguities

2. Classified input stream ambiguities into four categories

3. Implemented XGLR algorithm in Harmonia framework

4. Constructed combined lexer and parser generator to support embedded languages and lexical ambiguities at each stage of analysis

5. Enabled analysis of embedded languages, programming by voice, and legacy languages

April 3, 2004 LDTA 2004 85