data manipulation using programming by examples and natural language invited talk @ upenn april 2015...

39
Data Manipulation using Programming by Examples and Natural Language Invited Talk @ Upenn April 2015 Sumit Gulwani

Upload: francis-nelson

Post on 15-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

PowerPoint Presentation

Data Manipulation using Programming by Examples and Natural LanguageInvited Talk @ Upenn April 2015Sumit Gulwani0

1The New Opportunity

End Users(non-programmers with access to computers)Software developer2 orders of magnitude more end usersStruggle with simple repetitive tasksNeed domain-specific expert systemsTraditional customer for PL technologyExcel help forums

Typical help-forum interaction

300_w5_aniSh_c1_b w5=MID(B1,5,2)

300_w30_aniSh_c1_b w30=MID(B1,FIND(_,$B:$B)+1, FIND(_,REPLACE($B:$B,1,FIND(_,$B:$B),))-1)

=MID(B1,5,2)Flash Fill (Excel 2013 feature) demo

Data locked up in silos in various formats

Great flexibility in organizing (hierarchical) data for viewing but challenging to manipulate and reason about the data.

A typical workflow might involve one or more following stepsExtractionTransformationQueryingFormatting

PBE and PBNL can enable delightful data wrangling.

5Data Manipulation

To get Started!Data Science Class Assignment

FlashExtractFlashExtract

FlashExtract Demo910ArchitectureIntentProgram

Search Algorithm(Inductive Spec)Examples: Conjunction of (input state, output state)

Inductive Spec generalizes Examples in 2 ways.

Generalization 1: Conjunction of (input state, output property)Motivation: Output properties are easier to specify intent.11Inductive Specification12Output properties

Subsequence of the output listElements not belonging to the output listContiguous subsequence of the output listPrefix of the output listTask

13Output properties

Task

Prefix of the output table (seq of records)

We do not require explicit (magenta) record boundaries in which case the spec is:Prefixes of projections of the output table Examples: Conjunction of (input state, output state)

Inductive Spec generalizes Examples in 2 ways.

Generalization 1: Conjunction of (input state, output property)Motivation: Output properties are easier to specify intent.

Generalization 2: Boolean comb of (input state, output property)Motivation: Arises internally as part of specification refinement14Inductive Specification15ArchitectureIntentProgram

Search AlgorithmDSL(Inductive Spec)Challenge 1: Designing efficient search algorithm.Consider the tasks: [String s -> Substring] (arises in FlashFill) [Long String s ->List of Substrings] (arises in FlashExtract) Regular expression suffices for both, but is not ideal.Difficult to synthesizeDifficult to explain to the user

We propose abstractions that involve simpler regexes.

16DSL for Substring Extraction Consider the tasks: [String s -> Substring] (arises in FlashFill) [Long String s ->List of Substrings] (arises in FlashExtract)

DSL for Task 1, i.e., [String s -> Substring] := let p1 = [s -> index] in let p2 = [s -> index] in SubStr(s, p1, p2)

DSL for [String s -> index] := Constant | Pos(s, regex1, regex2, k) // kth position in s whose left/right side matches with regex1/regex2

17DSL for Substring Extraction | let t = Suffix(s,p1) in [t -> index]Let w = SubStr(s, p, p)where p = Pos(s, r1, r2, k) and p = Pos(s, r1, r2, k)18The SubStr Operatorsppww1w2w1w2r1 matches w1r2 matches w2

r1 matches w1r2 matches w2Consider the tasks: [String s -> Substring] (arises in FlashFill) [Long String s ->List of Substrings] (arises in FlashExtract) DSL for Task 2, i.e., [String s -> List of substrings] := let L = Filter(Split(s,\n), [Line -> Bool]) inMap(L, [String -> Substring])

DSL for [Line t -> bool] := MatchRegex(t, regex) | MatchRegex(t.previous, regex)

19DSL for Substring Extraction 20ArchitectureIntentProgram

Search AlgorithmDSLDeductive Reasoning Rules for specification refinement(Inductive Spec)Challenge 1: Designing efficient search algorithm.DSL for [String s -> List of substrings] : let L = Filter(Split(s,\n), [Line -> Bool]) in Map(L, [String -> Substring] )

21Deductive Reasoning for Specification Refinement

Spec for [String ->List of substrings]Spec for [Line ->Bool]

Spec for [String ->Substring]DSL for [String s -> Substring] :=let p1 = [s -> index] inlet p2 = [s -> index] inSubStr(s, p1, p2)

22Deductive Reasoning for Specification Refinement01/12/201201/12/201201/12/2012Spec for p1Spec for p2Spec for [String -> Substring]Disjunctions & Conjunctions are handled using union & intersection over program sets (Version Space Algebras) 22/1523ArchitectureIntentProgram

Search AlgorithmDSLDeductive Reasoning Rules for specification refinement(Inductive Spec)RankingFunctionChallenge 1: Designing efficient search algorithm.Challenge 2: Ambiguous/under-specified intent may result in unintended programs.

Synthesize multiple programs & rank them using machine learning.

General Principles for rankingPrefer shorter programs.Prefer programs with fewer constants.

Ranking StrategiesBaseline: Pick any minimal sized program using minimal number of constants.Machine Learning: Score programs using a weighted combination of program features.Weights are learned using training data.

24Ranking25Experimental Comparison of Ranking StrategiesStrategyAverage # of examples requiredBaseline4.17Learning1.48Technical Report: Predicting a correct program in Programming by ExampleRishabh Singh, Sumit Gulwani

BaselineLearningFlashFill Ranking Demo26FlashMeta ArchitectureIntentProgram

Search AlgorithmDSLDeductive Reasoning Rules for specification refinement(Inductive Spec)RankingFunctionChallenge 1: Designing efficient search algorithm.

Challenge 2: Ambiguous/under-specified intent may result in unintended programs.28

It's a great concept, but it can also lead to lots of bad data. I think many users will look at a few "flash filled" cells, and just assume that it worked. Be very careful.28Need for a better User Interaction Model!

most of the extracted data will be fine. But there might be exceptions that you don't notice unless you examine the results very carefully.

Make it easy to inspect output correctnessUser can accordingly provide more examples

Show programsin any desired programming language; in EnglishEnable effective navigation between programs

Computer initiated interactivity (Active learning)Highlight less confident entries in the output.Ask directed questions based on distinguishing inputs.

29User Interaction Models for Ambiguity ResolutionFlashExtract Demo(User Interaction Models)30ExtractionFlashExtract: Extract data from text files, web pages [PLDI 2014; Powershell convert-from-string API]FlashRelate: Extract data from spreadsheets [PLDI 2015]

TransformationFlash Fill: Excel feature for Syntactic String Transformations [POPL 2011]Semantic String Transformations [VLDB 2012]Number Transformations [CAV 2013]

QueryingNLyze: an Excel programming-by-natural-lang add-in [SIGMOD 2014]

FormattingTable re-formatting [PLDI 2011]FlashFormat: a Powerpoint add-in [AAAI 2014]

31PBE/PBNL tools for Data ManipulationFlashMeta ArchitectureIntentPrograms

Search AlgorithmDSLDeductive Reasoning Rules for specification refinement(Inductive Spec)RankingFunctionThe Inductive Synthesis Problem Definition:Intent x DSL x Ranking function -> Top k-Programs

Solution Strategy: Spec Refinement based on deductive rules Tech Report: FlashMeta: A Framework for Inductive Program SynthesisAlex Polozov, Sumit GulwaniProjectFlashFillFlashExtractTextFlashRelateFlashNormalizeFlashExtractWeb33Comparison of FlashMeta with hand-tuned implementationsOriginalFlashMeta1237452172N/A2.5OriginalFlashMeta91818172N/A1.5Lines of Code (K)Development time (months)Running time of FlashMeta implementations vary between 0.5-3x of the corresponding original implementation.Faster because of some free optimizationsSlower because of larger feature sets & a generalized frameworkFlashRelate + NLyze Demo34Other application domains.

Integration with existing programming environments.

Multi-modal intent specification using combination of Examples and NL.

35Other Directions

36SmartSynth: SmartPhone Script Synthesis using NL MobiSys 2013: SmartSynth: Synthesizing Smartphone Automation Scripts from Natural Languages; Vu Le, Sumit Gulwani, Zhendong Su

Vu LeCollaborators

Dan Barowy

Ted Hart

Maxim GrechkinAlex Polozov

Dileep Kini

Rishabh Singh

Mikael Mayer

Mark MarronGustavo Soares

Ben Zorn

Data manipulation is challenging!Data scientists spend 80% time cleaning data.99% of end users are non-programmers.

PBE/PBNL can enable delightful data wrangling!

Cross-disciplinary inspirationTheory/Logical Reasoning (Search algo)Language Design (DSL)Machine Learning (Ranking)HCI (User interaction models)38Data Manipulation using PBE/PBNL38/15