data manipulation using programming by examples and natural language invited talk @ upenn april 2015...
TRANSCRIPT
PowerPoint Presentation
Data Manipulation using Programming by Examples and Natural LanguageInvited Talk @ Upenn April 2015Sumit Gulwani0
1The New Opportunity
End Users(non-programmers with access to computers)Software developer2 orders of magnitude more end usersStruggle with simple repetitive tasksNeed domain-specific expert systemsTraditional customer for PL technologyExcel help forums
Typical help-forum interaction
300_w5_aniSh_c1_b w5=MID(B1,5,2)
300_w30_aniSh_c1_b w30=MID(B1,FIND(_,$B:$B)+1, FIND(_,REPLACE($B:$B,1,FIND(_,$B:$B),))-1)
=MID(B1,5,2)Flash Fill (Excel 2013 feature) demo
Data locked up in silos in various formats
Great flexibility in organizing (hierarchical) data for viewing but challenging to manipulate and reason about the data.
A typical workflow might involve one or more following stepsExtractionTransformationQueryingFormatting
PBE and PBNL can enable delightful data wrangling.
5Data Manipulation
To get Started!Data Science Class Assignment
FlashExtractFlashExtract
FlashExtract Demo910ArchitectureIntentProgram
Search Algorithm(Inductive Spec)Examples: Conjunction of (input state, output state)
Inductive Spec generalizes Examples in 2 ways.
Generalization 1: Conjunction of (input state, output property)Motivation: Output properties are easier to specify intent.11Inductive Specification12Output properties
Subsequence of the output listElements not belonging to the output listContiguous subsequence of the output listPrefix of the output listTask
13Output properties
Task
Prefix of the output table (seq of records)
We do not require explicit (magenta) record boundaries in which case the spec is:Prefixes of projections of the output table Examples: Conjunction of (input state, output state)
Inductive Spec generalizes Examples in 2 ways.
Generalization 1: Conjunction of (input state, output property)Motivation: Output properties are easier to specify intent.
Generalization 2: Boolean comb of (input state, output property)Motivation: Arises internally as part of specification refinement14Inductive Specification15ArchitectureIntentProgram
Search AlgorithmDSL(Inductive Spec)Challenge 1: Designing efficient search algorithm.Consider the tasks: [String s -> Substring] (arises in FlashFill) [Long String s ->List of Substrings] (arises in FlashExtract) Regular expression suffices for both, but is not ideal.Difficult to synthesizeDifficult to explain to the user
We propose abstractions that involve simpler regexes.
16DSL for Substring Extraction Consider the tasks: [String s -> Substring] (arises in FlashFill) [Long String s ->List of Substrings] (arises in FlashExtract)
DSL for Task 1, i.e., [String s -> Substring] := let p1 = [s -> index] in let p2 = [s -> index] in SubStr(s, p1, p2)
DSL for [String s -> index] := Constant | Pos(s, regex1, regex2, k) // kth position in s whose left/right side matches with regex1/regex2
17DSL for Substring Extraction | let t = Suffix(s,p1) in [t -> index]Let w = SubStr(s, p, p)where p = Pos(s, r1, r2, k) and p = Pos(s, r1, r2, k)18The SubStr Operatorsppww1w2w1w2r1 matches w1r2 matches w2
r1 matches w1r2 matches w2Consider the tasks: [String s -> Substring] (arises in FlashFill) [Long String s ->List of Substrings] (arises in FlashExtract) DSL for Task 2, i.e., [String s -> List of substrings] := let L = Filter(Split(s,\n), [Line -> Bool]) inMap(L, [String -> Substring])
DSL for [Line t -> bool] := MatchRegex(t, regex) | MatchRegex(t.previous, regex)
19DSL for Substring Extraction 20ArchitectureIntentProgram
Search AlgorithmDSLDeductive Reasoning Rules for specification refinement(Inductive Spec)Challenge 1: Designing efficient search algorithm.DSL for [String s -> List of substrings] : let L = Filter(Split(s,\n), [Line -> Bool]) in Map(L, [String -> Substring] )
21Deductive Reasoning for Specification Refinement
Spec for [String ->List of substrings]Spec for [Line ->Bool]
Spec for [String ->Substring]DSL for [String s -> Substring] :=let p1 = [s -> index] inlet p2 = [s -> index] inSubStr(s, p1, p2)
22Deductive Reasoning for Specification Refinement01/12/201201/12/201201/12/2012Spec for p1Spec for p2Spec for [String -> Substring]Disjunctions & Conjunctions are handled using union & intersection over program sets (Version Space Algebras) 22/1523ArchitectureIntentProgram
Search AlgorithmDSLDeductive Reasoning Rules for specification refinement(Inductive Spec)RankingFunctionChallenge 1: Designing efficient search algorithm.Challenge 2: Ambiguous/under-specified intent may result in unintended programs.
Synthesize multiple programs & rank them using machine learning.
General Principles for rankingPrefer shorter programs.Prefer programs with fewer constants.
Ranking StrategiesBaseline: Pick any minimal sized program using minimal number of constants.Machine Learning: Score programs using a weighted combination of program features.Weights are learned using training data.
24Ranking25Experimental Comparison of Ranking StrategiesStrategyAverage # of examples requiredBaseline4.17Learning1.48Technical Report: Predicting a correct program in Programming by ExampleRishabh Singh, Sumit Gulwani
BaselineLearningFlashFill Ranking Demo26FlashMeta ArchitectureIntentProgram
Search AlgorithmDSLDeductive Reasoning Rules for specification refinement(Inductive Spec)RankingFunctionChallenge 1: Designing efficient search algorithm.
Challenge 2: Ambiguous/under-specified intent may result in unintended programs.28
It's a great concept, but it can also lead to lots of bad data. I think many users will look at a few "flash filled" cells, and just assume that it worked. Be very careful.28Need for a better User Interaction Model!
most of the extracted data will be fine. But there might be exceptions that you don't notice unless you examine the results very carefully.
Make it easy to inspect output correctnessUser can accordingly provide more examples
Show programsin any desired programming language; in EnglishEnable effective navigation between programs
Computer initiated interactivity (Active learning)Highlight less confident entries in the output.Ask directed questions based on distinguishing inputs.
29User Interaction Models for Ambiguity ResolutionFlashExtract Demo(User Interaction Models)30ExtractionFlashExtract: Extract data from text files, web pages [PLDI 2014; Powershell convert-from-string API]FlashRelate: Extract data from spreadsheets [PLDI 2015]
TransformationFlash Fill: Excel feature for Syntactic String Transformations [POPL 2011]Semantic String Transformations [VLDB 2012]Number Transformations [CAV 2013]
QueryingNLyze: an Excel programming-by-natural-lang add-in [SIGMOD 2014]
FormattingTable re-formatting [PLDI 2011]FlashFormat: a Powerpoint add-in [AAAI 2014]
31PBE/PBNL tools for Data ManipulationFlashMeta ArchitectureIntentPrograms
Search AlgorithmDSLDeductive Reasoning Rules for specification refinement(Inductive Spec)RankingFunctionThe Inductive Synthesis Problem Definition:Intent x DSL x Ranking function -> Top k-Programs
Solution Strategy: Spec Refinement based on deductive rules Tech Report: FlashMeta: A Framework for Inductive Program SynthesisAlex Polozov, Sumit GulwaniProjectFlashFillFlashExtractTextFlashRelateFlashNormalizeFlashExtractWeb33Comparison of FlashMeta with hand-tuned implementationsOriginalFlashMeta1237452172N/A2.5OriginalFlashMeta91818172N/A1.5Lines of Code (K)Development time (months)Running time of FlashMeta implementations vary between 0.5-3x of the corresponding original implementation.Faster because of some free optimizationsSlower because of larger feature sets & a generalized frameworkFlashRelate + NLyze Demo34Other application domains.
Integration with existing programming environments.
Multi-modal intent specification using combination of Examples and NL.
35Other Directions
36SmartSynth: SmartPhone Script Synthesis using NL MobiSys 2013: SmartSynth: Synthesizing Smartphone Automation Scripts from Natural Languages; Vu Le, Sumit Gulwani, Zhendong Su
Vu LeCollaborators
Dan Barowy
Ted Hart
Maxim GrechkinAlex Polozov
Dileep Kini
Rishabh Singh
Mikael Mayer
Mark MarronGustavo Soares
Ben Zorn
Data manipulation is challenging!Data scientists spend 80% time cleaning data.99% of end users are non-programmers.
PBE/PBNL can enable delightful data wrangling!
Cross-disciplinary inspirationTheory/Logical Reasoning (Search algo)Language Design (DSL)Machine Learning (Ranking)HCI (User interaction models)38Data Manipulation using PBE/PBNL38/15