september 6, 2017 · evaluation method i benchmark spreadsheets collected i exercise session i...

51
TaCLe Learning constraints in spreadsheets and tabular data Samuel Kolb, Sergey Paramonov, Tias Guns, Luc De Raedt September 6, 2017 1 / 29

Upload: others

Post on 19-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

TaCLeLearning constraints in spreadsheets and tabular data

Samuel Kolb, Sergey Paramonov, Tias Guns, Luc De Raedt

September 6, 2017

1 / 29

Page 2: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Inspiration: Flash-fill

2 / 29

Page 3: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Inspiration: Flash-fill

2 / 29

Page 4: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Inspiration: Flash-fill

2 / 29

Page 5: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Inspiration: Flash-fill

2 / 29

Page 6: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Inspiration: Flash-fill

2 / 29

Page 7: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Inspiration: Flash-fill

2 / 29

Page 8: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Inspiration: Flash-fill

2 / 29

Page 9: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Inspiration: Flash-fill

2 / 29

Page 10: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Inspiration: Flash-fill

2 / 29

Page 11: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Inspiration: Flash-fill

2 / 29

Page 12: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Inspiration: Flash-fill

2 / 29

Page 13: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Inspiration: Flash-fill

2 / 29

Page 14: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Inspiration: Flash-fill

2 / 29

Page 15: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Inspiration: Flash-fill

2 / 29

Page 16: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Inspiration: Flash-fill

2 / 29

Page 17: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

What is Flash-fill [Gulwani et al.]?

I learns string transformation

I classic ML supervised setting

I one positive (or very few) example

I integrated into Excel

Key Questions:I what if we make it unsupervised?I what if we learn constraints rather than functions?

3 / 29

Page 18: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Can we recover formulas from a CSV file?

I What are the formulas here?

I T1[:, 6] = SUM(T1[:, 3:5], row)

I T2[:, 2] = SUMIF(T1[:, 1]=T2[:, 1], T1[:, 6])

4 / 29

Page 19: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Can we recover formulas from a CSV file?

I What are the formulas here?I T1[:, 6] = SUM(T1[:, 3:5], row)

I T2[:, 2] = SUMIF(T1[:, 1]=T2[:, 1], T1[:, 6])

4 / 29

Page 20: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Can we recover formulas from a CSV file?

I What are the formulas here?I T1[:, 6] = SUM(T1[:, 3:5], row)

I T2[:, 2] = SUMIF(T1[:, 1]=T2[:, 1], T1[:, 6])

4 / 29

Page 21: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

What is new here?

I Atypical data settings: no variables in columns and transactionsin rows – everything mixed and position matters

I Multi-relational learning (across multiple tables, e.g., lookup)I Constraints – specific for spreadsheets (e.g., fuzzy lookup,

sumproduct)I Important details:

I None valuesI Semi-structured dataI Numeric and textual dataI Numerical precision

5 / 29

Page 22: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Formalization

Tables (T)

I n× m matrixI HeaderlessI (Optional: orientation)

6 / 29

Page 23: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Formalization

Blocks (B)

I Contiguous group of entire rows or entire columns (vectors)I Type-consistentI Fixed orientation

7 / 29

Page 24: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Formalization

Block containment (B′ v B)

I B′ subblock of BI B superblock of B′

I Supports different granularities for reasoning

8 / 29

Page 25: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Formalization

Block properties

type a block is type-consistent, so it has one type

table the table that the block belongs to

orientation either row-oriented or column-oriented

size the number of vectors a block contains

length the length of its vectors; as all vectors are from thesame table, they always have the same length;

rows the number of rows in the block; in row-oriented blocksthis is equivalent to the size;

columns the number of columns in the block; in row-orientedblocks this is equivalent to the length.

9 / 29

Page 26: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Formalization

Constraint templates

I SyntaxI Syntactic form, e.g. ALLDIFFERENT(Bx)I In logic: relation / predicate

I SignatureI Requirements on arguments, e.g. discrete(Bx)I In logic: bias (Sigs)

I DefinitionI Actual definition, e.g. i 6= j: Bx[i] 6= Bx[j]I In logic: ackground knowledge (Defs)

10 / 29

Page 27: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Formalization

Row-wise sum

I Syntax: Br = SUMrow(Bx)I Signature: Br and Bx are numeric; columns(Bx) ≥ 2; and

rows(Bx) = length(Br)I Definition: Br[i] =

∑columns(Bx)j=1 row(i, Bx)[j]

Lookup

I Syntax: Br = LOOKUP(Bfk, Bpk, Bval)I Signature: Bfk and Bpk are discrete; arguments {Bfk, Br} and{Bpk, Bval} within the same set have the same length, table andorientation; Br and Bval have the same type; andFOREIGNKEY(Bfk, Bpk).

I Definition: Br[i] = Bval[j] where Bpk[j] = Bfk[i]

11 / 29

Page 28: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Formalization

Row-wise sum

I Syntax: Br = SUMrow(Bx)I Signature: Br and Bx are numeric; columns(Bx) ≥ 2; and

rows(Bx) = length(Br)I Definition: Br[i] =

∑columns(Bx)j=1 row(i, Bx)[j]

Lookup

I Syntax: Br = LOOKUP(Bfk, Bpk, Bval)I Signature: Bfk and Bpk are discrete; arguments {Bfk, Br} and{Bpk, Bval} within the same set have the same length, table andorientation; Br and Bval have the same type; andFOREIGNKEY(Bfk, Bpk).

I Definition: Br[i] = Bval[j] where Bpk[j] = Bfk[i]

11 / 29

Page 29: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

High level approach

1. Detect tables (Visual selection)

12 / 29

Page 30: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

High level approach

1. Detect tables (Visual selection)

13 / 29

Page 31: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

High level approach

1. Detect tables (Visual selection)

14 / 29

Page 32: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

High level approach

2. Detect blocks (Automatically, transparant to the user)

15 / 29

Page 33: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

High level approach

3. Learn constraints

SERIES(T1[:, 1])

T1[:, 1] = RANK(T1[:, 5])∗

T1[:, 1] = RANK(T1[:, 6])∗

T1[:, 1] = RANK(T1[:, 10])∗

T1[:, 8] = RANK(T1[:, 7])

T1[:, 8] = RANK(T1[:, 3])∗

T1[:, 8] = RANK(T1[:, 4])∗

T1[:, 7] = SUMrow(T1[:, 3:6])T1[:, 10] = SUMIF(T3[:, 1], T1[:, 2], T3[:, 2])T1[:, 11] = MAXIF(T3[:, 1], T1[:, 2], T3[:, 2])T2[1, :] = SUMcol(T1[:, 3:7])T2[2, :] = AVERAGEcol(T1[:, 3:7])T2[3, :] = MAXcol(T1[:, 3:7]),

T2[4, :] = MINcol(T1[:, 3:7])T4[:, 2] = SUMcol(T1[:, 3:6])T4[:, 4] = PREV(T4[:, 4]) + T4[:, 2] − T4[:, 3]

T5[:, 2] = LOOKUP(T5[:, 3], T1[:, 2], T1[:, 1])∗

T5[:, 3] = LOOKUP(T5[:, 2], T1[:, 1], T1[:, 2])

16 / 29

Page 34: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Approach

Intuition

I Example: Y = SUMcol(X)

I Step 1: Find superblock assignments (i.e. suitable blocks)I Assignments compatible signature, relax where necessaryI e.g. numeric(X), numeric(Y), columns(X) ≥ 2I e.g. rows(X) = length(Y)

I Step 2: Find all constraints over subblocks of the superblockassignments

I Find subassignments that satisfy the signature and the definitionI e.g. find subblocks for X and Y such that Y[i] =

∑column i

17 / 29

Page 35: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Approach

Refinements (similar as in constraint learning - clausaldiscovery)

I Dependencies between constraintsI Redundancies in the output constraintsI Limited precision

18 / 29

Page 36: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Refinements

Dependencies

Reuse learned constraints to learn dependent constraints

19 / 29

Page 37: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Refinements

Redundancies

I Hide constraints that are equivalentI Equivalent constraints can be computed from one anotherI Example: B1 = B2 × B3 and B1 = B3 × B2

I → Use canonical form

Limited precision

I Use the result value to deduce required precisionI Compute formula on input values and round to precision

20 / 29

Page 38: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Evaluation

Evaluation questions

I Recall (Q1)I Precision (Q2)I Speed (Q3)

21 / 29

Page 39: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Evaluation

MethodI Benchmark spreadsheets collected

I Exercise sessionI Online tutorialsI Data spreadsheets (e.g. crime statistics, fincancial data)

I Convert all spreadsheets to CSV filesI Manually specify ground-truth: intended constraints

I Based on original formulas, context, headers (intuition)I Structural constraints are ignored

22 / 29

Page 40: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Evaluation

MethodI Benchmark spreadsheets collected

I Exercise sessionI Online tutorialsI Data spreadsheets (e.g. crime statistics, fincancial data)

I Convert all spreadsheets to CSV files

I Manually specify ground-truth: intended constraintsI Based on original formulas, context, headers (intuition)I Structural constraints are ignored

22 / 29

Page 41: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Evaluation

MethodI Benchmark spreadsheets collected

I Exercise sessionI Online tutorialsI Data spreadsheets (e.g. crime statistics, fincancial data)

I Convert all spreadsheets to CSV filesI Manually specify ground-truth: intended constraints

I Based on original formulas, context, headers (intuition)I Structural constraints are ignored

22 / 29

Page 42: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Evaluation

Benchmark

Exercises (9) Tutorials (21) Data (4)Overall Sheet avg Overall Sheet avg Overall Sheet avg

Tables 19 2.11 48 2.29 4 1Cells 1231 137 1889 90 2320 580IntendedConstraints 34 3.78 52 2.48 6 1.50

23 / 29

Page 43: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

EvaluationBenchmark

Exercises (9) Tutorials (21) Data (4)Overall Sheet avg Overall Sheet avg Overall Sheet avg

Recall 0.85 0.83 0.88 0.87 1.00 1.00RecallSupported 1.00 1.00 1.00 1.00 1.00 1.00

Precision 0.97 0.98 0.70 0.91 1.00 1.00Speed (s) 1.62 0.18 1.76 0.08 0.81 0.20

I Q1. How many intended constraints are found by TaCLe?I High recallI All supported constraints always found

I Q2. How precise is TaCLe?I Precise on most spreadsheetsI Duplicates / multiple ways to calculate thwart precision

I Q3. How fast is TaCLe?I Dependencies crucial

24 / 29

Page 44: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

EvaluationBenchmark

Exercises (9) Tutorials (21) Data (4)Overall Sheet avg Overall Sheet avg Overall Sheet avg

Recall 0.85 0.83 0.88 0.87 1.00 1.00RecallSupported 1.00 1.00 1.00 1.00 1.00 1.00

Precision 0.97 0.98 0.70 0.91 1.00 1.00Speed (s) 1.62 0.18 1.76 0.08 0.81 0.20

I Q1. How many intended constraints are found by TaCLe?I High recallI All supported constraints always found

I Q2. How precise is TaCLe?I Precise on most spreadsheetsI Duplicates / multiple ways to calculate thwart precision

I Q3. How fast is TaCLe?I Dependencies crucial

24 / 29

Page 45: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

EvaluationBenchmark

Exercises (9) Tutorials (21) Data (4)Overall Sheet avg Overall Sheet avg Overall Sheet avg

Recall 0.85 0.83 0.88 0.87 1.00 1.00RecallSupported 1.00 1.00 1.00 1.00 1.00 1.00

Precision 0.97 0.98 0.70 0.91 1.00 1.00Speed (s) 1.62 0.18 1.76 0.08 0.81 0.20

I Q1. How many intended constraints are found by TaCLe?I High recallI All supported constraints always found

I Q2. How precise is TaCLe?I Precise on most spreadsheetsI Duplicates / multiple ways to calculate thwart precision

I Q3. How fast is TaCLe?I Dependencies crucial

24 / 29

Page 46: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

EvaluationBenchmark

Exercises (9) Tutorials (21) Data (4)Overall Sheet avg Overall Sheet avg Overall Sheet avg

Recall 0.85 0.83 0.88 0.87 1.00 1.00RecallSupported 1.00 1.00 1.00 1.00 1.00 1.00

Precision 0.97 0.98 0.70 0.91 1.00 1.00Speed (s) 1.62 0.18 1.76 0.08 0.81 0.20

I Q1. How many intended constraints are found by TaCLe?I High recallI All supported constraints always found

I Q2. How precise is TaCLe?I Precise on most spreadsheetsI Duplicates / multiple ways to calculate thwart precision

I Q3. How fast is TaCLe?I Dependencies crucial

24 / 29

Page 47: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Smart importTaCLe + constraint translation + cycle breaking

25 / 29

Page 48: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Auto-completion / dynamic error checkingincremental TaCLe + vector tracking

26 / 29

Page 49: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Conclusion

ConclusionI Approach that learns constraints in spreadsheet

I AccurateI Rather preciseI Efficient

Future work

I Applications (incremental learning, vector tracking, noise)I Nested constraintsI Sub vector-sizeI Post-processing (heuristic / entailment)

27 / 29

Page 50: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

https://unsplash.com/search/photos/cat?photo=Qmox1MkYDnY

28 / 29

Page 51: September 6, 2017 · Evaluation Method I Benchmark spreadsheets collected I Exercise session I Online tutorials I Data spreadsheets (e.g. crime statistics, fincancial data) I Convert

Questions?

29 / 29