topes: enabling end-user programmers to validate and reformat data

33
Topes: Enabling End-User Topes: Enabling End-User Programmers to Validate and Programmers to Validate and Reformat Data Reformat Data Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

Upload: blake-french

Post on 01-Jan-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Topes: Enabling End-User Programmers to Validate and Reformat Data. Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University. Hurricane Katrina “Person Locator” site: Many inputs unvalidated... and error-ful. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Topes: Enabling End-User Programmers to Validate and Reformat Data

Topes: Enabling End-User Topes: Enabling End-User Programmers to Validate and Reformat Programmers to Validate and Reformat

DataData

Christopher Scaffidi

Key collaborators: Brad Myers, Mary Shaw

Carnegie Mellon University

Page 2: Topes: Enabling End-User Programmers to Validate and Reformat Data

22

Hurricane Katrina “Person Locator” site:Hurricane Katrina “Person Locator” site:Many inputs unvalidated... and error-ful Many inputs unvalidated... and error-ful

Introduction Challenges Topes Tools Evaluation Conclusion

Page 3: Topes: Enabling End-User Programmers to Validate and Reformat Data

33

Data errors reduce the usefulness of data.Data errors reduce the usefulness of data.

Even little typos impede data de-duplication.

Age is not useful for flying my helicopter to come rescue you.

Nor is a “city name” with 1 letter.

Introduction Challenges Topes Tools Evaluation Conclusion

Page 4: Topes: Enabling End-User Programmers to Validate and Reformat Data

44

Hurricane Katrina sites are not alone in Hurricane Katrina sites are not alone in lacking input validation.lacking input validation.

• Eg: Google Base web application– 13 primary web forms – Even numeric fields accept unreasonable inputs

(such as a salary of “-45”)

• Eg: Spreadsheets– 40% of cells are non-numeric, non-date textual data– Often used to gather/organize textual data for reports

Introduction Challenges Topes Tools Evaluation Conclusion

Page 5: Topes: Enabling End-User Programmers to Validate and Reformat Data

55

OutlineOutline

1. Challenges of data validation

2. Topes• Model for describing data• Tools for creating/using topes

3. Evaluations

4. Conclusion

Introduction Challenges Topes Tools Evaluation Conclusion

Page 6: Topes: Enabling End-User Programmers to Validate and Reformat Data

66

Digging into the details: Digging into the details: real user inputs that need validation.real user inputs that need validation.

• Sources:– Interviews of Hurricane Katrina website creators– Survey of Information Week readers– Contextual inquiry of information workers who

created and used websites– Logs of what admin assistants typed into browsers– Exploration of the EUSES spreadsheet corpus

• Validating user inputs has 3 primary challenges…

Introduction Challenges Topes Tools Evaluation Conclusion

Page 7: Topes: Enabling End-User Programmers to Validate and Reformat Data

77

1. Inputs don’t always conform well1. Inputs don’t always conform wellto the simple “binary” validation model.to the simple “binary” validation model.

• Data is sometimes questionable… yet valid.– Eg: a suspiciously long email address– In practice, person names and other proper nouns are

never validated with regexps… too brittle.– Life is full of corner cases and exceptions.

• If code can identify questionable data, then it can double-check the data:– Ask an application end user to confirm the input– Flag the input for checking by a system administrator– Compare the value to a list of known exceptions– Call up a server and see if it can confirm the value

Introduction Challenges Topes Tools Evaluation Conclusion

Page 8: Topes: Enabling End-User Programmers to Validate and Reformat Data

88

2. User inputs often can occur in multiple2. User inputs often can occur in multipledifferent formats.different formats.

• Two different strings can be equivalent.– How many ways can you write a date?– What if an end user types a date in the wrong format?– “Jan-1-2007” and “1/1/2007” mean the same thing

because of the category that they are in: date.– Sometimes the interpretation is ambiguous. In real

life, preferences and experience guide interpretation.

• If code can transform among formats (ie: not just recognize formats with regexps), then it can put data in an unambiguous format as needed.– Display result so users can check/fix interpretation

Introduction Challenges Topes Tools Evaluation Conclusion

Page 9: Topes: Enabling End-User Programmers to Validate and Reformat Data

99

3. The meaning of data is often tied to3. The meaning of data is often tied toits “parts”, not directly to its characters.its “parts”, not directly to its characters.

• Data often has parts, each with a meaning.– What are the parts of a date, 12/31/2008?– Valid data obeys intra- and inter-part constraints.– Constraints are usually platform-independent– Writing regexps requires you to translate constraints

into a character sequence… tough in many cases, practically or truly impossible in others.

• If code could succinctly state the parts, as well as mandatory and optional constraints on the parts, wouldn’t the code be easier to write and maintain?– Especially if it was platform-independent!

Introduction Challenges Topes Tools Evaluation Conclusion

Page 10: Topes: Enabling End-User Programmers to Validate and Reformat Data

1010

Limitations of existing approachesLimitations of existing approaches

• Types do not support questionable values

• Grammars do not, either, nor can they reformat

• Information extraction algorithms rely on grammatical cues that are absent during validation

• Cues, Forms/3, -calculus, Slate, pollution markers, etc, infer numerical constraints but not constraints on strings, nor are they platform-independent

Introduction Challenges Topes Tools Evaluation Conclusion

Page 11: Topes: Enabling End-User Programmers to Validate and Reformat Data

1111

Imagine a world where…Imagine a world where…

• Code can ask an oracle, “Is this a company name?”, and the oracle replies yes, no, almost definitely, probably not, and other shades of gray.

• Code allows input in any reasonable format, since the code can ask the oracle to put the input into the format that is actually needed.

• People teach the oracle about a new data category by concisely stating its parts and constraints.

Introduction Challenges Topes Tools Evaluation Conclusion

Page 12: Topes: Enabling End-User Programmers to Validate and Reformat Data

1212

New Approach: TopesNew Approach: Topes

• A tope = a platform-independent abstraction describing how to recognize and transform strings in one category of data

• Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain

• Validating with topes improves– Accuracy of validation– Reusability of validation code– Consistency of data formatting

Introduction Challenges Topes Tools Evaluation Conclusion

Page 13: Topes: Enabling End-User Programmers to Validate and Reformat Data

1313

A tope is a graph.A tope is a graph.Node = format, edge = transformationNode = format, edge = transformation

Notional representation for a CMU room number tope…

Formal building name& room number

Elliot Dunlap Smith Hall 225

Colloquial building name& room number

Smith 225

Building abbreviation& room number

EDSH 225

Introduction Challenges Topes Tools Evaluation Conclusion

Page 14: Topes: Enabling End-User Programmers to Validate and Reformat Data

1414

A tope is a conceptual abstraction.A tope is a conceptual abstraction.A tope A tope implementationimplementation is code. is code.

• Each tope implementation has executable functions:– 1 isa:string[0,1] function per format, for

recognizing instances of the format (a fuzzy set)– 0 or more trf:stringstring functions linking formats,

for transforming values from one format to another

• Validation function:(str) = max(isaf(str))where f ranges over tope’s formats– Valid when (str) = 1– Invalid when (str) = 0– Questionable when 0 < (str) < 1

Introduction Challenges Topes Tools Evaluation Conclusion

Page 15: Topes: Enabling End-User Programmers to Validate and Reformat Data

1515

Common kinds of topes:Common kinds of topes:enumerations and proper nouns enumerations and proper nouns

• Multi-format Enumerations, e.g: US states– “New York”, “CA”, maybe “Guam”

• Open-set proper nouns, e.g.: Company names– Whitelist of definitely valid names (“Google”), with

alternate formats (e.g. “Google Corp”, “GOOG”)– Augmented with a pattern for promising inputs that

are not yet on the whitelist

Introduction Challenges Topes Tools Evaluation Conclusion

Page 16: Topes: Enabling End-User Programmers to Validate and Reformat Data

1616

Two other common kinds of topes:Two other common kinds of topes:numeric and hierarchicalnumeric and hierarchical

• Numeric, e.g.: human masses– Numeric and in a certain range– Values slightly outside range might be questionable– Sometimes labeled with an explicit unit– Transformation usually by multiplication

• Hierarchical, e.g.: address lines– Parts described with other topes (e.g.: “100 Main St.”

uses a numeric, a proper noun, and an enum)– Simple isas can be implemented with regexps.– Transformations involve permutation of parts, lookup

tables, and changes to separators & capitalization.

Introduction Challenges Topes Tools Evaluation Conclusion

Page 17: Topes: Enabling End-User Programmers to Validate and Reformat Data

1717

Tope Development Environment (TDE)Tope Development Environment (TDE)

Topei ModuleInfers tope from

examples

Toped ModuleEnables EUPs to create/edit topes

Topeg ModuleGenerates context-free

grammars and transformations

Topep ModuleParses data against grammars, performs

transformations

Plug-insRead/write program

data

RobofoxWeb macros

Vegemite/CoScripterWeb macros

Microsoft ExcelSpreadsheets

Visual Studio.NETWeb applications

Introduction Challenges Topes Tools Evaluation Conclusion

RepositoryStores topes for sharing/reuse

Page 18: Topes: Enabling End-User Programmers to Validate and Reformat Data

1818

Toped User InterfaceToped User Interface

Features• Format inference• Format/part names• Soft constraints• Value whitelists• Testing features• Format reusability

Introduction Challenges Topes Tools Evaluation Conclusion

Page 19: Topes: Enabling End-User Programmers to Validate and Reformat Data

1919

Integration with programming platformsIntegration with programming platforms

Microsoft Excel:

buttons and menus

Visual Studio: drag-and drop

code generation

Introduction Challenges Topes Tools Evaluation Conclusion

Page 20: Topes: Enabling End-User Programmers to Validate and Reformat Data

2020

Integration with programming platformsIntegration with programming platforms

Introduction Challenges Topes Tools Evaluation Conclusion

Recommends tope for the data at hand

Convenient access to

reformatting

Page 21: Topes: Enabling End-User Programmers to Validate and Reformat Data

2121

Other integrations to date:Other integrations to date:CoScripter, Robofox, XML/HTML libraryCoScripter, Robofox, XML/HTML library

Introduction Challenges Topes Tools Evaluation Conclusion

Page 22: Topes: Enabling End-User Programmers to Validate and Reformat Data

2222

Evaluating accuracyEvaluating accuracy

• Implemented topes for spreadsheet data– Grouped 1712 columns of spreadsheet data (from the

EUSES spreadsheet corpus) into data categories– Created 32 topes for the most common 32 data

categories (~ 70% of the data)– Compared validation with topes to validation with

regexps or enumerations from the web– Tope-based validation was over 3 times as accurate

(for 5 formats or regexps per data category)

Introduction Challenges Topes Tools Evaluation Conclusion

Page 23: Topes: Enabling End-User Programmers to Validate and Reformat Data

2323

Evaluating reusabilityEvaluating reusability

• Reused spreadsheet-based topes on webform data– Downloaded data for 8 data categories on

Google Base and 5 in Hurricane Katrina website– Reused spreadsheet-based topes on the web data– Validation was just as accurate (and sometimes even

better, as the webform data was from just two sources and therefore less diverse than the spreadsheet data)

Introduction Challenges Topes Tools Evaluation Conclusion

Page 24: Topes: Enabling End-User Programmers to Validate and Reformat Data

2424

Evaluating support for data cleaningEvaluating support for data cleaning

• Used topes to put web data into consistent formats– Again with the 5 columns in Hurricane Katrina website– Used transformation functions to put each string into

the most common format for that data category– Increased number of duplicate strings found by 10%

Introduction Challenges Topes Tools Evaluation Conclusion

Page 25: Topes: Enabling End-User Programmers to Validate and Reformat Data

2525

Evaluating usability for data validationEvaluating usability for data validation

• End users validating data with single-format topes– Between-subjects lab study (early version of Toped)– 8 users validated spreadsheet data with Toped; for

comparison, 8 users validated with Lapis patterns– Toped users found twice as many of the typos

compared to Lapis users– Topes were 50% more accurate than Lapis patterns– Toped gave significantly higher user satisfaction– (Comparison to an earlier regular expression study

that had similar but not identical tasks: Toped users were faster and more accurate, but not a statistically significant difference)

Introduction Challenges Topes Tools Evaluation Conclusion

Page 26: Topes: Enabling End-User Programmers to Validate and Reformat Data

2626

Evaluating usability for data reformattingEvaluating usability for data reformatting

• End users reformatting data with multi-format topes– Within-subjects lab study (latest version of Toped)– 9 users reformatted spreadsheet data by creating &

using topes; for comparison, they then did it manually– Effort of creating a tope “pays off” at only 47 strings

(further reuse is essentially “free”)– Every participant strongly preferred using Toped

instead of doing tasks manually

Introduction Challenges Topes Tools Evaluation Conclusion

Page 27: Topes: Enabling End-User Programmers to Validate and Reformat Data

2727

Evaluating tope recommendationsEvaluating tope recommendations

• Quickly recommend existing tope for data at hand– Supports keyword-based search + search-by-match

(eg: topes that match “888-555-1212”)– Evaluated by searching through topes for the 32 most

common data categories in EUSES spreadsheet corpus, using strings from corpus

– High accuracy: Recall over 80% (result set size = 5)– Adequate speed: User is likely to have a few dozen

topes on computer, taking under 1 sec to search

Introduction Challenges Topes Tools Evaluation Conclusion

Page 28: Topes: Enabling End-User Programmers to Validate and Reformat Data

2828

Conclusion: Topes improve data validationConclusion: Topes improve data validation

• Validating with topes improves– Accuracy of validation– Consistency of data formatting– Reusability of validation code

• Primary contributions:– Support for ambiguous data categories– Support for reformatting values– Platform-independent, reusable validation

Introduction Challenges Topes Tools Evaluation Conclusion

Page 29: Topes: Enabling End-User Programmers to Validate and Reformat Data

2929

Future work: quality controlFuture work: quality control

• Quality control (of topes) within topes repository– Indicators of tope reusability

• Eg: meaningful names given to parts in formats?• Eg: plenty of test strings that match the tope?

– Extension of work on identifying reusable web macros

• Quality control (by topes) of data exchange – Two modules (components/web services/…) may use

the same kind of data, but require different formats.– Topes can automatically reformat strings on demand.– One step toward a larger goal… helping end users to

create, share, and combine their code – ask for details!

Introduction Challenges Topes Tools Evaluation Conclusion

Page 30: Topes: Enabling End-User Programmers to Validate and Reformat Data

3030

Thank You…Thank You…

• For this opportunity to present

• To NSF for funding

Introduction Challenges Topes Tools Evaluation Conclusion

Page 31: Topes: Enabling End-User Programmers to Validate and Reformat Data

3131

Professional programmers use lots of tricks Professional programmers use lots of tricks to simplify validation code. Eg: njtransit.comto simplify validation code. Eg: njtransit.com

Split inputs into many easy-to-validate fields.Who cares if the user has to type tabs now,or if he can’t just copy-paste into one field?

Make users pick from drop-downs.Who cares if it’s faster for users to type

“NJ” or “1/2007”?(Disclaimer: drop-downs sometimes are good!)

I implemented this site in 2003.

Introduction Challenges Topes Tools Evaluation Conclusion

Page 32: Topes: Enabling End-User Programmers to Validate and Reformat Data

3232

Even with these tricks, writing validation is Even with these tricks, writing validation is still very time-consuming.still very time-consuming.

Overall, the site had over 1100 lines of JavaScript

just for validation….Plus equivalent server-side Java code (too bad code

isn’t platform-independent)

if (!rfcCheckEmail(frm.primaryemail.value)) return messageHelper(frm.primaryemail, "Please enter a valid Primary Email address.");var atloc = frm.primaryemail.value.indexOf('@');if (atloc > 31 || atloc < frm.primaryemail.value.length-33) return messageHelper(frm.primaryemail, "Sorry. You may only enter 32 characters or less for your email name\r\n”+ ”and 32 characters or less for your email domain (including @).");

Introduction Challenges Topes Tools Evaluation Conclusion

Page 33: Topes: Enabling End-User Programmers to Validate and Reformat Data

3333

That was worst case.That was worst case.Best case: reusable regexps.Best case: reusable regexps.

• Many IDEs allow the programmer to enter oneregular expression for validating each input field.– Usually, this drastically reduces the amount of code,

since most validation ain’t fancy.– So why don’t programmers validate most inputs?

Introduction Challenges Topes Tools Evaluation Conclusion