a lightweight model for end users’ data: progress and future work christopher scaffidi carnegie...

35
A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

Post on 20-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

A Lightweight Model for End Users’ Data:

Progress and Future Work

Christopher Scaffidi

Carnegie Mellon University

Page 2: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

22

Target usersTarget users

• In 2012, we project that there will be 90 million computer end users (“EUs”) in American workplaces.

• Of these, at least half will create spreadsheets, databases, and/or web applications. These are called end-user programmers (“EUPs”). [5]

• Both EUs and EUPs will benefit from this research, though the research is mainly aimed at EUPs (including EUs who become EUPs because of the research).

introduction ● topes ● prototype● future work ● evaluation

Page 3: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

33

Contextual inquiry:Contextual inquiry:What are the problems of EUs and EUPs?What are the problems of EUs and EUPs?

Observed 3 administrative assistants, 4 managers, and 3 webmasters/graphic designers (1-3 hrs, each) [3]

[9]

introduction ● topes ● prototype● future work ● evaluation

Page 4: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

44

How can EUPs validate web formsHow can EUPs validate web formsif they do not know JavaScript or regexps?if they do not know JavaScript or regexps?

Is the input valid?“EDSH 225”

Is the input nearly valid?“EDXH 225”

Does it just need reformatting?“Smith 225”

Or is it obviously invalid?“412-555-5444”

introduction ● topes ● prototype● future work ● evaluation

Page 5: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

55

Other tasks, other data, other problemsOther tasks, other data, other problems

• When building a staff roster by merging data sources into a single spreadsheet, one of the EUs:– Had to scrutinize data to identify questionable values that

deserved double-checking(e.g.: A first name with 15 characters might be right)

– Had to manually transform data to consistent format(e.g.: Put person names in Lastname, Firstname format)

• Contextual inquiries, interviews, and surveys identified other data validation and reuse tasks that are poorly supported by existing tools. [3][4][7][9]

introduction ● topes ● prototype● future work ● evaluation

Page 6: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

66

Underlying problem: abstraction mismatchUnderlying problem: abstraction mismatch

• Tools support strings, integers, floats, sometimes dates.• Problem domain involves higher-level categories of data:

– University names “Carnegie Mellon”, “CMU”

– Person names “Scaffidi, Christopher”, “Chris Scaffidi”

– CMU phone numbers “8-1234”, “x8-1234”

– CMU room numbers “WeH 4623”, “Wean 4623”

• These data categories are:– Human-readable

– Short (~ 1 input field)

– Multi-format

– Sometimes ambiguous / fuzzy (non-binary scale of validity)

– Often particular to certain groups of people

introduction ● topes ● prototype● future work ● evaluation

Page 7: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

77

Related WorkRelated Work

• Regexps / grammars / data detectors recognize data but do not specify how to transform multi-format data

• Types:– A value is or is not a valid instance of a type (non-fuzzy).

– Typed languages are difficult for EUPs.

• Research on units (e.g.: Slate) and constraint systems (e.g.: Cues) typically only apply to numeric data in certain applications (e.g.: spreadsheets).

• Tools for integrating heterogeneous databases typically require a professional DBA and are specific to db data.

introduction ● topes ● prototype● future work ● evaluation

Page 8: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

88

Approach: Create a new abstraction for Approach: Create a new abstraction for each category of dataeach category of data

• Like software “libraries,” implementations of these abstractions could be reused in many programs.

• Abstractions would need to include functions for:– Recognizing instances of the category (“isa”)

(for automating data validation)

– Transforming instances among various formats (“trf”)(for automating data reformatting)

introduction ● topes ● prototype● future work ● evaluation

Page 9: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

99

TopesTopes

• Tope = an abstraction for a data category– Greek word for “place,” because each tope corresponds to

a data category with a natural place in the problem domain

• Topes in practice:1. EUPs create new topes by using the basic tope editor (or

another language, e.g.: if they happen to know JavaScript)

2. EUPs publish topes on repositories.

3. Other EUs & EUPs download topes to their local cache.

4. Tool plug-ins let EUs & EUPs browse their local cache and associate topes with variables and input fields.

5. Plug-ins get topes from local cache and use them at runtime to validate and transform data.

introduction ● topes ● prototype● future work ● evaluation

Page 10: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

1010

Example in our prototype format editor: Example in our prototype format editor: CMU Campus Phone NumberCMU Campus Phone Number

Features:

• Format inference• Format/part names• Soft constraints• “isa” generation• Testing features• Format reusability• EUP tool integration

[1][6]

(Similar UI style for implementing trfs)

introduction ● topes ● prototype● future work ● evaluation

Page 11: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

1111

Validation by associating a topeValidation by associating a topewith a textboxwith a textbox

• Invalid inputs cause a targeted message to appear.

• Inputs that violate an always or never constraint cannot be submitted to the server.

• Inputs that violate an often constraint cause a warning, which the application user can override.

introduction ● topes ● prototype● future work ● evaluation

Page 12: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

1212

Evaluations to dateEvaluations to date

• Usability:– Controlled experiment shows that our format editor

enables EUPs to validate data more quickly and accurately than with Lapis patterns or with regexps

• Expressiveness:– We have implemented formats for dozens of kinds of data

(1) EUSES spreadsheet corpus(2) logs of EUPs’ web browsing

• Usefulness:– We have integrated topes with tools for creating web

applications, databases, spreadsheets, and web macros.

Page 13: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

1313

Future workFuture work

• Implement enhancements to the basic editor– UI improvements; behind the scenes: new meta-data fields

• Implement repository system– Plug-ins will have a list of “known” repository servers– EUPs will be able to publish topes into repository servers– Repositories will provide various search features

• Search by example (based on [1])

• Search by contextual keywords (based on [2])

• Search by collaborative filtering (similar to Amazon)

• Search by tope reliability (see [8])

• And of course, search by (non-unique) name

introduction ● topes ● prototype● future work ● evaluation

Page 14: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

1414

Evaluation: Can EUPs create topes?Evaluation: Can EUPs create topes?

Claim #1: By representing formats as a series of constrained parts, the basic editor enables EUPs to implement topes for common categories of data.

Evaluation: controlled experiment– Sample: information workers

– Tasks: create topes for data revealed by previous studies

– Comparison: have users verbally describe the data

– Measures: success, time, match to users’ expectations

(Our usability evaluation only covered isa, not trf.)

introduction ● topes ● prototype● future work ● evaluation

Page 15: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

1515

Evaluation: Do topes help EUPs? Evaluation: Do topes help EUPs?

Claim #2: Extending existing tools with topes enables EUPs to more quickly and correctly validate and reuse data than is possible through currently practiced methods.

Evaluation: controlled experiment– Sample: information workers

– Tasks: use topes to do work revealed by previous studies

– Measures: time, accuracy, satisfaction

– Comparison: Lapis and manual performance

(Our usability evaluation covered data validation, not reuse.)

introduction ● topes ● prototype● future work ● evaluation

Page 16: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

1616

Evaluation: Can EUPs share/reuse topes?Evaluation: Can EUPs share/reuse topes?

Claim #3: Given suitable tools operating on tope meta-information, EUPs can share topes with one another.

Evaluation: field test– Sample: CMU staff and students

– Tasks: install our tools and use them for several weeks

– Measures: logs of usage, satisfaction surveys

– Comparison: normal way of doing work

introduction ● topes ● prototype● future work ● evaluation

Page 17: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

1717

Related papersRelated papersConference papers[1] C. Scaffidi. Unsupervised Inference of Data Formats in Human-Readable Notation. Proceedings of 9th

International Conference on Enterprise Integration Systems (ICEIS'07), 2007, to appear.

[2] C. Scaffidi, K. Bierhoff, E. Chang, M. Felker, H. Ng, C. Jin. Red Opal: Product-Feature Scoring from Reviews. Proceedings of 8th ACM Conference on Electronic Commerce (ACMEC'07), 2007, to appear

[3] C. Scaffidi, A. Cypher, S. Elbaum, A. Koesnandar, and B. Myers. Scenario-Based Requirements for Web Macro Tools. Submitted for publication, 2007.

[4] C. Scaffidi, A. Ko, B. Myers, M. Shaw. Dimensions Characterizing Programming Feature Usage by Information Workers. VL/HCC'06: Proceedings of the 2006 IEEE Symposium on Visual Languages and Human-Centric Computing, pp. 59-62, 2006.

[5] C. Scaffidi, M. Shaw, and B. Myers. Estimating the Numbers of End Users and End User Programmers. VL/HCC'05: Proceedings of the 2005 IEEE Symposium on Visual Languages and Human-Centric Computing , pp. 207-214, 2005.

Other papers[6] C. Scaffidi, B. Myers, M. Shaw. The Topes Format Editor and Parser, Technical Report CMU-ISRI-07-104, School

of Computer Science, Carnegie Mellon University, Pittsburgh, PA, May 2007.

[7] C. Scaffidi, B. Myers, and M. Shaw. Trial By Water: Creating Hurricane Katrina "Person Locator" Web Sites. In Leadership at a Distance: Research in Technologically-Supported Work (S. Weisband, ed), Lawrence Erlbaum, pp. 209-222, 2007.

[8] C. Scaffidi, M. Shaw. Toward a Calculus of Confidence. First International Workshop on the Economics of Software and Computation, co-located with ICSE'07, 2007, to appear.

[9] C. Scaffidi, M. Shaw, B. Myers. Games Programs Play: Obstacles to Data Reuse, 2nd Workshop on End User Software Engineering (WEUSE), 2006.

Page 18: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

1818

Thank You…Thank You…

• …to the symposium committee/panel for the opportunity to present

• …to many people for helpful suggestions

• …to NSF and EUSES for funding (ITR-0325273 and CCF-0438929)

Marwan Abi-Antoun Margaret Burnett Martin Erwig Andy Ko Mary Beth Rosson

Robin Abraham Owen Cheng George Fairbanks Thomas LaToza Mary Shaw

Matt Bass Ciera Christopher Thomas Green Alon Lavie Jeff Stylos

Nels Beckman Michael Coblenz Josh Gross Henry Lieberman Dean Sutherland

Kevin Bierhoff Allen Cypher Greg Hartman Larry Maccherone Steve Tanimoto

Alan Blackwell Uri Dekel Jim Herbsleb Brad Myers Susan Wiedenbeck

Barry Boehm Sebastian Elbaum John Hosking John Pane

Page 19: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

1919

This slide intentionally left blank.

Page 20: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

2020

Interviews of web site creators:Interviews of web site creators:Confirmation of specific problemsConfirmation of specific problems

• Interviewed 6 people involved in creating “person locator” web sites after Hurricane Katrina [7][9]

• Many omitted data validation on web forms– Hard to detect that “12 Years old” is an invalid street address

(what would the regexp look like?)

• “Aggregator” sites were built to scrape and consolidate data from numerous person locator sites.– Hard to transform data into a single consistent format

– Hard to identify probable duplicates in the merged data set

Extra slides

Page 21: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

2121

Survey of EUPs:Survey of EUPs:Better data-manipulation features neededBetter data-manipulation features needed

• Asked 831 information workers about use of 23 features in 5 tools (eg: creating spreadsheet macros, database stored procedures, and web forms) [4][9]

• The most widely used features were related to manipulating linked structures of data (eg: database tables) rather than imperative or macro programming

• Yet respondents complained about these features:– “Not always easy to move sturctured [sic] data or text”

– “Not always integrated a lot of data manipulation redundant”

– “Information entered inconsistently into database fields by different people leaves a lot of database cleaning”

Extra slides

Page 22: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

2222

Proposed data modelProposed data model

• 1 tope implementation contains executable functions:– 1 isa:string[0,1] function per format, for

recognizing instances of the format

– 0 or more trf:stringstring function linking formats, for transforming values form one format to another

• A lightweight data model…– Only contains 2 kinds of functions (isa/trf)

– These correspond to the operations that people had to keep performing manually in our studies.

Extra slides

Page 23: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

2323

Example topeExample topeNotional representationNotional representation

• An example tope for CMU room numbers– 3 isa functions, 4 trf functions

– A tope’s trf functions can be omitted if desired

Formal building name& room number

Elliot Dunlap Smith Hall 225

Building abbreviation& room number

EDSH 225

Colloquial building name& room number

Smith 225

introduction ● topes ● prototype● future work ● evaluation

Page 24: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

2424

Prototype implementationPrototype implementationSystem block diagramSystem block diagram

Spreadsheet Microsoft Excel

Plug-in

Microsoft Visual Studio.NET

Plug-in

Format editor

Parser

Web application

Validator

Extra slides

Page 25: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

2525

Proposed development environmentProposed development environmentFunctional decomposition diagramFunctional decomposition diagram

Basic Topes Editor Repository Software

Publishing Tools Search Tools

Development Environment

Plug-Ins

EUPs implement topes in basic topes editor (or JavaScript), then publish in repositories.Other EUs and EUPs search for topes, download them, then use them through plug-ins.

Extra slides

Page 26: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

2626

Sample task: web form validationSample task: web form validationThe painful old wayThe painful old way

• Drag widgets and validator onto page, select a regexp, customize if desired.

Extra slides

Page 27: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

2727

Sample task: web form validationSample task: web form validationResults of the painful old wayResults of the painful old way

• Invalid inputs cause a hard-coded message to appear.

Oops, forgot to enter a message at design-time.

• For valid inputs, no error message appears.

Hm, didn’t realize the area code was optional.

What if I want to allow campus phone numbers?

Extra slides

Page 28: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

2828

Sample task: validating person namesSample task: validating person namesCustomizing constraints in our prototypeCustomizing constraints in our prototype

• User can add/edit constraints

Extra slides

Page 29: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

2929

Expressiveness evaluationExpressiveness evaluation

• Four administrative assistants’ use of a web browser was logged for three weeks, resulting in nearly 6000 sample data values that they typed into web forms.

• Not logged verbatim: characters were generalized– Eg: [email protected] Aa{7}0@a{5}.a{3}

• We manually grouped values into 19 semantic families (eg: email address) based on widget’s HTML name and words visually nearby to the widgets

• Created and tested formats for 14 families (4250 values)– Omitted: username/passwords and long blocks of “text”

– Inference & testing features were not used during format creation

introduction ● topes ● prototype● future work ● evaluation

Page 30: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

3030

Expressiveness evaluation resultsExpressiveness evaluation results

• 9 families needed 1 format each; 5 needed 2 formats each

• The only error attributable to editor expressiveness:– 1 of the 4250 test values had a trailing period on a street

type (in an address line)

– This particular version of the editor had no way to say that a part could contain a period but only at the end

... And we have recently submitted conference papers discussing a fuller expressiveness evaluation as well as a small usability study.

[6]

introduction ● topes ● prototype● future work ● evaluation

Page 31: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

3131

Future workFuture workShare/reuse via repositoriesShare/reuse via repositories

• Clients will have a list of “known” repository servers– Generally pre-configured to include a global server at CMU

– Organizations will configure clients to include the organizational server

– EUs and EUPs will be able to add new servers to their list

• To support publishing/searching, the repository will house meta-information about topes, including…– a human-visible non-unique name & description

– an internally-used globally unique id (guid) based on the tope’s URL in the repository

Extra slides

Page 32: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

3232

Future workFuture workSearching for relevant topesSearching for relevant topes

• Search by keyword:– Search tope name and description

– And match based on words that are visually near to topes

• Search by groups of people:– Within an organization, or by author’s email domain

– Within spaces that are “group-private”

• Search by groups of topes:– “If you liked this tope, you may also like XYZ”

– Similar to Amazon.com’s product recommendations

• Search by example:– “Find me a tope that recognizes 412-555-1212”

– For efficiency, filter based on “signature” (\d{3}-\d{3}-\d{4})Extra slides

Page 33: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

3333

Future workFuture workSearching for reliable topesSearching for reliable topes

Evidence [8] EUs and EUPs may trust topes: Search features

Explicit formal roles Created by their organization’s system administrators. Search by tope author

Prior performance From people who have previously supplied good topes.

Model of motivation From vendors that care about brand image.

Group membership From people who are known to have a similar background.

Reputation That earned anonymous votes of confidence. Search by tope ratings (either anonymous or not)References That present a list of high-profile people who like the topes.

Certification That are inspected and certified by a third party.

Social context That are actively maintained—that is, for which improved versions are regularly available.

That are implemented in a familiar language/platform.

Search by tope publication date and execution platform

Extra slides

Page 34: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

3434

Future workFuture workEnhancing plug-insEnhancing plug-ins

• Target tools– Microsoft Excel– Microsoft Visual Studio.NET– Robofox

• Operations supported– Assertions run isa on selected cells

– Transformation run trf on selected cells

– De-duplication run trf on selected cells

• Each will support basic editor topes & JavaScript topes

Extra slides

Page 35: A Lightweight Model for End Users’ Data: Progress and Future Work Christopher Scaffidi Carnegie Mellon University

3535

Future workFuture workRecognizing exceptions in plug-insRecognizing exceptions in plug-ins

• Tope creators might overlook values.• From the standpoint of a tope format, these “normal”

values are exceptional cases that need to be tolerated.

• Simple approach: Record a whitelist of exceptions• More sophisticated: For each format, record exceptions,

infer a format (new isa function), and average this function’s score with the raw function’s score

• Exceptional values can be incorporated into the tope in the local cache and/or, at EUP’s discretion, propagated to the repository of the tope’s master copy

Extra slides