text mining wksp auvil

33
Engineering Knowledge for the Humanities Text Mining Workshop April 26, 2008 Loretta Auvil National Center for Supercomputing Applications (NCSA) University of Illinois

Upload: loretta-auvil

Post on 29-Nov-2014

1.790 views

Category:

Technology


0 download

DESCRIPTION

My presentation of our work at the Text Mining Workshop 2008 held in conjunction with Eighth SIAM International Conference on Data Mining (SDM 2008) in Atlanta, GA on April 26, 2008.

TRANSCRIPT

Page 1: Text Mining Wksp Auvil

Engineering Knowledge for the Humanities

Text Mining WorkshopApril 26, 2008

Loretta AuvilNational Center for Supercomputing Applications (NCSA)

University of Illinois

Page 2: Text Mining Wksp Auvil

No Formulas

Page 3: Text Mining Wksp Auvil

More Visualizations

www.visualcomplexity.com

Page 4: Text Mining Wksp Auvil

NoraVis OpenLaszlo

www.noraproject.org

Page 5: Text Mining Wksp Auvil

NoraVis Backend

• Leverages D2K as web service call for predictive modeling• Passing parameters for some options

• Known modeling problems:• Training on a very sparse set of words, so improvements in

modeling can be achieved through additional semantic additions

Page 6: Text Mining Wksp Auvil

MONK

www.monkproject.org

Page 7: Text Mining Wksp Auvil

Challenges in Humanities Collaboration

• Understanding terminology and text mining capabilities• Learning their needs• Creating meaningful ways to display and present results• Technology innocence• Bridging different software tools• Appreciating how long things take to develop• Working collaboratively as a team

Page 8: Text Mining Wksp Auvil

How To Address these Challenges

• Educate team on data and text mining approaches• Demonstrate approaches with working examples• Develop use cases that drive software development• Create an environment/infrastructure that lets us create data

flows that are component based• Deploy web services for computations• Develop web application for setting up problem and delivery

of results

Page 9: Text Mining Wksp Auvil

SEASR Project Highlights

• SEASR will employ a comprehensive environment thatintegrates two complementary and revolutionary technicaladvances – Service Oriented Architecture and SemanticWeb, into a single computing architecture – SemanticEnabled Service Oriented Architecture

• SEASR will be enriched with a broad range of knowledgerepresentation and reasoning capabilities

• SEASR addresses the challenges of transforminginformation into knowledge by constructing the softwarebridges that are required to move from the unstructured andsemi-structured data world to the structured data world

Page 10: Text Mining Wksp Auvil

What does this mean for the Humanities Community?

SEASR will:• help scholars locate and access documents of interest in the

sea of large data stores• provide scholars with enhanced data synthesis and query

analysis• from focused data retrieval and data integration• to intelligent human-computer interactions for knowledge access• to semantic data enrichment• to entity and relationship discovery• to knowledge discovery and hypothesis generation

• empower collaboration among scholars by enhancing andinnovating virtual research environments

Page 11: Text Mining Wksp Auvil

Specific Project Highlights

• Common Services Layer: Provide execution environmentand supporting infrastructure that map from the problemsolving layer to the resource layer• Designed and developed Meandre (semantic, web-driven data flow

execution environment)• Developed the ability to define extensions for executing

components in languages other than Java; extensions have alreadybeen created for python and common lisp

• Problem Solving Layer: Visual environments that turncomponents and web services into a domain-specificproblem solving environment

Page 12: Text Mining Wksp Auvil

Workbench

Page 13: Text Mining Wksp Auvil

Community Hub

Page 14: Text Mining Wksp Auvil

Semantically Enabled SOA

Page 15: Text Mining Wksp Auvil

Semantically Enabled SOA 2

Page 16: Text Mining Wksp Auvil

A Problem from the MONK Project

• Analyze the repetition that occurred in the “The Making ofAmericans” by Gertrude Stein

Page 17: Text Mining Wksp Auvil

Repetition in The Making of Americans

~900~623~530Total pages

97.0612.8116.28Average wordfrequency

532917,19011,730Unique words (types)

517,207220,254190,906Total words(tokens)

Making ofAmericans

Moby DickUncleTom’sCabin

Text Source

Page 18: Text Mining Wksp Auvil

Visualization Approach from ManyEyes

Many Eyes Website: http://services.alphaworks.ibm.com/manyeyes/view/S4ZIjIsOtha6H~kYwoKjI2~

Page 19: Text Mining Wksp Auvil

Solution… came gradually

• Examine book by comparing each paragraph• Create feature set based on moving window of n-grams (3

grams) across each paragraph• Preprocess text

• To Stem or Not– "I will throw the umbrella in the mud"– “Martha was throwing the umbrella in the mud”

• To Keep Punctuation or Not• Execute the Closet algorithm (from Jiawei Han, et.al)

• Providing the following early results:

5:[a description of]:[1085|1087|1084|1082|1086]4:[men and women]:[1085|1083|1084|1088]4:[this is now|is now a]:[1087|1082|1086|1088]3:[a description of|now a description|this is now|is now a]

:[1087|1082|1086]3:[kinds of men|of men and|men and women]:[1085|1083|1088]3:[this is now|is now a|now a description|a description of]

:[1087|1082|1086]How do we make this meaningful to humanists??

Page 20: Text Mining Wksp Auvil

How to visualize… Trying existing tools

Brad Paley, TextArc

M. Wattenberg, Arc Diagrams

TimeSearcher

SpotFire

No context, No reading original text, No scale, No trends…

Page 21: Text Mining Wksp Auvil

Custom Solution - FeatureLens

FeatureLens--an early MONK (Metadata Offer New Knowledge)application--uses the machine learning approach of frequentpattern mining to identify fuzzy repetition patterns in a datacollection, and with no initial human input.

• Organized into sections (in this case chapters)• Rank frequent patterns by frequency and length• Show frequent patterns of n-grams in context• Rank frequent patterns by distribution trends, per collection

and per section.• Compare multiple patterns on the same views: distributions,

sections, paragraphs• Read the text (with highlighting of patterns)• Some options for handling scale for large data sets (e.g.

each line is five paragraphs)• Search for particular word

Page 22: Text Mining Wksp Auvil

FeatureLens: Organized into sections (chapters)

Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.

Page 23: Text Mining Wksp Auvil

FeatureLens: patterns sorted by frequency and length

Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.

Page 24: Text Mining Wksp Auvil

FeatureLens: n-gram patterns in context

Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.

Page 25: Text Mining Wksp Auvil

FeatureLens: distribution trends

Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.

Page 26: Text Mining Wksp Auvil

FeatureLens: multiple patterns

Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.

Page 27: Text Mining Wksp Auvil

The New Way to Read

• By visualizing certain patterns in this text and (it follows withlarger collections in general), by looking at the text “from adistance” through textual analytics and visualizations, onecan “read” the novel in ways formerly impossible.

• Franco Moretti has argued that the solution to trulyincorporating a more global perspective in our critical literarypractices is not to read more of the vast amounts of literatureavailable to us, but to read it differently by employing “distantreading.” “We know how to read texts,” he writes, “now let’slearn how not to read them.”Franco Moretti, Conjectures on World Literature. New Left Review, 1 (Jan.-Feb. 2000): 68.

Page 28: Text Mining Wksp Auvil

Stories buried in the repetition…

Page 29: Text Mining Wksp Auvil

Massive Digitization Projects

• What can be done with these large digital text collections• How can we use these large digital text collections

• Justify the use of computers and advanced techniques toprocess these collections, because we (humans) can’t readthis much

• The point is not to save the reader from reading theindividual texts or from making an independent judgment ofeach document's characteristics; rather, the point is to learnfrom the reader's holistic impression of the text and then,having done so, to show the reader what evidence correlateswith these impressions

Page 30: Text Mining Wksp Auvil

Transformational New Research Topics for Humanities

• Track patterns in morphology, syntax, and semantics acrosslarge stretches of time, space and culture

• Track topics or terminology across thousands of text• Track the social and economic influence of topics• Study multi-lingual and cultural impacts• Study literary inheritance• Study the evolution of ideas• and a lot more

Page 31: Text Mining Wksp Auvil

Exploratory Analysis Environments

• Provide access to text• Focus on specific passages• Allow for comparative reading• Provide enriched context for text and data analysis

Page 32: Text Mining Wksp Auvil

References

• J. Pei, J. Han, and R. Mao, ''CLOSET: An Efficient Algorithmfor Mining Frequent Closed Itemsets'', Proc. 2000 ACM-SIGMOD Int. Workshop on Data Mining and KnowledgeDiscovery (DMKD'00), Dallas, TX, May 2000.

• Tanya Clement, Anthony Don, Loretta Auvil, CatherinePlaisant, Greg Pape and Vered Goren. ‘Something that isinteresting is interesting them’: Using text mining andvisualizations to aid interpreting repetition in Gertrude Stein’sThe Making of Americans, Digital Humanities 2007.

Page 33: Text Mining Wksp Auvil

Automated Learning Group / SEASR Team

Michael WelgeBernie Ac’sBoris CapitanuLily DongPeter GrovesAmit KumarXavier LloràChad OlsonMary PietrowiczDuane SearsmithKelly SearsmithDavid Tcheng