text analysis: methods for searching, organizing, labeling and summarizing document collections

37
Text Analysis: Methods for Searching, Organizing, Labeling and Summarizing Document Collections Danny Dunlavy Computer Science and Informatics Department (1415) Sandia National Laboratories July 16, 2008 CSRI Student Seminar Series SAND2008-4999P

Upload: danno

Post on 21-Mar-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Text Analysis: Methods for Searching, Organizing, Labeling and Summarizing Document Collections. Danny Dunlavy Computer Science and Informatics Department (1415) Sandia National Laboratories July 16, 2008 CSRI Student Seminar Series SAND2008-4999P. Outline. Introduction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Text Analysis:Methods for Searching, Organizing,

Labeling and Summarizing Document Collections

Danny DunlavyComputer Science and Informatics Department (1415)

Sandia National Laboratories

July 16, 2008CSRI Student Seminar Series

SAND2008-4999P

Page 2: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Outline

• Introduction• Motivational Problems• Data• Analysis Pipeline• Transformation, Analysis, and Post-processing• Hybrid Systems• Examples• Conclusions

Page 3: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Introduction

• Knowledge discovery– Goal of text analysis– Data → information → knowledge

• Challenges– Too much information to process manually– Data ambiguity

• Word sense, multilingual, errors, weak signals– Heterogeneous data sources– Interpretability

• Goals of this talk– Exposure to research in text analysis at Sandia– Focus on methods based on mathematical principles

Page 4: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Example 1: Information Retrieval

Problem: ambiguous queries lead to information overload and topic confusion

Solutions: optimization, linear algebra, machine learning, and probabilistic modeling

Basketball player

MathematicianRank: 5, 50, 109, …

Jazz Musician?Rank: > 200

Page 5: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Example 2: Spam DetectionX-TMWD-Spam-Summary: TS=20080714175419; ID=1; SEV=2.3.2; DFV=B2008071415; IFV=2.0.4,4.0-9; AIF=B2008071415; RPD=5.03.0010; ENG=NA; RPDID=7374723D303030312E30413031303230332E34383742393243372E303043312C73733D312C6667733D30; CAT=NONE; CON=NONE; SIG=AAABAMQFAAAAAAAAAAAAAIAIgDkAAAM=X-TMWD-IP-Reputation: SIP=128.8.128.57; IPRID=303030312E30413039303330322E34383742393243362E30303330; CTCLS=T2; CAT=UnknownDate: Mon, 14 Jul 2008 13:54:13 -0400From: Dianne O'Leary <[email protected]>To: [email protected]: TS=20080714175417; SEV=2.2.2; DFV=B2008071415; IFV=2.0.4,4.0-9; AIF=B2008071415; RPD=5.02.0125; ENG=IBF; RPDID=7374723D303030312E30413031303230332E34383742393243392E303045422C73733D312C6667733D30; CAT=NONE; CON=NONEX-MMS-Spam-Filter-ID: B2008071415_5.02.0125_4.0-9X-PMX-Version: 5.4.2.344556, Antispam-Engine: 2.6.0.325393, Antispam-Data: 2008.7.14.174143X-PerlMx-Spam: Gauge=IIIIIII, Probability=7%, Report='BODY_SIZE_1000_LESS 0, BODY_SIZE_300_399 0, BODY_SIZE_5000_LESS 0, __CT 0, __CTE 0, __CT_TEXT_PLAIN 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0, __SUBJ_MISSING 0'

Bayesian Statistics

• SpamAssassin• Cloudmark Authority• MailSweeper Business Suite

Neural Networks• SurfControl E-mail Filter• AntiSpam for SMTP

Signature (Hash) Analysis • Cloudmark SpamNet• IM Message Inspector

[S. Ali and Y. Xiang (2007), “Spam Classification Using Adaptive Boosting Algorithm," Proc. ICIS 2007.]

Lexical Analysis • Brightmail Anti-Spam• Tumbleweed Email Firewall

Heuristic Patterns• McAfee SpamKiller• Brightmail Anti-Spam

Solutions: optimization, linear algebra, machine learning, probabilistic modeling

Problem: term meaning/usage ambiguity and deceit creates confusion between spam and good e-mail

Page 6: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Example 3: Topic Detection and Association

http://cloud.clusty.com

http://www.kartoo.com

Problem: determine topics in text collections and identify the most important, novel, or significant relationships

Clustering and visualization are key analysis methods

Solutions: optimization, linear algebra, machine learning, and probabilistic modeling, visualization

Page 7: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Text Data

• Text collection(s)– Corpus (corpora)

• Structured– Database fielded data

• Semi-structured– XML, HTML

• Unstructured– Formal

• Newspaper articles, scientific articles, business reports, …– Informal

• E-mail, chat, code comments, …• Other characteristics

– Incomplete, noisy (errors, ambiguity), multilingual

Metadataprocessing tool

parameters useddate processed

Datanamed entitiesrelationships

factsevents

Semi-Structured Datae-mail, web pages, blogs, etc.

Unstructured Datareports, newswire, etc.

Metadataraw source index

date collectedsource reliability

etc.Data

E-mail Headersto

fromdate

subjectetc.

Metadataraw source index

data collectedsource reliability

etc.

DataE-mail Headersmessage bodyattachments

Unstructured Text Processing

Datatext

Analysis

Page 8: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Text Analysis Pipeline

Ingestion

Pre-processing

Transformation

Analysis

Post-processing

Archiving

File readers (ASCII, UTF-8, XML, PDF, ...)

Tokenization, stemming, part-of-speech taggingnamed entity extraction, sentence boundaries

Data model, dimensionality reduction, feature weighting, feature extraction/selection

Information retrieval, clustering, summarization,classification, pattern recognition, statistics

Visualization, filtering, summary statistics

Database, file, web site

Page 9: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Vector Space Model

• Vector Space Model for Text– Terms (features): – Documents (objects):– Term Document Matrix:– : measure of importance of term in document

• Term Examples – Sentence: “Danny re-sent $1.”– Words: danny, sent, re [# chars?], $ [sym?], 1 [#?], re-sent [-?]– n-grams (3): dan, ann, nny, ny_, _re, re-, e-s, sen, ent, nt_, …– Named entities (people, orgs, money, etc.): danny, $1

• Document Examples– Documents, paragraphs, sentences, fixed-size chunks

[G. Salton, A. Wong, and C. S. Yang (1975), "A Vector Space Model for Automatic Indexing," Comm. ACM, 18(11), 613–620.]

Page 10: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Feature WeightingTerm Document Matrix Scaling:

Page 11: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Feature Extraction: Dimension Reduction

• Goal: find new, smaller set of features (dimensions) that best captures variability, correlations, or structure in the data

• Methods– Principal component analysis (PCA)

• Eigenvalue decomposition of covariance matrix of • Pre-processing: mean of each feature is 0

– Singular value decomposition of– Local Linear Embedding (LLE)

• Express points as combinations of neighbors and embed points into lower dimensional space (preserving neighbors)

– Multidimensional scaling• Preserve pairwise distances in lower dimensional space

– ISOMAP (nonlinear)• Extends MDS to use geodesic distances on a weighted graph

Page 12: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Analysis Tasks in This Talk• Information retrieval

– Goal: find documents most related to a query– Challenges: pseudonyms, synonyms, stemming, errors– Methods: LSA (later), boolean search, probabilistic retrieval

• Clustering– Goal: find a set of partitions that best separates groups of like objects– Challenges: distance metrics, number of clusters, uniqueness– Methods: k-means (later), agglomerative, graph-based

• Summarization– Goal: find a compact representation of text with same meaning– Challenges: single- vs. multi-document summaries, subjectivity– Methods: HMM+QR (later), probabilistic

• Classification– Goal: predict labels/categories of data instances (documents)– Challenges: data overfitting, – Methods: HEMLOCK (S. Gilpin, later), decision trees, naïve bayes, SVM

Page 13: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Other Analysis Tasks

• Machine translation• Speech recognition• Cross language information retrieval• Word sense disambiguation

– Determining sense of ambiguous words from context• Lexical acquisition

– Filling in gaps in dictionaries build from text corpora• Concept drift detection

– Change in general topics in streaming data• Association analysis

– Discovering novel relationships hidden in text

Page 14: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Hybrid Systems

• Rules + statistics/probabilities– Entity extraction (persons, organizations, locations)

• Rules: list of common names, capitalization• Probabilities: chance name occurs given sequence of words

• Any combination of data analytic tools

Data modeler

Feature extractor

Clustering tool

Parser

Oftendeveloped

independently

Page 15: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Hybrid System Development

• Data model– Cross-system, cross-platform accessibility– Accommodation of multiple data structures

• System– Modularized framework (plug-and-play capabilities)– Compatible interfaces– Multiple user interfaces

• TITAN: customizable front-ends to analysis pipelines• YALE: required parameters vs. complete set of parameters

• Performance, Verification & Validation– Tests for independent systems and overall system– Compatible test data and benchmarks– Analysis of parameter dependencies across individual systems

Page 16: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Hybrid System Example

Query, Cluster, Summarize

Page 17: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Motivation

• Query– methods plasma physics

• Retrieval– General: Google, 7.8106 of >2.51010 documents– Targeted: arXiv, 9,000 of >403,000 documents

• Problems– Too much information– Redundant information– Results: link, title, abstract, snippet (?), etc.– Ordering of results (meaning of “best” match?)

Page 18: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Problems to Solve

• QCS (Query, Cluster, Summarize)– Unstructured text parsing (common representation)– Data fusion (cleaning, assimilating, normalizing)– Natural language processing (sentences, POS)– Document retrieval (ranking)– High-dimensional clustering (data organization)– Automatic text summarization (data reduction)– Data representation/visualization (multiple perspectives)

Page 19: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

QueryLatent Semantic Analysis (LSA)

• SVD:• Truncated SVD:• Query scores (query as new “doc”):• LSA Ranking:

term

s

documentsd1 d2 dn

t2

t1

tm

…d3 d4

.

.

. Truncated SVD

term

s

concepts documents

concepts

singular values

[Deerwester, S. C., et al. (1990). Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41 (6), 391–407.]

Page 20: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

d1 : Hurricane. A hurricane is a catastrophe.

d2 : An example of a catastrophe is a hurricane.

d3 : An earthquake is bad.

d4 : Earthquake. An earthquake is a catastrophe.

d1 : Hurricane. A hurricane is a catastrophe.

d2 : An example of a catastrophe is a hurricane.

d3 : An earthquake is bad.

d4 : Earthquake. An earthquake is a catastrophe.

1011catastrophe

2100earthquake

0012hurricane

d4d3d2d1

0catastrophe

0earthquake

1hurricane

qA

.30.15.60.59catastrophe

.92.96.02-.03earthquake

.11-.11.78.78hurricane

d4d3d2d1

A2

00.71.89qTA .11–.78.78qTA2

Removestopwords

normalization only rank-2 approximation

captures link to doc 4

LSA Example 1

.450.71.45catastrophe

.89100earthquake

00.71.89hurricane

d4d3d2d1

A

Page 21: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

LSA Example 2

∆policy∆planning∆politics∆tomlinson∆1986oSport in Society: policy, Politics and Culture, ed A. Tomlinson (1990) oPolicy and Politics in Sport, PE and Leisure eds S. Fleming, M. Talbot and A. Tomlinson (1995) oPolicy and Planning (II), ed J. Wilkinson (1986) oPolicy and Planning (I), ed J. Wilkinson (1986) oLeisure: Politics, Planning and People, ed A. Tomlinson (1985)

∆parker∆lifestyles∆1989∆partoWork, Leisure and Lifestyles (Part 2), ed S. R. Parker (1989) oWork, Leisure and Lifestyles (Part 1), ed S. R. Parker (1989)

[Leisure Studies of America Data: 97 documents, 335 terms]

Page 22: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

ClusterGeneralized Spherical K-Means (gmeans)

• The Players– Documents: – Partition/Disjoint Sets:– Concept vectors (centroids):

• The Game– Maximize

• The Rules– Adaptive, but bounded k– Similarity Estimation– First variation (stochastic perturbation)

[Dhillon, I. S., et al. (2002). Iterative clustering of high dimensional text data augmented by local search. Proc. IEEE ICDM.]

Page 23: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

SummarizeHidden Markov Model + Pivoted QR

• Single Document Summarization– Mark summary sentences in training documents– Build probabilistic model

• Markov chain observations– log(#subject terms + 1)

• terms showing up in titles, topics, subject descriptions, etc. – log(#topic terms + 1)

• terms above a threshold using a mutual information statistic• Hidden Markov Model (HMM)

– Hidden states: {summary, non-summary}

– Score sentences in each document• Probabilities of sentence being a summary sentence

n 1 n 2 n

[Conroy, J. M., et al. (2001). Text summarization via hidden markov models and pivoted QR matrix decomposition.]

Page 24: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

SummarizeHidden Markov Model + Pivoted QR

• Multi-document Summarization– Goal: generate w-word summaries– Use HMM scores to select candidate sentences (~2w)– Terms as sentence features

• Terms:• Sentences: • Scaling: = HMM score of sentence

• Pivoted QR– Choose column with maximum norm ( )– Subtract components along from remaining columns– Stop: chosen sentences (columns) ~w words

• Removes semantic redundancy

Page 25: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections
Page 26: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

QCS: Evaluation

• Document Understanding Conference (DUC)– Automatics evaluation of summarizers (ROUGE)

• Measures how well you agree with human summaries– Human (), QCS (), S only () summaries– QCS finds subtopics and outliers

ROUGE-2 score vs. Summarizers (Humans, QCS, S)

Cluster 1 Cluster 2

Page 27: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

QCS: Evaluation

• Document Understanding Conference (DUC)– Scoring as a function of QCS cluster size (k)– QCS (), S only (---) summaries– Best results for different clusters use different k

ROUGE-2 scores vs. number of clusters

Cluster 1 Cluster 2

Page 28: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Benefits of QCS

• Dynamic data organization and compression– Subset of documents relevant to a query– Topic clusters, single summary per cluster

• Multiple perspectives (analyses)– Relevance ranking, topic clusters, summaries

• Efficient use of computation– Parsing, term counts, natural language processing, etc.

Page 29: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Other Examples

Page 30: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

ParaText™: Scalable Text Analysis

• ParaText™ Lite– Serial client-server text analysis– Parser, vector space model, SVD,

data similarities/graph creation – Built on vtkTextEngine (Titan)– Works with ~10K–100K documents

• ParaText™– End-to-end scalable text analysis– Challenge 1: Parsing [parallel string hashing, hierarchical agglomeration]– Challenge 2: Text modeling [initial Trilinos/Titan integration complete]– Challenge 3: Load balancing [initial: documents; goal: Isorropia/Zoltan]

• Impact– Available in ThreatView 1.2.0+ directly or through ParaText™ server– Plans to interface to LSAView, OverView (1424), Sisyphus (1422)

Page 31: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

ParaText™ Server (PTS)

Artifact DB

PTS PTS PTS

Reader

P0 P1 Pk

Parser

Matrix

SVD

Reader

Parser

Matrix

SVD

Reader

Parser

Matrix

SVD

Parallel Pipeline

Matrices DB

HPC Resource (cluster, multicore server, etc.)

1 or 2DB Servers

XMLHTTP

Master ParaText™

Server

ParaText™ Client

Page 32: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

LSAView: Algorithm Analysis/ Development

• LSAView– Analysis and exploration of

impact of informatics algorithms on end-user visual analysis of data

– Aids in discovery process of optimal algorithm parameters for given data and tasks

• Features – Side-by-side comparison of visualizations for two sets of parameters– Small multiple view for analyzing 2+ parameter sets simultaneously– Linked document, graph, matrix, and tree data views– Interactive, zoomable, hierarchical matrix and matrix-difference views– Statistical inference tests used to highlight novel parameter impact

• Impact– Used in developing and understanding ParaText™ and LSALIB algorithms

Page 33: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

20 40 60 800

0.5

1

1.5

2

20 40 60 800

0.5

1

1.5

2

20 40 60 800

0.5

1

1.5

2

20 40 60 800

0.5

1

1.5

2

k

LSAView Impact

• Document similarities:• Inner product view:• Scaled inner product view:

What is the best scaling for document similarity graph generation?

original scaling no scaling inverse sqrt inverse

[Leisure Studies of America Data: 97 documents, 335 terms]

Page 34: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

E-Mail Classification

• LSN Assistant / Sandia Categorization Framework – Yucca Mountain: categorize e-mail (Relevant, Federal Record, Privileged)– Machine learning library and GUI for document categorization & review– For review of existing categorizations, recommendations for new documents– Balanced learning

• Skewed class distributions• Importance

– Solved important, real problem• ~400K e-mails incorrectly categorized

– Foundation for LSN Online Assistant• Real-time system for recommendations

• Impact– Dong Kim, lead of DOE/OCRWM LSN certification is “very impressed with the LSN

Assistant Tool and the approach to doing the review.”– Factor of 3 speedup over manual categorization review only

Page 35: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Conclusions

• Text analysis relies heavily upon mathematics– Linear algebra, optimization, machine learning,

probability theory, statistics, graph theory• Hybrid system development is a challenge

– More than just gluing pieces together• Large-scale analysis is important

– Storing and processing large amounts of data– Scaling algorithms up– Developing new algorithms for large data

• Useful across many application domains

Page 36: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Collaborations

• QCS– Dianne O’Leary (Maryland), John Conroy & Judith Schlesinger (IDA/CCS)

• LSALIB– Tammy Kolda (8962)

• ParaText™– Tim Shead & Pat Crossno (1424)

• LSAView– Pat Crossno (1424)

• Sandia Categorization Framework– Justin Basilico (6341) and Steve Verzi (6343)

• HEMLOCK– Sean Gilpin (1415)

Page 37: Text Analysis: Methods for Searching, Organizing,  Labeling and Summarizing  Document Collections

Thank You

Text Analysis:Methods for Searching, Organizing,

Labeling and Summarizing Document Collections

Danny Dunlavy

[email protected]

http://www.cs.sandia.gov/~dmdunla