text analysis: methods for searching, organizing, labeling and summarizing document collections

Text Analysis:Methods for Searching, Organizing,

Labeling and Summarizing Document Collections

Danny DunlavyComputer Science and Informatics Department (1415)

Sandia National Laboratories

July 16, 2008CSRI Student Seminar Series

SAND2008-4999P

Outline

• Introduction• Motivational Problems• Data• Analysis Pipeline• Transformation, Analysis, and Post-processing• Hybrid Systems• Examples• Conclusions

Introduction

• Knowledge discovery– Goal of text analysis– Data → information → knowledge

• Challenges– Too much information to process manually– Data ambiguity

• Word sense, multilingual, errors, weak signals– Heterogeneous data sources– Interpretability

• Goals of this talk– Exposure to research in text analysis at Sandia– Focus on methods based on mathematical principles

Example 1: Information Retrieval

Problem: ambiguous queries lead to information overload and topic confusion

Solutions: optimization, linear algebra, machine learning, and probabilistic modeling

Basketball player

MathematicianRank: 5, 50, 109, …

Jazz Musician?Rank: > 200

Example 2: Spam DetectionX-TMWD-Spam-Summary: TS=20080714175419; ID=1; SEV=2.3.2; DFV=B2008071415; IFV=2.0.4,4.0-9; AIF=B2008071415; RPD=5.03.0010; ENG=NA; RPDID=7374723D303030312E30413031303230332E34383742393243372E303043312C73733D312C6667733D30; CAT=NONE; CON=NONE; SIG=AAABAMQFAAAAAAAAAAAAAIAIgDkAAAM=X-TMWD-IP-Reputation: SIP=128.8.128.57; IPRID=303030312E30413039303330322E34383742393243362E30303330; CTCLS=T2; CAT=UnknownDate: Mon, 14 Jul 2008 13:54:13 -0400From: Dianne O'Leary <[email protected]>To: [email protected]: TS=20080714175417; SEV=2.2.2; DFV=B2008071415; IFV=2.0.4,4.0-9; AIF=B2008071415; RPD=5.02.0125; ENG=IBF; RPDID=7374723D303030312E30413031303230332E34383742393243392E303045422C73733D312C6667733D30; CAT=NONE; CON=NONEX-MMS-Spam-Filter-ID: B2008071415_5.02.0125_4.0-9X-PMX-Version: 5.4.2.344556, Antispam-Engine: 2.6.0.325393, Antispam-Data: 2008.7.14.174143X-PerlMx-Spam: Gauge=IIIIIII, Probability=7%, Report='BODY_SIZE_1000_LESS 0, BODY_SIZE_300_399 0, BODY_SIZE_5000_LESS 0, __CT 0, __CTE 0, __CT_TEXT_PLAIN 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0, __SUBJ_MISSING 0'

Bayesian Statistics

• SpamAssassin• Cloudmark Authority• MailSweeper Business Suite

Neural Networks• SurfControl E-mail Filter• AntiSpam for SMTP

Signature (Hash) Analysis • Cloudmark SpamNet• IM Message Inspector

[S. Ali and Y. Xiang (2007), “Spam Classification Using Adaptive Boosting Algorithm," Proc. ICIS 2007.]

Lexical Analysis • Brightmail Anti-Spam• Tumbleweed Email Firewall

Heuristic Patterns• McAfee SpamKiller• Brightmail Anti-Spam

Solutions: optimization, linear algebra, machine learning, probabilistic modeling

Problem: term meaning/usage ambiguity and deceit creates confusion between spam and good e-mail

Example 3: Topic Detection and Association

http://cloud.clusty.com

http://www.kartoo.com

Problem: determine topics in text collections and identify the most important, novel, or significant relationships

Clustering and visualization are key analysis methods

Solutions: optimization, linear algebra, machine learning, and probabilistic modeling, visualization

Text Data

• Text collection(s)– Corpus (corpora)

• Structured– Database fielded data

• Semi-structured– XML, HTML

• Unstructured– Formal

• Newspaper articles, scientific articles, business reports, …– Informal

• E-mail, chat, code comments, …• Other characteristics

– Incomplete, noisy (errors, ambiguity), multilingual

Metadataprocessing tool

parameters useddate processed

Datanamed entitiesrelationships

factsevents

Semi-Structured Datae-mail, web pages, blogs, etc.

Unstructured Datareports, newswire, etc.

Metadataraw source index

date collectedsource reliability

etc.Data

E-mail Headersto

fromdate

subjectetc.

Metadataraw source index

data collectedsource reliability

etc.

DataE-mail Headersmessage bodyattachments

Unstructured Text Processing

Datatext

Analysis

Text Analysis Pipeline

Ingestion

Pre-processing

Transformation

Analysis

Post-processing

Archiving

File readers (ASCII, UTF-8, XML, PDF, ...)

Tokenization, stemming, part-of-speech taggingnamed entity extraction, sentence boundaries

Data model, dimensionality reduction, feature weighting, feature extraction/selection

Information retrieval, clustering, summarization,classification, pattern recognition, statistics

Visualization, filtering, summary statistics

Database, file, web site

Vector Space Model

• Vector Space Model for Text– Terms (features): – Documents (objects):– Term Document Matrix:– : measure of importance of term in document

• Term Examples – Sentence: “Danny re-sent $1.”– Words: danny, sent, re [# chars?], $ [sym?], 1 [#?], re-sent [-?]– n-grams (3): dan, ann, nny, ny_, _re, re-, e-s, sen, ent, nt_, …– Named entities (people, orgs, money, etc.): danny, $1

• Document Examples– Documents, paragraphs, sentences, fixed-size chunks

[G. Salton, A. Wong, and C. S. Yang (1975), "A Vector Space Model for Automatic Indexing," Comm. ACM, 18(11), 613–620.]

Feature WeightingTerm Document Matrix Scaling:

Feature Extraction: Dimension Reduction

• Goal: find new, smaller set of features (dimensions) that best captures variability, correlations, or structure in the data

• Methods– Principal component analysis (PCA)

• Eigenvalue decomposition of covariance matrix of • Pre-processing: mean of each feature is 0

– Singular value decomposition of– Local Linear Embedding (LLE)

• Express points as combinations of neighbors and embed points into lower dimensional space (preserving neighbors)

– Multidimensional scaling• Preserve pairwise distances in lower dimensional space

– ISOMAP (nonlinear)• Extends MDS to use geodesic distances on a weighted graph

Analysis Tasks in This Talk• Information retrieval

– Goal: find documents most related to a query– Challenges: pseudonyms, synonyms, stemming, errors– Methods: LSA (later), boolean search, probabilistic retrieval

• Clustering– Goal: find a set of partitions that best separates groups of like objects– Challenges: distance metrics, number of clusters, uniqueness– Methods: k-means (later), agglomerative, graph-based

• Summarization– Goal: find a compact representation of text with same meaning– Challenges: single- vs. multi-document summaries, subjectivity– Methods: HMM+QR (later), probabilistic

• Classification– Goal: predict labels/categories of data instances (documents)– Challenges: data overfitting, – Methods: HEMLOCK (S. Gilpin, later), decision trees, naïve bayes, SVM

Other Analysis Tasks

• Machine translation• Speech recognition• Cross language information retrieval• Word sense disambiguation

– Determining sense of ambiguous words from context• Lexical acquisition

– Filling in gaps in dictionaries build from text corpora• Concept drift detection

– Change in general topics in streaming data• Association analysis

– Discovering novel relationships hidden in text

Hybrid Systems

• Rules + statistics/probabilities– Entity extraction (persons, organizations, locations)

• Rules: list of common names, capitalization• Probabilities: chance name occurs given sequence of words

• Any combination of data analytic tools

Data modeler

Feature extractor

Clustering tool

Parser

Oftendeveloped

independently

Hybrid System Development

• Data model– Cross-system, cross-platform accessibility– Accommodation of multiple data structures

• System– Modularized framework (plug-and-play capabilities)– Compatible interfaces– Multiple user interfaces

• TITAN: customizable front-ends to analysis pipelines• YALE: required parameters vs. complete set of parameters

• Performance, Verification & Validation– Tests for independent systems and overall system– Compatible test data and benchmarks– Analysis of parameter dependencies across individual systems

Hybrid System Example

Query, Cluster, Summarize

Motivation

• Query– methods plasma physics

• Retrieval– General: Google, 7.8106 of >2.51010 documents– Targeted: arXiv, 9,000 of >403,000 documents

• Problems– Too much information– Redundant information– Results: link, title, abstract, snippet (?), etc.– Ordering of results (meaning of “best” match?)

Problems to Solve

• QCS (Query, Cluster, Summarize)– Unstructured text parsing (common representation)– Data fusion (cleaning, assimilating, normalizing)– Natural language processing (sentences, POS)– Document retrieval (ranking)– High-dimensional clustering (data organization)– Automatic text summarization (data reduction)– Data representation/visualization (multiple perspectives)

QueryLatent Semantic Analysis (LSA)

• SVD:• Truncated SVD:• Query scores (query as new “doc”):• LSA Ranking:

term

s

documentsd1 d2 dn

t2

t1

tm

…d3 d4

.

.

. Truncated SVD

term

s

concepts documents

concepts

singular values

[Deerwester, S. C., et al. (1990). Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41 (6), 391–407.]

d1 : Hurricane. A hurricane is a catastrophe.

d2 : An example of a catastrophe is a hurricane.

d3 : An earthquake is bad.

d4 : Earthquake. An earthquake is a catastrophe.

d1 : Hurricane. A hurricane is a catastrophe.

d2 : An example of a catastrophe is a hurricane.

d3 : An earthquake is bad.

d4 : Earthquake. An earthquake is a catastrophe.

1011catastrophe

2100earthquake

0012hurricane

d4d3d2d1

0catastrophe

0earthquake

1hurricane

qA

.30.15.60.59catastrophe

.92.96.02-.03earthquake

.11-.11.78.78hurricane

d4d3d2d1

A2

00.71.89qTA .11–.78.78qTA2

Removestopwords

normalization only rank-2 approximation

captures link to doc 4

LSA Example 1

.450.71.45catastrophe

.89100earthquake

00.71.89hurricane

d4d3d2d1

A

LSA Example 2

∆policy∆planning∆politics∆tomlinson∆1986oSport in Society: policy, Politics and Culture, ed A. Tomlinson (1990) oPolicy and Politics in Sport, PE and Leisure eds S. Fleming, M. Talbot and A. Tomlinson (1995) oPolicy and Planning (II), ed J. Wilkinson (1986) oPolicy and Planning (I), ed J. Wilkinson (1986) oLeisure: Politics, Planning and People, ed A. Tomlinson (1985)

∆parker∆lifestyles∆1989∆partoWork, Leisure and Lifestyles (Part 2), ed S. R. Parker (1989) oWork, Leisure and Lifestyles (Part 1), ed S. R. Parker (1989)

[Leisure Studies of America Data: 97 documents, 335 terms]

ClusterGeneralized Spherical K-Means (gmeans)

• The Players– Documents: – Partition/Disjoint Sets:– Concept vectors (centroids):

• The Game– Maximize

• The Rules– Adaptive, but bounded k– Similarity Estimation– First variation (stochastic perturbation)

[Dhillon, I. S., et al. (2002). Iterative clustering of high dimensional text data augmented by local search. Proc. IEEE ICDM.]

SummarizeHidden Markov Model + Pivoted QR

• Single Document Summarization– Mark summary sentences in training documents– Build probabilistic model

• Markov chain observations– log(#subject terms + 1)

• terms showing up in titles, topics, subject descriptions, etc. – log(#topic terms + 1)

• terms above a threshold using a mutual information statistic• Hidden Markov Model (HMM)

– Hidden states: {summary, non-summary}

– Score sentences in each document• Probabilities of sentence being a summary sentence

n 1 n 2 n

[Conroy, J. M., et al. (2001). Text summarization via hidden markov models and pivoted QR matrix decomposition.]

SummarizeHidden Markov Model + Pivoted QR

• Multi-document Summarization– Goal: generate w-word summaries– Use HMM scores to select candidate sentences (~2w)– Terms as sentence features

• Terms:• Sentences: • Scaling: = HMM score of sentence

• Pivoted QR– Choose column with maximum norm ( )– Subtract components along from remaining columns– Stop: chosen sentences (columns) ~w words

• Removes semantic redundancy

QCS: Evaluation

• Document Understanding Conference (DUC)– Automatics evaluation of summarizers (ROUGE)

• Measures how well you agree with human summaries– Human (), QCS (), S only () summaries– QCS finds subtopics and outliers

ROUGE-2 score vs. Summarizers (Humans, QCS, S)

Cluster 1 Cluster 2

QCS: Evaluation

• Document Understanding Conference (DUC)– Scoring as a function of QCS cluster size (k)– QCS (), S only (---) summaries– Best results for different clusters use different k

ROUGE-2 scores vs. number of clusters

Cluster 1 Cluster 2

Benefits of QCS

• Dynamic data organization and compression– Subset of documents relevant to a query– Topic clusters, single summary per cluster

• Multiple perspectives (analyses)– Relevance ranking, topic clusters, summaries

• Efficient use of computation– Parsing, term counts, natural language processing, etc.

Other Examples

ParaText™: Scalable Text Analysis

• ParaText™ Lite– Serial client-server text analysis– Parser, vector space model, SVD,

data similarities/graph creation – Built on vtkTextEngine (Titan)– Works with ~10K–100K documents

• ParaText™– End-to-end scalable text analysis– Challenge 1: Parsing [parallel string hashing, hierarchical agglomeration]– Challenge 2: Text modeling [initial Trilinos/Titan integration complete]– Challenge 3: Load balancing [initial: documents; goal: Isorropia/Zoltan]

• Impact– Available in ThreatView 1.2.0+ directly or through ParaText™ server– Plans to interface to LSAView, OverView (1424), Sisyphus (1422)

ParaText™ Server (PTS)

Artifact DB

PTS PTS PTS

Reader

P0 P1 Pk

Parser

Matrix

SVD

Reader

Parser

Matrix

SVD

Reader

Parser

Matrix

SVD

Parallel Pipeline

Matrices DB

HPC Resource (cluster, multicore server, etc.)

1 or 2DB Servers

XMLHTTP

Master ParaText™

Server

ParaText™ Client

LSAView: Algorithm Analysis/ Development

• LSAView– Analysis and exploration of

impact of informatics algorithms on end-user visual analysis of data

– Aids in discovery process of optimal algorithm parameters for given data and tasks

• Features – Side-by-side comparison of visualizations for two sets of parameters– Small multiple view for analyzing 2+ parameter sets simultaneously– Linked document, graph, matrix, and tree data views– Interactive, zoomable, hierarchical matrix and matrix-difference views– Statistical inference tests used to highlight novel parameter impact

• Impact– Used in developing and understanding ParaText™ and LSALIB algorithms

20 40 60 800

0.5

1

1.5

2

20 40 60 800

0.5

1

1.5

2

20 40 60 800

0.5

1

1.5

2

20 40 60 800

0.5

1

1.5

2

k

LSAView Impact

• Document similarities:• Inner product view:• Scaled inner product view:

What is the best scaling for document similarity graph generation?

original scaling no scaling inverse sqrt inverse

[Leisure Studies of America Data: 97 documents, 335 terms]

E-Mail Classification

• LSN Assistant / Sandia Categorization Framework – Yucca Mountain: categorize e-mail (Relevant, Federal Record, Privileged)– Machine learning library and GUI for document categorization & review– For review of existing categorizations, recommendations for new documents– Balanced learning

• Skewed class distributions• Importance

– Solved important, real problem• ~400K e-mails incorrectly categorized

– Foundation for LSN Online Assistant• Real-time system for recommendations

• Impact– Dong Kim, lead of DOE/OCRWM LSN certification is “very impressed with the LSN

Assistant Tool and the approach to doing the review.”– Factor of 3 speedup over manual categorization review only

Conclusions

• Text analysis relies heavily upon mathematics– Linear algebra, optimization, machine learning,

probability theory, statistics, graph theory• Hybrid system development is a challenge

– More than just gluing pieces together• Large-scale analysis is important

– Storing and processing large amounts of data– Scaling algorithms up– Developing new algorithms for large data

• Useful across many application domains

Collaborations

• QCS– Dianne O’Leary (Maryland), John Conroy & Judith Schlesinger (IDA/CCS)

• LSALIB– Tammy Kolda (8962)

• ParaText™– Tim Shead & Pat Crossno (1424)

• LSAView– Pat Crossno (1424)

• Sandia Categorization Framework– Justin Basilico (6341) and Steve Verzi (6343)

• HEMLOCK– Sean Gilpin (1415)

Thank You

Text Analysis:Methods for Searching, Organizing,

Labeling and Summarizing Document Collections

Danny Dunlavy

[email protected]

http://www.cs.sandia.gov/~dmdunla

mailto:[email protected]

http://www.cs.sandia.gov/~dmdunla

text analysis: methods for searching, organizing, labeling and summarizing document collections

Documents

spam classification

spam detectionxtmwd

b2008071415 rpd

b2008071415 ifv

linear algebra

machine learning

key analysis methodssolutions

information retrievalproblem