text mining three cases. 2 outline federalist papers svdpdf vaers

Text Mining Three Cases

Upload: felicity-knight

Post on 11-Jan-2016




0 download


Page 1: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

Text Mining

Three Cases

Page 2: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Outline Federalist Papers SVDPDF VAERS



Page 3: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Federalist Papers

Page 4: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Page 5: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Who wrote TheFederalist Papers?

Who wrote TheFederalist Papers?


STYLOMETRY: Uniquely identify an author based onthe distribution of words in a document.

STYLOMETRY: Uniquely identify an author based onthe distribution of words in a document.


Page 6: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

About the Data

Alexander Hamilton, James Madison, and John Jay wrote a series of essays in 1787 and 1788 to try to convince the citizens of the state of New York to ratify the new constitution of the United States. These essays are collectively called The Federalist Papers. Copies of the papers in a variety of formats can be found at

http://www.yale.edu/lawweb/avalon/federal/fed.htm, or http://www.constitution.org/fed/federa00.htm.

Of the 85 essays, 51 are attributed to Hamilton, 15 to Madison, 5 to Jay, and 3 to Hamilton and Madison jointly. The 11 remaining essays can be attributed only to Hamilton or Madison. Mosteller and Wallace (1964) used Bayesian statistical techniques to provide evidence that Madison wrote all 11 of the essays of unknown authorship. (The essays in question are numbers 49, 50, 51, 52, 53, 54, 55, 56, 57, 62, and 63.)


Page 7: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Corpus The Federalist Papers corpus is a collection of 85


Terms and TokensThe Federalist Papers taken as a whole contain over 190,000 tokens and approximately 8,800 unique tokens.

Page 8: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


The Federalist Papers Diagram

EM Clustering

Logistic Regression

TERGET: 1 – Madison; 0 – Hamilton; missing - unknown

Page 9: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Federalist Papers Clusters

Cluster 1



Cluster 2



These clusters were obtained using numeric inputs derived from text mining. No author information wasemployed. Of interest is the fact that EM clustering placed all of the unknown essays into the same clusterthat contains 14 of the 15 Madison essays.

Page 10: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Logistic Regression Classification of The Federalist Papers

Page 11: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

Text Mining Results

By Text Mining, the results of Mosteller and Wallace have been matched.

The predictions in the second column from the right show the strength of the decision.

The record with a predicted value of 0.709119 corresponds to essay 56, so the model thinks that this essay has the weakest association with Madison of all of the unknown essays.

Essay 63, with a predicted value of 0.999691, has the strongest association with Madison.

All of the essays in question have a stronger association with Madison than Hamilton, hence the classification into the Madison category.


Page 12: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Characteristics of a Document

A document consists of letters words sentences paragraphs punctuation possible structural items: chapters, sections.

The elements of a document can be counted (for example, the number of characters, words,

or sentences) summarized (for example, mean, median, or kurtosis).

Page 13: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Comparing Two Documents


d Size








d Fre









Doc 1

Doc 2

Page 14: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Page 15: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Contingency Table Comparing Essay 1 to Essay 37


Page 16: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Contingency Table Comparing Essay 1 to Essay 37

Page 17: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Text Miner Static Analysis

Page 18: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Text Miner Interactive Analysis

Page 19: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS



Page 20: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


SAS Education Course Descriptions The data represents a collection of 130 course

summaries obtained from http://support.sas.com. The original 130 files were PDF files stored in one

location on an HTTP server. A SAS DATA step was used to read the files from the

server and write them to a local directory. The TMFILTER macro was used to process the PDF

files and store the results as a text field in 130 document records in a SAS data set.

The final SAS data set was modified to accommodate this demonstration and can be found in DMTM9.SASPDF.

Page 21: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Static Analysis with SAS Text Miner

Page 22: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Text Miner Settings

Page 23: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Interactive Results

Page 24: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Applications of Concept Lists A company can have specific conceptual goals. For

example, are customers concerned about brand integrity quality price features, styles, and selection availability customer support?

Page 25: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Market Research for Quality What terms are most similar to the term “quality”?

– Find Similar– Filter

What documents address quality?– Filter on synonyms and similar terms– Find similar documents

What secondary concepts reflect information on quality?– SVD coefficients– Concept links

Page 26: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS



Page 27: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


VAERS VAERS was created by the Food and Drug

Administration (FDA) and Centers for Disease Control and Prevention (CDC) to receive reports about adverse events that might be associated with vaccines.

No prescription drug or biological product, such as a vaccine, is completely free from side effects. Vaccines protect many people from dangerous illnesses, but vaccines, like drugs, can cause side effects, a small percentage of which may be serious.

VAERS is used to continually monitor reports to determine whether any vaccine or vaccine lot has a higher than expected rate of events.

Department of Health and Human Services, Public Health Service

Page 28: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


VAERS Data was obtained from http://www.vaers.org/. Data was downloaded in September 2002 as a series

of CSV files. A SAS DATA step was used to read and process the

data. The original data had 131,464 observations and 59

variables. Cleaning and screening reduced the data set to

48,523 observations and 44 variables. The data set has 6 text variables. The original data

had 21, but 15 were sparsely populated.

Page 29: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Page 30: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


VAERS Sample Entries15 mon. male w/ hx of recurrent ear infections & measles

in Feb. 89'. 5Apr89 was given MMR. Within 24 hrs /p vaccine, parents noted hearing deficit, confirmed by physician exam. DEAF

Urticaria, wheezy, & periorbital edema which abated /p administration of subcut. epinephrine, Bendryl IV, Solumendrol IV ASTHMA

Pt experienced chicken pox from head to toe subsequent to receiving one dose of varicella virus vaccine live.


Page 31: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


VAERS Text FieldsSYMPTOM_TEXT: Full text description of the adverse

reaction entered by a medical professional

SYM01: Brief description of primary symptom

SYM02-SYM05: Additional symptoms in decreasing importance

Page 32: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


VAERS Initial Diagram

Page 33: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Equivalent Terms for Patient

Page 34: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Property Panel for VAERS Text Miner Analysis

Page 35: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Interactive Results

Page 36: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Clusters Window

Why only one termwhen five wererequested?

Page 37: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Cases with Fever

Last 16 entriesout of 98

Page 38: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Headache Terms

Page 39: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Headache Documents

Page 40: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Terms Most Similar to Headache

Page 41: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Documents Most Similar to Headache

Page 42: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


First 11 out of 65 Documents Filtered by Headache Terms

Page 43: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


VAERS Predictive Modeling Diagram

Page 44: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Logistic Regression Model Effects Plot

Page 45: Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS


Logistic Regression Lift Plot