text mining three cases. 2 outline federalist papers svdpdf vaers
TRANSCRIPT
Text Mining
Three Cases
2
Outline Federalist Papers SVDPDF VAERS
http://zlin.ba.ttu.edu/sassrc.rar
http://zlin.ba.ttu.edu/DMTM9.rar
3
Federalist Papers
4
5
Who wrote TheFederalist Papers?
Who wrote TheFederalist Papers?
HamiltonHamilton
STYLOMETRY: Uniquely identify an author based onthe distribution of words in a document.
STYLOMETRY: Uniquely identify an author based onthe distribution of words in a document.
MadisonMadison
About the Data
Alexander Hamilton, James Madison, and John Jay wrote a series of essays in 1787 and 1788 to try to convince the citizens of the state of New York to ratify the new constitution of the United States. These essays are collectively called The Federalist Papers. Copies of the papers in a variety of formats can be found at
http://www.yale.edu/lawweb/avalon/federal/fed.htm, or http://www.constitution.org/fed/federa00.htm.
Of the 85 essays, 51 are attributed to Hamilton, 15 to Madison, 5 to Jay, and 3 to Hamilton and Madison jointly. The 11 remaining essays can be attributed only to Hamilton or Madison. Mosteller and Wallace (1964) used Bayesian statistical techniques to provide evidence that Madison wrote all 11 of the essays of unknown authorship. (The essays in question are numbers 49, 50, 51, 52, 53, 54, 55, 56, 57, 62, and 63.)
6
7
Corpus The Federalist Papers corpus is a collection of 85
essays.
Terms and TokensThe Federalist Papers taken as a whole contain over 190,000 tokens and approximately 8,800 unique tokens.
8
The Federalist Papers Diagram
EM Clustering
Logistic Regression
TERGET: 1 – Madison; 0 – Hamilton; missing - unknown
9
Federalist Papers Clusters
Cluster 1
HamiltonMadisonUnknown
2410
Cluster 2
HamiltonMadisonUnknown
271411
These clusters were obtained using numeric inputs derived from text mining. No author information wasemployed. Of interest is the fact that EM clustering placed all of the unknown essays into the same clusterthat contains 14 of the 15 Madison essays.
10
Logistic Regression Classification of The Federalist Papers
Text Mining Results
By Text Mining, the results of Mosteller and Wallace have been matched.
The predictions in the second column from the right show the strength of the decision.
The record with a predicted value of 0.709119 corresponds to essay 56, so the model thinks that this essay has the weakest association with Madison of all of the unknown essays.
Essay 63, with a predicted value of 0.999691, has the strongest association with Madison.
All of the essays in question have a stronger association with Madison than Hamilton, hence the classification into the Madison category.
11
12
Characteristics of a Document
A document consists of letters words sentences paragraphs punctuation possible structural items: chapters, sections.
The elements of a document can be counted (for example, the number of characters, words,
or sentences) summarized (for example, mean, median, or kurtosis).
13
Comparing Two Documents
Wor
d Size
Sente
nce
Size
Parag
raph
Size
Wor
d Fre
q
Sente
nce
Freq
Parag
raph
Fre
q
Doc 1
Doc 2
14
15
Contingency Table Comparing Essay 1 to Essay 37
continued...
16
Contingency Table Comparing Essay 1 to Essay 37
17
Text Miner Static Analysis
18
Text Miner Interactive Analysis
SVDPDF
19
20
SAS Education Course Descriptions The data represents a collection of 130 course
summaries obtained from http://support.sas.com. The original 130 files were PDF files stored in one
location on an HTTP server. A SAS DATA step was used to read the files from the
server and write them to a local directory. The TMFILTER macro was used to process the PDF
files and store the results as a text field in 130 document records in a SAS data set.
The final SAS data set was modified to accommodate this demonstration and can be found in DMTM9.SASPDF.
21
Static Analysis with SAS Text Miner
22
Text Miner Settings
23
Interactive Results
24
Applications of Concept Lists A company can have specific conceptual goals. For
example, are customers concerned about brand integrity quality price features, styles, and selection availability customer support?
25
Market Research for Quality What terms are most similar to the term “quality”?
– Find Similar– Filter
What documents address quality?– Filter on synonyms and similar terms– Find similar documents
What secondary concepts reflect information on quality?– SVD coefficients– Concept links
26
VAERS
27
VAERS VAERS was created by the Food and Drug
Administration (FDA) and Centers for Disease Control and Prevention (CDC) to receive reports about adverse events that might be associated with vaccines.
No prescription drug or biological product, such as a vaccine, is completely free from side effects. Vaccines protect many people from dangerous illnesses, but vaccines, like drugs, can cause side effects, a small percentage of which may be serious.
VAERS is used to continually monitor reports to determine whether any vaccine or vaccine lot has a higher than expected rate of events.
Department of Health and Human Services, Public Health Service
28
VAERS Data was obtained from http://www.vaers.org/. Data was downloaded in September 2002 as a series
of CSV files. A SAS DATA step was used to read and process the
data. The original data had 131,464 observations and 59
variables. Cleaning and screening reduced the data set to
48,523 observations and 44 variables. The data set has 6 text variables. The original data
had 21, but 15 were sparsely populated.
29
30
VAERS Sample Entries15 mon. male w/ hx of recurrent ear infections & measles
in Feb. 89'. 5Apr89 was given MMR. Within 24 hrs /p vaccine, parents noted hearing deficit, confirmed by physician exam. DEAF
Urticaria, wheezy, & periorbital edema which abated /p administration of subcut. epinephrine, Bendryl IV, Solumendrol IV ASTHMA
Pt experienced chicken pox from head to toe subsequent to receiving one dose of varicella virus vaccine live.
INFECT
31
VAERS Text FieldsSYMPTOM_TEXT: Full text description of the adverse
reaction entered by a medical professional
SYM01: Brief description of primary symptom
SYM02-SYM05: Additional symptoms in decreasing importance
32
VAERS Initial Diagram
33
Equivalent Terms for Patient
34
Property Panel for VAERS Text Miner Analysis
35
Interactive Results
36
Clusters Window
Why only one termwhen five wererequested?
…
37
Cases with Fever
Last 16 entriesout of 98
38
Headache Terms
39
Headache Documents
40
Terms Most Similar to Headache
41
Documents Most Similar to Headache
42
First 11 out of 65 Documents Filtered by Headache Terms
43
VAERS Predictive Modeling Diagram
44
Logistic Regression Model Effects Plot
45
Logistic Regression Lift Plot