text data mining · 2018. 1. 16. · text data mining in this course • predictive analysis of...
TRANSCRIPT
![Page 1: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/1.jpg)
Text Data Mining: Predictive and Exploratory Analysis of Text
Jaime Arguello [email protected]
January 10, 2018
![Page 2: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/2.jpg)
2
Outline
Introductions
What is Text Data Mining?
Predictive Analysis of Text: The Big Picture
Exploratory Analysis of Text: The Big Picture
Applications
![Page 3: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/3.jpg)
3
• Hello, my name is ______.
• I’m in the ______ program.
• I’m taking this course because I’d like to learn how to ______.
Introductions
![Page 4: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/4.jpg)
4
• The science and practice of building and evaluating computer programs that automatically detect or discover interesting and useful things in collections of natural language text
What is Text Data Mining?
![Page 5: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/5.jpg)
5
• Machine Learning: developing computer programs that improve their performance with “experience”
• Data Mining: developing methods that discover patterns within large structured datasets
• Statistics: developing methods for the interpretation of data and experimental outcomes in reaching conclusions with a certain degree of confidence
Related Fields
![Page 6: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/6.jpg)
6
• Predictive Analysis of Text
‣ developing computer programs that automatically recognize or detect a particular concept within a span of text
• Exploratory Analysis of Text:
‣ developing computer programs that automatically discover interesting and useful patterns or trends in text collections
Text Data Mining in this Course
![Page 7: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/7.jpg)
7
Outline
Introductions
What is Text Data Mining?
Predictive Analysis of Text: The Big Picture
Exploratory Analysis of Text: The Big Picture
Applications
![Page 8: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/8.jpg)
8
Predictive Analysis example: recognizing triangles
![Page 9: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/9.jpg)
9
• We could imagine writing a “triangle detector” by hand:
‣ if shape has three sides, then shape = triangle.
‣ otherwise, shape = other
• Alternatively, we could use supervised machine learning!
Predictive Analysis example: recognizing triangles
![Page 10: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/10.jpg)
10
Predictive Analysis example: recognizing triangles
machine learning
algorithmmodel
labeled examples
new, unlabeled examples
model
predictions
training
testing
![Page 11: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/11.jpg)
11
Predictive Analysis example: recognizing triangles
machine learning
algorithmmodel
labeled examples
new, unlabeled examples
model
predictions
training
testing
What is the part that is missing?
HINT: It’s what most of this class will be about!
![Page 12: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/12.jpg)
12
color size # slides equal sides ... label
red big 3 no ... yes
green big 3 yes ... yes
blue small inf yes ... no
blue small 4 yes ... no.... .... .... .... .... ....
red big 3 yes ... yes
Predictive Analysis representation: features
![Page 13: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/13.jpg)
13
Predictive Analysis example: recognizing triangles
machine learning
algorithmmodel
labeled examples
new, unlabeled examples
model
predictions
training
testing
color size sides equal sides ... label
red big 3 no ... yes
green big 3 yes ... yes
blue small inf yes ... no
blue small 4 yes ... no.... .... .... .... .... ....
red big 3 yes ... yes
color size sides equal sides ... label
red big 3 no ... ???
green big 3 yes ... ???
blue small inf yes ... ???
blue small 4 yes ... ???.... .... .... .... .... ???
red big 3 yes ... ???
color size sides equal sides ... label
red big 3 no ... yes
green big 3 yes ... yes
blue small inf yes ... no
blue small 4 yes ... no.... .... .... .... .... ....
red big 3 yes ... yes
![Page 14: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/14.jpg)
14
Predictive Analysis basic ingredients
1. Training data: a set of examples of the concept we want to automatically recognize
2. Representation: a set of features that we believe are useful in recognizing the desired concept
3. Learning algorithm: a computer program that uses the training data to learn a predictive model of the concept
![Page 15: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/15.jpg)
15
Predictive Analysis basic ingredients
1. Training data: a set of examples of the concept we want to automatically recognize
2. Representation: a set of features that we believe are useful in recognizing the desired concept
3. Learning algorithm: a computer program that uses the training data to learn a predictive model of the concept
Highly influential!
![Page 16: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/16.jpg)
16
Predictive Analysis basic ingredients
4. Model: a (mathematical) function that describes a predictive relationship between the feature values and the presence/absence of the concept
5. Test data: a set of previously unseen examples used to estimate the model’s effectiveness
6. Performance metrics: a set of statistics used measure the predictive effectiveness of the model
![Page 17: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/17.jpg)
17
Predictive Analysis basic ingredients: the focus in this course
1. Training data: a set of examples of the concept we want to automatically recognize
2. Representation: a set of features that we believe are useful in recognizing the desired concept
3. Learning algorithm: uses the training data to learn a predictive model of the “concept”
![Page 18: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/18.jpg)
18
4. Model: describes a predictive relationship between feature values and the presence/absence of the concept
5. Test data: a set of previously unseen examples used to estimate the model’s effectiveness
6. Performance metrics: a set of statistics used measure the predictive effectiveness of the model
Predictive Analysis basic ingredients: the focus in this course
![Page 19: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/19.jpg)
19
Predictive Analysis applications
• Topic categorization
• Opinion mining
• Sentiment analysis
• Bias or viewpoint detection
• Discourse analysis (e.g., student retention)
• Forecasting and nowcasting
• Any other ideas?
![Page 20: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/20.jpg)
20
What Could Possibly Go Wrong?
1. Bad feature representation
2. Bad data + misleading correlations
3. Noisy labels for training and testing
4. Bad learning algorithm
5. Misleading evaluation metric
![Page 21: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/21.jpg)
21
Training data + Representation what could possibly go wrong?
![Page 22: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/22.jpg)
22
color size 90 deg. angle equal sides ... label
red big yes no ... yes
green big no yes ... yes
blue small no yes ... no
blue small yes yes ... no.... .... .... .... .... ....
red big no yes ... yes
Training data + Representation what could possibly go wrong?
![Page 23: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/23.jpg)
23
color size 90 deg. angle equal sides ... label
red big yes no ... yes
green big no yes ... yes
blue small no yes ... no
blue small yes yes ... no.... .... .... .... .... ....
red big no yes ... yes
Training data + Representation what could possibly go wrong?
1. bad feature representation!
![Page 24: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/24.jpg)
24
Training data + Representation what could possibly go wrong?
![Page 25: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/25.jpg)
25
Training data + Representation what could possibly go wrong?
color size # slides equal sides ... label
blue big 3 no ... yes
blue big 3 yes ... yes
red small inf yes ... no
green small 4 yes ... no.... .... .... .... .... ....
blue big 3 yes ... yes
![Page 26: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/26.jpg)
26
Training data + Representation what could possibly go wrong?
2. bad data + misleading correlations
color size # slides equal sides ... label
blue big 3 no ... yes
blue big 3 yes ... yes
red small inf yes ... no
green small 4 yes ... no.... .... .... .... .... ....
blue big 3 yes ... yes
![Page 27: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/27.jpg)
27
Training data + Representation what could possibly go wrong?
![Page 28: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/28.jpg)
28
Training data + Representation what could possibly go wrong?
3. noisy training data!
color size # slides equal sides ... label
white big 3 no ... yes
white big 3 no ... no
white small inf yes ... yes
white small 4 yes ... no.... .... .... .... .... ....
white big 3 yes ... yes
![Page 29: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/29.jpg)
29
y =
⇢1 if w0 + Ân
j=1 wjxj > 00 otherwise
Learning Algorithm + Model what could possibly go wrong?
• Linear classifier
![Page 30: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/30.jpg)
30
y =
⇢1 if w0 + Ân
j=1 wjxj > 00 otherwise
parameters learned by the modelpredicted value (e.g., 1 = positive, 0 = negative)
Learning Algorithm + Model what could possibly go wrong?
• Linear classifier
![Page 31: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/31.jpg)
31
f_1 f_2 f_3
0.5 1 0.2
model parameterstest instancew_0 w_1 w_2 w_3
2 -5 2 1
output = 2.0 + (0.50 x -5.0) + (1.0 x 2.0) + (0.2 x 1.0)
output = 1.7
output prediction = positive
Learning Algorithm + Model what could possibly go wrong?
![Page 32: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/32.jpg)
32(source: http://en.wikipedia.org/wiki/File:Svm_separating_hyperplanes.png)
Learning Algorithm + Model what could possibly go wrong?
![Page 33: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/33.jpg)
33
x2
x1• Would a linear classifier do well on positive (black) and
negative (white) data that looks like this?
0.5 1.0
0.5
1.0
Learning Algorithm + Model what could possibly go wrong?
![Page 34: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/34.jpg)
34
x2
x10.5 1.0
0.5
1.0
Learning Algorithm + Model what could possibly go wrong?
4. Bad learning algorithm!
![Page 35: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/35.jpg)
35
Learning Algorithm + Model what could possibly go wrong?
X1 > 0.50
noyes
yes no
negpos
X2 > 0.5 X2 > 0.5
yes no
posneg
![Page 36: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/36.jpg)
36
• Most evaluation metrics can be understood using a contingency table
triangle other
triangle A B
other C D
truepr
edic
ted
• What number(s) do we want to maximize?
• What number(s) do we want to minimize?
Evaluation Metric what could possibly go wrong?
![Page 37: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/37.jpg)
37
• True positives (A): number of triangles correctly predicted as triangles
• False positives (B): number of “other” incorrectly predicted as triangles
• False negatives (C): number of triangles incorrectly predicted as “other”
• True negatives (D): number of “other” correctly predicted as “other”
triangle other
triangle A B
other C D
true
pred
icte
d
Evaluation Metric what could possibly go wrong?
![Page 38: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/38.jpg)
38
• Accuracy: percentage of predictions that are correct (i.e., true positives and true negatives)
triangle other
triangle A B
other C D
true
pred
icte
d(? + ?)
(? + ? + ? + ?)
Evaluation Metric what could possibly go wrong?
![Page 39: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/39.jpg)
39
triangle other
triangle A B
other C D
true
pred
icte
d(A + D)
(A + B + C + D)
• Accuracy: percentage of predictions that are correct (i.e., true positives and true negatives)
Evaluation Metric what could possibly go wrong?
![Page 40: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/40.jpg)
40
• Accuracy: percentage of predictions that are correct (i.e., true positives and true negatives)
• What is the accuracy of this model?
Evaluation Metric what could possibly go wrong?
![Page 41: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/41.jpg)
41
• Interpreting the value of a metric on a particular data set requires some thinking ...
• On this dataset, what would be the expected accuracy of a model that does NO learning (degenerate baseline)?
? ?? ?
?
??
??
?
Evaluation Metric what could possibly go wrong?
![Page 42: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/42.jpg)
42
• Interpreting the value of a metric on a particular data set requires some thinking ...
? ?? ?
?
??
??
?
Evaluation Metric what could possibly go wrong?
5. Misleading interpretation of a metric value!
![Page 43: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/43.jpg)
43
What Could Possibly Go Wrong?
1. Bad feature representation
2. Bad data + misleading correlations
3. Noisy labels for training and testing
4. Bad learning algorithm
5. Misleading evaluation metric
![Page 44: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/44.jpg)
44
Outline
Introductions
What is Text Data Mining?
Predictive Analysis of Text: The Big Picture
Exploratory Analysis of Text: The Big Picture
Applications
![Page 45: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/45.jpg)
45
Text Data Mining in this Course
• Predictive Analysis of Text
‣ developing computer programs that automatically recognize a particular concept within a span of text
• Exploratory Analysis of Text:
‣ developing computer programs that automatically discover useful patterns or trends in text collections
![Page 46: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/46.jpg)
46
Exploratory Analysis example: clustering shapes
![Page 47: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/47.jpg)
47
Exploratory Analysis example: clustering shapes
![Page 48: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/48.jpg)
48
Exploratory Analysis example: clustering shapes
clustering algorithm
unlabeled examples
![Page 49: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/49.jpg)
49
color size # slides equal sides ... shape
blue big 3 no ... triangleblue big 3 yes ... trianglered small inf yes ... circle
green small 4 yes ... square....
........
........
....blue big 3 yes ... triangle
Exploratory Analysis representation: features
![Page 50: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/50.jpg)
50
Exploratory Analysis basic ingredients
1. Data: a set of examples that we want to automatically analyze in order to discover interesting trends
2. Representation: a set of features that we believe are useful in describing the data (i.e., its main attributes)
3. Similarity Metric: a measure of similarity between two examples that is based on their feature values
4. Clustering algorithm: an algorithm that assigns items with similar feature values to the same group
![Page 51: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/51.jpg)
51
Representation what could possibly go wrong?
![Page 52: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/52.jpg)
52
Representation what could possibly go wrong?
![Page 53: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/53.jpg)
53
Exploratory Analysis basic ingredients: the focus in this course
1. Data: a set of examples that we want to automatically analyze in order to discover interesting trends
2. Representation: a set of features that we believe are useful in describing the data
3. Similarity Metric: a measure of similarity between two examples that is based on their feature values
4. Clustering algorithm: an algorithm that assigns items with similar feature values to the same group
![Page 54: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/54.jpg)
54
Text Data Mining in this Course
• Predictive Analysis of Text
‣ developing computer programs that automatically recognize or detect a particular concept within a span of text
• Exploratory Analysis of Text:
‣ developing computer programs that automatically discover interesting and useful patterns or trends in text collections
![Page 55: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/55.jpg)
55
Outline
Introductions
What is Text Data Mining?
Predictive Analysis of Text: The Big Picture
Exploratory Analysis of Text: The Big Picture
Applications
![Page 56: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/56.jpg)
56
• Topic Categorization
• Opinion Mining
• Sentiment/Affect Analysis
• Bias Detection
• Information Extraction and Relation Learning
• Text-driven Forecasting
• Temporal Summarization
Predictive Analysis of Text examples we’ll cover in class
![Page 57: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/57.jpg)
57
• Topic Categorization: automatically assigning documents to a set of pre-defined topical categories
Predictive Analysis of Text example applications
![Page 58: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/58.jpg)
58
Topic Categorization
![Page 59: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/59.jpg)
59
Topic Categorization
![Page 60: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/60.jpg)
60
• Opinion Mining: automatically detecting whether a span of opinionated text expresses a positive or negative opinion about the item being judged
Predictive Analysis of Text example applications
![Page 61: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/61.jpg)
61
Opinion Mining movie reviews
• “Great movie! It kept me on the edge of my seat the whole time. I IMAX-ed it and have no regrets.”
• “Waste of time! It sucked!”
• “This film should be brilliant. It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it can’t hold up.”
• “Trust me, this movie is a masterpiece .... after you’ve seen it 4+ times.”
positive
negative
negative
???
![Page 62: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/62.jpg)
62
• Sentiment/Affect Analysis: automatically detecting the emotional state of the author of a span of text (usually from a set of pre-defined emotional states).
Predictive Analysis of Text example applications
![Page 63: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/63.jpg)
63
• “[I] also found out that the radiologist is doing the biopsy, not a breast surgeon. I am more scared now than when I ...”
• “... My radiologist ‘assured’ me my scan was NOT going to be cancer...she was wrong.”
• “ ... My radiologist did my core biopsy. Not a problem and he did a super job of it.”
• “It's pretty standard for the radiologist to do the biopsy so I wouldn't be concerned on that score.”
fear
despair
hope
hope
Sentiment Analysis support group posts
![Page 64: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/64.jpg)
64
• Bias detection: automatically detecting whether the author of a span of text favors a particular viewpoint (usually from a set of pre-defined viewpoints)
Predictive Analysis of Text example applications
![Page 65: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/65.jpg)
65
Bias Detection• “Coming [up] next, drug addicted pregnant
women no longer have anything to fear from the authorities thanks to the Supreme Court. Both sides on this in a moment.” -- Bill O’Reilly
• “Nationalizing businesses, nationalizing banks, is not a solution for the democratic party, it's the objective.” -- Rush Limbaugh
• “If you're keeping score at home, so far our war in Iraq has created a police state in that country and socialism in Spain. So, no democracies yet, but we're really getting close.” -- Jon Stewart
pro-policy (vs. anti-policy)
conservative (vs. liberal)
against war in iraq
(vs. in favor of)
![Page 66: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/66.jpg)
66
• Information extraction: automatically detecting that a short sequence of words belongs to (or is an instance of) a particular entity type, for example:
‣ Person(X)
‣ Location(X)
‣ TennisPlayer(X)
‣ ...
Predictive Analysis of Text example applications
![Page 67: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/67.jpg)
67
• Relation Learning: automatically detecting pairs of entities that share a particular relation, for example:
‣ CEO(<person>,<company>)
‣ Capital(<city>,<country>)
‣ Mother(<person>,<person>)
‣ ConvictedFelon(<person>,<crime>)
‣ ...
Predictive Analysis of Text example applications
![Page 68: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/68.jpg)
68
Relation Learning CEO(<person>,<company>)
<person>, who was named CEO of <company>
![Page 69: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/69.jpg)
69
Relation Learning CEO(<person>,<company>)
CEO(Tom LaSorda, Fisker)
CEO(Sean Connolly, Hillshire Brands)
CEO(woman, Gilt Groupe)
CEO(scottish chemist, AztraZeneca)
CEO(Bob Harrison, First Hawaiian Bank)
![Page 70: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/70.jpg)
70
• Text-based Forecasting: monitoring incoming text (e.g., tweets) and making predictions about external, real-world events or trends, for example:
‣ a presidential candidate’s poll rating
‣ a company’s stock value change
‣ a movie’s box office earnings
‣ side-effects for a particular drug
‣ ...
Predictive Analysis of Text example applications
![Page 71: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/71.jpg)
71
the summary consists of sentences extracted from the poolof documents, and abstractive summarization, where textis generated based on the pool of documents. Our work ismost similar to extractive summarization and we will reviewrelated work in that area.
The most popular methods for evaluating MDS systemsrequire a human-generated gold standard summary. Thisevaluation is more scalable than manually evaluating systemoutputs and improves reusability of editorial e↵ort. Givena human generated summary, there are several methods forcomparing system output to the gold standard. The mostcommonly used method, known as Rouge, compares theoverlap of n-grams in the machine and human summaries[10]. Extensive studies have confirmed the correlation be-tween Rouge metrics and manual human evaluation of sum-maries. Although, in the abstract, Rouge can use arbitraryn-grams, in practice, authors often evaluate MDS summariesusing unigrams [21, 22]. One problem exhibited by Rougeis its limited ability to match alternative phrasings. Othermethods suggested in the literature overcome this limitation,for example, by using a variety of operations to transformand match basic elements in di↵erent ways [24]. The Pyra-mid Method [14] is based to a larger extent on human contri-butions, allowing the inclusion of multiple human-generatedsummaries to be used as the ideal summary for comparison,and requiring human annotators to mark important infor-mation pieces which are then used for evaluation. Whilethe above-mentioned measures evaluate whether summariescontain the most important content regarding the informa-tion need, other aspects mentioned above, such as coherency,coverage, etc. are very important for the consumer of thesummaries [19]. Nevertheless, Rouge continues to be thestandard method for evaluating MDS systems.
While the ground-truth summary is usually assumed tobe created by an expert human, sometimes it is less costlyto employ non-experts. In this case, the gold standard sum-mary may be degraded as non-experts from Mechanical Turkwere found not to agree with experts on what constitutes thebest summary for a text [7]. Sauper et al. as well as Bal-asubramanian and Cucerzan use Wikipedia pages as goldstandard summaries [3, 21]. While not providing expert-level summaries, Wikipedia has the benefit of community-based vetting of content not found in Gillick’s MechanicalTurk experiments.The first set of MDS methods we will discuss operate in an
unsupervised manner. For example, LexRank is a methodfor computing sentence importance based on the eigenvectorof a graph representation of sentences [6]. Similar methods,exploiting graph structures induced by document text, wereused for MDS and content recommendation in blogs [8], andfor generating literature surveys [20]. Louis et al.[11] pro-pose creating a summary that has a similar distribution ofwords in the input and in the summary. Alternatively, asaliency-based hierarchical MDS method [17] was proposedfor summarizing news articles for presentation on mobile de-vices.Regressing againstRouge-based targets is a common method
of training MDS systems. In single-document summariza-tion, Svore et al. label sentences with Rouge scores beforetraining a neural net to rank sentences [22]. In the contextof multi-document summarization, Ouyang et al. regressagainst the Rouge of individual sentences [18]. In bothcases, the model predicts only the quality of a sentence, not
?output
input
BI
dt-5 dt-4 dt-3 dt-2 dt-1 dt dt+1
S̃t�1
Figure 2: The online updating system observes abu↵ered stream of documents and makes decisionsbased on the contents of the input bu↵er (BI) andthe delivered updates (S̃t�1), as well as sentence anddocument level signals.
the redundancy.O↵-line update summarization is the task of creating a
summary given a new batch of documents assuming the in-formation in an existing batch of documents has been con-sumed [5]. As the batch mode approaches tend to be tooexpensive for frequently generating summary updates [25],an incremental hierarchical clustering framework for MDSwas proposed to create summaries for the online setting,where documents arrive in a stream, and thus are not avail-able at the same time. While this approach is e↵ective,it requires incrementally storing all the input sentences inmemory, which may not be feasible when the number of doc-uments is of the scale encountered in web news media. Inour work, we explicitly enforce a fixed-sized sentence bu↵erto respect this real world constraint.A retrospective analysis of news articles was achieved us-
ing a system for showing the evolution of news stories overtime, by creating summaries of the event at various timesteps which satisfy relevance, coverage, coherence, and di-versity constraints [26]. This work is di↵erent from our workin that it is retrospective in nature.We stress that, except where noted, none of the MDS
techniques we discuss are online, they all require the entiredocument pool to be observable to the system. Furthermore,experiments were conducted on relatively small documentsets where O(n2) techniques (e.g. graph-based techniques)were acceptable; such approaches are not tractable for ourdata sets without some modification.
3. PROBLEM DEFINITIONA time-critical news event, e, is a topic with a clear onset
time, te. A set of profile queries, Qe, represents the filteringkey words which users submit to follow the event. The set ofsubtopics associated with the event, ✏(e), represents the im-portant information that should be included in the updatesto deliver to the users (we will discuss our realization of ✏(e)in Section 3.2). The system observes a temporally-orderedstream of documents, [d1, d2, . . .]. On the observation of dt,the system makes a decision to emit zero or more update sen-tences. The pool of candidate updates consists of sentencesin dt as well as those in an input bu↵er, BI , comprised ofthe most recent k documents. Upon receiving a document attime t, the set of delivered update sentences is S̃t�1, wheret � 1 the arrival time of the previous document dt�1. Wepresent a schematic diagram of this system in Figure 2.
• Temporal Summarization: monitoring incoming text (e.g., tweets) about a news event and predicting whether a sentence should be included in an on-going summary of the event
• Updates to the summary should contain relevant, novel, and accurate information.
Predictive Analysis of Text example applications
(Diagram from Guo et al., ECIR 2013)
![Page 72: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/72.jpg)
72
• Detecting other interesting properties of text: [insert your crazy idea here], for example, detecting humorous text:
‣ “Beauty is in the eye of the beholder”
‣ “Beauty is in the eye of the beer holder”
Predictive Analysis of Text example applications
not funny
funny
(example from Mihalcea and Pulman, 2007)
![Page 73: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/73.jpg)
73
Outline
Introductions
What is Text Data Mining?
Predictive Analysis of Text: The Big Picture
Exploratory Analysis of Text: The Big Picture
Applications
![Page 74: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/74.jpg)
Course Overview
![Page 75: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/75.jpg)
75
• Predictive Analysis of Text
‣ Supervised machine learning principles ‣ Text representation ‣ Feature selection
‣ Basic machine learning algorithms ‣ Tools for predictive analysis of text ‣ Experimentation and evaluation
• Exploratory Analysis of Text
‣ Clustering ‣ Outlier detection (tentative) ‣ Co-occurrence statistics
Road Map first half of the semester
![Page 76: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/76.jpg)
76
• Applications
‣ Text classification ‣ Opinion mining ‣ Sentiment analysis ‣ Bias detection ‣ Information extraction ‣ Relation learning ‣ Text-based forecasting
‣ Temporal Summarization
• Is there anything that you would like to learn more about?
Road Map second half of the semester
![Page 77: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/77.jpg)
77
Grading
• 30% homework
‣ 10% each
• 20% midterm
• 40% term project
‣ 5% proposal ‣ 10% presentation ‣ 25% paper
• 10% participation
![Page 78: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/78.jpg)
78
Grading for Graduate Students
• H: 95-100%
• P: 80-94%
• L: 60-79%
• F: 0-59%
![Page 79: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/79.jpg)
79
Grading for Undergraduate Students
• A+: 97-100%
• A: 94-96%
• A-: 90-93%
• B+: 87-89%
• B: 84-86%
• B-: 80-83%
• C+: 77-79%
• C: 74-76%
• C-: 70-73%
• D+: 67-69%
• D: 64-66%
• D-: 60-63%
• F: <= 59%
![Page 80: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/80.jpg)
80
General Outline of Homework
• Given a dataset (i.e., a training and test set), run experiments where you try to predict the target class using different feature representations
• Do error analysis
• Report on what worked, what didn’t, and why!
• Answer essay questions about the assignment
‣ These will be associated with the course material
![Page 81: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/81.jpg)
81
Homework vs. Midterm
• The homework will be more challenging than the mid-term. It should be, you have more time.
![Page 82: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/82.jpg)
82
• Work hard
• Do the assigned readings
• Do other readings
• Be patient and have reasonable expectations
‣ you’re not supposed to understand everything we cover in class during class
• Seek help sooner rather than later
‣ office hours: by appointment
‣ questions via email
• Remember the golden rule: no pain, no gain
Course Tips
![Page 83: Text Data Mining · 2018. 1. 16. · Text Data Mining in this Course • Predictive Analysis of Text ‣ developing computer programs that automatically recognize or detect a particular](https://reader033.vdocument.in/reader033/viewer/2022051907/5ffab1503c0b2259283329bf/html5/thumbnails/83.jpg)
83
Questions?