fusion with sentiment scores for market research
Post on 15-Apr-2017
81 Views
Preview:
TRANSCRIPT
Fusion with Sentiment Scores for Market Research
Subrata Das
Machine Analytics, Belmont, MA
sdas@machineanalytics.com
Arup Das
Alphaserve Technologies, NY & Machine Analytics, MA
adas@{alphaserveit, machineanalytics}.com
Abstract— The recent surge in electronic and social media
has led to an explosion of sentiment data embedded in public
and private documents, fueling interest in sentiment analysis,
especially as individuals, brands and corporations look to
manage their reputational risk which is directly correlated to
company performance. In this paper, we describe two
approaches to score sentiments from a large unstructured text
corpus1 to fuse with other relevant structured relational data:
1) a simple but effective and fast lexicon-based approach
where the score of a document is based on the occurrences of
stemmed words representing positive and negative sentiments;
and 2) a supervised machine learning approach where the
score is derived by making use of a kernel-based classification
model created from the training documents. Example
applications of these techniques can be found in our text
analytics tool called aText which can compute sentiment scores
of product reviews from Amazon and TripAdvisor to gain
market insight to products and services. Another example is
the computation of sentiment scores using aText for public and
private companies from credible financial sources which is
further fused with market data (stock price) to create a
composite index for financial analysts and traders.
Keywords—Text Analytics, Sentiment Analysis; Natural
Language Processing; Machine Learning
I. INTRODUCTION
Sentiment analysis (aka opinion mining) refers to the
identification, extraction, and quantification or scoring of
various types of subjective emotions in documents. The
recent surge in electronic and social media has led to an
explosion of data and opinion, fueling interest in sentiment
analysis to aid in market research to promote products. In
the financial domain, market movement is largely driven by
sentiments. The derived sentiment scores can be fused with
other relevant structured relational data for enhanced high-
level fusion [2] for market research and prediction. The
question is how to compute sentiment given consumers and
analysts opinions are expressed in the vast amount of textual
blogs, news articles, social media posts, customer feedback,
and reviews, some of which are openly available while the
rest are company proprietary. Text analytics is a process for
analyzing large text corpii to help discover information that
is strategic to an organization. For example, text analytics
will discover people’s opinions on various blog sites about a
company’s new product, or analyze customers’ sentiment
from text surveys.
1 A corpus is a set of documents representing news articles,
blogs, emails, reviews, opinions, and such.
A fundamental technology in sentiment analysis
applications is classification, labeling documents in a corpus
with a predefined set of categories. The most common and
primitive labels are “positive” and “negative” but sentiment
can also be labeled in finer levels expressing various types
of emotions, for example. Most approaches in sentiment
analysis use bag of words representations [23] where a
predefined set of words (i.e. lexicon) are used to represent a
category of emotion. Documents are classified as per the
occurrences of the words. The approach is fast but do not
take into account the wider context of the occurrence of a
word in a document. For contextual consideration, a piece
of text or document is converted into a feature vector or
other representation that makes its most salient and
important features available. Such feature vectors are used
to build models to help classifying into the categories of
sentiment.
In this paper, we describe two approaches to score
sentiment from big unstructured text corpus: 1) a simple but
effective and fast lexicon-based approach where the score of
a document is based on the occurrences of stemmed words
representing positive and negative sentiments; and 2) a
supervised machine learning approach where the score is
derived by making use of a kernel-based classification
model created from labeled training documents. We
demonstrate the machine learning approach with a
classification task into five rating categories that is similar
to aspect prediction. Snyder and Barzilay [31] analyzed
larger reviews in more detail by analyzing the sentiment
of multiple aspects of restaurants, such as food or
atmosphere. Shimada and Endo [30] have proposed a
method based on word variance for seeing several stars.
Pappas and Popescu-Belis [24] have proposed a method
using multiple-instance learning for aspect rating prediction. All experiments in this paper were carried out using the
Java application programming interface (API) of the in-house text analytics tool aText. The tool automatically analyzes text documents in order to extract actionable intelligence. It employs deep linguistics processing, text classification, and information extraction techniques. Given a corpus containing a set of textual documents, aText automatically extracts triples, summarizes documents, performs sentiment and social network analyses, and classifies documents in both supervised and unsupervised manners. The specific supervised classification techniques of aText that we make use of for the proposed machine learning approach to sentiment scoring are Naive Bayesian Classifier (NBC) [7][20][4][2], k-dependence NBC [27], and Fisher Kernel (FK) [16] algorithms. In both the proposed sentiment
scoring approaches, documents are first stemmed before classifying. Stemming [10] is the conflation of the morphological variants of the same word (e.g., application, applied, applying) into a common stem (apply). In most cases, the stemming leads to an improvement of the classification performance. aText has scraping ability from specific web sites (e.g. Amazon and TripAdvisor) by making use of the underlying format of the pages on the sites.
The rest of the paper is organized as follows. Section II describes the lexicon based approach. Section III describes the supervised classification based approach. The concluding section touches sentiment analysis using richer lexicon and relates sentiment scores with the history of stock price.
II. LEXICON BASED SENTIMENT SCORING
The lexicon based sentiment scoring makes use of a
dictionary of words annotated with positive and negative
sentiments. A proprietary algorithm produces a positive and
a negative sentiment scores of each document (sum to 1.0)
as shown in Figure 1. The corpus of documents in this
example contains Amazon customer reviews on a specific
television brand. The words representing positive (resp.
negative) sentiment are highlighted in green (resp. red).
Figure 2 shows the overall sentiment of all the reviews
downloaded.
Figure 1: Sentiment scoring of individual articles
The top of the split pane on the right in Figure 2 shows the frequencies of the words occurring in the corpus. The two numbers within parentheses corresponding to a word indicate the number of appearances of that word in positive and negative contexts. For example, the word “warranty” (stemmed version is “warranti”) appears about 44% in the positive context and 56% in the negative context. A large negative context will perhaps trigger the manufacturer to look into the item in more detail. The user can then highlight (in cyan) the documents where the term “warranty” is occurring.
As shown in Figure 3, we stem a document before matching its words with the dictionary. Handling negation can be an important concern in sentiment analysis. While the bag-of-words representations of “The is pleasant” and “This is not pleasant” are considered to be very similar by most commonly-used similarity measures, the only differing token, the negation term, forces the two sentences into
opposite classes. We recognize such and highlight as shown in Figure 3 and weigh appropriately when scoring sentiment.
Figure 2: Overall sentiment score and context
Figure 3: Document stemming
III. SUPERVISED CLASSIFICATION TECHNIQUES FOR
SENTIMENT SCORING
Our objective is to develop a supervised clustering and
classification technique to predict a document into one of
sentiment categories such as positive vs. negative and a
rating between 1 and 5. Most traditional clustering
techniques, such as feed-forward and supervised neural
networks, rely on carefully crafted data models in terms of
fixed-length vector structures of ordered n-tuples. Each
component in a vector represents some feature of an object
from the underlying problem domain. One of the early
approaches to supervised text classification with successful
applications to information retrieval, Latent Semantic
Analysis (LSA) [9], constructs feature vectors from the
terms occurring in documents. Such vectors become “very
high” dimensional to account for every term occurring in
the corpus. A similarity measure between two vectors
(usually the cosine of their contained angle in the semantic
space) is defined to cluster the vectors representing a text
corpus of documents. LSA, which is based on Single Value
Decomposition (SVD), attempts to solve the synonomy and
polysemy problems to match documents by taking
advantage of the implicit higher-order structure of the
association of terms with articles to create a multi-
dimensional semantic structure. Other notable developments
for text classification are unsupervised but generative Latent
Dirichlet Allocation (LDA) [1], and probabilistic Latent
Semantic Analysis (pLSA) [15] and its hierarchical
extension [12]. High dimensionality remains a problem for
these techniques. In general, discriminative techniques
perform better than generative ones by learning only
classifier functions, as opposed to learning explicit relations
among variables via joint probability distributions to
facilitate sampling. In this section, we present a hybrid approach to text
document classification leveraging better performance of discriminative classifiers and models of generative classifiers that can be visually inspected and adjusted by human experts. The proposed approach, kNBC/FK, trains a generative graphical probabilistic model, called k-dependence Naïve Bayesian Classifier (kNBC) [27], and then derives a Fisher Kernel (FK) [16] from the model to incorporate into a discriminative classifier. A kNBC model overcomes the strong conditional independence assumption of simple NBC (k = 0) by capturing relevant feature dependencies that exist in a corpus. We therefore expect a classifier to achieve optimal Bayesian accuracy if the right dependencies are set in the model. The TAN algorithm in [11] for inducing conditional trees is to generate optimal 1-dependence Bayesian classifiers.
FK with respect to a generative model compares two data points through the directions in which they ‘stretch’ the parameters of the model. This is achieved by comparing the two gradients of the derived score vectors at the two points as a function of the parameters. Thus the derived score vector corresponding to a sample text document explains how much parameters of the kNBC model contribute to generate the example, enabling the kernel approach to compare two documents with different numbers of features via any discriminant classifier. Our approach is suitable for handling a very high-dimensional feature space by discarding irrelevant features based on a mutual information measure during the kNBC model construction process without compromising the classifier quality.
This derivation of FK from a kNBC model seems to be the first in the literature. In principle, FK can be derived for any generative model with a differentiable likelihood
function. Shi et al. [28] derived FK for a NBC model. Denoyer and Gallinari [6] developed a specialized Bayesian network for structured document classification. This generative model has been transformed into a discriminant classifier using the method of FK. Sewell [29] has trained generative hidden Markov models on market data to derive a FK for a discriminative SVM. Nicotra et al. [22] extracted FK from a Hidden Tree Markov Model. Holub et al. [14] chose a simplified probabilistic Constellation model to derive FK, showing strong performance improvements for classification tasks over the corresponding generative approach. Dick and Kersting [8] developed FKs for relational data and empirically showed performance improvements over the results achieved without FKs. They used Bayesian logic programs as the relational model. Such models integrate definite logic programs with Bayesian belief networks. Perronnin and Dance [26] proposed a framework to image categorization where the underlying generative model is a Gaussian mixture model approximating the distribution of low-level features in images from a visual vocabulary.
The experimental evaluation is carried out with the well-known TripAdvisor collection. We apply natural language processing techniques to preprocess these collections, including stemming and XML file parsing, before applying the techniques. We show a comparable and sometimes improved performance over the baseline discriminative SVM-based classification.
The rest of the section is organized as follows. Subsection A provides kNBC background. Subsection B details an algorithm for constructing kNBC using a mutual information measure. Subsection C provides FK background and derives the FK for a kNBC model. Subsection D details the nature of the corpus that we have selected for empirical evaluation of the proposed hybrid approach. Section 5 presents the detailed evaluation results and analyses.
A. k-Dependence Naïve Bayesian Classifier (kNBC)
A kNBC [27], as shown in Figure 4, is a Bayesian network [25][18] which contains the structure of the NBC
[7][20][4][2] and allows each feature iv to have a maximum
of k feature nodes as parents, where features jv s are tokens
in document d. By varying the value of k one can define models that smoothly move along the spectrum of feature dependence.
… …
1,..., nc c
1v 2v 3v
Class
Variable
… nv
Feature
Variables
Figure 4: Generic structure of a k-NBC
Let d be a document that we want to classify and the
given set of classes is 1,..., nC c c . We want to compute
|ip c d , for every i:
1
| ,|
|
| ,
i j i j
ji i
i n
k j k j
k j
p c p v c vp c p d c
p c dp d
p c p v c v
where jv are the parents of jv . Note that the
computation of the posterior |ip c d after propagation of
evidence e involves only a multiplication of the relevant entries from the probability tables, without requiring full belief propagation as in Bayesian networks. One requires the
prior and conditional probabilities ip c and |j ip v c ,
which can either be obtained from domain experts or determined based on the keyword frequencies in documents. In NBC, the product of conditional probabilities comes from the assumption that tokens in a document are independent given the document class. This conditional independence assumption of features does not hold in most cases. For example, word co-occurrence is a commonly used feature for text classification.
We don’t need the estimated posterior |ip c d to be
correct. Instead, we only need
arg max | arg max |i i
i i j ic c j
p c d p c p v c
The score for each class can be expressed in the following tractable form for analytical purposes:
log log |i j ijp c p v c
The score is not a probability value, but is sufficient for the purpose of determining the most probable class. It reduces round-off errors due to a product of small fractions caused by a large number of tokens.
An example kNBC is shown in Figure 5, which is based on a ski-related document corpus of web pages. Some pages are advertisements for “shops”, some are describing “resorts”, and the rest are categorized as “other” containing articles, events, results, etc. The mutually exclusive and exhaustive set of hypotheses is the three classification classes of documents, and each child node of the network corresponds to a keyword as target attribute. In a kNBC
structure of Figure 4, an edge from iv to jv implies that the
influence of iv on the assessment of the class variable also
depends on the value of jv . For example, in Figure 5, the
influence of the attribute “brand” on the class DocType (C) depends on the value of “ski,” while in the equivalent NBC (i.e., without the edges among children) the influence of each attribute on the class variable is independent of other attributes. These additional edges among children in a kNBC affect the classification process in that a value of “brand”
that is typically surprising (i.e., |p brand C is low) may
be unsurprising if the value of its correlated attribute, “ski,”
is also unlikely (i.e., | ;p brand C ski is high). In this
situation, the NBC will overpenalize the probability of the class variable by considering two unlikely observations, while the augmented network of Figure 5 will not.
More concretely, in a suitably constructed corpus with distribution of documents among the three categories shop,
resort and other as 60%, 30% and 10%, the posterior probability distribution of the class variable in the equivalent NBC given that a document has only “ski” and “brand” keywords is as follows:
| , , 0.91
| , , 0.08
| , , 0.01
p DocType shop ski brand slope
p DocType resort ski brand slope
p DocType other ski brand slope
Doc Type
(C)
“price”
(v1)
“ski”
(v2)
“brand”
(v3)
“slope”
(v4)
c1 = shop
c2 = resort
c3 = other
Figure 5: k-NBC for document classification
While computing conditional probabilities from the frequency of occurrences, one would expect
| ,p brand shop ski to be higher than
| ,p brand resort ski since a web page for a ski shop is
more likely to mention the keyword “brand” than a web page
of a ski resort. Similarly, | ,p slope resort ski is likely to be
higher than | ,p slope shop ski . These kinds of
dependencies are not captured in a NBC. In the kNBC, the probability distribution among the hypotheses is as follows, due to the presence of the keywords ski and brand in a web page but absence of the keyword slope:
| , , 0.99
| , , 0.01
| , , 0.00
p DocType shop ski brand slope
p DocType resort ski brand slope
p DocType other ski brand slope
Note here the enhanced disambiguation in classification as compared to (0.91, 0.08, 0.01) obtained from the NBC presented earlier for the same evidence.
B. Algorithm for Constructing kNBC
The algorithm for constructing kNBC is provided with a set
of input labeled training instances belonging to a class C
and the value of k for the maximum allowable degree of
feature dependence. It outputs a kNBC model with
conditional probability tables determined from the input
data. The structural simplicity of kNBC (and hence NBC)
and the completeness of the input labeled instances avoid
the need for complex algorithms used for learning structure
and parameters in Bayesian networks [13][21]. The
algorithm here makes use of the following mutual variables
between two variables X and Y when selecting the order of
child nodes and the k parent nodes of a child.
,
,; , log
X Y
p X YI X Y p X Y
p X p Y
The probabilities in this formula are determined by counting the number of individual and pair-wise joint occurrences of the variables in the articles. Algorithm – Let the used variable list S be empty.
Let the k-dependence network BN being constructed begin with a single class node C.
– Repeat until S includes all domain features (i.e., the vocabulary containing all the terms):
Select feature maxX which is not in S and has the
largest value max ;I X C .
Add a node to BN representing maxX .
Add an arc from C to maxX in BN.
Add min | |,m k S arcs from m distinct features
jX in S with the highest value for max ; |jI X X C .
Add maxX to S.
– Compute the conditional probability tables inferred by the structure of BN by using counts from input instances and output BN.
C. Fisher Kernel (FK)
Jaakkola and Haussler [16] first introduced the notion of FK
to enable one to compare two incomplete data items with
different numbers of features via any classical discriminant
classifier. FK with respect to a generative model compares
two data points through the directions in which they
‘stretch’ the parameters of the model. This is achieved by
comparing the two gradients of the derived score vectors at
the two points as a function of the parameters. A
representative score vector of fixed length for each data
item x is first derived as follows. The log-likelihood of a data item x with respect to a
generative model M with parameters 1,..., n is
defined as
logM
L x
The Fisher score of a data item x with respect to a generative model M with parameters is defined as
1
, log log
n
M M
i i
f M x L x L x
where i
is the gradient operator with respect to the
parameter i . Intuitively, the score vector explains how
much parameters of the model contribute to generate the example. The Fisher information matrix with respect to a generative model M with parameters is defined as
, ,T
MI E f M x f M x
where the expectation is over the generation of the data point x. The Fisher information kernel with respect to a generative model M with parameters is defined as
1, , ,T
Mx y f M x I f M y
This kernel defines a distance between two data points x and y. This kernel function can be used with any kernel-based classifier, such as the support vector machine. We will make use of the practical FK
, , ,T
x y f M x f M y
and other types of kernels that are part of the SVM package offers when clustering Fisher vectors.
3.1. Derivation of FK for kNBC
We provide a mathematical derivation of FK of kNBC
models (some intermediate steps are omitted due to space
limitation). Assume that C is the set 1,..., mc c of mutually
exclusive and exhaustive set of classes of the root node and
X is the set 1,..., nx x of feature nodes of the k-NBC. We
also assume that an arbitrary combination of parent states of
the node ix is denoted as * ix (there are 2k of such if k
is the number of parents of ix ). Evidence e = 1,..., ny y is
received on nodes 1,..., nx x , where each
iy is categorical
(i iy x or
ix , i.e., ix is either true or false). The
derivation below assumes that there is either positive or
negative categorical evidence on every child node ix of
1,..., nx x corresponding to whether the word ix is present or
absent in the input document representing evidence e. The
derivation is similar in case some of the child nodes are left
uncertain or evidence is non-categorical. Consider the following derivation of the likelihood in
terms of the parameters of a kNBC model .
1 1
1
1
1 1
1 1
1 1
| |
,..., |
| , ,...,
| ,
m m
M M i M i M i
i i
m
M i M n i
i
m n
M i M j i j
i j
m n
M i M j i e j
i j
P e P ec P c P e c
P c P y y c
P c P y c y y
P c P y c x
The last line follows from the fact that, given a class ic ,
jx is independent of non-parent nodes
1,..., j e jy y x and that the parent nodes of jx are only
in 1 1,..., jy y as per the structure of a kNBC model.
Parameters M iP c s are the probability distribution of the
root node and | ,M j i e jP y c x s are the conditional
probabilities of the child nodes. Hence,
,
1
| , | , 1
. ., | , 1
j j j
M i
i
M j i k j M j i k j
M j i k j
y x x
P c
P x c x P x c x
i e P y c x
for all ic . We now compute the partial derivatives of the log
likelihood function with respect to each conditional
probability *| ,M j k jP y c x , where * jx is an
arbitrary combination of the parent states and jy is a
variable with domain ,j jx x .
*
1 1
log | log |
| , | ,
1| ,
|
, ,
M M
M j k j M j k e j
m n
M i M p i e p
i l p lM
i k l j
P e P e
P y c x P y c x
P c P y c xP e
c c y y
where , 1x y if x y else 0. Pushing the second
summation inside the product, we obtain the below results after the simple rearrangement.
*
log | | ,
| , | ,
M M k
M j k j M j k e j
P e P c e
P y c x P y c x
Note that e jx is unique given e and hence we will have
only two partial derivatives irrespective of the number of combinations of parents. The partial derivative with respect
to each prior probability M iP c (only 1m probabilities
are independent) of classes is the following:
1
1
log | | , | ,, 2,...,
| |
M M k M
M k M k M
P e P c e P c ek m
P c P c P c
As mentioned earlier, the computation of the
posterior | ,M kP c e in the above expressions after
propagation of evidence e is just a multiplication of the relevant entries from the probability tables of the model.
D. Experimental Setting
In this section we present details of the collection
TripAdvisor, and various statistics pertaining to the
preprocessed collection. TripAdvisor corpus has been built
by Baccianella et al., consisting of 15,763 hotel reviews
from the TripAdvisor Web site
(http://www.tripadvisor.com), a popular site to review
tourism-related activities. Each review is labeled with a
score of one to five “stars”. Figure 6 shows three samples
with 5, 3 and 1 stars. Note the usage of highly
discriminatory words like “fantastic”, “okay” and “terrible”
in coherence with the degree of ratings. We have used 10,508 documents for training and used
5,255 documents for testing. The training set contains 23,341 unique stemmed words.
Figure 6: Example TripAdvisor Reviews
The distribution of labels is highly skewed, since 44% of all the training articles have a global score of 5 stars, 34.8% a global score of 4 stars, 10% 3 stars, 7.1% 2 stars and only 4.1% 1 star. Test articles have a similar distribution.
Topic # 5 4630
4 3643
3 1052
2 752
1 431
Table 1: Distribution of TripAdvisor training articles
This kind of skewed distribution tends to make the
classification task for the least frequent scores difficult.
Figure 7 shows a fragment of the kNBC for the TripAdvisor
corpus showing dependence between stemmed words. As
expected, the word “terrible” has dependence on words such
as “rude” and “worst” and the word “comfort” has
dependence on words such as “clean” and “walk”.
worst clean walk comfort
… …
5,4,3,2,1Class
Variable
…rude terribl
Figure 7: Example kNBC (k = 2) dependencies for the
TripAdvisor Corpus
Figure 8 shows a screenshot for category prediction in aText
using NBC. The corresponding ground truth is shown in the
popup dialog window.
Number of articles (or reviews): 15763 Number of stemmed words: 23341 Number of topics: 5 Total number of training articles: 10508 Total number of test articles: 5255
253154_3638452 Terrible experience The twin room booked turned out to be a single with single bed and camp bed By our third night we had an ant infestation which the management were unwilling to deal with Eventually I had to spray the room and wait till 1 30am before being able to return
We left _PROS_Nothing _CONS_Nothing 1
274573_3994017 Perfectly okay The Hotel Suisse was just okay Daniela was very helpful but the other people at the front desk were not We found their attitude less than desirable The rooms were clean and spacious Breakfast delivered to the room was a bit ackward but certainly doable The location to the Spanish steps was quite helpful easy access to the Metro and a taxi stand They also asked for 1 night stay in cash which
we thought was odd _PROS_Nothing _CONS_Nothing 3
203223_3024338 Fantastic For the money the location is fantastic It is about 300 yards from a metro station From the hotel you can walk to the colosseum you can practically see it when you leave the hotel Yes the rooms are small but the hotel has great character I would definitely stay there again and would recommend it to others _PROS_Nothing
_CONS_Nothing 5
Figure 8: Predicting review category
E. Evaluation Results
The performance analysis of kNBC and FK on the
TripAdvisor corpus is presented in this section. To reduce
the complexity of dealing with thousands of children in a
kNBC model, we have discarded all children sharing very
low mutual information with the root node. First we used
the formula for ;I X Y for mutual information presented
earlier. In this process we have experimentally shown that
discarded children indeed contribute very little or nothing to
improved performance. We then computed the
performances of kNBC and FK by varying the value of the
dependence degree k, and compared them against the
baseline SVM. As shown in the table below, the performance for kNBC
(k = 0) does not degrade with the progressive reduction of the number of child nodes by increasing the mutual information threshold (Hit Rate and F-Measure are formally defined just a little later).
NBC Model
Mutual Information Threshold
0.001 0.002 0.003 0.004 0.005 0.006 0.007
No of Children
3521 401 237 160 117 79 61
Hit Rate 57.9 57.1 57.7 57.1 58.2 57.8 57.0
F-Measure
89.8 89.6 90.5 90.7 91.3 91.7 91.6
Table 2: Performance of kNBC by varying number of children
The table above suggests the performance stabilizes after
the threshold value 0.003. The baseline SVM performance
for this value is as follows:
With a kNBC consisting of 117 children and k = 2, the
confusion matrix is shown Table 3. In this table, we have
computed the “hit rate”, i.e., the percentage of the total
number of diagonal elements compared to the total number
of reports. This is equal to 58.9%. A plausible explanation
of the low performance of the TripAdvisor corpus is the
blurriness between two consecutive ratings. Many words in
reports from satisfied customers rating 4 and 5 are likely to
be common; so will be the case with ratings of 1 and 2.
kNBC Model 5 4 3 2 1
5 1662 602 56 25 10
4 535 996 220 61 9
3 49 167 153 83 9
2 25 60 80 138 71
1 12 6 15 71 140
Table 3: TripAdvisor Confusion Matrix
If we now transform the classification as a binary
classification problem with ratings 4 and 5 in the “positive”
class and ratings 1-3 in the “negative” class, performance
becomes acceptable. We then follow the usual precision
and recall definitions and the following definition of F-
measure:
2 /F Precision Recall Precision Recall
We now have varied the dependencies, and the table below
shows the hit rate and F-measure by varying the value of k
between 0 and 3.
K=0 (NBC) K=1 K=2 K=3
kNBC 83.2% 83.5% 85.1% 85.5%
kNBC/FK 81.9% 86.2% 87.4% 88.9%
Table 4: Comparison of performances between kNBC and FK
It’s clear that k = 2 yields the optimum performance and that
the performance of the hybrid kNBC/FK approach is
comparable to the baseline SVM performance.
IV. CONCLUSIONS
We have presented two different ways of computing
sentiment score of a document, namely, lexicon and
supervised classification based. In the financial domain, we
have used a richer lexicon with ten categories, including
positive and negative, as shown in Figure 9. We have also experimented with about 300 articles
written by analysts in 2015 on a particular company. We plotted the sentiment trend in certain number of intervals as shown in the bottom panel of Figure 10 (blue represents positive and red represent negative and the sum of the two scores at any time point is 1.0) and superimposed with the stock prices of the period. The correlation between the two graphs in some segments are evident. We have also defined a volatility index of sentiment reflecting a measure of price ups and downs during the period. The numeric sentiment scores over the time period can be incorporated into any time-series regression algorithm predicting the stock price.
Dependency k: 2 Overall: 58.9% Precision: 90.9% Recall: 92.2% F-Measure: 91.5%
Mutual information threshold: 0.003 Reduced number of children nodes: 117 Baseline SVM performance: 60.7%
Figure 9: Finer level of sentiment scoring
Figure 10: Correlating sentiment trend with stock prices
Our future plan is to enhance the scoring techniques via
deep linguistics processing and unsupervised classification
approaches of aText such as LDA and PLSA.
REFERENCES
[1] Blei, D., Ng, A., and Jordan, M. (2003). Latent Dirichlet allocation. J. of Machine Learning Research, 3(5):993–1022.
[2] Das, S. (2008). High-Level Data Fusion, Artech House, MA, USA.
[3] Das. S. (2012). “A framework for distributed high-level fusion,” In Net-centric Distributed Fusion, D. Hall, J. Llinas, M. Liggins, C. Chong (eds.), CRC Press/Taylor and Francis.
[4] Das, S. (2014). Computational Business Analytics, Chapman and Hall/CRC Press.
[5] Das, S., Ascano, R., and Macarty, M. (2015). “Distributed Big Data
Search for Analyst Queries and Data Fusion,” International Conference on Information Fusion.
[6] Denoyer, L. and Gallinari, P. (2004). “Bayesian network model for semi-structured document classification,” Information Processing and Management, Vol. 40, pp. 807–827.
[7] Duda, R., and Hart, P. (1973). Pattern Recognition and Scene Analysis, Wiley, NY.
[8] Dick, U. and Kersting, K. (2006). “Fisher Kernels for Relational Data,” Proc. of the 17th European Conference on Machine Learning, Springer-Verlag, pp. 114–125.
[9] Dumais, S., Furnas, G., Landauer, T., Deerwester, S., and Harshman, R. (1988). “Using latent semantic analysis to improve access to textual information,” Prof. of the Conf. on Human Factors in Computing Systems (CHI). pp. 281-286.
[10] Frakes, W. (1992). “Stemming Algorithms,” In: W.B. Frakes and R. Baeza-Yates (eds), Information Retrieval. Data Structures and Algorithms, pp. 131-160, Prentice Hall, 1992.
[11] Friedman, N., Geiger, D., and Goldszmidt, M. (1997). “Building classifers using bayesian networks,” Machine Learning, Vol. 29, pp. 131–163.
[12] Gaussier, E., Goutte, C., Popat, K., and Chen, F. (2002). “A Hierarchical Model for Clustering and Categorising Documents,” Adv. in Information Retrieval – Proc. of the 24th BCS-IRSG European Colloquium on IR Research (ECIR).
[13] Heckerman, D. E. (1996). “A tutorial on learning Bayesian networks,” Technical Report: MSR-TR-95-06, Microsoft Corporation, Redmond, WA.
[14] Holub, A., Welling, M., and Perona, P. (2005). “Combining Generative Models and Fisher Kernels for Object Recognition,” Prof. of the IEEE Int. Conf. on Comp. Vision.
[15] Hofmann T. (1999). “Probabilistic Latent Semantic Analysis,” Proccedings of the Conference on Uncertainity in Artificial Intelligence, UAI’99, Stockholm.
[16] Jaakkola, T. and Haussler, D. (1999). “Exploiting Generative Models in Discriminative Classifiers,” Advances in Neural Information Processing Systems 11, Bradford Books. Cambridge, MA: The MIT Press, pp. 487–493.
[17] Joachims, T. (1998). “Text categorization with suport vector machines: Learning with many relevant features,” Proc. of the 10th European Conf. on Machine Learning, Springer-Verlag, pp. 137–142.
[18] Jensen, F. V. (2002). Bayesian Networks and Decision Graphs, Springer-Verlag, NY.
[19] LIBSVM. (2011). http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
[20] Mitchell, T. (1997). Machine Learning. McGraw-Hill, NY.
[21] Neapolitan, R. E. (2003). Learning Bayesian Networks, Prentice Hall, Upper Saddle River, NJ.
[22] Nicotra, L., Micheli, A., and Starita, A. (2004). “Fisher Kernel for Tree Structured Data,” Proceedings of the 2004 IEEE Int. Joint Conference on Neural Networks, Vol. 3, pp. 1917–1922.
[23] Pang, B. and Lee, L. (2008). “Opinion mining and sentiment analysis,” Foundations and Trends in Information Retrieval, Vol. 2, No 1-2, pp 1–135.
[24] Pappas, N. and Popescu-Belis, A. (2014). “Explaining the stars: Weighted multiple-instance learning for aspect-based sentiment analysis,” Proceedings of the 2014 Conference on Empirical Methods In Natural Language Processing (EMNLP), pp. 455–466.
[25] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann.
[26] Perronnin, F. and Dance, C. (2007). “Fisher Kernels on Visual Vocabularies for Image Categorization,” Computer Vision and Pattern Recognition (CVPR), pp. 1–8.
[27] Sahami, M. (1996). “Learning limited dependence Bayesian classifiers,” Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 335-338.
[28] Shi, Z., Huang, Y., and Zhang, S. (2005). “Fisher Score Based Naive Bayesian Classifier,” Int. Conf. on Neural Networks and Brain (ICNN&B), Vol. 3(13-15), pp. 1616–1621.
[29] Sewell, M. (2007). “Fisher Kernel,” Department of Computer Science, University College London, April 2007.
[30] Shimada, K. and Tsutomu, E. (2008). “Seeing several stars: A rating inference task for a document containing several evaluation criteria,” Advances in Knowledge Discovery and Data Mining, 12th Pacific-Asia Conference, PAKDD, pp. 1006–1014.
[31] Snyder, B. and Barzilay, R. (2007). “Multiple aspect ranking using the good grief algorithm,” Proc. of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT-NAAC), pp. 300–307.
top related