comp8755 – individual computing project text ... · which could be specialized in distinct input...
TRANSCRIPT
1 / 21
COMP8755 – Individual Computing Project
Text Classification: Producing explainable and
performant solutions using hybrid lexicon-based
approaches
Name: Qiutian Chen
University ID: u5789678
Supervisors: Priscilla Kan John
Kerry Tylor
2 / 21
Introduction
Background
It is an interesting and challenging subject to automatically classify text documents in
document-oriented information systems from different areas such as libraries, social media
(Twitter, Blog, Facebook, reviews, etc.). It provides an efficient path for users to be able to
browse massive amount of data with ease (Clos, et al., 2017). Moreover, presenting why the
classifiers give such classified results and outcomes would be helpful to humans to
understand the mechanism as well as better evaluate whether such results are accurate as
expected. Besides, it can also convince humans with reasonable explanations to the
classifications. This report will demonstrate the research procedure and current achievement
of re-implementing Clos Et al.’s work (2017).
Project Motivation
There is a massive amount of data to be categorized for better indicating its features in real-
world cases. Therefore, automatically classifying helps people find their desired retrieved
information. The collected and categorized data could be useful in the business area and
helpful to most of the users. For instance, the categorized information of customers can help
business companies to learn users’ preferences and interests, according to which they can also
give recommendations to attract more consumers. Besides, the extracted features from
classifiers of those categorizing results present the reasons of the classifying, which helps
managers to be able to validate whether it is reasonable by humans’ perceptions. In the future,
more real-world industries require data and documents classifying in an intelligent way. In
this case, the efficient classifier(s) with extracted features will have a wide range of platform.
3 / 21
Project Aim
In this project, the primary aims are as followed: 1) to study different kinds of classifiers,
which could be specialized in distinct input datatypes (they could be text, table entries, visual
information, voice information, etc). 2) Implementing the classifier (RALAXNET)
mentioned in Clos Et al.’s work, which is used to automatically categorize target data. The
datasets involved consider two particular senses of standards, stance and sentiment. Basing
on these two standards, there are respectively datasets for each one. Furthermore, the
significant features that have the most influence on classifying data should be extracted. After
the classifier model having been created (trained), the features extracted should be used to
mark the input text in a distinct way (colours, fonts, underlines, etc) if the features words
exist in that input text.
Literature Review
Classifiers
Classifiers construction is based upon the algorithm supporting them. Fernández-Delgado et
al. (2014) compared hundreds of classifiers with different datasets. They indicate that
different types of classifiers may have their own characteristics, and they have their target
areas/fields and datasets. Hence it is essential to examine the algorithm to ensure whether the
selected classifier is suitable for the data to be processed.
Lexicon-based Classifiers
overview
A lexicon-based classifier can be expressed as a dictionary-like table, which stores the
4 / 21
corresponding weights of each word in it. The weights are used to scale, for example, how a
word affects a sentence, or in other words, how a term in a sequence influence the trend of
what class that entity belongs to.
There are three different techniques are generally referred to build a lexicon (Clos, et al.,
2017). Firstly, the traditional hand-crafted lexicons, which can be manually built by humans
with all included words rated by different individuals. Some samples of weighted words in
lexicon is listed below in table 1. The features in the table are distinctive with their sentiment,
and they are extracted from VaderSentiment (Hutto, C.J. & Gilbert, E.E., 2014), which is
referred in this project as the pure lexicon-based lexicon classifier.
Word
Mean
score
Standard
deviation
Manfully rated score from -4 to 4 by 10 individuals
Peace 2.5 1.0247 3 2 1 4 2 3 4 1 3 2
Poison -2.5 0.92195 -4 -3 -2 -4 -2 -2 -2 -3 -1 -2
Secure 1.4 0.4899 1 2 1 1 2 1 1 2 2 1
slavery -3.8 0.4 -4 -3 -4 -3 -4 -4 -4 -4 -4 -4
aboard 0.1 0.3 0 0 0 0 1 0 0 0 0 0
Table 1
In this table, the positive scores imply the trend the corresponding word to be positive
sentiment, and the negative scores on the contrary. Besides the means of ten scores to be used
as weights, the lexicon also provides standard deviation of each word to demonstrates the
scores fluctuation of ten individual, which indicates how reliable the mean-scores are to some
5 / 21
extent. Secondly, the ontology-based lexicons, which is generated by propagating objects on
graph according to the known seed words and the synonymy, antonymy and hypernymy
external relationship connected to them (Esuli & Sebastiani, 2006). As this technique is
extremely foreign to this project, it will not be introduced deeper. Thirdly, the corpus statistic-
based lexicons, which mainly involve the conditional probability and pointwise mutual that
are frequently used in statistics area. However, this technique has higher chance on
overemphasizing the connections between terms and classes. At the simple level of statistics,
if a term, for instance, “September” occasionally occurs too often within a class, although this
term generally has no trend to be any class, it might mislead its weight in lexicon to lean to
that class.
Advantages and Limitations
The lexicon-based lexicon can be illustrated as an open-sourced dictionary that people are
able to check detailed information anytime. In addition, as its fundamental is human built, it
can also be easily modified and tuned when required. In other hand, building up a completed
lexicon is a huge work to do, as there are more than fifteen thousand of English words in
current use, and many of them has different meanings, it may consume a lot of time. Even
though a lexicon is successfully built, it could also take time and efforts to test and evaluate it
on types of data, and its accuracy cannot be guaranteed at the same time.
Classifiers Basing on Machine Learning
Overview
Machine learning is a tool that allows the system to automatically learn program from data
(Domingos & University of Washington, 2012). In last decade, the development of machine
6 / 21
learning is has significantly accelerated, and it has also spread around more areas, while the
most widely used and mature area that involves machine learning is classification. The
classifier basing on machine learning is generally a system that extracts features from
multiple input samples and characterizes them to generate a model, which can be used to
predict the class of new incoming inputs.
Advantages and Limitations
The performance of machines like personal computers and laptops are rapidly increasing
nowadays, machine learning model can be supported easier. In this case, using machine to
learn a large amount data is much more efficient than manually build lexicon or make
statistic. Furthermore, it is possible to generate multiple models for distinct types of data
without spending much time, which is also flexible and practical, rather than build a lexicon
that might not be able to fit all the data. However, machine learning requires datasets that
were pre-labelled by the classes (details with be specified below), so that extra efforts are
required to construct the well-formatted datasets to feed machine learning model. Also,
overfitting is a problem that mostly happens to machine learning system, which means the
model emphases to fit the fed training data too much, leading to the model could fail a lot to
predict new input or classify data in low accuracy.
Datasets
As mentioned above, two classifying reference standards, stance and sentiments, are used in
fundamental reference work. Investigating the original datasets involved, datasets extracted
from Internet Argument Corpus (IAC) and CREATEDEBATE forum (CD) was set to stance
classifying and Amazon, Yelp and IMDb reviews (AYI) and Amazon user reviews (AMZ)
7 / 21
was set to sentiment classifying. Considering the size and richness of datasets, the CD and
AYI datasets are applied to the project respectively for stance and sentiment classifying.
CD dataset includes the posts from users in the forum on different topics and issues, each
post has a labeled stance of the user, and AYI dataset includes reviews on three big online
platforms, Amazon: the online electronic commerce, Yelp: crowd-sourced reviews about local
restaurant/business, and IMDb: The Internet Movie Database.
Potential Tasks
In addition to automatically classifying, the technique can be also applied or extended to
other areas. In this case, there are also two potential tasks in this project: 1) emotion detection
from the text (Bandhakavi, et al., 2017) and 2) re-processing search engine results (Chen &
Dumais, 2000).
To apply the classifier to emotion detection, the datasets should be preliminarily in the format
of emotion categorized. As the emotion potentially refers both stance and sentiment, there
might have a probability that the classifier also works on emotion detection. Re-processing
the search engine results requires an interface, which works as a medium that allows the user
to interact with search results. The agent takes the user’s input as the keywords to be
retrieved via a search engine, and re-process the returned results with the classifier then
display them to users.
8 / 21
Project Procedure and Changes Made
Initial steps
Pure Lexicon based classifier
As the classifier to be implemented is lexicon-based, it is necessary to study how a lexicon-
based estimate the outcome. In consideration of this, a pure lexicon-based classifier named
VaderSentiment (Hutto, C.J. & Gilbert, E.E., 2014) is referred. This classifier has a manually-
built lexicon with more than 7000 words. Each word was rated by 10 individuals, who give
the score from -4 to 4, and the mean of these scores will be the weight. Given an input text,
the classifier would give a result on the trend of being negative, being neutral, being positive,
and a compound score. The examples are shown in figure 1.
Figure 1
In addition, some attempts to modify the original code have been done. To display more
detailed weighting information of the classifier, how each word in the text affect the result is
also made to be printed as figure 2.
9 / 21
Figure 2
Tool Selection
Machine learning is currently support in many development languages on different platforms.
As Python is a friendly Object-Oriented program language that has many machine learning
and data mining tools support, it is selected as the language to implement the target classifier
in this project. In Python, scikit-learn widely recommended tool to design and implement
10 / 21
machine learning model with many built-in classifiers and algorithms. Therefore, it will be
preliminarily considered to be used in this project.
Baseline Classifiers Selection
Before implementing the target classifier, several popular and widely used classifiers are
involved as the start point to get started with. It helps to better understand the mechanism of
machine learning and basic rules need to take care in the implementation later.
The baseline classifiers selected are: Naïve Bayes, SVM, and Decision Tree.
Naïve Bayes
This classifier is a probabilistic classifier, which applies Bayes’ theorem. This classifier is
commonly used in the fields like email filtering and simple text classification (a few categories).
The core algorithm is to extract the word frequencies from input data as the features.
SVM
SVM is the abbreviation from support vector machines. It is a model to represent a map that
displays each sample as a point in space, and it uses gap (bound) to classify the points into
different classes.
Decision Tree
A decision tree can be imaged as a model with branches extended like real trees. It is a
flowchart-like graph and each node in decision tree is a test leading to the branches
representing the outcome of that test, which will eventually go to the end of the tree with
consequences (class label). Decision tree is more popular in game theory to analyse the
strategies and tactics.
11 / 21
Precision Recall
Naïve Bayes 0.801 0.800
SVM 0.831 0.830
Decision Tree 0.640 0.640
Classifier Implementation
Structure description
The target classifier to be implemented is a classifier basing on machine learning. Whenever
a sequence of terms (can be a sentence, a review, or a post, etc.) is fed into the model, it is
split up into single ones and the summed weight (randomly generated weight for the terms
when initialized) of them combining with the scores from context window will be compared
to the class label, and the result can be applied to the terms’ weight for updating, which is
called regression. After all the inputs have been processed, a lexicon-like model will be
generated containing the weights to predict the class of new input according to the terms it
has. The graph of this structure is show in figure 3.
Figure 3
12 / 21
Design and implementation
This structure requires convolutional neural network to configure the layers in this model,
while it is not supported in scikit-learn tool. Hence keras in python is involved as the tool to
implement this classifier.
a) Convert the input into vectors:
After loading all the data in the dataset, it is essential to convert them into a form that
can be fed into model in a clear and simplified format. In this case, all involved words
in input files constructs a vector with their own unique index representing it. Then
each input sequence of terms will be presented using the corresponding indices of
those terms. Similarly, the possible modifiers (adverbs and conjunctions) are also
converted into a vector with indices indicating them. Example vectors are showed in
figure 4.
Figure 4
13 / 21
b) Train the model:
When a sentence is fed into the model as figure 5 for instance, “the dog runs quickly”,
the corresponding place in the matrix will be replaced by 1 for each term. Considering
the context window is size of 3, which means the 1 word right before current term and
the 1 word right after current term are included in. As the word “quickly” is the
modifier in this case, in the context window for “run” has a modifier in and the word
“quickly” itself.
Figure 5
c) In machine learning, there are some hyperparameters can be configured that might
have influence on the training result. By tuning these hyperparameters, the generated
model might be able to provide higher accuracy with new input. Some
hyperparameters used to tune model in this project will be listed below:
14 / 21
1. Epochs: One Epoch is when an ENTIRE dataset is passed forward and
backward through the neural network only ONCE.
Multiple epochs generally allow the model fit more on the training data, and
reduce the influence of diverse data.
2. Batch size: Total number of training examples present in a single batch.
In most of cases, it is unpractical to pass entire dataset into neural network at
once. Hence the dataset need to be divided into several batch with the declared
batch size.
3. K-fold cross validation: k-fold cross validation is generally used to prevent
overfitting problem and provide the best model using the same training data.
For example, in 10-fold cross validation, the training data is split up into 10
groups. With one group of data held-out each time, there will be 10 different
models generated with the rest nine groups of data, and the held-out group will
be used to test the corresponding model by providing an accuracy score. The
mean of these 10 models’ scores is used to evaluate the model, and the model
with highest score is the best model to be used later.
Alternative solutions
During attempting to implement, I tried to configure the convolutional kernel, which is the
context window in structure to make it be able to consider ONLY the modifiers, but as the
time limitation, I was unable to complete this part in time. Then, an alternative solution to
this issue is applied. In convolutional layer, the kernel will be set to size of 3 rather than
manually configured. Therefore, the final version of the classifier model will be as followed:
15 / 21
Layer 1: Embedding layer to map the terms indices, and the number limitation of features
is the top 400 frequent occurred words, which is set to prevent the noise while training
and improve the speed.
Layer 2: the convolutional layer with kernel size of 3 to apply the context window.
Layer 3: a hidden layer with 250 units of neurons.
Layer 4: output layer using SoftMax to decide the distribution.
Running Result
Figure 6 and 7 display the chart of the running result using 10-fold cross validation, with
hyperparameters as followed:
Epochs = 30 batch_size = 10 kernel_size = 3 k-fold = 10
16 / 21
Figure 6
17 / 21
Figure 7
One interesting thing is that when considering one of the 10 folds (figure 8 and figure 9), the
statistics is changing during 30 epochs. As the epochs increase, the model becomes more
fitted to the training data, but the loss in validation data keeps increasing, which indicates that
more epochs do not lead to better model as overfitting is always a problem if same data keeps
being used to train a model.
1.57
1.31 1.27
1.60
1.76
1.14
1.34
0.99
1.34 1.34 1.37
76.38% 76.38% 79.70% 77.12% 76.75% 78.81% 74.72%82.53% 81.78% 76.21% 78.04%
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
1 2 3 4 5 6 7 8 9 10
10-Fold Cross Validation Result
Loss Accuracy
18 / 21
Figure 8
Figure 8
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Trend changes as epochs progress
train_loss train_accuracy validateion_loss validation_accuracy
19 / 21
Conclusion
Using machine learning to implement classifier significantly improves the efficient to
generate a practical model to classify text. By comparing with the three popular classifiers, it
evidences that a fine-tuned model with the configured layers according to target datasets
could perform better than other existed widely used classifiers. However, to make model
perfect is currently impossible because: a) tuning a model into the best condition is a long-
term work as there could be multiple combinations takes O(n^n) time.
20 / 21
Sentiment
Classifiers Accuracy
Naïve Bayes 76.70%
SVM 70.60%
Decision Tree 67.70%
Target Classifier 79.33%
Future Works
Configure Convolutional Layer
I have put the unfinished code for convolutional kernel configuration in the python file,
which would be the next task to do. By successfully implementing this part, the modifiers
effects to the terms might be more apparent.
For Search Engine Result
The current version of this classifier can already be used to re-process the search engine
result, which will be tested and evaluated manually. If possible, some attempt to implement a
user interface as an intermedia to interact between search engine result and client would also
be done.
Improve the Output
The output of this classifier is currently displayed in the command window, which is not
artistic enough. Make it more acceptable by humans worth reasonable efforts to do so, and as
described in the project aim, displaying the input sentence with distinct colours to indicates
how each term influence this sentence will be done in the future.
21 / 21
References
Bandhakavi, A., Wiratunga, N. & Massie, S., 2017. Lexicon Generation for Emotion Detection from Text.
IEEE Intelligent System, 13 February, pp. 102 - 108.
Chen, H. & Dumais, S., 2000. Bringing order to the Web: automatically categorizing search results. CHI '00
Proceedings of the SIGCHI conference on Human Factors in Computing Systems, 1 April, pp. 145-152.
Clos, J., Wiratunga, N. & Massie, S., 2017. Towards Explainable Text Classification by Jointly Learning
Lexicon and, Aberdeen, United Kingdom: Robert Gordon University.
Domingos, P. & University of Washington, 2012. A few useful things to know about machine learning.
Communications of the ACM, 01 10, 55(10), pp. 78-87.
Esuli, A. & Sebastiani, F., 2006. SENTIWORDNET: A Publicly Available Lexical Resource. In Proceedings
of LREC, volume 6, pp. 417-422.
Fernandez-Delgado, M., Cernadas, E., Barro, S. & Amorim, D., 2014. Do we Need Hundreds of Classifiers
ito Solve Real World Classification Problems?. Journal of Machine Learning Research, 15 October, pp.
3133-3181.