published by cityu student research & investment club · 2019. 12. 12. · after this, we use a...
TRANSCRIPT
Published by CityU Student Research & Investment Club
© CityU Student Research & Investment Club 2
THE FINAL PAGE OF THIS REPORT CONTAINS A DETAILED DISCLAIMER
The content and opinions in this report are written by a university student from the CityU Student Research &
Investment Club, and thus are for reference only. CityU Student Research & Investment Club is not responsible
for any direct or indirect loss resulting from decisions made based on this report. The opinions in this report
constitute the opinion of the CityU Student Research & Investment Club and do not constitute the opinion of
the City University of Hong Kong nor any governing or student body or department under the University.
© CityU Student Research & Investment Club 3
Table of Contents
Introduction ................................................................................................................ 4
Literature Review ........................................................................................................ 5 Technical Background ........................................................................................................ 5 Financial Background ......................................................................................................... 6
The Program ................................................................................................................ 8 Libraries .............................................................................................................................. 8 Dataset ............................................................................................................................. 10 Deploying the Strategy .................................................................................................... 12
Results ........................................................................................................................16
Limitations ..................................................................................................................16
Future developments ..................................................................................................17
Conclusion ..................................................................................................................18
© CityU Student Research & Investment Club 4
Introduction
Computer algorithms are a big part of society. Traffic lights, social media, flight ticket booking, and
net banking are few of the many services that make their operations more efficient using algorithms.
Another prevalent use of algorithms that has gained popularity, especially in the past decade, has been
in stock market trading. Algorithmic Trading has become an integral part of our financial systems in
this era of technological advancements.
This report presents a short-only algorithmic trading strategy built upon sentiment analysis of Twitter
data using Natural Language Processing (NLP) techniques. As of Q1’19, Twitter averaged 330 million
active monthly users, a jump from 321 million in Q4’18. The same metric was at a mere 30 million
monthly users at the start of the decade (Q1’10)1. Growing more than 10 folds in less than a decade
exemplifies the fact that Twitter has become a major platform for people to express their opinions. For
this reason, analyzing the sentiment of tweets about certain companies and recording the stock
movement based on those sentiments can give us an insight on how influential social media can be in
this field.
In this strategy, stocks of 4 companies are chosen: Amazon (AMZN), Facebook (FB), Google
(GOOGL) and Apple (AAPL); this will serve as our portfolio. We chose four of the largest firms in the
world based on our assumption that they will be most susceptible to public sentiment. We then retrieved
one dataset each for the companies, which consists of tweets that mention them for a period of 2 months.
After this, we use a Python library called natural language toolkit (NLTK) to apply NLP techniques on
each tweet and give it a score ranging from -1 (most negative) to +1 (most positive). We then calculate
the average of the sentiment scores on a daily basis to find out the overall statement for each day. After
this, we look for days where the overall sentiment has a score of less than 0, i.e. days when the public
reacted negatively towards the company. Under the assumption that the general public reacts more
strongly to negative news/opinions than positive, we expect the stock price to fall in the following days.
Hence, we short sell 100 shares for two days following the day with a negative sentiment. The rationale
behind this strategy is that public sentiment is arguably the strongest market influencer and plays a huge
role in moving the share price of a company. We chose Twitter because of the aforementioned reason
of it being a major platform for expressing opinions and, data is readily available for it online with the
service of retrieving real time tweets made possible through the Twitter API. Additionally, the field of
AI has paved way for algorithmic trading to reach new levels and exploring a strategy based it has
aroused curiosity in the financial world.
1 Statistica. “Number of monthly active active Twitter users worldwide from 1st quarter 2010 to 1st quarter 2019”[Online]
https://www.statista.com/statistics/282087/number-of-monthly-active-Twitter-users/
© CityU Student Research & Investment Club 5
The purpose of this report is to theoretically present the idea behind a trading strategy which can be
further developed to be used in real world scenarios. We do not present the strategy in practice itself,
but rather an introduction to how algorithmic trading works and the groundwork behind it. It can serve
as a guide to any beginner in the field of algorithmic trading and quantitative finance. In the following
sections, we start off by giving a brief description of the theory behind the strategy. Following which,
we walk the reader through the Python code, explaining each function along with code snippets from
the Jupyter Notebook to see how the strategy is simulated. We also provide a description of the libraries
used as well as the explain the dataset. We then evaluate our results i.e. the total return. Lastly, we
discuss some limitations of the program and explore the potential of turning it into a real-world
algorithmic trading strategy.
Literature Review
This section gives a description of the theoretical concepts behind the strategy. Algorithmic trading is
a perfect example of how the two fields of Computer Science and Finance work together. Therefore,
this section is divided into two parts:
1. Technical: explaining the technical tools used in the program and some concepts at the core of
this strategy.
2. Financial: explaining the financial concepts involved in this strategy.
Technical Background
Natural Language Processing
Natural Language Processing (NLP) is the branch of Artificial Intelligence (AI) that deals with the
synthesis and analysis of text data and speech in natural language. The voice assistant in smartphones
like Siri, chatbots, Google search results recommendations are all examples of NLP applications. This
technology is rapidly advancing due to the availability of large amounts of data, more computational
power and most importantly, a keen interest in machine-to-computer communication. The language
humans speak, such as English and Chinese, is called the natural language. The natural language is an
intense combination of grammar, semantics, syntax, sounds and symbols which makes it extremely
complicated and senseless to a computer. Hence, making a computer understand the human natural
language is something that requires a great deal of research, complex algorithms and rigorous pre-
processing of data. As complicated as the process might be, it is equally important since it is a major
step forward in the efficient integration of computers in our society. Although there have been many
© CityU Student Research & Investment Club 6
breakthroughs in this field, there are still a large number of milestones that researchers are aiming to
reach. NLP has unique applications in almost every field, this report will discuss its potential in finance.
Sentiment Analysis
Sentiment Analysis is one of the many major implementations of NLP. As the name suggests, sentiment
analysis is the practice in which a piece of text is processed and categorized
as positive, negative or neutral. Sentiment analysis is important for businesses to understand the general
opinion of the public about their overall brand, as well as the opinions on their products. Through this
analysis, companies can make business decisions accordingly, especially in this age where social media
and online platforms are so prevalent that people can voice their opinions with minimum effort. We
make use of this particular fact in our program.
Financial Background
Algorithmic Trading
A type of security trading in which a computer buys or sells automatically based on a set of pre-
programmed rules is called algorithmic trading. Such strategies typically involve tracking the price of
the security in question and/or certain market conditions and executing a buy or sell order without any
human intervention as soon as some criteria is met. Broadly, algorithmic trading involves 4 steps:
1. Strategy Identification: This step defines the criteria that should be met i.e. the signal to
execute trades. An example of a very simple strategy is mean reversion, which works under the
assumption that the price of a security will revert to its long term mean level of the entire
dataset. A signal in this strategy can be as following: if the price of a certain stock falls below
the mean by say, 100 base points, buy 10 shares. Strategies can be as simple as mean reversion
or as eccentric as building it upon weather forecast - a paper2 suggests that sunshine boosts
peoples’ mood which has a positive effect on the stock market.
2. Program Development: After theoretically identifying the idea behind a strategy, the next step
is to develop a computer program that will implement it. It involves using technical tools to
construct a framework that will generate the signal to execute the trades.
3. Backtesting: This is the most important step in the process of algorithmic trading and the step
where strategies pass or fail. “Back” in backtesting points to the past, and “testing” means
2 H. David, S.Tyler (2001) “Good DaySunshine: Stock Returns and the Weather”.
© CityU Student Research & Investment Club 7
quantitatively evaluating the strategy. Once the computer program is ready, the next step is to
execute it on real data from the past to check if the strategy is profitable enough or not. For
example, a mean reversion strategy will be tested on stock prices from a few years ago and the
weather forecast strategy will be tested on weather conditions from the past. Backtesting has
become convenient with the abundance of data freely available. Nowadays, there are many
platforms such as Quantopian and MetaStock that provide a convenient way to backtest a
strategy and evaluates a strategy based on more metrics than just the return such as the sharpe
ratio, median profit per trade, outlier adjusted profit factor et cetera.
4. Implementation: If a strategy is profitable and has an acceptable level of risk based on past
data, it becomes ready to be put into practice in the real world. There are many trading platforms
now that offer implementation of strategies to trade automatically on the user’s behalf.
Algorithmic trading is now widely used in many financial institutions, which generally come under
what is known as High Frequency Trading (HFT). Through using HFT, these institutions place tens of
thousands of orders in a matter of seconds and use extremely complicated algorithms. In its early stages,
most algorithms were designed to create trading strategies that involved complex mathematical
formulas and variables such as price and volume. Now with the advancements in artificial intelligence,
algorithmic trading is reaching levels it has not yet reached before.
Short Selling
As mentioned in the introduction, we will explore a short only trading strategy in this report. Short
selling is an investment strategy that speculates on the decline of the price of a security. This strategy
involves borrowing shares and selling them immediately. At a later stage when the price of that security
falls, the same number of shares are bought back at the lower price and returned to the lender. Since
shares are sold at a higher price and bought back at a lower price, the difference is the profit, minus the
borrowing fee. For instance, you borrow 10 shares of stock ‘A’ from someone when the price is
$10/share and immediately sell them. You now have $100. Later in the day, the price falls to $5 /share
and you buy 10 shares for $50 and return these shares to the lender with say, $5 as borrowing fee. After
these transactions, you are left with a $45 profit. Short selling is viable because people panic and are
more willing to sell when the price falls and willing to buy when the price is high (Trend chasing
behavioral bias).
An investor goes short on a security only when they are sure that it is overvalued and that the price is
going to fall. However, short selling comes with a very high risk. Theoretically, one could lose an
infinite amount of money when they go short because the price of the security could rise forever. For
this reason, short selling is typically only practiced by experienced investors. Hedge funds are the most
© CityU Student Research & Investment Club 8
prevalent short sellers in the market; they use this strategy to hedge their long positions in other
securities and save their portfolios from losses.
The Program
This section will explain the logic behind the program after a brief description of all the libraries used
and the dataset.
The strategy is developed in Python programming language and the environment used is the Jupyter
Notebook. Python has extensive support to store, manipulate and process large datasets quickly with
the pandas library (see below). Due to this reason among others, it has grown to become the number
one choice for developers in the field of data science and finance.
Libraries
Figure 1 shows all the libraries that are needed to be imported in order to make the program work. The
first two libraries are the basic ones that every data scientist should be familiar with: pandas and
matplotlib. Pandas is the most commonly used library to store, manipulate and process data, while
matplotlib is the most extensively used library for charting data. In this project, we have used Pandas
to store the tweets, stock prices and sentiments (see section 3.3) and matplotlib for the visualizations of
that data. These two libraries will be used in majority of data driven Python projects.
The other two libraries, which form the core of this project, are: yfinance and
SentimentIntensityAnalyzer
Library: “yfinance”
This library is used to download stock prices of any company from Yahoo Finance for a specified period
of time. You use the ‘download’ function which takes in 3 parameters: Ticker, start date and end date,
and returns a data frame with 6 columns: Open, High, Low, Close, Adj Close and Volume. In our
project, we only use the Open and Close columns of this data frame.
Figure 1
Source: CURIC
© CityU Student Research & Investment Club 9
Figure 2 shows the output and the format of the “download” function of the library to retrieve the stock
price data for GOOGL.
Library: “SentimentIntensityAnalyzer”
This library is used for the sentiment analysis of the tweets in our dataset. Although we have imported
it from the natural language toolkit package in our project, SentimentIntensityAnalyser is an
independent library under the MIT License3. It has comprehensive documentation which describes its
features and provides some examples of its uses that can be found on this Github link: VADER
Sentiment GitHub Link.
The library uses a sentiment analysis technique called VADER (Valence Aware Dictionary and
Sentiment Reasoner). VADER is a lexicon and rule based sentiment analysis. It is specifically tuned
for social media posts which makes it the perfect choice for our project. Not only does it simply classify
text as positive, negative or neutral; it also gives a score (from -1 to +1) to show how positive or negative
the sentiment is. To do this, VADER recognizes and considers common practices seen in social media
posts such as utf-8 encoded emojis, the use of punctuations (e.g. “good” would receive a less positive
score than “good!!!”), use of word-shape to signal emphasis and use of slangs. We have created the
SentimentIntensityAnalyser object and named it “sia”.
3 Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media
Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
Figure 2
Source: CURIC
© CityU Student Research & Investment Club 10
Figure 2 shows the different scores given to different sentences that are slightly modified. As you can
see from the second sentence, adding an exclamation mark at the end increases the positive score
slightly, as compared to the first sentence. The last two sentences show that capitalizing the word “so”
to show emphasis also has an effect on the sentiment score of the sentence.
Usually the threshold as described in the Github documentation for the sentiment scores is:
1. Positive sentiment: score >= 0.05
2. Neutral sentiment: score > -0.05 and score < 0.05
3. Negative sentiment: score < -0.05
However, for simplicity of our project, we take any score less than 0 as a negative sentiment and any
score greater than 0 as a positive sentiment.
Dataset
Dataset is required to test any kind of algorithmic trading strategy and data required for this project was
the collection of tweets for each company chosen. The data was retrieved from Twitter Dataset
NASDAQ 100. This resource has a collection of all the tweets for each company listed on the NASDAQ
stock exchange for a period of 79 days, from 2016, March 28th to 2016, June 15th, available to download
for free. There are six excel files for each company and the tweets are contained in the “stream” sheet
of the “dashboard” file.
Figure 3
Source: CURIC
© CityU Student Research & Investment Club 11
This figure shows some insights of the data we will be dealing with. We read the file containing tweets
about Google using the function “read_excel()” and store it in a pandas data frame that we name googl.
We see that in a period of 79 days, this data has over 29,000 tweets mentioning Google (including
retweets). The second cell shows the 26 columns that are originally included in the file and the third
cell shows the first two rows of the actual data. Now for our project, we only need to use two columns:
Date and Tweet content. For this purpose, we have written a function called “clean_data” which returns
only the necessary columns for us to start our sentiment analysis (figure 5).
Figure 5
Source: CURIC
Figure 4
Source: CURIC
© CityU Student Research & Investment Club 12
Now that we know the groundwork for our project, i.e. the libraries used, how the sentiment analysis
works, what the dataset looks like and where to find it, we can jump right in on deploying the strategy
and explore the results.
Deploying the Strategy
Our program consists of 8 helper functions, one of which has already been mentioned in the previous
section. After storing the tweets dataset with only the date and tweet content, we loop through the data
frame and apply the SentimentIntensityAnalyser’s polarity_scores function on each tweet to get the
sentiment score for them, the same way we did in Figure 2. For simplicity, we also omit the tweets
which has a sentiment score of 0.0. We do all this by using the get_sentiment_score function of our
program. Let’s demonstrate this on our dataset.
Figure 6 shows that the length of the dataset has reduced from around 29.3k to 14.5k after removing all
the tweets with a sentiment score of 0.0.
There may be strategies in place which make trading decisions based on single tweets. However, our
strategy focuses on the overall sentiment on daily basis. For this reason, we now find the mean of the
Figure 6
Source: CURIC
© CityU Student Research & Investment Club 13
sentiment scores grouped by date. In this step, we can also get rid of the tweet content column since we
do not need the actual tweets anymore. We do this by using the “get_daily_sentiment” of our program.
Figure 7 shows us the daily sentiments score. This gives us an insight of how the public reacted to the
company in question on Twitter. We observe that the sentiment has been positive, i.e. a score above 0
for most days, except on June 5th 2016, which has a sentiment score of -0.142. According to the Daily
Mail online archive for June 4th 2016, there was a news article with the headline, “Google removes
racist Chrome extension that was used by neo-nazis to target Jewish people online”. The article can be
found here: Daily Mail archive news article - google. Such news can trigger a stir on social media, and
this piece of news could have contributed to the negative opinions about Google the next day.
Now that we have the daily sentiments from Twitter, following our strategy, we short sell 100 shares
on the next two consecutive days - 6th and 7th June. This means that we borrow 100 shares and sell them
as soon as the market opens the next day, i.e. at the Open price. According to our theory, the share price
must fall during the day so the Close price will be less that the Open price, hence, we buy back 100
shares right before the market is supposed to close and return them to the lender. We repeat the same
process for another day. In order to calculate the return and test if our strategy actually works, we first
need to get the stock price of the company for the days we have the sentiment scores for. We do this by
calling the “get_stock_prices” function, the output is similar to Figure 2 and we store it in a dataframe
called “prices”. For our program, we are only concerned with the open and the close price. Therefore,
we call the “match_open_close” function and add these two columns to our dataframe which contains
the sentiment scores per day.
Figure 7
Source: CURIC
© CityU Student Research & Investment Club 14
As figure 8 shows two columns have been added namely “short” and “return”. The former will contain
a value of 0 or 1; any value of 1 would represent a signal that the computer needs to execute a short-
sell trade on that day. The latter will calculate the total return earned on that day. Since we are short
selling 100 shares in our strategy, the return will be calculated as: Return = 100*(Close – Open). We
generate a signal by calling the “short” function and calculate the return by calling the
“calculate_return” function. The results are shown below when we do it on our Google data frame.
Figure 8
Source: CURIC
© CityU Student Research & Investment Club 15
Figure 9 shows the final result of our strategy when applied to the Google stock. Our theory that the
stock price will fall after a day of negative opinions on Twitter has proven to be true in this case, as we
can see from rows 68 and 69. Prices fell on the third day too, however shorting for 3 consecutive days
that would be playing it too risky on an already high-risk strategy.
After shorting 100 shares for two days, we earn a total return of $1061. Let’s say if the borrowing fee
is $50, it leaves us with a profit of $961 ($100 dollars borrowing fee for two days).
This section showed a step by step process of how we go from having raw Twitter data to theoretically
earning a profit without any significant capital requirement. We apply the exact same steps to all the
other stocks in our portfolio and calculate the total return in the next section.
.
.
Figure 9
Source: CURIC
….
© CityU Student Research & Investment Club 16
Results
After following the steps shown in the previous section on the stocks of Amazon, Facebook and Apple,
we arrive at the following results:
Ticker Days Short Per Day
Profit/Loss
Total
Return
Borrower’s
Fee ($50 per
day)
Net Return
AMZN
25th June 2016
26th June 2016
27th June 2016
$959 Loss
$928 Profit
$522 Profit
$491 $150 $341 Profit
FB 10th May 2016
11th May 2016
$87 Loss
$89 Profit $2 $100 $98 Loss
AAPL 5th May 2016
6th May 2016
$76 Profit
$65 Profit $141 $100 $41 Profit
Any trading strategy, technical or fundamental will never guarantee profit in every scenario. As we can
see, we incur a loss in deploying our strategy for the Facebook stock.
If we add all the net returns, we end up with a net profit of $1245.
Limitations
As mentioned in the introduction, the purpose of this report is to present an idea behind a trading
strategy. There are a number of variables, evaluation metrics and extra requirements that are involved
for a strategy to be put into practice. Hence, if the strategy presented in this report would have been
used in real life in the same timeframe, the net return would most likely be different.
Figure 10
Source: CURIC
© CityU Student Research & Investment Club 17
This report only tests the strategy on four stocks, for a short period of time. In order to really see if it is
a viable strategy, more rigorous testing is needed, on a greater number of stocks and on a much larger
timeframe. The reason it was not done in this project was the shortage of data currently available online.
Rigorous testing would require collecting data on our own, however, we did not have an adequate
amount of time or resources to do so. We also only choose 4 large sized firms under the assumption
that they will be most susceptible to Twitter sentiment.
Future developments
The strategy presented in this report is fairly simple and is used for demonstration purposes only. Real
strategies use extremely complicated algorithms to trade securities. By using our strategy as a base,
there can be various modifications to the program in order to make it more accurate. For example, we
can consider the number of followers for the writer of each tweet while calculating our sentiment score.
We can observe that a tweet written by an account with 10k followers will have more effect on the
sentiment of that day than a tweet by an account with 100 followers, since the former will reach more
people. Similarly, we can tweak our sentiment analysis taking such factors into account.
Since our strategy is based on Twitter data, the first step to put the strategy into practice would be to
track and store tweets in real time. This is possible with Twitter API. By making an account with Twitter
API, every user will get tokens that can be used with various Python libraries. One such library is called
Tweepy, which establishes a connection and retrieve tweets in real time. The details on how to do it is
beyond the scope of this report.
© CityU Student Research & Investment Club 18
Conclusion
In this report, we have seen what algorithmic trading is and how it works. As a demonstration, we
developed a short only strategy based on sentiment analysis of Twitter data. In conclusion, it should be
noted that although the strategy presented is highly risky and practically not the most viable strategy to
adopt, the concepts presented in this report such as natural language processing, sentiment analysis and
algorithmic trading are topics that are state-of-the-art practices and have great potential in the financial
world, especially with the recent technological advancements. Therefore, having knowledge of these
concepts is a bonus for any individual aspiring to have a career in this field.
© CityU Student Research & Investment Club 19
DISCLAIMER
This report is produced by university student members of CityU Student Research & Investment Club (the Club).
All material presented in this report, unless otherwise specified, is under copyright of the Club. None of the
material, nor its content, nor any copy of it, may be altered in any way without the prior express written permission
and approval of the Club. All trademarks, service marks, and logos used in this report are trademarks or service
marks of the Club. The information, tools and materials presented in this report are for information purposes only
and should not be used or considered as an offer or a solicitation of an offer to sell or buy or subscribe to securities
or other financial instruments. The Club has not taken any measures to ensure that the opinions in the report are
suitable for any particular investor. This report does not constitute any form of legal, investment, taxation, or
accounting advice, nor does this report constitute a personal recommendation to you. Information and opinions
presented in this report have been obtained from or derived from sources which the Club believes to be reliable
and appropriate but the Club makes no representation as to their accuracy or completeness. The Club accepts no
liability for loss arising from the use of the material presented in this report. Due attention should be given to the
fact that this report is written by university students. This report is not to be relied upon in substitution for the
exercise of independent judgement. The Club may have issued in the past, and may issue in the future, other
communications and reports which are inconsistent with, and reach different conclusions from, the information
presented in this report. Such communications and reports represent the different assumptions, views, and
analytical methods of the analysts who prepared them. The Club is not under an obligation to ensure that such
communications and reports are brought to the attention to any recipient of this report. This report, and all other
publications by the Club do not constitute the opinion of the City University of Hong Kong, nor any governing or
student body or department under the University aside from the Club itself. This report may provide the addresses
of, or contain hyperlinks to, websites. Except to the extent to which the report refers to website material of the
Club, the Club has not reviewed any such website and takes no responsibility for the content contained therein.
Such addresses or hyperlinks (including addresses or hyperlinks to the Club’s own website material) is provided
solely for your own convenience and information and the content of any such website does not in any way form
part of this Report. Accessing such website or following such link through this report shall be at your own risk.