published by cityu student research & investment club · 2019. 12. 12. · after this, we use a...

Published by CityU Student Research & Investment Club

http://cityuinvestmentclub.com/

http://cityuinvestmentclub.com/

https://www.linkedin.com/company/cityu-investment-club/?viewAsMember=true

© CityU Student Research & Investment Club 2

THE FINAL PAGE OF THIS REPORT CONTAINS A DETAILED DISCLAIMER

The content and opinions in this report are written by a university student from the CityU Student Research &

Investment Club, and thus are for reference only. CityU Student Research & Investment Club is not responsible

for any direct or indirect loss resulting from decisions made based on this report. The opinions in this report

constitute the opinion of the CityU Student Research & Investment Club and do not constitute the opinion of

the City University of Hong Kong nor any governing or student body or department under the University.


Table of Contents

Introduction ................................................................................................................ 4

Literature Review ........................................................................................................ 5 Technical Background ........................................................................................................ 5 Financial Background ......................................................................................................... 6

The Program ................................................................................................................ 8 Libraries .............................................................................................................................. 8 Dataset ............................................................................................................................. 10 Deploying the Strategy .................................................................................................... 12

Results ........................................................................................................................16

Limitations ..................................................................................................................16

Future developments ..................................................................................................17

Conclusion ..................................................................................................................18


Introduction

Computer algorithms are a big part of society. Traffic lights, social media, flight ticket booking, and

net banking are few of the many services that make their operations more efficient using algorithms.

Another prevalent use of algorithms that has gained popularity, especially in the past decade, has been

in stock market trading. Algorithmic Trading has become an integral part of our financial systems in

this era of technological advancements.

This report presents a short-only algorithmic trading strategy built upon sentiment analysis of Twitter

data using Natural Language Processing (NLP) techniques. As of Q1’19, Twitter averaged 330 million

active monthly users, a jump from 321 million in Q4’18. The same metric was at a mere 30 million

monthly users at the start of the decade (Q1’10)1. Growing more than 10 folds in less than a decade

exemplifies the fact that Twitter has become a major platform for people to express their opinions. For

this reason, analyzing the sentiment of tweets about certain companies and recording the stock

movement based on those sentiments can give us an insight on how influential social media can be in

this field.

In this strategy, stocks of 4 companies are chosen: Amazon (AMZN), Facebook (FB), Google

(GOOGL) and Apple (AAPL); this will serve as our portfolio. We chose four of the largest firms in the

world based on our assumption that they will be most susceptible to public sentiment. We then retrieved

one dataset each for the companies, which consists of tweets that mention them for a period of 2 months.

After this, we use a Python library called natural language toolkit (NLTK) to apply NLP techniques on

each tweet and give it a score ranging from -1 (most negative) to +1 (most positive). We then calculate

the average of the sentiment scores on a daily basis to find out the overall statement for each day. After

this, we look for days where the overall sentiment has a score of less than 0, i.e. days when the public

reacted negatively towards the company. Under the assumption that the general public reacts more

strongly to negative news/opinions than positive, we expect the stock price to fall in the following days.

Hence, we short sell 100 shares for two days following the day with a negative sentiment. The rationale

behind this strategy is that public sentiment is arguably the strongest market influencer and plays a huge

role in moving the share price of a company. We chose Twitter because of the aforementioned reason

of it being a major platform for expressing opinions and, data is readily available for it online with the

service of retrieving real time tweets made possible through the Twitter API. Additionally, the field of

AI has paved way for algorithmic trading to reach new levels and exploring a strategy based it has

aroused curiosity in the financial world.

1 Statistica. “Number of monthly active active Twitter users worldwide from 1st quarter 2010 to 1st quarter 2019”[Online]

https://www.statista.com/statistics/282087/number-of-monthly-active-Twitter-users/

https://www.statista.com/statistics/282087/number-of-monthly-active-Twitter-users/


The purpose of this report is to theoretically present the idea behind a trading strategy which can be

further developed to be used in real world scenarios. We do not present the strategy in practice itself,

but rather an introduction to how algorithmic trading works and the groundwork behind it. It can serve

as a guide to any beginner in the field of algorithmic trading and quantitative finance. In the following

sections, we start off by giving a brief description of the theory behind the strategy. Following which,

we walk the reader through the Python code, explaining each function along with code snippets from

the Jupyter Notebook to see how the strategy is simulated. We also provide a description of the libraries

used as well as the explain the dataset. We then evaluate our results i.e. the total return. Lastly, we

discuss some limitations of the program and explore the potential of turning it into a real-world

algorithmic trading strategy.

Literature Review

This section gives a description of the theoretical concepts behind the strategy. Algorithmic trading is

a perfect example of how the two fields of Computer Science and Finance work together. Therefore,

this section is divided into two parts:

1. Technical: explaining the technical tools used in the program and some concepts at the core of

this strategy.

2. Financial: explaining the financial concepts involved in this strategy.

Technical Background

Natural Language Processing

Natural Language Processing (NLP) is the branch of Artificial Intelligence (AI) that deals with the

synthesis and analysis of text data and speech in natural language. The voice assistant in smartphones

like Siri, chatbots, Google search results recommendations are all examples of NLP applications. This

technology is rapidly advancing due to the availability of large amounts of data, more computational

power and most importantly, a keen interest in machine-to-computer communication. The language

humans speak, such as English and Chinese, is called the natural language. The natural language is an

intense combination of grammar, semantics, syntax, sounds and symbols which makes it extremely

complicated and senseless to a computer. Hence, making a computer understand the human natural

language is something that requires a great deal of research, complex algorithms and rigorous pre-

processing of data. As complicated as the process might be, it is equally important since it is a major

step forward in the efficient integration of computers in our society. Although there have been many


breakthroughs in this field, there are still a large number of milestones that researchers are aiming to

reach. NLP has unique applications in almost every field, this report will discuss its potential in finance.

Sentiment Analysis

Sentiment Analysis is one of the many major implementations of NLP. As the name suggests, sentiment

analysis is the practice in which a piece of text is processed and categorized

as positive, negative or neutral. Sentiment analysis is important for businesses to understand the general

opinion of the public about their overall brand, as well as the opinions on their products. Through this

analysis, companies can make business decisions accordingly, especially in this age where social media

and online platforms are so prevalent that people can voice their opinions with minimum effort. We

make use of this particular fact in our program.

Financial Background

Algorithmic Trading

A type of security trading in which a computer buys or sells automatically based on a set of pre-

programmed rules is called algorithmic trading. Such strategies typically involve tracking the price of

the security in question and/or certain market conditions and executing a buy or sell order without any

human intervention as soon as some criteria is met. Broadly, algorithmic trading involves 4 steps:

1. Strategy Identification: This step defines the criteria that should be met i.e. the signal to

execute trades. An example of a very simple strategy is mean reversion, which works under the

assumption that the price of a security will revert to its long term mean level of the entire

dataset. A signal in this strategy can be as following: if the price of a certain stock falls below

the mean by say, 100 base points, buy 10 shares. Strategies can be as simple as mean reversion

or as eccentric as building it upon weather forecast - a paper2 suggests that sunshine boosts

peoples’ mood which has a positive effect on the stock market.

2. Program Development: After theoretically identifying the idea behind a strategy, the next step

is to develop a computer program that will implement it. It involves using technical tools to

construct a framework that will generate the signal to execute the trades.

3. Backtesting: This is the most important step in the process of algorithmic trading and the step

where strategies pass or fail. “Back” in backtesting points to the past, and “testing” means

2 H. David, S.Tyler (2001) “Good DaySunshine: Stock Returns and the Weather”.

http://www-personal.umich.edu/~shumway/papers.dir/weather.pdf


quantitatively evaluating the strategy. Once the computer program is ready, the next step is to

execute it on real data from the past to check if the strategy is profitable enough or not. For

example, a mean reversion strategy will be tested on stock prices from a few years ago and the

weather forecast strategy will be tested on weather conditions from the past. Backtesting has

become convenient with the abundance of data freely available. Nowadays, there are many

platforms such as Quantopian and MetaStock that provide a convenient way to backtest a

strategy and evaluates a strategy based on more metrics than just the return such as the sharpe

ratio, median profit per trade, outlier adjusted profit factor et cetera.

4. Implementation: If a strategy is profitable and has an acceptable level of risk based on past

data, it becomes ready to be put into practice in the real world. There are many trading platforms

now that offer implementation of strategies to trade automatically on the user’s behalf.

Algorithmic trading is now widely used in many financial institutions, which generally come under

what is known as High Frequency Trading (HFT). Through using HFT, these institutions place tens of

thousands of orders in a matter of seconds and use extremely complicated algorithms. In its early stages,

most algorithms were designed to create trading strategies that involved complex mathematical

formulas and variables such as price and volume. Now with the advancements in artificial intelligence,

algorithmic trading is reaching levels it has not yet reached before.

Short Selling

As mentioned in the introduction, we will explore a short only trading strategy in this report. Short

selling is an investment strategy that speculates on the decline of the price of a security. This strategy

involves borrowing shares and selling them immediately. At a later stage when the price of that security

falls, the same number of shares are bought back at the lower price and returned to the lender. Since

shares are sold at a higher price and bought back at a lower price, the difference is the profit, minus the

borrowing fee. For instance, you borrow 10 shares of stock ‘A’ from someone when the price is

$10/share and immediately sell them. You now have $100. Later in the day, the price falls to $5 /share

and you buy 10 shares for $50 and return these shares to the lender with say, $5 as borrowing fee. After

these transactions, you are left with a $45 profit. Short selling is viable because people panic and are

more willing to sell when the price falls and willing to buy when the price is high (Trend chasing

behavioral bias).

An investor goes short on a security only when they are sure that it is overvalued and that the price is

going to fall. However, short selling comes with a very high risk. Theoretically, one could lose an

infinite amount of money when they go short because the price of the security could rise forever. For

this reason, short selling is typically only practiced by experienced investors. Hedge funds are the most


prevalent short sellers in the market; they use this strategy to hedge their long positions in other

securities and save their portfolios from losses.

The Program

This section will explain the logic behind the program after a brief description of all the libraries used

and the dataset.

The strategy is developed in Python programming language and the environment used is the Jupyter

Notebook. Python has extensive support to store, manipulate and process large datasets quickly with

the pandas library (see below). Due to this reason among others, it has grown to become the number

one choice for developers in the field of data science and finance.

Libraries

Figure 1 shows all the libraries that are needed to be imported in order to make the program work. The

first two libraries are the basic ones that every data scientist should be familiar with: pandas and

matplotlib. Pandas is the most commonly used library to store, manipulate and process data, while

matplotlib is the most extensively used library for charting data. In this project, we have used Pandas

to store the tweets, stock prices and sentiments (see section 3.3) and matplotlib for the visualizations of

that data. These two libraries will be used in majority of data driven Python projects.

The other two libraries, which form the core of this project, are: yfinance and

SentimentIntensityAnalyzer

Library: “yfinance”

This library is used to download stock prices of any company from Yahoo Finance for a specified period

of time. You use the ‘download’ function which takes in 3 parameters: Ticker, start date and end date,

and returns a data frame with 6 columns: Open, High, Low, Close, Adj Close and Volume. In our

project, we only use the Open and Close columns of this data frame.

Figure 1

Source: CURIC


Figure 2 shows the output and the format of the “download” function of the library to retrieve the stock

price data for GOOGL.

Library: “SentimentIntensityAnalyzer”

This library is used for the sentiment analysis of the tweets in our dataset. Although we have imported

it from the natural language toolkit package in our project, SentimentIntensityAnalyser is an

independent library under the MIT License3. It has comprehensive documentation which describes its

features and provides some examples of its uses that can be found on this Github link: VADER

Sentiment GitHub Link.

The library uses a sentiment analysis technique called VADER (Valence Aware Dictionary and

Sentiment Reasoner). VADER is a lexicon and rule based sentiment analysis. It is specifically tuned

for social media posts which makes it the perfect choice for our project. Not only does it simply classify

text as positive, negative or neutral; it also gives a score (from -1 to +1) to show how positive or negative

the sentiment is. To do this, VADER recognizes and considers common practices seen in social media

posts such as utf-8 encoded emojis, the use of punctuations (e.g. “good” would receive a less positive

score than “good!!!”), use of word-shape to signal emphasis and use of slangs. We have created the

SentimentIntensityAnalyser object and named it “sia”.

3 Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media

Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

Figure 2

Source: CURIC

https://github.com/cjhutto/vaderSentiment

https://github.com/cjhutto/vaderSentiment


Figure 2 shows the different scores given to different sentences that are slightly modified. As you can

see from the second sentence, adding an exclamation mark at the end increases the positive score

slightly, as compared to the first sentence. The last two sentences show that capitalizing the word “so”

to show emphasis also has an effect on the sentiment score of the sentence.

Usually the threshold as described in the Github documentation for the sentiment scores is:

1. Positive sentiment: score >= 0.05

2. Neutral sentiment: score > -0.05 and score < 0.05

3. Negative sentiment: score < -0.05

However, for simplicity of our project, we take any score less than 0 as a negative sentiment and any

score greater than 0 as a positive sentiment.

Dataset

Dataset is required to test any kind of algorithmic trading strategy and data required for this project was

the collection of tweets for each company chosen. The data was retrieved from Twitter Dataset

NASDAQ 100. This resource has a collection of all the tweets for each company listed on the NASDAQ

stock exchange for a period of 79 days, from 2016, March 28th to 2016, June 15th, available to download

for free. There are six excel files for each company and the tweets are contained in the “stream” sheet

of the “dashboard” file.

Figure 3

Source: CURIC

http://followthehashtag.com/datasets/nasdaq-100-companies-free-twitter-dataset/

http://followthehashtag.com/datasets/nasdaq-100-companies-free-twitter-dataset/


This figure shows some insights of the data we will be dealing with. We read the file containing tweets

about Google using the function “read_excel()” and store it in a pandas data frame that we name googl.

We see that in a period of 79 days, this data has over 29,000 tweets mentioning Google (including

retweets). The second cell shows the 26 columns that are originally included in the file and the third

cell shows the first two rows of the actual data. Now for our project, we only need to use two columns:

Date and Tweet content. For this purpose, we have written a function called “clean_data” which returns

only the necessary columns for us to start our sentiment analysis (figure 5).

Figure 5

Source: CURIC

Figure 4

Source: CURIC


Now that we know the groundwork for our project, i.e. the libraries used, how the sentiment analysis

works, what the dataset looks like and where to find it, we can jump right in on deploying the strategy

and explore the results.

Deploying the Strategy

Our program consists of 8 helper functions, one of which has already been mentioned in the previous

section. After storing the tweets dataset with only the date and tweet content, we loop through the data

frame and apply the SentimentIntensityAnalyser’s polarity_scores function on each tweet to get the

sentiment score for them, the same way we did in Figure 2. For simplicity, we also omit the tweets

which has a sentiment score of 0.0. We do all this by using the get_sentiment_score function of our

program. Let’s demonstrate this on our dataset.

Figure 6 shows that the length of the dataset has reduced from around 29.3k to 14.5k after removing all

the tweets with a sentiment score of 0.0.

There may be strategies in place which make trading decisions based on single tweets. However, our

strategy focuses on the overall sentiment on daily basis. For this reason, we now find the mean of the

Figure 6

Source: CURIC


sentiment scores grouped by date. In this step, we can also get rid of the tweet content column since we

do not need the actual tweets anymore. We do this by using the “get_daily_sentiment” of our program.

Figure 7 shows us the daily sentiments score. This gives us an insight of how the public reacted to the

company in question on Twitter. We observe that the sentiment has been positive, i.e. a score above 0

for most days, except on June 5th 2016, which has a sentiment score of -0.142. According to the Daily

Mail online archive for June 4th 2016, there was a news article with the headline, “Google removes

racist Chrome extension that was used by neo-nazis to target Jewish people online”. The article can be

found here: Daily Mail archive news article - google. Such news can trigger a stir on social media, and

this piece of news could have contributed to the negative opinions about Google the next day.

Now that we have the daily sentiments from Twitter, following our strategy, we short sell 100 shares

on the next two consecutive days - 6th and 7th June. This means that we borrow 100 shares and sell them

as soon as the market opens the next day, i.e. at the Open price. According to our theory, the share price

must fall during the day so the Close price will be less that the Open price, hence, we buy back 100

shares right before the market is supposed to close and return them to the lender. We repeat the same

process for another day. In order to calculate the return and test if our strategy actually works, we first

need to get the stock price of the company for the days we have the sentiment scores for. We do this by

calling the “get_stock_prices” function, the output is similar to Figure 2 and we store it in a dataframe

called “prices”. For our program, we are only concerned with the open and the close price. Therefore,

we call the “match_open_close” function and add these two columns to our dataframe which contains

the sentiment scores per day.

Figure 7

Source: CURIC

https://www.dailymail.co.uk/news/article-3625316/Google-removes-racist-Chrome-extension-used-neo-Nazis-target-Jewish-people-online.html


As figure 8 shows two columns have been added namely “short” and “return”. The former will contain

a value of 0 or 1; any value of 1 would represent a signal that the computer needs to execute a short-

sell trade on that day. The latter will calculate the total return earned on that day. Since we are short

selling 100 shares in our strategy, the return will be calculated as: Return = 100*(Close – Open). We

generate a signal by calling the “short” function and calculate the return by calling the

“calculate_return” function. The results are shown below when we do it on our Google data frame.

Figure 8

Source: CURIC


Figure 9 shows the final result of our strategy when applied to the Google stock. Our theory that the

stock price will fall after a day of negative opinions on Twitter has proven to be true in this case, as we

can see from rows 68 and 69. Prices fell on the third day too, however shorting for 3 consecutive days

that would be playing it too risky on an already high-risk strategy.

After shorting 100 shares for two days, we earn a total return of $1061. Let’s say if the borrowing fee

is $50, it leaves us with a profit of $961 ($100 dollars borrowing fee for two days).

This section showed a step by step process of how we go from having raw Twitter data to theoretically

earning a profit without any significant capital requirement. We apply the exact same steps to all the

other stocks in our portfolio and calculate the total return in the next section.

.

.

Figure 9

Source: CURIC

….


Results

After following the steps shown in the previous section on the stocks of Amazon, Facebook and Apple,

we arrive at the following results:

Ticker Days Short Per Day

Profit/Loss

Total

Return

Borrower’s

Fee ($50 per

day)

Net Return

AMZN

25th June 2016

26th June 2016

27th June 2016

$959 Loss

$928 Profit

$522 Profit

$491 $150 $341 Profit

FB 10th May 2016

11th May 2016

$87 Loss

$89 Profit $2 $100 $98 Loss

AAPL 5th May 2016

6th May 2016

$76 Profit

$65 Profit $141 $100 $41 Profit

Any trading strategy, technical or fundamental will never guarantee profit in every scenario. As we can

see, we incur a loss in deploying our strategy for the Facebook stock.

If we add all the net returns, we end up with a net profit of $1245.

Limitations

As mentioned in the introduction, the purpose of this report is to present an idea behind a trading

strategy. There are a number of variables, evaluation metrics and extra requirements that are involved

for a strategy to be put into practice. Hence, if the strategy presented in this report would have been

used in real life in the same timeframe, the net return would most likely be different.

Figure 10

Source: CURIC


This report only tests the strategy on four stocks, for a short period of time. In order to really see if it is

a viable strategy, more rigorous testing is needed, on a greater number of stocks and on a much larger

timeframe. The reason it was not done in this project was the shortage of data currently available online.

Rigorous testing would require collecting data on our own, however, we did not have an adequate

amount of time or resources to do so. We also only choose 4 large sized firms under the assumption

that they will be most susceptible to Twitter sentiment.

Future developments

The strategy presented in this report is fairly simple and is used for demonstration purposes only. Real

strategies use extremely complicated algorithms to trade securities. By using our strategy as a base,

there can be various modifications to the program in order to make it more accurate. For example, we

can consider the number of followers for the writer of each tweet while calculating our sentiment score.

We can observe that a tweet written by an account with 10k followers will have more effect on the

sentiment of that day than a tweet by an account with 100 followers, since the former will reach more

people. Similarly, we can tweak our sentiment analysis taking such factors into account.

Since our strategy is based on Twitter data, the first step to put the strategy into practice would be to

track and store tweets in real time. This is possible with Twitter API. By making an account with Twitter

API, every user will get tokens that can be used with various Python libraries. One such library is called

Tweepy, which establishes a connection and retrieve tweets in real time. The details on how to do it is

beyond the scope of this report.


Conclusion

In this report, we have seen what algorithmic trading is and how it works. As a demonstration, we

developed a short only strategy based on sentiment analysis of Twitter data. In conclusion, it should be

noted that although the strategy presented is highly risky and practically not the most viable strategy to

adopt, the concepts presented in this report such as natural language processing, sentiment analysis and

algorithmic trading are topics that are state-of-the-art practices and have great potential in the financial

world, especially with the recent technological advancements. Therefore, having knowledge of these

concepts is a bonus for any individual aspiring to have a career in this field.


DISCLAIMER

This report is produced by university student members of CityU Student Research & Investment Club (the Club).

All material presented in this report, unless otherwise specified, is under copyright of the Club. None of the

material, nor its content, nor any copy of it, may be altered in any way without the prior express written permission

and approval of the Club. All trademarks, service marks, and logos used in this report are trademarks or service

marks of the Club. The information, tools and materials presented in this report are for information purposes only

and should not be used or considered as an offer or a solicitation of an offer to sell or buy or subscribe to securities

or other financial instruments. The Club has not taken any measures to ensure that the opinions in the report are

suitable for any particular investor. This report does not constitute any form of legal, investment, taxation, or

accounting advice, nor does this report constitute a personal recommendation to you. Information and opinions

presented in this report have been obtained from or derived from sources which the Club believes to be reliable

and appropriate but the Club makes no representation as to their accuracy or completeness. The Club accepts no

liability for loss arising from the use of the material presented in this report. Due attention should be given to the

fact that this report is written by university students. This report is not to be relied upon in substitution for the

exercise of independent judgement. The Club may have issued in the past, and may issue in the future, other

communications and reports which are inconsistent with, and reach different conclusions from, the information

presented in this report. Such communications and reports represent the different assumptions, views, and

analytical methods of the analysts who prepared them. The Club is not under an obligation to ensure that such

communications and reports are brought to the attention to any recipient of this report. This report, and all other

publications by the Club do not constitute the opinion of the City University of Hong Kong, nor any governing or

student body or department under the University aside from the Club itself. This report may provide the addresses

of, or contain hyperlinks to, websites. Except to the extent to which the report refers to website material of the

Club, the Club has not reviewed any such website and takes no responsibility for the content contained therein.

Such addresses or hyperlinks (including addresses or hyperlinks to the Club’s own website material) is provided

solely for your own convenience and information and the content of any such website does not in any way form

part of this Report. Accessing such website or following such link through this report shall be at your own risk.

published by cityu student research & investment club · 2019. 12. 12. · after this, we use a...

Documents