individual investors, social media and chinese stock market
TRANSCRIPT
Individual Investors, Social Media and Chinese Stock Market: aCorrelation Study
By
Yonghui Wu
B.E., Shanghai Jiao Tong University, 2007M.E., Shanghai Jiao Tong University, 2010
SUBMITTED TO THE MIT SLOAN SCHOOL OF MANAGEMENT IN PARTIALREQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE IN MANAGEMENT STUDIESAT THE
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
JUNE 2016
@2016 Yonghui Wu. All rights reserved.
The author hereby grants to MIT permission to reproduceand to distribute publicly paper and electronic
copies of this thesis document in whole or in partin any medium now known or hereafter created.
Signature of Author:
Certified by:
Accepted by:
FULFILLMENT OF THE
MASSACHUSETTS INSTITUTEOF TECHNOLOGY
JUN 082016
LIBRARIESARCHIVES
Signature redactedI MIT Sino in School of Management
May 6, 2016
Signature redactedErik Brynjolfsson
Schussel Family ProfessorThesis Supervisor
Signature redacted____Rodrigo S. Verdi
Associate Professor of AccountingProgram Director, M.S. in Management Studies Program
MIT Sloan School of Management
Individual Investors, Social Media and Chinese Stock Market: aCorrelation Study
By
Yonghui Wu
Submitted to MIT Sloan School of Managementon May 6, 2016 in Partial fulfillment of the
requirements for the Degree of Master of Science inManagement Studies.
ABSTRACTChinese stock market is a unique financial market where heavy involvement of individualinvestors exists. This article explores how the sentiment expressed on social media is correlatedwith the stock market in China. Textual analysis for posts from one of the most popular socialmedia in China is conducted based on Hownet and NTUSD, two most commonly usedsentiment Chinese dictionaries.
The correlation matrices and regressions between sentiment ratios and returns of 9 holdingperiods for all the 30 sample securities reveal that correlation exists between investorsentiment on social media and the future returns of the Chinese stock market. In addition, I findthat negative sentiment ratio is superior than positive sentiment ratio, and correlation ofsentiment ratio to return is persistent in future holding periods. Also, by comparing differentstocks and indices, I find that well-established market index has better correlation with socialmedia sentiments than individual stocks, and well-known 'star' stocks have better correlationwith social media than other stocks. However, I test the VAR model on Shanghai CompositeIndex, and find that the model is stable but shows no Granger causality. Better data andimproved analysis are needed to predict stock market with social media.
Thesis Supervisor: Erik BrynjolfssonTitle: Schussel Family Professor
Acknowledgements
I feel grateful and privileged to have worked with my thesis advisor Professor Erik Bryn-
jolfsson. I would like to thank him for his guidance for helping me navigate through the
thesis process, and for his prompt feedback and suggetions regarding the directions and the
resources of this study.
I would also like to thank Professor Marshall Van Alstyne and other fellow students for
their valuable comments and encouragement on this research in the class of Economics of
Digitalization.
This study is very new and challenging for me because I have little prior experienc in
programming. This thesis could not have been possible without the help of my friend Lerith
Tian. Lerith has helped me tremendously with python programing and textual analysis. I
am very grateful for his help and also learned a lot from his patient guidance.
I also benefited a lot from my other friends. Shuyi Yu has provided me with many
valuable suggestions on statistical analysis. Shan Huang has helped me narrow down the
research scope at the very beginning. Alora Chen, Jin Jing Liu and Liam O'Dea have
greatly supported me during my preparation for this thesis. I am indebted to these dear
friends of mine.
Last but not least, I would like to thank my parents Shunfeng Wu and Ganying Deng
as well as my sister Yonghong Wu. Thank you for always believing in me and standing
behind all my endeavors.
Contents
1 Introduction 4
1.1 Literature review on investor sentiment and the stock market . . . . . . . . 4
1.2 Social media and stock markets in China . . . . . . . . . . . . . . . . . . . 6
1.3 Literature review on Chinese NLP . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Data 12
2.1 G uba . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Guba as a social media in China . . . . . . . . . . . . . . . . . . . 12
2.1.2 Posts and samples . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Financialdata . . .. . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Method 16
3.1 Dictionaries and word list . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Segmenting and parsing the posts. . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Quantifying the positive and negative sentiment . . . . . . . . . . . . . . . 19
3.4 Regression methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Results 21
4.1 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Positive ratio v.s. Negative ratio . . . . . . . . . . . . . . . . . . . 21
4.1.2 Differnt Holding Periods and securities . . . . . . . . . . . . . . . 23
4.2 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Positive ratio v.s. negative ratio . . . . . . . . . . . . . . . . . . . 25
4.2.2 Difference between stocks . . . . . . . . . . . . . . . . . . . . . . 26
4.2.3 Difference between stocks and indices . . . . . . . . . . . . . . . . 30
4.3 Time-series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.1 Lag selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.2 VAR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.3 Stability test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
'I
List of Figures
1 Social network penetration in China from 2012 to 2018 . . . . . . . . . . . 6
2 Domestic market capitalization of stock exchanges in the world in 2014 7
3 Accumulted Number of Individual A Share Account in China . . . . . . . 8
4 Trading Volume of Different Investor Type in China (2011, 2012) . . . . . 9
5 Timespan and Posts Under the Selected 30 sections . . . . . . . . . . . . . 14
6 Company Information of the 28 Selected Stocks . . . . . . . . . . . . . . . 15
7 Summary of Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
8 List of Positive & Negative Words for Stock Market . . . . . . . . . . . . . 18
9 Denotations of Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
10 Correlation of Negative Ratio and Returns of Sample Stocks and Indices 22
11 Correlation of Positive Ratio to Returns of Sample Stocks and Indices . . . 22
12 Average of Correlation Coefficients for Positive and Negative Ratio . . . . 23
13 Negative Correlations with Different Returns for Sample Securities . . . . . 24
14 Positive Correlation with Different Holding Period Return . . . . . . . . . 24
15 Sample Securities Ranked by Posts per Day . . . . . . . . . . . . . . . . . 26
16 Insignificant Positive Ratio and Significant Negative Ratio for 11 Stocks . . 27
17 Positive Ratio and Negative Ratio are Both Significant for 17 stocks . . . . 28
18 Outliers in Regression Coefficients . . . . . . . . . . . . . . . . . . . . . 29
19 Coefficients of Positive Ratio and Negative Ratio for Stocks . . . . . . . . 30
20 Regression Results for Indices . . . . . . . . . . . . . . . . . . . . . . . . 31
21 Coefficients for Positive Ratio: Indices v.s. Stocks . . . . . . . . . . . . . . 32
22 Coefficients for Negative Ratio: Indices v.s. Stocks . . . . . . . . . . . . . 32
23 Lag Length Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
24 VAR M odel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
25 Unit Root Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
26 Granger Causality Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3
1 Introduction
1.1 Literature review on investor sentiment and the stock market
Behavior science tells us that emotions can influence people's decisions. In financial ar-
eas, many researchers in beharioral finance have identified that stock perfomances are af-
fected by investor behaviors and sentiments. Unlike the standard finance model, where
unemotional investors always force capital market prices to equal to the rational present
value of expected future cash flows, behavior finance has grossed substantially in the past
decade to augment the standard model. The first dimension of behavior finance is about
behavior patterns. Many behaviro patterns and biases have been discovered. For example,
M. Seasholes and N. Zhu(2010) have found that individuals tilt their portfolios towards
locally-headquartered firms, and this local bias doesn't bing them superior returns. J. En-
gelbert and C. Parsons (2011) have identified the causal effect of the local media on the
trading behavior-all else equal, local press coverage increases the daily trading volume of
local retail investors. The second dimension of behavior finance is related to sentiments.
One of the most important assumptions in behavior finance is that investors are subject to
sentiment. Investor sentiment, defined broadly, is a belief about future cash flows and in-
vestment risks that is not justified by the facts at hand (Beker 2007). Although the question
is no longer whether investor sentiment affects stock prices, but rather how to measure in-
vestor sentiment and quantify its effects. Many measurements have been developed, such as
Investor Surveys (Qiu and Welch, 2004), Investor Mood (Kamstra, Kramer and Levi 2003),
Retail Investor Trades (Barber, Odean, and Zhu, 2003), IPO Frist-Day Returns, Option im-
plied volatility (Market Volatility Index or VIX which measures the implied volatility of
options on the standard and Poor's 100 stock index). However, those measurements are
only proxies of investors' sentiments, not direct measurements. In addition, data avail-
ability narrows these options considerably, because data for some measures is costly and
sometimes subjective.
4
A direct way to measure investor sentiment is to quatify languages in the media. By
quantifying language, researchers can examine and judge the directional impact of a limit-
less variety of events. Tetlocl (2008) analyzes the negatives words in Wall Street Journals
and concludes that negative words in firm-specific stories leading up to earnings announce-
ments significantly contribute to a useful measure of firms' fundamentals. Feng Li(2009)
uses Naive Bayesian algorithm to detect tones in Management's Discussion and Analy-
sis of Financial Condition and Results of Operations (MD&A) and finds that the tone of
the forward-looking statement is positively correlated with future performance and has
explanatory power incremental to other variables. Loughran and McDonald (2010) also
compare the widely-used Harvard-IV-4 TagNeg (H4N) file with other five wordlists. Those
studies have explored the quantification of news in journals and public infomation released
by companies.
As internet begins to play a major role in businesses and people's everyday lives, so-
cial media has become one of the most important venues where individual investors share
their opinions on financial securities. The content on social media, apparently diverse in
quality, huge in quantity and different with traditional meida, provides direct sources of
sentiment data to measure investors' sentiment. A growing number of literature has ad-
dressed the effects of user-generated content of social media and the stock market in United
States. Bollen, Mao and Zeng (2011) find that the mood expressed on Twitter among fi-
nancial investors can predict daily stock returns. Karabulut (2013) finds that the National
Happiness Index issued by Facebook is correlated with the daily stock return and trading
volume. Gilbert and Karahalios construct an Anxiety Index based on Twitter to predict
the stock market. Chen and De (2014) conduct research on Seeking Alfa and find evi-
dence that views expressed in Seeking Alfa could predict the future stock returns. With
its ever-growing amount of user-generated content, social media has become an important
confluence of investment sentiments and has exerted its influences on financial markets.
However, few English liturature addresses the correlation between social media and
stock market in countries other than United States. In this study, I want to explore the
correlaton between socia media and stock market in China.
5
Social network penetration in China from 2012 to 2018
FiTure 1: Social network penetration in China from 2012 to 2018
1.2 Social media and stock markets in China
The past decade has witnessed the boom of social media in China. In addition to having
the world's biggest Internet user base--513 million people, more than double the 245 mil-
lion users in the United States-China also has the world's most active environment for
social media. More than 300 million people use it, from blogs to social-networking sites to
microblogs and other online communities.That's roughly equivalent to the combined pop-
ulation of France, Germany. Italy. Spain, and the United Kingdom. In addition, China's
online users spend more than 40 percent of their time online on social media, a tigure that
continues to rise rapidly. Statistics show that Chinese people spend an average of 3 hours
daily on social media, I hour more than on TV. Wechat, the popular application that com-
bines the best of Facebook and WhatsApp, has 600 million Monthly Active Users as of
June 2015. With such a high penetration of social media in China, how social media has
affected the stock market becomes an interesting question.
6
Domestic market capitalization of stock exchanges in the world in 2014 (in billion U.S.dollars)
.1-4 w RUM 11144
Exn L Ss 4cAse b 13 aAa11
11w1t R NU 041
mb-ia i BsE &NKSL 1110M 4-[
NX'SF Fw,.11mltI mm m. E IA m
Lar- -h-I < -1,,
!N-\SDAQ ON1 YC
Chhm d J Sover 2h mi,.hioI, ad tskekgs
N YSLELamwxl I S; RMHA I 43&s
11"' 11R 4) a' 'w I ; W I' 1,14 11K 2 341) IYX
Figure 2: Domestic market capitalization of stock exchanges in the world in 2014
In addition, the Chinese stock markets have been growing very fast in the past two
decades, though they are still young by global standards. Taken together, China's two stock
markets rank second in the world in market capitalization behind the New York Stock
Exchange in 2014. As of December 31 2015, the total number of A share accouts has
reached over 200 million. In additiona trading in the Chinese stock markets is very acitve
comparing with many other stock markets in the world, which makes Chinese stock markets
one of the most dynamic stock markets in the world. Volatility is also high in stock markets
in China. For example, the Chinese stock market suffered a major crash in 2015. The
crash began with the popping of the stock market bubble on 12 June 2015. A third of
the value of A-shares on the Shanghai Stock Exchange was lost within one month of the
event. With such a dynamic and volatile backdrop, the effects of investor sentiment on
Chinese stock markets seem to have the potential to help us understand this market better,
for example, what role does investor sentiment play in this volatile market? Does the
findings on investment sentiment in United States apply to Chinese stock markets?
Before diving into answering those questions, consideration on the information and
market characteristics of the Chiese stock markets is needed. According to Fama( 1 970)'s
7
Accunulted Number of Individual A Share Account in China from 2001 to 2015
200
150
Figure 3: Accumulted Number of Individual A Share Account in China
EMH theory, efficient stock markets accurately reflect all available information at all times.
Weak-form efficiency implies that current prices reflect all historical price information, and
the semi-strong form implies that all public information would be fully reflected in mar-
ket prices. In the strongest form of the theory, even private (insider) information would
already be incorporated into market prices. Many researches have been conducted to iden-
tify which form Chinese stock market belongs to, and those results are mixed. Amelie and
Olivier found that B shares on Chinese stock exchanges do not follow the random walk
hypothesis and therefore are significantly inefficient, whereas A shares are more efficient.
Nisar adopted three methods to test the random walk theory and concluded inefficency
in Chinese stock market. Malkiel argued that Chinese stock market was broad-form effi-
cient because semi-strong and weak form both existed in Chinese stock market. As many
economists suggest, a well functioning financial system should be supported by a strong
legal system and by proper corporate governance, but China has none of it. This fact may
partially lead to the mixed results in testing the EMH theory. Moreover, Behavior Finance
also shed new lights on solving the contradiction in applying EMH theory to the Chinese
market.
One feature that sets the Chinese stock market apart from any other markets in the
world is that individual investors are very active in China. Individual investors, who
8
Trading Volume of Different Investor Type in China (2011, 2012)
40%
30
Figure 4: Trading Volume of Different Investor Type in China (2011, 2012)
hold merely 26% of total market capitalization, account for 78% of daily trading volume.
China's approximately 200 million retail investors trade more often than investors in any
other countries: 81 percent of individual investors trade at least once a month, compared
with 53 percent in the U.S, according to a survey by State Street in 2015. Moreover, the in-
dividual investors are not as educated in other countries, as a survery by Bloomberg shows,
more than two-thirds of the most recent batch of new investors didn't even graduate from
high school. Many literature has identified individual investors are prone to many biases.
Bondt (1998) identified the portrait of individual investors as people who discover naive
patterns in past price movements, share popular models of value, and trade in subopti-
mal ways. Following researchers have discovered many biases that individual investors
suffered, such as overconfidence, availability, framing and accounting biases. etc. Barbe
(2008) found that individual investors are prone to invest in attention-grabbing targets be-
cause of their limited energy and time to search for investment. With all those portraits
and biases exhibited in literature, it is natural for one to wonder whether those biases are
present in their social media postings. The large base of individual investors and heavy
involvement of individuals in daily stock trading in China make the Chinese stock market
a unique place to observe how social media has affected people's investment behavior.
9
1.3 Literature review on Chinese NLP
Natural language processing (NLP) is the ability of a computer program to understand hu-
man speech as it is spoken. Human languages, usually referred to as natural languages,
is a dynamic set of symbols and corresponding rules for communication. NLP can be
considered as a technique for the realization of linguistic theory to facilitate real-world ap-
plication, such as online content analysis, machine learning, etc, and has grown very fast
as a component of artificial intelligence (AI) since its inception in 1950s. In this regard,
Chinese NLP is no exception. However, lack of clear delimiters between words in Chinese
renders Chinese NLP unique from western languages. Unlike English text in which words
are delimited by white spaces, in Chinese text, sentences are represented as strings of Chi-
nese characters (hanzi) without similar natural delimiters between them. For this reason,
automatic word segmentation, the major step in Chinese morphological analysis, lays down
the foundation of any modern Chinese information system(Wong, Li, Xu, Zhang, 2010).
The problem of Chinese word segmentation has been studied by researchers for many
years. Several different algorithms have been proposed, which, generally speaking, can
be classified into characterbased approaches and word-based approaches(Wong, Li, Xu,
Zhang, 2010).. The character-based approach is used to mainly process classical Chinese
texts. It is simple and easy to use, which leads to other advantages of reduced costs and
minimal overheads in the indexing and querying process. On the other hand, word-based
approaches is gaining more popularity because of the increasing computaion power of com-
puters. Word-based approaches, as the name implies, attempt to extract complete words
from sentences. They can be further categorized as statistics-based, dictionary-based,
comprehension-based, and learning-based approaches. As I noticed many English liter-
ature on social media and stock markets use lexions and word lists in their studies, so in
this study I follow suite and choose dictionary-based approaches to analyzie online content
on Chinese social media.
10
1.4 Summary
One major assumption of behavior fiance is that investors are affected by sentiments, and
this has been proven by many studies in past decades. Now in the era of social media,
a burgeoning amount of lituerature begin to address the effects of social media on stock
market. Researchers have investigated the explanatory power of social media in United
States, such as twitter, facebook, Seek Alfa, etc. Positive evidences have been found. While
China is now the largest market for social media, and has the second largest stock markets
in the world, the effects of social media on Chinese stock markets become an interesting
question. In addition, the development of Chinese NLP techniques has provided ample
lexicons and tools for textual analysis in Chinese. Therefore, in this study I explore the
correlation between social media and stock markets in China.
Specifically, I want to explore whether correlaiton between social media and Chinese
stock market exists, and differences between correlation of negative sentiment raio and
positive ratio. Also correlations of different stocks and market indices will be compared to
see which "tags" or topics connet most closely the social media and Chinese social media.
And following prior literature, I will also test the correlation of social media sentiments to
different periods of future returns to see how the correlation persists in the future periods.
11
2 Data
This study uses posts data from Guba, a popular investment-themed socila mida in China.
Data of financial prices are downloaded from iFind. The sample period varies because of
the data availability from Guba.
2.1 Guba
2.1.1 Guba as a social media in China
Guba(http://guba.com.cn/) is the most popular investment-related online community in
China. It is also part of Eastmoney.com, which is the largest financial online media in
China. In Alexa ranking system, Eastmoney ranks 772 globaly, which is way above the
ranking of Seeking Alfa (1,480). According to iResarch, during January 2015 the Daily
Unique Visitors to Eastmoney.com was 15,210,00, 5.9% of all netizens in China. Guba's
mobile application is also widely downloaded among smartphone users. In addition, the
Weekly Effective Viewing Duration on Eastmoney during January 2015 reached 21,220,000
hours, which indicated a strong user loyalty among users. It is quite easy to make postings
on Guba, and it even doesn't require registration before posting. The easiness to post
increases the popularity of Guba among individual investors, especially less educated in-
vestors.
Guba is a topic-based forum-style social media. All topics are named after one stock
or indices, so posts are naturally catergorizd under different stocks or indices of the stock
market.There are over 2200 topics, covering almost every individual stock and market in-
dex. In this study, I download all the posts under selected stocks and indices, and all the
posts under one stock or index will be analyzed to extract the invest sentiment for this stock
or index. And here we use stock name to denote the disccussions under one specific topic.
12
2.1.2 Posts and samples
I download posts under 30 sections in total. The sections and post information is shown in
Figure 5. Among the 30 sections, two are market indices: Shanghai Composite Index and
China Stock Index Futures. Both are typical indices for the overall stock market in China.
The China Stock Index Futures are known to be very responsive to market infomation and
more liquid than equities.
I also download posts of 28 individual stocks.My selection criteria of these stocks are
(1) representative of A share market in China; (2)relative popularity among all sections
and (3) their heterogenicity. As those stocks are all component stocks of the CS1300, a
capitalization-weighted stock market index designed to replicate the performance of 300
stocks traded in the Shanghai and Shenzhen stock exchanges, so they are very representa-
tive of Chinese stock markets. And they are of different sizes and from different industries.
Figure 6 shows the additional information on those stocks. The 28 stocks are comprised of
a mixture of industries, such as meida, financial services, retail, construction, transporta-
tion, etc. Also the market capitalization also varies from 19, 205 million CNY to 1,124,975
million CNY. PE ratio also varies a lot, from 6 the lowest to 188 the highest. Companies
like China Vanke and OCT Group have been listed since 1990s, and some companies only
become public in recent years. In this regard, the companies are very heterogenous.
The time spans of all the posts downloaded in this study vary across different sections.
One reason is that they are listed at different time, and the posts centered on its shares
are only possible after its date of listing. Another reason for the variation in periods is
technichal reason: for very popular sections such as Shanghai Composite, huge amount of
data exists and some data may have lost on the server or in the process of downloading.
In order to ensure the data quality, I discard some periods that contains some consecutive
blank small period.
13
Categories Ticker Chinese Name English Name Beginning Data Ending Data Total Posts
Indices 000001 1Eft Shanghai Cormposite 11/18/2013 11/15/2015 1,129,503000300 R China Stock Future 11/6/2014 11/18/2015 85,094000002 i f4A China Vanke 11/10/2013 12/10/2015 76,914000069 $ f6AA OCT Group 7/18/2008 12/28/2015 68,198000156 $ WASU Media 10/19/2012 12/31/2015 28,099000157 L1 4XJf Zoomlion 10/9/2008 12/31/2015 126,625000333 if Midea Group 9/19/2013 12/31/2015 44,541000338 If!PM fI Weichai Power 6/20/2008 12/31/2015 115,473000651 4 F1$) L Gree Electrical 9/26/2008 12/31/2015 44,821
000712 Viel fr}) Golden Dragon 12123/2008 12/30/2015 6,455000725 AAYi A BOE Tech 4/3/2009 12/31/2015 63,236000728 PiiLd' Guoyuan Securities 7/1/2008 12/31/2015 75,854002008 i-k ': Han's Laser 218!2013 12/29/2015 51,425002024 )$9 & Suning 4/25/2013 12/31/2015 72,274
600038 r ARAi} Avicopter 7/17f2008 12/8/2015 15,441
Stocks 601618 t[9' MCC 9:10/2009 12/31/2015 62,570601669 UE Power Construction 9/25/2011 12/31/2015 132,530601688 4 Huatai Securities 2/9/2010 12/31/2015 53,592601699 Lu'an Envir. Energy 7/8/2008 12y3 1/2015 48,943
601718 1 Jihua Group 8/24/2010 12/23/2015 16,791
601706 ' $ CRRC Corp. 731/2008 12/31/2015 556,721601872 1 CMES 6/2/2008 12/31/2015 75,511601888 4:1 A CITS 9/22/2009 12/31/2015 30,882601898 P%'% China Coal Energy 6/1/2008 12/3112015 83,052601899 @ Zijin Mining 6/6/2008 12/3 1/2015 157.243601901 : iEi. Founder Securities 7/31/2011 12/31/2015 147,603
601919 Pl itiV COSCO 6/2/2008 12/31/2015 124,354601928 Ta-M Phoenix PubL& Medi 11/17/2011 12/31/2015 83,290601939 Wi.M China Constuction BE 6/6/2008 12/31/2015 101,991603993 flrYVL'k China Molybdenum 9/10/2012 12/31/2015 55,400
Total 3,734,426
Figure 5: Timespan and Posts Under the Selected 30 sections
14
Market Value PDe 31Ticker Name (Dec 31 2015) Date of Listing Industry
Millions CNY 2015)
000002.SZ 7Ef}A 263,094 17 1991-01-29 Real Estate000069.SZ ${f#rA 64,715 14 1997-09-10 Real Estate
000 156.SZ 'PK 47,014 123 2000-09-06 Media & Entertainment000157.SZ P4A*|F 36,937 69 2000-10-12 Heavy Machinacry
000333.SZ kLJM 140,001 13 2013-09-18 Electrical Manufacturing000338.SZ #%P 4JJ 36,225 8 2007-04-30 Automobile00065 1.SZ ) 134,452 9 1996-11-18 Electrical Manufacturing
000712.SZ %1N3 26,092 67 1997-04-15 Financial Service
000728.SZ t tiE& 44,369 32 1997-06-16 Financial Service000725.SZ ACFA 103,151 41 2001-01-12 Electrical Manufacturing002008.SZ t$&I%* 27,339 39 2004-06-25 High-tech manufacturing002024.SZ $tTViI 99,302 115 2004-07-21 Retail600038.SH q1 ThR f} 31,083 94 2000-12-18 Aircraft601618.SH P H4 rif 103,387 29 2009-09-21 Construction601669.SH r L 9r 110,450 23 2011-10-18 Construction601688.SH $41E4 133,389 31 2010-02-26 Financial Service601699.SH i C$iWfrt 19,205 20 2006-09-22 Mining
601718.SH 4hf1$ 44,240 38 2010-08-16 Textile601766.SH P E I 4 329,574 66 2008-08-18 Railway equipment601872.SH 4tffPiUK 37,573 188 2006-12-01 Transportation601888.SH r41[A 57,901 39 2009-10-15 Travelling
601898.SH L -thtNWA 65,588 105 2008-02-01 Mining601899.SH * hI 65,441 32 2008-04-25 Mining601901.SH tiE<E 79,028 44 2011-08-10 Financial Service601919.SH tYi 76,484 254 2007-06-26 Transportation
601928.SH #AK4 40,540 34 2011-11-30 Media & Entertainment
601939.SH WiQWHT 1,124,975 6 2007-09-25 Financial Service603993.SH fiVUWk 62,552 41 2012-10-09 Mining
Figure 6: Company Information of the 28 Selected Stocks
15
2.2 Financial data
I download all financial data from Wind Financial Terminal, which is a leading financial
data provider in China, covering stocks, bonds, funds, indices, warrants, commodity fu-
tures, foreign exchanges, and the macro industry. Closing prices per day for year are down-
loaded for Shanghai Composite Index, China Stock Index Futures and all the individual
stocks. Following prior literature(Chen, De, et all 2011), I compute return of differnt hold-
ing periods: R1 denotes the return of holding the stock for I day, i.e. buy at the closing price
of day t - 1, and sell at the ending price of dayt. R2 denotes return from 2 days holding
period.R3 denotes return from 3 days holding period. The same goes for R4 , R5 , R7 , RIO,
R 15, R 30 -
3 Method
In this study I use the dictionary-based way of Chinese Natural Language Processing to
do textual analysis. The first step is to pick a suitabel lexion. In order to make the textual
analysis reflect the sentiments of Guba posts accurately, augmentation of existing dictio-
naries is also needed. I selected the most popular words used in stock discussions among
Chinese netizens to add to the positive and negative dictionaries of Hownet and NTUSD.
Some words are catchwords and some are the parlances. The second step is to segment all
the posts into words. To do this, Jieba Parse is selected in this study to parse all the posts
from Guba. After parsing, I computer the total word counts of negative and positve words,
and the ratio of both to the total word counts for each day. With those ratios and the price
data, I conduct correclation and regression analysis.
3.1 Dictionaries and word list
Many lexicons have been developed for textual analysis in NLP. Of all these lexicons,
Hownet and NTUSD are the most popular two dictionaries. HowNet is an on-line common-
sense knowledgebase unveiling inter-conceptual relationships and inter-attribute relation-
ships of concepts as connoting in lexicons of the Chinese and their English equivalents.
16
Data Collecting Text Analysis Statistical Analysis
1. Posts 1. Dictionary 1. Correlation Analysis
2. Prices 2. Parsing 2. Linear Regression
3. # of Sentiment word 3.grime eriesRegression
4. Sentiment Ratio
Figure 7: Summary of Method
NTUSD (National Taiwan University Sentiment Dictionary) is based on Chinese language,
and independant of other languages. Xu, Zhao, Qiu and Hu (2010) compare those dic-
tionaries and find that they are not enough. Many sentiment words are not included in
the current Chinese sentiment dictionaries. For example, HowNet contains 3969 positive
words and 3755 negative words; NUTSD contains 2648 positive words and 7742 negative
words. Among them, only 669 positive words and 877 negative words are shared.
In order to better measure the sentiment behind the posts, I combine the lexicons of
Hownet and NTUSD as the base dictionary for this study. In addition, I also manually
select and produce a word list specially designed for stock market. I first use the Chinese
parser to divide all the posts into words, and extract the top 300 most frequent words from
them. Then I distribute the list of 300 words to three experienced individual investors, who
are also heavy users of Guba. They mark the words which they think express positvie or
negative emotions. In the meantime, in order to eliminate the parsing errors, I also read
over 400 pages of posts to pick all postive and negative words. So based on the two lists, I
build a list of 53 positive words(Figure 7) and 70 negative words(Figure 8).
3.2 Segmenting and parsing the posts
Chinese language is very different from English in syntax features. For example, Chi-
nese makes less use of function words and morphology than English, verbs appear in a
unique form with few supporting function words. Also subject pro-drop, which is the null
17
Negative words (88 words)
Positive Words (58i
Jf
rkrords
18
A
I' J
47kf kE A,' )' ATh jilt
i R 41 1 _ KB A 18
& 1 *1 -K1l -51- Ai i ~
Figure 8: List of Positive & Negative Words for Stock Market
18
I
A
i121 a t
ily IL
k2 r
4fyz;A
f
?AV
IVIR
A, y
I M
realization of uncontrolled pronominal subjects, is widespread in Chinese but rare in En-
glish. Therefore, Natural Language Processing (NLP) tools are very different for the two
languages. There are several widely used Chinese NLP open source tools, such as Jieba,
BosonNLP, NLPIR, ITP-Cloud, etc. Stanford NLP group and Berkeley NLP Group also
provide Chinese segementer and parser.
I adopt Jieba NLP tool package (https://github.com/fxsjy/jieba), because it is widely
used in the social media industry, and it is conscidered stable and accurate in Chinese pars-
ing. Jieba is free open-source tool in Python. As its algorithm is based on Trie Tree struc-
ture, Jieba is able to find all the possible wording situations, and arrive at the most probable
tree path through dynamic programming. Moreover, Jieba adopted Hidden Markov Model
and Viterbi algorithm to detect new words. With these features, Jieba has become one of
the most popular segmenter and parser among Chinese users.
3.3 Quantifying the positive and negative sentiment
Although prior literature has noted that positive words in English are limited in testing
sentiment, because they are frequently subject to negation, and corporate communications
rarely convey positive news using negated words(Loughran and MacDonald 2011). How-
ever, it may not be the case in social media, where people are not so mindful about their
language. Moverover, many of those positive words in Chinese are verbs; if a Chinese in-
vestor wants to express an opposite opinion, he would simply use the opposite verb for it.
Negation of the original verb is not the same as using the opposite verb. Some reseaches
on Weibo use positive sentiments in their tests, and find positive meansures is also useful
(Pang, Li, et all). Based on these considerations, I include positive measures to this study.
We calculate the total word count per day, the total count for positive words per day,
the total count for negative words per day for every section. Formally, I use the frequence
of positive words and negative words as the measures for positive and negative sentiment.
NegRatic =No.ofNegativeWordsTotalWordCount
19
PoRto - No.ofPositiveWordsTotalWordCount
3.4 Regression methods
I first test the correlation of the sentiment ratios at day t to returns of 9 different holding
period (RI, R2, R3 ,R4 , R5 , R7, RIO, R 15 and R30). In this study, I use Ri to denote the
return of holding peoriod from day t to day t + i. For example, R, is the same day return
at day t, because the holding period is just I day; R2 is the return from day tto day t +2 ;
and R30 is the return of holding the security for one month since day t. Figure 9 shows the
relationship of these returns.
After the correlation analysis, I test the linear regression between returns of different
holding period and sentiment measures from Guba as the following regression.
Ri,-,n = aij + i,1NegRat ioi, + f3 2PosRatio,1 + Ei,t
The dependent variable is the same- day return Rit, where i indexes sections and t
denotes the day on which posts are posted on Guba, and it denotes the days of holding
period. Specifically, I chose 9 holding periods to study, including R1 , R2, R3 ,R4 , ,
R 10, R 15 and R30 . So I conduct 9 regressions for each stock and index to see the relationship
of sentiment ratios to different returns.
After the linear regression with future returns, I also conduct time series analysis and
construct vector autoregression (VAR) models for selected stocks and indices using differ-
ent lags of positive ratio and negative ratio. Granger causality test are also conducted to see
the predictablity of this models.
20
RR
t -f2 t-3 t+4 i-5 1-7 t+10 +15 t+30
negative ratio at day
positiveratio at day t
Figure 9: Denotations of Returns
4 Results
4.1 Correlation Analysis
4.1.1 Positive ratio v.s. Negative ratio
Figure 10 shows the correlation coefficients between negative ratio and different returns.
Figure 11 shows the correlation coefficients between positive ratio and different returns.
First of all, we can see that the results of negative ratio are much higher statistical sig-
nificance than those of positive ratio. Of all the 30 sample stocks and indices, 29 securities
have very significant correlation between negative ratio and different returns (most of them
are significant under the significance level of 1 %), while only 16 securities have significant
correlation between positive ratio and different return. In this sense, the negative ratio has
more substantial correlation with stock returns.
Secondly, the correlation coefficients of negative ratio are much larger larger than the
correlation coefficents of positive ratio. For every stock and index, its negative raito has
much high correlation with its returns. The only exception is the Shanghai Composite, its
positive and negative ratio has almost the same level of correlation. Most of the postive
correlation coefficients are much lower than those of the negative ratio. Figure 12 shows
21
licker Ratio R1 12 K3 R4 R5 R? RIO R15 R30
(11100)2 negative ratio -0.223*** -0.280' .313"* -0283'" 4210*'* -0281*** 0.24* -0266"* -0.195"'
00(81618 negative ratio -0- 1* 0.68* -0.155*** -0. 138** 134* ~0.137*** -0 141** -0.1254*** 0855"*
000156 negative ratio -00258 -0.0268 00205 -0 0106 0,000432 0.00640 A.0171 0,0449 0 157***
000157 negative ratio -0 175* * -0. 1901* -0184** -168"-O 0.155** -0.117*** 410990* -9082**
000333 negative ratio -009761 -0J39 0.13* -0131" -0.122* -0.0678 -0:0278 -0,0634 -00925'
000338 negaive ratio 07131** -0.142+1 -0,134*** -0 130** 1 0153* -0.142** 0130* -0.12"* .0864"'
000651 negative ratio -0,04* -0108" -0.126*** -0 126**-OI22** -0t23"' -0.135"' -0.114++ -0.127*
000712 negative ratio -0.181 .158** 4.143* -0,107' -0. 0* -0.0935 -A0645 -0.0338 -oOso3
000725 ncgative ratio -011 0"'* -3114** -0.0976** -00946* -0.0701* -0.0348 -00379 410304 -0.0169
000728 negative ratio -094*** -0.212* -0.211"' -0,216' -0.218"* -0.1910** -0.194** -0. 157"' -0.0665
002008 nogativeratio -0.255'** -. 268- -0.286"* -0233*** -0.187* -0.158** -0.196-" .236"* -0.199"
002024 negtive ratio -0 "* -0.192*** -0.174** 4- 179*** -0-171' -0.171" 4150"' -0.142* -0122"*
6o )38 negative ratio -0 143*** -0.117* -0.117* -0.1151' -0.1321 -0.127W*4 -0.133"' .y44'** -0.0886*
601618 negative ratio -0.0534 -0.0851* -0.0956" -0.1956" -0.0870' -0.07204 -0.0750' - .0418 0,00468
601669 negative ratio -024,i' -0.235"' -0.253** -0.237"' -0 22* -0.230"' -0226"' .225"* -0.24*
601688 negative ratio -0164*' -076*** -0.144"** -0115"' -0136 * -0.112** -00837' -0.108" 4.08310601699 negative ratio 033* -0-173"* -0.080.' -0.181* -0.181"* -0.168" -0171"' -0.191' -0184"*'
601718 ne-gaye ratio -095* -0.285** -0.287*' -0.268"* -0,166* -0,116 -0,124 -0.118 -).164*
601766 negative ratio -0.2"2*** 4219"' -0.214"'0 -0 197"* -0.195** -0.180*** -0164*"* -0.148"' -0138*"'
601871 negative ratio -050** -0.163 -0.145"' -0.133*** -0.133"' 4.143'" 4138.' -, 138"' 4112*"601888 Iegat ve, ratio O 08** -0. i I1"' -0.111*"' 4.106'" -0AW93*** -00859"' -0.0794"w -0.0814** -0.0748**
601898 negative rtio -0104*** -0.111*** -0.105*** -0. 101'** -0.0933*** -0.08061" -00848*** -0.0716** -0.0231
601899 negative ratio 00903" -0.0925* -0.0924** -0.0799* -0.0982" -D103" -0.132*0" -0.120"' -0,103"
601901 ncgativc ratio -0,2* -0.192 -0.195" -0. 1802" -0 l "' -0. 138*** -0.114** -0.080" -0.0546
601919 nrgativ ratio .0157*"* -0.192" -. 192"* -1. 08" -0-" 0182*** -0,184" -015** -Y4117w*'
601928 negative ratio -0-163*** -0.155' -0,141*** -0 120'* -0.14** -0.126" -0.132** -0. 106** -0.0769*
601939 negative ratio -0101*** 407 -013' -080971* 093 * 00924*** -0.0433#" -. 082* -0.0896" -0.0563
601993 negative ratio -0 119*.* -3-0" A0, 113" 5** -0.0939" -0.0587 -0,0558 -0.1000 4).112*
negative ratio -0 23 -0.179"* -0.216**' 0 4 222"' -0240*' 0.279f -0.338'** -0426"'" 41441"
SbanghaCbpoia negative ratio -0361"'* -0.466"' 42*** -0488** -0.499** -0454** -0.538*'* ) 0.w0" -0516"*
=*P<0.05 ** p<0.01 * P<0.001,
Figure 10: Correlation of Negative Ratio and Returns of Sample Stocks and Indices
fitker Ratio R1 R2 R3 R4 R5 R7 RIO R15 1R30
00(032 positiv ratio 0.0814 0.0862 0049* 0.109' 0 1 0059 0 007(5 00148 40654
5%"'69 )0s0iVCno t .0600* 0045" 0.058 7 0565' o0567* 0.0505k 60.675'" 045 00416
000156 roeotive reo 0.0142 o,0257 0.0252 0.0240 0.024 o03(Oi7 0,0330 0 O066 (11210*
0001 ; iive man A0712"* 0190" 011768** 00827** 0.079** 0.0616' 00621' 0.0782" 0,01'2
000333 positive ratio 0.00676 0.240 0.04' 0.0627 0,0504 O 4O 0.0400 00721 it 108'
14(33A posiaive ia't U.0417 0.0343 0,0477 0150.0 t.'622* 0.0622* 0.062 2' 0 .0022* 113"*
0051 pxitive rato 0.0847' Q. I I" 0,082" 0 0514* 0.0718 0i4 0 077 00374 ) 0253
000712 positivrerdai 00200 0.305 00.344 0.0314 0.0212 -0 (W!21 00143 0.0512 0,-06
000725 poiive Mr6o 0.0721' 0.0599 a0474 0.0353 0.0269 0. 051 7 0.0338 0557 0.02W8
000729 posltive ratio 0 028 0.0279 0-0303 0.094 0.0339 00456 0.0512* 00558' 0.8753"0021)0 poia, rat ( 0135"' Q34*"' 0103" 0.,094* 0077 00491 0.0824 0047 0.0114r
002024 pomitiv ratc 0 151"** 0154-' .142*** 0.123 0.9501 00755 0.C659 00597 f01359
60833 pOitic Vrwio 0.0438 ',0493 0.0624 0,0297 0.0223 0020 U.018 0,0320 0.0133
601611 postve ro 0 110* 00827* 0.0875* 0.0928" 0.063" 0 107" 0.0975* 0106* 00524
601669 pontiac i I 0. 115, 103" 0.104"0 115"* 0.1U0S" 0113" 0.120400 1315" 0,17*"
11614 i 0.0929" O18" (,104" 0.13**' 0,0957" 00859" 0.0974" 00875" 0 t17#.
601699 poestiv tao 0123" 11105"'* 0.0833"' 0.07%* 00" 00604' 0.0477 0 015 "5 )'327
601719 poiivc rtO 0.0626 0117 0.174* 0. 18* 0 * 0.250" 0,276** 0-21" 113*
601766 postiv tiratc 0128*" 0J24** 01124** 0 111" 0.10'*** 0.)894** 0.0759** 0764" 00684*
601871 poitiv Muo't 0O625* 061(3*" 0.0683" 0.0459* 0.0334 0.0256 0.0329 0005 A002126
601801i positive ratio 3 013' 0+867** 0.0743" 0.0575* 0.0593* 003-il 0.0591 0.0779"* 014
601S91 pt10 iv9ra.o 0.040* Q,1677-- A0583* 0.0500' 0.0459 0.07 0.426 006133 01670"
60189') positia rato 009C? + 026*** 01 0.125*** 0.129*** 0. 1200" 0.104+ 0.0878' 0,0589
60101 positive 'MO .116"' 0132*** 0149** 0153*" 0.140"' 1I40"' 0132*** .147" 0147*"*
601919 pie t-c 00779' 00301 00793** 0.952* 0.098*" 0031" 0.0373" 0.078301 0.0722
60t72a pimitiv rat - 1 0M.1 0106*1, 0083"' 0.0713' 0.056 0(0F5 0),441 00281 (0.0.59
601939 " itivr rtxl 00766" ' 4f7' 0.0554* 0.0661 * 49' 1 0. 101" U10** 0.108*' 0 tItI,*
V)t941 Pt idvl rato 00415 0910 00339 0.0359 0.43 0 - 0.0105 00108 0.0242
Inde Fatie p tiivL rt-o U.0471 U0464 0 100-14 -4. 019 -Cii)9 -C 02 -0 0643 101Y31 .0.70
ponstirveattit 40"" 0400"' 0466*** 0,426"* 0.410"' 0164"' 0320"* 0229"' .1"'5**
poOl'S '* <0' "* p<li.D-
Figure 11: Correlation of Positive Ratio to Returns of Sample Stocks and Indices
22
RI R2 R3 R4 R5 R7 RIO RI5 R30
Positive ratio Corr. Coeffi 0085 0.093 0.092 0087 0.083 0.076 0.071 0.071 0.066Averagevariance 0.0051 0.0073 0.0065 0.0059 0.0063 0.0053 0.0054 0.0036 0.0040
Negative ratio Co. Coeffl. -0.156 -0.173 -0.173 -0.163 -0.155 -0.146 -0.143 -0.141 -0.118Averagevariance 0.0043 0.0066 0.0077 00076 0.0076 0.0091 0.0107 0.0128 0,0146
Figure 12: Average of Correlation Coefficients for Positive and Negative Ratio
the average correlation coefficients of positive ratio and negative ratio. We can see that
the correlation coefficients are nearly twice the value of correlation coefficients of positive
ratio.
Thus, both the value level and statistical significance are better when we use negative
ratio. This result is in accordance with the studies in English. Positive words are subject
to negation, so when positive words are used, it could be possibly expressing positive or
negative tone. At the beginning, I was wondering whether it would be not the case in
Chinese, and the results of correlation analysis show that this also holds true in Chinese. I
think the major reason lies in the dictionaries I used. Many positive words in Hownet and
NTUSD are positive adjective words, which are more often subject to negation. Although
I add a word list with many positve verbs or nouns to augment the dictionary, it is only a
small part of the lexicons. Therefore, negative ratio is better than positve ratio in terms of
correlation degree and significance.
4.1.2 Differnt Holding Periods and securities
Figure 13 shows the correlation coefficents for different stocks' different holding period
returns. There is no clear sign that as holding period gets longer, the correlation with future
return will decrease. However, the variance between securities are very large.
23
Negative Ratioi Correlation With Rciurmes A Samnpl Stocks and Indiccs
Figure 13: Negative Correlations with Different Returns for Sample Securities
Postitive Ratio Correlation with Returns of Sample Stocks and Indices
ii i
Figure 14: Positive Correlation with Different Holding Period Return
24
First, the Shanghai Composite Index has the largest correlation coefficients for positive
and negative ratio with all holding periods' returns. The China Stock Index Future also has
very high correlaiton with negative ratio. Individual stocks also vary a lot in the correla-
tion level. Some have relative larger correlation coefficients (in absolute value), such as
00002(China Vanke), 002008(Han's Laser), 600669(Power Construction), 601766(CRRC
Corp), 000157(Zoomlion).
Why these securities have higher correlation coefficients? One possible reason is that
these securities have more data. Figure 15 shows that Shanghai Composite Index(szzs)
has 1554 posts under it every day, which is 5 times more than the China Stock Index Fu-
ture(gzqh), and much much more than the individual stocks. And 00002(China Vanke),
002008(Han's Laser), 600669(Power Construction), 601766(CRRC Corp), 0001 57(Zoom-
lion) all have frequent posts per day.
There is an outlier 601718(Jihua Group). Although it has very low posts per day, its
positive correlation coefficients are relatively larger than many other stocks. Considering
the low significance level of 601718's correlation coeffients in Figure 10 and Figure 11, and
its low frequence of posts data (rank 28# in the sample), I think 601718 is simply outlier
that can be ignored. Another outlier is 0001 56(WASU Media), because it has large positive
correlation coefficients for negative ratio, which is against common sense. One reason for
000156's abnormality may lies in its relatively unfrequent posts data.
4.2 Regression Analysis
4.2.1 Positive ratio v.s. negative ratio
Similar to correlation ceefficients' results, the regression results for positive ratio are not as
good as negative ratio in terms of significance. As Figure 16 shows, coefficients for positive
ratio are not significant to all 9 kinds of returns for 11 stocks. However, coefficients for
negatve ratio are almost all significant to all sample stocks and returns (see Figure 16 and
Figure 17).
25
ticker post total days post per_,da! Rank ticker post total da-ygspostper ,day Rankszzs 1,129,503 727 1554 1 601939 101,991 2,764 37 16gzqh 85,094 377 226 2 601898 83,052 2,769 30 17601766 556,721 2,709 206 3 000728 75,854 2,739 28 18000002 76,914 760 101 4 601871 75,511 2,768 27 19601901 147,603 1,614 91 5 601618 62,570 2,303 27 20601669 132,530 1,558 85 6 000725 63,236 2,463 26 21002024 72,274 980 74 7 000069 68,198 2,719 25 22601899 157,243 2,764 57 8 601688 53,592 2,151 25 23601928 83,290 1,505 55 9 000156 28,099 1,168 24 24000333 44,541 833 53 10 601699 48,943 2,732 18 25002008 51,425 1,044 49 11 000651 44,821 2,652 17 26000157 126,625 2,639 48 12 601888 3 0, 882 2,291 13 27601993 55,400 1,207 46 13 601718 16,791 1,947 9 28601919 124,354 2,768 45 14 600038 15,441 2,700 6 29000338 115,473 2,750 42 15 000712 6,455 2,563 3 30
Figure 15: Sample Securities Ranked by Posts per Day
In addition, it is very obvious that in Figure 16 and Figure 17 that for the same stocks,
coefficients of negative ratio are all ways larger than those of positive ratio. This fact shows
that in the regression model
Ri,,n = ait + A,1NegRatioij + ,2PosRatioij + Ei~t
#i, 1 is larger than i,2, i.e. negative ratio has better correlation with the stock returns. As
disscussed above, this is because positive words are subject to negation. And this study
shows that it is the same for textual analysis in Chinese, especially, when the dictionaries
used are compriesed of many adjective words. Therefor, based on this we can conclude
that negative ratio is a better meansurement of investment sentiment, because it has stable
and significant correlation with returns of different holding periods.
4.2.2 Difference between stocks
There are large divergence of correlation between different stocks (Figure 18). The coef-
ficients for NegRatioij range from -3 to + 15 for individual stocks, and the coefficients
26
VARIABLES RI R2 R3 R4 R5 R7 RIO R15 R3 0
000156 positive ratio 0.387 1.058 1.248 1.362 1.586 2.743 2.668 6.156* 14.16***
negative ratio -0.653 -0.972 -0.876 -0.486 0.0824 0.496 1.287 3.781 15.01*4*
000002 positive ratio 0.224 0.322 0.425 0.597* 0.685* 0.262 -0.176 -0.282 -1.406*
negative ratio -0.683*** -I.255*** -L691*** -.7Q9*** -1.890*** -2.170*** -2.379*** -2.750*** -2.902***
positive-ratio -0.0524 -0,0286 0.0883 0.278 0.216 0.348 0.414 0.871 1.847**
negative ratio -0.300** -0.613*** -0.734*** -0.762*** -0.800*** -0.486 -0.192 -0.576 -1.190*
positive-ratio 0.0662 0.0661 0.145 0.214* 0.258* 0.313* 0.385** 0.480** 1.460***
neative ratio -0.297*** -0.477*** -0.547*** -0.607*** -0.806*** -0.880** -0.957** -1109-** 1004***
000712 positive-ratio -0.00593 0.0117 0.0263 0.036 0.017 -0.0431 0.0179 0.206 0.2
negative ratio -0.214*** -0.299*** -0.334*** -0.292** -0.309** -0.356* -0.286 -0.148 -0.535
positive-ratio 0.0538** 0.0629 0.058 0.0448 0.0383 0.107 0.0769 0.156* 0.104000725 negative-ratio -0.0980*** -0.151*** -0.155*** -0.171*** -0.141** -0.0708 -0.0955 -0.0796 -0.0664
000728 positive ratio 0.0426 0.0573 0.102 0.128 0.116 0.204 0.272* 0.392* 0.991***
0-0728 negativeratio -0.483*** -0.774*** -0.941*** -1.103*** -1.240*** -1.244*** -1.468*** -1.500*** -0.914**
positiveratio 0.0696 0.162 0.292** 0.292* 0.249 0.292 0.214 0.129 0.348
601928 negative ratio -0.342*** -0.463*** -0.506*** -0,508*** -0.503*** -0.723*** -0.906*** -0.864*** -0.855**
601993 positive ratio 0.152 0.15 0.238 0.276 0.341 0.231 0.142 0.218 0.483
negative ratio -0.312*** -0.487*** -0.602*** -0.616*** -0.567*** -0.411* -0.443 -0.925*** -1.380***
600038 positive ratio 0.0325 0.0751 0.135 0.0354 -0.00429 0.0168 -0.0278 0.0396 -0.0204
negative ratio -0.226*** -0.267*** -0.324*** -0.389*** -0.498*** -0.544*** -0.677*** .-0.805*** -0.708**
601718 positive ratio 0.0888 0.288 0.660* 0.852* 1.517** 2.311*** 3.266*** 3.275*** 4.430**
negative ratio -0.453** -O.872*** -1.039*** -1.121*** -0.724* -0.591 -0.708 -0.983 -2.752*
Figure 16: Insignificant Positive Ratio and Significant Negative Ratio for 11 Stocks
for PosRatioi. range from 14.6 to minus 1.4. First of all, I think the large coefficient of
+ 15 for negative ratio and the large coefficient of 14.6 for positive ratio (both come from
000156) are all outliers and should be ignored. As disscussed in the correlation analy-
sis, 000156's correlaiton is not significant under any acceptable significance level, and its
posts' frequency is also very low. Considering this we will not consider 000 156's data in
the following discussions.
After deleting the outliers, we still see there are large variances of coefficient value
across stocks. For example, 000002 (China Vanke), 002008 (Han's Laser), 600669 (China
Power Construction), 601699 (Lu'an Environment Energy), 601766 (CRRC Corp), and
601919 (COSCO) have larger coefficients for negative ratio (As negative ratio is supe-
rior than positive ratio, I will focus on disussing negative ratio) than other stocks. This
factor may also be related with the relative posts per day, because 000002(China Vanke),
002008(Han's Laser), 600669(China Power Construction) and 601766 (CRRC Corp) have
much higher posts per day than other stocks.
27
VARIABLES RI R2 R3 R4 R5 R7 RIO ~15 R30
601766 positive ratio 0.829*** L176*** 1382"* 0.943 0.865 0,377 0.512 0.645 -2.055
negative rato -04715*** -1.388* ' 1.817 * -2. 196*" - 5 " -6.34 - 2
poitverai 09.-A6 83** 265** -6,3
601872 poitiveratio 0.111*** 0.174*** 0.218*** 0170* 0.141 0.13 096' 015- 0006
negative ratio -0.224*** -0360*** -0,396** 0 41"-0 ' 0.1"' 0.74"'
--------. -0410 -4 ** *- -0.641*** 077*
0positiveratio 0.0650* 0162*** 0.168*** 0. 147* 09 2" 034
negative ratio -0.144"* -0.216*** -0.261' -Q.282* -0291*** -0.288*** -03090" -0.369* -0451
poSitive ratio 0. 142** 0.211" 0.257" 0.364" 0.435' 0.451 0601919
0.601*0 0.68** 098
negative ratio -0.338*** -0.621"* -0,769"' -0.95*** -1.065' L-501'
000069 positive ratio 0.0969" 0 136** .-14:9 0. 0-- 0.1-----88 0,321" *
i ratio-0.283"*** -0.460*** -0.515"* -0.522** -0,57 *** -0.691*** -0.82** 0. " ,23
positive ratio 0.139** 0.267*** 0282** 0..6 0.256 0.1410
000651 eaieai 16 -02,760,14 021 015
negativeratio -0 36*' -0.227** -0.321' -0.370*** -0.395*** -0.462*** -0.600*** -0.594** -0.939***
002008 positive. ratio 0.42"3*** 0.604** 0513" 0,549** 0.451 0.325 0. . 13
negative ratio -. 635*** -. 971"' -1264*** 1.157 - -0 35 -. 4 " - 0,376
positive ratio 0160" 0.214' 0.262' 0.3*6" 0..4413" 0. 5" -. 462*
601669 -- 022 .36* .4** 0-519**0 0.705"** 1 i1* -- -*
negative ratio -0.463*** -0.694*** -954 1046 * * * .6 , -129*** -l8**-67*-2.093*** -2 99***
positive ratio 0.268'* 0402*** 0465*** 0.446" 0354 01.
002024 . 121 94 0311 0332
negative ratio -0276 -0398*** -0451*** -0.556*** -0.61'7*** -0.761*** -0.813"' -0.946"' -1,44**"
601939 positive ratio 0.132*** 0.166"' 160** 0214** 0.3 "** 4"* 0.562' '6.0
negative ratio -0144*** -0.208*** -0.231" -0.257"'* -0289"' 0, * 0-5 ' 0662*** L 043*
601699 positive ratio 0.328"'* 0.4 1 ]" 0.39 .5*** 0380 ' 0 443 "' 0 . * -0.343*** 0 "*** 0443*
negativo ratio -0.418** -0.621"** -0.801*** -0.930** -1 045"'* *11 -1,424*" 0-.0 " -. 4
601688 positive ratio 0.207' 0.397*** 0462*** 0.563*** 0.544*** 046064* 2862' 0.951 2.5714
negative ratio -0.388*** -0.630*** -0648*** ..602 -820"' -0 0.7 * -. 29"* -. 1.060 0 -2**-.3** 07 75** -129* 1
601618 positive ratio 0.171*** 0.173** 0.235** 0.297 " 0.360*** 0.508 "' 0.5 " 082 2"' 610**
negative ratio -0.0463 -0.114** -066*** -0.196*** -0.2* -0211 0.284** 082*** 0.624
601899 positive ratio 0.180** 0.355*** 0.442*** 0.503"' 0.571 0.640"' 0.599** 0.617' 0.509
negative ratio -0 177** -0.251' -0.308* -0.286 -0.429" .6-587" 1.167"' -1.45"'
000157 positive ratio 0.197"** 0.288*** 0.342*** 0.425*** ..4 " 0.424" 0.508" 0764*** 0.33
negative ratio -0.366*** -0.557*** -0.705*** -0.787** -0801* -0.4 60"' -0.784"' -. 799"** -0,9.33
6 0 1 9 0 1 p o sitiv e ratio 0 .3 3 6 * * * 0 .5 75 * * * 0 .8 17 * * " 0 9 9 7 "' 7 . 8 " 10 "' 2 . 2 1 8 "' - .9 0 7 * * *
negative ratio -0.409"' -0.682"' -0.868*** -0950*** -0.970"' -1.0 *** -1 *4" - 26"' .5976
601898 positive ratio 0.0772" 0.159*" 0.169* 0.168" 0174" 03
negative ratio -0.78*** -0.279*** -0.325*** -0.362*** -0.376" -0.384"' -0.40*" -0.4 0.*224
Figure 17: Positive Ratio and Negative Ratio are Both Significant for 17 stocks
28
I
OOutlie
Figure 18: Outliers in Regression Coefficients
On the other hand, their larger coefficients also indicate greater attention from investors
for those stocks. For example, CRRC Corp has been a buzzword in China since 2014, be-
cause the Chinese government encourages the export of railway equipments under the "Belt
and Road" policy raised by Chinese President Xi Jin Ping. Also, CRRC Corp is one of the
largest company in Chinese stock market, and has the capacity to attract millions of indi-
vidual investors. The mania for CRRC Corp stock has made the stock price increase 400%
from Dec 2014 to April 2015. After that, CRRC Corp dived drastically to the original level
of price in 2014. Such drastic changes in prices inevitably attract investors to talk about it.
Thus the more people are talking about it on the social media, the better the posts reflect
investor sentiment, and the better the regression coefficients. Apart from 601766 (CRRC
Corp), 000002 (China Vanke) is the No.1 real estate developer in China and its CEO is very
famous public figure in China. 002008 (Han's Laser) is the world-leading laser equipment
producer and has been regarded as major player in the 3D printing industry. 601669 (China
Power Construction) is widely regarded to benefit from the "Belt and Road" policy raised
by Chinese government. All those stocks have fancy stories to attract people discussing on
the social media. Comparing with these stocks, other stocks are not so attention-grabbing,
so their coefficients are mush smaller.
29
Figure 19: Coefficients of Positive Ratio and Negative Ratio for Stocks
4.2.3 Difference between stocks and indices
The regression results in Figure 20 show that Shanghai Composite Index has very good
regression results: coefficients for negative ratio and positive ratio are almost all significant
under I % significance level and the R square for all the holding period returns are all above
20%, which is much higher than any other regression for individual stocks. An interesting
result is that the coefficients of negative ratio gets larger (in absolute value) when holding
period get longer. This indicates that future return is very dependent on the lag of investor
sentiment. (To analyze this, I also conduct time series analysis and build a VAR model for
for Shanghai Composite Index.)
Things are very different for China Stock Index Future. The coefficients of positive
ratio are not significant for all holding period returns. Whereas the coefficients of negative
ratio are all significant at I% significance level. Also, all the coefficients are much smaller
than those of Shanghai Composite Index. The failure of positive ratio in explaining China
Stock Index Future and the smaller coefficients may lies in its much lower posts per day
than Shanghai Composite Index. Another underlying reason is the relatively small size
of China's future market. Also individual investors are required to deposit 500,000 RMB
before they start trading index futures. This rule also limits the involvement of individual
investors in the future market. Therefore, although index futures are largely perceived as
good indicators of stock markets worldwide, investor sentiment on index futures on social
media may not be a good indicator of the stock market, because they are not widely traded
among individual investors and usually have higher entry level for investment.
30
VARIABLES ri r2 r3 r4 r5 r7 rIO r15 r30
positiveratio ].626*** 2.967*** 3.171*** 3.115*** 3.415** 3.050*** 2.735*** 1.103 -0.993
negative ratio -0.536*** -1.065*** -1.413*** -1.713*** -2.030*** -2.741*** 3.589*** -4.748*** -7.429***ShanghaI 3594
Composite Constant -0.0631*** -0.11* -0.107*** -00907** -00925** -0.0424 0.0114 0.149*** 0.383***
Observations 487 487 487 487 487 487 487 487 487
R-squared 0.22 0.348 0.335 0.314 0.319 0.313 0.31 0.294 0.267
positive-ratio 0.269 0.415 0.114 -0.0825 -0.326 -0.221 -0.819 -1.498 -1.824
negative ratio -0.393** -0.846*** -1.181*** -1.324*** -1.610*** -2.288*** -- 5.138*** -8.209***
Fude Constant 0.00568 0.0203 0.0520* 0.0701** 0.0970** 0,126*** 0.210*** 0.331*** 0.512-**Future
Observations 253 253 253 253 253 253 253 253 253
R-squared 0.018 0.035 0.047 0.049 0.058 0.078 0.117 0.187 0.198
Figure 20: Regression Results for Indices
Comparing the regression results of indices and stocks (Figure 21 and Figure 22), we
can see that for negative ratio, Shanghai Composite Index have the largest coefficints for
almost all returns, which suggests that the investor sentiment of Shanghai Composite In-
dex is the most correlated with stock market. However, for positve ratio, coefficients of
Shanghai Composite Index are not the largest at all. Other stocks such as 601901 (Founder
Securities) and 601718 (Jihua Group) have also large positive ratio coefficients for their
regression. As disscussed before, many coefficients of positive ratio are not significant, so
their values are not as useful as those of negative ratio.
4.3 Time-series Analysis
As in Shanghai Composite Index's case, it is quite obvious that lagged values of the inde-
pendent variable greatly matter. In technical terminology, the regression is called a vector
autoregression (VAR). So I construct a VAR modle with lagged variables to better reflect
the regression.
Ri= -ai,t +&foNegRatiojt +P,iNegRatio,tj-1+... +f,0PosRatiotj +pi,1PosRatioit.- I+..+ Ei
31
Figure 2 1: Coefficients for Positive Ratio: Indices v.s. Stocks
/#
Figure 22: Coefficients for Negative Ratio: Indices v.s. Stocks
32
Selection-order criterlaNumber of obs. 483Lag LL LR df p AIC HQIC SBIC
0 4977.96 -20.60 -20.59 -20.571 5431.98 908.03 9 0.00 -22.44 -22.40 -22.342 5482.19 100.41 9 0.00 -22.61 -22.54* -22.43*3 5496.17 27.98 9 0.00 -22.63 -22.53 -22.374 5510.93 29.52* 9 0.00 -22.66* -22.53 -22.32
Figure 23: Lag Length Selection
Because among all the sample stocks and indices, Shanghai Composite Index has the
largest and significant coefficients and the largest R square, so I use Shanghai Composite
Index to conduct the VAR model and Granger causality test.
4.3.1 Lag selection
In order to construct a good VAR model, the first step is to find the optimal lag length for
return of Shanghai Composite Index. I use the Akaike Information Criterion (AIC) criteria
to determin the lag length, because for monthly and daily VAR models, the Akaike Infor-
mation Criterion (AIC) tends to produce the most accurate structural and semi-structural
im- pulse response estimates for realistic sample sizes (Ivanov and Kilian, 2005).
As Figure 23 shows, AIC reaches its minimum -22.6581 at Lag 4. Therefore, I choose
4 days length as the optimal lag for the VAR model.
4.3.2 VAR Model
Figure 24 shows the details of this VAR model.
4.3.3 Stability test
The necessary and sufficient condition for stability is that all characteristic roots lie outside
the unit circle. Figure 25 shows that the reciprocal values of all unit roots lie within the unit
circle. So the VAR model is stable.
33
Vector Autorregression Results
Equation Parms RMSE R-sq chi2 p>chi2 No. of Obs.
ri 13 0.02 0.07 35.63 0.00 483
positive-ratio 13 0.00 0.30 205.51 0.00 483
negative_ ratio 13 0.00 0.77 1644,86 0.00 483.00
Coef. Std. Err. z P>IzIri
riL1. 0.13 0.06 2.27 0.02
L2. -0.11 0.06 -1.82 0.07
L3. -0.01 0.06 -0.22 0.83L4. 0.18 0.05 3.41 0.00
positive-ratioLI. 0.17 0.32 0.54 0.59L2. 0.00 0.34 0.00 1.00L3. -0.33 0.34 -0.98 0.33L4. -0.30 0.31 -1.00 0.32
negative-ratioL1. -0.09 0.24 -0.38 0.70
L2. 0.17 0.27 0.66 0.51
L3. -0.27 0.26 -1.01 0.32L4. 0.08 0.23 0.34 0.74
cons 0.03 0.02 1.42 0.16
Figure 24: VAR Model
Rocts of the companion matrix
s Ral
Figure 25: Unit Root Check
34
2
Granger causality Wald tests
Equation Excluded chi2 df Prob>chi2ri positive-ratio 3.56 4 0.47ri negative-ratio 1.98 4 0.74ri all 5.68 8 0.68positive ratio rI 24.05 4 0.00positive-ratio negativeratio 7.38 4 0.12positive ratio all 39.66 8 0.00negativeratio rI 36.53 4 0.00negativeratio positive-ratio 15.78 4 0.00negative_ratio All 58.07 8 0.00
Figure 26: Granger Causality Test
4.3.4 Granger causality analysis
Figure 26 shows the result of Granger causality test. As p values of postive ratio and neg-
ative ratiio are larger than the significance levels, the causality between ratios and stock
returns doesn't exist. In other words, we can not assert that positive ratio and negative ratio
of sentment on Shanghai Composite Index is predictive of market return. Although Shang-
hai Composite Index has very significant linear regressions with social media sentiment
ratios, it is still challenging to use past sentiment ratios to predict future returns.
35
5 Conclusion
Since 1980s, behavior finance has posited that investors are driven by sentiments and this
assumption has been proven by many studies on traditional media, such as newspaper, jour-
nals and annual reports. However, sentiments extracted from traditional media may only
stand for the opinions of a small fraction of investors. In the era of internet and big data,
everyone is connected by social media. Social media has become the important confluence
where all investors' sentiments are gathered and maintained. Thus many researchers are in-
vestigating how to extract sentiments from crowds of people and better predict the financial
market. In this spirit, I want to study the relationship between social media and Chinese
stock market, given that few literature covers the Chinese social media and Chinese stock
market. I choose guba.com.cn, one of the most popular stock-related social media in China,
for this study, and download 3,734,426 posts from 2008 to 2015 under the sections of 30
sample securities, including Shanghai Composite Index, China Stock Index Future and 28
individual stocks. Textual analysis of those posts is conducted and sentiment ratios are ex-
tracted. I have analyzed the correlation matrices and regressions between sentiment ratios
and returns of 9 holding periods for all the 30 sample securities, and have arrived at the
following findings.
1. Negative sentiment ratio is superior than positive sentiment ratio. I have found in
Chinese literature that many researchers are using both sentiment measures, while most
English literature use negative ratio. Language differences could be a reason for the differ-
ence in using sentiment ratios. Therefore, I deliberately compare the two kinds of sentiment
ratios and find that negative ratio is superior. First, the correlation coefficients of positive
ratio to the returns are all smaller than those of negative ratio for all the 30 samples. This
is the same case with regression coefficients. In addition, more than half the coefficients
of positive ratio are not statistically significant. Therefore, I conclude that negative senti-
ment ratio is better than positive sentiment ratio, especially when dictionary-based textual
analysis is used in the research.
2. Correlation of sentiment ratio to return is persistent in future holding periods. I
compute the correlaiton coefficients and regression coefficients for sentiment ratios to 9 re-
36
turns of different holding periods. There is no sign that the correlation is decreasing when
holding period becomes longer. On the contrary, in the case of Shanghao Composite In-
dex, the coefficients of negative ratio become larger in absolute value when holding period
increases from 1 day to 30 days. Whereas, in cases of other samples, the absolute value of
coefficients change arbitrarily and no clear pattern can be found. However one thing is sure:
the correlation doesn't decrease as holding period gets longer. This result indicates that lag
terms may exist to predict stock returns, however since no sign of decreasing correlation is
identified, finding the optimal lag length could be difficult.
3. Well-established market index has better correlation with social media than individ-
ual stocks, and well-known 'star' stocks have better correlation with social media than other
stocks. In this study, Shanghai Composite Index has the best correlation results: largest co-
efficients of all samples, all significant at I % significance level, and R-squared of all 9
regressions are over 20%. One reason for this is that the discussion about Shanghai Com-
posite Index is the most active -1554 posts are submitted everyday. Although individual
investors may focus on only several individual stocks, they all care about the market index.
That's why Shanghai Composite Index has the most frequent postings and the best corre-
lation with social media. Among individual stocks, 601766 (CRRC Corp), 000002 (China
Vanke), 002008 (Han's Laser) and 601669 (China Power Construction) have relative better
correlation with social media sentiments ratios, because they are all popular stocks among
individual investors, and have more posts per day than other stocks.
4. Better data and improved analysis are needed to predict stock market with social
media. I test the VAR model on Shanghai Composite Index, and find that the model is
stable but shows no Granger causality. I think the reason for this result comes from the
quality of data and the textual analysis method. Maybe the quality of the data is not good
enough, especially many posts in Guba are rumors and speculations with no reason (Guba is
regarded by many investors as full of rumors). So the quality of infomation may be twisted.
Another solution could be to improve the textual analysis method, such as improving the
dictionary to make it more adaptive to financial markets, improving the parsing tools, etc.
All in all, this study confirms that correlation exists between investor sentiment on
social media and the returns of the Chinese stock market, and negative sentiment ratio is
37
a better indicator for this correlation. However, in order to predict the stock market, the
quantity and quality of contents on social media matter greatly. The more contents, the
lareger the correlation exists. As social media attracts more and more users to generate
contents, the explanatory and predictive power of social media will be greater in the future.
38
References
[1] Am'else Charles, Olivier Darn'e. The random walk hypothesis for Chinese stock mar-
kets: Evidence from variance ratio tests. Economic Systems, Elsevier, 2009, 33 (2),
pp. 117-126.
[2] Baker, HK. 2010. "Individual Investor Trading." Pp. 1-26 in Behavioral Finance: In-
vestors, Corporations, and Markets.
[3] Baker, Malcolm. 2007. "Investor Sentiment in the Stock Market." Journal of Eco-
nomic Perspectives 21(2):129-52.
[4] Biemann, Chris. 2006. "Chinese Whispers - an Efficient Graph Clustering Algorithm
and Its Application to Natural Language Processing Problems." In Proceedings of
Workshop on TextGraphs, at HLT-NAACL 2006, New York 73-80.
[5] Bollen, Johan, Huina Mao, and Xiaojun Zeng. 2011. "Twitter Mood Predicts the Stock
Market." Journal of Computational Science 2(1):1-8.
[6] Bondt, De and Werner F.M. 1998. "A Portrait of the Individual Investor." European
Economic Review 42(3-5):831-44.
[7] Chen, Hailiang, Prabuddha De, Y. Yu (Jeffrey) Hu, and B. H. Byoung-Hyoun Hwang.
2014. "Wisdom of Crowds: The Value of Stock Opinions Transmitted Through Social
Media." Rev. Financ. Stud. 27(5):hhuOOl -.
[8] Chen, Hailiang, Prabuddha De, Yu Hu, and Byoung Hyoun Hwang. 2011. "Sentiment
Revealed in Social Media and Its Effect on the Stock Market." IEEE Workshop on
Statistical Signal Processing Proceedings 25-28.
[9] Engelberg, Joseph E. and Christopher A. Parsons. 2011. "The Causal Impact of Media
in Financial Markets." The Journal of Finance 66(t):67-97.
[10] Epstein, Marc J. and Martin Freedman. 1994. "Social Disclosure and the Individual
Investor." Accounting, Auditing & Accountability Journal 7(4):94-109.
39
[11] Gilbert, Eric and Karrie Karahalios. 2009. "Predicting Tie Strength with Social Me-
dia." Chi 2009.
[12] Gilbert, Eric and Karrie Karahalios. 2010. "Widespread Worry and the Stock Market."
Proceedings of the 4th International AAAI Conference on Weblogs and Social Media
58-65.
[13] Tetlock, Paul C. 2015. "Giving Content to Investor Sentiment : The Role of Media in
the Stock Market." 62(3):1139-68.
[14] Karabulut, Yigitcan. 2013. "Can Facebook Predict Stock Market Activity?" American
Finance Association 2013 Meetings 49(0):60.
[151 Kr, Roman. 2015. "Media , Sentiment and Market Performance in the Long Run."
(July).
[16] Loughran, T. I. M. and Bill Mcdonald. 2010. "When Is a Liability Not a Liability
? Textual Analysis , Dictionaries , and 10-Ks Journal of Finance , Forthcoming."
Journal of Finance, Forthcoming LXVI(1):46.
[17] Maertens, Annemie, a. V. Chari, and David R. Just. 2014. "Why Farmers Sometimes
Love Risks: Evidence from India." Economic Development and Cultural Change
62(2):239-74.
[18] Malkiel, Burton G. 2007. "The Efficiency of the Chinese Stock Markets." (154).
[19] Feng, Li. 2006. "Annual Report Readability, Current Earnings , and Earnings Persis-
tence Annual Report Readability , Current Earnings , and." Social Sciences (Septem-
ber).
[20] Sheng-, Zhou and S. H. I. Xun-. 2013. "Stock Market Time- Series Prediction Based
on Weibo Search and SVM." 22-26.
[21] Tetlock, Paul C., Maytal Saar-Tsechansky, and Sofus MacSkassy. 2008. "More than
Words: Quantifying Language to Measure Firms' Fundamentals." Journal of Finance
63(3): 1437-67.
40
[22] Weston, Jason et al. 2011. "Natural Language Processing (Almost) from Scratch."
Journal of Machine Learning Research 12:2461-2505.
[23] Yang, Sy, Syk Mo, and Xiaodi Zhu. 2013. "An Empirical Study of the Financial Com-
munity Network on Twitter."
[24] Zhou, Xiaolin, Zheng Ye, Him Cheung, and Hsuan-Chih Chen. 2009. "Processing
the Chinese Language: An Introduction." Language and Cognitive Processes 24(7-
8):929-46.
[25] Seasholes, Mark S. and Ning Zhu. 2010. "Individual Investors and Local Bias." Jour-
nal of Finance 65(5):1987-2010.
[26] Li, Feng. 2008. "The Determinants and Information Content of the Forward-Looking
Statements in Corporate Filings - a Na * ive Bayesian Machine Learning Approach."
Journal of Accounting Research 1001.
[27] Introduction to Chinese Natural Language Processing, Kam-Fai Wong, Wenji Li,
Ruifeng Xu, Zheng-sheng Zhang, Morgan & Claypool Publishers
[28] Xu, Hongzhi, Kai Zhao, Likun Qiu, and Changjian Hu. 2010. "Expanding Chinese
Sentiment Dictionaries from Large Scale Unlabeled Corpus." Proceedings of the
PACLIC 24 301-10.
[29] Ventzislav, Ivanov and Kilian Lutz. 2005. "A Practitioner's Guide to Lag Order Selec-
tion For VAR Impulse Response Analysis." Studies in Nonlinear Dynamics & Econo-
metrics 9(l):1-36.
41