individual investors, social media and chinese stock market

47
Individual Investors, Social Media and Chinese Stock Market: a Correlation Study By Yonghui Wu B.E., Shanghai Jiao Tong University, 2007 M.E., Shanghai Jiao Tong University, 2010 SUBMITTED TO THE MIT SLOAN SCHOOL OF MANAGEMENT IN PARTIAL REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN MANAGEMENT STUDIES AT THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUNE 2016 @2016 Yonghui Wu. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Signature of Author: Certified by: Accepted by: FULFILLMENT OF THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUN 082016 LIBRARIES ARCHIVES Signature redacted I MIT Sino in School of Management May 6, 2016 Signature redacted Erik Brynjolfsson Schussel Family Professor Thesis Supervisor Signature redacted____ Rodrigo S. Verdi Associate Professor of Accounting Program Director, M.S. in Management Studies Program MIT Sloan School of Management

Upload: nguyenthu

Post on 11-Jan-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Individual Investors, Social Media and Chinese Stock Market: aCorrelation Study

By

Yonghui Wu

B.E., Shanghai Jiao Tong University, 2007M.E., Shanghai Jiao Tong University, 2010

SUBMITTED TO THE MIT SLOAN SCHOOL OF MANAGEMENT IN PARTIALREQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE IN MANAGEMENT STUDIESAT THE

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

JUNE 2016

@2016 Yonghui Wu. All rights reserved.

The author hereby grants to MIT permission to reproduceand to distribute publicly paper and electronic

copies of this thesis document in whole or in partin any medium now known or hereafter created.

Signature of Author:

Certified by:

Accepted by:

FULFILLMENT OF THE

MASSACHUSETTS INSTITUTEOF TECHNOLOGY

JUN 082016

LIBRARIESARCHIVES

Signature redactedI MIT Sino in School of Management

May 6, 2016

Signature redactedErik Brynjolfsson

Schussel Family ProfessorThesis Supervisor

Signature redacted____Rodrigo S. Verdi

Associate Professor of AccountingProgram Director, M.S. in Management Studies Program

MIT Sloan School of Management

Individual Investors, Social Media and Chinese Stock Market: aCorrelation Study

By

Yonghui Wu

Submitted to MIT Sloan School of Managementon May 6, 2016 in Partial fulfillment of the

requirements for the Degree of Master of Science inManagement Studies.

ABSTRACTChinese stock market is a unique financial market where heavy involvement of individualinvestors exists. This article explores how the sentiment expressed on social media is correlatedwith the stock market in China. Textual analysis for posts from one of the most popular socialmedia in China is conducted based on Hownet and NTUSD, two most commonly usedsentiment Chinese dictionaries.

The correlation matrices and regressions between sentiment ratios and returns of 9 holdingperiods for all the 30 sample securities reveal that correlation exists between investorsentiment on social media and the future returns of the Chinese stock market. In addition, I findthat negative sentiment ratio is superior than positive sentiment ratio, and correlation ofsentiment ratio to return is persistent in future holding periods. Also, by comparing differentstocks and indices, I find that well-established market index has better correlation with socialmedia sentiments than individual stocks, and well-known 'star' stocks have better correlationwith social media than other stocks. However, I test the VAR model on Shanghai CompositeIndex, and find that the model is stable but shows no Granger causality. Better data andimproved analysis are needed to predict stock market with social media.

Thesis Supervisor: Erik BrynjolfssonTitle: Schussel Family Professor

(This page left intentionally blank)

Acknowledgements

I feel grateful and privileged to have worked with my thesis advisor Professor Erik Bryn-

jolfsson. I would like to thank him for his guidance for helping me navigate through the

thesis process, and for his prompt feedback and suggetions regarding the directions and the

resources of this study.

I would also like to thank Professor Marshall Van Alstyne and other fellow students for

their valuable comments and encouragement on this research in the class of Economics of

Digitalization.

This study is very new and challenging for me because I have little prior experienc in

programming. This thesis could not have been possible without the help of my friend Lerith

Tian. Lerith has helped me tremendously with python programing and textual analysis. I

am very grateful for his help and also learned a lot from his patient guidance.

I also benefited a lot from my other friends. Shuyi Yu has provided me with many

valuable suggestions on statistical analysis. Shan Huang has helped me narrow down the

research scope at the very beginning. Alora Chen, Jin Jing Liu and Liam O'Dea have

greatly supported me during my preparation for this thesis. I am indebted to these dear

friends of mine.

Last but not least, I would like to thank my parents Shunfeng Wu and Ganying Deng

as well as my sister Yonghong Wu. Thank you for always believing in me and standing

behind all my endeavors.

(This page left intentionally blank)

Contents

1 Introduction 4

1.1 Literature review on investor sentiment and the stock market . . . . . . . . 4

1.2 Social media and stock markets in China . . . . . . . . . . . . . . . . . . . 6

1.3 Literature review on Chinese NLP . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Data 12

2.1 G uba . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Guba as a social media in China . . . . . . . . . . . . . . . . . . . 12

2.1.2 Posts and samples . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Financialdata . . .. . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Method 16

3.1 Dictionaries and word list . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Segmenting and parsing the posts. . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Quantifying the positive and negative sentiment . . . . . . . . . . . . . . . 19

3.4 Regression methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Results 21

4.1 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.1 Positive ratio v.s. Negative ratio . . . . . . . . . . . . . . . . . . . 21

4.1.2 Differnt Holding Periods and securities . . . . . . . . . . . . . . . 23

4.2 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.1 Positive ratio v.s. negative ratio . . . . . . . . . . . . . . . . . . . 25

4.2.2 Difference between stocks . . . . . . . . . . . . . . . . . . . . . . 26

4.2.3 Difference between stocks and indices . . . . . . . . . . . . . . . . 30

4.3 Time-series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3.1 Lag selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.2 VAR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.3 Stability test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

'I

4.3.4 Granger causality analysis . . . . . . . . . . . . . . . . . . . . . . 35

5 Conclusion 36

2

List of Figures

1 Social network penetration in China from 2012 to 2018 . . . . . . . . . . . 6

2 Domestic market capitalization of stock exchanges in the world in 2014 7

3 Accumulted Number of Individual A Share Account in China . . . . . . . 8

4 Trading Volume of Different Investor Type in China (2011, 2012) . . . . . 9

5 Timespan and Posts Under the Selected 30 sections . . . . . . . . . . . . . 14

6 Company Information of the 28 Selected Stocks . . . . . . . . . . . . . . . 15

7 Summary of Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

8 List of Positive & Negative Words for Stock Market . . . . . . . . . . . . . 18

9 Denotations of Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

10 Correlation of Negative Ratio and Returns of Sample Stocks and Indices 22

11 Correlation of Positive Ratio to Returns of Sample Stocks and Indices . . . 22

12 Average of Correlation Coefficients for Positive and Negative Ratio . . . . 23

13 Negative Correlations with Different Returns for Sample Securities . . . . . 24

14 Positive Correlation with Different Holding Period Return . . . . . . . . . 24

15 Sample Securities Ranked by Posts per Day . . . . . . . . . . . . . . . . . 26

16 Insignificant Positive Ratio and Significant Negative Ratio for 11 Stocks . . 27

17 Positive Ratio and Negative Ratio are Both Significant for 17 stocks . . . . 28

18 Outliers in Regression Coefficients . . . . . . . . . . . . . . . . . . . . . 29

19 Coefficients of Positive Ratio and Negative Ratio for Stocks . . . . . . . . 30

20 Regression Results for Indices . . . . . . . . . . . . . . . . . . . . . . . . 31

21 Coefficients for Positive Ratio: Indices v.s. Stocks . . . . . . . . . . . . . . 32

22 Coefficients for Negative Ratio: Indices v.s. Stocks . . . . . . . . . . . . . 32

23 Lag Length Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

24 VAR M odel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

25 Unit Root Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

26 Granger Causality Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3

1 Introduction

1.1 Literature review on investor sentiment and the stock market

Behavior science tells us that emotions can influence people's decisions. In financial ar-

eas, many researchers in beharioral finance have identified that stock perfomances are af-

fected by investor behaviors and sentiments. Unlike the standard finance model, where

unemotional investors always force capital market prices to equal to the rational present

value of expected future cash flows, behavior finance has grossed substantially in the past

decade to augment the standard model. The first dimension of behavior finance is about

behavior patterns. Many behaviro patterns and biases have been discovered. For example,

M. Seasholes and N. Zhu(2010) have found that individuals tilt their portfolios towards

locally-headquartered firms, and this local bias doesn't bing them superior returns. J. En-

gelbert and C. Parsons (2011) have identified the causal effect of the local media on the

trading behavior-all else equal, local press coverage increases the daily trading volume of

local retail investors. The second dimension of behavior finance is related to sentiments.

One of the most important assumptions in behavior finance is that investors are subject to

sentiment. Investor sentiment, defined broadly, is a belief about future cash flows and in-

vestment risks that is not justified by the facts at hand (Beker 2007). Although the question

is no longer whether investor sentiment affects stock prices, but rather how to measure in-

vestor sentiment and quantify its effects. Many measurements have been developed, such as

Investor Surveys (Qiu and Welch, 2004), Investor Mood (Kamstra, Kramer and Levi 2003),

Retail Investor Trades (Barber, Odean, and Zhu, 2003), IPO Frist-Day Returns, Option im-

plied volatility (Market Volatility Index or VIX which measures the implied volatility of

options on the standard and Poor's 100 stock index). However, those measurements are

only proxies of investors' sentiments, not direct measurements. In addition, data avail-

ability narrows these options considerably, because data for some measures is costly and

sometimes subjective.

4

A direct way to measure investor sentiment is to quatify languages in the media. By

quantifying language, researchers can examine and judge the directional impact of a limit-

less variety of events. Tetlocl (2008) analyzes the negatives words in Wall Street Journals

and concludes that negative words in firm-specific stories leading up to earnings announce-

ments significantly contribute to a useful measure of firms' fundamentals. Feng Li(2009)

uses Naive Bayesian algorithm to detect tones in Management's Discussion and Analy-

sis of Financial Condition and Results of Operations (MD&A) and finds that the tone of

the forward-looking statement is positively correlated with future performance and has

explanatory power incremental to other variables. Loughran and McDonald (2010) also

compare the widely-used Harvard-IV-4 TagNeg (H4N) file with other five wordlists. Those

studies have explored the quantification of news in journals and public infomation released

by companies.

As internet begins to play a major role in businesses and people's everyday lives, so-

cial media has become one of the most important venues where individual investors share

their opinions on financial securities. The content on social media, apparently diverse in

quality, huge in quantity and different with traditional meida, provides direct sources of

sentiment data to measure investors' sentiment. A growing number of literature has ad-

dressed the effects of user-generated content of social media and the stock market in United

States. Bollen, Mao and Zeng (2011) find that the mood expressed on Twitter among fi-

nancial investors can predict daily stock returns. Karabulut (2013) finds that the National

Happiness Index issued by Facebook is correlated with the daily stock return and trading

volume. Gilbert and Karahalios construct an Anxiety Index based on Twitter to predict

the stock market. Chen and De (2014) conduct research on Seeking Alfa and find evi-

dence that views expressed in Seeking Alfa could predict the future stock returns. With

its ever-growing amount of user-generated content, social media has become an important

confluence of investment sentiments and has exerted its influences on financial markets.

However, few English liturature addresses the correlation between social media and

stock market in countries other than United States. In this study, I want to explore the

correlaton between socia media and stock market in China.

5

Social network penetration in China from 2012 to 2018

FiTure 1: Social network penetration in China from 2012 to 2018

1.2 Social media and stock markets in China

The past decade has witnessed the boom of social media in China. In addition to having

the world's biggest Internet user base--513 million people, more than double the 245 mil-

lion users in the United States-China also has the world's most active environment for

social media. More than 300 million people use it, from blogs to social-networking sites to

microblogs and other online communities.That's roughly equivalent to the combined pop-

ulation of France, Germany. Italy. Spain, and the United Kingdom. In addition, China's

online users spend more than 40 percent of their time online on social media, a tigure that

continues to rise rapidly. Statistics show that Chinese people spend an average of 3 hours

daily on social media, I hour more than on TV. Wechat, the popular application that com-

bines the best of Facebook and WhatsApp, has 600 million Monthly Active Users as of

June 2015. With such a high penetration of social media in China, how social media has

affected the stock market becomes an interesting question.

6

Domestic market capitalization of stock exchanges in the world in 2014 (in billion U.S.dollars)

.1-4 w RUM 11144

Exn L Ss 4cAse b 13 aAa11

11w1t R NU 041

mb-ia i BsE &NKSL 1110M 4-[

NX'SF Fw,.11mltI mm m. E IA m

Lar- -h-I < -1,,

!N-\SDAQ ON1 YC

Chhm d J Sover 2h mi,.hioI, ad tskekgs

N YSLELamwxl I S; RMHA I 43&s

11"' 11R 4) a' 'w I ; W I' 1,14 11K 2 341) IYX

Figure 2: Domestic market capitalization of stock exchanges in the world in 2014

In addition, the Chinese stock markets have been growing very fast in the past two

decades, though they are still young by global standards. Taken together, China's two stock

markets rank second in the world in market capitalization behind the New York Stock

Exchange in 2014. As of December 31 2015, the total number of A share accouts has

reached over 200 million. In additiona trading in the Chinese stock markets is very acitve

comparing with many other stock markets in the world, which makes Chinese stock markets

one of the most dynamic stock markets in the world. Volatility is also high in stock markets

in China. For example, the Chinese stock market suffered a major crash in 2015. The

crash began with the popping of the stock market bubble on 12 June 2015. A third of

the value of A-shares on the Shanghai Stock Exchange was lost within one month of the

event. With such a dynamic and volatile backdrop, the effects of investor sentiment on

Chinese stock markets seem to have the potential to help us understand this market better,

for example, what role does investor sentiment play in this volatile market? Does the

findings on investment sentiment in United States apply to Chinese stock markets?

Before diving into answering those questions, consideration on the information and

market characteristics of the Chiese stock markets is needed. According to Fama( 1 970)'s

7

Accunulted Number of Individual A Share Account in China from 2001 to 2015

200

150

Figure 3: Accumulted Number of Individual A Share Account in China

EMH theory, efficient stock markets accurately reflect all available information at all times.

Weak-form efficiency implies that current prices reflect all historical price information, and

the semi-strong form implies that all public information would be fully reflected in mar-

ket prices. In the strongest form of the theory, even private (insider) information would

already be incorporated into market prices. Many researches have been conducted to iden-

tify which form Chinese stock market belongs to, and those results are mixed. Amelie and

Olivier found that B shares on Chinese stock exchanges do not follow the random walk

hypothesis and therefore are significantly inefficient, whereas A shares are more efficient.

Nisar adopted three methods to test the random walk theory and concluded inefficency

in Chinese stock market. Malkiel argued that Chinese stock market was broad-form effi-

cient because semi-strong and weak form both existed in Chinese stock market. As many

economists suggest, a well functioning financial system should be supported by a strong

legal system and by proper corporate governance, but China has none of it. This fact may

partially lead to the mixed results in testing the EMH theory. Moreover, Behavior Finance

also shed new lights on solving the contradiction in applying EMH theory to the Chinese

market.

One feature that sets the Chinese stock market apart from any other markets in the

world is that individual investors are very active in China. Individual investors, who

8

Trading Volume of Different Investor Type in China (2011, 2012)

40%

30

Figure 4: Trading Volume of Different Investor Type in China (2011, 2012)

hold merely 26% of total market capitalization, account for 78% of daily trading volume.

China's approximately 200 million retail investors trade more often than investors in any

other countries: 81 percent of individual investors trade at least once a month, compared

with 53 percent in the U.S, according to a survey by State Street in 2015. Moreover, the in-

dividual investors are not as educated in other countries, as a survery by Bloomberg shows,

more than two-thirds of the most recent batch of new investors didn't even graduate from

high school. Many literature has identified individual investors are prone to many biases.

Bondt (1998) identified the portrait of individual investors as people who discover naive

patterns in past price movements, share popular models of value, and trade in subopti-

mal ways. Following researchers have discovered many biases that individual investors

suffered, such as overconfidence, availability, framing and accounting biases. etc. Barbe

(2008) found that individual investors are prone to invest in attention-grabbing targets be-

cause of their limited energy and time to search for investment. With all those portraits

and biases exhibited in literature, it is natural for one to wonder whether those biases are

present in their social media postings. The large base of individual investors and heavy

involvement of individuals in daily stock trading in China make the Chinese stock market

a unique place to observe how social media has affected people's investment behavior.

9

1.3 Literature review on Chinese NLP

Natural language processing (NLP) is the ability of a computer program to understand hu-

man speech as it is spoken. Human languages, usually referred to as natural languages,

is a dynamic set of symbols and corresponding rules for communication. NLP can be

considered as a technique for the realization of linguistic theory to facilitate real-world ap-

plication, such as online content analysis, machine learning, etc, and has grown very fast

as a component of artificial intelligence (AI) since its inception in 1950s. In this regard,

Chinese NLP is no exception. However, lack of clear delimiters between words in Chinese

renders Chinese NLP unique from western languages. Unlike English text in which words

are delimited by white spaces, in Chinese text, sentences are represented as strings of Chi-

nese characters (hanzi) without similar natural delimiters between them. For this reason,

automatic word segmentation, the major step in Chinese morphological analysis, lays down

the foundation of any modern Chinese information system(Wong, Li, Xu, Zhang, 2010).

The problem of Chinese word segmentation has been studied by researchers for many

years. Several different algorithms have been proposed, which, generally speaking, can

be classified into characterbased approaches and word-based approaches(Wong, Li, Xu,

Zhang, 2010).. The character-based approach is used to mainly process classical Chinese

texts. It is simple and easy to use, which leads to other advantages of reduced costs and

minimal overheads in the indexing and querying process. On the other hand, word-based

approaches is gaining more popularity because of the increasing computaion power of com-

puters. Word-based approaches, as the name implies, attempt to extract complete words

from sentences. They can be further categorized as statistics-based, dictionary-based,

comprehension-based, and learning-based approaches. As I noticed many English liter-

ature on social media and stock markets use lexions and word lists in their studies, so in

this study I follow suite and choose dictionary-based approaches to analyzie online content

on Chinese social media.

10

1.4 Summary

One major assumption of behavior fiance is that investors are affected by sentiments, and

this has been proven by many studies in past decades. Now in the era of social media,

a burgeoning amount of lituerature begin to address the effects of social media on stock

market. Researchers have investigated the explanatory power of social media in United

States, such as twitter, facebook, Seek Alfa, etc. Positive evidences have been found. While

China is now the largest market for social media, and has the second largest stock markets

in the world, the effects of social media on Chinese stock markets become an interesting

question. In addition, the development of Chinese NLP techniques has provided ample

lexicons and tools for textual analysis in Chinese. Therefore, in this study I explore the

correlation between social media and stock markets in China.

Specifically, I want to explore whether correlaiton between social media and Chinese

stock market exists, and differences between correlation of negative sentiment raio and

positive ratio. Also correlations of different stocks and market indices will be compared to

see which "tags" or topics connet most closely the social media and Chinese social media.

And following prior literature, I will also test the correlation of social media sentiments to

different periods of future returns to see how the correlation persists in the future periods.

11

2 Data

This study uses posts data from Guba, a popular investment-themed socila mida in China.

Data of financial prices are downloaded from iFind. The sample period varies because of

the data availability from Guba.

2.1 Guba

2.1.1 Guba as a social media in China

Guba(http://guba.com.cn/) is the most popular investment-related online community in

China. It is also part of Eastmoney.com, which is the largest financial online media in

China. In Alexa ranking system, Eastmoney ranks 772 globaly, which is way above the

ranking of Seeking Alfa (1,480). According to iResarch, during January 2015 the Daily

Unique Visitors to Eastmoney.com was 15,210,00, 5.9% of all netizens in China. Guba's

mobile application is also widely downloaded among smartphone users. In addition, the

Weekly Effective Viewing Duration on Eastmoney during January 2015 reached 21,220,000

hours, which indicated a strong user loyalty among users. It is quite easy to make postings

on Guba, and it even doesn't require registration before posting. The easiness to post

increases the popularity of Guba among individual investors, especially less educated in-

vestors.

Guba is a topic-based forum-style social media. All topics are named after one stock

or indices, so posts are naturally catergorizd under different stocks or indices of the stock

market.There are over 2200 topics, covering almost every individual stock and market in-

dex. In this study, I download all the posts under selected stocks and indices, and all the

posts under one stock or index will be analyzed to extract the invest sentiment for this stock

or index. And here we use stock name to denote the disccussions under one specific topic.

12

2.1.2 Posts and samples

I download posts under 30 sections in total. The sections and post information is shown in

Figure 5. Among the 30 sections, two are market indices: Shanghai Composite Index and

China Stock Index Futures. Both are typical indices for the overall stock market in China.

The China Stock Index Futures are known to be very responsive to market infomation and

more liquid than equities.

I also download posts of 28 individual stocks.My selection criteria of these stocks are

(1) representative of A share market in China; (2)relative popularity among all sections

and (3) their heterogenicity. As those stocks are all component stocks of the CS1300, a

capitalization-weighted stock market index designed to replicate the performance of 300

stocks traded in the Shanghai and Shenzhen stock exchanges, so they are very representa-

tive of Chinese stock markets. And they are of different sizes and from different industries.

Figure 6 shows the additional information on those stocks. The 28 stocks are comprised of

a mixture of industries, such as meida, financial services, retail, construction, transporta-

tion, etc. Also the market capitalization also varies from 19, 205 million CNY to 1,124,975

million CNY. PE ratio also varies a lot, from 6 the lowest to 188 the highest. Companies

like China Vanke and OCT Group have been listed since 1990s, and some companies only

become public in recent years. In this regard, the companies are very heterogenous.

The time spans of all the posts downloaded in this study vary across different sections.

One reason is that they are listed at different time, and the posts centered on its shares

are only possible after its date of listing. Another reason for the variation in periods is

technichal reason: for very popular sections such as Shanghai Composite, huge amount of

data exists and some data may have lost on the server or in the process of downloading.

In order to ensure the data quality, I discard some periods that contains some consecutive

blank small period.

13

Categories Ticker Chinese Name English Name Beginning Data Ending Data Total Posts

Indices 000001 1Eft Shanghai Cormposite 11/18/2013 11/15/2015 1,129,503000300 R China Stock Future 11/6/2014 11/18/2015 85,094000002 i f4A China Vanke 11/10/2013 12/10/2015 76,914000069 $ f6AA OCT Group 7/18/2008 12/28/2015 68,198000156 $ WASU Media 10/19/2012 12/31/2015 28,099000157 L1 4XJf Zoomlion 10/9/2008 12/31/2015 126,625000333 if Midea Group 9/19/2013 12/31/2015 44,541000338 If!PM fI Weichai Power 6/20/2008 12/31/2015 115,473000651 4 F1$) L Gree Electrical 9/26/2008 12/31/2015 44,821

000712 Viel fr}) Golden Dragon 12123/2008 12/30/2015 6,455000725 AAYi A BOE Tech 4/3/2009 12/31/2015 63,236000728 PiiLd' Guoyuan Securities 7/1/2008 12/31/2015 75,854002008 i-k ': Han's Laser 218!2013 12/29/2015 51,425002024 )$9 & Suning 4/25/2013 12/31/2015 72,274

600038 r ARAi} Avicopter 7/17f2008 12/8/2015 15,441

Stocks 601618 t[9' MCC 9:10/2009 12/31/2015 62,570601669 UE Power Construction 9/25/2011 12/31/2015 132,530601688 4 Huatai Securities 2/9/2010 12/31/2015 53,592601699 Lu'an Envir. Energy 7/8/2008 12y3 1/2015 48,943

601718 1 Jihua Group 8/24/2010 12/23/2015 16,791

601706 ' $ CRRC Corp. 731/2008 12/31/2015 556,721601872 1 CMES 6/2/2008 12/31/2015 75,511601888 4:1 A CITS 9/22/2009 12/31/2015 30,882601898 P%'% China Coal Energy 6/1/2008 12/3112015 83,052601899 @ Zijin Mining 6/6/2008 12/3 1/2015 157.243601901 : iEi. Founder Securities 7/31/2011 12/31/2015 147,603

601919 Pl itiV COSCO 6/2/2008 12/31/2015 124,354601928 Ta-M Phoenix PubL& Medi 11/17/2011 12/31/2015 83,290601939 Wi.M China Constuction BE 6/6/2008 12/31/2015 101,991603993 flrYVL'k China Molybdenum 9/10/2012 12/31/2015 55,400

Total 3,734,426

Figure 5: Timespan and Posts Under the Selected 30 sections

14

Market Value PDe 31Ticker Name (Dec 31 2015) Date of Listing Industry

Millions CNY 2015)

000002.SZ 7Ef}A 263,094 17 1991-01-29 Real Estate000069.SZ ${f#rA 64,715 14 1997-09-10 Real Estate

000 156.SZ 'PK 47,014 123 2000-09-06 Media & Entertainment000157.SZ P4A*|F 36,937 69 2000-10-12 Heavy Machinacry

000333.SZ kLJM 140,001 13 2013-09-18 Electrical Manufacturing000338.SZ #%P 4JJ 36,225 8 2007-04-30 Automobile00065 1.SZ ) 134,452 9 1996-11-18 Electrical Manufacturing

000712.SZ %1N3 26,092 67 1997-04-15 Financial Service

000728.SZ t tiE& 44,369 32 1997-06-16 Financial Service000725.SZ ACFA 103,151 41 2001-01-12 Electrical Manufacturing002008.SZ t$&I%* 27,339 39 2004-06-25 High-tech manufacturing002024.SZ $tTViI 99,302 115 2004-07-21 Retail600038.SH q1 ThR f} 31,083 94 2000-12-18 Aircraft601618.SH P H4 rif 103,387 29 2009-09-21 Construction601669.SH r L 9r 110,450 23 2011-10-18 Construction601688.SH $41E4 133,389 31 2010-02-26 Financial Service601699.SH i C$iWfrt 19,205 20 2006-09-22 Mining

601718.SH 4hf1$ 44,240 38 2010-08-16 Textile601766.SH P E I 4 329,574 66 2008-08-18 Railway equipment601872.SH 4tffPiUK 37,573 188 2006-12-01 Transportation601888.SH r41[A 57,901 39 2009-10-15 Travelling

601898.SH L -thtNWA 65,588 105 2008-02-01 Mining601899.SH * hI 65,441 32 2008-04-25 Mining601901.SH tiE<E 79,028 44 2011-08-10 Financial Service601919.SH tYi 76,484 254 2007-06-26 Transportation

601928.SH #AK4 40,540 34 2011-11-30 Media & Entertainment

601939.SH WiQWHT 1,124,975 6 2007-09-25 Financial Service603993.SH fiVUWk 62,552 41 2012-10-09 Mining

Figure 6: Company Information of the 28 Selected Stocks

15

2.2 Financial data

I download all financial data from Wind Financial Terminal, which is a leading financial

data provider in China, covering stocks, bonds, funds, indices, warrants, commodity fu-

tures, foreign exchanges, and the macro industry. Closing prices per day for year are down-

loaded for Shanghai Composite Index, China Stock Index Futures and all the individual

stocks. Following prior literature(Chen, De, et all 2011), I compute return of differnt hold-

ing periods: R1 denotes the return of holding the stock for I day, i.e. buy at the closing price

of day t - 1, and sell at the ending price of dayt. R2 denotes return from 2 days holding

period.R3 denotes return from 3 days holding period. The same goes for R4 , R5 , R7 , RIO,

R 15, R 30 -

3 Method

In this study I use the dictionary-based way of Chinese Natural Language Processing to

do textual analysis. The first step is to pick a suitabel lexion. In order to make the textual

analysis reflect the sentiments of Guba posts accurately, augmentation of existing dictio-

naries is also needed. I selected the most popular words used in stock discussions among

Chinese netizens to add to the positive and negative dictionaries of Hownet and NTUSD.

Some words are catchwords and some are the parlances. The second step is to segment all

the posts into words. To do this, Jieba Parse is selected in this study to parse all the posts

from Guba. After parsing, I computer the total word counts of negative and positve words,

and the ratio of both to the total word counts for each day. With those ratios and the price

data, I conduct correclation and regression analysis.

3.1 Dictionaries and word list

Many lexicons have been developed for textual analysis in NLP. Of all these lexicons,

Hownet and NTUSD are the most popular two dictionaries. HowNet is an on-line common-

sense knowledgebase unveiling inter-conceptual relationships and inter-attribute relation-

ships of concepts as connoting in lexicons of the Chinese and their English equivalents.

16

Data Collecting Text Analysis Statistical Analysis

1. Posts 1. Dictionary 1. Correlation Analysis

2. Prices 2. Parsing 2. Linear Regression

3. # of Sentiment word 3.grime eriesRegression

4. Sentiment Ratio

Figure 7: Summary of Method

NTUSD (National Taiwan University Sentiment Dictionary) is based on Chinese language,

and independant of other languages. Xu, Zhao, Qiu and Hu (2010) compare those dic-

tionaries and find that they are not enough. Many sentiment words are not included in

the current Chinese sentiment dictionaries. For example, HowNet contains 3969 positive

words and 3755 negative words; NUTSD contains 2648 positive words and 7742 negative

words. Among them, only 669 positive words and 877 negative words are shared.

In order to better measure the sentiment behind the posts, I combine the lexicons of

Hownet and NTUSD as the base dictionary for this study. In addition, I also manually

select and produce a word list specially designed for stock market. I first use the Chinese

parser to divide all the posts into words, and extract the top 300 most frequent words from

them. Then I distribute the list of 300 words to three experienced individual investors, who

are also heavy users of Guba. They mark the words which they think express positvie or

negative emotions. In the meantime, in order to eliminate the parsing errors, I also read

over 400 pages of posts to pick all postive and negative words. So based on the two lists, I

build a list of 53 positive words(Figure 7) and 70 negative words(Figure 8).

3.2 Segmenting and parsing the posts

Chinese language is very different from English in syntax features. For example, Chi-

nese makes less use of function words and morphology than English, verbs appear in a

unique form with few supporting function words. Also subject pro-drop, which is the null

17

Negative words (88 words)

Positive Words (58i

Jf

rkrords

18

A

I' J

47kf kE A,' )' ATh jilt

i R 41 1 _ KB A 18

& 1 *1 -K1l -51- Ai i ~

Figure 8: List of Positive & Negative Words for Stock Market

18

I

A

i121 a t

ily IL

k2 r

4fyz;A

f

?AV

IVIR

A, y

I M

realization of uncontrolled pronominal subjects, is widespread in Chinese but rare in En-

glish. Therefore, Natural Language Processing (NLP) tools are very different for the two

languages. There are several widely used Chinese NLP open source tools, such as Jieba,

BosonNLP, NLPIR, ITP-Cloud, etc. Stanford NLP group and Berkeley NLP Group also

provide Chinese segementer and parser.

I adopt Jieba NLP tool package (https://github.com/fxsjy/jieba), because it is widely

used in the social media industry, and it is conscidered stable and accurate in Chinese pars-

ing. Jieba is free open-source tool in Python. As its algorithm is based on Trie Tree struc-

ture, Jieba is able to find all the possible wording situations, and arrive at the most probable

tree path through dynamic programming. Moreover, Jieba adopted Hidden Markov Model

and Viterbi algorithm to detect new words. With these features, Jieba has become one of

the most popular segmenter and parser among Chinese users.

3.3 Quantifying the positive and negative sentiment

Although prior literature has noted that positive words in English are limited in testing

sentiment, because they are frequently subject to negation, and corporate communications

rarely convey positive news using negated words(Loughran and MacDonald 2011). How-

ever, it may not be the case in social media, where people are not so mindful about their

language. Moverover, many of those positive words in Chinese are verbs; if a Chinese in-

vestor wants to express an opposite opinion, he would simply use the opposite verb for it.

Negation of the original verb is not the same as using the opposite verb. Some reseaches

on Weibo use positive sentiments in their tests, and find positive meansures is also useful

(Pang, Li, et all). Based on these considerations, I include positive measures to this study.

We calculate the total word count per day, the total count for positive words per day,

the total count for negative words per day for every section. Formally, I use the frequence

of positive words and negative words as the measures for positive and negative sentiment.

NegRatic =No.ofNegativeWordsTotalWordCount

19

PoRto - No.ofPositiveWordsTotalWordCount

3.4 Regression methods

I first test the correlation of the sentiment ratios at day t to returns of 9 different holding

period (RI, R2, R3 ,R4 , R5 , R7, RIO, R 15 and R30). In this study, I use Ri to denote the

return of holding peoriod from day t to day t + i. For example, R, is the same day return

at day t, because the holding period is just I day; R2 is the return from day tto day t +2 ;

and R30 is the return of holding the security for one month since day t. Figure 9 shows the

relationship of these returns.

After the correlation analysis, I test the linear regression between returns of different

holding period and sentiment measures from Guba as the following regression.

Ri,-,n = aij + i,1NegRat ioi, + f3 2PosRatio,1 + Ei,t

The dependent variable is the same- day return Rit, where i indexes sections and t

denotes the day on which posts are posted on Guba, and it denotes the days of holding

period. Specifically, I chose 9 holding periods to study, including R1 , R2, R3 ,R4 , ,

R 10, R 15 and R30 . So I conduct 9 regressions for each stock and index to see the relationship

of sentiment ratios to different returns.

After the linear regression with future returns, I also conduct time series analysis and

construct vector autoregression (VAR) models for selected stocks and indices using differ-

ent lags of positive ratio and negative ratio. Granger causality test are also conducted to see

the predictablity of this models.

20

RR

t -f2 t-3 t+4 i-5 1-7 t+10 +15 t+30

negative ratio at day

positiveratio at day t

Figure 9: Denotations of Returns

4 Results

4.1 Correlation Analysis

4.1.1 Positive ratio v.s. Negative ratio

Figure 10 shows the correlation coefficients between negative ratio and different returns.

Figure 11 shows the correlation coefficients between positive ratio and different returns.

First of all, we can see that the results of negative ratio are much higher statistical sig-

nificance than those of positive ratio. Of all the 30 sample stocks and indices, 29 securities

have very significant correlation between negative ratio and different returns (most of them

are significant under the significance level of 1 %), while only 16 securities have significant

correlation between positive ratio and different return. In this sense, the negative ratio has

more substantial correlation with stock returns.

Secondly, the correlation coefficients of negative ratio are much larger larger than the

correlation coefficents of positive ratio. For every stock and index, its negative raito has

much high correlation with its returns. The only exception is the Shanghai Composite, its

positive and negative ratio has almost the same level of correlation. Most of the postive

correlation coefficients are much lower than those of the negative ratio. Figure 12 shows

21

licker Ratio R1 12 K3 R4 R5 R? RIO R15 R30

(11100)2 negative ratio -0.223*** -0.280' .313"* -0283'" 4210*'* -0281*** 0.24* -0266"* -0.195"'

00(81618 negative ratio -0- 1* 0.68* -0.155*** -0. 138** 134* ~0.137*** -0 141** -0.1254*** 0855"*

000156 negative ratio -00258 -0.0268 00205 -0 0106 0,000432 0.00640 A.0171 0,0449 0 157***

000157 negative ratio -0 175* * -0. 1901* -0184** -168"-O 0.155** -0.117*** 410990* -9082**

000333 negative ratio -009761 -0J39 0.13* -0131" -0.122* -0.0678 -0:0278 -0,0634 -00925'

000338 negaive ratio 07131** -0.142+1 -0,134*** -0 130** 1 0153* -0.142** 0130* -0.12"* .0864"'

000651 negative ratio -0,04* -0108" -0.126*** -0 126**-OI22** -0t23"' -0.135"' -0.114++ -0.127*

000712 negative ratio -0.181 .158** 4.143* -0,107' -0. 0* -0.0935 -A0645 -0.0338 -oOso3

000725 ncgative ratio -011 0"'* -3114** -0.0976** -00946* -0.0701* -0.0348 -00379 410304 -0.0169

000728 negative ratio -094*** -0.212* -0.211"' -0,216' -0.218"* -0.1910** -0.194** -0. 157"' -0.0665

002008 nogativeratio -0.255'** -. 268- -0.286"* -0233*** -0.187* -0.158** -0.196-" .236"* -0.199"

002024 negtive ratio -0 "* -0.192*** -0.174** 4- 179*** -0-171' -0.171" 4150"' -0.142* -0122"*

6o )38 negative ratio -0 143*** -0.117* -0.117* -0.1151' -0.1321 -0.127W*4 -0.133"' .y44'** -0.0886*

601618 negative ratio -0.0534 -0.0851* -0.0956" -0.1956" -0.0870' -0.07204 -0.0750' - .0418 0,00468

601669 negative ratio -024,i' -0.235"' -0.253** -0.237"' -0 22* -0.230"' -0226"' .225"* -0.24*

601688 negative ratio -0164*' -076*** -0.144"** -0115"' -0136 * -0.112** -00837' -0.108" 4.08310601699 negative ratio 033* -0-173"* -0.080.' -0.181* -0.181"* -0.168" -0171"' -0.191' -0184"*'

601718 ne-gaye ratio -095* -0.285** -0.287*' -0.268"* -0,166* -0,116 -0,124 -0.118 -).164*

601766 negative ratio -0.2"2*** 4219"' -0.214"'0 -0 197"* -0.195** -0.180*** -0164*"* -0.148"' -0138*"'

601871 negative ratio -050** -0.163 -0.145"' -0.133*** -0.133"' 4.143'" 4138.' -, 138"' 4112*"601888 Iegat ve, ratio O 08** -0. i I1"' -0.111*"' 4.106'" -0AW93*** -00859"' -0.0794"w -0.0814** -0.0748**

601898 negative rtio -0104*** -0.111*** -0.105*** -0. 101'** -0.0933*** -0.08061" -00848*** -0.0716** -0.0231

601899 negative ratio 00903" -0.0925* -0.0924** -0.0799* -0.0982" -D103" -0.132*0" -0.120"' -0,103"

601901 ncgativc ratio -0,2* -0.192 -0.195" -0. 1802" -0 l "' -0. 138*** -0.114** -0.080" -0.0546

601919 nrgativ ratio .0157*"* -0.192" -. 192"* -1. 08" -0-" 0182*** -0,184" -015** -Y4117w*'

601928 negative ratio -0-163*** -0.155' -0,141*** -0 120'* -0.14** -0.126" -0.132** -0. 106** -0.0769*

601939 negative ratio -0101*** 407 -013' -080971* 093 * 00924*** -0.0433#" -. 082* -0.0896" -0.0563

601993 negative ratio -0 119*.* -3-0" A0, 113" 5** -0.0939" -0.0587 -0,0558 -0.1000 4).112*

negative ratio -0 23 -0.179"* -0.216**' 0 4 222"' -0240*' 0.279f -0.338'** -0426"'" 41441"

SbanghaCbpoia negative ratio -0361"'* -0.466"' 42*** -0488** -0.499** -0454** -0.538*'* ) 0.w0" -0516"*

=*P<0.05 ** p<0.01 * P<0.001,

Figure 10: Correlation of Negative Ratio and Returns of Sample Stocks and Indices

fitker Ratio R1 R2 R3 R4 R5 R7 RIO R15 1R30

00(032 positiv ratio 0.0814 0.0862 0049* 0.109' 0 1 0059 0 007(5 00148 40654

5%"'69 )0s0iVCno t .0600* 0045" 0.058 7 0565' o0567* 0.0505k 60.675'" 045 00416

000156 roeotive reo 0.0142 o,0257 0.0252 0.0240 0.024 o03(Oi7 0,0330 0 O066 (11210*

0001 ; iive man A0712"* 0190" 011768** 00827** 0.079** 0.0616' 00621' 0.0782" 0,01'2

000333 positive ratio 0.00676 0.240 0.04' 0.0627 0,0504 O 4O 0.0400 00721 it 108'

14(33A posiaive ia't U.0417 0.0343 0,0477 0150.0 t.'622* 0.0622* 0.062 2' 0 .0022* 113"*

0051 pxitive rato 0.0847' Q. I I" 0,082" 0 0514* 0.0718 0i4 0 077 00374 ) 0253

000712 positivrerdai 00200 0.305 00.344 0.0314 0.0212 -0 (W!21 00143 0.0512 0,-06

000725 poiive Mr6o 0.0721' 0.0599 a0474 0.0353 0.0269 0. 051 7 0.0338 0557 0.02W8

000729 posltive ratio 0 028 0.0279 0-0303 0.094 0.0339 00456 0.0512* 00558' 0.8753"0021)0 poia, rat ( 0135"' Q34*"' 0103" 0.,094* 0077 00491 0.0824 0047 0.0114r

002024 pomitiv ratc 0 151"** 0154-' .142*** 0.123 0.9501 00755 0.C659 00597 f01359

60833 pOitic Vrwio 0.0438 ',0493 0.0624 0,0297 0.0223 0020 U.018 0,0320 0.0133

601611 postve ro 0 110* 00827* 0.0875* 0.0928" 0.063" 0 107" 0.0975* 0106* 00524

601669 pontiac i I 0. 115, 103" 0.104"0 115"* 0.1U0S" 0113" 0.120400 1315" 0,17*"

11614 i 0.0929" O18" (,104" 0.13**' 0,0957" 00859" 0.0974" 00875" 0 t17#.

601699 poestiv tao 0123" 11105"'* 0.0833"' 0.07%* 00" 00604' 0.0477 0 015 "5 )'327

601719 poiivc rtO 0.0626 0117 0.174* 0. 18* 0 * 0.250" 0,276** 0-21" 113*

601766 postiv tiratc 0128*" 0J24** 01124** 0 111" 0.10'*** 0.)894** 0.0759** 0764" 00684*

601871 poitiv Muo't 0O625* 061(3*" 0.0683" 0.0459* 0.0334 0.0256 0.0329 0005 A002126

601801i positive ratio 3 013' 0+867** 0.0743" 0.0575* 0.0593* 003-il 0.0591 0.0779"* 014

601S91 pt10 iv9ra.o 0.040* Q,1677-- A0583* 0.0500' 0.0459 0.07 0.426 006133 01670"

60189') positia rato 009C? + 026*** 01 0.125*** 0.129*** 0. 1200" 0.104+ 0.0878' 0,0589

60101 positive 'MO .116"' 0132*** 0149** 0153*" 0.140"' 1I40"' 0132*** .147" 0147*"*

601919 pie t-c 00779' 00301 00793** 0.952* 0.098*" 0031" 0.0373" 0.078301 0.0722

60t72a pimitiv rat - 1 0M.1 0106*1, 0083"' 0.0713' 0.056 0(0F5 0),441 00281 (0.0.59

601939 " itivr rtxl 00766" ' 4f7' 0.0554* 0.0661 * 49' 1 0. 101" U10** 0.108*' 0 tItI,*

V)t941 Pt idvl rato 00415 0910 00339 0.0359 0.43 0 - 0.0105 00108 0.0242

Inde Fatie p tiivL rt-o U.0471 U0464 0 100-14 -4. 019 -Cii)9 -C 02 -0 0643 101Y31 .0.70

ponstirveattit 40"" 0400"' 0466*** 0,426"* 0.410"' 0164"' 0320"* 0229"' .1"'5**

poOl'S '* <0' "* p<li.D-

Figure 11: Correlation of Positive Ratio to Returns of Sample Stocks and Indices

22

RI R2 R3 R4 R5 R7 RIO RI5 R30

Positive ratio Corr. Coeffi 0085 0.093 0.092 0087 0.083 0.076 0.071 0.071 0.066Averagevariance 0.0051 0.0073 0.0065 0.0059 0.0063 0.0053 0.0054 0.0036 0.0040

Negative ratio Co. Coeffl. -0.156 -0.173 -0.173 -0.163 -0.155 -0.146 -0.143 -0.141 -0.118Averagevariance 0.0043 0.0066 0.0077 00076 0.0076 0.0091 0.0107 0.0128 0,0146

Figure 12: Average of Correlation Coefficients for Positive and Negative Ratio

the average correlation coefficients of positive ratio and negative ratio. We can see that

the correlation coefficients are nearly twice the value of correlation coefficients of positive

ratio.

Thus, both the value level and statistical significance are better when we use negative

ratio. This result is in accordance with the studies in English. Positive words are subject

to negation, so when positive words are used, it could be possibly expressing positive or

negative tone. At the beginning, I was wondering whether it would be not the case in

Chinese, and the results of correlation analysis show that this also holds true in Chinese. I

think the major reason lies in the dictionaries I used. Many positive words in Hownet and

NTUSD are positive adjective words, which are more often subject to negation. Although

I add a word list with many positve verbs or nouns to augment the dictionary, it is only a

small part of the lexicons. Therefore, negative ratio is better than positve ratio in terms of

correlation degree and significance.

4.1.2 Differnt Holding Periods and securities

Figure 13 shows the correlation coefficents for different stocks' different holding period

returns. There is no clear sign that as holding period gets longer, the correlation with future

return will decrease. However, the variance between securities are very large.

23

Negative Ratioi Correlation With Rciurmes A Samnpl Stocks and Indiccs

Figure 13: Negative Correlations with Different Returns for Sample Securities

Postitive Ratio Correlation with Returns of Sample Stocks and Indices

ii i

Figure 14: Positive Correlation with Different Holding Period Return

24

First, the Shanghai Composite Index has the largest correlation coefficients for positive

and negative ratio with all holding periods' returns. The China Stock Index Future also has

very high correlaiton with negative ratio. Individual stocks also vary a lot in the correla-

tion level. Some have relative larger correlation coefficients (in absolute value), such as

00002(China Vanke), 002008(Han's Laser), 600669(Power Construction), 601766(CRRC

Corp), 000157(Zoomlion).

Why these securities have higher correlation coefficients? One possible reason is that

these securities have more data. Figure 15 shows that Shanghai Composite Index(szzs)

has 1554 posts under it every day, which is 5 times more than the China Stock Index Fu-

ture(gzqh), and much much more than the individual stocks. And 00002(China Vanke),

002008(Han's Laser), 600669(Power Construction), 601766(CRRC Corp), 0001 57(Zoom-

lion) all have frequent posts per day.

There is an outlier 601718(Jihua Group). Although it has very low posts per day, its

positive correlation coefficients are relatively larger than many other stocks. Considering

the low significance level of 601718's correlation coeffients in Figure 10 and Figure 11, and

its low frequence of posts data (rank 28# in the sample), I think 601718 is simply outlier

that can be ignored. Another outlier is 0001 56(WASU Media), because it has large positive

correlation coefficients for negative ratio, which is against common sense. One reason for

000156's abnormality may lies in its relatively unfrequent posts data.

4.2 Regression Analysis

4.2.1 Positive ratio v.s. negative ratio

Similar to correlation ceefficients' results, the regression results for positive ratio are not as

good as negative ratio in terms of significance. As Figure 16 shows, coefficients for positive

ratio are not significant to all 9 kinds of returns for 11 stocks. However, coefficients for

negatve ratio are almost all significant to all sample stocks and returns (see Figure 16 and

Figure 17).

25

ticker post total days post per_,da! Rank ticker post total da-ygspostper ,day Rankszzs 1,129,503 727 1554 1 601939 101,991 2,764 37 16gzqh 85,094 377 226 2 601898 83,052 2,769 30 17601766 556,721 2,709 206 3 000728 75,854 2,739 28 18000002 76,914 760 101 4 601871 75,511 2,768 27 19601901 147,603 1,614 91 5 601618 62,570 2,303 27 20601669 132,530 1,558 85 6 000725 63,236 2,463 26 21002024 72,274 980 74 7 000069 68,198 2,719 25 22601899 157,243 2,764 57 8 601688 53,592 2,151 25 23601928 83,290 1,505 55 9 000156 28,099 1,168 24 24000333 44,541 833 53 10 601699 48,943 2,732 18 25002008 51,425 1,044 49 11 000651 44,821 2,652 17 26000157 126,625 2,639 48 12 601888 3 0, 882 2,291 13 27601993 55,400 1,207 46 13 601718 16,791 1,947 9 28601919 124,354 2,768 45 14 600038 15,441 2,700 6 29000338 115,473 2,750 42 15 000712 6,455 2,563 3 30

Figure 15: Sample Securities Ranked by Posts per Day

In addition, it is very obvious that in Figure 16 and Figure 17 that for the same stocks,

coefficients of negative ratio are all ways larger than those of positive ratio. This fact shows

that in the regression model

Ri,,n = ait + A,1NegRatioij + ,2PosRatioij + Ei~t

#i, 1 is larger than i,2, i.e. negative ratio has better correlation with the stock returns. As

disscussed above, this is because positive words are subject to negation. And this study

shows that it is the same for textual analysis in Chinese, especially, when the dictionaries

used are compriesed of many adjective words. Therefor, based on this we can conclude

that negative ratio is a better meansurement of investment sentiment, because it has stable

and significant correlation with returns of different holding periods.

4.2.2 Difference between stocks

There are large divergence of correlation between different stocks (Figure 18). The coef-

ficients for NegRatioij range from -3 to + 15 for individual stocks, and the coefficients

26

VARIABLES RI R2 R3 R4 R5 R7 RIO R15 R3 0

000156 positive ratio 0.387 1.058 1.248 1.362 1.586 2.743 2.668 6.156* 14.16***

negative ratio -0.653 -0.972 -0.876 -0.486 0.0824 0.496 1.287 3.781 15.01*4*

000002 positive ratio 0.224 0.322 0.425 0.597* 0.685* 0.262 -0.176 -0.282 -1.406*

negative ratio -0.683*** -I.255*** -L691*** -.7Q9*** -1.890*** -2.170*** -2.379*** -2.750*** -2.902***

positive-ratio -0.0524 -0,0286 0.0883 0.278 0.216 0.348 0.414 0.871 1.847**

negative ratio -0.300** -0.613*** -0.734*** -0.762*** -0.800*** -0.486 -0.192 -0.576 -1.190*

positive-ratio 0.0662 0.0661 0.145 0.214* 0.258* 0.313* 0.385** 0.480** 1.460***

neative ratio -0.297*** -0.477*** -0.547*** -0.607*** -0.806*** -0.880** -0.957** -1109-** 1004***

000712 positive-ratio -0.00593 0.0117 0.0263 0.036 0.017 -0.0431 0.0179 0.206 0.2

negative ratio -0.214*** -0.299*** -0.334*** -0.292** -0.309** -0.356* -0.286 -0.148 -0.535

positive-ratio 0.0538** 0.0629 0.058 0.0448 0.0383 0.107 0.0769 0.156* 0.104000725 negative-ratio -0.0980*** -0.151*** -0.155*** -0.171*** -0.141** -0.0708 -0.0955 -0.0796 -0.0664

000728 positive ratio 0.0426 0.0573 0.102 0.128 0.116 0.204 0.272* 0.392* 0.991***

0-0728 negativeratio -0.483*** -0.774*** -0.941*** -1.103*** -1.240*** -1.244*** -1.468*** -1.500*** -0.914**

positiveratio 0.0696 0.162 0.292** 0.292* 0.249 0.292 0.214 0.129 0.348

601928 negative ratio -0.342*** -0.463*** -0.506*** -0,508*** -0.503*** -0.723*** -0.906*** -0.864*** -0.855**

601993 positive ratio 0.152 0.15 0.238 0.276 0.341 0.231 0.142 0.218 0.483

negative ratio -0.312*** -0.487*** -0.602*** -0.616*** -0.567*** -0.411* -0.443 -0.925*** -1.380***

600038 positive ratio 0.0325 0.0751 0.135 0.0354 -0.00429 0.0168 -0.0278 0.0396 -0.0204

negative ratio -0.226*** -0.267*** -0.324*** -0.389*** -0.498*** -0.544*** -0.677*** .-0.805*** -0.708**

601718 positive ratio 0.0888 0.288 0.660* 0.852* 1.517** 2.311*** 3.266*** 3.275*** 4.430**

negative ratio -0.453** -O.872*** -1.039*** -1.121*** -0.724* -0.591 -0.708 -0.983 -2.752*

Figure 16: Insignificant Positive Ratio and Significant Negative Ratio for 11 Stocks

for PosRatioi. range from 14.6 to minus 1.4. First of all, I think the large coefficient of

+ 15 for negative ratio and the large coefficient of 14.6 for positive ratio (both come from

000156) are all outliers and should be ignored. As disscussed in the correlation analy-

sis, 000156's correlaiton is not significant under any acceptable significance level, and its

posts' frequency is also very low. Considering this we will not consider 000 156's data in

the following discussions.

After deleting the outliers, we still see there are large variances of coefficient value

across stocks. For example, 000002 (China Vanke), 002008 (Han's Laser), 600669 (China

Power Construction), 601699 (Lu'an Environment Energy), 601766 (CRRC Corp), and

601919 (COSCO) have larger coefficients for negative ratio (As negative ratio is supe-

rior than positive ratio, I will focus on disussing negative ratio) than other stocks. This

factor may also be related with the relative posts per day, because 000002(China Vanke),

002008(Han's Laser), 600669(China Power Construction) and 601766 (CRRC Corp) have

much higher posts per day than other stocks.

27

VARIABLES RI R2 R3 R4 R5 R7 RIO ~15 R30

601766 positive ratio 0.829*** L176*** 1382"* 0.943 0.865 0,377 0.512 0.645 -2.055

negative rato -04715*** -1.388* ' 1.817 * -2. 196*" - 5 " -6.34 - 2

poitverai 09.-A6 83** 265** -6,3

601872 poitiveratio 0.111*** 0.174*** 0.218*** 0170* 0.141 0.13 096' 015- 0006

negative ratio -0.224*** -0360*** -0,396** 0 41"-0 ' 0.1"' 0.74"'

--------. -0410 -4 ** *- -0.641*** 077*

0positiveratio 0.0650* 0162*** 0.168*** 0. 147* 09 2" 034

negative ratio -0.144"* -0.216*** -0.261' -Q.282* -0291*** -0.288*** -03090" -0.369* -0451

poSitive ratio 0. 142** 0.211" 0.257" 0.364" 0.435' 0.451 0601919

0.601*0 0.68** 098

negative ratio -0.338*** -0.621"* -0,769"' -0.95*** -1.065' L-501'

000069 positive ratio 0.0969" 0 136** .-14:9 0. 0-- 0.1-----88 0,321" *

i ratio-0.283"*** -0.460*** -0.515"* -0.522** -0,57 *** -0.691*** -0.82** 0. " ,23

positive ratio 0.139** 0.267*** 0282** 0..6 0.256 0.1410

000651 eaieai 16 -02,760,14 021 015

negativeratio -0 36*' -0.227** -0.321' -0.370*** -0.395*** -0.462*** -0.600*** -0.594** -0.939***

002008 positive. ratio 0.42"3*** 0.604** 0513" 0,549** 0.451 0.325 0. . 13

negative ratio -. 635*** -. 971"' -1264*** 1.157 - -0 35 -. 4 " - 0,376

positive ratio 0160" 0.214' 0.262' 0.3*6" 0..4413" 0. 5" -. 462*

601669 -- 022 .36* .4** 0-519**0 0.705"** 1 i1* -- -*

negative ratio -0.463*** -0.694*** -954 1046 * * * .6 , -129*** -l8**-67*-2.093*** -2 99***

positive ratio 0.268'* 0402*** 0465*** 0.446" 0354 01.

002024 . 121 94 0311 0332

negative ratio -0276 -0398*** -0451*** -0.556*** -0.61'7*** -0.761*** -0.813"' -0.946"' -1,44**"

601939 positive ratio 0.132*** 0.166"' 160** 0214** 0.3 "** 4"* 0.562' '6.0

negative ratio -0144*** -0.208*** -0.231" -0.257"'* -0289"' 0, * 0-5 ' 0662*** L 043*

601699 positive ratio 0.328"'* 0.4 1 ]" 0.39 .5*** 0380 ' 0 443 "' 0 . * -0.343*** 0 "*** 0443*

negativo ratio -0.418** -0.621"** -0.801*** -0.930** -1 045"'* *11 -1,424*" 0-.0 " -. 4

601688 positive ratio 0.207' 0.397*** 0462*** 0.563*** 0.544*** 046064* 2862' 0.951 2.5714

negative ratio -0.388*** -0.630*** -0648*** ..602 -820"' -0 0.7 * -. 29"* -. 1.060 0 -2**-.3** 07 75** -129* 1

601618 positive ratio 0.171*** 0.173** 0.235** 0.297 " 0.360*** 0.508 "' 0.5 " 082 2"' 610**

negative ratio -0.0463 -0.114** -066*** -0.196*** -0.2* -0211 0.284** 082*** 0.624

601899 positive ratio 0.180** 0.355*** 0.442*** 0.503"' 0.571 0.640"' 0.599** 0.617' 0.509

negative ratio -0 177** -0.251' -0.308* -0.286 -0.429" .6-587" 1.167"' -1.45"'

000157 positive ratio 0.197"** 0.288*** 0.342*** 0.425*** ..4 " 0.424" 0.508" 0764*** 0.33

negative ratio -0.366*** -0.557*** -0.705*** -0.787** -0801* -0.4 60"' -0.784"' -. 799"** -0,9.33

6 0 1 9 0 1 p o sitiv e ratio 0 .3 3 6 * * * 0 .5 75 * * * 0 .8 17 * * " 0 9 9 7 "' 7 . 8 " 10 "' 2 . 2 1 8 "' - .9 0 7 * * *

negative ratio -0.409"' -0.682"' -0.868*** -0950*** -0.970"' -1.0 *** -1 *4" - 26"' .5976

601898 positive ratio 0.0772" 0.159*" 0.169* 0.168" 0174" 03

negative ratio -0.78*** -0.279*** -0.325*** -0.362*** -0.376" -0.384"' -0.40*" -0.4 0.*224

Figure 17: Positive Ratio and Negative Ratio are Both Significant for 17 stocks

28

I

OOutlie

Figure 18: Outliers in Regression Coefficients

On the other hand, their larger coefficients also indicate greater attention from investors

for those stocks. For example, CRRC Corp has been a buzzword in China since 2014, be-

cause the Chinese government encourages the export of railway equipments under the "Belt

and Road" policy raised by Chinese President Xi Jin Ping. Also, CRRC Corp is one of the

largest company in Chinese stock market, and has the capacity to attract millions of indi-

vidual investors. The mania for CRRC Corp stock has made the stock price increase 400%

from Dec 2014 to April 2015. After that, CRRC Corp dived drastically to the original level

of price in 2014. Such drastic changes in prices inevitably attract investors to talk about it.

Thus the more people are talking about it on the social media, the better the posts reflect

investor sentiment, and the better the regression coefficients. Apart from 601766 (CRRC

Corp), 000002 (China Vanke) is the No.1 real estate developer in China and its CEO is very

famous public figure in China. 002008 (Han's Laser) is the world-leading laser equipment

producer and has been regarded as major player in the 3D printing industry. 601669 (China

Power Construction) is widely regarded to benefit from the "Belt and Road" policy raised

by Chinese government. All those stocks have fancy stories to attract people discussing on

the social media. Comparing with these stocks, other stocks are not so attention-grabbing,

so their coefficients are mush smaller.

29

Figure 19: Coefficients of Positive Ratio and Negative Ratio for Stocks

4.2.3 Difference between stocks and indices

The regression results in Figure 20 show that Shanghai Composite Index has very good

regression results: coefficients for negative ratio and positive ratio are almost all significant

under I % significance level and the R square for all the holding period returns are all above

20%, which is much higher than any other regression for individual stocks. An interesting

result is that the coefficients of negative ratio gets larger (in absolute value) when holding

period get longer. This indicates that future return is very dependent on the lag of investor

sentiment. (To analyze this, I also conduct time series analysis and build a VAR model for

for Shanghai Composite Index.)

Things are very different for China Stock Index Future. The coefficients of positive

ratio are not significant for all holding period returns. Whereas the coefficients of negative

ratio are all significant at I% significance level. Also, all the coefficients are much smaller

than those of Shanghai Composite Index. The failure of positive ratio in explaining China

Stock Index Future and the smaller coefficients may lies in its much lower posts per day

than Shanghai Composite Index. Another underlying reason is the relatively small size

of China's future market. Also individual investors are required to deposit 500,000 RMB

before they start trading index futures. This rule also limits the involvement of individual

investors in the future market. Therefore, although index futures are largely perceived as

good indicators of stock markets worldwide, investor sentiment on index futures on social

media may not be a good indicator of the stock market, because they are not widely traded

among individual investors and usually have higher entry level for investment.

30

VARIABLES ri r2 r3 r4 r5 r7 rIO r15 r30

positiveratio ].626*** 2.967*** 3.171*** 3.115*** 3.415** 3.050*** 2.735*** 1.103 -0.993

negative ratio -0.536*** -1.065*** -1.413*** -1.713*** -2.030*** -2.741*** 3.589*** -4.748*** -7.429***ShanghaI 3594

Composite Constant -0.0631*** -0.11* -0.107*** -00907** -00925** -0.0424 0.0114 0.149*** 0.383***

Observations 487 487 487 487 487 487 487 487 487

R-squared 0.22 0.348 0.335 0.314 0.319 0.313 0.31 0.294 0.267

positive-ratio 0.269 0.415 0.114 -0.0825 -0.326 -0.221 -0.819 -1.498 -1.824

negative ratio -0.393** -0.846*** -1.181*** -1.324*** -1.610*** -2.288*** -- 5.138*** -8.209***

Fude Constant 0.00568 0.0203 0.0520* 0.0701** 0.0970** 0,126*** 0.210*** 0.331*** 0.512-**Future

Observations 253 253 253 253 253 253 253 253 253

R-squared 0.018 0.035 0.047 0.049 0.058 0.078 0.117 0.187 0.198

Figure 20: Regression Results for Indices

Comparing the regression results of indices and stocks (Figure 21 and Figure 22), we

can see that for negative ratio, Shanghai Composite Index have the largest coefficints for

almost all returns, which suggests that the investor sentiment of Shanghai Composite In-

dex is the most correlated with stock market. However, for positve ratio, coefficients of

Shanghai Composite Index are not the largest at all. Other stocks such as 601901 (Founder

Securities) and 601718 (Jihua Group) have also large positive ratio coefficients for their

regression. As disscussed before, many coefficients of positive ratio are not significant, so

their values are not as useful as those of negative ratio.

4.3 Time-series Analysis

As in Shanghai Composite Index's case, it is quite obvious that lagged values of the inde-

pendent variable greatly matter. In technical terminology, the regression is called a vector

autoregression (VAR). So I construct a VAR modle with lagged variables to better reflect

the regression.

Ri= -ai,t +&foNegRatiojt +P,iNegRatio,tj-1+... +f,0PosRatiotj +pi,1PosRatioit.- I+..+ Ei

31

Figure 2 1: Coefficients for Positive Ratio: Indices v.s. Stocks

/#

Figure 22: Coefficients for Negative Ratio: Indices v.s. Stocks

32

Selection-order criterlaNumber of obs. 483Lag LL LR df p AIC HQIC SBIC

0 4977.96 -20.60 -20.59 -20.571 5431.98 908.03 9 0.00 -22.44 -22.40 -22.342 5482.19 100.41 9 0.00 -22.61 -22.54* -22.43*3 5496.17 27.98 9 0.00 -22.63 -22.53 -22.374 5510.93 29.52* 9 0.00 -22.66* -22.53 -22.32

Figure 23: Lag Length Selection

Because among all the sample stocks and indices, Shanghai Composite Index has the

largest and significant coefficients and the largest R square, so I use Shanghai Composite

Index to conduct the VAR model and Granger causality test.

4.3.1 Lag selection

In order to construct a good VAR model, the first step is to find the optimal lag length for

return of Shanghai Composite Index. I use the Akaike Information Criterion (AIC) criteria

to determin the lag length, because for monthly and daily VAR models, the Akaike Infor-

mation Criterion (AIC) tends to produce the most accurate structural and semi-structural

im- pulse response estimates for realistic sample sizes (Ivanov and Kilian, 2005).

As Figure 23 shows, AIC reaches its minimum -22.6581 at Lag 4. Therefore, I choose

4 days length as the optimal lag for the VAR model.

4.3.2 VAR Model

Figure 24 shows the details of this VAR model.

4.3.3 Stability test

The necessary and sufficient condition for stability is that all characteristic roots lie outside

the unit circle. Figure 25 shows that the reciprocal values of all unit roots lie within the unit

circle. So the VAR model is stable.

33

Vector Autorregression Results

Equation Parms RMSE R-sq chi2 p>chi2 No. of Obs.

ri 13 0.02 0.07 35.63 0.00 483

positive-ratio 13 0.00 0.30 205.51 0.00 483

negative_ ratio 13 0.00 0.77 1644,86 0.00 483.00

Coef. Std. Err. z P>IzIri

riL1. 0.13 0.06 2.27 0.02

L2. -0.11 0.06 -1.82 0.07

L3. -0.01 0.06 -0.22 0.83L4. 0.18 0.05 3.41 0.00

positive-ratioLI. 0.17 0.32 0.54 0.59L2. 0.00 0.34 0.00 1.00L3. -0.33 0.34 -0.98 0.33L4. -0.30 0.31 -1.00 0.32

negative-ratioL1. -0.09 0.24 -0.38 0.70

L2. 0.17 0.27 0.66 0.51

L3. -0.27 0.26 -1.01 0.32L4. 0.08 0.23 0.34 0.74

cons 0.03 0.02 1.42 0.16

Figure 24: VAR Model

Rocts of the companion matrix

s Ral

Figure 25: Unit Root Check

34

2

Granger causality Wald tests

Equation Excluded chi2 df Prob>chi2ri positive-ratio 3.56 4 0.47ri negative-ratio 1.98 4 0.74ri all 5.68 8 0.68positive ratio rI 24.05 4 0.00positive-ratio negativeratio 7.38 4 0.12positive ratio all 39.66 8 0.00negativeratio rI 36.53 4 0.00negativeratio positive-ratio 15.78 4 0.00negative_ratio All 58.07 8 0.00

Figure 26: Granger Causality Test

4.3.4 Granger causality analysis

Figure 26 shows the result of Granger causality test. As p values of postive ratio and neg-

ative ratiio are larger than the significance levels, the causality between ratios and stock

returns doesn't exist. In other words, we can not assert that positive ratio and negative ratio

of sentment on Shanghai Composite Index is predictive of market return. Although Shang-

hai Composite Index has very significant linear regressions with social media sentiment

ratios, it is still challenging to use past sentiment ratios to predict future returns.

35

5 Conclusion

Since 1980s, behavior finance has posited that investors are driven by sentiments and this

assumption has been proven by many studies on traditional media, such as newspaper, jour-

nals and annual reports. However, sentiments extracted from traditional media may only

stand for the opinions of a small fraction of investors. In the era of internet and big data,

everyone is connected by social media. Social media has become the important confluence

where all investors' sentiments are gathered and maintained. Thus many researchers are in-

vestigating how to extract sentiments from crowds of people and better predict the financial

market. In this spirit, I want to study the relationship between social media and Chinese

stock market, given that few literature covers the Chinese social media and Chinese stock

market. I choose guba.com.cn, one of the most popular stock-related social media in China,

for this study, and download 3,734,426 posts from 2008 to 2015 under the sections of 30

sample securities, including Shanghai Composite Index, China Stock Index Future and 28

individual stocks. Textual analysis of those posts is conducted and sentiment ratios are ex-

tracted. I have analyzed the correlation matrices and regressions between sentiment ratios

and returns of 9 holding periods for all the 30 sample securities, and have arrived at the

following findings.

1. Negative sentiment ratio is superior than positive sentiment ratio. I have found in

Chinese literature that many researchers are using both sentiment measures, while most

English literature use negative ratio. Language differences could be a reason for the differ-

ence in using sentiment ratios. Therefore, I deliberately compare the two kinds of sentiment

ratios and find that negative ratio is superior. First, the correlation coefficients of positive

ratio to the returns are all smaller than those of negative ratio for all the 30 samples. This

is the same case with regression coefficients. In addition, more than half the coefficients

of positive ratio are not statistically significant. Therefore, I conclude that negative senti-

ment ratio is better than positive sentiment ratio, especially when dictionary-based textual

analysis is used in the research.

2. Correlation of sentiment ratio to return is persistent in future holding periods. I

compute the correlaiton coefficients and regression coefficients for sentiment ratios to 9 re-

36

turns of different holding periods. There is no sign that the correlation is decreasing when

holding period becomes longer. On the contrary, in the case of Shanghao Composite In-

dex, the coefficients of negative ratio become larger in absolute value when holding period

increases from 1 day to 30 days. Whereas, in cases of other samples, the absolute value of

coefficients change arbitrarily and no clear pattern can be found. However one thing is sure:

the correlation doesn't decrease as holding period gets longer. This result indicates that lag

terms may exist to predict stock returns, however since no sign of decreasing correlation is

identified, finding the optimal lag length could be difficult.

3. Well-established market index has better correlation with social media than individ-

ual stocks, and well-known 'star' stocks have better correlation with social media than other

stocks. In this study, Shanghai Composite Index has the best correlation results: largest co-

efficients of all samples, all significant at I % significance level, and R-squared of all 9

regressions are over 20%. One reason for this is that the discussion about Shanghai Com-

posite Index is the most active -1554 posts are submitted everyday. Although individual

investors may focus on only several individual stocks, they all care about the market index.

That's why Shanghai Composite Index has the most frequent postings and the best corre-

lation with social media. Among individual stocks, 601766 (CRRC Corp), 000002 (China

Vanke), 002008 (Han's Laser) and 601669 (China Power Construction) have relative better

correlation with social media sentiments ratios, because they are all popular stocks among

individual investors, and have more posts per day than other stocks.

4. Better data and improved analysis are needed to predict stock market with social

media. I test the VAR model on Shanghai Composite Index, and find that the model is

stable but shows no Granger causality. I think the reason for this result comes from the

quality of data and the textual analysis method. Maybe the quality of the data is not good

enough, especially many posts in Guba are rumors and speculations with no reason (Guba is

regarded by many investors as full of rumors). So the quality of infomation may be twisted.

Another solution could be to improve the textual analysis method, such as improving the

dictionary to make it more adaptive to financial markets, improving the parsing tools, etc.

All in all, this study confirms that correlation exists between investor sentiment on

social media and the returns of the Chinese stock market, and negative sentiment ratio is

37

a better indicator for this correlation. However, in order to predict the stock market, the

quantity and quality of contents on social media matter greatly. The more contents, the

lareger the correlation exists. As social media attracts more and more users to generate

contents, the explanatory and predictive power of social media will be greater in the future.

38

References

[1] Am'else Charles, Olivier Darn'e. The random walk hypothesis for Chinese stock mar-

kets: Evidence from variance ratio tests. Economic Systems, Elsevier, 2009, 33 (2),

pp. 117-126.

[2] Baker, HK. 2010. "Individual Investor Trading." Pp. 1-26 in Behavioral Finance: In-

vestors, Corporations, and Markets.

[3] Baker, Malcolm. 2007. "Investor Sentiment in the Stock Market." Journal of Eco-

nomic Perspectives 21(2):129-52.

[4] Biemann, Chris. 2006. "Chinese Whispers - an Efficient Graph Clustering Algorithm

and Its Application to Natural Language Processing Problems." In Proceedings of

Workshop on TextGraphs, at HLT-NAACL 2006, New York 73-80.

[5] Bollen, Johan, Huina Mao, and Xiaojun Zeng. 2011. "Twitter Mood Predicts the Stock

Market." Journal of Computational Science 2(1):1-8.

[6] Bondt, De and Werner F.M. 1998. "A Portrait of the Individual Investor." European

Economic Review 42(3-5):831-44.

[7] Chen, Hailiang, Prabuddha De, Y. Yu (Jeffrey) Hu, and B. H. Byoung-Hyoun Hwang.

2014. "Wisdom of Crowds: The Value of Stock Opinions Transmitted Through Social

Media." Rev. Financ. Stud. 27(5):hhuOOl -.

[8] Chen, Hailiang, Prabuddha De, Yu Hu, and Byoung Hyoun Hwang. 2011. "Sentiment

Revealed in Social Media and Its Effect on the Stock Market." IEEE Workshop on

Statistical Signal Processing Proceedings 25-28.

[9] Engelberg, Joseph E. and Christopher A. Parsons. 2011. "The Causal Impact of Media

in Financial Markets." The Journal of Finance 66(t):67-97.

[10] Epstein, Marc J. and Martin Freedman. 1994. "Social Disclosure and the Individual

Investor." Accounting, Auditing & Accountability Journal 7(4):94-109.

39

[11] Gilbert, Eric and Karrie Karahalios. 2009. "Predicting Tie Strength with Social Me-

dia." Chi 2009.

[12] Gilbert, Eric and Karrie Karahalios. 2010. "Widespread Worry and the Stock Market."

Proceedings of the 4th International AAAI Conference on Weblogs and Social Media

58-65.

[13] Tetlock, Paul C. 2015. "Giving Content to Investor Sentiment : The Role of Media in

the Stock Market." 62(3):1139-68.

[14] Karabulut, Yigitcan. 2013. "Can Facebook Predict Stock Market Activity?" American

Finance Association 2013 Meetings 49(0):60.

[151 Kr, Roman. 2015. "Media , Sentiment and Market Performance in the Long Run."

(July).

[16] Loughran, T. I. M. and Bill Mcdonald. 2010. "When Is a Liability Not a Liability

? Textual Analysis , Dictionaries , and 10-Ks Journal of Finance , Forthcoming."

Journal of Finance, Forthcoming LXVI(1):46.

[17] Maertens, Annemie, a. V. Chari, and David R. Just. 2014. "Why Farmers Sometimes

Love Risks: Evidence from India." Economic Development and Cultural Change

62(2):239-74.

[18] Malkiel, Burton G. 2007. "The Efficiency of the Chinese Stock Markets." (154).

[19] Feng, Li. 2006. "Annual Report Readability, Current Earnings , and Earnings Persis-

tence Annual Report Readability , Current Earnings , and." Social Sciences (Septem-

ber).

[20] Sheng-, Zhou and S. H. I. Xun-. 2013. "Stock Market Time- Series Prediction Based

on Weibo Search and SVM." 22-26.

[21] Tetlock, Paul C., Maytal Saar-Tsechansky, and Sofus MacSkassy. 2008. "More than

Words: Quantifying Language to Measure Firms' Fundamentals." Journal of Finance

63(3): 1437-67.

40

[22] Weston, Jason et al. 2011. "Natural Language Processing (Almost) from Scratch."

Journal of Machine Learning Research 12:2461-2505.

[23] Yang, Sy, Syk Mo, and Xiaodi Zhu. 2013. "An Empirical Study of the Financial Com-

munity Network on Twitter."

[24] Zhou, Xiaolin, Zheng Ye, Him Cheung, and Hsuan-Chih Chen. 2009. "Processing

the Chinese Language: An Introduction." Language and Cognitive Processes 24(7-

8):929-46.

[25] Seasholes, Mark S. and Ning Zhu. 2010. "Individual Investors and Local Bias." Jour-

nal of Finance 65(5):1987-2010.

[26] Li, Feng. 2008. "The Determinants and Information Content of the Forward-Looking

Statements in Corporate Filings - a Na * ive Bayesian Machine Learning Approach."

Journal of Accounting Research 1001.

[27] Introduction to Chinese Natural Language Processing, Kam-Fai Wong, Wenji Li,

Ruifeng Xu, Zheng-sheng Zhang, Morgan & Claypool Publishers

[28] Xu, Hongzhi, Kai Zhao, Likun Qiu, and Changjian Hu. 2010. "Expanding Chinese

Sentiment Dictionaries from Large Scale Unlabeled Corpus." Proceedings of the

PACLIC 24 301-10.

[29] Ventzislav, Ivanov and Kilian Lutz. 2005. "A Practitioner's Guide to Lag Order Selec-

tion For VAR Impulse Response Analysis." Studies in Nonlinear Dynamics & Econo-

metrics 9(l):1-36.

41