text mining and topic modeling on compendium …docs.trb.org/prp/16-3009.pdf1 text mining and topic...

16
Text Mining and Topic Modeling on Compendium Papers from 1 Transportation Research Board Annual Meetings 2 3 4 Subasish Das, Ph.D. 1 5 Associate Transportation Researcher 6 Texas A&M Transportation Institute 7 Texas A&M University System 8 3135 TAMU 9 College Station, TX 77843-3135 10 Email: [email protected] 11 Phone: 979-845-9958 12 Fax: 979-845-6006 13 14 15 Xiaoduan Sun, Ph.D. & PE 16 Professor 17 Civil Engineering Department 18 University of Louisiana 19 Lafayette, LA 70504 20 Email: [email protected] 21 Phone: 337-739-6732 22 Fax: 337-739-6688 23 24 25 Anandi Dutta 26 Ph.D. Candidate 27 Center for Advanced Computer Studies 28 University of Louisiana at Lafayette 29 Lafayette, LA 70504 30 Email: [email protected] 31 32 33 Word Count: 7,000 including 11 figures and 4 tables 34 35 36 1 Corresponding Author 37 38 39 Submitted to the 95th TRB Annual Meetings for Presentation and Publication under Committee 40 Library and Information Science for Transportation (ABG40) 41

Upload: vutruc

Post on 22-Apr-2018

221 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

Text Mining and Topic Modeling on Compendium Papers from 1

Transportation Research Board Annual Meetings 2 3 4

Subasish Das, Ph.D.1 5 Associate Transportation Researcher 6

Texas A&M Transportation Institute 7 Texas A&M University System 8

3135 TAMU 9 College Station, TX 77843-3135 10

Email: [email protected] 11 Phone: 979-845-9958 12 Fax: 979-845-6006 13

14 15

Xiaoduan Sun, Ph.D. & PE 16 Professor 17

Civil Engineering Department 18 University of Louisiana 19

Lafayette, LA 70504 20

Email: [email protected] 21

Phone: 337-739-6732 22 Fax: 337-739-6688 23

24

25

Anandi Dutta 26 Ph.D. Candidate 27

Center for Advanced Computer Studies 28 University of Louisiana at Lafayette 29

Lafayette, LA 70504 30 Email: [email protected] 31

32

33

Word Count: 7,000 including 11 figures and 4 tables 34

35

36

1 Corresponding Author 37

38

39

Submitted to the 95th TRB Annual Meetings for Presentation and Publication under Committee 40 Library and Information Science for Transportation (ABG40) 41

Page 2: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

2

ABSTRACT 1 2 The collective knowledge system is advancing rapidly in the age of the Internet. The digitalization of 3 information in many online medias- like blogs, social media, articles, webpages, images, audios, and 4 videos- provides an unprecedented opportunity for us to extract and identify new knowledge. Prominent 5 journal and conference proceedings usually contain extensive amounts of textual data that can be used to 6 examine the research trends on various interested topics and to understand how these researches have 7 helped the advancement of a particular subject such as transportation engineering. The exploration of the 8 unstructured contents in journal or conference papers requires sophisticated knowledge extraction 9 algorithms. Topic models are algorithms designed to discover the main theme or trend from massive 10 collections of unstructured documents. This paper utilizes text mining techniques to analyze compendium 11 papers published by Transportation Research Board (TRB) Annual Meetings, the largest and most 12 comprehensive transportation conference in the world. A total of 15,357 compendium papers from seven 13 years (2008-2014) of TRB Annual Meeting conference proceedings were analyzed by using a popular 14 topic modeling, named latent Dirichlet allocation (LDA), to reveal research trends and interesting 15 research development histories. 16 17 18 Key words: Topic modeling, latent Dirichlet allocation, text mining. 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

Page 3: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

3

INTRODUCTION 1 2 The rise of the Internet and digital gadgets has evolved publication media into an abundant 3 source of information in digital format as part of a collective knowledge system, which provides 4

unprecedented opportunities for us to extract, identify and investigate new knowledge and its 5 applications. The prospect and correct use of digitalized information is an evolving issue in 6 theory and practice. The amount of information in digital form has been growing exponentially. 7 Sophisticated tools and models are needed for researchers to analyze this vast amount of 8 information so that the new knowledge can be disseminated effectively. 9

The rapid development in machine learning and natural language processing (NLP) has 10

offered a probabilistic framework for term frequency occurrence models, named as topic models, 11

in documents based upon the idea that documents are mixtures of topics where a topic is a 12 probability distribution over words. Among different topic models, latent Dirichlet allocation 13 (LDA) models are widely used. An LDA topic model is a Bayesian mixture model for discrete 14 data where topics are uncorrelated. 15

The Transportation Research Board (TRB) organizes the largest and most comprehensive 16 annual transportation conference in the world. Established in 1920 as the National Advisory 17

Board on Highway Research, TRB provides a mechanism for the exchange of information and 18 research results about every aspect of transportation with a focus on highway transportation 19 research and development. The mission of TRB is to promote innovation and progress in 20

transportation through research. In an objective and interdisciplinary setting, TRB facilitates the 21

sharing of information on transportation practice and policy by researchers and practitioners; 22 stimulates research and offers research management services that promote technical excellence; 23 provides expert advice on transportation policy and programs; and disseminates research results 24

broadly and encourages their implementation. 25 The research papers published by the TRB annual meeting compendiums are peer 26

reviewed by hundreds of TRB committees and a small percentage of compendium papers are 27 accepted for publication by the TRB journal. It is interesting to investigate research trends and 28 applications by analyzing the TRB compendium papers with currently evolving data mining 29

tools such as text mining and topic modeling. In this paper, we focused on a topic discovery 30 system to reveal the implicit knowledge present in TRB compendium papers. 31

32 33

LITERATURE REVIEW 34 35

Text mining, the process of deriving high-quality information from text, is having a wider range 36 of applications in many fields. With the increasing power of computers and programming 37 software, text mining can now explore any massive amount of textual data for easy-to-38 understand knowledge. As a newer branch in scientific data analysis, text mining is growing 39 quickly. There are two good literature review papers on text mining that explore both theoretical 40

aspect and application-oriented methodologies [1-2]. Semantic analysis of the textual data was 41 widely used to facilitate many applications such as user interest modeling [3], sentiment analysis 42

[4], content exploration [5-7], event tracking [8], citizen-government relations [9-11], news 43 retrieving [12], prediction of stock market variations [13], the management of natural disasters 44 [14], the understanding of epidemical diseases [15] and the characterization of electoral 45 processes [16]. 46

Page 4: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

4

Topic modeling is a type of statistical model for discovering the unstructured topics that 1 occur in a collection of documents. Blei wrote a general introductory article on topic modeling 2 with an emphasis on latent Dirichlet allocation (LDA) [17]. Research trend analysis by using 3 topic models has been conducted in several studies. In 2008, Hall et al. performed a study to 4

investigate the development of ideas in the scientific field by using LDA [18]. Paul and Girju 5 used LDA to develop a novel classifier to classify research papers based on topic and language 6 [19]. Recently, Cui et al. used topic model to explore trends of cancer research [20]. The 7 researchers of the current study compiled detailed bibliographies (with abstracts of the papers) 8 on text mining and topic modeling in two webpages [21-22]. Interested readers can surf through 9

these two web pages for more detailed reviews on text mining and topic modeling. 10

11

12 METHODOLOGY 13 TRB’s various activities regularly engage more than 7,000 engineers, scientists, and other 14 transportation researchers and practitioners from the public and private sectors and academia. 15

State transportation departments and federal agencies, including the component administrations 16 of the U.S. Department of Transportation, support this program. More than 5,000 presentations 17

in nearly 750 sessions and workshops were made covering a board area of transportation. The 18 objective of this research is to investigate how data mining can be helpful in retracting and 19 identifying new knowledge and applications of that knowledge from TRB Annual Meeting 20

publications. As a first step, the data from 15,357 compendium papers from 2008-2014 TRB 21

Annual Meetings were collected as shown in Table 1. It is seen that the number of accepted 22 compendium papers increased over the years; for example, a 46% increase in number of 23 compendium papers is seen from 2009 to 2014. Accomplishing the research goal, two data 24

mining methods were used in this study: text mining and topic modeling. 25

26

TABLE 1. Number of TRB Papers by Year 27 28

Year Number of Compendium Papers

2008 697

2009 1,993

2010 2,143

2011 2,312

2012 2,539

2013 2,758

2014 2,915

29 30

Text Mining 31 Text mining is an applied method that originated from a more generic scientific branch called 32 data mining or knowledge discovery. Knowledge discovery (KD) is the scientific process of 33

identifying valid, original, important, and ultimately interpretable patterns in unstructured data. 34 Knowledge discovery in text (KDT) or text mining can be viewed as a multi-stage process that 35 comprises all activities from document collection to interpretable knowledge extraction. KDT 36 uses methods like data mining, information retrieval, supervised and unsupervised machine 37 learning, and computational semantics. Extraction of useful information from databases (in this 38

Page 5: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

5

case, TRB papers) through pattern recognition helps identify contributing factors or trends in 1 associated tasks. Text mining mainly deals with the collections of unstructured textual data rather 2 than from structured databases. 3

In text mining methods, it is assumed that keywords represent compact information in 4

documents. Keyword extraction uses a natural language processing method to identify particular 5 word/term tags that is combined with various machine learning algorithms. Another particular 6 interest is seen in co-occurrence of particular phrases and terms. This co-occurrence would be a 7 point of interest in many studies. For example, high frequency of the term congestion would 8 indicate the nature of the document’s particular interest. If the occurrence of congestion with 9

another term minimal is high, it would indicate a rather different nature of the document’s 10

interest. In text mining, a corpus represents a collection of text documents. A corpus is an 11

abstract concept, and they can be several implementations in parallel. After developing a corpus, 12 users can clean the textual contents by removing redundant words, numbers, punctuations, etc. 13

The main information on TRB compendium papers collected from TRB authority 14 contains the following attributes: 15

- Publication year 16 - Title 17

- Abstract 18 - Author name 19 - First author affiliation 20

- Review committee’s code 21

- Review committee’s name 22 Table 2 lists the top ten review committees, which clearly reveals the contemporary 23

concerned issues and highlighting areas in transportation. The research also investigated the 24

demographics of the research community. Figure 1 shows the top ten first author affiliations in 25 TRB compendium paper publications. It is interesting to note that even though approximately 26

90% of TRB annual meeting attendees are from the U.S., two of the top ten first author 27 affiliations are outside of the country. 28

29

TABLE 2. Top Ten Review Committees 30

No. Reviewing Committee's Name TRB Code Papers Appeared

in Compendium

1 Traveler Behavior and Values ADB10 512

2 Traffic Flow Theory and Characteristics AHB45 403

3 Safety Data, Analysis and Evaluation ANB20 331

4 Transportation Demand Forecasting ADB40 309

5 Traffic Signal Systems AHB25 280

6 Pedestrians ANF10 247

7 Transportation and Air Quality ADC20 243

8 Transportation in the Developing Countries ABE90 243

9 Transportation Network Modeling AFK50 242

10 Bicycle Transportation ANF20 232

31 32 33 34

Page 6: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

6

1 FIGURE 1. Top ten first author affiliations. 2

3 The text corpus, collection of texts, was created based on the annual compendium papers. 4

Seven years of TRB compendium papers were stored in seven documents named TRB2008, 5 TRB2009, TRB2010, TRB2011, TRB2012, TRB2013, and TRB2014 respectively. In conducting 6

text mining, two specific data groups were considered: Paper Title and Paper Abstract. For 7

paper abstracts, 3,000 representative abstracts were randomly selected for analysis to mitigate 8 computational delay. For the group of paper titles, all 15,357 paper tiles were used in the 9 analysis. 10

Table 3 lists the basic outputs generated from these two groups (paper titles and paper 11 abstracts). For both of the groups, the sparsity is around 62%. The study removed sparse terms, 12

i.e., the terms occurring only in very few documents, to narrow down the matrix of the terms 13 dramatically without losing significant relations inherent to the matrix. The number of terms 14 reduced to 1,084 and 2,293 for the groups of paper titles and paper abstracts, respectively, after 15

the removal of sparse terms. 16

17 TABLE 3. Analysis of Before and After Sparse Term Removal 18

Paper Titles Sample of Abstracts

Analyzed Papers 15,357 3,000

All Terms

Non-sparse/Sparse entries 30,371/52,334 52,337/86,228

Terms 11,815 19,795

Sparsity 63% 62%

Maximal Term length 34 45

After Removing Sparse Terms (14%)

Non-sparse/Sparse entries 7,588/0 16,051/0

Terms 1,084 2,293

Sparsity 0% 0%

Maximal Term length 22 21

Page 7: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

7

One of the most common tasks in text mining is to find out the most frequently cited 1 terms in the corpus. Figure 2 illustrates the most frequently cited terms found in the combined 2 corpus from all paper titles. It is interesting to note that the top five most frequently cited terms 3 are: model/models/modeling, traffic, analysis, pavement, and evaluations/evaluations/evaluating. 4

Figure 3 shows the heatmap of the top 20 most frequently cited terms in paper titles. The darker 5 color indicates a higher percentage of usage while the lighter color indicates a lower percentage. 6 It is not surprising to see some terms remain popular over the seven years such as 7 models/modeling, evaluation/evaluating and traffic, and usages of some terms do change over 8 time such as urban, pavement, concrete, and safety. Particularly, pavement related research 9

shows a clear decline in the recent years. 10

11 FIGURE 2. Most frequently used terms in paper titles. 12

13

14 FIGURE 3. Heatmap of most frequently used terms in paper titles. 15

Page 8: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

8

Figure 4 illustrates the most frequently cited terms found in the combined corpus from 1 the sampled paper abstracts. The top five most frequently cited terms are: 2 model/models/modeling, data/dataset/database, traffic, travel/travels/travelers, and 3 vehicle/vehicles/vehicular. Figure 5 shows the heatmap of the yearly weightage of the top 20 4

most frequently cited terms in the randomly sampled paper abstracts. Usages of some terms do 5 change over time such as transit, information, asphalt, and crash. 6

7

8 FIGURE 4. Most frequently used terms in paper abstracts. 9

10

11 FIGURE 5. Heatmap of most frequently used terms in paper abstracts. 12

Page 9: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

9

There are slight differences in most cited terms between the groups of paper titles and 1 abstracts. For example, comparing with the group of paper titles, data and anything related to 2 data (dataset, database and etc.) is ranked number two in the group of abstracts, pushing the 3 word traffic into number three. Model or modeling is still the most frequently used word in both 4

groups. This is easily interpreted as most paper abstracts contain a brief introduction of the used 5 data. 6

Wordcloud is another way of visualizing the most frequent terms in unstructured 7 documents. In performing a general wordcloud, this study developed comparison wordclouds to 8 visualize the research trends over the years. If pa,b is the rate at which word a occurs in document 9

b, and pb is the average across documents (b

bap , /n), where n is the number of documents. In 10

comparison clouds, the size of each word is mapped to its maximum deviation (maxa(pa,b − pb)), 11 and its angular position is determined by the document where that maximum occurs. For this 12

particular analysis, we excluded TRB 2008 papers for better visualization. Three groups were 13 formed: TRB papers in 2009-2010, TRB papers in 2011-2012, and TRB papers 2013-2014. 14 Figure 6 illustrates comparison clouds for TRB paper titles from 2009 to 2014 starting with 15 TRB09-10 papers and working forward, where size of the word indicates the usage (bigger size 16

words infer more frequent usage). Figure 7 shows comparison clouds for paper review 17 committees. 18

19

20

21 FIGURE 6. Comparison wordclouds of paper title terms. 22

23 It seems that the popular word or words may vary year by year. For example, the most 24 significant terms are planning, pavement, and modeling for TRB 2009-2010, TRB 2011-2012, 25

and TRB 2013-2014, respectively. The newer research trend in social media is visible by the 26 term social in TRB 2013-2014 papers. In Figure 6, the most significant terms are bituminous, 27

management, and characteristics for TRB 2009-2010, TRB 2011-2012, and TRB 2013-2014, 28 respectively. 29

Page 10: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

10

1

2 FIGURE 7. Comparison wordclouds of review committees. 3

4 Table 4 shows the correlation between the words of the top four most frequently cited 5

terms in both paper title and sampled abstracts. A more detailed dataset is provided in a webpage 6

dedicated for interested readers [23]. 7 8 TABLE 4. Correlations Between the Terms 9 10

Paper Titles Sampled Abstracts

Model Model

Development 1 Capacity 0.99

Vehicles 1 Effects 0.99

Dynamic 0.98 Results 0.99

Traffic Traffic

Evaluating 1 Differences 1

Analysis 0.99 Quantified 1

Performance 0.99 Identified 1

Improved 0.99

Analysis Analysis

Evaluating 1 Bridge 0.99

Traffic 0.99 Ensure 0.99

Networks 0.98

Pavement Pavement

Preservation 0.98 Increases 0.99

Deformation 0.98 Possible 0.99

Overlays 0.97

Sensors 0.96

Page 11: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

11

It is not surprising to see the word model is completely associated with development, or 1 that the word traffic always appears with evaluating in paper titles. In sampled abstracts, the 2 word model is highly correlated with capacity, effects and results. 3

4 Topic Modeling 5 Latent Dirichlet allocation (LDA) 6 Latent Dirichlet allocation (LDA) is an autonomous way of discovering topics in unstructured 7 documents. Several authors have used LDA in topic model development. The study by Blei et al. 8 is excellent in describing the theoretical development of LDA [24]. 9

It is assumed that documents are represented as random mixtures over latent topics, 10

where each topic is characterized by a distribution over words. We can consider a document as a 11

sequence of N words denoted by t = (t1, t2, ..., tN), where tn is the nth word in the sequence. A 12 word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by {1, ..., 13 V}. We represent words using unit-basis vectors that have a single component equal to all other 14 components equal to zero. Thus, using superscripts to denote components, the v-th word in the 15

vocabulary is represented by a V-vector w such that tv = 1 and t

u = 0 for u≠v. A corpus is a 16

collection of M documents denoted by T = {t1, t2, ..., tM}. 17

For this study we developed an LDA model application with the following steps for each 18 document w in a corpus T: 19

1. Consider the number of the multinomial, N ~ Poisson(φ) 20

2. Consider the parameter of class distribution, θ ~ Dirichlet(α). Here, α is the parameter of 21

Dirichlet Distribution (DD) over the hidden classes. 22 3. For each of N words tn: 23

- Consider a topic zn ∼ Multinomial(θ). 24 - Choose a word tn from p(tn |zn), a multinomial probability conditioned on the topic zn. 25 Several assumptions were made in the study: First the dimensionality k of the DD is 26

assumed as known and fixed, and then the word probabilities are parameterized by a k ×V matrix 27 β where βij = p(tj = 1|zi = 1) that is treated as a fixed quantity needing to be estimated. N is 28

independent of all the other data generating variables (θ and z). Thus, it is an ancillary variable 29 and randomness is not further taken care of. 30

A k-dimensional Dirichlet random variable θ for a specific α can be written as: 31

32

11

1

1

1 ....

)(

)(

)|( 1

k

kk

i

i

k

i

i

P

(1) 33

Here, the parameter α is a k-vector with components αi > 0, and where )(x is the 34

Gamma function. Given the parameters α and β, the joint distribution of a latent class mixture θ, 35

a set of N latent classes z, and a set of N features t is given by: 36

N

n

nnn ztPzPPtzP1

),|()|()|(),|,,( (2) 37

Here, p(zn|θ) is θ for each unique i such that i

nz equals to 1. After, integrating over θ and 38

summing over z: 39

Page 12: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

12

dztPzPPtPN

n z

nnn

n

1

),|()|()|(),|( (3) 1

After taking the product of the marginal probabilities of single documents, we finally get 2

the probability of a corpus: 3

M

t

t

N

n z

tntntntt dztPzPPTPt

nt1 1

),|()|()|(),|( (4) 4

For developing the topic models for two different groups (titles and abstracts) of text 5

documents, we used open source R software package topicmodels [25]. Here all of the TRB 6

papers were considered as a single document per group. Topic extraction from text corpus is the 7

fundamental of many topic analysis tasks. In this analysis, latent topic models are extended to 8

find the underlying structure of time series in an unsupervised manner (Figure 9 and Figure 11). 9

LDA using bag-of-patterns representation automatically discovered the clusters of topics that are 10

in the unstructured form in the document groups. The generated topics from both of the 11

document groups are illustrated in Figure 8-11. 12

13

FIGURE 8. Top six topics from paper titles. 14

As shown in Figure 8, topic 1 includes terms like traffic, systems, models, and data. It 15

clearly indicates traffic data modeling. The second topic infers performance modeling. The third 16

topic indicates research related to pavement analysis. Topic 4 indicates travel data modeling 17

research. Topic 5 and Topic 6 indicate research on traffic design and traffic application analysis, 18

respectively. The trend of the topics over the years is shown in Figure 9. 19

20

Page 13: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

13

1

FIGURE 9. Trend of top six topics for paper titles over the years. 2

As shown in Figure 10, topic 1 includes behavioral research on drivers. The second topic 3

and third topic covers research on traffic crash data analysis, and traffic network modeling 4

research, respectively. The fourth topic indicates research on project management. The fifth topic 5

indicates research related to non-motorized mobility options. The sixth topic infers pavement 6

performance analysis. Topic 7 and topic 8 indicate research on environmental impact, and modal 7

analysis, respectively. The trend of these eight topics over the years is shown in Figure 11. 8

9

FIGURE 10. Top eight topics from paper abstracts. 10

Page 14: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

14

1 FIGURE 11. Trend of top eight topics for paper abstracts over the years. 2

3

CONCLUSION 4 5 By exploring text mining applications in research papers published by TRB Annual Meetings, 6 the most comprehensive transportation conference in the world, this research took advantage of 7

rapidly advancing data analysis techniques to examine the research trend and possible emerging 8 knowledge in the area of transportation. The preliminary results have clearly shown that 9

transportation research is truly a dynamic area. The top topics from the paper abstracts show the 10 recent prominence in behavioral research and safety prediction models. As human society moves 11

forward, the critical issues change over time from emphasizing mobility only to a more inclusive 12

perspective for problem solutions. The research trend from the current study will help the TRB 13 community to identify the exploration of transportation related research ideas over time. 14

The current research should be expanded in scope and content. A larger historical dataset 15

containing all TRB papers in digital format should be analyzed to examine the research trends 16 and unseen connections as text mining and topic modeling are becoming ever more efficient with 17 increased computing power. TRB publications are a robust laboratory for such research. 18 Many potential improvements have been identified by the research team; such as developing 19 better interactive visualization on the research findings. The researchers are currently developing 20

an interactive web tool on the findings of this research. 21

This exploratory research demonstrates the feasibility of using the text mining and topic 22

modeling techniques for synthesis study with massive historical textual data. After performing 23 the initial analysis, a more detailed framework for further studies is in progress. 24 25 26

Page 15: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

15

ACKNOWLEDGEMENTS 1 2 TRB Technical Activities Division Director Mr. Mark R. Norman, and TRB Program Officer Michael 3 DeCarmine provided the data in spreadsheet form containing information of TRB Annual Meeting 4 Compendium papers to us. Their interest in our idea and support is deeply appreciated. 5 6 REFERENCES 7

8 1. Berry, M., and Castellanos, M. Survey of Text Mining: Clustering, Classification, and 9 Retrieval. Springer, New York. 2004. 10 2. Berry, M., and Castellanos, M. Survey of Text Mining II. Springer, New York. 2008. 11

3. Pennacchiotti, M., and Gurumurthy, S. Investigating topic models for social media user 12 recommendation. In Proceedings of the 20th international conference companion on World Wide 13 Web WWW ’11, pp. 101–102, New York, 2011. 14

4. Lin, C., and He, Y. Joint sentiment/topic model for sentiment analysis. Proceedings of the 18th 15 ACM conference on Information and knowledge management CIKM ’09, pp. 375–384, New 16 York, 2009. 17 5. Duan, J., and Zeng, J. Web objectionable text content detection using topic modeling 18

technique. Expert Systems with Applications, 40, pp. 6094–6104, 2013. 19 6. Martinez-Romo, J., and Araujo, L. Detecting malicious tweets in trending topics using a 20

statistical analysis of language. Expert Systems with Applications, 40, pp. 2992–3000, 2013. 21

7. Waters, R. and Jamal, J. Tweet, tweet, tweet: A content analysis of nonprofit organizations’ 22

Twitter updates. Public Relations Review 37, pp. 321– 324, 2011. 23 8. Lee, C. Unsupervised and supervised learning to evaluate event relatedness based on content 24 mining from social-media streams. Expert Systems with Applications, 39, 13338–13356, 2012. 25

9. Panagiotopoulos , P., Bigdeli, A., and Sams, S. Citizen–government collaboration on social 26 media: The case of Twitter in the 2011 riots in England. Government Information Quarterly, in 27

press, 2014. 28 10. Sobaci, M., and Karkin, N. The use of twitter by mayors in Turkey: Tweets for better public 29 services? Government Information Quarterly, Vol. 30, pp. 417–425, 2013. 30

11. Chatfield, A., Scholl, H., and Brajawidagda, U. Tsunami early warnings via Twitter in 31 government: Net-savvy citizens' co-production of time-critical public information services. 32

Government Information Quarterly, Vol. 30, pp. 377–386, 2013. 33 12. Hong, S. Online news on Twitter: Newspapers’ social media adoption and their online 34

readership. Information Economics and Policy, Vol. 24, pp. 69–74, 2012. 35 13. Bollen, J., Mao, H., and Zeng, X. Twitter mood predicts the stock market. Journal of 36 Computer Science, Vol. 2 (1), pp. 1–8, 2011. 37 14. Sakaki, T., Okazaki, M., and Matsuo, Y. Earthquake shakes twitter users: real-time event 38 detection by social sensors. In: Proceedings of the 19th International Conference on World Wide 39

Web. WWW’10. ACM, New York, pp. 851–860, 2010. 40 15. Culotta, A. Towards detecting influenza epidemics by analyzing twitter messages. 41 Proceedings of the First Workshop on Social Media Analytics.SOMA’10. ACM, New York, pp. 42

115–122, 2010. 43 16. Borondo, J., Morales, A., Losada, J., and Benito, R. Characterizing and modeling an electoral 44

campaign in the context of twitter: 2011 Spanish presidential election as a case study. Chaos, 45 Vol. 22 (2), 2012. 46 17. Blei, D. Probabilistic Topic Models. Communications of the ACM. Vol. 55, No. 4. 2012. 47

Page 16: Text Mining and Topic Modeling on Compendium …docs.trb.org/prp/16-3009.pdf1 Text Mining and Topic Modeling on Compendium Papers from 2 Transportation Research Board Annual Meetings

16

18. Hall, D., Jurafsky, D., and Manning, C., Studying the History of Ideas Using Topic Models. 1 Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, 2 October, 2008. 3 19. Paul, M., and Girju, R. Topic Modeling of Research Fields: An Interdisciplinary Perspective. 4

International Conference RANLP, Borovets, Bulgaria, pp. 337–342, 2009. 5 20. Cui, M., Liang, Y., Li, Y., and Guan, R. Exploring Trends of Cancer Research Based on 6 Topic Model. 1st International Workshop on Semantic Technologies (IWOST), Changchun, 7 China, March 9-12, 2015. 8 21. Bibliography of social media research. http://subasish.github.io/pages/TRB2016/textm.html. 9

Last retrieved on July 17, 2015. 10

22. Bibliography of topic mining research. 11

http://subasish.github.io/pages/TRB2016/topicm.html. Last retrieved on July 17, 2015. 12 23. Webpage on Topic Modeling Paper. 13 http://subasish.github.io/pages/TRB2016/topic_pap.html. Last retrieved on July 17, 2015. 14 24. Blei, D., Ng, A., and Jordan, M. J Latent Dirichlet Allocation. Journal of Machine Learning 15

Research. Vol. 3, pp. 993-1022, 2003. 16 25. Grun. B., and Hornik, K. topicmodels: An R Package for fitting Topic Models. Journal of 17

Statistical Software. Vol. 40 (13), pp.1-30, 2011. 18