indian cities_ranking based on twitter feeds using advanced analytics

2

INTRODUCTION

With the development of new smart technologies, the world is going digital. The increasing scope of the web and the large amount of electronic data piling up throughout the web has provoked the exploration of hidden information from their text content. Looking up for the precise and relevant information and extracting it from the web has now become a time-consuming task. There are many techniques used for the Web - information extraction and text mining is one of them.Twitter is one of the famous social platforms with 316 M user in the world and 500 M tweets sent per day. In India, twitter has 22.2 M users (source : http://www.huffingtonpost.in). With such a vast amount of data available, twitter has been used as a source of unstructured data to perform varied Data analytics insights.

Every year famous magazines publish "most livable cities in world" list and each city wants to be the most livable to attract business and investments, boost local economies and real estate markets. Here I focused on text mining techniques, k means algorithm and classification to identify livable cities in India by categorizing the tweets-its sentiments using different criteria which includes social and economic circumstances for residents, public health, infrastructure, and ease and availability of local transport

3

NOW , WHAT’S PRIMARY OBJECTIVE , SCOPE AND LIMITATION ?

The research Question : ‘how Livable is a city based on the comments and

views on twitter with the use of Text Mining’

Objectives are: To provide a dynamic algorithm with can label the twitter

feeds & reduce the complexity of ranking a city in different categories based

on twitter views

Scope and Limitations:

The Scope of this project is the Twitter Views on Indian Cities.

The Limitation is the dataset is that the feeds are from a single day

‘25/08/2015’

4

THE FLOW

EXTRACT OF

TWITTER FEEDS

THE SEMMA APPROACH

.Json file format

Converted to .csv

Removal of Duplicate

Texts

Language-English

Pareto of Top cities

Loading the corpus

DTM Creation

Stop-Word Removal

Tokenization

Loading the corpus

TERM ANALYSIS

K-Means Clustering

Labelling based on

the Clusters

Classification of Tweets

based on Labels

Result

1.SAMPLE 2.EXPLORE

4.MODEL

3.MODIFY

Results Exploration

CITY RANKING RESULTS

CONCLUSIONS

5.ASSESS

5

What are the Data Attributes?

The Dataset obtained post conversion from .JSON file format to .CSV 93762 records with 24 attributes related to each Twitter Feed

From the list of 24 Attributes, selected 11 attributes to proceed with the project

S.no Attribute Name S.no Attribute Name1 links 13 user_name2 text 14 sentiment_type3 topics 15 reach4 application_rating 16 user_city5 application_store 17 user_language6 created_time 18 device7 city 19 application_version8 user_id 20 keyword9 sentiment 21 language10 application 22 country11 engagement 23 uri12 source 24 user_country

S.no Attributes Selected S.no Attributes Selected1 text 6 keyword2 created_time 7 language3 sentiment 8 application4 sentiment_type 9 user_country5 reach 10 device

6

What did the Explore Stage Result ?

Post Cleaning up of the Data, applied the Pareto !The Top 30 Cities resulted contributed to 95% of the volume

On Further exploration (post data processing) found the Top Terms with Frequencies

7

Model Phase Results ?

There are various methods of clustering and K-means is one of the most efficient ways for clustering.From the given set of n data, k different clusters; each cluster characterized with a unique centroid (mean) is partitioned using the K-means algorithm. The elements belonging to one cluster are close to the centroid of that particular cluster and dissimilar to the elements belonging to the other cluster.Clustering was done to identify the clusters of terms in turn enabling us to label the data !

With this K-means clustering exercise using Euclidian distance; 25 clusters (best fit) were obtained

8

Model Phase Results ? Contd…

The 25 clusters which have the list of terms are extracted to a .csv file

The terms in each cluster are reviewed manually and a Label is given to each clusterWith this step, 6 Labels for the clusters are identified:

LifeStyle

Law & Order

Infrastructure

Education

CrimeCareer

Sheet 1Class

Career

Crime

Education

Environment&Health

Infrastructure

Law & Order

LifeStyle

Class. Color shows details about Class. Size shows sum of Number of Records.The marks are labeled by Class. The view is filtered on Class, which excludes Oth-ers.

Once the Labels are assigned to each cluster, all the labels (with the terms) as separated into individual text documents for the purpose of classification. A number is assigned for each document and data frame is created. A union of 2 lists (the Label lists and the Feed lists) is done. Post the Union of 2 lists is done, a binding of each tweet with the label is performed > The Result is Every Tweet is labelled with the category identified.

9

Conclusions--- the final results !

The resulted extract has each Tweet labelled. There 25180 records in the final extractFew tweets were labelled as ‘Others’ as they weren’t binding with any of the Labels. The ’Others’ label is excludedPost the above step, there are 25140 records left

By City – the Count of Labelled Tweets:The Top city is New Delhi followed by Mumbai and Bangalore

10


It can identified that the maximum number of tweets were on Lifestyle, Career and Infrastructure and are mostly Neutral in nature

The above Tree map depicts the Label versus the user reach. The Highest is for the Label ‘Lifestyle’ and the Least is for ‘law and Order’

11

Conclusions --- the final results !

Out of the obtained set of Cities , Performed ranking for the Top 10 cities based on Pareto Rule

An Analysis of City versus the sentiment score, Category is performed. The Below outputs explain the ranking of the Cities:

New Delhi has the highest positive and negative scores

12


OVERALL – ‘GURGAON’ is the most Livable city and ‘AHMEDABAD’ is the Least Likely

13


CITY CAREER CRIME EDUCATIONENVIRONMENT

& HEALTHINFRASTRUCTURE

LAW & ORDER

LIFESTYLE

Ahmedabad -2.08 -15.19 2.33 -2.13 -3.00 -26.81 12.19Bangalore 71.85 -14.39 8.09 -0.48 -1.37 -6.63 82.81Chennai 43.00 -12.56 1.82 8.00 8.24 0.33 56.50Gurgaon 162.71 -1.83 4.14 9.55 35.49 -1.13 60.90

Hyderabad 25.47 -18.52 11.31 1.95 -6.61 0.85 20.05Jaipur 2.09 -13.18 -1.60 1.33 2.15 3.38 40.54

Mumbai 40.02 -60.49 11.44 14.43 15.89 -10.01 54.42New delhi 114.23 -90.98 9.09 17.83 47.63 -3.44 127.07

Pune 17.89 -4.00 6.93 4.38 8.25 -0.65 46.63Salem 14.25 -3.01 1.75 2.76 2.73 -5.30 82.86

RANKING OF CITIES BY LABEL

The Above table gives a snapshot of scores of the City by Label

For CAREER – Gurgaon is the most likely City and Ahmedabad is the least likely City

For CRIME – New Delhi is highly prone to crime whereas Gurgaon is least prone

For EDUCATION – Mumbai and Hyderabad are the most likely Cities whereas Jaipur is least likely

For ENVIRONMENT & HEALTH – New Delhi is most likely and Ahmedabad is least likely

For INFRASTRUCTURE – New Delhi is most likely on Infrastructure & Ahmedabad, Hyderabad are least likely

For LAW & ORDER – Jaipur is high on Law & order whereas Ahmedabad is the least

For LIFESTYLE – New Delhi is most spoken for Lifestyle and Ahmedabad the least

indian cities_ranking based on twitter feeds using advanced analytics

Documents