indian cities_ranking based on twitter feeds using advanced analytics
TRANSCRIPT
2
INTRODUCTION
With the development of new smart technologies, the world is going digital. The increasing scope of the web and the large amount of electronic data piling up throughout the web has provoked the exploration of hidden information from their text content. Looking up for the precise and relevant information and extracting it from the web has now become a time-consuming task. There are many techniques used for the Web - information extraction and text mining is one of them.Twitter is one of the famous social platforms with 316 M user in the world and 500 M tweets sent per day. In India, twitter has 22.2 M users (source : http://www.huffingtonpost.in). With such a vast amount of data available, twitter has been used as a source of unstructured data to perform varied Data analytics insights.
Every year famous magazines publish "most livable cities in world" list and each city wants to be the most livable to attract business and investments, boost local economies and real estate markets. Here I focused on text mining techniques, k means algorithm and classification to identify livable cities in India by categorizing the tweets-its sentiments using different criteria which includes social and economic circumstances for residents, public health, infrastructure, and ease and availability of local transport
3
NOW , WHAT’S PRIMARY OBJECTIVE , SCOPE AND LIMITATION ?
The research Question : ‘how Livable is a city based on the comments and
views on twitter with the use of Text Mining’
Objectives are: To provide a dynamic algorithm with can label the twitter
feeds & reduce the complexity of ranking a city in different categories based
on twitter views
Scope and Limitations:
The Scope of this project is the Twitter Views on Indian Cities.
The Limitation is the dataset is that the feeds are from a single day
‘25/08/2015’
4
THE FLOW
EXTRACT OF
TWITTER FEEDS
THE SEMMA APPROACH
.Json file format
Converted to .csv
Removal of Duplicate
Texts
Language-English
Pareto of Top cities
Loading the corpus
DTM Creation
Stop-Word Removal
Tokenization
Loading the corpus
TERM ANALYSIS
K-Means Clustering
Labelling based on
the Clusters
Classification of Tweets
based on Labels
Result
1.SAMPLE 2.EXPLORE
4.MODEL
3.MODIFY
Results Exploration
CITY RANKING RESULTS
CONCLUSIONS
5.ASSESS
5
What are the Data Attributes?
The Dataset obtained post conversion from .JSON file format to .CSV 93762 records with 24 attributes related to each Twitter Feed
From the list of 24 Attributes, selected 11 attributes to proceed with the project
S.no Attribute Name S.no Attribute Name1 links 13 user_name2 text 14 sentiment_type3 topics 15 reach4 application_rating 16 user_city5 application_store 17 user_language6 created_time 18 device7 city 19 application_version8 user_id 20 keyword9 sentiment 21 language10 application 22 country11 engagement 23 uri12 source 24 user_country
S.no Attributes Selected S.no Attributes Selected1 text 6 keyword2 created_time 7 language3 sentiment 8 application4 sentiment_type 9 user_country5 reach 10 device
6
What did the Explore Stage Result ?
Post Cleaning up of the Data, applied the Pareto !The Top 30 Cities resulted contributed to 95% of the volume
On Further exploration (post data processing) found the Top Terms with Frequencies
7
Model Phase Results ?
There are various methods of clustering and K-means is one of the most efficient ways for clustering.From the given set of n data, k different clusters; each cluster characterized with a unique centroid (mean) is partitioned using the K-means algorithm. The elements belonging to one cluster are close to the centroid of that particular cluster and dissimilar to the elements belonging to the other cluster.Clustering was done to identify the clusters of terms in turn enabling us to label the data !
With this K-means clustering exercise using Euclidian distance; 25 clusters (best fit) were obtained
8
Model Phase Results ? Contd…
The 25 clusters which have the list of terms are extracted to a .csv file
The terms in each cluster are reviewed manually and a Label is given to each clusterWith this step, 6 Labels for the clusters are identified:
LifeStyle
Law & Order
Infrastructure
Education
CrimeCareer
Sheet 1Class
Career
Crime
Education
Environment&Health
Infrastructure
Law & Order
LifeStyle
Class. Color shows details about Class. Size shows sum of Number of Records.The marks are labeled by Class. The view is filtered on Class, which excludes Oth-ers.
Once the Labels are assigned to each cluster, all the labels (with the terms) as separated into individual text documents for the purpose of classification. A number is assigned for each document and data frame is created. A union of 2 lists (the Label lists and the Feed lists) is done. Post the Union of 2 lists is done, a binding of each tweet with the label is performed > The Result is Every Tweet is labelled with the category identified.
9
Conclusions--- the final results !
The resulted extract has each Tweet labelled. There 25180 records in the final extractFew tweets were labelled as ‘Others’ as they weren’t binding with any of the Labels. The ’Others’ label is excludedPost the above step, there are 25140 records left
By City – the Count of Labelled Tweets:The Top city is New Delhi followed by Mumbai and Bangalore
10
Conclusions--- the final results !
It can identified that the maximum number of tweets were on Lifestyle, Career and Infrastructure and are mostly Neutral in nature
The above Tree map depicts the Label versus the user reach. The Highest is for the Label ‘Lifestyle’ and the Least is for ‘law and Order’
11
Conclusions --- the final results !
Out of the obtained set of Cities , Performed ranking for the Top 10 cities based on Pareto Rule
An Analysis of City versus the sentiment score, Category is performed. The Below outputs explain the ranking of the Cities:
New Delhi has the highest positive and negative scores
12
Conclusions--- the final results !
OVERALL – ‘GURGAON’ is the most Livable city and ‘AHMEDABAD’ is the Least Likely
13
Conclusions--- the final results !
CITY CAREER CRIME EDUCATIONENVIRONMENT
& HEALTHINFRASTRUCTURE
LAW & ORDER
LIFESTYLE
Ahmedabad -2.08 -15.19 2.33 -2.13 -3.00 -26.81 12.19Bangalore 71.85 -14.39 8.09 -0.48 -1.37 -6.63 82.81Chennai 43.00 -12.56 1.82 8.00 8.24 0.33 56.50Gurgaon 162.71 -1.83 4.14 9.55 35.49 -1.13 60.90
Hyderabad 25.47 -18.52 11.31 1.95 -6.61 0.85 20.05Jaipur 2.09 -13.18 -1.60 1.33 2.15 3.38 40.54
Mumbai 40.02 -60.49 11.44 14.43 15.89 -10.01 54.42New delhi 114.23 -90.98 9.09 17.83 47.63 -3.44 127.07
Pune 17.89 -4.00 6.93 4.38 8.25 -0.65 46.63Salem 14.25 -3.01 1.75 2.76 2.73 -5.30 82.86
RANKING OF CITIES BY LABEL
The Above table gives a snapshot of scores of the City by Label
For CAREER – Gurgaon is the most likely City and Ahmedabad is the least likely City
For CRIME – New Delhi is highly prone to crime whereas Gurgaon is least prone
For EDUCATION – Mumbai and Hyderabad are the most likely Cities whereas Jaipur is least likely
For ENVIRONMENT & HEALTH – New Delhi is most likely and Ahmedabad is least likely
For INFRASTRUCTURE – New Delhi is most likely on Infrastructure & Ahmedabad, Hyderabad are least likely
For LAW & ORDER – Jaipur is high on Law & order whereas Ahmedabad is the least
For LIFESTYLE – New Delhi is most spoken for Lifestyle and Ahmedabad the least
14