network analytics meets text mining for social media analysis
DESCRIPTION
Network Analytics meets Text Mining for Social Media Analysis. Dr. Rosaria Silipo. Social Media Data Water Water Everywhere , and not a drop to drink. Social Media Data Water Water Everywhere , and not a drop to drink. What companies do with it: Download and keep - PowerPoint PPT PresentationTRANSCRIPT
Network Analytics meets
Text Mining for Social Media Analysis
Dr. Rosaria Silipo
Social Media DataWater Water Everywhere, and not a drop to drink
2
Social Media DataWater Water Everywhere, and not a drop to drink
What companies do with it:• Download and keep• Topic [Shift] Detection (email content routing,
detect market interest shift, clinical studies, query non structured DBs, ...)
• Sentiment Analysis (marketing, polls, elections, ...)
• Connection Analysis (influencers, risk analysis, ...)
• .... 3
Social Media DataWater Water Everywhere, and not a drop to drink
The Analysis Tools:• Web Crawlers• Visual Exploration• Topic Detection (Text Mining, NLP,
Ontologies)• Sentiment Score (Text Mining, NLP)• Influence Score (Network Analytics)• Find Groups (Predictive Analytics)
4
Case Study Example: Slashdot Data
5
Basic Numbers:
• 24532 users
• 491 threads with • 15 – 843 responses• 12 – 507 users
• 113505 posts
• 60 main topics
• Selected Topic: Politics
Post
Comments
Case Study Example: Slashdot• Very rich data sources about customers
!
• We want to establish:• How users feel about the discussed topic• Whether it matters how users feel• A more general abstraction of the results
6
Sentiment Analysis
Network Analytics
Clustering
Sentiment AnalysisRemove anonymous users, group by PostID Words Tagging
Positive words
Negative words
MPQACorpus
BoW
, Ent
ity F
ilter
, Wor
d Fr
eque
ncy,
A
ttitu
de C
alcu
latio
n by
Doc
umen
t
User Bins
Word cloud for selected users
Total Attitude by User
Slashdot – Text MiningMost Negative User pNutz
Slashdot – Text MiningMost Positive User dada21
Slashdot – Sentiment Analysis• 16016 positive users• 7107 negative users• Most positive user: dada21 (2838 positive/1725 negative
words)• Most negative user: pNutz (43 positive/109 negative
words)• Which Topics have positive users in common ?
– Government– People– Law/s– Money– Market– Parties
11
Network CreationUser
1
User2
User3
User6
User4
User5
Topic Graphs
12
14
Topic Graph: NASA
15
Topic Graph: Sci-Fi
Hubs & Authorities
16
• Hubs = Followers• Authorities = Leaders
Filtering anonymous users and creating network Centrality index to define hub weight
and authority weight
Users with hub and authority weights and
other features
Hubs & Authorities
17
dada21
Doc Ruby
Carl Bialik from the WSJ
pNutz
99BottlesOfBeerInMyF
Tube Steak
18
KNIME: Bringing it all together
Network Analysis
Text Analysis
Users with hub and authority weights and other features
Users bins: positive, negative, neutral
19
Carl Bialik from the WSJ
dada21
Doc Ruby
99BottlesOfBeerInMyFWeb
Hos
ting
Guy
pNutz
Tube Steak
Catbeller
What we have found ...- The positive
leaders- The neutral
leaders- The negative
leaders- The inactive users
20
What identifies each group?How do I identify a new user?How do I handle each user?
Why Clustering?
- No a priori knowledge (not even on a subset of users)
- Prediction and interpretation capabilities required
21
k-Means algorithm
Re-sampling the Training Set
23
k = 10
The k-Means Clusters
24
The k-Means Clusters
25
Superfans
Negativeusers
Neutralusers
Fans
Additional Discoveries• There are only very few real leaders!
Authority and hub scores identify active participants rather than leaders.
• Superfans can be found in cluster_3 • Negative and (sigh!) active users are collected in
cluster_1.• Neutral users are usually inactive (cluster_2,
cluster_7, and cluster_8)• Positive users with different degrees of activity are
scattered across the remaining clusters.26
The operational Workflow
27
Pre-processingCluster Extraction
Assignment of new data
Notes• MPQA Corpus: publicly available
Subjectivity Lexicon (http://www.cs.pitt.edu/mpqa/lexicons.html)
• User Characterization is Sum -> Mean• NLP: No sentence splitting, no negation
identification.• For a more refined syntax-based
sentiment analysis -> „External Tool“ node 28
External Tool NodeThe „External Tool“ node executes any
external program from command line1. Writes input data to an input file2. Calls Tool to run on input file and command
line options and to write results to output file
3. Reads output file and presents data at output port
29
Alternative Sentiment AnalysisFree non-interactive Command Line
running Tools for Sentiment Analysis not found
SentiStrength v2.2 (still interactive)
30External Tool and
Generic Web Service Client
Web Crawling Workflow
31
Community Web Crawler Node
XML Parsing Nodes
Next Steps- Integrate topic information- Integrate user demographic and
behavioural information- Discover [time series] patterns for
early detection of negative users and superfans
- Try other techniques, maybe even on manually segmented data, to discover new user segments 32
Where do I find more?Whitepaper: [email protected] Complete Workflows + Data:
www.knime.com - text mining- network mining- combined analysis
(note the above 3 process huge data and require 16G memory)– clustering
Open Source Software: KNIME www.knime.com 33
Next Appointment
User Day US Boston (free)
October 22nd 2013 10:00 -17:00Microsoft New England R&D Center (NERD) One Memorial Drive, Suite 100, Cambridge
http://www.knime.com/user-day-boston-2013
34
Hands-on Session2. Install ExtensionsHelp -> Install New
SoftwareSelect:
• KNIME & Extensions
In KNIME Labs Extensions, select:
• KNIME Network Mining• KNIME Textprocessing
36
Hands-on Session3. Get workflows and Slashdot data
• Get workflows from USB stick (KNIMEBoston2013.zip)• Text Mining• Network Analytics• Text and Network Mining• Social Media Clustering
• Slashdot Raw Data is included in the downloaded workflows
• A smaller set of data is available, Slashdot Reduced Data, for lower memory requirements
• Both data sets are available from USB Stick 37
Hands-on Session3. Import Workflows
38
Hands-on SessionMemory Increase in knime.ini-startupplugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar--launcher.libraryplugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.100.v20110502-vmargs-Xmx2G-XX:MaxPermSize=256m-server-Dsun.java2d.d3d=false-Dosgi.classloader.lock=classname-XX:+UnlockDiagnosticVMOptions-XX:+UnsyncloadClass-Dknime.enable.fastload=true-Djava.library.path=C:\Users\rosy\Documents\R\win-library\2.15\rJava\jri\x64
39
Hands-on Session
5. Improve Workflows: Text Mining
40
Data Reading
Data Preprocessing
Reading Tag Corpus
Tagging Words
BoW
Scoring and Tag Cloud
Hands-on Session
6. Improve Workflows: Network Analytics
41
Data Reading andpreprocessing
Create Network Object Clean up Network
Visualize Network
zoomba
42
nahdude812
43