infusing social data analytics into future internet applications for manufacturing
TRANSCRIPT
DSS Lab NTUA AICCSA 2014
Evmorfia Biliri, Michael Petychakis, Iosif Alvertis, Fenareti Lampathaki, Sotirios Koussouris, Dimitrios Askounis(National Technical University of Athens – NTUA, DSSLab)
Infusing Social Data Analytics into Future Internet applications for Manufacturing
1
DSS Lab NTUA AICCSA 2014
About me and the lab
• Me– PhD Student
– API Developer
– Semantic Web enthusiast
• DSS Lab
– Research in ICT including• Future Internet Applications and Systems for Enterprises and Public
Administrations
• Big, Open and Linked Data and Analytics
• APIs, Social Media Publishing and Analytics
• eGovernance and Policy Modeling
• Enterprise and Government Interoperability
• ICT for Manufacturing
• Software Services and Cloud Infrastructures
DSS Lab NTUA AICCSA 2014
The problem (I)• A new age of engagement and collaboration has emerged with
the proliferation of user-generated content
• The quantity of information in the world is soaring, with businesses, governments and society only starting to tap its potential
• Harnessing collective intelligence represents a challenge for any manufacturing industry.
3
• To understand what is discussed online about any topic of interest, instantly catching the market realm
• To early identify sentiments about products and brands, thus preventing potential damage to the corporate reputation
• To detect on time user trends in order to be incorporated in product design
DSS Lab NTUA AICCSA 2014
The problem (II)
What we have
Many users
Many social
platforms
What we want
• Aggregated sentiment per
product or product feature
• Brand mentions analytics
• Trendy words identification and
analysis
DSS Lab NTUA AICCSA 2014
Problem (III)
Loooved it!!!lol
imho
#cinema
@mpetyx Have a look at http://... #greatdesign
TTYN
#omgfacts
#fail
#thingsilove
lol haha FTW yea right
Liz has finally managed to achieve
what seems to have been her goal...
to release an album that could have just
as easily been made by anybody else.
DSS Lab NTUA AICCSA 2014
Problem (IV)
IRONY IDENTIFICATION IS HARD…
Amazon.com (1 star)
“It took a couple of goes to get into it, but once the story hooked me, I found it difficult to put the book down -- except for those moments when I had to stop and
shriek at my friends, "SPARKLY VAMPIRES!" or "VAMPIRE BASEBALL!“ or "WHY IS BELLA SO STUPID?" These moments came increasingly often as I reached the climactic chapters, until I simply reached the point where I had to stop and flail around laughing.”
What if we did not have the rating (ground truth)?
DSS Lab NTUA AICCSA 2014
Algorithm
• Lexical based
– Pre-defined lists of words and phrases associated with a sentiment
– LIWC, SentiWordNet
– Difficult to apply the same list in different context
– Semi-automatic construction of lexicons
• Machine learning based
– Naïve Bayes, SVM, Maximum entropy
– Supervised classification
– Adaptation to domain/context
– Labeled data difficult to find
DSS Lab NTUA AICCSA 2014
NLP• Tokenization.
What about punctuation?
• Conversion to lowercaseDo capital words indicate sentiment (anger, excitement) ?
• Emoticons detection and replacement Commonly used as “ground-truth” polarity labels in the automatic creation of testing datasets.
• Stop words filtering
• Repeated characters removalCould be indicative of feeling? Remove based on lexicon usage?
• StemmingUse aggressive stemmer? Possible loss of information?
• N-gram creation n=? (usually between 1 and 3)
DSS Lab NTUA AICCSA 2014
NLP (II)• Part of speech tagging.
Study effect of adverbs, adjectives and POS structure of sentence
• Negation detectionReplacement with one word?
• HashtagsAre they more valuable? Could be used to map to preconfigured subjects and improve accuracy…
DSS Lab NTUA AICCSA 2014
Our approach
The FITMAN “Unstructured and Social Data Analytics” Specific Enabler (FITMAN-Anlzer) extracts unstructured data from selected web resources and social data
from selected social networks and turns such user-generated content to knowledge to be used for the benefit of manufacturers.
11
A web infrastructure to ….
Collect Store Process Visualize Interact with
Cloud-based, customizable, domain-independent solution with a user-friendly interface
DSS Lab NTUA AICCSA 2014
Design goals• Domain-Independent
Optimizations are usually very specific and cannot be applied across different industries
• No code skills or formal querying language requiredPeople who train and use the system are typically not IT
• FlexibilitySystem can be trained to meet the needs of a specific domain and even a specific department
e.g. a promotional tweet
• Real time streaming and scalabilityImportant news go viral within hours…
• Report historyStore and keep statistics and charts for future reference
The described design goals were decided in collaboration with and validated by the FITMAN trials in respect with real-life applications in the manufacturing domain
DSS Lab NTUA AICCSA 2014
Functionalities Overview
Keyword- & Account-based Information Acquisition
Information Filtering
Sentiment Analysis
Trend Analysis
Added-value User Generated Content (UGC)
13
Repeated Characters Removal
Username Removal
TokenizationConversion to lowercase
Stop-words Filtering
……
SVM Training
+ Light Stemming
+ Term Frequency
Emoticons Identification
URL Removal
Snowball Stemming
DSS Lab NTUA AICCSA 2014
Technology Stack & Interactions
Trend & Sentiment Analysis Engine
Processing /Querying Engine
Visualization & Report Creator Engine
Data Connectors
Storage System
Scalable… Transferable… Extensible… Open-source…
14
Charts
DSS Lab NTUA AICCSA 2014
Data retrieval
– Streaming API
Low latency access to Twitter’s global stream of Tweet data. Suitable for following specific users or topics.
– Graph API
Primary way to get data in and out of Facebook's social graph. It's a low-level HTTP-based API that you can use to query dataExample: /{page-id}/posts
to get the posts that were published by this page
DSS Lab NTUA AICCSA 2014
Storage System
Couchbase Server
NoSQL database
Dynamic schema design
Flexibility, Scalability
Free Enterprise edition
Source: http://www.couchbase.com/nosql-resources/what-is-no-sql
DSS Lab NTUA AICCSA 2014
Indexing
Elasticsearch
• Flexible and powerful open source, distributed, real-time search and analytics engine.
• Easy to use
• Construction of structured queries also in JSON
• RESTful API for configuration/management
• Suitable for JSON documents.
DSS Lab NTUA AICCSA 2014
Sentiment analysis engine
Rapidminer Studio
• Easy-to-use visual environment for predictive analytics
• No programming required
• Available implementation for SVM
• Powerful text processing plugin
DSS Lab NTUA AICCSA 2014
Visualization
• Kibana
– No code required
– Real-time analysis of streaming data
– Highly scalable
– Open source, community driven
– Seamless integration with Elasticsearch
• Google charts
– Great variety of charts
DSS Lab NTUA AICCSA 2014
Scenario: The User PerspectiveI want to monitor trends for furniture, so I access the FITMAN Unstructured & Social
Data Analytics SE
1
20
I create a new project for the domain I am
interested in
2
DSS Lab NTUA AICCSA 2014
Scenario: The User PerspectiveI want to monitor trends for furniture, so I access the FITMAN Unstructured & Social
Data Analytics SE
I provide the necessary training
material
Automatically Collect data
1
3
Connectors
22
Users
I create a new project for the domain I am
interested in
2
Publish UGC
DSS Lab NTUA AICCSA 2014
Step 3: Training
• Download csv file with the 1000 most recently collected documents, edit and upload it
to train the system
• Upload your own csv file
• One training file per language
23
DSS Lab NTUA AICCSA 2014
Scenario: The User PerspectiveI want to monitor trends for furniture, so I access the FITMAN Unstructured & Social
Data Analytics SE
I provide the necessary training
material
Publish UGC
Automatically Collect data
1
3
Connectors
I select search terms to generate data reports
4
24
Users
I create a new project for the domain I am
interested in
2
DSS Lab NTUA AICCSA 2014
Scenario: The User PerspectiveI want to monitor trends for furniture, so I access the FITMAN Unstructured & Social
Data Analytics SE
I provide the necessary training
material
Publish UGC
Automatically Collect data
I view collected data and navigate to the analysis & refine results
1
3
5
Connectors
I select search terms to generate data reports
4
26
Users
I create a new project for the domain I am
interested in
2
DSS Lab NTUA AICCSA 2014
Future steps
• Evaluation in the scope of real-life business cases in various domains
• Measure and evaluate the effect of domain-specific training
• Experiment with other NLP techniques (e.g. include POS-tagger in text preprocessing)
• Extend polarity tags (detect more sentiments)
• Try other machine learning algorithms
• Subjective-objective sentence identification, as a prior step to the sentiment analysis process
• Conversation-level analysis of facebook comments
• Influencers detection
• Allow for even more flexible queries with Elasticsearch