sharknado social media analysis with sap hana and predictive analysis
TRANSCRIPT
Sharknado Social Media Analysis with SAP HANA and Predictive Analysis Mining social media data for customer feedback is perhaps one of the greatest untapped opportunities
for customer analysis in many organizations today. Social media
data is freely available and allows organizations to personally
identify and interact directly with customers to resolve any
potential dissatisfaction. In today’s blog post, I’ll discuss using SAP
Data Services, SAP HANA, and SAP Predictive Analysis to collect,
process, visualize, and analyze social media data related to the
recent social media phenomenon Sharknado.
Collecting Social Media Data with SAP Data
Services While I’ll be focusing primarily on the analysis of social media data
in this blog post, social media data can be collected from any source
with an open API by using Python scripting within a User-Defined
Transform. In this example, I’ve collected Twitter data using the
basic outline provided by SAP in the Data Services Text Data
Processing Blueprints available on the SAP Community Network,
updated it for the REST version 1.1 Twitter API. This process
consists of 2 dataflows, the first tracks search terms and constructs (Get_Search_Tasks transform) and
executes (Search_Twitter transform) a Twitter search query to store the data pictured below. In
addition to the raw text of the tweet, some metadata is available, including user name, time, and
location information (if the user has made it publicly available).
Once the raw tweet data has been collected, I can use either the Text Data Processing transform in SAP
Data Services or the Voice of Customer text analysis process in SAP HANA. While both processes give
the same result, SAP Data Services is also able to perform preliminary summarization and
transformations on the parsed data within the same dataflow. In this case, I will run text analysis in SAP
HANA by running the command below in SAP HANA Studio.
Create FullText Index "VOC" On <table name>(<tweet text column name>)
TEXT ANALYSIS ON
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER';
This results in a table called $TA_VOC in the same schema as the source table, as shown below.
In this table, the TA_TOKEN—called SOURCE_FORM in SAP Data Services TDP—is the extracted entity or
element from the tweet (for example, an identifiable person, place, topic, organization, or sentiment),
TA_TYPE (called TYPE in SAP Data Services TDP) is the category the entity falls under. These are the two
main text analysis elements used to extract information from Twitter data.
For a more in-depth explanation on Text Data Processing and social media analysis using SAP Data
Services, refer to the Decision First Summer EIM Expert Series webinar on Twitter data collection and
social media sentiment analysis by Nicholas Hohman.
Once the Twitter data was loaded into SAP HANA and text analysis had been performed, I created an
Analytic View and several Calculation Views to allow for visualization and analysis.
In the first Analytical View pictured above, I’ve cleaned up the TYPE categories a bit further to
consolidate into top level categories (for example, combining all types of Organizations into one single
Organization category) and assigned a numeric sentiment values to each sentiment-type entity as shown
in the table below, ranging from 0 (strong negative sentiment) to 1 (strong positive sentiment).
I then created a calculation view that aggregates data to the tweet-level and calculates tweet-level flags
for analysis, including flags to indicate whether key types of entities are found in each tweet (location,
topic, Twitter hashtag, retweet, sentiment, etc). This also aggregates the average sentiment based on
any sentiments found within the tweet. I’ll use these aggregated metrics later for visualization and
predictive analysis of the Twitter data.
The final outputs of the SAP HANA modeling process are 2 analysis sets:
1.) A tweet-level analysis set with aggregated flags and values summarizing the tweet, including
tweet length, number of extracted entities within the tweet, and the metadata collected with
the tweet, such as location, time, and the user information.
2.) An entity-level analysis set with tweet-level metadata joined back to the individual entities to
allow analysis at the entity level.
While these analysis sets could be created using a SAP Data Services ETL process, the SAP HANA
Information Views have the advantage of being calculated on the fly rather than as a batch process, so if
we are continuously monitoring and collecting Twitter data, users will have real-time access to social
media trends and insights without having to wait for an overnight or batch process to finish.
Visualization and Analysis of #Sharknado Data For this analysis, I collected over 33,000 tweets related to the topic “sharknado” over a period of days.
After Text Analysis was performed, over 200,000 individual entities were extracted from these tweets.
A natural first step is generating descriptive charts to explain the nature of these extracted entities and
tweets. The figure below shows an area chart of all the entities extracted from the tweets by category.
Twitter hashtags were the most commonly identified entities, followed by sentiments, Twitter users,
topics, and organizations. The depth of color indicates the tweet-level average sentiment. This shows
that tweets with topic entities have the highest (most positive) overall sentiment, while tweets with
hashtags are much less positive.
A few other fast facts on the Sharknado tweets:
38% of the tweets collected include a retweet from another user
41% of tweets have a topic entity extracted from the text
7.5% of tweets have a location entity within the tweet text
45% of tweets have a sentiment entity identified in the text
54.5% of tweets have 5 or more entities extracted from the text
The chart below shows a histogram of tweets by the length of the tweet text—tweets are most
commonly right around the 140 character limit, with about 25% of tweets at 135 characters and
above.
Now, we can start to examine the individual entities extracted from the tweets and sentiments
associated with each entity. For example, we can pull the Person entities identified by the text analysis
in a word cloud, shown below. This word cloud shows the most common entities (larger size) and the
sentiment associated with the person entities (depth of color).
This shows that Tara Reid, Cary Grant, Tatiana Maslany, Ian Ziering, and Steve Sanders were the most
commonly identified person entities, with Tatiana Maslany and Tara Reid appearing in tweets with
higher average sentiments. Tara Reid and Ian Ziering are actors that appeared in Sharknado, and Steve
Sanders was Ian Ziering’s character in Beverly Hills, 90210, but I was confused by the appearance of Cary
Grant, whom Wikipedia identifies as an English actor with “debonair demeanor” who died in 1986, and
Tatiana Maslany, a lesser-known Canadian Actress, neither of whom appeared in Sharknado. Further
filtering the tweet text for these particular entities, I find an extremely high retweet frequency for 2
influential tweets:
@TVMcGee: #Sharknado is even more impressive when you realize Tatiana
Maslany played all the different sharks.
@RichardDreyfuss: People don't talk about it much in Hollywood (omertà
and everything) but Cary Grant actually died in a #sharknado
The entity “impressive” was strongly positive for Tatiana Maslany, while “n’t talk” was considered a
minor problem for the Cary Grant tweet. Further analysis can be done to identify popular characters
and portions of the movie, which the Sharknado filmmakers can mine to identify the characters, plots,
or topics to revisit in the already-approved sequel to Sharknado (coming Summer 2014).
Similarly, investigating location entities shown in the word cloud below, we can see the most common
references are to Texas and Hollywood, with tweets about Texas being more positive than Hollywood.
Organizations identified by Text Analysis show SyFy (the channel that brought you Sharknado) and the
phrase Public Service Announcement, as well as Lego and Nova were common in tweets, as shown in
the word cloud below.
The SyFy and public service announcement phrases were found in a frequently retweeted tweet about a
re-airing of the movie:
@Syfy: Public Service
Announcement: #Sharknado
will be rebroadcast on
Thurs, July 18, at 7pm.
Please retweet this
important information.
Nova was a character in the movie who
may have met an untimely end, which
apparently did not elicit positive
sentiments. The Lego topic/organization
was also in a commonly re-tweeted
tweet of a picture of a sharknado made
of Legos.
@Syfy: OMG OMG OMG someone
made #Sharknado out of
LEGOs!!!
http://t.co/0ORVv6w2uf
http://t.co/lbjJ6DDvzU
Predictive Analysis on #Sharknado Data After summarizing and visualizing the data, I can leverage SAP Predictive Analysis’s Predict pane to
evaluate the models using predictive algorithms. We can further summarize tweet data across multiple
numeric characteristics using a clustering algorithm. Clustering is an unsupervised learning algorithm
and one of the most popular segmentation methods; it creates groups of similar observations based on
numeric characteristics. In this case, the numeric characteristics available are: length of tweet, # of
entities extracted from the tweet, and the presence of a topic or a sentiment flag. While binary
variables are not technically appropriate to use in a clustering model, we’re including them here to
increase the complexity of our model and make the results more interesting.
The clustering model results show 3 groups of tweets, roughly separated by size, with Cluster 3 being
the short tweets, Cluster 1 the longer tweets, and Cluster 2 between 3 and 1. This clustering model
does show us that longer tweets were more likely to have more entities identified by the text analysis
and were more likely to have a sentiment and a topic within the tweet.
While this is an extremely simple example, with additional descriptive statistics we could cluster tweets
according to sentiment and occurrences of key phrases or words; if the organization could link these
tweet segments to customer satisfaction or other key metrics (such as referrals generated through
social media buzz or calls to a customer service center), monitoring the frequency of tweets by segment
would be a great, nearly real-time leading indicator of viral buzz, customer complaints, or referral
business.
Another potential application for predictive models would be attempting to estimate the impact of
tweet characteristic on the sentiment value of the tweet. In this case, I’ve arbitrarily determined that a
tweet with an average sentiment of 0.4 or higher is “Positive”. I can then use the R-CNR Decision Tree
algorithm or a custom R function for Logistic Regression (see this previous blog on Custom R Modules)
to predict which elements are most indicative of positive tweets. In order to compare these models, I
use a filter transform to filter out tweets without sentiments. Then, I configure the Logistic Regression
and R-CNR Tree modules to include all my descriptive data, including tweet length, number of entities
extracted, and presence of location and topic entities.
Once this predictive workflow has been run, I can review results for the logistic regression and decision
tree results.
Logistic Regression results
These model output charts show that the logistic regression model is not terribly predictive, showing an
AUC (area under the ROC Curve) of only 0.598 (AUC varies from 0 to 1 with a baseline of 0.5 and values
closest to 1 indicating more accurate predictions).
This chart shows that there is a slight increase in predicted average sentiment (red line) across the
actual average tweet sentiment (x axis). Blue bars represent tweet volume for each level of average
sentiment. Ideally, the red line would be approximately diagonal from bottom left to top right.
Decision Tree results
The Decision tree shows that the model is able to identify large pockets of tweets that are much more
likely to be positive.
Pockets of highly-positive tweets
In summary, the models show potential to distinguish tweet positivity based on tweet content
characteristics. These models could be further tuned for accuracy with more Sharknado-related
characteristics, such as whether the tweet mentioned specific plot points, emotions, or characters. In
these preliminary models, results suggest that having a location entity, longer tweet length, and
presence of a retweet contribute to positive sentiments. Perhaps this suggests that people are more
likely to retweet positive tweets than negative?
Adding presence of key terms like “chainsaw” or “shark” or specific character names could be used as
input predictors and we would be able to see the impact of those specific terms on sentiment positivity.
Developers of the Sharknado sequel, could determine which specific aspects of the film were most
positively and negatively received by the audience and incorporate these concepts into the sequel.
Tips for Social Media Data Collection and Analysis Based on this experiment, I have a few recommendations for approaching a similar problem going
forward.
Implement custom data dictionaries and custom categorizations: Using custom data
dictionaries, we could have the text data processing step immediately identify key terms that
are related to our particular topic. In this case, we could have created a custom dictionary with
character names, plot points, or key terms like “chainsaw” or “shark”. These terms might not be
recognized by the “standard” text analysis dictionaries, but they will help us automatically pull
out and identify entities that are important in our particular scenario.
Scrape profanity and irrelevant tweets immediately: One thing I noticed when pulling in
Sharknado-related tweets was an abundance of profanity and Twitter spam. Scraping out
profanity is important if the tweet data is going to be included in Business Intelligence reports or
shared with others within the organization. Similarly, setting up policies to eliminate or avoid
spam-related Twitter accounts may help keep the feedback data more pure. I noticed accounts
that would tweet a message like “Get 500 followers free” and include the top 5 hashtags
trending on Twitter at the time. These tweets made up a huge portion of the data I collected,
and should have been immediately discarded based on the repetitive text so as not to influence
frequency and sentiment analysis.
Construct descriptive attributes: Probably the most important part of this process is
constructing descriptive attributes for each of the tweets. These may include flags to indicate
whether the tweet included a key entity or category, length fields, or perhaps user information
that can be collected about the poster. These attributes might be related to the custom data
dictionaries relevant to the topic.
Identify and treat retweets differently: While the re-tweeted data is valuable in gauging
influence and frequency of the social media buzz, it can bias the sentiment analysis by
overwhelming the average sentiment with copies of the same information. Therefore, flagging
tweets that contain retweeted information and excluding those from some sentiment analysis
might eliminate sentiment bias of a single opinion or phrase that was retweeted many, many
times.
Implementation of Sentiment Analysis Data While the Sharknado example is a fun pop culture phenomenon, how does this become relevant to a
real-world organization? Collecting Twitter data relevant to an organization could provide nearly free
focus group-like feedback directly from customers who are most likely to influence their peers. For
example, a hotel chain could collect Twitter data not only from users that mention its brand name, but
also from users mentioning competitors’ names or just talking about hotels in the general sense. They
can get an idea of what contributes to positive and negative sentiments about hotels. Do negative
sentiments most commonly accompany comments about cleanliness? Noise? Wait to check in? Staff?
Do positive sentiments stem from amenities like the pool or gym? What is the general sentiment for
customers of your hotel chain versus competitors? And are there particularly negative sentiments for
users of one particular location that might indicate a serious problem?
Furthermore, having this type of feedback available in a nearly real-time environment allows
organizations to monitor, respond to, and leverage social media buzz to increase audience or revenue
for the organization. For example, when SyFy executives saw the volume of social media posts and
response to the initial Sharknado airing, SyFy was able to quickly schedule subsequent showings,
commit to a sequel, and arrange for the film to make its theatrical debut in response and disperse this
information via Twitter while the topic was still trending. This equates to increasing awareness and
future audience at a very low cost. If the SyFy had missed this window, they would have to expend
significant marketing funds to re-generate this level of buzz. In fact, by leveraging this strong social
media buzz around the initial airing of Sharknado, SyFy actually garnered higher viewership with the re-
airing than they experienced during the initial premier.
This type of feedback can give insight not only to what users might think about your organization’s
brand overall, but also could give an idea of the importance that specific product aspects hold in a user’s
experience. Understanding how the consumer values these factors could guide investment decisions or
marketing strategies by highlighting the features that customers care about and those that are not
meaningful.
Hillary Bliss, Analytics Practice Lead
Decision First Technologies
twitter @HillaryBlissDFT
Hillary Bliss is the Analytics Practice Lead at Decision First Technologies, and specializes in data
warehouse design, ETL development, statistical analysis, and predictive modeling. She works with clients
and vendors to integrate business analysis and predictive modeling solutions into the organizational
data warehouse and business intelligence environments based on their specific operational and
strategic business needs. She has a master’s degree in statistics and an MBA from Georgia Tech.