sharknado social media analysis with sap hana and predictive analysis

Sharknado Social Media Analysis with SAP HANA and Predictive Analysis Mining social media data for customer feedback is perhaps one of the greatest untapped opportunities

for customer analysis in many organizations today. Social media

data is freely available and allows organizations to personally

identify and interact directly with customers to resolve any

potential dissatisfaction. In today’s blog post, I’ll discuss using SAP

Data Services, SAP HANA, and SAP Predictive Analysis to collect,

process, visualize, and analyze social media data related to the

recent social media phenomenon Sharknado.

Collecting Social Media Data with SAP Data

Services While I’ll be focusing primarily on the analysis of social media data

in this blog post, social media data can be collected from any source

with an open API by using Python scripting within a User-Defined

Transform. In this example, I’ve collected Twitter data using the

basic outline provided by SAP in the Data Services Text Data

Processing Blueprints available on the SAP Community Network,

updated it for the REST version 1.1 Twitter API. This process

consists of 2 dataflows, the first tracks search terms and constructs (Get_Search_Tasks transform) and

executes (Search_Twitter transform) a Twitter search query to store the data pictured below. In

addition to the raw text of the tweet, some metadata is available, including user name, time, and

location information (if the user has made it publicly available).

http://scn.sap.com/docs/DOC-8820

https://dev.twitter.com/docs/api/1.1/get/search/tweets

Once the raw tweet data has been collected, I can use either the Text Data Processing transform in SAP

Data Services or the Voice of Customer text analysis process in SAP HANA. While both processes give

the same result, SAP Data Services is also able to perform preliminary summarization and

transformations on the parsed data within the same dataflow. In this case, I will run text analysis in SAP

HANA by running the command below in SAP HANA Studio.

Create FullText Index "VOC" On <table name>(<tweet text column name>)

TEXT ANALYSIS ON

CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER';

This results in a table called $TA_VOC in the same schema as the source table, as shown below.

In this table, the TA_TOKEN—called SOURCE_FORM in SAP Data Services TDP—is the extracted entity or

element from the tweet (for example, an identifiable person, place, topic, organization, or sentiment),

TA_TYPE (called TYPE in SAP Data Services TDP) is the category the entity falls under. These are the two

main text analysis elements used to extract information from Twitter data.

For a more in-depth explanation on Text Data Processing and social media analysis using SAP Data

Services, refer to the Decision First Summer EIM Expert Series webinar on Twitter data collection and

social media sentiment analysis by Nicholas Hohman.

Once the Twitter data was loaded into SAP HANA and text analysis had been performed, I created an

Analytic View and several Calculation Views to allow for visualization and analysis.

http://info.decisionfirst.com/EIMSeries2013_RegistrationLandingPage.html?URLsource=Twitter

In the first Analytical View pictured above, I’ve cleaned up the TYPE categories a bit further to

consolidate into top level categories (for example, combining all types of Organizations into one single

Organization category) and assigned a numeric sentiment values to each sentiment-type entity as shown

in the table below, ranging from 0 (strong negative sentiment) to 1 (strong positive sentiment).

I then created a calculation view that aggregates data to the tweet-level and calculates tweet-level flags

for analysis, including flags to indicate whether key types of entities are found in each tweet (location,

topic, Twitter hashtag, retweet, sentiment, etc). This also aggregates the average sentiment based on

any sentiments found within the tweet. I’ll use these aggregated metrics later for visualization and

predictive analysis of the Twitter data.

The final outputs of the SAP HANA modeling process are 2 analysis sets:

1.) A tweet-level analysis set with aggregated flags and values summarizing the tweet, including

tweet length, number of extracted entities within the tweet, and the metadata collected with

the tweet, such as location, time, and the user information.

2.) An entity-level analysis set with tweet-level metadata joined back to the individual entities to

allow analysis at the entity level.

While these analysis sets could be created using a SAP Data Services ETL process, the SAP HANA

Information Views have the advantage of being calculated on the fly rather than as a batch process, so if

we are continuously monitoring and collecting Twitter data, users will have real-time access to social

media trends and insights without having to wait for an overnight or batch process to finish.

Visualization and Analysis of #Sharknado Data For this analysis, I collected over 33,000 tweets related to the topic “sharknado” over a period of days.

After Text Analysis was performed, over 200,000 individual entities were extracted from these tweets.

A natural first step is generating descriptive charts to explain the nature of these extracted entities and

tweets. The figure below shows an area chart of all the entities extracted from the tweets by category.

Twitter hashtags were the most commonly identified entities, followed by sentiments, Twitter users,

topics, and organizations. The depth of color indicates the tweet-level average sentiment. This shows

that tweets with topic entities have the highest (most positive) overall sentiment, while tweets with

hashtags are much less positive.

A few other fast facts on the Sharknado tweets:

38% of the tweets collected include a retweet from another user

41% of tweets have a topic entity extracted from the text

7.5% of tweets have a location entity within the tweet text

45% of tweets have a sentiment entity identified in the text

54.5% of tweets have 5 or more entities extracted from the text

The chart below shows a histogram of tweets by the length of the tweet text—tweets are most

commonly right around the 140 character limit, with about 25% of tweets at 135 characters and

above.

Now, we can start to examine the individual entities extracted from the tweets and sentiments

associated with each entity. For example, we can pull the Person entities identified by the text analysis

in a word cloud, shown below. This word cloud shows the most common entities (larger size) and the

sentiment associated with the person entities (depth of color).

This shows that Tara Reid, Cary Grant, Tatiana Maslany, Ian Ziering, and Steve Sanders were the most

commonly identified person entities, with Tatiana Maslany and Tara Reid appearing in tweets with

higher average sentiments. Tara Reid and Ian Ziering are actors that appeared in Sharknado, and Steve

Sanders was Ian Ziering’s character in Beverly Hills, 90210, but I was confused by the appearance of Cary

Grant, whom Wikipedia identifies as an English actor with “debonair demeanor” who died in 1986, and

Tatiana Maslany, a lesser-known Canadian Actress, neither of whom appeared in Sharknado. Further

filtering the tweet text for these particular entities, I find an extremely high retweet frequency for 2

influential tweets:

@TVMcGee: #Sharknado is even more impressive when you realize Tatiana

Maslany played all the different sharks.

@RichardDreyfuss: People don't talk about it much in Hollywood (omertà

and everything) but Cary Grant actually died in a #sharknado

http://en.wikipedia.org/wiki/Cary_Grant

http://en.wikipedia.org/wiki/Tatiana_Maslany

The entity “impressive” was strongly positive for Tatiana Maslany, while “n’t talk” was considered a

minor problem for the Cary Grant tweet. Further analysis can be done to identify popular characters

and portions of the movie, which the Sharknado filmmakers can mine to identify the characters, plots,

or topics to revisit in the already-approved sequel to Sharknado (coming Summer 2014).

Similarly, investigating location entities shown in the word cloud below, we can see the most common

references are to Texas and Hollywood, with tweets about Texas being more positive than Hollywood.

Organizations identified by Text Analysis show SyFy (the channel that brought you Sharknado) and the

phrase Public Service Announcement, as well as Lego and Nova were common in tweets, as shown in

the word cloud below.

The SyFy and public service announcement phrases were found in a frequently retweeted tweet about a

re-airing of the movie:

@Syfy: Public Service

Announcement: #Sharknado

will be rebroadcast on

Thurs, July 18, at 7pm.

Please retweet this

important information.

Nova was a character in the movie who

may have met an untimely end, which

apparently did not elicit positive

sentiments. The Lego topic/organization

was also in a commonly re-tweeted

tweet of a picture of a sharknado made

of Legos.

@Syfy: OMG OMG OMG someone

made #Sharknado out of

LEGOs!!!

http://t.co/0ORVv6w2uf

http://t.co/lbjJ6DDvzU

Predictive Analysis on #Sharknado Data After summarizing and visualizing the data, I can leverage SAP Predictive Analysis’s Predict pane to

evaluate the models using predictive algorithms. We can further summarize tweet data across multiple

numeric characteristics using a clustering algorithm. Clustering is an unsupervised learning algorithm

and one of the most popular segmentation methods; it creates groups of similar observations based on

numeric characteristics. In this case, the numeric characteristics available are: length of tweet, # of

entities extracted from the tweet, and the presence of a topic or a sentiment flag. While binary

variables are not technically appropriate to use in a clustering model, we’re including them here to

increase the complexity of our model and make the results more interesting.

The clustering model results show 3 groups of tweets, roughly separated by size, with Cluster 3 being

the short tweets, Cluster 1 the longer tweets, and Cluster 2 between 3 and 1. This clustering model

does show us that longer tweets were more likely to have more entities identified by the text analysis

and were more likely to have a sentiment and a topic within the tweet.

http://t.co/0ORVv6w2uf

http://t.co/lbjJ6DDvzU

While this is an extremely simple example, with additional descriptive statistics we could cluster tweets

according to sentiment and occurrences of key phrases or words; if the organization could link these

tweet segments to customer satisfaction or other key metrics (such as referrals generated through

social media buzz or calls to a customer service center), monitoring the frequency of tweets by segment

would be a great, nearly real-time leading indicator of viral buzz, customer complaints, or referral

business.

Another potential application for predictive models would be attempting to estimate the impact of

tweet characteristic on the sentiment value of the tweet. In this case, I’ve arbitrarily determined that a

tweet with an average sentiment of 0.4 or higher is “Positive”. I can then use the R-CNR Decision Tree

algorithm or a custom R function for Logistic Regression (see this previous blog on Custom R Modules)

to predict which elements are most indicative of positive tweets. In order to compare these models, I

use a filter transform to filter out tweets without sentiments. Then, I configure the Logistic Regression

and R-CNR Tree modules to include all my descriptive data, including tweet length, number of entities

extracted, and presence of location and topic entities.

Once this predictive workflow has been run, I can review results for the logistic regression and decision

tree results.

http://sapbiblog.com/2013/07/15/custom-r-modules-in-predictive-analysis/

Logistic Regression results

These model output charts show that the logistic regression model is not terribly predictive, showing an

AUC (area under the ROC Curve) of only 0.598 (AUC varies from 0 to 1 with a baseline of 0.5 and values

closest to 1 indicating more accurate predictions).

This chart shows that there is a slight increase in predicted average sentiment (red line) across the

actual average tweet sentiment (x axis). Blue bars represent tweet volume for each level of average

sentiment. Ideally, the red line would be approximately diagonal from bottom left to top right.

Decision Tree results

The Decision tree shows that the model is able to identify large pockets of tweets that are much more

likely to be positive.

Pockets of highly-positive tweets

In summary, the models show potential to distinguish tweet positivity based on tweet content

characteristics. These models could be further tuned for accuracy with more Sharknado-related

characteristics, such as whether the tweet mentioned specific plot points, emotions, or characters. In

these preliminary models, results suggest that having a location entity, longer tweet length, and

presence of a retweet contribute to positive sentiments. Perhaps this suggests that people are more

likely to retweet positive tweets than negative?

Adding presence of key terms like “chainsaw” or “shark” or specific character names could be used as

input predictors and we would be able to see the impact of those specific terms on sentiment positivity.

Developers of the Sharknado sequel, could determine which specific aspects of the film were most

positively and negatively received by the audience and incorporate these concepts into the sequel.

Tips for Social Media Data Collection and Analysis Based on this experiment, I have a few recommendations for approaching a similar problem going

forward.

Implement custom data dictionaries and custom categorizations: Using custom data

dictionaries, we could have the text data processing step immediately identify key terms that

are related to our particular topic. In this case, we could have created a custom dictionary with

character names, plot points, or key terms like “chainsaw” or “shark”. These terms might not be

recognized by the “standard” text analysis dictionaries, but they will help us automatically pull

out and identify entities that are important in our particular scenario.

Scrape profanity and irrelevant tweets immediately: One thing I noticed when pulling in

Sharknado-related tweets was an abundance of profanity and Twitter spam. Scraping out

profanity is important if the tweet data is going to be included in Business Intelligence reports or

shared with others within the organization. Similarly, setting up policies to eliminate or avoid

spam-related Twitter accounts may help keep the feedback data more pure. I noticed accounts

that would tweet a message like “Get 500 followers free” and include the top 5 hashtags

trending on Twitter at the time. These tweets made up a huge portion of the data I collected,

and should have been immediately discarded based on the repetitive text so as not to influence

frequency and sentiment analysis.

Construct descriptive attributes: Probably the most important part of this process is

constructing descriptive attributes for each of the tweets. These may include flags to indicate

whether the tweet included a key entity or category, length fields, or perhaps user information

that can be collected about the poster. These attributes might be related to the custom data

dictionaries relevant to the topic.

Identify and treat retweets differently: While the re-tweeted data is valuable in gauging

influence and frequency of the social media buzz, it can bias the sentiment analysis by

overwhelming the average sentiment with copies of the same information. Therefore, flagging

tweets that contain retweeted information and excluding those from some sentiment analysis

might eliminate sentiment bias of a single opinion or phrase that was retweeted many, many

times.

Implementation of Sentiment Analysis Data While the Sharknado example is a fun pop culture phenomenon, how does this become relevant to a

real-world organization? Collecting Twitter data relevant to an organization could provide nearly free

focus group-like feedback directly from customers who are most likely to influence their peers. For

example, a hotel chain could collect Twitter data not only from users that mention its brand name, but

also from users mentioning competitors’ names or just talking about hotels in the general sense. They

can get an idea of what contributes to positive and negative sentiments about hotels. Do negative

sentiments most commonly accompany comments about cleanliness? Noise? Wait to check in? Staff?

Do positive sentiments stem from amenities like the pool or gym? What is the general sentiment for

customers of your hotel chain versus competitors? And are there particularly negative sentiments for

users of one particular location that might indicate a serious problem?

Furthermore, having this type of feedback available in a nearly real-time environment allows

organizations to monitor, respond to, and leverage social media buzz to increase audience or revenue

for the organization. For example, when SyFy executives saw the volume of social media posts and

response to the initial Sharknado airing, SyFy was able to quickly schedule subsequent showings,

commit to a sequel, and arrange for the film to make its theatrical debut in response and disperse this

information via Twitter while the topic was still trending. This equates to increasing awareness and

future audience at a very low cost. If the SyFy had missed this window, they would have to expend

significant marketing funds to re-generate this level of buzz. In fact, by leveraging this strong social

media buzz around the initial airing of Sharknado, SyFy actually garnered higher viewership with the re-

airing than they experienced during the initial premier.

This type of feedback can give insight not only to what users might think about your organization’s

brand overall, but also could give an idea of the importance that specific product aspects hold in a user’s

experience. Understanding how the consumer values these factors could guide investment decisions or

marketing strategies by highlighting the features that customers care about and those that are not

meaningful.

Hillary Bliss, Analytics Practice Lead

Decision First Technologies

[email protected]

twitter @HillaryBlissDFT

Hillary Bliss is the Analytics Practice Lead at Decision First Technologies, and specializes in data

warehouse design, ETL development, statistical analysis, and predictive modeling. She works with clients

and vendors to integrate business analysis and predictive modeling solutions into the organizational

data warehouse and business intelligence environments based on their specific operational and

strategic business needs. She has a master’s degree in statistics and an MBA from Georgia Tech.

mailto:[email protected]

https://twitter.com/HillaryBlissDFT

sharknado social media analysis with sap hana and predictive analysis

Documents