©2015 ibm corporation bluemix + next- generation analytics

64
©2015 IBM Corporation Bluemix + Next- generation Analytics

Upload: adela-nicholson

Post on 13-Dec-2015

225 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Bluemix + Next-generation Analytics

Page 2: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Agenda

• Introductions• Round table• Introduction to Spark• Set up development environment and create the hello world application• Notebook Walk-through• Break• Use case discussion• Introduction to Spark Streaming• Build an application with Spark Streaming: Sentiment analysis with Twitter and Watson Tone

Analyzer

Page 3: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Introductions

Page 4: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Introductions

Our mission:We are here to help developers realize their most ambitious projects.

Goals for today’s session:•Setup a local development environment via Scala Eclipse IDE.•Write a hello world Scala project to run Spark. Build a custom library. •Run locally on Spark.•Deploy on Jupyter notebook and Apache Spark on Bluemix.

Page 5: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

What is our motivation?• Local or cloud development and deployment

Advantages of local development• Rapid development• Productivity• Excellent for proof of concept

Disadvantages of local development• Time consuming for reproducing on a larger scale• Difficult for sharing quickly• Intense on hardware resource

Page 6: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

What is spark

Spark is an open source

in-memory

computing framework for

distributed data processing and

iterative analysis

on massive data volumes

Page 7: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Spark Core Libraries

Spark CoreSpark Core

general compute engine, handles distributed task dispatching, scheduling

and basic I/O functions

Spark SQL

Spark SQL

Spark Streaming

Spark Streaming

Mllib (machine learning)

Mllib (machine learning)

GraphX (graph)GraphX (graph)

executes SQL

statements

performs streaming

analytics using micro-batches

common machine

learning and statistical algorithms

distributed graph

processing framework

Page 8: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Key reasons for interest in Spark

Open SourceOpen Source

Fast Fast

distributed data processing

distributed data processing

ProductiveProductive

Web ScaleWeb Scale

•In-memory storage greatly reduces disk I/O•Up to 100x faster in memory, 10x faster on disk

•Largest project and one of the most active on Apache•Vibrant growing community of developers continuously improve code base and extend capabilities

•Fast adoption in the enterprise (IBM, Databricks, etc…)

•Fault tolerant, seamlessly recompute lost data from hardware failure•Scalable: easily increase number of worker nodes•Flexible job execution: Batch, Streaming, Interactive

•Easily handle Petabytes of data without special code handling•Compatible with existing Hadoop ecosystem

•Unified programming model across a range of use cases•Rich and expressive apis hide complexities of parallel computing and worker node management

•Support for Java, Scala, Python and R: less code written•Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX

Page 9: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Ecosystem of the IBM Analytics for Apache Spark as service

Page 10: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Notebook walkthrough

‣https://developer.ibm.com/clouddataservices/start-developing-with-spark-and-notebooks/

‣Sign up on Bluemix https://console.ng.bluemix.net/registration/

‣Create an Apache Starter boilerplate application

‣Create notebooks either in python or scala or both

‣Run basic commands and get familiar with notebooks

Page 11: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Page 12: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Setup local development Environment

•http://velocityconf.com/devops-web-performance-ny-2015/public/schedule/detail/45890

•Pre-requisites- Scala runtime 2.10.4 http://www.scala-lang.org/download/2.10.4.html

- Homebrew http://brew.sh/

- Scala sbt http://www.scala-sbt.org/download.html

- Spark 1.3.1 http://www.apache.org/dyn/closer.lua/spark/spark-1.3.1/spark-1.3.1.tgz

Page 13: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Setup local development Environment contd..•Create scala project using sbt

•Create directories to start from scratchmkdir helloSpark && cd helloSpark

mkdir -p src/main/scala

mkdir -p src/main/java

mkdir -p src/main/resources

Create a subdirectory under src/main/scala directory

mkdir -p com/ibm/cds/spark/sample

•Github URL for the same project https://github.com/ibm-cds-labs/spark.samples

Page 14: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Setup local development Environment contd..•Create HelloSpark.scala using an IDE or a text editor

• Copy paste this code snippet

package com.ibm.cds.spark.samplesimport org.apache.spark._

object HelloSpark {    //main method invoked when running as a standalone Spark Application    def main(args: Array[String]) {        val conf = new SparkConf().setAppName("Hello Spark")        val spark = new SparkContext(conf)         println("Hello Spark Demo. Compute the mean and variance of a collection")        val stats = computeStatsForCollection(spark);        println(">>> Results: ")        println(">>>>>>>Mean: " + stats._1 );        println(">>>>>>>Variance: " + stats._2);        spark.stop()    }     //Library method that can be invoked from Jupyter Notebook    def computeStatsForCollection( spark: SparkContext, countPerPartitions: Int = 100000, partitions: Int=5): (Double, Double) = {            val totalNumber = math.min( countPerPartitions * partitions, Long.MaxValue).toInt;        val rdd = spark.parallelize( 1 until totalNumber,partitions);        (rdd.mean(), rdd.variance())    }}

Page 15: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Setup local development Environment contd..•Create a file build.sbt under the project root directory:

•Under the project root directory run

Check for helloSpark 2.10-10.jar under the project root directory

name := "helloSpark" version := "1.0" scalaVersion := "2.10.4" libraryDependencies ++= {    val sparkVersion =  "1.3.1"    Seq(        "org.apache.spark" %% "spark-core" % sparkVersion,        "org.apache.spark" %% "spark-sql" % sparkVersion,        "org.apache.spark" %% "spark-repl" % sparkVersion     )}

Download all dependencies $sbt update

Compile$sbt compile

Package an application jar file$sbt package

Page 16: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Hello World application on Bluemix Apache Starter

Page 17: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Break

Join us in 15 minutes

Page 18: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Use-cases

Customer Behavior Analytics

Retail & Merchandising

Churn Reduction

Telco, Cable, Schools

Cyber Security

IT –Any Industry

Predictive Maintenance (IoT)

Update..

Network Performance Optimization

IT –Any Industry

-Predict system failure before it happens

-Network intrusion detection-Fraud Detection-…

-Predict customer drop-offs/drop-outs

-Diagnose real-time device issues-…

-Refine strategy based on customer behaviour data-…

Page 19: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Use-cases

‣SETI use-case for astronomers, data scientist, mathematician and algorithm design.

Page 20: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

IBM Spark @ SETI - Application Architecture

• Spark@SETI GitHub repository

• Python code modules for data access and analytics

• Jupyter notebooks• Documentation and links to

other relevant github repos• Standard GitHub Collaboration

functions

Import of signal data from SETI radio telescope data archives ~ 10 years

Shared repository of SETI data in Object Store•200M rows of signal event data•15M binary recordings of “signals of interest”

Collaborative environment for project team data scientists (NASA, SETI Institute, Penn State, IBM Research)

Actively analyzing over 4TB of signal data. Results have already been used by SETI to re-program the radio telescope observation sequence to include “new targets of interest”

Page 21: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Spark Streaming‣“Spark Streaming is an extension of the core Spark API that enables scalable, high-

throughput, fault-tolerant stream processing of live data streams” (http://spark.apache.org/docs/latest/streaming-programming-guide.html)

‣Breakdown the Streaming data into smaller pieces which are then sent to the Spark Engine

Page 22: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Spark Streaming‣Provides connectors for multiple data sources:

- Kafka

- Flume

- Twitter

- MQTT

- ZeroMQ

‣Provides API to create custom connectors. Lots of examples available on Github and spark-packages.org

Page 23: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Building a Spark Streaming application Sentiment analysis with Twitter and Watson Tone Analyzer

‣Section 1: Setup the dev environment1. Create a new scala project

2. Configure the sbt dependencies

3. Create the Application boilerplate code

4. Run the code a. Using an Eclipse launch configuration

b. Using spark-submit command line

Page 24: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

A Word about the Scala Programming language

‣Scala is Object oriented but also support functional programming style‣Bi-directional interoperability with Java‣Resources:

• Official web site: http://scala-lang.org• Excellent first steps site: http://www.artima.com/scalazine/articles/steps.html• Free e-books: http://readwrite.com/2011/04/30/5-free-b-books-and-tutorials-o

Page 25: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 1.1: Create a new scala project‣Refer to “Set up development environment” section earlier in this presentation

‣Resource: https://developer.ibm.com/clouddataservices/start-developing-with-spark-and-notebooks/

Page 26: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 1.2: Configure the sbt dependencies

Spark depencencies resolved with sbt update

Extra dependencies needed by this app

Extra dependencies needed by this app

Page 27: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 1.2: Configure the sbt dependencies ‣Run “sbt update” to resolve all the dependencies and download them into your

local apache ivy repository (in <home>/.ivy2/cache)

‣Optional: If you are using Scala IDE for Eclipse, run “sbt eclipse” to generate the eclipse project and associated classpath that reflects the project dependencies

‣Run “sbt assembly” to generate a uber jar that contains your code and all the required depencencies as defined in build.sbt

Page 28: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 1.3: Create the Application boilerplate code

• Boiler plate code that creates a twitter stream

Page 29: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 1.4.a: Run the code using a Eclipse launch configuration

SparkSubmit is the Main class that runs this job

Tell SparkSubmit which class to run

Page 30: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 1.4.b: Run the code using spark-submit command line

‣Package the code as a jar: “sbt assembly”- Generates

‣Run the job using spark-submit script available in the spark distribution: - $SPARK_HOME/bin/spark-submit

--class com.ibm.cds.spark.samples.StreamingTwitter--jars <path>/tutorial-streaming-twitter-watson-assembly-1.0.jar

Page 31: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Building a Spark Streaming application Sentiment analysis with Twitter and Watson Tone Analyzer

‣Section 2: Configure Twitter and Watson Tone Analyzer1. Configure OAuth credentials for Twitter

2. Create a Watson Tone Analyzer Service on Bluemix

Page 32: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 2.1: Configure OAuth credentials for Twitter

‣You can follow along the steps in https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags/#twitter

Page 33: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 2.2: Create a Watson Tone Analyzer Service on Bluemix

‣You can follow along the steps in https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags/#bluemix

Page 34: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Building a Spark Streaming application Sentiment analysis with Twitter and Watson Tone Analyzer

‣Section 3: Work with Twitter data1. Create a Twitter Stream

2. Enrich the data with sentiment analysis from Watson Tone Analyzer

3. Aggregate data into RDD with enriched Data model

4. Create SparkSQL DataFrame and register Table

Page 35: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 3.1: Create a Twitter Stream

//Hold configuration key/value pairs

val config = Map[String, String](

("twitter4j.oauth.consumerKey", Option(System.getProperty("twitter4j.oauth.consumerKey")).orNull ),

("twitter4j.oauth.consumerSecret", Option(System.getProperty("twitter4j.oauth.consumerSecret")).orNull ),

("twitter4j.oauth.accessToken", Option(System.getProperty("twitter4j.oauth.accessToken")).orNull ),

("twitter4j.oauth.accessTokenSecret", Option(System.getProperty("twitter4j.oauth.accessTokenSecret")).orNull ),

("tweets.key", Option(System.getProperty("tweets.key")).getOrElse("")),

("watson.tone.url", Option(System.getProperty("watson.tone.url")).orNull ),

("watson.tone.username", Option(System.getProperty("watson.tone.username")).orNull ),

("watson.tone.password", Option(System.getProperty("watson.tone.password")).orNull )

)

Create a map that stores the credentials for the Twitter and Watson Service

config.foreach( (t:(String,String)) => if ( t._1.startsWith( "twitter4j") ) System.setProperty( t._1, t._2 ))

Twitter4j requires credentials to be store in System properties

Page 36: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 3.1: Create a Twitter Stream

//Filter the tweets to only keeps the one with english as the language

//twitterStream is a discretized stream of twitter4j Status objects

var twitterStream = org.apache.spark.streaming.twitter.TwitterUtils.createStream( ssc, None )

.filter { status =>

Option(status.getUser).flatMap[String] {

u => Option(u.getLang)

}.getOrElse("").startsWith("en") //Allow only tweets that use “en” as the language

&& CharMatcher.ASCII.matchesAllOf(status.getText) //Only pick text that are ASCII

&& ( keys.isEmpty || keys.exists{status.getText.contains(_)}) //If User specified #hashtags to monitor

}

Initial DStream of Status Objects

Page 37: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 3.2: Enrich the data with sentiment analysis from Watson Tone Analyzer

//Broadcast the config to each worker node val broadcastVar = sc.broadcast(config)

val rowTweets = twitterStream.map(status=> { lazy val client = PooledHttp1Client()

val sentiment = callToneAnalyzer(client, status, broadcastVar.value.get("watson.tone.url”).get,broadcastVar.value.get("watson.tone.username").get, broadcastVar.value.get("watson.tone.password").get

)…

}

Initial DStream of Status Objects

Page 38: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 3.2: Enrich the data with sentiment analysis from Watson Tone Analyzer

Initial DStream of Status Objects

Data Model |-- author: string (nullable = true) |-- date: string (nullable = true) |-- lang: string (nullable = true) |-- text: string (nullable = true) |-- lat: integer (nullable = true) |-- long: integer (nullable = true) |-- Cheerfulness: double (nullable = true) |-- Negative: double (nullable = true) |-- Anger: double (nullable = true) |-- Analytical: double (nullable = true) |-- Confident: double (nullable = true) |-- Tentative: double (nullable = true) |-- Openness: double (nullable = true) |-- Agreeableness: double (nullable = true) |-- Conscientiousness: double (nullable = true)

DStream of key,value pairs

Page 39: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 3.3: Aggregate data into RDD with enriched Data model

…..

//Aggregate the data from each DStream into the working RDD

rowTweets.foreachRDD( rdd => {

if ( rdd.count() > 0 ){

workingRDD = sc.parallelize( rdd.map( t => t._1 ).collect()).union( workingRDD )

}

})

Initial DStream

RowTweets

Initial DStream

RowTweets

Initial DStream

RowTweets

….Mic

roba

tche

s

Row 1

Row 2

Row 3

Row 4

Row n

workingRDDData Model

|-- author: string (nullable = true) |-- date: string (nullable = true) |-- lang: string (nullable = true) |-- text: string (nullable = true) |-- lat: integer (nullable = true) |-- long: integer (nullable = true) |-- Cheerfulness: double (nullable = true) |-- Negative: double (nullable = true) |-- Anger: double (nullable = true) |-- Analytical: double (nullable = true) |-- Confident: double (nullable = true) |-- Tentative: double (nullable = true) |-- Openness: double (nullable = true) |-- Agreeableness: double (nullable = true) |-- Conscientiousness: double (nullable = true)

Page 40: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 3.4: Create SparkSQL DataFrame and register Table //Create a SparkSQL DataFrame from the aggregate workingRDD

val df = sqlContext.createDataFrame( workingRDD, schemaTweets )

//Register a temporary table using the name "tweets"

df.registerTempTable("tweets")

println("A new table named tweets with " + df.count() + " records has been correctly created and can be accessed through the SQLContext variable")

println("Here's the schema for tweets")

df.printSchema()

(sqlContext, df)

Row 1Row 2Row 3Row 4

……

Row n

workingRDD

author date lang …Cheerfulne

ssNegative …

Conscientiousness

John Smith10/11/2015 –

20:18en 0.0 65.8 … 25.5

Alfred … en 34.5 0.0 … 100.0

… … … … … …

… … … … … …

… … … … … …

Chris … en 85.3 22.9 … 0.0

Relational SparkSQL Table

Page 41: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Building a Spark Streaming application: Sentiment analysis with Twitter and Watson Tone Analyzer

‣Section 4: IPython Notebook analysis1. Load the data into an IPython Notebook

2. Analytic 1: Compute the distribution of tweets by sentiment scores greater than 60%

3. Analytic 2: Compute the top 10 hashtags contained in the tweets

4. Analytic 3: Visualize aggregated sentiment scores for the top 5 hashtags

Page 42: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Introduction to Notebooks

‣Notebooks allow creation of interactive executable documents that include rich text with Markdown, executable code with Scala, Python or R, graphics with matplotlib

‣Apache Spark provides multiple flavor APIs that can be executed with a REPL shell: Scala, Python (PYSpark), R

‣Multiple open-source implementations available:- Jupyter: https://jupyter.org

- Apache Zeppelin: http://zeppelin-project.org

Page 43: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.1: Load the data into an IPython Notebook‣ You can follow along the steps here: https://github.com/ibm-cds-labs/spark.samples/blob/master/streaming-

twitter/notebook/Twitter%20%2B%20Watson%20Tone%20Analyzer%20Part%202.ipynb

Create a SQLContext from a SparkContext

Load from parquet file and create a DataFrame

Create a SQL table and start excuting SQL queries

Page 44: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.2: Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60%

#create an array that will hold the count for each sentimentsentimentDistribution=[0] * 9#For each sentiment, run a sql query that counts the number of tweets for which the sentiment score is greater than 60%#Store the data in the arrayfor i, sentiment in enumerate(tweets.columns[-9:]): sentimentDistribution[i]=sqlContext.sql("SELECT count(*) as sentCount FROM tweets where " + sentiment + " > 60")\

.collect()[0].sentCount

Page 45: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.2: Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60%

Use matplotlib to create a bar chart

Page 46: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.2: Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60%

Bar Chart Visualization

Page 47: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.3: Analytic 2: Compute the top 10 hashtags contained in the tweets

Initial Tweets

RDD

Filterhashtags

Key, value pair RDD

Reduced map with

counts

SortedMap by key

flatMap filter map reduceByKey sortByKey

Page 48: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.3: Analytic 2: Compute the top 10 hashtags contained in the tweets

Page 49: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.3: Analytic 2: Compute the top 10 hashtags contained in the tweets

Page 50: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.4: Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

‣Problem:- Compute the mean average all the emotion score for all the top 10 hastags

- Format the data in a way that can be consumed by the plot script

Page 51: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.4: Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 1: Create RDD from tweets dataframe tagsRDD = tweets.map(lambda t: t )

author … Cheerfulness

Jake … 0.0

Scrad … 23.5

Nittya Indika … 84.0

… … …

… … …

Madison … 93.0

tweets (Type: DataFrame)

Row(author=u'Jake', …, text=u’@sarahwag…’, Cheerfulness=0.0, …)

Row(author=u’Scrad', …, text=u’ #SuperBloodMoon https://t…’, Cheerfulness=23.5, …)

Row(author=u’ Nittya Indika', …, text=u’ Good mornin! http://t.…’, Cheerfulness=84.0, …)

Row(author=u’ Madison', …, text=u’ how many nights…’, Cheerfulness=93.0, …)

tagsRDD (Type: RDD)

Page 52: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.4: Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 2: Filter to only keep the entries that are in top10tags tagsRDD = tagsRDD.filter( lambda t: any(s in t.text for s in [i[0] for i in top10tags] ) )

Row(author=u'Jake', …, text=u’@sarahwag…’, Cheerfulness=0.0, …)

Row(author=u’Scrad', …, text=u’ #SuperBloodMoon https://t…’, Cheerfulness=23.5, …)

Row(author=u’ Nittya Indika', …, text=u’ Good mornin! http://t.…’, Cheerfulness=84.0, …)

Row(author=u’ Madison', …, text=u’ how many nights…’, Cheerfulness=93.0, …)

Row(author=u'Mike McGuire', text=u'Explains my disappointment #SuperBloodMoon https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0)

Row(author=u'Meng_tisoy', text=u’…hihi #ALDUBThisMustBeLove https://t….’,…,Conscientiousness=68.0)

Row(author=u'Kevin Contreras', text=u’…SILA! #ALDUBThisMustBeLove', …Conscientiousness=68.0)

Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove https://t…’,…, Conscientiousness=100.0)

Page 53: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.4: Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 3: Create a flatMap using the expand function defined above, this will be used to collect all the scores #for a particular tag with the following format: Tag-Tone-ToneScore

cols = tweets.columns[-9:]def expand( t ):

ret = [ ] for s in [i[0] for i in top10tags]:

if ( s in t.text ):

for tone in cols: ret += [s + u"-" + unicode(tone) + ":" + unicode(getattr(t, tone))] return ret tagsRDD = tagsRDD.flatMap( expand )

Row(author=u'Mike McGuire', text=u'Explains my disappointment #SuperBloodMoon https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0)

Row(author=u'Meng_tisoy', text=u’…hihi #ALDUBThisMustBeLove https://t….’,…,Conscientiousness=68.0)

Row(author=u'Kevin Contreras', text=u’…SILA! #ALDUBThisMustBeLove', …Conscientiousness=68.0)

Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove https://t…’,…, Conscientiousness=100.0)

u'#SuperBloodMoon-Cheerfulness:0.0'

u'#SuperBloodMoon-Negative:100.0’

u'#SuperBloodMoon-Negative:23.5'

u'#ALDUBThisMustBeLove-Analytical:85.0’

FlatMap of encoded values

Page 54: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.4: Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 4: Create a map indexed by Tag-Tone keys tagsRDD = tagsRDD.map( lambda fullTag : (fullTag.split(":")[0], float( fullTag.split(":")[1]) ))

u'#SuperBloodMoon-Cheerfulness:0.0'

u'#SuperBloodMoon-Negative:100.0’

u'#SuperBloodMoon-Negativer:23.5'

u'#ALDUBThisMustBeLove-Analytical:85.0’

u'#SuperBloodMoon-Cheerfulness'

0.0

u'#SuperBloodMoon-Negative’ 100.0

u'#SuperBloodMoon-Negative' 23.5

u'#ALDUBThisMustBeLove’ 85.0

map

Page 55: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.4: Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 5: Call combineByKey to format the data as follow #Key=Tag-Tone, Value=(count, sum_of_all_score_for_this_tone) tagsRDD = tagsRDD.combineByKey((lambda x: (x,1)), (lambda x, y: (x[0] + y, x[1] + 1)), (lambda x, y: (x[0] + y[0], x[1] + y[1])))

u'#SuperBloodMoon-Cheerfulness'

0.0

u'#SuperBloodMoon-Negative’ 100.0

u'#SuperBloodMoon-Negative' 23.5

u'#ALDUBThisMustBeLove’ 85.0

u'#Supermoon-Confident’ (0.0, 3)

u'#HajjStampede-Tentative’ (0.0, 3)

u'#KiligKapamilya-Conscientiousness’

(290.0, 6)

u'#LunarEclipse-Tentative’ (92.0, 4)

CreateCombiner: Create list of tuples (sum,count)

mergeValue: called for each new value (sum, count)

MergeCombiner: reduce part, merge 2 combiners

Page 56: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.4: Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 6 : ReIndex the map to have the key be the Tag and value be (Tone, Average_score) tuple #Key=Tag #Value=(Tone, average_score) tagsRDD = tagsRDD.map(lambda (key, ab): (key.split("-")[0], (key.split("-")[1], round(ab[0]/ab[1], 2))))

u'#Supermoon-Confident’ (0.0, 3)

u'#HajjStampede-Tentative’ (0.0, 3)

u'#KiligKapamilya-Conscientiousness’

(290.0, 6)

u'#LunarEclipse-Tentative’ (92.0, 4)

u'#Supermoon-Confident’ (u'Confident', 0.0)

u'#HajjStampede-Tentative’ (u'Tentative', 0.0)

u'#KiligKapamilya-Conscientiousness’

(u'Conscientiousness', 48.33)

u'#LunarEclipse-Tentative’ (u'Tentative', 23.0)

Page 57: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.4: Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 7: Reduce the map on the Tag key, value becomes a list of (Tone,average_score) tuples tagsRDD = tagsRDD.reduceByKey( lambda x, y : makeList(x) + makeList(y) )

u'#Supermoon-Confident’ (u'Confident', 0.0)

u'#HajjStampede-Tentative’ (u'Tentative', 0.0)

u'#KiligKapamilya-Conscientiousness’

(u'Conscientiousness', 48.33)

u'#LunarEclipse-Tentative’ (u'Tentative', 23.0)

u'#HajjStampede' [(u'Tentative', 0.0), (u'Agreeableness', 3.67), …, (u'Cheerfulness', 100.0)]

u'#Supermoon'[(u'Confident', 0.0), (u'Openness', 91.0), …, (u'Agreeableness', 20.33)]

u'#bloodmoon'[(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)]

u'#KiligKapamilya'

[(u'Conscientiousness', 48.33), (u'Anger', 0.0),... (u'Agreeableness', 10.83)]

Page 58: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.4: Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 8 : Sort the (Tone,average_score) tuples alphabetically by Tone tagsRDD = tagsRDD.mapValues( lambda x : sorted(x) )

u'#HajjStampede' [(u'Tentative', 0.0), (u'Agreeableness', 3.67), …, (u'Cheerfulness', 100.0)]

u'#Supermoon'[(u'Confident', 0.0), (u'Openness', 91.0), …, (u'Agreeableness', 20.33)]

u'#bloodmoon'[(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)]

u'#KiligKapamilya'

[(u'Conscientiousness', 48.33), (u'Anger', 0.0),... (u'Agreeableness', 10.83)]

u'#HajjStampede'[(u'Agreeableness', 3.67),(u'Cheerfulness', 100.0),… (u'Tentative', 0.0),]

u'#Supermoon'[(u'Agreeableness', 20.33), (u'Confident', 0.0),..., (u'Openness', 91.0)]

u'#bloodmoon'[(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)]

u'#KiligKapamilya'[(u'Agreeableness', 10.83), (u'Anger', 0.0)(u'Conscientiousness', 48.33),,...]

Page 59: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.4: Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 9 : Format the data as expected by the plotting code in the next cell.

#map the Values to a tuple as follow: ([list of tone], [list of average score])

tagsRDD = tagsRDD.mapValues( lambda x : ([elt[0] for elt in x],[elt[1] for elt in x]) )

u'#HajjStampede'[(u'Agreeableness', 3.67),(u'Cheerfulness', 100.0),… (u'Tentative', 0.0),]

u'#Supermoon'[(u'Agreeableness', 20.33), (u'Confident', 0.0),..., (u'Openness', 91.0)]

u'#bloodmoon'[(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)]

u'#KiligKapamilya'[(u'Agreeableness', 10.83), (u'Anger', 0.0)(u'Conscientiousness', 48.33),,...]

u'#HajjStampede'([u'Agreeableness’,u'Cheerfulness’,… u'Tentative’], [3.67, 100.0,…0.0])

u'#Supermoon'([u'Agreeableness’,u'Confident',..., u'Openness’],[20.33, 0.0,… 91.0])

u'#bloodmoon'([u'Anger’,u'Negative', …, u'Openness’), [0.0, 0.0,…38.0])

u'#KiligKapamilya'([u'Agreeableness’,u'Anger’, u'Conscientiousness',...],[10.83, 0.0,48.33,...])

Value is a tuple of 2 arrays: tones-scores

Page 60: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.4: Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 10 : Use custom sort function to sort the entries by order of appearance in top10tags def customCompare( key ): for (k,v) in top10tags: if k == key: return v return 0 tagsRDD = tagsRDD.sortByKey(ascending=False, numPartitions=None, keyfunc = customCompare)

u'#HajjStampede'([u'Agreeableness’,u'Cheerfulness’,… u'Tentative’], [3.67, 100.0,…0.0])

u'#Supermoon'([u'Agreeableness’,u'Confident',..., u'Openness’],[20.33, 0.0,… 91.0])

u'#bloodmoon'([u'Anger’,u'Negative', …, u'Openness’), [0.0, 0.0,…38.0])

u'#KiligKapamilya'([u'Agreeableness’,u'Anger’, u'Conscientiousness',...],[10.83, 0.0,48.33,...])

u'#Superbloodmon'([u'Agreeableness’,u'Cheerfulness’,… u'Tentative’], [33.97, 19.38,…12.85])

u'#BBWLA'([u'Agreeableness’,u'Confident',..., u'Openness’],[38.33, 12.34,… 21.43])

u'#ALDUBThisMustBeLove'

([u'Anger’,u'Negative', …, u'Openness’), [0.0, 0.0,…62.0])

u'#Newmusic'([u'Agreeableness’,u'Anger’, u'Conscientiousness',...],[0.0, 0.0,68.33,...])

Page 61: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.4: Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

Page 62: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Section 4.4: Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

Page 63: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Possible Improvements to the Twitter + Watson App‣ Leverage the tweet geo location to show emotion trends based on locations on a map

‣ Get twitter historical stream from gnip decahose or IBM Insight for Twitter Service

‣ Add scalability and robustness by using Kafka message hub

Page 64: ©2015 IBM Corporation Bluemix + Next- generation Analytics

©2015 IBM Corporation

Thank You