hpcc systems in educationcdn.hpccsystems.com/presentations/hpc_meetup-hpcc... · compute intensive...

44
Page 1 HPCC Systems - http://hpccsystems.com Risk Solutions Page 1 HPCC Systems in Education See, Save, Skip: Sentiment Analysis using HPCC One Click Thor on AWS Edin Muharemagic, Ph.D. Architect and Data Scientist HPCC Systems

Upload: others

Post on 28-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 1 HPCC Systems - http://hpccsystems.com Risk Solutions Page 1

HPCC Systems in Education

See, Save, Skip: Sentiment Analysis using HPCC One Click Thor on AWS

Edin Muharemagic, Ph.D.

Architect and Data Scientist HPCC Systems

Page 2: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 2 HPCC Systems - http://hpccsystems.com Risk Solutions

Overview

HPCC Systems Update The Best Open Source Data Intensive Super Computing Platform

Machine Learning Library has been released!

HPCC in Education

One-Click Thor on AWS

Sentiment Analysis using HPCC

Page 3: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 3 HPCC Systems - http://hpccsystems.com Risk Solutions

LexisNexis Risk Solutions and HPCC Systems

LexisNexis:

30-year history with rich tradition in legal and academic markets

LexisNexis Risk Solutions:

New division with 10-year history

Products and services assess risk, verify identity, detect fraud, and help customers answer questions like “who are you?”, “how much risk is associated with you?”, “what type of network do you have?”

Customers: banks, insurance carriers, health care organizations, law enforcement, Federal Government

Built “Big Data” solutions for 10 years: Data Refinery (Thor)

Data Delivery (Roxy)

ECL – High Level Parallel Programming Language

HPCC Systems: Open Source Data Intensive Super Computing Platform

Page 4: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 4 HPCC Systems - http://hpccsystems.com Risk Solutions

One Platform End-to-End: Simple

Consistent and elegant HW&SW architecture across the complete platform

Page 5: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 5 HPCC Systems - http://hpccsystems.com Risk Solutions

Data-Driven World

Science Data bases from astronomy, genomics, natural languages, seismic

modeling, …

Humanities Scanned books, historic documents, …

Commerce Corporate sales, stock market transactions, census, airline traffic,

Entertainment Internet images, Hollywood movies, MP3 files, …

Medicine MRI & CT scans, patient records, …

Page 6: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 6 HPCC Systems - http://hpccsystems.com Risk Solutions

Science Paradigms eScience: Jim Gray http://research.microsoft.com/~Gray

Thousand years ago: science was empirical describing natural phenomena

Last few hundred years: theoretical branch using models, generalizations

Last few decades: a computational branch simulating complex phenomena

Today: data exploration (eScience) unify theory, experiment, and simulation Data captured by instruments

Or generated by simulator Processed by software Information/Knowledge stored in computer Scientist analyzes database / files

using data management and statistics

2

22.

3

4

a

cG

a

aΚ−=

ρπ

Page 7: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 7 HPCC Systems - http://hpccsystems.com Risk Solutions

Data-Intensive Applications

Rely on large, ever-changing data sets Collecting and maintaining data represents major

effort Have Complex Computational Requirements From simple queries to large-scale analyses Requires Parallel Processing Program at abstract level

HPCC, a DISC, perfect platform for DI App domain

Page 8: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 8 HPCC Systems - http://hpccsystems.com Risk Solutions

Parallel Processing Classification

Compute Intensive (HPC)

Compute-bound applications

Performance measured in xFLOPS (x=tera, peta…)

Involves parallelizing algorithms (i.e. decompose application into separate tasks)

Functional (Control) Parallelism

Data Intensive (HPCC)

I/O bound applications

Performance measured in xORPS (x=B as in billion)

Involves subdividing data into segments, using the same application to process segments in parallel, and reassembling results at the end of processing

Data Parallelism

Page 9: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 9 HPCC Systems - http://hpccsystems.com Risk Solutions

Programming Models

Compute Intensive (HPC)

Programs described at very low level

Specify detailed control of processing & communications

Rely on small number of software packages

Written by specialists

Limits classes of problems & solution methods

Data Intensive (HPCC)

Application programs written in terms of high-level operations on data

Runtime system controls scheduling,

load balancing, …

Hardware

Machine-Dependent Programming Model

Software Packages

Application Programs

Hardware

Machine-Independent Programming Model

Runtime System

Application Programs

Page 10: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 10 HPCC Systems - http://hpccsystems.com Risk Solutions

Machine Learning Library

Page 11: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 11 HPCC Systems - http://hpccsystems.com Risk Solutions

ML Documentation

Page 12: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Andrew Ng

Page 13: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Andrew Ng

Presenter
Presentation Notes
Octave, mathlab, weka: high quality ML libraries already exists. What is the big deal about HPCC ML library? FAU prof using weka for his research – ‘hungry’ for computing cycles, takes 6 months to get some results… HPCC ML is library of fully parallel algorithms
Page 14: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Andrew Ng

0

100

200

300

400

0 500 1000 1500 2000 2500

Housing price prediction.

Price ($) in 1000’s

Size in feet2

Regression: Predict continuous valued output (price)

Supervised Learning

“right answers” given

Page 15: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Andrew Ng

x1

x2

Supervised Learning

Page 16: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Andrew Ng

Unsupervised Learning

x1

x2

Page 17: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Andrew Ng

Organize computing clusters Social network analysis

Image credit: NASA/JPL-Caltech/E. Churchwell (Univ. of Wisconsin, Madison)

Astronomical data analysis Market segmentation

Page 18: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 18 HPCC Systems - http://hpccsystems.com Risk Solutions

Open Data Intensive Computing course at FAU

Expose students and faculty to newest technology

Help faculty & PhD researchers concentrate on addressing real problems (e.g. ML experiments do not need to take 6 months to produce results)

Get smart people working together

University is an open forum for free exchange of ideas

Build HPCC following and community

Harness that Open Source community power to keep improving HPCC and stay relevant

Page 19: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 19 HPCC Systems - http://hpccsystems.com Risk Solutions

Open Data Intensive Computing course at FAU

How to make it interesting, interactive, entertaining?

A number of Universities already have similar courses, based on Hadoop

Browsing those offerings ran into an interesting approach:

http://www.youtube.com/watch?v=kO8x8eoU3L4

Page 20: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 20 HPCC Systems - http://hpccsystems.com Risk Solutions

Open Data Intensive Computing course at FAU

Q: What is the best DISC? A: HPCC!

Decided against it

Instead, created a hands on, interactive course, covering: Thor Architecture (Cluster components and their purpose)

Thor Configuration (Let’s build the Cluster)

ECL Programming (Let’s get the cluster busy)

Roxie Architecture (Let’s deliver)

ML with HPCC

Had 15 students (4 undergraduate and 11 graduate)

Page 21: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 21 HPCC Systems - http://hpccsystems.com Risk Solutions

Building an HPCC

FAU Cloud: VMware vSphare Hypervisor

College of Engineering IT: Serge and Mahesh allocated 32 nodes

Students used those nodes to build HPCC clusters

Configuration process is well documented:

http://hpccsystems.com/community/docs/installing-running-hpcc-platform

Initial Setup – Single Node

Configuring Multi node System

Starting and Stopping

Page 22: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 22 HPCC Systems - http://hpccsystems.com Risk Solutions

One-Click Thor on AWS

https://aws.hpccsystems.com

Page 23: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 23 HPCC Systems - http://hpccsystems.com Risk Solutions

One-Click Thor on AWS

Page 24: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 24 HPCC Systems - http://hpccsystems.com Risk Solutions

One-Click Thor on AWS

Page 25: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 25 HPCC Systems - http://hpccsystems.com Risk Solutions

One-Click Thor on AWS

Page 26: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 26 HPCC Systems - http://hpccsystems.com Risk Solutions

One-Click Thor on AWS

Page 27: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 27 HPCC Systems - http://hpccsystems.com Risk Solutions

One-Click Thor on AWS

Page 28: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 28 HPCC Systems - http://hpccsystems.com Risk Solutions

One-Click Thor on AWS

Page 29: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 29 HPCC Systems - http://hpccsystems.com Risk Solutions

One-Click Thor on AWS

Page 30: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 30 HPCC Systems - http://hpccsystems.com Risk Solutions

One-Click Thor on AWS

Page 31: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Page 31 HPCC Systems - http://hpccsystems.com Risk Solutions

Mining the Web for Feelings

Computers are good at crunching numbers! Can they do feelings?

Emerging field: Sentiment Analysis! Translate human emotions into hard data

Cultural factors and language nuances make it difficult to deduce pro or con sentiment (e.g. sinful & chocolate cake)

Becoming standard feature of search engine – fine-tune results based on sentiment (e.g. best hotel in Boca)

Business: “online opinion represents virtual currency that makes or breaks a product in the marketplace”

Casual Web surfer: Tweetfeel, Twendz and Twitrratr

TV watcher: “See, Save, Skip – Aspect-Based Sentiment Analysis using HPCC”

Page 32: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing
Page 33: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing
Page 34: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

SEE SAVE SKIP: ASPECT-BASED SENTIMENT ANALYSIS Charlene Gilbert Florida Atlantic University

[email protected]

Page 35: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Intro. to Sentiment Analysis

a.k.a. Sentiment Classification or Opinion Mining Given text, determine polarity

Positive Negative Neutral

Page 36: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

See Save Skip: Television

“So many channels and nothing to watch!”

Not only TV but DVR, Netflix, Hulu, etc.

Decide shows to See (Live) Save (for Later) Skip

Page 37: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Twitter

140 characters Hashtags, @mentions,

Search 200 million

tweets/day Embraced by television

shows Twitter Tickers GetGlue

Page 38: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Twitter

Application Programming Interface (API) Search API

Keyword Location Date Language

Streaming API Real Time 400 Keywords

Page 39: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Keyword Based Sentiment

Lists of Affective Words Count words in tweet Classify sentiment with most words Other ways…

Naïve Bayes

Page 40: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Table 2: Sample Emoticon/Abbreviation List

Positive Negative

>:] :’(

:-) :(

:) T_T

:o) :c

8) :<

:D <.<

XD WTF

FTW FML

LOL FTL

Page 41: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Stop Words

Common words to filter out

Pre-Existing List

Page 42: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Sentiment Classification

A Tweet is Split into Tokens Joined with 3 Affective Word Lists Score(Positive Words) := 1 Score(Negative Words) := -1 Score(Neither) := 0

Sum of Token Scores Sum > 0 := 1 (Positive) Sum < 0 := -1 (Negative) Sum = 0 := 0 (Neutral)

ECL DEMO

Page 43: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

See Save Skip Classification

Get all non-neutral tweets Get percentage of positive tweets

See – 100%-80% Positive Save – 80%-60% Positive Skip – Below 60%

Somewhat Arbitrary Might Classify on Bell curve

Page 44: HPCC Systems in Educationcdn.hpccsystems.com/presentations/HPC_Meetup-HPCC... · Compute Intensive (HPC) Programs described at very low level Specify detailed control of processing

Some Results

Positive 89%

Negative 11%

Sentiment: The Sing Off

See Save Skip Classification: See! See Save Skip Classification: Save

Positive 70%

Negative 30%

Sentiment: Two Broke Girls