machine learning with spark and r

35
Big Data Analytics Some challenges (R and Spark) Armando Vieira Data Scientist @Stratified Medical & @dataAI

Upload: armando-vieira

Post on 06-Aug-2015

206 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Machine Learning with Spark and R

Big Data AnalyticsSome challenges (R and Spark)

Armando VieiraData Scientist @Stratified Medical & @dataAI

Page 2: Machine Learning with Spark and R

Summary

Challenges of big data analytics

Paradigm shift

Why R?

R + Spark = SparkR

Some case studies

Page 3: Machine Learning with Spark and R

Challenges of Big Data

Page 4: Machine Learning with Spark and R

4

What’s New?

• Ability to process huge quantities of data:– Hadoop & NoSQL vs relational Databases

• A Mindset shift: – DATA is no longer static– DATA can be reused– DATA can reveal secrets

• Correlation versus causality• No sampling: N= ALL

Page 5: Machine Learning with Spark and R

How Big is Big Data?• Exponential growth• Google process 24 petabytes per day = 1015

• Facebook uploads 10M photos per hour and 3 Billion “likes” per day• YouTube adds one hour of video every second

….• In 2000 – 25% of data were digital• In 2007 – 300 exabytes of data were stored. As in 300 Billions compressed

digital films. And it represented 93% of data• In 2012 - 1200 exabytes – representing 98% of all data. Like 5 piles of CDs

reaching the moon• Every person on Earth now has 320 times the information that were

(estimate) stored in the Library of Alexandria• In Gutenberg time, it took 50 years to double the amount of info, now it

takes 3 years

Page 6: Machine Learning with Spark and R

3 main shifts

• MORE Data– We can now process almost ALL data we want– Using ALL data les us see details we could not see when we were limited

• MESSY Data– Having ALL data available we can forgive some imperfections in them– Removing the sampling error allows for some measurement error– The loss in accuracy at the micro level is compensated by the insight at the

macro level

• From causality to CORRELATION– Big Data tells us the “WHAT”, not the “WHY”– From validation of our hypotheses to observing connections we never

thought about

Page 7: Machine Learning with Spark and R

7

Datafication

• Taking information on everything and making it analyzable opens the door to new usage for the data

• Data is the OIL of the “Information Economy” and will soon move to the Balance Sheets of companies

• Subject Matter Expert will become less relevant being replaced by data scientists

Page 8: Machine Learning with Spark and R

8

Risks

• Moving from human-driven decision (based on small dataset) to machine-based decision (based on huge dataset containing OUR data) have implications

• Who regulates the algorithms• How we preserve individual volition “sanctity”?

• Examples:– Data predict you will have a hearth attack soon. Insurance asks

you to pay more– Data predict you will default on a mortgage. Mortgage is denied– Data predict you will commit a crime. Should you be arrested?

Page 9: Machine Learning with Spark and R

9

23andMe

• For 100$ they (used to) analyze your DNA to reveal traits making you more likely to get some heart and cancer problems

• But they only sequence a small portion of your DNA – relative to the markers they know

• So, if a new marker is discovered – they would need to sequence you again

• So, working with a subset only answers the questions you considered in advance

Page 10: Machine Learning with Spark and R

10

Data in medicine• He got his entire DNA sequenced

(3B pairs)• In choosing medications, doctors

normally hope for similarities between what they know of their patient DNA and the one of who participated to the drug’s trial

• In Job’s case, they could precisely select drugs according to their efficacy given his genetic make-up

• They kept changing treatment, as the cancer mutated

• This did not save Steve’s life, but extended it by many years

Page 11: Machine Learning with Spark and R

11

Machine recommendations• Nobody knows WHY a customer who bough book A also want to but book B• But one third of Amazon’s sales result from this system• 75% of orders for Netflix come from this system• It is like the merchandise placed close to the cashiers – but it analyses your

cart real time and real time it puts the right merchandise in the basket

• Professional skills, subject-matter expertise, have no impact on those sales processes

• Knowing what, not why, is good enough

• Correlation cannot foretell the future, but through identifying a really good proxy for a phenomenon, it can predict it with a certain likelihood

Page 12: Machine Learning with Spark and R

Don’t make hypothesis, be data-driven

• Walmart – the largest retailer in the world, crossed its historical sales data with the weather reports. Discovered that before every hurricane, people rushed to buy….Pop-Tart, a sugary snack. Now they know and they stock it next to the hurricane supplies

• Nobody could have made that hypothesis

• The traditional approach was to make hypothesis and validate them through test. Slow and cumbersome and influenced by our bias

• Let sophisticated computational analysis identify the optimal proxy

• No need to know which are the search items correlated to flu• No need to know the rules the airlines use to compute prices• No need to know the taste of Walmart buyers

Page 13: Machine Learning with Spark and R
Page 14: Machine Learning with Spark and R

Credit Risk

• Theory: the highest the leverage the highest the risk.

• What we found: the highest the leverage the lowest the risk.

• WHY? Because we were dealing with subprime market

Page 15: Machine Learning with Spark and R

Why R?• Powerful data manipulation• Easy to learn• Community driven• Over 4000 packages and growing fast• Powerful graphical capabilities: ggplot and D3• Interactivity (through Shinny package)• FREE

Page 16: Machine Learning with Spark and R
Page 17: Machine Learning with Spark and R

Where R stands in the ecosystem?• SAS: SAS has been the undisputed market leader in commercial analytics space.

The software offers huge array of statistical functions, has good GUI (Enterprise Guide & Miner) for people to learn quickly and provides awesome technical support. However, it ends up being the most expensive option and is not always enriched with latest statistical functions.

• R: R is the Open source counterpart of SAS, which has traditionally been used in academics and research. Because of its open source nature, latest techniques get released quickly. There is a lot of documentation available over the internet and it is a very cost-effective option.

• Python: With origination as an open source scripting language, Python usage has grown over time. Today, it sports libraries (numpy, scipy and matplotlib) and functions for almost any statistical operation / model building you may want to do. Since introduction of pandas, it has become very strong in operations on structured data.

Page 18: Machine Learning with Spark and R
Page 19: Machine Learning with Spark and R

However…• R is not fit to work in parallel

• Revolutionary Analytics

• Data has to fit in memory

• Data has to be structured

Page 20: Machine Learning with Spark and R

Spark

Page 21: Machine Learning with Spark and R

Spark vs Hadoop

Page 22: Machine Learning with Spark and R

SparkR

Page 23: Machine Learning with Spark and R

Benchmarks: sorting data

Source: Data Bricks

Page 24: Machine Learning with Spark and R

Visualizations

• R+ggplot2• D3 & DC for interactive visualizations • R + Google motion charts (see

armandoanalytics.blogspot.pt)• R+Shiny for interactive plots• R+Leaflet graphs (see

farquasar.shinyaaps.io/shiny_pop)

Page 25: Machine Learning with Spark and R

Applications / case studies

• Social Care UK

Page 26: Machine Learning with Spark and R
Page 27: Machine Learning with Spark and R

Ontologies: knowledge representation

Page 28: Machine Learning with Spark and R

Objective: Map the entire biomedical knowledge of humanity into a knowledge graph

Page 29: Machine Learning with Spark and R

Deep Learning: a revolution

Page 30: Machine Learning with Spark and R

Our Facial Emotion Recognition

0 angry1 disgust2 fear3 happy4 sad5 Surprise6 Neutral

2,0,2,4,4,6,4,3,3,5 0,6,6,6,3,5,3,2,0,6 6,2,0,4,3,3,5,3,3,5 3,6,3,6,3,6,6,6,0,3 0,3,2,0,6,2,3,6,6,2 2,5,5,6,4,2,0,3,6,2 6,5,3,4,3,0,6,3,0,2 4,4,2,2,0,6,0,0,5,0 3,5,3,4,4,4,4,6,5,4 6,6,4,0,6,6,2,3,6,3

Online test on: http://miguelpedroso.com/?page_id=3624

Page 31: Machine Learning with Spark and R

Bimodal deep learning

......

... ...

...

...... ......

Text Input Image Input

SharedRepresentation

Text Reconstruction Image Reconstruction

“concepts”

Page 32: Machine Learning with Spark and R

Internet of Things

Page 33: Machine Learning with Spark and R

Case studies

Page 34: Machine Learning with Spark and R

Thank you!

• Armando.lidinwise.com

• StratifiedMedical.com

[email protected] @lidinwise

Page 35: Machine Learning with Spark and R