python for data science - tdc 2015

69
PYTHON FOR DATA SCIENCE Gabriel Moreira Machine Learning Engineer @gspmoreira 2015

Upload: gabriel-moreira

Post on 12-Aug-2015

515 views

Category:

Data & Analytics


1 download

TRANSCRIPT

PYTHON FOR DATA SCIENCE

Gabriel MoreiraMachine Learning Engineer

@gspmoreira

2015

Why so much buzz?

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

Big Data

ONLINE PERSONALIZATION

WHAT IS DATA SCIENCE

http://drewconway.com

WHAT IS DATA SCIENTISTA Data Scientist is someone with deliberate dual personality who can first build a curious business case defined with a telescopic vision and can then dive deep with microscopic lens to sift through DATA to reach the goal while defining and executing all the intermittent tasks.

http://www.datasciencecentral.com/profiles/blogs/are-you-a-data-scientist

http://nirvacana.com/thoughts/becoming-a-data-scientist/Data Science MetroMap Curriculum

TYPES OF ANALYTICS

Investigative Analytics Operational AnalyticsConsumers: Humans Consumers: Machines

http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/https://hbr.org/2014/08/the-question-to-ask-before-hiring-a-data-scientist/

[Hillary Mason, Data Scientist]

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

DATA SCIENCE IS IOSEMN

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

PYTHON IS IOSEMN

jsOutsider

ANALYTICS CASE CORPORATE SOCIAL NETWORKS

Full Data Analysis demo available in IPython Notebookbit.ly/python4ds_nb

Investigative AnalyticsConsumers: Humans

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

INQUIRE

1.Which communities are more popular?

2.Is the user engagement increasing?

3.What is the distribution of publishing time?

4.What is the distribution of user interactions?

5.Is there a relationship between publishing hour and number of interactions?

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

OBTAIN

•Download data from another location (e.g., a web page or server)

•Query data from a database (e.g., MySQL or Oracle)•Extract data from an API (e.g., Twitter, Facebook) •Extract data from another file (e.g., an HTML file or

spreadsheet) •Generate data yourself (e.g., reading sensors or

taking surveys)

READING INTERACTIONS FROM CVS

READING POSTS FROM JSON LINES

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

SCRUB

SCRUB

SCRUB

SCRUB

Dealing with nulls

SCRUB

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

1 - WHICH COMMUNITIES ARE MORE POPULAR?

1 - WHICH COMMUNITIES ARE MORE POPULAR?

2 - IS USER ENGAGEMENT INCREASING?

2 - IS USER ENGAGEMENT INCREASING?

3 - WHAT IS THE DISTRIBUTION OF PUBLISHING TIME?

4 - HOW IS THE DISTRIBUTION OF USER INTERACTIONS?

4 - HOW IS THE DISTRIBUTION OF USER INTERACTIONS?

4 - HOW IS THE DISTRIBUTION OF USER INTERACTIONS?

5 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?

5 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?

5 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?

5 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?

http://viverdeblog.com/melhoresahorarios-para-postar-nas-redes-sociais/

Operational AnalyticsConsumers: Machines

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

1. Discover the most relevant words in the posts

2. Find related posts, with similar content

Operational Analytics Tasks example

Find Related Posts

1 - RELEVANT WORDS IN A POST

TF-IDF - More “relevant" terms in a document are frequent terms in the document and rare in other documents

1 - RELEVANT WORDS IN A POST

1 - RELEVANT WORDS IN A POST

1 - RELEVANT WORDS IN A POST

BONUS - GLOBAL RELEVANT TERMS [ALL POSTS]

2 - SIMILAR POSTS

Cosine Similarity Measure of similarity between two vectors being the cosine of the angle between them.

2 - SIMILAR POSTS

2 - SIMILAR POSTSOriginal Post Did you ever wonder how great it would be if you could write your jmeter tests in ruby ? This projects aims to do so. If you use it on your project just let me now. On the Architecture Academy you can read how jmeter can be used to validate your Architecture. modulo 13 arch definition architecture validation | academia de arquiteturaMost similar post (cosine similarity = 0.30) Foram disponibilizados no site Enterprise Architecture, na parte de Knowledge Base de performance, alguns how-tos relacionados a testes de performance.Entre eles, como definir os requisitos (throughput, cálculo de threads para o JMeter etc.), utilização do JMeter, geração de massa de dados e monitoramento. planning and executing performance testing | enterprise architecture - how to identify performance acceptance criteria | enterprise architecture - how to geracao de massa de dados | enterprise architecture - how to jmeter | enterprise architecture - how to monitoramento | enterprise architecture

SIMILAR PEOPLE!

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

INTERPRET

•Drawing conclusions from your data

•Evaluating what your results mean

•Communicating your result

DATA PRODUCTS“If information has context and the context is interactive, insights are not predictable."

[Agile Data Science, O’Reilly, 2014]

SENTIMENT ANALYSIS

bit.ly/eleicoes2014debatesbt

Analytical Dashboard

SENTIMENT ANALYSISAnalytical Dashboard

bit.ly/eleicoes2014debatesbt

NETWORK ANALYSIS

https://linkedjazz.org/network/js

What about Python for Big Data?

PYTHON ON HADOOP

Streaming

HADOOPYPig UDFs in Jython

HADOOP STREAMINGHadoop Streaming - Allows MapReduce jobs from any executable script - including Python

HADOOP STREAMING

http://workingsweng.com.br/2014/04/clusterizando-raios-com-hadoop-e-k-means-em-map-reduce/

K-Means with Python on MapReduce

140.000 lightnings em 28/02/2014 in 137 data files

Running on Amazon Elastic Map Reduce•Instances: 10 m1.small•Time (k=10): 10 iterations => 32 minutes•Time (k=50): 50 iterations => 164 minutes

IS DATA SCIENTIST THENEW WEBMASTER?

[Doing Data Science, O’Reilly, 2014]

DATA SCIENCE COURSES• Introduction to Data Science (Univ. of Washington)

• Data Science specialization (Johns Hopkins)

• Intro to Hadoop and MapReduce (Cloudera)

• Machine Learning (Stanford)

• Statistical Learning (Stanford)

• Mining Massive Datasets (Stanford)

• Scalable Machine Learning (Berkeley)

http://workingsweng.com.br/2014/04/cursos-mooc-e-especializacoes-em-data-science/

BOOKS

Happy data geeking!

Gabriel Moreira@gspmoreira

http://about.me/gspmoreira

Thank you!

2015

PYTHON FOR DATA SCIENCE

Slides: http://bit.ly/python4ds_tdc