practical data analysis in python

25
Practical Data Analysis in Python Hilary Mason @hmason www.hilarymason.com [email protected]

Upload: hilary-mason

Post on 06-Dec-2014

20.667 views

Category:

Technology


2 download

DESCRIPTION

These are the slides from my presentation to the NYC Python Meetup on July 28, 2009. The presentation was an overview of data analysis techniques and various python tools and libraries, along with the practical example (with code and algorithms) of a Twitter spam filter implemented with NLTK.

TRANSCRIPT

Page 1: Practical Data Analysis in Python

Practical Data Analysis in Python

Hilary Mason@hmason

[email protected]

Page 2: Practical Data Analysis in Python

Data is ubiquitous.

The ability and tools to use it are not.

Page 3: Practical Data Analysis in Python

(Focused) Data == Intelligence

Page 4: Practical Data Analysis in Python

Data Analysis on the Web

Data items change rapidly.Data items are not independent.There’s a lot of semi-structured data around.There’s a LOT of data around.

==Too many problems, few tools, and few experts.

Page 5: Practical Data Analysis in Python

Entity Disambiguation

This is important.

Page 6: Practical Data Analysis in Python

MEUGLY HAG

Page 7: Practical Data Analysis in Python

Entity Disambiguation

This is important.

Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company?

This is a hard problem.

Page 8: Practical Data Analysis in Python

SPAM sucks

Page 9: Practical Data Analysis in Python

Classification

Document classification.

Image recognition.

Topic recognition.

Page 10: Practical Data Analysis in Python

Text Parsing

Page 11: Practical Data Analysis in Python

Recommendation Systems

Product recommendations.Disease predictions.Behavior analysis.

Page 12: Practical Data Analysis in Python

IEEE Tag Clustering

immunity

ultrasound

medical imaging

medical devices

thermoelectric devices

fault-tolerant circuits

low power devices

Page 13: Practical Data Analysis in Python

Python for Data Analysis

import why_python_is_awesome

Python is readable.Easy to transition from Matlab or R.Numerical computing support.Growing set of machine learning libraries.

Page 14: Practical Data Analysis in Python

Libraries

NLTK (Natural Language Toolkit) – www.nltk.org

mlpy (Machine Learning PY) – mlpy.fbk.eu

numpy & scipy – scipy.org

Page 15: Practical Data Analysis in Python

An EC2 AMI provisioned with all of the toys you need:

http://blog.infochimps.org/2009/02/06/start-hacking-machetec2-released/

MachetEC2

Page 16: Practical Data Analysis in Python

Demo: Classifying Tweets

Page 17: Practical Data Analysis in Python

Supervised Classification

Text Feature Extractor

TrainedClassifier

Spam

Not Spam

Training Data

Feature Extractor

Page 18: Practical Data Analysis in Python

Data: TweetsHand-classified. For example, some spam:

| don't disrespect me. I just wanted yall to get a head start so don't feel bad when I have more followers in two days. http://xyyx.eu/a1ha |

| oh yay more new followers..hiii...if u want go to http://xyyx.eu/a1hb |

| My friend made this new tool to get more twitter followers, http://xyyx.eu/a1ht |

| Yes, Twitter is doing some Follower/Following count corrections. Get it back at: http://xyyx.eu/a1h8 |

| man if i see one more person cry about losing followers!!! http://xyyx.eu/a1h4 |

Page 19: Practical Data Analysis in Python

Features def document_features(self, document):

document_words = set(document) features = {} for word in self.word_features: features['contains(%s)' % word] = (word in document_words) return features

Break tweets into lists of relevant words.

Page 20: Practical Data Analysis in Python

Naïve Bayesian Classifer

P(A|B) = the conditional probability of A given B

http://yudkowsky.net/rational/bayeshttp://blog.oscarbonilla.com/2009/05/visualizing-bayes-theorem/

classifier = nltk.NaiveBayesClassifier.train(train_set)

Page 21: Practical Data Analysis in Python

Classifer Accuracy

Use a hand-classified test set to see the accuracy of the classifier:

nltk.classify.accuracy(classifier, test_set)

Page 22: Practical Data Analysis in Python

Feature Relevance contains(') = True not_s : spam = 53.6 : 1.4

contains(") = True not_s : spam = 32.2 : 1.1 contains(#) = True not_s : spam = 22.0 : 1.0 contains(!) = True not_s : spam = 10.8 : 1.0 contains(*) = True spam : not_s = 7.4 : 1.0 contains(=) = True not_s : spam = 5.5 : 1.0 contains(i) = False spam : not_s = 5.2 : 1.0 contains(?) = True not_s : spam = 2.4 : 1.0 contains(:) = True spam : not_s = 2.3 : 1.0 contains(&) = True not_s : spam = 1.8 : 1.0 contains(;) = True not_s : spam = 1.6 : 1.0 contains($) = True spam : not_s = 1.5 : 1.0

contains(u) = True spam : not_s = 1.5 : 1.0

contains(2.0) = False not_s : spam = 1.4 : 1.0 contains(saw) = False not_s : spam = 1.4 : 1.0 contains(noble) = False not_s : spam = 1.4 : 1.0 contains(sound) = False not_s : spam = 1.3 : 1.0 contains(approach) = False not_s : spam = 1.3 : 1.0 contains(finally) = False not_s : spam = 1.3 : 1.0 contains(more) = False spam : not_s = 1.3 : 1.0

Page 23: Practical Data Analysis in Python

Kitchen Sink

wash, rinse, repeat

Page 24: Practical Data Analysis in Python

Results

90% accuracy on spam tweets – not bad!

Other possibilities:

categorization – what do you tweet about?human vs bot? which celebrity tweeter are you?

Page 25: Practical Data Analysis in Python

<3 Data

Thank you!