practical data analysis in python

Practical Data Analysis in Python

Hilary Mason@hmason

[email protected]

Data is ubiquitous.

The ability and tools to use it are not.

(Focused) Data == Intelligence

Data Analysis on the Web

Data items change rapidly.Data items are not independent.There’s a lot of semi-structured data around.There’s a LOT of data around.

==Too many problems, few tools, and few experts.

Entity Disambiguation

This is important.

MEUGLY HAG

Entity Disambiguation

This is important.

Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company?

This is a hard problem.

SPAM sucks

Classification

Document classification.

Image recognition.

Topic recognition.

Text Parsing

Recommendation Systems

Product recommendations.Disease predictions.Behavior analysis.

IEEE Tag Clustering

immunity

ultrasound

medical imaging

medical devices

thermoelectric devices

fault-tolerant circuits

low power devices

Python for Data Analysis

import why_python_is_awesome

Python is readable.Easy to transition from Matlab or R.Numerical computing support.Growing set of machine learning libraries.

Libraries

NLTK (Natural Language Toolkit) – www.nltk.org

mlpy (Machine Learning PY) – mlpy.fbk.eu

numpy & scipy – scipy.org

http://www.nltk.org/

An EC2 AMI provisioned with all of the toys you need:

http://blog.infochimps.org/2009/02/06/start-hacking-machetec2-released/

MachetEC2



Demo: Classifying Tweets

Supervised Classification

Text Feature Extractor

TrainedClassifier

Spam

Not Spam

Training Data

Feature Extractor

Data: TweetsHand-classified. For example, some spam:

| don't disrespect me. I just wanted yall to get a head start so don't feel bad when I have more followers in two days. http://xyyx.eu/a1ha |

| oh yay more new followers..hiii...if u want go to http://xyyx.eu/a1hb |

| My friend made this new tool to get more twitter followers, http://xyyx.eu/a1ht |

| Yes, Twitter is doing some Follower/Following count corrections. Get it back at: http://xyyx.eu/a1h8 |

| man if i see one more person cry about losing followers!!! http://xyyx.eu/a1h4 |

Features def document_features(self, document):

document_words = set(document) features = {} for word in self.word_features: features['contains(%s)' % word] = (word in document_words) return features

Break tweets into lists of relevant words.

Naïve Bayesian Classifer

P(A|B) = the conditional probability of A given B

http://yudkowsky.net/rational/bayeshttp://blog.oscarbonilla.com/2009/05/visualizing-bayes-theorem/

classifier = nltk.NaiveBayesClassifier.train(train_set)

http://yudkowsky.net/rational/bayes

http://blog.oscarbonilla.com/2009/05/visualizing-bayes-theorem/

http://blog.oscarbonilla.com/2009/05/visualizing-bayes-theorem/

Classifer Accuracy

Use a hand-classified test set to see the accuracy of the classifier:

nltk.classify.accuracy(classifier, test_set)

Feature Relevance contains(') = True not_s : spam = 53.6 : 1.4

contains(") = True not_s : spam = 32.2 : 1.1 contains(#) = True not_s : spam = 22.0 : 1.0 contains(!) = True not_s : spam = 10.8 : 1.0 contains(*) = True spam : not_s = 7.4 : 1.0 contains(=) = True not_s : spam = 5.5 : 1.0 contains(i) = False spam : not_s = 5.2 : 1.0 contains(?) = True not_s : spam = 2.4 : 1.0 contains(:) = True spam : not_s = 2.3 : 1.0 contains(&) = True not_s : spam = 1.8 : 1.0 contains(;) = True not_s : spam = 1.6 : 1.0 contains($) = True spam : not_s = 1.5 : 1.0

contains(u) = True spam : not_s = 1.5 : 1.0

contains(2.0) = False not_s : spam = 1.4 : 1.0 contains(saw) = False not_s : spam = 1.4 : 1.0 contains(noble) = False not_s : spam = 1.4 : 1.0 contains(sound) = False not_s : spam = 1.3 : 1.0 contains(approach) = False not_s : spam = 1.3 : 1.0 contains(finally) = False not_s : spam = 1.3 : 1.0 contains(more) = False spam : not_s = 1.3 : 1.0

Kitchen Sink

wash, rinse, repeat

Results

90% accuracy on spam tweets – not bad!

Other possibilities:

categorization – what do you tweet about?human vs bot? which celebrity tweeter are you?

<3 Data

Thank you!

practical data analysis in python

Technology

spam tweets

false spam

spam sucks

lot of data

data analysisimport

focused data

practical data analysis

lot of semistructured