have data? what now?!
DESCRIPTION
A brief overview of common data analysis problems and algorithms.TRANSCRIPT
![Page 1: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/1.jpg)
Have Data? What now?!
Hilary Mason@hmason
![Page 2: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/2.jpg)
(Focused) Data == Intelligence
![Page 3: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/3.jpg)
Common Problems
Gathering dataParsing, Entity Extraction and DisambiguationClusteringDocument classificationNLP
![Page 4: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/4.jpg)
Text is MESSY
![Page 5: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/5.jpg)
Do you need to parse it?Parsing unstructured data is hard. (we’ll get to this)
CHEAT.
Open Calais (www.opencalais.com) currently supports:
Anniversary, City, Company, Continent, Country, Currency, EmailAddress, EntertainmentAwardEvent, Facility, FaxNumber, Holiday, IndustryTerm, MarketIndex, MedicalCondition, MedicalTreatment, Movie, MusicAlbum, MusicGroup, NaturalFeature, OperatingSystem, Organization, Person, PhoneNumber, Position, Product, ProgrammingLanguage, ProvinceOrState, PublishedMedium, RadioProgram, RadioStation, Region, SportsEvent, SportsGame, SportsLeague, Technology, TVShow, TVStation, URL
![Page 6: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/6.jpg)
Entity Disambiguation
This is important.
![Page 7: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/7.jpg)
MEUGLY HAG
![Page 8: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/8.jpg)
Entity Disambiguation
This is important.
Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company?
![Page 9: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/9.jpg)
A Practical Approach – Path101
Human classification
Data APIs
Automaticclassification
model
Example: Company Name
External data from Open Calais, Freebase
Based on industry, location, and type of job, we can differentiate between MS Volt (Microsoft) and Volt (Volt Information Sciences, Inc.)
![Page 10: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/10.jpg)
SPAM sucks
![Page 11: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/11.jpg)
Supervised Classification
TextText Feature ExtractorFeature
ExtractorTrained
ClassifierTrained
Classifier
CatsCats
DogsDogs
FireFire
Training Data
Training Data
Feature ExtractorFeature
Extractor
![Page 12: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/12.jpg)
Classification Example: Movie Reviews!
[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.'], ...]
…tagged ‘positive’ and ‘negative’.
![Page 13: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/13.jpg)
#!/usr/bin/env python# encoding: utf-8"””classification_example.py"""
from __future__ import divisionimport sys, os, random, nltk, re, pprintfrom nltk.corpus import movie_reviews
def document_features(document, word_features): document_words = set(document) features = {} for word in word_features: features['contains(%s)' % word] = (word in document_words) return features
def main(): all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = all_words.keys()[:2000]
documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] featuresets = [(document_features(d, word_features), c) for (d,c) in documents] train_set, test_set = featuresets[100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set) classifier.show_most_informative_features(20)
if __name__ == '__main__': main()
![Page 14: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/14.jpg)
Clustering
immunityimmunity
ultrasoundultrasound
medical imagingmedical imaging
medical devicesmedical devices
thermoelectric devices
thermoelectric devices
fault-tolerant circuits
fault-tolerant circuits
low power devices
low power devices
![Page 15: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/15.jpg)
Hierarchical Clustering
![Page 16: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/16.jpg)
![Page 17: Have data? What now?!](https://reader034.vdocument.in/reader034/viewer/2022051515/54c764734a7959154e8b4575/html5/thumbnails/17.jpg)
<3 Data
Thank you!