my work in information retrieval, machine learning...

53
Thamme Gowda @Stanford University, Nov 3 rd , 2016 My work in Information Retrieval, Machine Learning and NLP 1

Upload: lamdat

Post on 17-May-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

Thamme Gowda@Stanford University, Nov 3rd, 2016

My work in Information Retrieval,

Machine Learning and NLP

1

Page 2: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

+ I’m Thamme Gowda+ University of Southern California, Los Angeles - MSCS + NASA Jet Propulsion Laboratory - Intern+ Apache Software Foundation - Volunteer + Datoin - Co Founder+ You can find me online: @thammegowda

HELLO!

2

Page 3: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

OVERVIEW

+ In this presentation+ USC IRDS - DARPA Memex+ NASA Jet Propulsion Lab + Datoin

+ Research+ Clustering Web Pages+ Mars Target Encyclopedia

+ Research Interests and motivations

3

Page 4: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

USC INFORMATION RETRIEVAL AND DATA SCIENCE

+ Dr. Chris Mattmann’s group+ Contributions to Free and Open Source

Softwares + Top Apache Projects:

+ Apache Tika, Nutch, Joshua (Incubating)+ Sparkler - A web crawler on Apache Spark

+ Involvement in DARPA Memex program

4

Page 5: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

+ Summer Intern (Full time), Fall Co-Op (part time)+ Continued involvement with DARPA MEMEX + DARPA Data Driven Discovery of Models (D3M)+ Mars Target Encyclopedia + Mars Landmarks Classification

NASA JET PROPULSION LABORATORY

5

Page 6: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

MEMEXCollaborators: Dr. Chris Mattmann Paul Ramirez Kyle Hundman, et al.

6

Page 7: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

DARPA MEMEX

+ Web Crawling, Information Retrieval + Apache Solr based Search Index+ Information Extraction

+ Names of people, organizations, locations+ From location names to GPS coordinates

+ Object Recognition - Models from ImageNet dataset

7

Page 8: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

TIKA-1787: NAMED ENTITY RECOGNITION+ NER on Memex Dataset + Added NER support to Apache Tika [1]

+ MaxEnt Classifier from Apache OpenNLP (default) [2]

+ CRF Classifier from Stanford CoreNLP [3]

+ MITIE - IE toolkit from MIT LL

[1] https://wiki.apache.org/tika/TikaAndNER[2] https://opennlp.apache.org/documentation/manual/opennlp.html#tools.namefind[3] http://nlp.stanford.edu/software/CRF-NER.shtml

8

Page 10: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

TIKA-1993: OBJECT RECOGNITION

+ Image Recognition support+ Integrated Tensorflow’s Inception-V3 model+ Evaluated multiple ways of integration

+ Command Line Invocation → S-l-o-w as a turtle+ Java Native Interface → Transitive dep. issues+ GRPC Client Server → Dependency version issues+ REST Client Server → Works best, please use this!

+ https://wiki.apache.org/tika/TikaAndVision

10

Page 11: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

TIKA-1993 DEMO<meta name="OBJECT" content="German shepherd, German shepherd dog, German police dog, alsatian (0.36203)"/><meta name="OBJECT" content="military uniform (0.13061)"/>

* Photo Credits - Wikimedia.org

11

Page 12: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

LABELING THE WEAPONS DATASET+ 1.3 Million Images of DARPA MEMEX dataset+ Detected objects in the images+ Two Experiments

+ 1st time - top 2 objects+ 2nd time - top 2 objects + confidence threshold of 0.3

+ Improved the efficiency for large jobs+ 1.3 million images took ~ 36 hours on 32 CPU cores+ Improvements are upstreamed to Apache Tika

+ Pushed the results back to Imagecatdev solrhttp://imagecat.dyndns.org/weapons/imagespace-dev/

12

Page 14: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

MEMEX CP1: HT CLASSIFIER+ Using SVM+ Created custom vectors+ Stanford CoreNLP for tokenization+ Features:

+ Unigrams+ Selected Bigrams

+ All grams are lemmatized+ Classification is done at the cluster level+ https://github.com/USCDataScience/svm-classifier-memex

* Photo credits http://scikit-learn.org/

14

Page 15: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

MEMEX CP1: EVAL. DATASET AU ROC

81.7% AU-ROC for 487 Clusters (Next best result: 65%)

15

Page 16: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

MEMEX CP1: Sample Features+ Cluster classification instead of individual documents+ Lemmatization+ Selected Bigrams and N-grams:

+ Adjectives and nouns - together+ Adverbs and verbs - together

16

Page 17: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

SPARKLERCollaborators:

Karanjeet Singh

17

Page 18: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

SPARKLER[1] (aka Spark-Crawler) + Redesigned and reimagined crawler

+ Taking the best parts of Apache Nutch+ Combined with recent advancements in distributed

computing+ Crawler database is redesigned → indexed store

using Apache Lucene/Solr+ Crawler pipeline is designed → CrawldbRDD

18

[1] https://github.com/uscdataScience/sparkler

Page 19: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

SPARKLER - ROADMAP+ Partitioning for Fair Fetching � + Apache Solr Backend for crawldb �+ Stores Data on FS �+ OSGI based Plugin Framework (Apache Felix) �

+ Regex URL Filter �, JavaScript Engine �+ Admin Dashboard �+ Apache Kafka Integration �+ ApacheCon EU 2016 �+ TODO: More Plugins from Nutch+ TODO: Apache Incubator

Quick Start: https://github.com/USCDataScience/sparkler/wiki/sparkler-0.1

19

Page 20: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

SPARKLER PIPELINE[1]

[1] https://github.com/USCDataScience/sparkler/wiki/Sparkler-Internals20

Page 21: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

Co-Founder

21

Page 22: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

DATOIN - DATA TO INFORMATION[1]

+ A platform for data flow pipelines+ Drag-drop-connect the components+ SDK to build reusable components+ Machine Learning components are our interest+ All round experience - idea, design, implement,

test, deploy, collaborate

[1] http://datoin.com

22

Page 23: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

Demo: http://datoin.com/applications/extraction?type=custom

Pipeline: http://datoin.com/pipeline/viewPipeline/pipeline-9d3bc507-4549-4139-8f0f-dc38a0adf354

*Photo from http://datoin.com

23

Page 24: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

RESEARCH EXPERIENCE Clustering Web Pages based on Structure and Style

Object Recognition - Landmarks Classification Named Entity Recognition - Mars Target Encyclopedia

24

Page 25: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

Clustering Web Pages Based On Structure And Style Similarity

Thamme Gowda and Chris Mattmann, IEEE IRI 2016

USC Information Retrieval and Data Science Group (irds.usc.edu)

25

Page 26: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

AUTO EXTRACTOR[1]

+ Unsupervised Learning for Information Extraction+ First Step - Clustering based on structure and style+ Kernels

+ Structural Similarity using Tree Edit Distance between DOM trees

+ Style Similarity of CSS class names + Shared Near Neighbor Clustering

+ Distributable on Apache Spark[1] https://github.com/thammegowda/autoextractor

26

Page 27: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov 27

Page 28: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

28

SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

Page 29: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

29

SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

Page 30: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

30

SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

Page 31: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

METHOD OVERVIEW

CLUSTERING

Page 32: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

METHOD : STEP #1

WEB PAGES FROM CRAWLER LIKE APACHE NUTCH

STRUCTURAL SIMILARITY

STRUCTURAL SIMILARITY

Page 33: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

● 3 Edit operations● Normalized

distance

* Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262.

MINIMUM TREE EDIT DISTANCE

33

Page 34: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

METHOD: STEP #2

WEB PAGES FROM CRAWLER LIKE APACHE NUTCH

STYLE SIMILARITY

STYLE SIMILARITY

Page 35: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

• Similar web pages have similar css styles• XPath : ”//*[@class]/@class”• Simple measure :

• Jaccard Similarity on CSS class names

STYLE SIMILARITY

35

Page 36: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

METHOD : STEP #3

AGGREGATED SIMILARITY

AGGREGATE

Page 37: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

METHOD : STEP #4

SIMILARITY MATRIX CLUSTERS

CLUSTERINGSHARED NEAR NEIGHBOR

Page 38: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

• TED very expensive• Zhang-Shasha’s TED

• O(|T1| x |T2| x Min{depth(T1), leaves(T1)} x Min{depth(T2), leaves(T2)})

• That’s O(n4)• Approx. 1000 HTML Tags• That’s O(1012)

CHALLENGES

Number of HTML Tags

Tim

e Co

mpl

exit

y38

Page 39: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

LANDMARKS CLASSIFICATIONAND

MARS TARGET ENCYCLOPEDIA

Jet Propulsion Laboratory, California Institute of Technology

Contributors: Dr. Kiri Wagstaff Dr. Raymond Francis

39

Page 40: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

+ Goal: Classify Landmark images from High Resolution Imaging Science Experiment (HiRISE)

+ Classes: Crater, Dark Dune, Bright Dune, Streak etc+ Trained a deep neural net for image classification+ Compared with the results from Caffe based classifier

LANDMARKS CLASSIFICATION

40

Page 41: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

LANDMARKS CLASSIFICATION + Challenges:

+ Too little training data+ Demands lots of CPU power+ Labels are not precise

+ Solution: Transfer Learning+ Start with Inception-V3 Net using state-of-the-art model+ Erase the weights of last layer+ Retrain the network for new classes

41

Page 42: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

INCEPTION-V3 ARCHITECTURE

* Photo Credits - Google Research

This Network has 5.64% top-5 error on ILSVRC 2012 validation dataset

42

Page 43: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

LANDMARKS CLASSIFICATIONModel EvaluationJudge↓\TFlo→ streak other dark_dune bright_dune crater [TFlo.Tot]

streak 1 55 1 0 1 58

other 1 1562 143 2 60 1768

dark_dune 0 18 471 0 1 490

bright_dune 0 6 1 0 16 23

crater 0 225 0 0 158 383

[Judge.Total] 2 1866 616 2 236 [2722]

43

Page 44: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

+ Goal: Build a search engine for research articles related to planetary science.

+ Minerals, Elements,Targets etc+ Contributions: parser and indexer tools

+ Apache Tika to extract text+ Grobid parser to extract title, authors,

affiliations etc+ Stanford CoreNLP for NER+ Apache Lucene/Solr inverted index

+ https://github.com/USCDataScience/parser-indexer-py

MARS TARGET ENCYCLOPEDIA (MTE)

44

Page 45: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

+ Build custom NER model for planetary science + Entities include ELEMENTS, MINERALS, TARGETS, etc+ Annotated the documents published in Lunar and Planetary

Science Conference [1] (LPSC) 2015 using brat[2]

+ Trained a model for Stanford CoreNLP’s CRFClassifier [3]

INFORMATION EXTRACTION[4]

1. http://www.hou.usra.edu/meetings/lpsc2015/2. http://brat.nlplab.org/3. http://nlp.stanford.edu/software/CRF-NER.shtml4. https://github.com/USCDataScience/parser-indexer-py/tree/master/src/corenlp

45

Page 46: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

Gasda, P. J., et al. "Potential Link Between High-Silica Diagenetic Features in Both Eolian and Lacustrine Rock Units Measured in Gale Crater with MSL." Lunar and Planetary Science Conference. Vol. 47. 2016.

46

Page 47: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

NER MODEL EVALUATION● 8 Test documents from LPSC 2015● 100 training documents from LPSC 2015 and LPSC 2016

47

Page 48: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

+ There is so much to learn! + Research in Question Answering (AI) is fascinating+ Narrowing down:

+ Natural Language Understanding+ Question Answering

+ Information Extraction+ Knowledge Representation

RESEARCH MOTIVATIONS AND INTERESTS

48

Page 49: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

+ Question Answering is an Interface, not a single applicationany human interfacing system that has input and output can be converted to a sort of question - answering system

+ Natural Language Understanding+ Identifying the entities (nouns)+ Resolving the references (pronouns, context)+ Updating the states (adjectives) of entities based on the actions

(verbs)

QUESTION ANSWERING

49

Page 50: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

+ Capture the mutable aspects of knowledge+ A formal language to do (math) reasoning

(Induction, deduction)+ A graphical language to visualize and explain+ Storage and Retrieval

KNOWLEDGE REPRESENTATION

50

Page 51: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

TIMELINE KR

+ Every entity has a timeline+ Timelines intersect with each other+ For example: this meeting

51

Page 52: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

TIMELINE KR : Example

52

Page 53: My work in Information Retrieval, Machine Learning …scf.usc.edu/~tnarayan/files/SummaryPresentationToChris...My work in Information Retrieval, Machine Learning and NLP 1 + I’m

QUESTIONS ?

Thanks

53