document classification with neo4j

55
(graphs)-[:are]->(everywhere) Document Classification with Neo4j © All Rights Reserved 2014 | Neo Technology, Inc. @kennybastani Neo4j Developer Evangelist

Upload: kenny-bastani

Post on 28-Nov-2014

824 views

Category:

Technology


9 download

DESCRIPTION

Graphs are a perfect solution to organize information and to determine the relatedness of content. Neo4j Developer Evangelist Kenny Bastani will discuss using Neo4j to perform document classification and text classification using a graph database. Kenny will demonstrate how to build a scalable architecture for classifying natural language text using a graph-based algorithm called Hierarchical Pattern Recognition. This approach encompasses a set of techniques familiar to Deep Learning practitioners.

TRANSCRIPT

Page 1: Document Classification with Neo4j

(graphs)-[:are]->(everywhere)

Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

@kennybastani

Neo4j Developer Evangelist

Page 2: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Agenda

• Introduction to Neo4j

• Introduction to Graph-based Document Classification

• Graph-based Hierarchical Pattern Recognition

• Generating a Vector Space Model for Recommendations

• Graphify for Neo4j

• U.S. Presidential Speech Transcript Analysis

2

Page 3: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Introduction to Neo4j

3

Page 4: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

The Property Graph Data Model

4

Page 5: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

John

Sally

Graph Databases Book

Friend Of

Friend Of

Has R

ead

Has Read

5

Page 6: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

name: John age: 27

name: Sally age: 32

title: Graph Databasesauthors: Ian Robinson, Jim Webber

FRIEND_OFsince: 01/09/2013

HAS_READon: 2/03/2013rating: 5

HAS_READon: 02/09/2013rating: 4

FRIEND_OFsince: 01/09/2013

6

Page 7: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

The Relational Table Model

7

Page 8: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Customers AccountsCustomer_Accounts

143 Alice

326 $100

725$63

2

981 $212

143 981

143 725

143 326

8

Page 9: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

The Neo4j Browser

9

Page 10: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

http://localhost:7474/

Neo4j Browser - finding help

10

Page 11: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Execute Cypher, Visualize

11

Page 12: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Introduction to Document Classification

12

Page 13: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Document Classification

Automatically assign a document to one or more classes

Documents may be classified according to their subjects or

according to other attributes

Automatically classify unlabeled documents to a set of

relevant classes using labeled training data

13

Page 14: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Example Use Cases for Document Classification

14

Page 15: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Sentiment Analysis for Movie Reviews

Scenario: A movie website allows users to submit reviews describing what they either liked or disliked about a particular movie.

Problem: The user reviews are unstructured text.

How do I automatically generate a score indicating whether the review was positive or negative?

Solution: Train a natural language parsing model on a dataset that has been labeled in previous reviews as either positive or negative.

15

Page 16: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Recommend Relevant Tags

Scenario: A Q/A website allows users to submit questions and receive answers from other users.

Problem: Users sometime do not know what tags to apply to their questions in order to increase discoverability for receiving answers.

Solution: Automatically recommend the most relevant tags for questions by classifying the text from training on previous questions.

16

Page 17: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Recommend Similar Articles

Scenario: A news website provides hundreds of new articles a day to users on a broad range of topics.

Problem: The site needs to increase user engagement and time spent on the site.

Solution: Train natural language parsing models for daily articles in order to provide recommendations for highly relevant articles at the bottom of each page.

17

Page 18: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

How Automated Document Classification Works

18

Page 19: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

X YDocument

Document

Document

Document

Label Label

Assign a set of labels that describes the document’s text

Supervised Learning

Step 1: Create a Training Dataset

Z

Label

19

Page 20: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

State machines represent predicates that evaluate to 0 or 1 for a text match

Deep feature representations are selected and learned using an evolutionary algorithm

Step 2: Train a Natural Language Parsing Model

State machines map to classes of document labels that matched text during training

Deep Learning

pp

p p p

p

Class

X Y

Class

Z

Class

= State Machine

20

Page 21: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Unlabeled Document

The natural language parsing model is used to classify other

unlabeled documents

XClass

YClass

ZClass

0.99

0.67

0.01

cos(θ)

cos(θ)

cos(θ)

Step 3: Classify Unlabeled Documents

21

Page 22: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Hierarchical Pattern Recognition (HPR)

22

Page 23: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

What is Hierarchical Pattern Recognition (HPR)?

HPR is a graph-based deep learning algorithm I created that learns deep feature representations in linear time —

I created the algorithm to do graph-based traversals using a hierarchy of finite state machines (FSM).

Designed for scalable performance in P time:

23

Page 24: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Influences & Inspirations

24

Ray Kurzweil(Pattern Recognition Theory of

Mind)

Jeff Hawkins(Hierarchical Temporal

Memory)

+ =

Hierarchical Pattern Recognition

pp

p p p

p

X Y Z

Page 25: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

How does feature extraction work?

25

Hierarchical Pattern Recognition

“Deep” feature representations are learned and associated with labels that are mapped to documents that the feature was discovered in.

The feature hierarchy is translated into a Vector Space Model for classification on feature vectors generated from unlabeled text.

pp

p p p

p

X Y Z

HPR uses a probabilistic model in combination with an evolutionary algorithm to generate hierarchies of deep feature representations.

Page 26: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Graph-based feature learning

26

Page 27: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Learning new features from matches on training data

27

Page 28: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Cost Function for the Generations of Features

Reproduction occurs after a threshold of matches has been exceeded for a feature.

After replication the cost function is applied to increase that threshold every time the feature reproduces.

is the current threshold on the feature node.

is the minimum threshold, which I chose as 5 for new features.

Cost function:

28

Page 29: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.29

Page 30: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Vector Space Model

30

Page 31: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Generating Feature Vectors

The natural language parsing model created during training can be turned into a global feature index.

This global feature index is a list of Neo4j internal IDs for every feature in the hierarchy.

Using that global feature index, a multi-dimensional vector space is created with a length equal to the number of features in the hierarchy.

31

Page 32: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Relevance Rankings

“Relevance rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as the same kind of vector as the documents.” - Wikipedia

32

Page 33: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Vector-based Cosine Similarity Measure

33

In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the angle itself:

Page 34: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Cosine Similarity & Vector Space Model

34

Page 35: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Vector-based Cosine Similarity Measure

“The resulting similarity ranges from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity.”

via Wikipedia

35

Page 36: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Graphify for Neo4j

36

Page 37: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Graphify for Neo4j

Graphify is a Neo4j unmanaged extension used for document and text classification using graph-based hierarchical pattern recognition.

https://github.com/kbastani/graphify

37

Page 38: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Example Project

Head over to the GitHub project page and clone it to your local machine.

Follow the directions listed in the README.md to install the extension.

Navigate to the /examples directory of the project.

Run:

examples/graphify-examples-author/src/java/org/neo4j/nlp/examples/author/main.java

38

Page 39: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

U.S. Presidential Speech Transcript Analysis

39

Page 40: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Identify the Political Affiliation of a Presidential Speech

This example ingests a set of texts from presidential speeches with labels from the author of that speech in training phase. After building the training models, unlabeled presidential speeches are classified in the test phase.

40

Page 41: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

The Presidents• Ronald Reagan

• labels: liberal, republican, ronald-reagan

• George H.W. Bush

• labels: conservative, republican, bush41

• Bill Clinton

• labels: liberal, democrat, bill-clinton

• George W. Bush

• labels: conservative, republican, bush43

• Barack Obama

• labels: liberal, democrat, barack-obama

41

Page 42: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Training

Each of the presidents in the example have 6 speeches to analyze.

4 of the speeches are used to build a natural language parsing model.

2 of the speeches are used to test the validity of that model.

42

Page 43: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Get Similar Labels/Classes

43

Page 44: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Ronald Reagan

44

Class Similarity

republican 0.7182046285385341

liberal 0.644281223102398

democrat 0.4854114595950056

conservative 0.4133639188595147

bill-clinton 0.4057969121945167

barack-obama 0.323947855372623

bush41 0.3222644898334092

bush43 0.3161309849153592

Page 45: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

George H.W. Bush

45

Class Similarity

conservative 0.7032274806766954

republican 0.6047256274615608

liberal 0.4439742461594541

democrat 0.39114918238853674

bill-clinton 0.3234223107986785

ronald-reagan 0.3222644898334092

barack-obama 0.2929260544514002

bush43 0.29106733975087984

Page 46: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Bill Clinton

46

Class Similarity

democrat 0.8375678825642422

liberal 0.7847858060182163

republican 0.5561860529059708

conservative 0.45365774896422445

barack-obama 0.4507676679770066

ronald-reagan 0.4057969121945167

bush43 0.365042482383354

bush41 0.3234223107986785

Page 47: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

George W. Bush

47

Class Similarity

conservative 0.820636570272315

republican 0.7056890956512284

liberal 0.5075788396061254

democrat 0.4505424322086937

bill-clinton 0.365042482383354

barack-obama 0.33801949243378965

ronald-reagan 0.3161309849153592

bush41 0.29106733975087984

Page 48: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Barack Obama

48

Class Similarity

democrat 0.7668017370739147

liberal 0.7184792203867296

republican 0.4847680475425114

bill-clinton 0.4507676679770066

conservative 0.4149264161292232

bush43 0.33801949243378965

ronald-reagan 0.323947855372623

bush41 0.2929260544514002

Page 49: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Get involved in the Neo4j community

49

Page 50: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

http://stackoverflow.com/questions/tagged/neo4j

50

Page 51: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

http://groups.google.com/group/neo4j

51

Page 52: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

https://github.com/neo4j/neo4j/issues

52

Page 53: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

http://neo4j.meetup.com/

53

Page 54: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

(Thank You)

54

Page 55: Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

Get in touch

55

Twitter www.twitter.com/kennybastani

LinkedIn

www.linkedin.com/in/kennybastani

GitHub www.github.com/kbastani