document classification with neo4j

Post on 28-Nov-2014

824 Views

Category:

Technology

9 Downloads

Preview:

Click to see full reader

DESCRIPTION

Graphs are a perfect solution to organize information and to determine the relatedness of content. Neo4j Developer Evangelist Kenny Bastani will discuss using Neo4j to perform document classification and text classification using a graph database. Kenny will demonstrate how to build a scalable architecture for classifying natural language text using a graph-based algorithm called Hierarchical Pattern Recognition. This approach encompasses a set of techniques familiar to Deep Learning practitioners.

TRANSCRIPT

(graphs)-[:are]->(everywhere)

Document Classification with Neo4j

© All Rights Reserved 2014 | Neo Technology, Inc.

@kennybastani

Neo4j Developer Evangelist

© All Rights Reserved 2014 | Neo Technology, Inc.

Agenda

• Introduction to Neo4j

• Introduction to Graph-based Document Classification

• Graph-based Hierarchical Pattern Recognition

• Generating a Vector Space Model for Recommendations

• Graphify for Neo4j

• U.S. Presidential Speech Transcript Analysis

2

© All Rights Reserved 2014 | Neo Technology, Inc.

Introduction to Neo4j

3

© All Rights Reserved 2014 | Neo Technology, Inc.

The Property Graph Data Model

4

© All Rights Reserved 2014 | Neo Technology, Inc.

John

Sally

Graph Databases Book

Friend Of

Friend Of

Has R

ead

Has Read

5

© All Rights Reserved 2014 | Neo Technology, Inc.

name: John age: 27

name: Sally age: 32

title: Graph Databasesauthors: Ian Robinson, Jim Webber

FRIEND_OFsince: 01/09/2013

HAS_READon: 2/03/2013rating: 5

HAS_READon: 02/09/2013rating: 4

FRIEND_OFsince: 01/09/2013

6

© All Rights Reserved 2014 | Neo Technology, Inc.

The Relational Table Model

7

© All Rights Reserved 2014 | Neo Technology, Inc.

Customers AccountsCustomer_Accounts

143 Alice

326 $100

725$63

2

981 $212

143 981

143 725

143 326

8

© All Rights Reserved 2014 | Neo Technology, Inc.

The Neo4j Browser

9

© All Rights Reserved 2014 | Neo Technology, Inc.

http://localhost:7474/

Neo4j Browser - finding help

10

© All Rights Reserved 2014 | Neo Technology, Inc.

Execute Cypher, Visualize

11

© All Rights Reserved 2014 | Neo Technology, Inc.

Introduction to Document Classification

12

© All Rights Reserved 2014 | Neo Technology, Inc.

Document Classification

Automatically assign a document to one or more classes

Documents may be classified according to their subjects or

according to other attributes

Automatically classify unlabeled documents to a set of

relevant classes using labeled training data

13

© All Rights Reserved 2014 | Neo Technology, Inc.

Example Use Cases for Document Classification

14

© All Rights Reserved 2014 | Neo Technology, Inc.

Sentiment Analysis for Movie Reviews

Scenario: A movie website allows users to submit reviews describing what they either liked or disliked about a particular movie.

Problem: The user reviews are unstructured text.

How do I automatically generate a score indicating whether the review was positive or negative?

Solution: Train a natural language parsing model on a dataset that has been labeled in previous reviews as either positive or negative.

15

© All Rights Reserved 2014 | Neo Technology, Inc.

Recommend Relevant Tags

Scenario: A Q/A website allows users to submit questions and receive answers from other users.

Problem: Users sometime do not know what tags to apply to their questions in order to increase discoverability for receiving answers.

Solution: Automatically recommend the most relevant tags for questions by classifying the text from training on previous questions.

16

© All Rights Reserved 2014 | Neo Technology, Inc.

Recommend Similar Articles

Scenario: A news website provides hundreds of new articles a day to users on a broad range of topics.

Problem: The site needs to increase user engagement and time spent on the site.

Solution: Train natural language parsing models for daily articles in order to provide recommendations for highly relevant articles at the bottom of each page.

17

© All Rights Reserved 2014 | Neo Technology, Inc.

How Automated Document Classification Works

18

© All Rights Reserved 2014 | Neo Technology, Inc.

X YDocument

Document

Document

Document

Label Label

Assign a set of labels that describes the document’s text

Supervised Learning

Step 1: Create a Training Dataset

Z

Label

19

© All Rights Reserved 2014 | Neo Technology, Inc.

State machines represent predicates that evaluate to 0 or 1 for a text match

Deep feature representations are selected and learned using an evolutionary algorithm

Step 2: Train a Natural Language Parsing Model

State machines map to classes of document labels that matched text during training

Deep Learning

pp

p p p

p

Class

X Y

Class

Z

Class

= State Machine

20

© All Rights Reserved 2014 | Neo Technology, Inc.

Unlabeled Document

The natural language parsing model is used to classify other

unlabeled documents

XClass

YClass

ZClass

0.99

0.67

0.01

cos(θ)

cos(θ)

cos(θ)

Step 3: Classify Unlabeled Documents

21

© All Rights Reserved 2014 | Neo Technology, Inc.

Hierarchical Pattern Recognition (HPR)

22

© All Rights Reserved 2014 | Neo Technology, Inc.

What is Hierarchical Pattern Recognition (HPR)?

HPR is a graph-based deep learning algorithm I created that learns deep feature representations in linear time —

I created the algorithm to do graph-based traversals using a hierarchy of finite state machines (FSM).

Designed for scalable performance in P time:

23

© All Rights Reserved 2014 | Neo Technology, Inc.

Influences & Inspirations

24

Ray Kurzweil(Pattern Recognition Theory of

Mind)

Jeff Hawkins(Hierarchical Temporal

Memory)

+ =

Hierarchical Pattern Recognition

pp

p p p

p

X Y Z

© All Rights Reserved 2014 | Neo Technology, Inc.

How does feature extraction work?

25

Hierarchical Pattern Recognition

“Deep” feature representations are learned and associated with labels that are mapped to documents that the feature was discovered in.

The feature hierarchy is translated into a Vector Space Model for classification on feature vectors generated from unlabeled text.

pp

p p p

p

X Y Z

HPR uses a probabilistic model in combination with an evolutionary algorithm to generate hierarchies of deep feature representations.

© All Rights Reserved 2014 | Neo Technology, Inc.

Graph-based feature learning

26

© All Rights Reserved 2014 | Neo Technology, Inc.

Learning new features from matches on training data

27

© All Rights Reserved 2014 | Neo Technology, Inc.

Cost Function for the Generations of Features

Reproduction occurs after a threshold of matches has been exceeded for a feature.

After replication the cost function is applied to increase that threshold every time the feature reproduces.

is the current threshold on the feature node.

is the minimum threshold, which I chose as 5 for new features.

Cost function:

28

© All Rights Reserved 2014 | Neo Technology, Inc.29

© All Rights Reserved 2014 | Neo Technology, Inc.

Vector Space Model

30

© All Rights Reserved 2014 | Neo Technology, Inc.

Generating Feature Vectors

The natural language parsing model created during training can be turned into a global feature index.

This global feature index is a list of Neo4j internal IDs for every feature in the hierarchy.

Using that global feature index, a multi-dimensional vector space is created with a length equal to the number of features in the hierarchy.

31

© All Rights Reserved 2014 | Neo Technology, Inc.

Relevance Rankings

“Relevance rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as the same kind of vector as the documents.” - Wikipedia

32

© All Rights Reserved 2014 | Neo Technology, Inc.

Vector-based Cosine Similarity Measure

33

In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the angle itself:

© All Rights Reserved 2014 | Neo Technology, Inc.

Cosine Similarity & Vector Space Model

34

© All Rights Reserved 2014 | Neo Technology, Inc.

Vector-based Cosine Similarity Measure

“The resulting similarity ranges from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity.”

via Wikipedia

35

© All Rights Reserved 2014 | Neo Technology, Inc.

Graphify for Neo4j

36

© All Rights Reserved 2014 | Neo Technology, Inc.

Graphify for Neo4j

Graphify is a Neo4j unmanaged extension used for document and text classification using graph-based hierarchical pattern recognition.

https://github.com/kbastani/graphify

37

© All Rights Reserved 2014 | Neo Technology, Inc.

Example Project

Head over to the GitHub project page and clone it to your local machine.

Follow the directions listed in the README.md to install the extension.

Navigate to the /examples directory of the project.

Run:

examples/graphify-examples-author/src/java/org/neo4j/nlp/examples/author/main.java

38

© All Rights Reserved 2014 | Neo Technology, Inc.

U.S. Presidential Speech Transcript Analysis

39

© All Rights Reserved 2014 | Neo Technology, Inc.

Identify the Political Affiliation of a Presidential Speech

This example ingests a set of texts from presidential speeches with labels from the author of that speech in training phase. After building the training models, unlabeled presidential speeches are classified in the test phase.

40

© All Rights Reserved 2014 | Neo Technology, Inc.

The Presidents• Ronald Reagan

• labels: liberal, republican, ronald-reagan

• George H.W. Bush

• labels: conservative, republican, bush41

• Bill Clinton

• labels: liberal, democrat, bill-clinton

• George W. Bush

• labels: conservative, republican, bush43

• Barack Obama

• labels: liberal, democrat, barack-obama

41

© All Rights Reserved 2014 | Neo Technology, Inc.

Training

Each of the presidents in the example have 6 speeches to analyze.

4 of the speeches are used to build a natural language parsing model.

2 of the speeches are used to test the validity of that model.

42

© All Rights Reserved 2014 | Neo Technology, Inc.

Get Similar Labels/Classes

43

© All Rights Reserved 2014 | Neo Technology, Inc.

Ronald Reagan

44

Class Similarity

republican 0.7182046285385341

liberal 0.644281223102398

democrat 0.4854114595950056

conservative 0.4133639188595147

bill-clinton 0.4057969121945167

barack-obama 0.323947855372623

bush41 0.3222644898334092

bush43 0.3161309849153592

© All Rights Reserved 2014 | Neo Technology, Inc.

George H.W. Bush

45

Class Similarity

conservative 0.7032274806766954

republican 0.6047256274615608

liberal 0.4439742461594541

democrat 0.39114918238853674

bill-clinton 0.3234223107986785

ronald-reagan 0.3222644898334092

barack-obama 0.2929260544514002

bush43 0.29106733975087984

© All Rights Reserved 2014 | Neo Technology, Inc.

Bill Clinton

46

Class Similarity

democrat 0.8375678825642422

liberal 0.7847858060182163

republican 0.5561860529059708

conservative 0.45365774896422445

barack-obama 0.4507676679770066

ronald-reagan 0.4057969121945167

bush43 0.365042482383354

bush41 0.3234223107986785

© All Rights Reserved 2014 | Neo Technology, Inc.

George W. Bush

47

Class Similarity

conservative 0.820636570272315

republican 0.7056890956512284

liberal 0.5075788396061254

democrat 0.4505424322086937

bill-clinton 0.365042482383354

barack-obama 0.33801949243378965

ronald-reagan 0.3161309849153592

bush41 0.29106733975087984

© All Rights Reserved 2014 | Neo Technology, Inc.

Barack Obama

48

Class Similarity

democrat 0.7668017370739147

liberal 0.7184792203867296

republican 0.4847680475425114

bill-clinton 0.4507676679770066

conservative 0.4149264161292232

bush43 0.33801949243378965

ronald-reagan 0.323947855372623

bush41 0.2929260544514002

© All Rights Reserved 2014 | Neo Technology, Inc.

Get involved in the Neo4j community

49

© All Rights Reserved 2014 | Neo Technology, Inc.

http://stackoverflow.com/questions/tagged/neo4j

50

© All Rights Reserved 2014 | Neo Technology, Inc.

http://groups.google.com/group/neo4j

51

© All Rights Reserved 2014 | Neo Technology, Inc.

https://github.com/neo4j/neo4j/issues

52

© All Rights Reserved 2014 | Neo Technology, Inc.

http://neo4j.meetup.com/

53

© All Rights Reserved 2014 | Neo Technology, Inc.

(Thank You)

54

© All Rights Reserved 2014 | Neo Technology, Inc.

Get in touch

55

Twitter www.twitter.com/kennybastani

LinkedIn

www.linkedin.com/in/kennybastani

GitHub www.github.com/kbastani

top related