music recommendations api with neo4j

22
6/24/2015 Boris Guarisma, Big Data Freelance 1 BIG DATA / NoSQL MUSIC RECOMMENDATIONS PROOF OF CONCEPT WITH NEO4J

Upload: boris-guarisma

Post on 16-Jan-2017

494 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Music recommendations API with Neo4j

6/24/2015

Boris Guarisma, Big Data Freelance

1

BIG DATA / NoSQLMUSIC RECOMMENDATIONS

PROOF OF CONCEPTWITH NEO4J

Page 2: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 2

• Musicovery API provides data to generate music recommendations and playlists of all types : from a mood, an artist, a track, a genre/style, a theme, a period/year,...

• The response, a list of tracks/artists, can be filtered and personalized with several factors : popularity, listener country, similarity type.

• Playlists and recommendations can be dynamically personalized to a specific user

MUSIC RECOMMENDATIONS POC WITH NEO4J

Page 3: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 3

MUSIC RECOMMENDATIONS POC WITH NEO4J

Page 4: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 4

MUSIC RECOMMENDATIONS POC WITH NEO4J

Page 5: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 5

• The success of Musicovery API and the constant growing music catalog are expected to bring Big Data-related problems, mostly in terms of volume and velocity.

• Issues were specifically identified in terms of API performance and system scalability.

• The following Proof of Concept (PoC) experience addresses only the API performance issue.

INTRODUCTION

Page 6: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 6

1. Challenge2. PoC solution design3. Neo4j4. PoC steps5. Traversal6. Unmanaged extensions7. Cloud8. Results9. Next steps10. References11. Contact

CONTENTS

Page 7: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 7

• Objective: enhance customer experience with real-time (fast!) responses to HTTP requests

• Issue: high latency (> 10 sec) responses

• Need: latency to be lowered considerably using domain specific rules as per current API to ensure equivalent music recommendation

1. CHALLENGE

Page 8: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 8

• Scope: only track recommendations, not artist nor genre

• Database: use NoSQL graph database instead of current relational MySQL database

• “There are no tables and columns any more, like in relational databases, or keys and values, like in other NoSQL technologies”

• “There are no SQL-based select and join commands”• The secret of Neo4j’s speed is in the data structure: nodes and relationships,

each can have properties.

• Pre-requisites: • cut down current SQL request into several modules (or problems) in order to

specialize the solution only in the track recommendations problem• define the minimum number and types of nodes and relationships necessary

to perform track recommendation

2. POC SOLUTION DESIGN (1/3)

Page 9: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 9

2. POC SOLUTION DESIGN (2/3)

• Musicovery collects user actions such as liked track, banned track, and based on other (time specific) actions such as radio launch, tracks can also be considered as burnt or interested

• An ETL process was implemented (using R) for data wrangling and for domain specific rules computations, prior loading the data into the NoSQL database

• Reference tracks: only user liked and interested tracks (flagged using relationship properties) will be loaded into the NoSQL database

Page 10: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 10

• Musicovery also stores track similarities data. Similarity between tracks are (pre-) computed and represented by several distances based on specific criteria e.g. artist, genre, user cooccurrences, “mood”, date, etc.

• Only the global distance (weighted sum of different distances) between tracks are loaded into the NoSQL database

• For the PoC we only need • two node types: user and track• two relationships: user’s reference track and track’s similar track

2. POC SOLUTION DESIGN (3/3)

Page 11: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 11

• http://neo4j.com/• Enterprise 2.2.0 M02 free trial• Used Batch Import Tool for CSV files

• POC• Number of User nodes: 897 478• Number of Track nodes: 167 655• Number of HAS_REF_TRACK rel: 6 827 294• Number of HAS_SIM_TRACK rel: 83 251 998

• Size: 6.2 GB

3. NEO4J

Page 12: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 12

user

one of user reference tracks

a track similar to user’s reference track

a track similar to user’s reference track

a track similar to user’s reference track

3. NEO4J

Page 13: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 13

• step 1: as Java application 1• compare Neo4j “all” track recommendation results to those from the

MySQL database without domain specific rules

• step 2: as Java application 2• compare Neo4j “filtered” track recommendation results to those from

the MySQL database applying domain specific rules (see “Traversal”, next slide)

• step 3: as unmanaged extension on the cloud• implement step 2 as an unmanaged extension on the cloud e.g. AWS

EC2

4. POC STEPS

Page 14: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 14

“Graph traversal is the process of visiting nodes in the graph by following the relationships between nodes in a particular manner”.

• “particular manner” = a depth-first traversal applying track recommendation rules.

• Depth evaluations to determine whether to keep the current node in the result (include) or to discard it (exclude). Example:

• depth 1 evaluation: consider only seed tracks e.g. user reference tracks with lower number of (sampled) liked tracks.

• depth 2 evaluation: apply track popularity filter based on user’s country.

• Other complex domain specific rules were applied prior and after the traversal, and were coded in Java as well.

• example: only consider similar tracks with number of occurrences greater than minimum threshold calculated per user, …

5. TRAVERSAL

Page 15: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 15

5. TRAVERSAL

Page 16: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 16

« Unmanaged extensions essentially allow you to define your own domain-specific REST API ».

• Objective: send HTTP GET requests via cURL and receive JSON responses including response time and track recommendations for the corresponding user.

6. UNMANAGED EXTENSION

Page 17: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 17

• Used a m3.xlarge AWS EC2 Ubuntu 14.04 LTS instance

7. CLOUD

Page 18: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 18

• Used a m3.xlarge AWS EC2 Ubuntu 14.04 LTS instance

7. CLOUD

unmanaged extension

HTTP GET request

Page 19: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 19

• In blue the average response time in seconds• All PoC average response times are under 1 second!• Thus, response time divided by 100 in average

8. RESULTS

TIME IN SECONDS

Page 20: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 20

• Musicovery is now confident that a fast REST API for music recommendations can be implemented using NoSQL graph database technology.

• Music recommendation is only one of the features that the Musicovery API can offer. Discussions about testing other API functions are still ongoing.

• In addition to the ETL process (slide 5), other modules that support music recommendations such as track/artist real-time “distances” calculations are being tested as PoCs based on large scale data processing frameworks such as Spark and Kafka.

Note: Neo4j free trial for the Enterprise release used for this PoC has expired

9. NEXT STEPS

Page 21: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 21

• Vukotic A., Watt N., Neo4j in Action, Manning Publications, ISBN 9781617290763

• Robinson I., Webber J, Eifrem E., Graph Databases, O’Reilly, 2nd edition, ISBN 9781491930892

10. REFERENCES

Page 22: Music recommendations API with Neo4j

6/24/2015 Boris Guarisma, Big Data Freelance 22

• Boris Guarisma• Freelance, Big Data and Data Science consultancy• [email protected]• https://fr.linkedin.com/in/borisguarisma

11. CONTACT