empirical network analysis title: exploring an artist co ... · pdf filejack johnson, regina...

16
Empirical Network Analysis Leonardo Soto Matamala Empirical Network Analysis Title: Exploring an Artist Co-Ocurrence Network Derived from Implicit User Feedback in Digital Music. Class: Social Network Analysis Professor: Lada Adamic Student: Leonardo Soto Matamala Date: Nov 24, 2014 Project Description The advances and adoption of internet radio and similar services is producing a massive amount of implicit feedback that can be incorporated into those products and leveraged for improving user experience. Traditionally, collaborative filtering is used together with content based filtering, i.e. metadata about songs/artists (like music genre). Nevertheless it is hard to get high quality metadata at scale. The goal of this project is to explore to which extent it is possible to use network analysis over implicit user feedback in order to learn or improve artist-artist associations. That data could be used for identifying metadata issues and/or enhancing the quality of the recommendations, via improved recall set and ranking function.

Upload: dinhliem

Post on 24-Feb-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

Empirical Network Analysis

Title:Exploring an Artist Co-Ocurrence Network Derived from Implicit User Feedback in Digital Music.

Class: Social Network AnalysisProfessor: Lada Adamic

Student: Leonardo Soto MatamalaDate: Nov 24, 2014

Project Description

The advances and adoption of internet radio and similar services is producing a massive amount of implicit feedback that can be incorporated into those products and leveraged for improving user experience. Traditionally, collaborative filtering is used together with content based filtering, i.e. metadata about songs/artists (like music genre). Nevertheless it is hard to get high quality metadata at scale. The goal of this project is to explore to which extent it is possible to use network analysis over implicit user feedback in order to learn or improve artist-artist associations. That data could be used for identifying metadata issues and/or enhancing the quality of the recommendations, via improved recall set and ranking function.

Page 2: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

Dataset

The dataset was obtained from a Graphlab tutorial on music recommendation:

http://graphlab.com/learn/gallery/notebooks/recsys_rank_10K_song.html

It is a preprocessed and subsampled file from the Million Song Dataset: http://labrosa.ee.columbia.edu/millionsong/

The “Taste Profile” part of the dataset contains real user - play counts from undisclosed partners.The dataset that was downloaded contained 2million user play counts and 10K songs. This data was further filtered and aggregated in order to generate an artist co-occurrence network.

In order to efficiently process the data, it was loaded into a Postgres database and pre-processed in order to generate aggregated count of plays per artist for each user. Another important part was subsampling the data in order to obtain a dataset size that was manageable in R and Gephi.

Filtering The following filters were applied:- Only songs with year 2000 or newer were considered.- Only kept users with 5 or more artists.- Only artists with 3 or more users.

The generated dataset consisted of in a list of tuples: <user_id, artist_name, count>. The final dataset contains 2223 users and 1021 artists.

Please see Appendixes 1 and 2, for details on downloading the data and pre-processing it in SQL.

Building the Artist co-Occurrence NetworkThe dataset described in the previous section contains 15639 tuples <user_id, artist_name, count>. It was loaded into R in order to build the network.

#Loading the datacleaned_user_artist_data <- read.csv("~/sna/graphlab/cleaned_user_artist_data.csv", sep=";", stringsAsFactors=FALSE)

head(cleaned_user_artist_data) user_id artist_name count nbr_songs1 0031572620fa7f18487d3ea22935eb28410ecc4c Coldplay 182 122 0031572620fa7f18487d3ea22935eb28410ecc4c Incubus 49 43 0031572620fa7f18487d3ea22935eb28410ecc4c Kate Winslet 22 14 0031572620fa7f18487d3ea22935eb28410ecc4c Rage Against The Machine 11 15 0031572620fa7f18487d3ea22935eb28410ecc4c Slim Dusty 21 1

Page 3: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

This data was used for building a matrix of artist to users (indicator function, i.e. 0 or 1)

#Creating a matrix of artist(rows) to users(columns)s_artist_user_matrix <-

as.matrix(ifelse(table(cleaned_user_artist_data$artist_name, cleaned_user_artist_data$user) > 0, 1, 0))

This matrix is then used for building an artist-artist matrix:

#Multiplying this matrix by its transpose in order to generate an artist-artist #matrix. This contains the co-occurrence of artists in user play history.artist2artist <- s_artist_user_matrix %*% t(s_artist_user_matrix)

This matrix (artist-artist) is then used as the adjacency matrix for our (weighted) network:

#Using artist2artist matrix as adjacency matrix in order to create the graph, using co-occurrence values for weights.ws_s <- graph.adjacency(artist2artist,weighted=TRUE,mode="undirected")

Lastly, weights are normalized, label is set and the network is saved as a graphml file:

#Normalizing weights with respect to maximum weight.E(ws_s)$Weight <- E(ws_s)$weight / max(E(ws_s)$weight)

#Assigning label to artist name for visualization in GephiV(ws_s)$Label <- V(ws_s)$name

#Exporting the graph:write.graph(ws_s,'weighted_co_artist_nx.graphml',format="graphml")

#NOTE: it is necessary to remove special characters from the file in order to load #it into Gephi.cat weighted_co_artist_nx.graphml | tr -cd '\11\12\15\40-\176' > weighted_co_artist_nx_for_gephi.graphml

Page 4: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

Data Analysis

The network was then loaded into Gephi for analysis. One of the most interesting findings was community detection. The modularity report is shown below:

Page 5: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

Network Statistics/Tools ran:

Avg Degree 74.58

Modularity 0.21

Page Rank

Connected Components 1

Avg Clustering Coefficient 0.45

Avg. Path Length 2

Visualization

First the nodes where color coded by PageRank and also by degree (this is the one shown below). ForceAtlas 2 was used for layout and also “Label Adjust”.

In order to study each community separately a filter was used (library → attributes → partition). In this way it is possible to isolate each community by clicking on the corresponding partition. A screenshot for each community follows, together with some of the representative artists in each community (using degree and filtering by the ones that I know)

Page 6: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

Mod Class 0 (34.18%)The Killers, Florence + The Machine, The Black Keys, Snow Patrol, MGMT

Page 7: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

Mod Class 5(23.41%)Linkin Park, Eminem, Evanescence, 3 Doors Down, Rise Against, Foo Fighters

Page 8: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

Mod Class 3 (22.33%)One Republic, Coldplay, Train, Nickelback, Taylor Swift, Lady Gaga, Black Eyed Peas

Page 9: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

Mod Class 2 (11.17%)Jack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer

Page 10: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

Mod Class 1 (8.33%)Kings Of Leon, Bjork, Tub Ring, Angels and Airwaves, The Prodigy

Page 11: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

Mod Class 4 (0.59%)

Page 12: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

It is really interesting to see the artists form communities around different genres. A tricky aspect of musical genres is how to create a compatibility matrix for a given genre taxonomy. In this case, compatibility emerges naturally from the interactions of users and music services.

Also I loaded the network into MapEquation website and used their tools for clustering in order to generate clusters and visualize them:

Which are quite relevant. Specially the Karaoke cluster, which can be challenging to identify when not reliable metadata is used.

Another interesting exercise was ranking artists (globally) by different measures of centrality.

Top 10 by Degree Kings Of Leon 611 1OneRepublic 538 2Coldplay 510 3Bjrk 453 4The Killers 451 5Linkin Park 427 6Florence + The Machine 418 7The Black Keys 405 8Rise Against 383 9Jack Johnson 358 10

Page 13: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

Top 10 by Betweenness Kings Of Leon 28061.4113149 1Coldplay 17102.773762 2OneRepublic 17081.1948947 3Bjrk 12593.1831725 4Linkin Park 11117.778936 5The Killers 10121.7606571 6Florence + The Machine 9130.73668653 7The Black Keys 8710.33946082 8Rise Against 7851.85934978 9Gorillaz 7778.30652734 10

Top 10 by Clustering CoefficientAll That Remains 0.8482758621 1Enrique Iglesias / Lil Wayne 0.825 2Plain White T S 0.8225806452 3Madonna 0.8092307692 4Tracy Chapman 0.803030303 5So Many Dynamos 0.7707509881 6

0.7692307692 70.7619047619 80.7582417582 90.7579831933 10

Jonas SteurKid Cudi / RatatatEnnio MorriconeMudvayne

High clustering for some of these artists makes a lot of sense, for instance for Ennio Morricone, who has collaborated with several artists and created many popular soundtracks.

Conclusions

Network analysis applied to artists co-occurring in user playlists gathered from music services provides opportunities for new insights. Despite the limitations of the dataset (sampling and filtering) interesting new insights were found. It would be interesting to see the graph with the complete dataset from the Million Song Project. Community detection was particularly insightful: it surfaces artist compatibility, naturally from the interactions of users and music services. Besides that, degree and PageRank can be used for ranking popular artists inside communities, another useful piece of information for designing or improving music recommendation systems. Similarly, it is possible to build a song co-occurrence network, which would be interesting to study and compare with the artist one, specially for artists that have works in very different genres.

Page 14: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

Appendix 1: Dataset

Getting the data

The first file contains user id, song id and number of times user has play that song.

wget http://s3.amazonaws.com/GraphLab-Datasets/millionsong/10000.txt

head 10000.txt

b80344d063b5ccb3212f76538f3d9e43d87dca9e SOAKIMP12A8C130995 1

b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBBMDR12A8C13253B 2

b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBXHDL12A81C204C0 1

b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBYHAJ12A6701BF1D 1

...

The second file contains song id and associated metadata (song title, artist name, year).

wget http://s3.amazonaws.com/GraphLab-Datasets/millionsong/song_data.csv

head song_data.csv

song_id,title,release,artist_name,year

SOQMMHC12AB0180CB8,"Silent Night","Monster Ballads X-Mas","Faster Pussy cat",2003

SOVFVAK12A8C1350D9,"Tanssi vaan","Karkuteillä",Karkkiautomaatti,1995

SOGTUKN12AB017F4F1,"No One Could Ever",Butter,"Hudson Mohawke",2006

...

Appendix 2: Data Preprocessing

Creating Tables in PostgresqlCREATE TABLE music_samples (

user_id text,song_id text,count int

);CREATE INDEX music_samples_song_idx ON music_samples (song_id)

CREATE TABLE song_data (song_id text,title text,release text,artist_name text,year int);

Page 15: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

Loading Data into Tables

Loading play history for 10K songs:

cat 10000.txt | psql -c "COPY music_samples FROM stdin"

Loading metadata:

awk 'NR> 1 { print $0}' song_data.csv | psql -c "COPY song_data FROM stdin WITH csv"

SQL CODE

CREATE TEMP TABLE cleaned_song_data ASSELECT distinct song_id, artist_name FROM song_data WHERE year > 1999;--307936 rowsCREATE index cleaned_song_data_song_idx ON cleaned_song_data(song_id);

DROP TABLE IF EXISTS user_artist_song_count;CREATE TABLE user_artist_song_count ASSELECT user_id, artist_name, count, s.song_id FROM music_samples s, cleaned_song_data m WHERE s.song_id = m.song_id AND count > 9;--60929 rowsCREATE INDEX user_artist_song_count_user_artist_idx ON user_artist_song_count (user_id, artist_name);

DROP TABLE IF EXISTS counts_by_user_artist;CREATE TABLE counts_by_user_artist ASSELECT user_id, artist_name, SUM(count) as count, COUNT(distinct song_id) as nbr_songs FROM user_artist_song_countGROUP BY user_id, artist_name;--50870CREATE INDEX counts_by_user_artist_user_idx ON counts_by_user_artist(user_id);

--only keep users with 5 artists or moreDROP TABLE IF EXISTS nbr_artist_for_users;CREATE TEMP TABLE nbr_artist_for_users ASSELECT user_id, count(distinct artist_name) as nbr_unique_artists FROM counts_by_user_artist GROUP BY user_id HAVING count(distinct artist_name) > 4;--2223 usersCREATE INDEX nbr_artist_for_users_user_idx ON nbr_artist_for_users(user_id);

--just keep users that have at least 5 artistsCREATE TEMP TABLE filtered1 ASSELECT c.* FROM counts_by_user_artist c, nbr_artist_for_users n WHERE c.user_id = n.user_id;

Page 16: Empirical Network Analysis Title: Exploring an Artist Co ... · PDF fileJack Johnson, Regina Spektor, Red Hot Chili Peppers, Stone Temple Pilots, Weezer. Empirical Network Analysis

Empirical Network Analysis Leonardo Soto Matamala

--16320

CREATE TEMP TABLE number_users_by_artist ASSELECT artist_name, count(distinct user_id) FROM filtered1 GROUP BY artist_name HAVING count(distinct user_id) > 2order by count desc;--1021

DROP TABLE IF EXISTS filtered2; CREATE TEMP TABLE filtered2 ASSELECT c.* FROM filtered1 c, number_users_by_artist a WHERE c.artist_name = a.artist_name;--15638

SELECT c.user_id, '"' || regexp_replace(c.artist_name, '"', ' ') || '"' as artist_name, c.count, nbr_songs FROM filtered1 c, number_users_by_artist a WHERE c.artist_name = a.artist_name;

References

Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Societyfor Music Information Retrieval Conference (ISMIR 2011), 2011.

http://graphlab.com/learn/gallery/notebooks/recsys_rank_10K_song.htmlhttp://labrosa.ee.columbia.edu/millionsong/