lastfm crawler
DESCRIPTION
Mini-project result presentation in classTRANSCRIPT
last.fm crawlerRW vs RWRW
Mário Almeida [email protected] Gilani [email protected]
Arinto Murdopo [email protected]
Outline● Parameters● Methodology● Results● Challenges● Conclusion
Parameters1. Playcounts2. Playlists3. Ages4. IDs5. Number of friends (degrees)
Compare average using RW and RWRW!
MethodologyUtilized lastfm APIs to obtain
● user info ● number of friends (degree)
RW with UIS-WROn-the-fly, we apply RW formula:
MethodologyFor RWRW, we apply:
The weight Wv is set to number of friends (degree)
ResultsCrawled for ~10 hoursNumber of samples: 48000Number of age samples: 36363, not all users show their age
Results - Ages
After about 25k samples, the
age stabilizes.
RW estimates
lower average age
values. There is a big
correlation between age
and the degree
Results - Playlists
Most users do not have playlists.
RW estimates higher numbers of playlists. Users with higher degrees tend to
have more playlists.
Results - Playcounts
We found some users having playcounts in the order of millions.
RW estimates higher playcounts. Users with higher degree tend to have higher playcounts
Results - IDs
RW estimates a lower average ID compared to RWRW. An user with lower ID has generally a higher degree
Not yet stable.
Results - Degrees
RWRW reduces the bias of nodes with higher probability to be visited
due to the high degree. This is indeed close to the expected degree
value.
Conclusion● A simple random walk in a social network
generally results into biased averages.○ A node with higher degree has a higher probability of
being discovered.● RWRW normalizes the averages.
○ High variations do not abruptly impact the estimation.
○ RWRW reduces the biases of RW.● Low variance means lower difference
between RW and RWRW.● Crawling lastfm produces many challenges
○ e.g.: 0 degree, banned user, huge playcounts
QuestionsCheck the code in:● http://code.google.com/p/lastfm-rwrw/