Download - Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks
![Page 1: Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks](https://reader030.vdocument.in/reader030/viewer/2022032805/568133d8550346895d9aceb7/html5/thumbnails/1.jpg)
Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks
Cong Ding, Yang Chen*, and Xiaoming Fu
University of Göttingen*Duke University
![Page 2: Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks](https://reader030.vdocument.in/reader030/viewer/2022032805/568133d8550346895d9aceb7/html5/thumbnails/2.jpg)
Significance of social network data crawling
•Understanding user behaviors
•Improving SNS architectures
•Handling privacy/security issues
•and so on...
![Page 3: Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks](https://reader030.vdocument.in/reader030/viewer/2022032805/568133d8550346895d9aceb7/html5/thumbnails/3.jpg)
Current data collection methods (1)
•ISP-based measurement [Schneider IMC’09]
Only ISP companiescan do that
![Page 4: Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks](https://reader030.vdocument.in/reader030/viewer/2022032805/568133d8550346895d9aceb7/html5/thumbnails/4.jpg)
Current data collection methods (2)
•Cooperate with SNS companies [Yang IMC’11]
Most research groupsdo not have chance
![Page 5: Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks](https://reader030.vdocument.in/reader030/viewer/2022032805/568133d8550346895d9aceb7/html5/thumbnails/5.jpg)
Current data collection methods (3)
•Crawl data by a single group (and share them to others)
[Gjoka INFOCOM’10]
Suffering requestrate limiting
![Page 6: Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks](https://reader030.vdocument.in/reader030/viewer/2022032805/568133d8550346895d9aceb7/html5/thumbnails/6.jpg)
Shortages of crawling by a single group
•Waste computing andnetwork resources
•Introduce overhead toservice providers (andmay lead stricter rate limiting)
•Lack of ground truth forthe research community
![Page 7: Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks](https://reader030.vdocument.in/reader030/viewer/2022032805/568133d8550346895d9aceb7/html5/thumbnails/7.jpg)
A new thought
Why not collect data collaboratively?
![Page 8: Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks](https://reader030.vdocument.in/reader030/viewer/2022032805/568133d8550346895d9aceb7/html5/thumbnails/8.jpg)
System overview
Coordinator
Crawlers
![Page 9: Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks](https://reader030.vdocument.in/reader030/viewer/2022032805/568133d8550346895d9aceb7/html5/thumbnails/9.jpg)
System design
•Fetching UIDs (BFS, etc.)
•Handling crawling failure (timeout)
•Bypassing request rate limiting (massive IP addresses)
•Data fidelity (redundant crawling)
![Page 10: Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks](https://reader030.vdocument.in/reader030/viewer/2022032805/568133d8550346895d9aceb7/html5/thumbnails/10.jpg)
Implementation
•A proof-of-concept prototype (without the data fidelity part)to crawl in Weibo
•472 PlanetLab servers as crawlers
![Page 11: Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks](https://reader030.vdocument.in/reader030/viewer/2022032805/568133d8550346895d9aceb7/html5/thumbnails/11.jpg)
Evaluation
•In 24 hours, we have crawled 2.22M users’ data from Weibo,including user profiles, all the posts, all the social connections
•Comparison:
•Fu et al. (PLOS ONE 2013) get 30K user’s data in 6 days•Guo et al. (PAM 2013) get 1M user’s data in 1 monthCrowd
CrawlingFu et al. Guo et al.
#UIDs/day 2.22M 5K 33K
![Page 12: Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks](https://reader030.vdocument.in/reader030/viewer/2022032805/568133d8550346895d9aceb7/html5/thumbnails/12.jpg)
Evaluation
![Page 13: Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks](https://reader030.vdocument.in/reader030/viewer/2022032805/568133d8550346895d9aceb7/html5/thumbnails/13.jpg)
Evaluation
![Page 14: Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks](https://reader030.vdocument.in/reader030/viewer/2022032805/568133d8550346895d9aceb7/html5/thumbnails/14.jpg)
Conclusion and Discussion
•Data sharing may violate some providers’ terms of servicesoTwitter does not allow to share data (even for
research)oWeibo allows to share data among researchers
•Unlimited data sharing might cause ethical issuesoThe data should be anonymized
•We will publish the data crawled in the evaluation