dssn tweet corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/lehre/1415/ws/lv/... · data...

30
DSSN Tweet Corpus Robert R¨ oßling University of Leipzig 2. February 2015 Robert R¨ oßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 1 / 12

Upload: dinhduong

Post on 27-Jun-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

DSSN Tweet Corpus

Robert Roßling

University of Leipzig

2. February 2015

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 1 / 12

Page 2: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

1 Task

2 Decision

3 Current state

4 Functioning

5 Demonstration

6 Current Problems

7 Solutions, Improvements and Discussion

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 2 / 12

Page 3: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Task

Creating a text corpus

Data should be realData should contain meta informationIt should be possible to assign one user a certain set of data

Should also contain social media specific attributes

UsernamesID

FriendsFollowers

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 3 / 12

Page 4: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Task

Creating a text corpus

Data should be realData should contain meta informationIt should be possible to assign one user a certain set of data

Should also contain social media specific attributes

UsernamesID

FriendsFollowers

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 3 / 12

Page 5: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Decision

Using the Twitter REST API1

Get real data from there

Create User objects for each user

Create Tweet objects for each tweet

Give each User and Tweet a unique ID (preferably the samefrom Twitter)

User object initializes one replay agent

Possibility to create an outputfile via the native etree andBeautifulSoup42 library

1https://dev.twitter.com/rest/public2http://crummy.com/software/BeautifulSoup

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 4 / 12

Page 6: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Current State: DSSNTweetCorpus4

Creates a text corpus of real data

Date comes from Twitter

Access through the Twitter REST API via python twtterAPI-tool3

Programming Language is Python

Every user and tweet is it’s own object

Friend requests have a unique ID

Outputfile can be created

3http://mike.verdone.ca/twitter/4https://github.com/DSSN-Practical/DSSNTweetCorpus

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 5 / 12

Page 7: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Modules

Handler:

Has the main method in itWhen initiated will prompt for the initial user screen nameWill then prompt for the iterations

Corpus:

Contains the array of all usersContains an additional array of all user ID’sHas the authorization to the twitter API in it

Friender:

Due to the fact that Tweets do not contain a proper”friendrequest” tweet this module is a workaround attemptIt generates a random subset of users for each user and setsthem as ”friends”

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 6 / 12

Page 8: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Modules

Handler:

Has the main method in itWhen initiated will prompt for the initial user screen nameWill then prompt for the iterations

Corpus:

Contains the array of all usersContains an additional array of all user ID’sHas the authorization to the twitter API in it

Friender:

Due to the fact that Tweets do not contain a proper”friendrequest” tweet this module is a workaround attemptIt generates a random subset of users for each user and setsthem as ”friends”

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 6 / 12

Page 9: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Modules

Handler:

Has the main method in itWhen initiated will prompt for the initial user screen nameWill then prompt for the iterations

Corpus:

Contains the array of all usersContains an additional array of all user ID’sHas the authorization to the twitter API in it

Friender:

Due to the fact that Tweets do not contain a proper”friendrequest” tweet this module is a workaround attemptIt generates a random subset of users for each user and setsthem as ”friends”

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 6 / 12

Page 10: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Modules

Handler:

Has the main method in itWhen initiated will prompt for the initial user screen nameWill then prompt for the iterations

Corpus:

Contains the array of all usersContains an additional array of all user ID’sHas the authorization to the twitter API in it

Friender:

Due to the fact that Tweets do not contain a proper”friendrequest” tweet this module is a workaround attemptIt generates a random subset of users for each user and setsthem as ”friends”

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 6 / 12

Page 11: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

1 class User(object):

2 uid = 0

3 screen_name = ’’

4 name = ’’

5 createdAt = ’’

6 description = ’’

7 nrFriends = 0

8 nrFollowers = 0

9 tweets = []

10 friends = []

11 #only used for GET timeline

12 protected = False

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 7 / 12

Page 12: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

1 class Tweet(object):

2 tid = 0

3 text = ’’

4 createdAt = ’’

5 isReply = False

6 replyTo = None

7 retweeted = False

8 isFriendRequest = False

9 FriendRequestToId = 0

10 #causes errors for now

11 hashtags = []

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 8 / 12

Page 13: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Handler methods

startHandling

General method that will start handling all incoming dataOnly this method needs to be called

addUsers

Will look up the last 200 followers of a user and will extracttheir meta-dataCreates a user object and appends it to the corpus array

addUserTweets

Looks up the Timeline of a certain userCreates a Tweet object for every Tweet

createUserEntry and createOutputFile

Creates HTML / XML based stringSaves it as a .xml File with all the user meta-data and tweets

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 9 / 12

Page 14: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Handler methods

startHandling

General method that will start handling all incoming dataOnly this method needs to be called

addUsers

Will look up the last 200 followers of a user and will extracttheir meta-dataCreates a user object and appends it to the corpus array

addUserTweets

Looks up the Timeline of a certain userCreates a Tweet object for every Tweet

createUserEntry and createOutputFile

Creates HTML / XML based stringSaves it as a .xml File with all the user meta-data and tweets

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 9 / 12

Page 15: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Handler methods

startHandling

General method that will start handling all incoming dataOnly this method needs to be called

addUsers

Will look up the last 200 followers of a user and will extracttheir meta-dataCreates a user object and appends it to the corpus array

addUserTweets

Looks up the Timeline of a certain userCreates a Tweet object for every Tweet

createUserEntry and createOutputFile

Creates HTML / XML based stringSaves it as a .xml File with all the user meta-data and tweets

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 9 / 12

Page 16: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Handler methods

startHandling

General method that will start handling all incoming dataOnly this method needs to be called

addUsers

Will look up the last 200 followers of a user and will extracttheir meta-dataCreates a user object and appends it to the corpus array

addUserTweets

Looks up the Timeline of a certain userCreates a Tweet object for every Tweet

createUserEntry and createOutputFile

Creates HTML / XML based stringSaves it as a .xml File with all the user meta-data and tweets

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 9 / 12

Page 17: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Handler methods

startHandling

General method that will start handling all incoming dataOnly this method needs to be called

addUsers

Will look up the last 200 followers of a user and will extracttheir meta-dataCreates a user object and appends it to the corpus array

addUserTweets

Looks up the Timeline of a certain userCreates a Tweet object for every Tweet

createUserEntry and createOutputFile

Creates HTML / XML based stringSaves it as a .xml File with all the user meta-data and tweets

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 9 / 12

Page 18: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Handler methods

startHandling

General method that will start handling all incoming dataOnly this method needs to be called

addUsers

Will look up the last 200 followers of a user and will extracttheir meta-dataCreates a user object and appends it to the corpus array

addUserTweets

Looks up the Timeline of a certain userCreates a Tweet object for every Tweet

createUserEntry and createOutputFile

Creates HTML / XML based stringSaves it as a .xml File with all the user meta-data and tweets

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 9 / 12

Page 19: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Handler methods

startHandling

General method that will start handling all incoming dataOnly this method needs to be called

addUsers

Will look up the last 200 followers of a user and will extracttheir meta-dataCreates a user object and appends it to the corpus array

addUserTweets

Looks up the Timeline of a certain userCreates a Tweet object for every Tweet

createUserEntry and createOutputFile

Creates HTML / XML based stringSaves it as a .xml File with all the user meta-data and tweets

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 9 / 12

Page 20: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Handler methods

startHandling

General method that will start handling all incoming dataOnly this method needs to be called

addUsers

Will look up the last 200 followers of a user and will extracttheir meta-dataCreates a user object and appends it to the corpus array

addUserTweets

Looks up the Timeline of a certain userCreates a Tweet object for every Tweet

createUserEntry and createOutputFile

Creates HTML / XML based stringSaves it as a .xml File with all the user meta-data and tweets

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 9 / 12

Page 21: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Demonstration

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 10 / 12

Page 22: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Current Problem

Friend request data is not ”real”

User can send himself a friend request

Creating the outputfile results in a high RAM usage

Creating the outputfile on Windows results in a crash

Halt method slows down the computation drastically

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 11 / 12

Page 23: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Current Problem

Friend request data is not ”real”

User can send himself a friend request

Creating the outputfile results in a high RAM usage

Creating the outputfile on Windows results in a crash

Halt method slows down the computation drastically

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 11 / 12

Page 24: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Current Problem

Friend request data is not ”real”

User can send himself a friend request

Creating the outputfile results in a high RAM usage

Creating the outputfile on Windows results in a crash

Halt method slows down the computation drastically

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 11 / 12

Page 25: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Current Problem

Friend request data is not ”real”

User can send himself a friend request

Creating the outputfile results in a high RAM usage

Creating the outputfile on Windows results in a crash

Halt method slows down the computation drastically

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 11 / 12

Page 26: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Current Problem

Friend request data is not ”real”

User can send himself a friend request

Creating the outputfile results in a high RAM usage

Creating the outputfile on Windows results in a crash

Halt method slows down the computation drastically

Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 11 / 12

Page 27: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Solutions, Improvements and Discussion

Use the lxml5 library for high-performance XML parsing

Return an user object to the replay agent directly for easyhandling

From the actual amount of friends of each user generate amore ”real” subset

Parallel halt method:Assumption: Friends = Followers, since only Friend requestsare usedWhile Rate limit for GET/followers/list is exceeded useGET/friends/listwhile both Rate limits would be exceeded already start readingout the user timelinesProblem: Amount of users increases faster while followers /friends request are linear, which results into several thousanduser timeline requests for only 20+ users, the waiting time foruser timeline can hardly be reduced

5http://lxml.de/Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 12 / 12

Page 28: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Solutions, Improvements and Discussion

Use the lxml5 library for high-performance XML parsing

Return an user object to the replay agent directly for easyhandling

From the actual amount of friends of each user generate amore ”real” subset

Parallel halt method:Assumption: Friends = Followers, since only Friend requestsare usedWhile Rate limit for GET/followers/list is exceeded useGET/friends/listwhile both Rate limits would be exceeded already start readingout the user timelinesProblem: Amount of users increases faster while followers /friends request are linear, which results into several thousanduser timeline requests for only 20+ users, the waiting time foruser timeline can hardly be reduced

5http://lxml.de/Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 12 / 12

Page 29: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Solutions, Improvements and Discussion

Use the lxml5 library for high-performance XML parsing

Return an user object to the replay agent directly for easyhandling

From the actual amount of friends of each user generate amore ”real” subset

Parallel halt method:Assumption: Friends = Followers, since only Friend requestsare usedWhile Rate limit for GET/followers/list is exceeded useGET/friends/listwhile both Rate limits would be exceeded already start readingout the user timelinesProblem: Amount of users increases faster while followers /friends request are linear, which results into several thousanduser timeline requests for only 20+ users, the waiting time foruser timeline can hardly be reduced

5http://lxml.de/Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 12 / 12

Page 30: DSSN Tweet Corpus - uni-leipzig.debis.informatik.uni-leipzig.de/de/Lehre/1415/WS/LV/... · Data should be real Data should contain meta information It should be possible to assign

Solutions, Improvements and Discussion

Use the lxml5 library for high-performance XML parsing

Return an user object to the replay agent directly for easyhandling

From the actual amount of friends of each user generate amore ”real” subset

Parallel halt method:Assumption: Friends = Followers, since only Friend requestsare usedWhile Rate limit for GET/followers/list is exceeded useGET/friends/listwhile both Rate limits would be exceeded already start readingout the user timelinesProblem: Amount of users increases faster while followers /friends request are linear, which results into several thousanduser timeline requests for only 20+ users, the waiting time foruser timeline can hardly be reduced

5http://lxml.de/Robert Roßling (University of Leipzig) DSSN Tweet Corpus 2. February 2015 12 / 12