mining the social web - leibniz universität hannover · user recommendation ... (lda) to generate...
Post on 06-Aug-2020
0 Views
Preview:
TRANSCRIPT
Mining the Social Web
Asmelash Teka Hadguteka@L3S.de
L3S Research CenterApril 24, 2018
Outline
2
Deep LearningSelected Papers
Introduction: The big picture
Definition from Web Science Conference1
Web Science is the emergent science of the people, organizations, applications, and of policies that shape and are shaped by the Web
Studying the online world to understand the offline world
3
What is Web Science?
http://websci16.org
Web 2.0 describes World Wide Web sites that emphasize user-generated content, usability, and interoperability.
The social web is a set of social relations that link people through the World Wide Web.
4
The Social Web
https://makeawebsitehub.com/social-media-sites
Problem: How can we automatically construct user profiles?
5
User Identification
Authoritative users extraction - Discovering expert users for a target topic.
Personalized web search - Personalized social media posts retrieval.
User recommendation - Suggesting new interesting users to a target user.
Preprocessing step for many interesting downstream analysis.
6
User Identification Applications
Challenges:
Labels - Generating labelled data to train Models
Feature Engineering - What are good features
Modeling - Mostly off the shelf Predictive models
7
Machine Learning for User Identification [1,2,3]
Feature engineering is key in a machine learning task.
● Profile features● Tweeting behaviour features● Linguistic content● Social network features
8
Feature Engineering
Baseline
Bag-of-words (BOW) model with weighting (e.g., TF-IDF)
Each user is represented as a bag of the words in their tweets.
These words are then used as features to train a classifier.
9
Feature Engineering
Ideally Profile information should be sufficient. In reality it does not contain enough quality information to be directly used for user classification.
- Length of name, Number of alphanumeric chars.- Capitalization forms in user name- Use of avatar picture- Number of followers, friends- Regular expression matches in bio:
(I |i)(m|am| m|[0 − 9] + (yo|yearold)man|woman|boy |girl
10
Profile Features
A set of statistics capturing the way users interact with the micro-blogging service.
- Number of tweets of a user.- Number and fraction of retweets of a user.- Ave. number of hashtags and URLs per tweet.- Ave. time and std between tweets.
11
Tweeting Behaviour Features
Linguistic content contains the user’s lexical usage and the main topics of interest to the user.
- List of socio-linguistic words such as emoticons, ellipses etc.- Prototypical words, hashtags etc. Given n classes, each class ci containing
seed users Si . Each word w from seed users is assigned a score:
(1)
where |w, Si| is the number of times the word w is issued by all users for class ci .
- Latent Dirichlet Allocation (LDA) to generate topics used as features for classification.
- Sentiment words- Word embeddings
12
Linguistic Content Features
These features contain the social connections between a user and those one follows, replies to or whose messages they retweet.
Prototypical ‘friend’ accounts are generated by exploring the social network of users in the training set by bootstrapping as in Eq. (1)
- Number of prototypical friends, percentage number of prototypical friend - Prototypical replied users, Prototypical retweeted users
13
Social Network Features
Manually building ground truth data and validating result not easy.
● Combine multiple data sources. e.g. scraping websites that contain users in each class.
● Using key terms in profile of users. Prone to systematic bias and care should be taken on features.
● Using crowdsourcing services such as Amazon Mechanical Turk1 and CrowdFlower2
14
Ground-Truth and Validation
1. https://www.mturk.com 2. http://www.crowdflower.com/
Identifying Computer Science researchers on Twitter [3]
15
A Complete Pipeline
● Current State-of-the-art methods for classification are based on deep learning.
● Key difference is in representation learning - instead of hand crafting features, layers of neural networks, learn representations (features) from raw inputs.
● Bag of words -> words in semantic spaces through word embeddings e.g., word2vec
16
Deep Learning
Intro Deep Learning
Applications:
Face Recognition
Object Detection
Neural Style Transfer
From Feedforward to Convolutional Neural Networks
1. Deep Learning on large images
Consider an RGB image with 1000x1000 dimensions, if you use FCs with 1000 hidden units => 3 billion params.
- Need lots of data to prevent overfitting- computational/memory requirements are huge
2. Preserve Spatial relationship of input images.
The Big Picture (What’s inside the box?)
cat
Convolutional Neural Network
Why Convolutions?
Main advantages of convolutional layers over fully connected layers:
● Sparse interactions
● Parameter sharing
● Translation invariance
Hierarchy of Feature Representations
Earlier layers of deep neural networks detect low level features such as edges and later layers detect high level features.
Convolution Operation:
● Apply a dot-product of filter over image: ● Convolutional networks are neural networks that use convolution in at least one
of their layers.
1 1 1 0 0
0 1 1 1 0
0 0 1 1 1
0 0 1 1 0
0 1 1 0 0
1 0 1
0 1 0
1 0 1
Image
Filter/kernel
Animation Source: https://www.coursera.org/learn/convolutional-neural-networks
Padding
Problems with the basic Convolution operation:
- Shrinking output- Throwing away information from edge
E.g., nxn image convolved with fxf filter becomes
To solve these problems we can pad the image before applying convolution.
Given: nxn input convolved with fxf filter filter padding p gives:
Valid and Same Convolutions
Valid Convolution: Convolution with no padding
Input image: n by n becomes
Same Convolution: Pad so that output size is the same as the input size
Input image: n by n output feature map:
n + 2p - f + 1 = n => p = (f-1)/2
Common filter sizes for Convolution: 1x1, 3x3, 5x5, 7x7
Strided Convolution
Strided Convolution is another important basic building block.
Instead of shifting filter by one, we jump over by stride s of steps.
Given an n x n image convolved with f x f filter with padding p and stride s, the output size is given by:
Convolutions over Volumes
● Allow us to operate directly on multi-dimensional input e.g., RGB images.
● Multiple filters stacked together allow us to detect multiple features.
● The resulting output is volume which in turn we can perform convolution to form a convolutional network.
Video Source: https://www.coursera.org/learn/convolutional-neural-networks
Convolution Computation [Dimensions]
Input: 5x5x3 image
2 filters 3x3 with Stride = 2 and padding = 1
Output: (5+2-3)/2 +1 = 3x3x2
We add bias and apply nonlinearity, e.g., Relu to obtain the next output to complete one step of a convolution layer.
Video Source: http://cs231n.github.io/convolutional-networks/
One Convolutional Layer [Parameters]
Consider the image on the right is 64x64x3. If you apply 10 filters that are 3x3 in the first CONV layer of neural network, how many parameters does that layer have?
CONV
3x3x3 = 27 params + bias => 28 params * 10 filters => 280 params
Note: No matter how big the input image is, the number of parameters stays the same. This makes convNets less prone to overfitting.
Types of Layers in a Convolutional Neural Network
- Convolution (CONV)- Pooling (POOL)- Fully Connected (FC)
CONV POOL CONV POOL CONV
FC FC
Pooling Layer
- Reduce the size of representation- Speed up computation- Make features more robust
No parameters to learn!
Hyperparameters: filter size, stride, Max or Average pooling, mostly no padding.
Pooling Operation:
3 4 0 1
2 0 2 0
5 3 2 1
1 0 6 7
Max Pooling: take the max of values in each filter. Similly,
Average Pooling: take the average of values in each filter.
4 2
5 7
3 4 0 1
2 0 2 0
5 3 2 1
1 0 6 7
2.25
0.75
2.25
4
Max pooling Average pooling
ConvNet Example
64x64x3
f =5x5s=1#f=8
64x64x8
CONV
MAXPOOL
f =2x2s=2
32x32x8
f =5x5s=1#f=16
CONV
28x28x16
MAXPOOL
f =2x2s=2 14x14x16
f =3x3s=1#f=32
CONV
FC FC
How activation dimensions change
Why Convolutions?
Main advantages of convolutional layers over fully connected layers:
● Sparse interactions: In each layer, each output value depends only on a small number of inputs.
● Parameter sharing: a feature detector (such as a vector edge detector) that’s useful in one part of the image is probably useful in another part of the image.
● Translation invariance: An image shifted a few pixels should result in similar features and should be assigned the same output label.
Training ConvNets
Training examples (x(1) ,y(1)), (x(2) ,y(2)), … (x(m) ,y(m))
Cost
Maximum likelihood provides a guide for how to design a good cost function.
Use gradient descent to optimize parameters to reduce J.
y^
Summary Part-One
We have seen basic building blocks
● Convolution operation, padding and strided convolution● Convolution Layer● Pooling layer
How to put these things together?
Case Studies: Deep ConvNet Architectures
- Start with what’s used or modify for own purpose- Gain intuition from successful architectures- Architecture that works for one task often works well for other tasks
ImageNet Classification
1.2 million high-resolution images into 1000 different classes.
LeNet-5
Architecture: CONV-POOL-CONV-POOL-CONV-FC-FC
● Relatively small network ~ 60k parameters● Notice the pattern of one or more CONVs followed by a POOL and then finally FCc● As you go deeper in the network, a decrease in HxW 32 => 28 => 14 => 10 => 10 => 5 and an increase
in channels 1 => 6 => 16
LeCun et al., 1998, Gradient-Based Learning Applied to Document Recognition
Input: 32x32 grayscale imageVALID CONV: 6 filters 5x5, stride 1AVG POOL: 2x2, stride 2VALID CONV: 6 filters 5x5, stride 1AVG POOL: 2x2, stride 2FC 120 nodesFC 84 nodes
AlexNet
● Similar to LeNet but much bigger ~60M parameters instead of ~60k● Used ReLu activation function
Krizhevsky et al., 2012, ImageNet Classification with Deep Convolutional Neural Networks
Input: 227x227x3SAME CONV 96 filters 11x11 s=4 MAX-POOL 3x3 s=2SAME CONV 256 filters 5x5MAX-POOL 3x3 s=2SAME CONV 384 filters 3x3 SAME CONV 384 filters 3x3 SAME CONV 256 filters 3x3 MAX-POOL 3x3 s=2FC 4096FC 4096Softmax 1000
VGG-16
● CONV = 3x3 filter, s=1, same● MAX-POOL = 2x2, s = 2● From 8 to 16 layers
Input: 224x224x3[CONV 64] x 2POOL[CONV 128] x 2POOL[CONV 256] x 3POOL[CONV 512] x 3POOL[CONV 512] x 3POOLFC 4096FC 4096Softmax 1000
Simonyan and Zisserman, 2015, Very Deep Convolutional Networks for Large-Scale Image Recognition
Notable for simple and uniform architecture.
ResNetHe et al., 2015, Deep residual learning for image recognition
Very deep networks are difficult to train because of vanishing and exploding gradients.
Skip Connection: allows to take the activation from one layer and feed it directly to a layer much deeper in the network.
Plain connection Residual block
ResNet
ResNets are formed by stacking residual blocks together to form a deep network
ResNets do well because it’s easy for a residual block learn the identity function
Plain connection
Residual Network
He et al., 2015, Deep residual learning for image recognition
ResNet
● Able to train very deep networks without degrading (152 layers on ImageNet, 1202 on Cifar)
● Swept 1st place in all ILSVRC and COCO 2015 competitions
He et al., 2015, Deep residual learning for image recognition
GoogLeNet
When designing a layer for a ConvNet you might ask:
1x1 CONV? 3x3 CONV? 5x5 CONV? MAX POOL?
Inception Network says Why not all!
Szegedy et al. 2014, Going deeper with convolutions
GoogLeNetSzegedy et al. 2014, Going deeper with convolutions
- 22 layers- Efficient ‘Inception’ module- No FC layers- Only 5 million Params 12x less than AlexNet- ILSVRC’14 classification winner (6.7% top 5 error)
Summary Part Two
Use Cases and Architectural Choices
Classic Examples:
LeNet-5, AlexNet, VGG
Recent Advances:
ResNet, GoogLeNet (Inception)
References
Convolutional Neural Networks On Coursera
https://www.coursera.org/specializations/deep-learning
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.stanford.edu/
Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courvill http://www.deeplearningbook.org/
ApplicationsSpeech Recognition
Machine Translation
Named Entity Recognition
Image Captioning
Why Not a standard feed-forward Network?
● Input, output can be of different lengths in different examples
● Doesn’t share features learned across time
● Reduced Number of parameters
Why Not CNNs?
CNN - specialized for processing a grid of values such as an imageRNN - specialized for processing a sequence of values x(1), . . . , x(τ).
CNN - can readily scale to images with large width and heightRNN - can scale to much longer sequences
Types of RNNs (RNN Architectures)
Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Image classification Music generationImage captioning
Sentiment classification Named Entity RecognitionVideo activity recognition
Machine translation
What are Recurrent Neural Networks (RNNs)?
xt - the input at time step tst - the hidden state at time step t. It’s the “memory” of the network. st is calculated based on the previous hidden state and the input at the current step: st=f(Ut + Wst-1). The function f usually is a nonlinearity such as tanh or ReLU.ot - is the output at step t Training RNN (learn parameter U, V, W): backpropagation with Backpropagation Through Time (BPTT) algorithm
Standard Recurrent Neural Networks
The repeating module in a standard RNN contains a single layer neural network.
Source: Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTMs
54
LSTMs are a special type of RNNs capable of learning long-term dependencies.
Selected Papers
Gender Shades
● Computer vision is being utilized in high-stakes sectors such as health-care and law enforcement.
● Machine learning algorithms can discriminate based on classes like race and gender.
● An approach to evaluate bias present in automated facial analysis algorithms and datasets.
Buolamwini, Joy, and Timnit Gebru. "Gender shades: Intersectional accuracy disparities in commercial gender classification." Conference on Fairness, Accountability and Transparency. 2018.
Gender Shades
● Members of parliament from Africa: Rwanda, Senegal, South Africa and Europe: Iceland, Finland, Sweden
● Public figures with known identities and photos available under non-restrictive licenses
● gender labeling based on name, gendered title, prefixes such as Mr. or Ms and appearance of the photo
Buolamwini, Joy, and Timnit Gebru. "Gender shades: Intersectional accuracy disparities in commercial gender classification." Conference on Fairness, Accountability and Transparency. 2018.
Pilot Parliaments Benchmark (PPB) dataset
Gender Shades
● Current benchmark datasets: IJB-A and Adience majority (79.6% for IJB-A and 86.2% for Adience) of the subjects are lighter-skinned faces
● Evaluation on commercial gender classification○ Microsoft - Microsoft’s Cognitive Services Face API○ IBM - IBM's Watson Visual Recognition API○ Face++ - Publicly available demonstration of Face++ from China
darker-skinned females are the most misclassified group (with error rates of up to 34.7%)
Buolamwini, Joy, and Timnit Gebru. "Gender shades: Intersectional accuracy disparities in commercial gender classification." Conference on Fairness, Accountability and Transparency. 2018.
CheXNet
● Deep learning is being applied to help with improved medical diagnoses, disease analyses, and pharmaceutical development.
● Chest X-rays are the most common imaging examination tool for diagnosing Pneumonia and other diseases.
● Automated detection of diseases from chest X-rays at level of expert radiologists helps in clinical settings, also important to deliver health care in populations with inadequate access to diagnostic imaging specialists.
Rajpurkar, Pranav, et al. "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning." arXiv preprint arXiv:1711.05225 (2017).
CheXNet
● An algorithm that detects Pneumonia from chest X-rays at a level exceeding professional radiologists.
● CheXNet, is a 121-layer convolutional neural network trained on ChestX-ray14, publicly available chest X-ray dataset.
● Uses dense connections and batch normalization to make the optimization of such a deep network tractable.
Rajpurkar, Pranav, et al. "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning." arXiv preprint arXiv:1711.05225 (2017).
CheXNet
● Heatmaps visualize areas of image most indicative of disease with class activation maps (CAMs).
● Let fkbe kth feature map and wc,kweight in the final classification layer for feature map k leading to pathology c, compute a map Mc:
● Heatmap generated by upscaling Mcto dimensions of image and overlaying
Rajpurkar, Pranav, et al. "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning." arXiv preprint arXiv:1711.05225 (2017).
DeepMoji
● NLP tasks are often limited by scarcity of manually annotated data.
● Researchers typically use emoticons and specific hashtags as forms of distant supervision.
● Learn emotion, sentiment and sarcasm detection using a single model through emoji prediction on a dataset of 1246 million tweets containing one of 64 common emojis.
Felbo, Bjarke, et al. "Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm." arXiv preprint arXiv:1708.00524 (2017).
Top five most likely emojis for each text
DeepMoji
Pretraining on the classification task of predicting which emoji were initially part of a text.
● English tweets without URLs● word tokens, repeated characters shortened ● Single tweet for the pretraining per unique emoji type
regardless of the number of emojis in a tweet● Balance dataset to avoid over representation of most
frequent emojis
Felbo, Bjarke, et al. "Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm." arXiv preprint arXiv:1708.00524 (2017).
DeepMoji model S is text length and C number of classes
DeepMoji
Transfer Learning: fine tuning the pretrained model to a target task.
- Using model as a feature extractor with all layers frozen
- Using model as an initialization where the full model is unfrozen
- Sequentially unfreezing and fine tuning a single layer at a time, chain-thaw.
Felbo, Bjarke, et al. "Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm." arXiv preprint arXiv:1708.00524 (2017).
Chain-thaw transfer learning approach
References
1. Pennacchiotti, Marco, and Ana-Maria Popescu. "Democrats, republicans and starbucks afficionados: user classification in twitter." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.
2. Rao, Delip, et al. "Classifying latent user attributes in twitter." Proceedings of the 2nd international workshop on Search and mining user-generated contents. ACM, 2010.
3. Hadgu, Asmelash Teka, and Robert Jäschke. "Identifying and analyzing researchers on twitter." Proceedings of the 2014 ACM conference on Web science. ACM, 2014.
4. Remain, Joseph, EST al. "You only look once: Unified, real-time object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
5. Rajpurkar, Pranav, et al. "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning." arXiv preprint arXiv:1711.05225 (2017).
6. Felbo, Bjarke, et al. "Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm." arXiv preprint arXiv:1708.00524 (2017).
7. Buolamwini, Joy, and Timnit Gebru. "Gender shades: Intersectional accuracy disparities in commercial gender classification." Conference8. on Fairness, Accountability and Transparency. 2018.
65
top related