usfd at semeval-2016 - stance detection on twitter with autoencoders
TRANSCRIPT
Isabelle Augenstein, Andreas Vlachos, Kalina Bontcheva [email protected], {a.vlachos | k.bontcheva}@sheffield.ac.uk
USFD at SemEval-2016 Task 6: Any-Target Stance Detection on Twitter with Autoencoders
Stance Detection Subtask B Classify attitude of tweet towards target as “favor”, “against”, “none”
Tweet: “No more Hillary Clinton” Target: Donald Trump Stance: FAVOR
Subtask A training targets: Climate Change is a Real Concern, Feminist Movement, Atheism, Legalization of Abortion, Hillary Clinton
Subtask B testing target: Donald Trump
Challenges • Labelled data not available for the test target • Manual labelling of training data not allowed • Target does not always appear in tweet
Feature Extraction • Aut-twe: Tweet auto-encoded tweet,100d feature vector • targetInTweet: is (shortened) target contained in tweet
• Good indicator for non-neutral stance • Other features tested (not used for final run): WordNet-
Affect gazetteers, emoticon detection • Baselines: bag of word, word2vec (trained on same data
as autoencoder)
Results Model Comparison (Hillary Clinton, dev)
Model Comparison (Donald Trump, test)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
MacroF1
BoWBoW+inTweWord2VecAut-tweAut-twe+inTweConclusions
• It is important to detect if the target is mentioned in the tweet • Hillary Clinton: 0.4538 F1 (inTwe) vs 0.3243 F1 (not inTwe) • Donald Trump: 0.3745 F1 (inTwe) vs 0.2377 F1 (not inTwe)
• Autoencoder can help to detect stance towards unseen targets • Developing method for new targets without labelled training
data is challenging - discrepancies between what works for dev vs. test set
• Future work: better incorporate the target for stance detection Acknowledgements
This work was partially supported by the European Union, grant agreement No. 611233 PHEME (http://www.pheme.eu)
Data • 5 628 labelled train tweets about Subtask A
targets • 1 278 about Hillary Clinton, used for dev
• 278 013 unlabelled Donald Trump tweets • 395 212 collected unlabelled tweets about all
targets • Keywords: hillary, clinton, trump, climate,
femini, aborti • 707 Donald Trump test tweets
Preprocessing • Phrase detection: Train phrase detection model on unlabelled
+labelled tweets, e.g. “donald”, “trump” → “donald trump”
Autoencoder • Bag-of-word autoencoder, using 50 000 most
frequent words • trained on unlabelled+labelled tweets • Input vector: dimensionality 50 000. For each word
in vocabulary, does tweet contain the word or not • One hidden layer (size 100), output size 100 • Trained encoder is applied to labelled train and
test data to obtain 100d features, decoder not used
Model MacroF1Majorityclass(official) 0.2972SVMn-grams(official) 0.2843BoW 0.3453Aut-twe(submi6ed) 0.3307
References • Code: https://github.com/sheffieldnlp/stance-semeval2016 • Phrases: Mikolov et al. (2013). Distributed Representations
of Words and Phrases and Their Compositionality. NIPS.
Tweets
“No more Hillary Clinton”, “Donald Trump”, “FAVOR” Preprocessing: [“No”, “more”, “Hillary_Clinton”]
Autoencoder Training
[america: 0, …, Hillary_Clinton: 1] 50 000d input [0, 0, …, 1] 100d hidden layer [0, 1, …, 1] 100d output layer
Feature Extraction
Autoencoder inTwe [0, 1, …, 1] 0
Logistic Regression
Model
Predictions
“#voteTrump (…)”, “Donald Trump”, “FAVOR” “youre fired (…)” “Donald Trump”, “AGAINST”