authorship attribution using probabilistic context-free grammars
DESCRIPTION
Authorship Attribution Using Probabilistic Context-Free Grammars. Sindhu Raghavan, Adriana Kovashka, Raymond Mooney The University of Texas at Austin. Authorship Attribution. Task of identifying the author of a document Applications Forensics (Luckyx and Daelemans, 2008) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/1.jpg)
Authorship Attribution Using Probabilistic Context-Free Grammars
Sindhu Raghavan, Adriana Kovashka, Raymond MooneyThe University of Texas at Austin
1
![Page 2: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/2.jpg)
Authorship Attribution
• Task of identifying the author of a document
• Applications– Forensics (Luckyx and Daelemans, 2008)
– Cyber crime investigation (Zheng et al., 2009)
– Automatic plagiarism detection (Stamatatos, 2009)
– The Federalist papers study (Monsteller and Wallace, 1984)– The Federalist papers are a set of essays of the US constitution– Authorship of these papers were unknown at the time of publication– Statistical analysis was used to find the authors of these documents
2
![Page 3: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/3.jpg)
Existing Approaches
• Style markers (function words) as features for classification (Monsteller and Wallace, 1984; Burrows, 1987; Holmes and Forsyth, 1995; Joachims, 1998; Binongo and Smith, 1999; Stamatatos et al., 1999; Diederich et al., 2000; Luyckx and Daelemans, 2008)
• Character-level n-grams (Peng et al., 2003)
• Syntactic features from parse trees (Baayen et al., 1996)
• Limitations– Capture mostly lexical information– Do not necessarily capture the author’s syntactic style
3
![Page 4: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/4.jpg)
Our Approach• Using probabilistic context-free grammar (PCFG)
to capture the syntactic style of the author
• Construct a PCFG based on the documents written by the author and use it as a language model for classification– Requires annotated parse trees of the documents
4
How do we obtain these annotated parse trees?
![Page 5: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/5.jpg)
Algorithm – Step 1
Treebank each document using a statistical parser trained on a generic corpus– Stanford parser (Klein and Manning, 2003)
– WSJ or Brown corpus from Penn Treebank (http://www.cis.upenn.edu/~treebank)
5
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
…………………..….……..
Training documents
Alice Bob Mary John
![Page 6: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/6.jpg)
Algorithm – Step 2
6
S NP VP .8S VP .2NP Det A N .4NP NP PP .35NP PropN .25
.
.
.
S NP VP .8S VP .2NP Det A N .4NP NP PP .35NP PropN .25
.
.
.
S NP VP .7S VP .3NP Det A N .6NP NP PP .25NP PropN .15
.
.
.
S NP VP .7S VP .3NP Det A N .6NP NP PP .25NP PropN .15
.
.
.
S NP VP .9S VP .1NP Det A N .3NP NP PP .5NP PropN .2
.
.
.
S NP VP .9S VP .1NP Det A N .3NP NP PP .5NP PropN .2
.
.
.
S NP VP .5S VP .5NP Det A N .8NP NP PP .1NP PropN .1
.
.
.
S NP VP .5S VP .5NP Det A N .8NP NP PP .1NP PropN .1
.
.
.
Probabilistic Context-Free Grammars
Train a PCFG for each author using the treebanked documents from Step 1
Alice Bob Mary John
![Page 7: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/7.jpg)
Algorithm – Step 3
7
Test document
S NP VP .8S VP .2NP Det A N .4NP NP PP .35NP PropN .25
S NP VP .8S VP .2NP Det A N .4NP NP PP .35NP PropN .25
S NP VP .7S VP .3NP Det A N .6NP NP PP .25NP PropN .15
S NP VP .7S VP .3NP Det A N .6NP NP PP .25NP PropN .15
S NP VP .9S VP .1NP Det A N .3NP NP PP .5NP PropN .2
S NP VP .9S VP .1NP Det A N .3NP NP PP .5NP PropN .2
S NP VP .5S VP .5NP Det A N .8NP NP PP .1NP PropN .1
S NP VP .5S VP .5NP Det A N .8NP NP PP .1NP PropN .1
Alice
Bob
Mary
John
.6
.5
.33
.75
![Page 8: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/8.jpg)
Algorithm – Step 3
8
Test document
S NP VP .8S VP .2NP Det A N .4NP NP PP .35NP PropN .25
S NP VP .8S VP .2NP Det A N .4NP NP PP .35NP PropN .25
S NP VP .7S VP .3NP Det A N .6NP NP PP .25NP PropN .15
S NP VP .7S VP .3NP Det A N .6NP NP PP .25NP PropN .15
S NP VP .9S VP .1NP Det A N .3NP NP PP .5NP PropN .2
S NP VP .9S VP .1NP Det A N .3NP NP PP .5NP PropN .2
S NP VP .5S VP .5NP Det A N .8NP NP PP .1NP PropN .1
S NP VP .5S VP .5NP Det A N .8NP NP PP .1NP PropN .1
Alice
Bob
Mary
John
.6
.5
.33
.75
Multiply the probability of the top parse for each sentence in the test document
![Page 9: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/9.jpg)
Algorithm – Step 3
9
Test document
S NP VP .8S VP .2NP Det A N .4NP NP PP .35NP PropN .25
S NP VP .8S VP .2NP Det A N .4NP NP PP .35NP PropN .25
S NP VP .7S VP .3NP Det A N .6NP NP PP .25NP PropN .15
S NP VP .7S VP .3NP Det A N .6NP NP PP .25NP PropN .15
S NP VP .9S VP .1NP Det A N .3NP NP PP .5NP PropN .2
S NP VP .9S VP .1NP Det A N .3NP NP PP .5NP PropN .2
S NP VP .5S VP .5NP Det A N .8NP NP PP .1NP PropN .1
S NP VP .5S VP .5NP Det A N .8NP NP PP .1NP PropN .1
Alice
Bob
Mary
John
.6
.5
.33
.75
Multiply the probability of the top parse for each sentence in the test document
Label for the test document
![Page 10: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/10.jpg)
Experimental Evaluation
10
![Page 11: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/11.jpg)
DataData set # Authors Approx #
Words/authorApprox #
Sentences/author
Football 3 14374 786
Business 6 11215 543
Travel 4 23765 1086
Cricket 4 23357 1189
Poetry 6 7261 329
11
Blue – News articles Red – Literary worksData sets available at www.cs.utexas.edu/users/sindhu/acl2010
![Page 12: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/12.jpg)
Methodology
• Bag-of-words model (baseline)
– Naïve Bayes, MaxEnt• N-gram models (baseline)
– N=1,2,3
• Basic PCFG model• PCFG-I (Interpolation)
12
![Page 13: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/13.jpg)
Methodology
• Bag-of-words model (baseline)
– Naïve Bayes, MaxEnt• N-gram models (baseline)
– N=1,2,3
• Basic PCFG model• PCFG-I (Interpolation)
13
![Page 14: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/14.jpg)
Basic PCFG
• Train PCFG based only on the documents written by the author
• Poor performance when few documents are available for training– Increase the number of documents in the training set– Forensics - Do not always have access to a number of
documents written by the same author– Need for alternate techniques when few documents are
available for training
14
![Page 15: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/15.jpg)
PCFG-I
• Uses the method of interpolation for smoothing
• Augment the training data by adding sections of WSJ/Brown corpus
• Up-sample data for the author
15
![Page 16: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/16.jpg)
Results
16
![Page 17: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/17.jpg)
Performance of Baseline Models
17
Inconsistent performance for baseline models – the same model does not necessarily perform poorly on all data sets
![Page 18: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/18.jpg)
Performance of PCFG and PCFG-I
18
PCFG-I performs better than the basic PCFG model on most data sets
![Page 19: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/19.jpg)
PCFG Models vs. Baseline Models
19
Best PCFG model outperforms the worst baseline for all data sets, but does not outperform the best baseline for all data sets
![Page 20: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/20.jpg)
PCFG-E• PCFG models do not always outperform N-gram
models
• Lexical features from N-gram models useful for distinguishing between authors
• PCFG-E (Ensemble) – PCFG-I (best PCFG model)
– Bigram model (best N-gram model) – MaxEnt based bag-of-words (discriminative classifier)
20
![Page 21: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/21.jpg)
Performance of PCFG-E
21PCFG-E outperforms or matches with the best baseline on all data sets
![Page 22: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/22.jpg)
Significance of PCFG
22Drop in performance on removing PCFG-I from PCFG-E on most data sets
(PCFG-E – PCFG-I)
![Page 23: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/23.jpg)
Conclusions
• PCFGs are useful for capturing the author’s syntactic style
• Novel approach for authorship attribution using PCFGs
• Both syntactic and lexical information is necessary to capture author’s writing style
23
![Page 24: Authorship Attribution Using Probabilistic Context-Free Grammars](https://reader036.vdocument.in/reader036/viewer/2022062309/568137de550346895d9f80e1/html5/thumbnails/24.jpg)
Thank You
24