patrick juola duquesne university [email protected] authorship attribution and stylometry...
TRANSCRIPT
![Page 1: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/1.jpg)
Patrick Juola
Duquesne University
www.jgaap.com
Authorship Attribution and Stylometry(lecture 5)
Authorship Attribution and Stylometry(lecture 5)
![Page 2: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/2.jpg)
Some HousekeepingSome Housekeeping
• I’m having trouble with n/w connectivity to Duquesne• Watch www.mathcs.duq.edu/~juola• Watch www.jgaap.com• Will be posting new developments as they
occur• (Will also post NG corpus as requested.)
![Page 3: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/3.jpg)
ESSLLI materialESSLLI material
• The Personae corpus is freely available
• BUT the one we’ve developed is not• If you’re willing to have your essays and
information published, contact me• [email protected]
• I will collate and publish via the web
![Page 4: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/4.jpg)
JGAAP materialJGAAP material
• JGAAP is freeware; use and enjoy
• New developments to JGAAP are always welcome, subject to licensure (i.e. GPL).
• Wiki at www.jgaap.com is open for• Feature requests• Bug reports• Comments• New developers
![Page 5: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/5.jpg)
Interest in a volume?Interest in a volume?
• Depending upon public interest,... i.e. you, should we pursue the idea of an edited collection of JGAAP-related papers?• There are a lot of publishers at this summer
school• Contact me if you’re interested
![Page 6: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/6.jpg)
So, now what?So, now what?
• JGAAP seems to work, but needs more development
• More corpora (and more specialist corpora) are needed
• But if you have an authorship problem to solve NOW…
![Page 7: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/7.jpg)
Top/bottom methodsTop/bottom methods
• Sorry, still having n/w troubles 8-(
• Best canonicizers : unify case, normalize whitespace• Strip punctuation hinders
• Best events : word bigrams• Worst : word lengths
• Best analysis : KL-distance, cosine distance• Worst : LZW
![Page 8: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/8.jpg)
But....But....
• (Show spreadsheet, stupid!)
![Page 9: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/9.jpg)
Testing transferrenceTesting transferrence
• 8 AAAC problems are “English”
• 5 are “foreign” (French [x2], Dutch, Latin, Serbian/Slavonic)
• Does English score reflect “foreign” score?• If so, have evidence that best practices in
English are also best practices in novel language.
• N.b. evidence is not proof!
![Page 10: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/10.jpg)
2008/9 AAAC data2008/9 AAAC data
• 281 different analyses, generally better than AAAC submisssions.
• Correlation: r = 0.6680 (cf. 0.594)
• Significance: p < 0.0001 (cf. 0.05)
• Coefficient of determination (r2)• 45% of variation explained by algorithm
performance alone (rather than other factors)
![Page 11: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/11.jpg)
TranferrenceTranferrence
• Best practices transfer – a best practice in one environment is likely to be a “good” practice in another• Turn it around : Do we really expect something
terrible in English to magically improve in Polish?
• Caveat : No predictions about “absolute” error rates
• Caveat(2) : Assumes lg. agnosticism
![Page 12: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/12.jpg)
Some other findingsSome other findings
• OCR errors do not materially impact accuracy (Noecker, et al.)
• Asymmetry is a significant factor in distance-based attribution methods (Ryan and Juola)
• Algorithm performance dominates language or data size effects (Juola)
![Page 13: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/13.jpg)
Other findings (2)Other findings (2)
• Cosine distance on large numbers of words outperforms higher-overhead methods on fewer words (Noecker & Juola)
• Characters trump words for Chinese at current word seg technology (Zhao & Juola)
• Mosteller-Wallace’s function words are overtuned (in preparation)
![Page 14: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/14.jpg)
Best practices for nowBest practices for now
• “Mixture of experts” improves accuracy
• Run multiple analyses, mixing event types (character and word n-grams)
• Cosine distance and KL-distance work well on large event sets
• SVM works well on small event set
• Current leader : KL-distance (max) on word bigrams
![Page 15: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/15.jpg)
• AAAC corpus too small to distinguish among 20,000 methods (testing continuing, though)
• Add more methods to JGAAP, hopefully solicited from community
• Continue to develop/publish “best practices”
Future extensionsFuture extensions
![Page 16: Patrick Juola Duquesne University juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)](https://reader035.vdocument.in/reader035/viewer/2022072005/56649ce15503460f949abd0a/html5/thumbnails/16.jpg)
• Merci
• Arigato
• Спасибо
• Danke
• Gracias
• Teşekkür ederim
• Dank U
Tak!