ainl 2016: kuznetsova

29
Antiplagiat Research Rita Kuznetsova, Oleg Bakhteev, Alexey Romanov 12.11.2016 AINL FRUCT’16 1 / 29

Upload: lidia-pivovarova

Post on 15-Apr-2017

225 views

Category:

Science


0 download

TRANSCRIPT

Antiplagiat Research

Rita Kuznetsova, Oleg Bakhteev, Alexey Romanov

12.11.2016 AINL FRUCT’16 1 / 29

Outline

Intro

Cross-Language Plagiarism Detection

Machine-Generated Text Detection

Intrinsic Plagiarism Detection

Collaboration

12.11.2016 AINL FRUCT’16 2 / 29

What’s Anti-Plagiat JSC

Anti-Plagiat System• Detects text reuse in any language and for any popularfile type

• Discovers cheating

Few numbers• Over 500 universities• 140 M sources in search databases• 25 M text checked per year

12.11.2016 AINL FRUCT’16 3 / 29

What’s Antiplagiat Research?

Antiplagiat Research tackles the most challenging problemsin the area of natural language processing and plagiarismdetection.

• Development of advancing technology• Propagation of scientific thought• Unity of young talents from leading institutions

— Moscow Phystech (MIPT)— Computing Centre of RAS— Moscow State University

12.11.2016 AINL FRUCT’16 4 / 29

History of the Project

• Oct ’14 : launch of the project by Antiplagiat JSC• Aug ’15 : first conference participation• Nov ’15 : comprehensive study on machine-generated textdetection in real-world data

• Apr ’16 : PAN 2016 participation (Top-1 in 2 tracks ofAuthor Diarization task)

• Jul ’16 : development of cross-language plagiarismdetection tool powered by state-of-the-art techniques

. . . and great growth opportunities

12.11.2016 AINL FRUCT’16 5 / 29

Areas of Interest

• Cross-Language Plagiarism• Paraphrase Detection• Machine-Generated Text Detection• Automatic Text Categorization• Intelligent Search and Topic Search• Author Diarization• Smart Evaluation of Research Papers

12.11.2016 AINL FRUCT’16 6 / 29

Problems in Focus

12.11.2016 AINL FRUCT’16 7 / 29

Types of Text Reuse

Text reuse can be classified into several categories:• copying text ”as is”• text reuse with paraphrasing

— Mr.Dursley always sat with his back to the window in hisoffice on the ninth floor.

— Mr.Dursley always propped his back on the glass window onthe ninth floor of the office.

• cross-language plagiarism— A cat was sitting on the table.— На столе сидела кошка.

12.11.2016 AINL FRUCT’16 8 / 29

Cross-Language Plagiarism Problem

The problem has ancient origins and still remains topical...

12.11.2016 AINL FRUCT’16 9 / 29

Cross-Language Plagiarism Problem

The problem has ancient origins and still remains topical...

12.11.2016 AINL FRUCT’16 10 / 29

Cross-Language Plagiarism Problem

Problem• A large proportion of texts contain reused fragments fromanother language.

• The problem of cross-lingual textual similarity in the caseof Russian being one of the languages in a pair is poorlyknown.

• The majority of methods that involve machine translationstage, generates texts that differ too much from thesources of plagiarism.

Our goalDevelop a method for cross-lingual (Russian and English) textreuse detection that based on the monolingual approach.

12.11.2016 AINL FRUCT’16 11 / 29

Cross-Language Plagiarism Detection Tool

• Explicit Semantic Analysis for Cross-Language Retrieval in Case ofRussian-English Translation — RuSSIR 2015

• A Monolingual Approach to Detection of Text Reuse in Russian-EnglishCollection — AINL-ISMW FRUCT 2015

• Candidate Document Retrieval for Cross-Lingual Plagiarism Detection — IDP2016

12.11.2016 AINL FRUCT’16 12 / 29

Cross-Language Plagiarism Detection - main stages

• Given: English document collection and suspiciousRussian document

• The first stage:— Find candidate documents, which possibly contain reused text

from the suspicious document, in the collection.— Rank these documents according to their relevance values.

• The second stage:— Split the suspicious document and candidate documents into

segments.— Compare with each other.

12.11.2016 AINL FRUCT’16 13 / 29

Machine-Generated Text Detection Problem

• Problem is not new, tools for paper generation have beenavailable for 10 years already

• Past research on generated papers discovered a hundredof them in IEEE, Elsevier, Springer journals (2009 andlater)

TaskDistinguish machine-generated papers from authenticdocuments automatically.

Key assumptionMost of papers are generated with one of several popular tools.

12.11.2016 AINL FRUCT’16 14 / 29

Machine-Generated Text Detection ProblemToday you can write a paper on a given topic with one click!

SCIgen - An Automatic CS Paper Generator

12.11.2016 AINL FRUCT’16 15 / 29

Machine-Generated Text Detection ProblemToday you can write a paper on a given topic with one click!

Mathgen: Randomly generated math papers

12.11.2016 AINL FRUCT’16 16 / 29

Machine-Generated Text Detection in Real-WorldData

Automatic detection of gibberish papers should:• deal with big data (millions of papers in real-worldcollections),

• be applicable for the Russian language,• capture texts prepared with various generation tools,• also detect machine-translated text chunks containinggrammatical errors.

Our findings:• Исследование коллекции eLIBRARY.RU на наличие искусственных и

ненаучных текстов — SCIENCE ONLINE 2016

12.11.2016 AINL FRUCT’16 17 / 29

eLIBRARY.RU

• Search a collection of scientific papers of eLIBRARY.RUfor machine-generated and non-scientific papers

• Classification task— Machine-generated vs. human-written texts— Scientific papers vs. fiction texts

• Text features:syntactic and lexical• Results

— We did’t find any machine-generated texts like «Korchevatel»in the collection of eLIBRARY.RU

— We found: anniversary congratulations, business news,interviews, bibliographies, memorials, etc.

12.11.2016 AINL FRUCT’16 18 / 29

“Fly, pie, to the oven”. Non-scientific paper in ascientific journal on baking bread

12.11.2016 AINL FRUCT’16 19 / 29

Machine-Translated Text Detection

• Recent advances in the field of statistical machinetranslation (SMT) lead to high availability of SMTsystems on the Web.

• Student reports, term works and theses lack properanalysis by their tutors.

• It is very tempting to find relevant information in English,automatically translate it into Russian, and paste it intothe paper “as is”!

• Machine-translated texts often contain grammatical errorsor inappropriate words:— First individuals in the system take the maximum number of

contacts for any parameter combination.— Первые лица в системе взять максимальное количество

контактов для любой комбинации параметров.

12.11.2016 AINL FRUCT’16 20 / 29

Solution design for MT detection

• Let’s estimate the likelihood that a sentence ismachine-translated, according to several language models(LMs). . .— Lexical 2,3-gram LMs trained on authentic texts— Lexical 2,3-gram LMs trained on machine-translated texts— POS tag 2,3-gram LMs trained on authentic texts— POS tag 2,3-gram LMs trained on machine-translated texts— word2vec (skip-gram and CBOW) models trained on

authentic texts

• . . . and use these estimates as features for classificationtask. 2 * 4 + 2 = 10 features in total

• The classifier is trained on a mixed labeled sample ofauthentic and machine-translated sentences.

Our findings:• Machine-Translated Text Detection in a Collection of Russian Scientific

Papers — Dialogue 2016

12.11.2016 AINL FRUCT’16 21 / 29

Intrinsic Plagiarism Detection Problem

IPD TaskDetecting the plagiarized parts of given document by analyzingthe writing style.

Main Challenges• No external collection• No further possibilities to uncover plagiarism besidesdetecting suspicious text parts which significantly differfrom the rest of the document

• Even if suspicious text parts are found, there is still noguarantee that these parts are truly plagiarized

12.11.2016 AINL FRUCT’16 22 / 29

PAN @ CLEF 2016

PAN: Uncovering Plagiarism, Authorship and Social SoftwareMisuse

• Held since 2007• Offers:

— Large-scale corpora for EPD and IPD algorithms— Performance measure scheme

12.11.2016 AINL FRUCT’16 23 / 29

PAN Tasks

1. Intrinsic plagiarism detection.1.1 There exists one main author who wrote at least 70% of the

text.1.2 Up to the other 30% may be written by other authors.

2. Diarization with a given number (n) of authors.2.1 There are (n) of authors, no main author2.2 Each author may have contributed to an arbitrary extent.

3. Diarization with an unknown number of authors.3.1 No information about how many authors contributed to the

document.

12.11.2016 AINL FRUCT’16 24 / 29

Solving the ProblemCommon scheme involves several stages:

• text segmentation (sentences, blocks, paragraphs etc.),• map each segment to the feature space,• outlier detection (or clustering for author diarization).

• Methods for Intrinsic Plagiarism Detection and Author Diarization—Notebookfor PAN at CLEF 2016. In CLEF 2016 Evaluation Labs and Workshop –Working Notes Papers, 5-8 September, Evora, Portugal, September 2016.CEUR-WS.org. ISSN 1613-0073.

12.11.2016 AINL FRUCT’16 25 / 29

Collaboration Opportunities

12.11.2016 AINL FRUCT’16 26 / 29

Research Collaboration

Opportunities for research collaboration include:• Joint non-profit studies• Custom research• Consulting and mentorship• Joint laboratories (joint & grant financing)• Internship opportunities• Thesis research

12.11.2016 AINL FRUCT’16 27 / 29

Dialogue Evaluation’17 - Plagiarism Detection

The PlagEvalRus workshopFocused on evaluation of Russian-specific plagiarism detectionalgorithms. The workshops emphasize on external plagiarismdetection in scientific texts (academic plagiarism).

With support of:• PAN• Dialogue conference• CyberLeninka

www.dialog-21.ru/evaluation/2017/plageval/

12.11.2016 AINL FRUCT’16 28 / 29

Thanks for you attention!

Questions / Comments?

12.11.2016 AINL FRUCT’16 29 / 29