ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · a template-based approach to reada...

288
اِ ٰ ْ اﻟﺮِ اﻟﻠﻪِ ْ ِ ِ ْ ِ ﻟﺮ

Upload: others

Post on 25-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

ن احم حيم بسم الله الر لر

Page 2: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Research Methodology in I.T.

Lecture 01 –

Ms. Iqra MuneerDr. Rao Muhammad Adeel Nawab

Dr. Rao Muhammad Adeel Nawab

Authors:

Introduction to Text Reuse and Plagiarism Detection

Instructor:

Page 3: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 3

!اے ��ان 

، � �  � �، � �  � � � �  � �

� � � � � �، � � � � � � � �� � � � � �، � �  � �

� � د� � � د� ،� � د� ،

.د� � �ے � � د� �� � 

رح�ت �� �� � �� 

Page 4: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 4

How to Work?

.�م ��

. �� �� �م ��

.اهللا � �� � ��� �� �م ��﮵تت﮴ :آ ك � ��

�ك نعبد وا �� �

تعني ا س�. � اهللا � �ى � �دت �� �: ��

اور � � � � د �� � 

Page 5: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 5

. � � � �ر�

.د�� �ں � �� � � � �� �

�ط :د� تقمي رص �ط ٱٱلمس� مٱٱهد� ٱٱلرص �ن ٱ�نعمت �لهي ٱٱ��. � �� راه د� ان ��ں � راه � � � � ا�م �: ��

Power of Effort & Dua

Page 6: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 6

�هم� خر يل وا�رت يل الل�متنا ال� ما �ل

�ب�انك ال �مل لنا ا س�

�ك ٱ�نت العلمي الحكمي ن�ا

ح يل صدري رب ارش يل امري و�رس

وا�لل عقدة من لساين یفقهوا قويل

Dua – Take Help from Allah Before Starting Any Task

Page 7: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 7

Dr. Rao Muhammad Adeel Nawab – About Me

University of Sheffield, UKPhD

COMSATS University Islamabad, Lahore CampusAssistant Professor

NLP Group, CUI, Lahore CampusGroup Lead

Page 8: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 8

Publications

Research Supervisions 10 PhD Studentsunder supervision (in different countries)

40+ MPhil StudentsSupervised

36 International Publications15 Impact Factor Journal Papers

21 International Conference/Workshop Papers

Citations & H-Index258 CitationsH-Index = 10

Page 9: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 9

About ME: Research Profile on Google Scholar

https://scholar.google.com/citations?user=fUn6DXYAAAAJ&hl=en

Page 10: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 10

Dedication

This course is dedicated to mymost beloved and respectablePhD supervisor

Dr. Mark StevensonDepartment of Computer Science,University of Sheffield, UK

Page 11: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 11

Course Details - Instructor

Dr. Rao Muhammad Adeel Nawab

[email protected]

https://lahore.comsats.edu.pk/Employees/74

Instructor

Email

Office

CUI Profile

Room no. 04, Faculty Block

Page 12: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 12

Course Details – Classes

Lec 01 & Lec 02

Tuesday

(17:30 – 20:30)

D-110 (D Block)

Page 13: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 13

Mainly get Excellence in two things

Course Focus

Become a great Human Being

Become a great Researcher

Page 14: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 14

Five Types of Training

PoliceEliteRangersArmyCommando

Page 15: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 15

Main Goal of a Course - Commando Training

Commando

Commando is a man of character and (s)he should safeguard his character

Page 16: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 16

Main Goal of a Course - Commando Training

Two Main Qualities of a Commando

.� � � �ر�

.د�� �ں � �� � � � �� �

100% Effort with Sincerity

ادب  +  ��� ا�د  اور  وا��

1

2

Page 17: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 17

Main Goal of a Course - Commando Training

Summary of Qualities in a Commando

��ى

Page 18: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 18

ادب

� ادب � �، � ادب � �

 �� �� � � �، وہ �� � �

� � �وہ �� � �م،� �� � آ � �

Page 19: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 19

ا� � �

 � � �� �ا� � �� را� 

Page 20: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 20

ا� � �

� �وم � وہ ادب� 

� �وم ��ا� � ه�ل ا�� رو� (

ض اهللا عن ي

ض)ر�

Page 21: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 21

Course Aims

To introduce essential concepts required to become agreat human being and a great researcher

To develop skills to systematically teach any concept

To develop skills to carry out research using a template-based approach

To develop internet searching skills, both general andresearch specific

Page 22: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 22

Course Aims

To develop skills to systematically search, read andsummarize a research paper / thesis

To develop skills to systematically make a template-basedoutline of a research paper / thesis and then write it

To develop skills to systematically design an experiment

To develop skills to carry out research in such a way thatstudents who carry out research become a Commando in life

Page 23: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 23

Course Learning Outcomes (CLOs)

Understand how to systematically learn any concept

Understand how to search internet (both generic andresearch specific) to satisfy their information needs

By the end of this course, the students should be able to

Understand what daily tasks are important tohave a balanced personality

Page 24: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 24

Course Learning Outcomes (CLOs)

Read, write and design an experiment for a researchpaper / thesis using a template-based approach

Tell a coherent and connected story in a researchpaper / thesis

Carry out research in such a way that it enhancesthe self-learning abilities of students and createability in them to cope up with the challenges of life

Page 25: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 25

Useful Material

A Template-based Approach to Read a Research PaperDownload Link: https://ilmoirfan.com/a-template-based-approach-to-read-a-research-paper/

A Template-based Approach to Write a Research PaperDownload Link: https://ilmoirfan.com/a-template-based-approach-to-write-a-research-paper/

A Template-based Approach to Design an ExperimentDownload Link: https://ilmoirfan.com/a-template-based-approach-to-design-an-experiment/

Page 26: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 26

Useful Material Cont…

A Step by Step Approach to Learn Latex for ScientificWritingDownload Link: https://ilmoirfan.com/authors-page/latex/

A Template-based Approach to Write an EmailDownload Link: https://ilmoirfan.com/authors-page/a-template-based-approach-to-write-an-email/

Page 27: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 27

Little Efforts Daily Will Make You the GreatestTo systematically learn and get excellence in any concept /subject

Importance of completing tasks on daily basis���یں کام روز کا روز

ا ا س�تت ا جج ھاتی � ں ��یں �ن ک دن ��ی ا اتی ھاتن � کا ے �ن ں،اک ���ی ��ی

ں �ن ک دن ��ی کام اتی کا ے �ن ک ���ی ی اتی �ے �ہ �ی اا و س�تت �ہ

گا ے

ں آ� ��یس �ن ا اب وا�پ �تی � د�� �ے جپ �� زتن و دن آپ �ج

ے ا �ہ وس�تت ی �ہ کام آج �ہ کا آج

ر د� و�ہ ا �ج تن و ا�پتے � ما �ہ دان �ج ں، آج �تی ��ی

ا �ن تت کا �پ ے وا�ے دن ن

ں، آ� ��یا �ن ا وہ آتن �تی ��ر و ھاؤ�ج

Page 28: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 28

Instructions – To Do Tasks on Daily Basis

In order to facilitate the Course Instructor to monitor

your work progress on daily basis, every student must

follow the following steps:

Page 29: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 29

Instructions – To Do Tasks on Daily Basis (Cont.)

Name of Folder should be: Name – Registration Number

Create a folder on your Google Drive and share it’slink with me on [email protected]

Create a separate sub-folder for each lecture / task

For example: Muhammad Adeel – SP20-RCS-007

Put your files in appropriate sub-folders

For example: Lecture 01 – Introduction to Text Reuseand Plagiarism

Page 30: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 30

Instructions – To Do Tasks on Daily Basis (Cont.)

Update your files in Google Drive sub-folders on dailybasis

Very Important and Mandatory

The tasks mentioned in Your Turn slide(s)must be completed before the next lecture

Page 31: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 31

Course Outline – Key TopicsIntroduction to Text Reuse and PlagiarismLearn How to LearnLearning is a Searching ProblemSearching Offline and Online Sources of Knowledge and SkillsA Template-based Approach to Analyze, Summarize andDocument Search Results v1A Template-based Approach to Read a Research Paper v1A Template-based Approach to Design an ExperimentA Template-based Approach to Write a Research Paper

A Template-based Approach to Write a Research Thesis

A Template-based Approach to Write a Research Thesis Proposal

Page 32: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 32

Lecture Outline1 Basics of Text Reuse

2 Basics of Plagiarism3 Data Annotation for Text Reuse Detection4 Methods for Text Reuse and Plagiarism Detection5 Evaluation Measures6 Treating the Problem of Text Reuse / Plagiarism

Detection as Machine Learning Problem – A Step by Step Example

Page 33: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Basics of Text Reuse

Page 34: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 34

Text Reuse

The text which is used to create new textOriginal Text (or Source Text)

The text created by reusing the original text(s)Derived text (or Reused Text)

DefinitionThe process of creating a new text (or document) using

the existing one(s)

Page 35: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 35

Text Reuse

1

2

Idea

Image

ConceptReuse can also be of

3

4

5

Movies

Features etc.

Page 36: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 36

Text Reuse - Acceptable vs Non-Acceptable

1 Journalism

2 Plagiarism

Text reuse is a common practiceNewspapers use text(s) provided by News Agenciesto write newspaper articles

Unacknowledged text reuse is not acceptable

Page 37: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 37

Text Reuse in Journalism

1 News Agency

2 Text Reuse in JournalismNewspapers use articles provided by News Agenciesto write newspaper stories (or news articles)Text reuse is a common and legitimate practice in thedomain of journalism

An organization that collects news items anddistributes them to newspapers and broadcasters

Page 38: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 38

Two Levels of Rewrite in Journalism

Derived vs Non-Derived

The Newspaper story was created borrowing thetext(s) from News Agencies

Derived

The Newspaper story is written independentlyand doesn't borrow any text from News Agencies

Non-Derived

Page 39: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 39

Three Levels of Re-write in Journalism

Derived Category can be further divided into

News Agency text is the only source for the reusedNewspaper text, which means it is a verbatim (orexact) copy of the News Agency textIn this case, most of the reused text is word-to wordcopy of the source text

The Newspaper text has been either derived frommore than one News Agency or most of the text isparaphrased by the editor when rewriting fromNews Agency text source

Page 40: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 40

Three Levels of Re-write in Journalism (Cont.)

The News Agency text has not been used in theproduction of the Newspaper text (though wordsmay still co-occur in both documents), it hascompletely different facts and figures or is heavilyparaphrased from the News Agency’s copy

Page 41: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 41

Text Reuse – Importance

Large digital repositories are readily available, making iteasier to text reuse and hard to detect it

Powerful text editors are making it easier to rewrite /modify text

Automatic text altering tools are making it easier toquickly modify text for reuse

Freely available Machine Translation systems are helpingpeople to easily even reuse text written in language thatthey don’t know

Page 42: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 42

Detecting unacknowledged reuse of text particularly inacademia

For example, removing duplicate or near-duplicatedocuments from the set of documents returned by a SearchEngine (or Information Retrieval System) against a userquery

Text Reuse - Applications

1 Plagiarism Detection

2 Duplicate (or Near-duplicate) Document Detection

3 Copyright infringement detection

Page 43: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 43

Text Reuse Detection - Task

A text pair, Text 1 and Text 2 (input)

Given

FindHow much text has been reused fromOriginal (Text 1) to create Text 2(output) i.e. goal is to identify thelevel of text reuse

Page 44: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 44

Text Reuse – Input and Output

For two levels of text reuse

For three levels of text reuse

DerivedNon-Derived

Wholly DerivedPartially DerivedNon-Derived

Input

Output

Text Pair (Text 1 and Text 2)

Goal Identify the level of text reuse

Page 45: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 45

Text Reuse - Granularity

Word Level1

Phrasal Level2

Sentence Level3

Passage / Paragraph Level4

Document Level5

Text reuse may occur at five levels

Page 46: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 46

Output

Input

Example 01 – Text Reuse at Word Level

Text 1Meal

Text 2Food

Derived

Page 47: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 47

Output

Input

Example 02 – Text Reuse at Word Level

Text 1Meal

Text 2Butter

Non-Derived

Page 48: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 48

Output

Input

Example 03 – Text Reuse at Word Level

Text 1Like

Text 2Nice

Derived

Page 49: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 49

Output

Input

Example 04 – Text Reuse at Word Level

Text 1People

Text 2Audience

Derived

Page 50: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 50

Output

Input

Example 05 – Text Reuse at Word Level

Text 1Dinner

Text 2Gathering

Non-Derived

Page 51: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 51

Output

Input

Example 06 – Text Reuse at Word Level

Text 1Dinner

Text 2Food

Derived

Page 52: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 52

Example 01 – Text Reuse at Phrasal Level

Output

Input

Example 01

Text 1A story as old as time

Text 2A tale as old as time

Derived

Page 53: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 53

Example 02 – Text Reuse at Phrasal Level

Output

Input

Example 02

Text 1A story as old as time

Text 2This is the story of old times

Non-Derived

Page 54: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 54

Example 03 – Text Reuse at Phrasal Level

Output

Input

Example 03

Text 1Reading a book

Text 2Reading an article

Non-Derived

Page 55: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 55

Example 04 – Text Reuse at Phrasal Level

Output

Input

Example 04

Text 1Ambling in the rain

Text 2Walking in the rain

Derived

Page 56: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 56

Example 05 – Text Reuse at Phrasal Level

Output

Input

Example 05

Text 1The love of my life

Text 2The crush of my life

Derived

Page 57: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 57

Example 06 – Text Reuse at Phrasal Level

Output

Input

Example 06

Text 1The love of my life

Text 2The journey of love isvery long

Non-Derived

Page 58: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 58

Example 01 – Text Reuse at Sentence Level

DerivedOutput

Input

Text 1What is your age?

Text 2How old are you?

Page 59: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 59

Example 02 – Text Reuse at Sentence Level

Non-DerivedOutput

Input

Text 1Will it snow tomorrow?

Text 2The weather prediction is quitestorming, what do you think about snowin the upcoming days?

Page 60: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 60

Example 03 – Text Reuse at Sentence Level

DerivedOutput

Input

Text 1Your car is nice

Text 2I like your car

Page 61: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 61

Example 04 – Text Reuse at Sentence Level

DerivedOutput

Input

Text 1He said that sit-ins have caused a hugeloss to national economy and thenation is depressed

Text 2Prime minister said, “Sit-ins have causeda huge loss to national economy and thenation is depressed”

Page 62: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 62

Example 05 – Text Reuse at Sentence Level

Non-DerivedOutput

Input

Text 1Balochistan successfully holds 3rd roundof LG elections

Text 2Plots in the municipal elections, thePML-N won the third stage

Page 63: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 63

Example 06 – Text Reuse at Sentence Level

DerivedOutput

Input

Text 1His body was handed over to the heirsafter legal formalities

Text 2The body was handed over to the heirs

Page 64: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 64

Example 07 – Text Reuse at Sentence Level

Non-DerivedOutput

Input

Text 1NAB Chairman determined to root outcorruption from society

Text 2Honest people will send representativesin Parliament will help eliminatecorruption: NAB

Page 65: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 65

Example 01 - Text Reuse at Passage Level

Cognizant of the need to accordgreater attention towardsprotection of the vulnerable andmarginalized segments of society,the government is committed tomake every possible effort to put inplace effective legal, economic andsocial frameworks so as to ensureprotection of human rights," he saidin his message on the occasion ofInternational Human Rights Day(December 10)

Input (Text 1)On the occasion of Human RightsDay, the Constitution of Pakistan,he said in his message to thecitizens based on race, color orrace, regardless of the guarantees.To ensure the protection of humanrights as possible to provideeffective legal, economic and socialframework is determined. He is thecelebration of Human Rights Day,on every level, promotion andprotection of human rights is oursolid commitment

Input (Text 2)

Derived

Output

Page 66: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 66

Example 02 - Text Reuse at Passage Level

Addressing a ceremonyheld in honour of thewinners of Nobel PeacePrize-2014 and televisedfrom Oslo, Norway, sheexpressed the resolve tocontinue her struggle forbringing all girls and boysin the education net and tofight for their rights

Input (Text 1)Pakistan and India live inpeace, I am sure that we stopthe progress. But if we do notsucceed one another so thatno country would be able tomove ahead, both the issuesof poverty, lack of educationof children and women aredenied basic rights, and wehave these problemstogether to solve

Input (Text 2)

Non-Derived

Output

Page 67: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 67

Example 03 - Text Reuse at Passage Level

He said he had held very goodmeetings and talks with IranianMinister of Economy and Finance AliTayebnia during his visit to Pakistan.He said that there were agreementswith Chinese CNPC oil companyrecently for laying 700 kms gaspipeline from Gwadar port, 70 kmsfrom Iran border, and Nawabshah. Hesaid Chinese company envoy arrivedon Tuesday to arrange preliminariesfor the project, announcing that thecompany will complete the 700kmpipeline in 24 months

Input (Text 1)LNG terminal at the port in the firstphase while the second phase will beheld from Gwadar to Nawabshah 700kilometers of 42-inch diameterpipeline will be laid. He said thatbecause of international sanctionsimposed on Iran Pakistan to fulfill itspart of the project completed Despitethe government's efforts could relateto this project Bank, InternationalContractors and Equipment Suppliersare not willing to work. Is expectedto work on the project would bestarted soon

Input (Text 2)

Derived

Output

Page 68: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 68

Example 04 - Text Reuse at Passage Level

We are contacting the PTIleadership for starting thedialogue," he added. Welcomingthe PTI's decision to resumetalks, Dar said there was noimpediment in this regard as PTIchief Imran Khan had backedout from the unconstitutionaland illegal demand for theresignation of Prime Minister orfor going on a one-month leave

Input (Text 1)'Movement and the government ofrigging the kumbynh judicial commissioninvestigating judge who will not berecognized, ISI or IB representatives ofthe Commission only the Commission canadd your own, we would not demandanything, "Imran Mohib The evidence ofPakistani and sit-ins and rallies to protestthe talks ended. If they do not agree thenthe negotiations' unconstitutionaldemand the resignation of the Minister ofJustice to withdraw any unconstitutionalis welcome, we will not discuss the talksand hope that justice will not demand anyunconstitutional

Input (Text 2)

Non-Derived

Output

Page 69: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 69

Example 05 - Text Reuse at Passage Level

The participants, Malala said she wasproud to be the youngest-ever NobelPeace Prize recipient. She thanked herparents for providing all kind ofsupport in getting education, saying "Ithank them for not clipping my wingsand letting me fly." Stressing the needto make joint efforts for impartingquality education to girls and boyswithout fear and discrimination,Malala vowed to work for protectionof children's rights not only inPakistan but across the world withmore vigour and dedication

Input (Text 1)After completing his studies at the primeminister's wish, to promote educationseriously consider a global leader. Theinternational community to pay specialattention to education. Why can not givearms to provide easy and scripture. Whypowerful countries are weak in peace. Weare living in contemporary art, nothing isimpossible, and I thank all my fans. I amgrateful to my teachers and parents whogave me a chance to excel and education.This is a very happy day for me, I'm theyoungest Pakistani and Pashtun girl whowon it. Kailash Satyarthi champions arefighting for the rights of children. I amglad that we can work together

Input (Text 2)

Derived

Output

Page 70: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 70

Example 06 - Text Reuse at Passage Level

According to senior policeofficials, the reason behind themurder has not been ascertainedyet. Several police teams led bysenior police officials wereconducting raids at various placesfor the arrest of killers.Meanwhile, MQM has lodgedprotest in Sialkot city against themurder and blocked traffic onvarious roads of Sialkot. They weredemanding early arrest of thekillers

Input (Text 1)Praltaf Anwar Hussain said theBOA was the senior partner andsincere testimony of theirorganization is a senior fellowlost. The tortured bodies of ourworkers have poured into thestreets of the patient isrecommended. Altaf Hussain haswarned that if not stopped killingour workers in the province ofPunjab, including the primeminister will not enter into anyMinister Sindh

Input (Text 2)

Non-Derived

Output

Page 71: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 71

Example 01 – Text Reuse at Document Level

Input (Document 1)

She expressed the resolve to continue her struggle forbringing all girls and boys in the education net and to fightfor their rights. She also recalled her struggle for gettingeducation in Swat and Taliban's terror in the valley, whothreatened girls to stop getting education

Page 72: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 72

Example 01 – Text Reuse at Document Level (Cont.)Input (Document 2)

But, she decided to stand up against them and succeeded, she added."Terrorists failed in their nefarious designs," Malala said adding that wasnot alone as she was the voice of 66 million girls. Taliban, she said, blew upschools in Swat with bombs and rockets and they misused the name of Islamwhich was a religion peace, tolerance, brotherhood and humanity, urgingthe followers to get knowledge, education and discover new Praltaf AnwarHussain said the BOA was the senior partner and sincere testimony of theirorganization is a senior fellow lost. The tortured bodies of our workers havepoured into the streets of the patient is recommended. Altaf Hussain haswarned that if not stopped killing our workers in the province of Punjab,including the prime minister will not enter into any Minister Sindh.

Page 73: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 73

Example 01 – Text Reuse at Document Level (Cont.)

DerivedOutput

Page 74: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 74

Example 02 – Text Reuse at Document Level

Input (Document 1)

Chairman Norwegian Nobel Peace Committee ThirdbornJagland awarded the winners with gold medals and prizes in awidely televised-ceremony from Oslo, Norway. He highlightedefforts of Malala and Kailash for protecting children's rightsand bringing all girls and boys in the education net. He saidMalala faced Taliban in Swat, who were threatening to keep heraway from education and even made an attempt on her life.She, however exhibited great courage and continued studies,besides advocating for girls' education

Page 75: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 75

Example 02 – Text Reuse at Document Level (Cont.)Input (Document 2)

It is time that education should take place, then do not raise any action against education. I wantpeace in every corner of the world, education is a key component of basic life henna on theirhands, the formula used to calculate. I want that women be given equal rights, the award is forfrightened children who want peace. Our Prophet Mohammad is the messenger of peace, Idecided to speak out against the Taliban, hundreds of schools were destroyed by militants inSwat, once a tourist paradise of Swat was killed by terrorists. Girls' education was stopped inSwat, militants tried to stop us, me and my friends were attacked, our voice has been comparedto the Taliban, the Taliban's ideology not only won their shots prevail so, this story is not just meso many other girls, deprived of education stand to hear children's voices, this time will not beafraid and do virtually anything. Swat was always eager to learn and inventions. It is time thateducation should take place, then do not raise any action against education. I want peace in everycorner of the world, education is a key component of basic life henna on their hands, the formulaused to calculate. I want that women be given equal rights, the award is for frightened childrenwho want peace. Our Prophet Mohammad is the messenger of peace, I decided to speak outagainst the Taliban, hundreds of schools were destroyed by militants in Swat, once a touristparadise of Swat was killed by terrorists

Page 76: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 76

Example 02 – Text Reuse at Document Level (Cont.)

Non-DerivedOutput

Page 77: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 77

Example 03 – Text Reuse at Document Level

Input (Document 1)

Around 500 family members of victims of Indian staterepression along with human rights activists and membersof Dal Khalsa during a march in Amritsar said that humanrights abuses committed in Punjab and Kashmir were notrandom but were carried out as a matter of Indian statepolicy. They said that they would approach Barack Obamaduring his forthcoming visit to India on January 26, nextyear,KMS reported

Page 78: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 78

Example 03 – Text Reuse at Document Level

Input (Document 2)

The Indian state of Kashmir my dyasrus about five hundredfamilies of victims of terrorism on human rights activistsand members of Dal Khalsa accompanied the rally inAmritsar Punjab and Kashmir protesters said regularhuman rights violations Indian state policy being.

Page 79: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 79

Example 03 – Text Reuse at Document Level (Cont.)

DerivedOutput

Page 80: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 80

Example 04 – Text Reuse at Document Level

Input (Document 1)

In the decision it was stated that the counsel for thepetitioner was confronted with the maintainability of thesepetitions in the light of the restrictions contained in article225 of the Constitution on throwing a challenge to theelection results other than by way of election petition beforethe Election Tribunal and further whether the results of theentire general elections for the national and the provincialassemblies could be annulled under any provision of theConstitution or law

Page 81: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 81

Example 04 – Text Reuse at Document Level

Input (Document 2)

The decision that the applicants lawyer has no way to hearelection petitions, but tribunals do not give satisfactoryanswers to the question whether the context of Article 225and whether national and annulled the results of theAssembly can be given

Page 82: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 82

Example 04 – Text Reuse at Document Level (Cont.)

Non-DerivedOutput

Page 83: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 83

Text Reuse - Types

When amount of text reused is detected atsentence / passage level

When amount of text reused is detected at document level

Local Text Reuse vs Global Text Reuse

1 Local Text Reuse

2 Global Text Reuse

Page 84: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 84

Text Reuse - Types

VS

When both the original andthe reused text are in thesame language

When the original text isin one language and thereused text is in anotherlanguageCross-lingual text reusecan be carried out using

Mono-lingual Text Reuse Cross-lingual Text Reuse

Automatic Translation (e.g.Google Translator, BIgnTranslator etc.)Manual Translation

Page 85: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 85

Example - Mono-lingual Text Reuse

DerivedOutput

Input

Text 1A dog bites a man

Text 2A hound bites a person

Both source and reused texts are in the samelanguage

Page 86: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 86

Example - Cross-lingual Text Reuse

DerivedOutput

Input

Text 1When source and reused texts are indifferent languages

Text 2�� �ا� � ا� � �

Source: A dog bites a man

Both source and reused texts are in thedifferent languages

Page 87: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 87

Example - Verbatim Copy / Exact Copy (Mono-lingual Settings)

Text 2

Text 1

Example

He said that sit-ins havecaused a huge loss to nationaleconomy and the nation isdepressed

He said that sit-ins havecaused a huge loss to nationaleconomy and the nation isdepressed

Page 88: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 88

Example - Paraphrased Copy (Mono-lingual Settings)

Text 2

Text 1

Example

Prime Minister Nawaz Sharif recalledthat in start of December he hadannounced reduction of 2.32 rupeesper unit in prices of electricity, butthe announcement was deferredbecause of Peshawar school tragedy

Prime Minister said that the priceof electricity has been decreasedby 2 rupees 32 paisas per unit. Inthe coming days [we] are trying tomake electricity more affordable.

Page 89: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 89

Example - Independently Written (Mono-lingual Settings)

Text 2

Text 1

Example

Prime Minister Muhammad NawazSharif Wednesday announcedfurther reduction in the prices ofpetroleum products, which will beeffective from January 1, 2015

Prime Minister Nawaz Sharif hasannounced reduction in prices ofpetroleum products up to Rs. 14per liter. Petrol 6.25, diesel 7.86,kerosene oil 11.26 rupees per litercheaper [than before]

Page 90: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 90

Example - Verbatim Copy / Exact Copy (Cross-lingual Settings)

Text 2

Text 1

Example

The chief minister said he wouldpersonally monitor the programmedof repair and construction of roadsin rural areas and review the pace ofprogress on fortnightly basis

ا� � � � د� ��ں � � و وز�  �ں � � ذا� �ر � �ا�  �وں � �� � �و�ام � � ��ہ �ں روز � ��  �اور � �رہ 

Page 91: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 91

Example - Paraphrased Copy (Cross-lingual Settings)

Text 2

Text 1

Example

Prime Minister Nawaz Sharif recalledthat in start of December he hadannounced reduction of 2.32 rupeesper unit in prices of electricity, butthe announcement was deferredbecause of Peshawar school tragedy

32رو� 2وز�ا� � � � � � � �د. � � �� � � � � �ں � � آ�ہ 

�� � �� � �ش � ر� �

Page 92: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 92

Example - Independently Written (Cross-lingual Settings)

Text 2

Text 1

Example

Prime Minister Muhammad NawazSharif Wednesday announcedfurther reduction in the prices ofpetroleum products, which will beeffective from January 1, 2015

� �ں وز�ا� �از�� � �و� ��ت� ا�ن �14�  �ول . د�رو� � � � � 

� � , 7.86ڈ�ل , 6.25 رو� � �11.26� 

Page 93: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 93

Your Turn

Write at least 12 examples for each of thefollowing: 6 examples for mono-lingual text reuseand 6 examples for cross-lingual text reuse (2Wholly Derived Examples, 2 Partially DerivedExamples and 2 Non-Derived Examples)

Text Reuse at Word LevelText Reuse at Phrasal LevelText Reuse at Sentence LevelText Reuse at Passage / Paragraph LevelText Reuse at Document Level

Page 94: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 94

Summary - Basics of Text Reuse

Text Reuse is the process of creating a new text (ordocument) using the existing one(s) and it is reported tobe on rise in recent years due to easy access to largeonline digital repositories

Given a text pair (Text 1 and Text 2), Text 2 is said to beDerived from Text 1, if it is created using text from Text 1.On the other hand, Text 2 is said to be Non-Derived fromText 1, if it is interpedently written i.e. did not borrow textfrom Text 1

Page 95: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 95

Summary - Basics of Text Reuse (Cont.)

Text Reuse may occur at five levels:

WordLevel

PhrasalLevel

SentenceLevel

Passage /ParagraphLevel

DocumentLevel

Page 96: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 96

Summary - Basics of Text Reuse

Two main types of text reuse are: (1) Local Text Reuse -when amount of text reused is detected atsentence/passage level and (2) Global Text Reuse - whenamount of text reused is detected at document levelText Reuse can be: (1) Mono-lingual Text Reuse - whenboth the original and the reused text are in the samelanguage and (2) Cross-lingual Text Reuse - when theoriginal text is in one language and the reused text is inanother language

Page 97: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

It's Poetry Time

Page 98: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 98

�ا� د�ا� � دف �ى � �� � 

�ى � دف �ت او�د � 

او�د � دف �� � 

اور �� � دف ��ى � �� � �ق ا� ��

� � � �ى � � �� --دف �ر�  

Importance of Poetry

Page 99: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 99

� وا�ں � �ں � � ذوق  �ں 

� � د� � � � � وا� �

 �

Page 100: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Stay Motivated

Page 101: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 101

. � �ڑى �� � د�ں � � ا�ر � �� � �

.�ڑى ا� �وڑ � � اور �ول � � � � � �

Motivation is the Fuel of Life

Importance of Motivation

Page 102: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 102

• Make a Schedule of 24 Hours and Live a Live a Balance Life• Watch Seminar - Balanced Life is Ideal Life

• Download Link: https://drive.google.com/open?id=1jet6r1QOtAB16Glpgiq18EeXaCkeVJUp

• On Daily Basis, read and enjoy• Poetry• Jokes

Tips - To Stay Motivated and become a Great Researcher

Page 103: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 103

Title: Guru ki har bat gur hoti hai Gur ko nhe Guru ko pakar

Speaker: Dr. Rao Muhammad Adeel Nawab

Downlaod Link: Video + Slides https://drive.google.com/open?id=1SQ5yUroaPT5dRcY-8K7IVC5SQrQy8Lsh

Task Summarize main points of the motivational seminar

Motivational Seminar

Page 104: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Basics of Plagiarism

Page 105: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 105

Plagiarism

Plagiarism is defined as the unacknowledgedreuse of text

Plagiarism

Copying another person's work exactly andpresenting it as your own (withoutattributing it to the original author)

Formal Definition

Page 106: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 106

Plagiarism (Cont.)

The document suspected tocontain plagiarism

The document(s) whichwere used to create theplagiarized document

Suspicious Document Source Document(s)

Note that a suspiciousdocument may or may notcontain plagiarism

Page 107: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 107

Plagiarism – Importance

In recent years, plagiarism has been reported tobe on rise particularly in academia

Plagiarism detection systems are routinelyused in universities to check students workfor plagiarism

Page 108: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 108

Levels of Plagiarism

VerbatimThe original text is reused as verbatim(word to word copy) or with minormodifications to create the plagiarizeddocument

Page 109: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 109

Levels of Plagiarism (Cont.)

Paraphrased Plagiarism

The original text is heavily altered (orparaphrased) to create the plagiarizeddocumentParaphrasing can be as:

Light RevisionSource text is slightly paraphrased

Heavy RevisionSource text is heavily paraphrased

Page 110: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 110

Levels of Plagiarism (Cont.)

Plagiarism of Idea

The idea of the original text is reusedwithout dependence on the words orform of the source

Page 111: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 111

Example - Verbatim Plagiarism

Text 2

Text 1

Example

In fact, of innumerable creaturespredestined from the creation of theworld to lay up a store of wealth for theBritish farmer, and a store of quiteanother sort for an immaculateRepublican government

Here lived innumerable creaturespredestined from the creation of theworld to lay up a store of wealth for theBritish farmer, and a store of quiteanother sort for an immaculateRepublican government

Page 112: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 112

Example - Paraphrased Plagiarism

Text 2

Text 1

Example

The number of foreign and domestic touristsin the Netherlands rose above 42 million in2017, an increase of 9% and the sharpestgrowth rate since 2006, the nationalstatistics office CBS reported on Wednesday(DutchNews.nl, 2018)

According to the national statistics office, theNetherlands experienced dramatic growth intourist numbers in 2017. More than 42 milliontourists travelled to or within the Netherlandsthat year, representing a 9% increase – thesteepest in 12 years (DutchNews.nl, 2018)

Page 113: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 113

Example - Plagiarism of Idea

Text 2

Text 1

From a class perspective this put them [highwaymen] in anambivalent position. In aspiring to that proud, if temporary,status of Gentleman of the Road, they did not question theinegalitarian hierarchy of their society. Yet their boldness ofact and deed, in putting them outside the law as rebelliousfugitives, revivified the animal spirits of capitalism andbecame an essential part of the oppositional culture ofworking-class London, a serious obstacle to the formation ofa tractable, obedient labour force. Therefore, it was notenough to hang them – the values they espoused orrepresented had to be challenged.

Peter Linebaugh argues that highwaymen represented apowerful challenge to the mores of capitalist society andinspired the rebelliousness of London’s working class.

Page 114: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 114

Plagiarism Detection – Task

A suspicious text (input)

Given

Identify

The source(s) of plagiarism

Page 115: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 115

Plagiarism Detection – Input and Output

Suspicious Text

Input

Output

Plagiarized / Non-Plagiarized

Page 116: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 116

Plagiarism Detection - Two Levels of Rewrite

When any type of plagiarism isoccurred between documents theywere called plagiarized

Plagiarized

Non-Plagiarized

When no type of plagiarism isoccurred between documents theywere called non plagiarized

Page 117: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 117

Plagiarism Detection - Four Levels of Rewrite

When suspicious text is created by simply copyingand pasting text from source document(s)

1 - Near Copy

2 - Light RevisionWhen suspicious text is created by applyingsmall modification like synonyms replacementand altering grammatical structure

The Plagiarized cases can be further categorized into three categories

Page 118: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 118

Plagiarism Detection - Four Levels of Rewrite (Cont.)

When suspicious text is created by rephrasing thetext to generate the meaning

3 - Heavy Revision

4 - Non-Plagiarized

When suspicious text is written independently

It may include breaking source sentence into morethan one sentences, merging two or more sentencesinto one, replacing words with appropriate synonymsor phrases, changing voice, changing tense etc.

Page 119: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 119

Important Note

Documents that are independently writtenon the same topic are expected to have

Around 50% content overlap

Page 120: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 120

Example - Near Copy

Output

Input

Example

Text 1A dog bites a man

Text 2A dog bites a man

Near Copy

Page 121: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 121

Example - Light Revision

Output

Input

Example

Text 1A dog bites a man

Text 2A hound bites by a person

Light Revision

Page 122: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 122

Example - Heavy Revision

Output

Input

Example

Text 1A dog bites a man

Text 2A man was bitten by a dog

Heavy Revision

Page 123: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 123

Example - Non-Plagiarized

Output

Input

Example

Text 1A dog bites a man

Text 2A man was injured while runningon road followed by a dog

Non-Plagiarized

Page 124: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 124

Types of Plagiarism Cases

Artificial cases of plagiarism are generated by using

Automatic Text Altering tools to obfuscate the

source text for plagiarism

Artificial

There are three main types of plagiarism cases

Page 125: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 125

Types of Plagiarism Cases (Cont.)

None ObfuscationAutomatic TextAltering toolsimply copy andpastes text fromsource to createplagiarizeddocument

Low Obfuscation Automatic TextAltering toollightly rephrasessource textautomaticallybefore it is used tocreate plagiarizeddocument

High ObfuscationAutomatic TextAltering toolheavily rephrasessource textautomaticallybefore it is used tocreate plagiarizeddocument

Three levels of rewrite

Page 126: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 126

Type of Plagiarism Cases (Cont.)

Simulated / Manual

The original text is paraphrased by humans to createthe cases of plagiarism

For example, Karl-Theodor zu Guttenberg (GermanDefense Minister) PhD thesis proved plagiarizedURL:https://www.theguardian.com/world/2011/mar/01/german-defence-minister-resigns-plagiarism

Real

Real cases of plagiarism are those which occurred inthe real world

Page 127: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 127

Source

Example – Artificial (None Obfuscation)

The first agrarian movement after theenactment of lex Licinia took place in theyear 338, after the battle of Veseris inwhich the Latini and their allies werecompletely conquered

Suspicious

The first agrarian movement after theenactment of lex Licinia took place in theyear 338, after the battle of Veseris inwhich the Latini and their allies werecompletely conquered

Page 128: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 128

Source

Example – Manual (Simulated Obfuscation)The emigrants who sailed with Gilbert were betterfitted for a crusade than a colony, and, disappointedat not at once finding mines of gold and silver, manydeserted; and soon there were not enough sailors toman all the four ships

Suspicious

The people who left their countries and sailed withGilbert were more suited for fighting the crusadesthan for leading a settled life in the colonies. Theywere bitterly disappointed as it was not the Americathat they had expected. Since they did notimmediately find gold and silver mines, manydeserted. At one stage, there were not even enoughman to help sail the four ships

Page 129: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 129

Real

Due to copyright issues it is impossible to

have example of real case of plagiarism

Page 130: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 130

Types of Plagiarism Detection

Checking that the entire document (or all thepassages) were written by one single author

Intrinsic Plagiarism Detection

Searching for the source(s)(or original text(s)) that werereused to create thesuspicious document

Extrinsic Plagiarism Detection

In case of intrinsic plagiarism detection, thefocus is on identifying portion(s) of textwhose writing style significantly differs fromthe remaining text in the suspiciousdocument, which means that the entiredocument is not written by one single authorand contains text written by other author(s).

Mainly involves comparisonof the suspicious documentwith potential sourcedocuments

Two main types of plagiarism detection

Page 131: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 131

Intrinsic Plagiarism Detection – Task

Page 132: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 132

Intrinsic Plagiarism Detection – Task

Identify

Given

Task

A suspicious document (input)

Portion(s) of text whosewriting style is significantlydifferent form the remainingtext (output)

Page 133: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 133

Intrinsic Plagiarism Detection – Input and Output

Portion(s) of text whose writing style issignificantly different form the remaining textOutput

Input A Suspicious Text

If whose writing style is one or more portion(s) of text issignificantly different form the remaining text then thesuspicious document is plagiarized otherwise non-plagiarized

Note

Page 134: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 134

Example – Intrinsic Plagiarism Detection

Rasheed is my best friend. He lives in Lahore. He had got goodeducation. He earned his PhD degree from one of the mostprestigious, well reputed and renowned institution of the worldi.e. MIT, U.S.A. He is humble and nice. Rasheed always try tohelp others

Suspicious Document is Plagiarized

Given (Suspicious Document)

Output

Portion of text whose writing style is significantly different fromremaining text

He earned his PhD degree from one of the most prestigious, wellreputed and renowned institution of the world i.e. MIT, U.S.A.

Page 135: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 135

Extrinsic Plagiarism Detection - Task

Page 136: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 136

Extrinsic Plagiarism Detection – Task (Cont.)

Page 137: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 137

Example – Extrinsic Plagiarism Detection

Containing two suspicious documents

Suspicious Collection

Source Collection

Containing five source documents

Given

Page 138: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 138

Example – Extrinsic Plagiarism Detection (Cont.)

suspicious-document-01 Rashed is my best friend. He lives in Lahore. He had got good education. He earned hisPhD degree from one of the most prestigious, well reputed and renowned instructionsof the world i.e. MIT, U.S.A. He is humble and nice. Rashed always try to help others.

Given a user query as input, Information Retrieval system aims toretrieve document(s) which are relevant to the user query to satisfyuser's information need.

Documents in Suspicious Collection

Page 139: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 139

MachineLearning abranch of AI.It is a hotresearchtopic in theworld

MIT is one ofthe mostprestigious,well reputedandrenownedinstructionsof the worldlocated inU.S.A.

My fathername is RaoNawabAkhtar. He isnice andhumble. Healways try tohelp others

Understandingis deeper thanLove. A largenumber ofpeople loveyou but only afewunderstandyou

Example – Extrinsic Plagiarism Detection (Cont.)

Documents in Source Collection

NaturalLanguageProcessing isa branch ofAI. It hasmanypotentialapplicationsin the realworld

source-document 1

source-document 2

source-document 3

source-document 4

source-document 5

Page 140: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 140

Example – Extrinsic Plagiarism Detection (Cont.)

Candidate Document Retrieval

Extrinsic Plagiarism Detection - Two Step Process

AimIdentify potential source(s) of plagiarism

Potential Technique(s)Use an Information Retrieval (IR) based approach

Detailed AnalysisAim

Identify fragment(s) of source text(s) that were used tocreate the corresponding plagiarized fragments of text(s)

Possible Technique(s)Use a Pairwise Comparison Approach

Page 141: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 141

Example – Extrinsic Plagiarism Detection (Cont.)

Candidate Document Retrieval

GoalIdentify top K source documents from thesource collection which are potential sourcesof plagiarism

Here K = 3

Candidate Document Retrieval SystemVector Space Model

Page 142: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 142

Example – Extrinsic Plagiarism Detection (Cont.)

QueryTwo Separate Queries

suspicious-document-01 suspicious-document-02

Static Collection of DocumentsSource Collection (containing 5 documents)

Given

Page 143: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 143

Example – Extrinsic Plagiarism Detection (Cont.)

Output of Candidate Document Retrieval Systemsuspicious-document-01 - Potential Candidate Source Document(s)

source-document-02source-document-03source-document-05

suspicious-document-02 - Potential Candidate Source Document(s)source-document-01source-document-04source-document-05

Page 144: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 144

Example – Extrinsic Plagiarism Detection (Cont.)

Detailed Analysis

GoalIdentify fragment(s) of source text(s) thatwere used to create the correspondingplagiarized fragments of text(s)

Detailed Analysis TechniqueGreedy String Tiling

Page 145: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 145

Example – Extrinsic Plagiarism Detection (Cont.)

Considering suspicious-document-01Potential Candidate Source Documents

source-document-02 source-document-03 source-document-05

Considering suspicious-document-02Potential Candidate Source Documents

source-document-01 source-document-04 source-document-05

Given

Page 146: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 146

Example – Extrinsic Plagiarism Detection (Cont.)

Output - suspicious-document-01suspicious-document-01 is Plagiarized

Source(s) of Plagiarism

Output – Detailed Analysis

suspicious-document-02 suspicious-document-03

Page 147: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 147

Example – Extrinsic Plagiarism Detection (Cont.)

one of the most prestigious, well reputed andrenowned instructions of the world i.e. MIT, U.S.A.

Fragment Pair 1 - suspicious-document-01, source-document-01

MIT is one of the most prestigious, well reputed andrenowned instructions of the world located in U.S.A.

Suspicious-Source Fragment Pairs

suspicious-document-01

Source-document-02

Page 148: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 148

Example – Extrinsic Plagiarism Detection (Cont.)

He is humble and nice. Rasheed always try to helpothers.

Fragment Pair 2 - suspicious-document-01, source-document-03

He is nice and humble. He always tries to helpothers.

suspicious-document-01

Source-document-03

Page 149: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 149

Example – Extrinsic Plagiarism Detection (Cont.)

Output - suspicious-document-02

suspicious-document-01 is Non -Plagiarized

Output – Detailed Analysis

Page 150: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 150

Shared Tasks on Text Reuse and Plagiarism

PAN is a series of scientific events and shared

tasks on digital text forensics and stylometryPAN

https://pan.webis.de/

Page 151: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 151

Main Shared Tasks on Natural Language Processing

CoNLL

URL:

https://www.conll.

org/2019

SemEval

URL:

http://alt.qcri.org/s

emeval2019/

Page 152: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 152

Your Turn

Write at least 8 examples (at sentence level) foreach of the following levels of rewrite inplagiarism

Near CopyLight RevisionHeavy RevisionNon-Plagiarized

Page 153: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 153

Summary - Basics of Plagiarism

Plagiarism is defined as the unacknowledged reuse of text andin recent years it has been reported to be on rise.Consequently, plagiarism detection systems are routinely usedby higher educational institutions to check students work forplagiarismGiven a text pair (source text and suspicious text), suspicioustext is said to be Plagiarized from source text, if it is createdusing text from source text. On the other hand, suspicious textis said to be Non-Plagiarized from source text, if it isinterpedently written i.e. did not borrow text from source text

Page 154: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 154

Summary - Basics of Plagiarism (Cont.)

There are three levels of Plagiarism: (1) Verbatim - the originaltext is reused as verbatim (word to word copy) or with minormodifications to create the plagiarized document, (2)Paraphrased Plagiarism - the original text is heavily altered (orparaphrased) to create the plagiarized document and (3)Plagiarism of Idea - the idea of the original text is reusedwithout dependence on the words or form of the sourceThere are three main types of Plagiarism Cases: (1) ArtificialCases of Plagiarism - are generated by using Automatic TextAltering tools to obfuscate the source text for plagiarism,

Page 155: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 155

Summary - Basics of Plagiarism (Cont.)

(2) Simulated / Manual Cases of Plagiarism - the original text isparaphrased by humans to create the cases of plagiarism and(3) Real Cases of Plagiarism - are those which occurred in thereal worldTwo main types of plagiarism detection are: (1) IntrinsicPlagiarism Detection - checking that the entire document (orall the passages) were written by one single author and (2)Extrinsic Plagiarism Detection - searching for the source(s) (ororiginal text(s)) that were reused to create the suspiciousdocument

Page 156: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 156

It's Poetry Time

Page 157: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 157

�ا� د�ا� � دف �ى � �� � 

�ى � دف �ت او�د � 

او�د � دف �� � 

اور �� � دف ��ى � �� � �ق ا� ��

� � � �ى � � �� --دف �ر�  

Importance of Poetry

Page 158: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 158

    �  �  ر�ے �� � �� �ض ئ دل آ�ي

دل� � �� دل �ى � �  � ��  �

 �

Page 159: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 159

Stay Motivated

Page 160: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 160

. � �ڑى �� � د�ں � � ا�ر � �� � �

.�ڑى ا� �وڑ � � اور �ول � � � � � �

Motivation is the Fuel of Life

Importance of Motivation

Page 161: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 161

• Make a Schedule of 24 Hours and Live a Live a Balance Life• Watch Seminar - Balanced Life is Ideal Life

• Download Link: https://drive.google.com/open?id=1jet6r1QOtAB16Glpgiq18EeXaCkeVJUp

• On Daily Basis, read and enjoy• Poetry• Jokes

Tips - To Stay Motivated and become a Great Researcher

Page 162: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 162

Title: Kamal Kiya Hai - Hussan Ho Aur Nazakat Na Ho

Speaker: Dr. Rao Muhammad Adeel Nawab

Downlaod Link: Video + Slides https://drive.google.com/open?id=1RZOgjND7LiQSe0onFW3fo-GHjZsieTVA

Task Summarize main points of the motivational seminar

Motivational Seminar

Page 163: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Data Annotation for Text Reuse Detection

Page 164: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 164

Structured Data

Data is stored,processed, andmanipulated in atraditional RelationalDatabase ManagementSystem (RDBMS)

Unstructured Data

Data that iscommonly generatedfrom human activitiesand doesn’t fit into astructured databaseformat

Semi-structured Data

Data doesn’t fit into astructured database system,but is none-the-lessstructured by tags that areuseful for creating a formof order and hierarchy inthe data

Data vs Information

Varieties Of Data

Raw Facts and Figures

Varieties of DataPAN

Page 165: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 165

Data vs Information (Cont.)

Forms Of Data

Text Image Video Audio

Main Forms Of Data

Processed form of DataINFORMATION

Page 166: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 166

Data Annotation

a.k.a. data labeling / data tagging

Data Annotation

is the process of labeling data to make it usable for machine learning?is performed by domain experts (humans – a.k.a. annotators / taggers / raters)requires a lot of effort, time and cost

Page 167: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 167

Example 01 - Data AnnotationRaw Data

iPhone7 is a good mobileBattery of this phone is bad

I am using iphone7

Data Annotation – Sentiment AnalysisComment / Review Sentiment

iPhone7 is a good mobile PositiveBattery of this phone is

badNegative

I am using iphone7 Neutral

Data Annotation – Gender IdentificationComment / Review Gender

iPhone7 is a good mobile MaleBattery of this phone is

badMale

I am using iphone7 Female

Page 168: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 168

Main Steps to Create Benchmark Annotated DatasetRaw Data Collection

Corpus Standardization

Data Source(s)Cleaning of DataPre-processing of Data

Annotation ProcessPreparation of Annotation GuidelinesAnnotationsComputing Inter-Annotator Agreement

Page 169: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 169

Raw Data Collection Two Sources of Data

News Agencies articles01Associated Press of Pakistan (APP)

Newspapers stories02

Nawa e Waqt

Example – Data Annotation for Text Reuse Detection

Independent News Agency (INP)

Express NewspaperJang Newspaper

Multiple Newspapers can reuse text from one NewsAgency to produce their news storiesNote

Page 170: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 170

Raw Data Collection Raw Data Statistics

Collected 4 News Agency articles and 6 Newspaperstories

Example – Data Annotation for Text Reuse Detection (Cont.)

Cleaning of Raw DataRemove HTML tags, hyperlinks, foreign languagecharacters etc.

Pre-processing of Raw DataNo pre-processing was performed

Page 171: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 171

Below are cleaned and pre-processed documents (4News Agency articles and 6 Newspaper stories)

News Agency Article Newspaper Story

A dog bites a man A dog bites a personA dog bites a man A person was badly bitten by a dog

A dog bites a manA person is badly injured while running at roadfollowed by a hound, Independently written

I like your car Your car is niceThis is my country I live in Lahore, PakistanI like your car I like your car

Example – Data Annotation for Text Reuse Detection (Cont.)

Page 172: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 172

Example – Extrinsic Plagiarism Detection (Cont.)

Annotation Guidelines

Annotation Process

Read text pair

Assign the label (from pre-defined set of labels) which maps to the most dominating label

Page 173: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 173

Example – Extrinsic Plagiarism Detection (Cont.)

AnnotationsStandard practice in doing annotations is to have three annotators (A, B and C)

Can have more than three annotatorsCharacteristics of Annotators

Annotators must be domain expertsAnnotators should be expert and / or native Speaker in the language in which text is writtenAnnotators should be of good qualification

Page 174: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 174

Example – Data Annotation for Text Reuse Detection (Cont.)

Generally, annotations are carried out in three steps

Step 1

Step 2

Step 3

Annotators A and B annotate a subset of the datasetDiscuss conflicting and agreed text pairs and refinethe annotation guidelines to reduce the conflicts

Annotators A and B annotate the remainingdataset based on revised annotation guidelines

Annotator C annotates conflictingdocuments

Page 175: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 175

AnnotationsSeparately Give Text Pairs to Annotators A & B

Annotations by Annotator ANews Agency

Article Newspaper Story Annotator A

A dog bites a man A dog bites a person Wholly DerivedA dog bites a man A person was badly bitten by a dog Partially DerivedA dog bites a man A person is badly injured while running at road

followed by a hound, Independently writtenNon - Derived

I like your car Your car is nice Non - DerivedThis is my country I live in Lahore, Pakistan Partially DerivedI like your car I like your car Wholly Derived

Example – Data Annotation for Text Reuse Detection (Cont.)

Annotation Process

Page 176: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 176

AnnotationsSeparately Give Text Pairs to Annotators A & B

Annotations by Annotator BNews Agency

Article Newspaper Story Annotator A

A dog bites a man A dog bites a person Wholly DerivedA dog bites a man A person was badly bitten by a dog Partially DerivedA dog bites a man A person is badly injured while running at road

followed by a hound, Independently writtenNon - Derived

I like your car Your car is nice Partially DerivedThis is my country I live in Lahore, Pakistan Non - DerivedI like your car I like your car Wholly Derived

Example – Data Annotation for Text Reuse Detection (Cont.)

Annotation Process

Page 177: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 177

Inter-Annotator AgreementInter-Annotator Agreement (IAA) is a measure of howwell two (or more) annotators can make the sameannotation decision for a certain category

Example – Data Annotation for Text Reuse Detection (Cont.)

Annotation Process

Inter-Annotator Agreement

𝑰𝑰𝑰𝑰𝑰𝑰 =𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄 𝒄𝒄𝒕𝒕𝒕𝒕 𝒄𝒄𝒄𝒄𝒏𝒏𝒏𝒏𝒕𝒕𝒏𝒏 𝒄𝒄𝒐𝒐 𝒏𝒏𝒓𝒓𝒄𝒄𝒓𝒓𝒄𝒄𝒓𝒓𝒓𝒓 𝒓𝒓𝒄𝒄 𝒓𝒓𝒓𝒓𝒏𝒏𝒕𝒕𝒕𝒕𝒏𝒏𝒕𝒕𝒄𝒄𝒄𝒄

𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄 𝒄𝒄𝒕𝒕𝒕𝒕 𝒄𝒄𝒄𝒄𝒄𝒄𝒓𝒓𝒕𝒕 𝒄𝒄𝒄𝒄𝒏𝒏𝒏𝒏𝒕𝒕𝒏𝒏 𝒄𝒄𝒐𝒐 𝒏𝒏𝒓𝒓𝒄𝒄𝒓𝒓𝒄𝒄𝒓𝒓𝒓𝒓

01

Page 178: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 178

News Agency Article Newspaper Story Annotator

A Annotator BA dog bites a man A dog bites a person Wholly Derived Wholly DerivedA dog bites a man A person was badly bitten by a

dogPartiallyDerived

Partially Derived

A dog bites a man A person is badly injured whilerunning at road followed by ahound, Independently written

Non - Derived Non - Derived

I like your car Your car is nice Non - Derived Partially DerivedThis is my country I live in Lahore, Pakistan Partially

DerivedNon - Derived

I like your car I like your car Wholly Derived Wholly Derived

Example – Data Annotation for Text Reuse Detection (Cont.)

02 IAA is computed to drive two thingsHow easy was it to clearly delineate the category?How trustworthy is the annotation?

Page 179: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 179

Inter-Annotator Agreement =𝟒𝟒𝟔𝟔

= 0.667

Example – Data Annotation for Text Reuse Detection (Cont.)

Annotation Process

Combine Annotations of A & B to ComputeInter Annotator Agreement (IAA)

Page 180: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 180

Conflict ResolutionGive only conflicted pairs to Annotator C

News Agency Article

Newspaper Story

Annotator A

Annotator B

AnnotatorC

I like your car Your car is nice Partially Derived Non Derived Partially Derived

This is my country I live in Lahore, Pakistan Non Derived Partially

Derived Non Derived

Example – Data Annotation for Text Reuse Detection (Cont.)

Annotation Process

Page 181: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 181

Final Gold Standard Benchmark Corpus

News Agency Article Newspaper Story Label

A dog bites a man A dog bites a person Wholly DerivedA dog bites a man A person was badly bitten by a dog Partially Derived

A dog bites a man A person is badly injured while running at roadfollowed by a hound, Independently written Non Derived

I like your car Your car is nice Non Derived

This is my country I live in Lahore, Pakistan Partially Derived

I like your car I like your car Wholly Derived

Example – Data Annotation for Text Reuse Detection (Cont.)

Annotation Process

Page 182: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 182

Example – Corpus Standardization

Two Main Formats to Standardize Corpus

XMLCSV

Page 183: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 183

Corpus Standardization in CSV Format

Example – Corpus Standardization

Page 184: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 184

Corpus Standardization in XML Format

Example – Corpus Standardization

Page 185: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 185

METER Corpus

Paper Title: The METER corpus: A corpus for analyzing journalistic text reuse

SAC

Short Answer Corpus.Paper Title: Developing a corpus of plagiarized short answers

PAN Corpora

Can be downloaded from the PAN website: pan.webis.de

CLEU

Cross lingual English Urdu Corpus.Paper Title: Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair

Benchmark Corpora for Text Reuse and Plagiarism Detection

USTRC

Paper Title: USTRC: Urdu Short Text Reuse Corpus

Corpus

Page 186: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 186

Your Turn

Take a toy dataset of 12 examples and annotatethem with four levels of rewrite by followingthe steps discussed in the lecture

Near CopyLight RevisionHeavy RevisionNon-Plagiarized

Page 187: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 187

Data Annotation

Summary - Data Annotation for Text Reuse Detection

is the process of labeling data to make it usable for machine learning?

is performed by domain experts (humans – a.k.a. annotators / taggers / raters)

requires a lot of effort, time and cost

Page 188: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 188

01 Raw Data Collection

02 Simulated / Manual Cases of Plagiarism

03 Corpus Standardization

Summary - Data Annotation for Text Reuse Detection

Data Source(s)Cleaning of DataPre-processing of Data

Preparation of Annotation GuidelinesAnnotationsComputing Inter-Annotator Agreement

Main Steps to Create Benchmark Annotated Dataset

Page 189: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 189

It's Poetry Time

Page 190: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 190

�ا� د�ا� � دف �ى � �� � 

�ى � دف �ت او�د � 

او�د � دف �� � 

اور �� � دف ��ى � �� � �ق ا� ��

� � � �ى � � �� --دف �ر�  

Importance of Poetry

Page 191: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 191

�ل ر� � در�ں � �م � � ��ں � 

وہ � � �ف �م � � � � �د آج � 

ا� وہ �� �ے وہ � �� � دن � � ر

م �وہ �� �� � ��وں � �ں � �

�� � � � وہ��دل � �وں �ب 

��م��ٴ وہ � �اب�آرزوؤں وہ

�ل

Page 192: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 192

�ؤ �ن�ان � � �ؤ � � ��ے

��م � ان � �ں وہ � �ں � �� �

� �ؤں � �� � � � اب � � � ا� 

� �روں �� � �ے � � �رت �م �

� � � �� � ا��را� � ر�ں �

�ز�ں � �م ل� � � ���ے� � 

� ر�ى

…�ل

Page 193: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Methods for Text Reuse and Plagiarism Detection

Page 194: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 194

Methods for Text Reuse and Plagiarism Detection

A text pair (Text 1 and Text 2)

Given

Goal

Quantify the degree of similaritybetween text pair

Note – High similarity score indicates that Text 2was created using Text 1

Page 195: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 195

Example – N-gram Overlap Approach for Text Reuse and Plagiarism Detection

An n-gram is a contiguous sequence of

n items from a given sample of textDefinition

N-gram can be Word basedCharacter based

N represents the length of N-gram

Page 196: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 196

Example – N-gram Generation from Input TextInput Text

R u coming?Word Uni-grams (N = 1)

12

Rucoming3

4 ?

Tokenized Text

R u coming ?Set of Word Uni-grams

Page 197: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 197

Example – N-gram Generation from Input TextInput Text

R u coming?Word Bi-grams (N = 2)

12

Rucoming3

4 ?

Tokenized Text

Set of Word Bi-grams R u U coming Coming ?

Page 198: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 198

Example – N-gram Generation from Input TextInput Text

R u coming?Word Tri-grams (N = 3)

12

Rucoming3

4 ?

Tokenized Text

Set of Word Tri-grams R u coming u coming ?

Page 199: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 199

Example – N-gram Generation from Input TextInput Text

R u coming?Character Tri-grams (N = 3)

R

R U C O M I N ing ?

Tokenized Text:Note that space is also a character

r u u co com omi min ing gn?Set of character Tri-grams

r u coming?

Page 200: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 200

A range of measures have been proposed including

Similarity Measures to Compute N-gram Overlap

Dice Similarity Co-efficient

Containment Similarity Co-efficient

Overlap Similarity Co-efficient

Jaccard Similarity Co-efficient

Page 201: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 201

Steps – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

A Text Pair (Text 1 and Text 2)Given

Steps - Commuting Similarity

Pre-process Input TextLower casePunctuation Marks Removal

Convert Text 1 and Text 2 into Sets of N-grams

Page 202: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 202

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

Similarity Co-efficient i.e. Quantify the Degree of Similarity

Where S (Text 1, n) and S (Text 1, n) represent sets of N-grams of length n for Text 1 and Text 2 respectively

Overlap Similarity

𝑶𝑶𝑶𝑶𝒕𝒕𝒏𝒏𝒕𝒕𝒓𝒓𝑶𝑶 𝑺𝑺𝒓𝒓𝒏𝒏𝒓𝒓𝒕𝒕𝒓𝒓𝒏𝒏𝒓𝒓𝒄𝒄𝑺𝑺 = 𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟏𝟏,𝐧𝐧 ∩𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧𝐦𝐦𝐦𝐦𝐧𝐧( 𝐒𝐒 𝑻𝑻𝒕𝒕𝑻𝑻𝒄𝒄 𝟏𝟏,𝐧𝐧 ,|𝐒𝐒(𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧)|

Compute Similarity Between Sets of N-grams usingOverlap

Page 203: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 203

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

Quantify the degree for similarity betweentext pair using N-gram Overlap Approach

Here

Goal

N-grams are word basedN = 1

A Text PairGiven

Text 1: A dog bites a man. Text 2: A hound bites a man.

Page 204: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 204

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

Pre-process Input Text

Steps - Commuting Similarity

Lower casePunctuation Marks Removal

Text Pair Before Pre-processingText 1: A dog bites a man. Text 2: A hound bites a man.

Page 205: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 205

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

Text Pair After Pre-processingText 1: a dog bites a manText 2: a hound bites a man

Convert Text 1 and Text 2 into Sets of N-gramsText 1 Word Unigrams (S (Text 1, 1)) = {a, dog, bites, a, man}Text 2 Word Unigrams (S (Text 2, 1)) = {a, hound, bites, a, man}

Page 206: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 206

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

Compute Similarity Between Sets of N-grams usingOverlapSimilarity Co-efficient i.e. Quantify the Degree of Similarity

Overlap Similarity

𝑶𝑶𝑶𝑶𝒕𝒕𝒏𝒏𝒕𝒕𝒓𝒓𝑶𝑶 𝑺𝑺𝒓𝒓𝒏𝒏𝒓𝒓𝒕𝒕𝒓𝒓𝒏𝒏𝒓𝒓𝒄𝒄𝑺𝑺 = 𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟏𝟏,𝐧𝐧 ∩𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧𝐦𝐦𝐦𝐦𝐧𝐧( 𝐒𝐒 𝑻𝑻𝒕𝒕𝑻𝑻𝒄𝒄 𝟏𝟏,𝐧𝐧 ,|𝐒𝐒(𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧)|

Overlap Similarity = |{a, dog, bites, a, man}| ∩|{a, hound,bites, a, man}| / min (|({a, dog, bites, a, man}|, (|{a, hound,bites, a, man}|)))

Page 207: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 207

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

= 0.80

Overlap Similarity

𝑶𝑶𝑶𝑶𝒕𝒕𝒏𝒏𝒕𝒕𝒓𝒓𝑶𝑶 𝑺𝑺𝒓𝒓𝒏𝒏𝒓𝒓𝒕𝒕𝒓𝒓𝒏𝒏𝒓𝒓𝒄𝒄𝑺𝑺 = { 𝐚𝐚,𝐛𝐛𝐦𝐦𝐓𝐓𝐓𝐓𝐛𝐛,𝐚𝐚,𝐦𝐦𝐚𝐚𝐧𝐧}𝐦𝐦𝐦𝐦𝐧𝐧 (𝟓𝟓,𝟓𝟓)

= 𝟐𝟐𝟒𝟒

Page 208: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 208

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

Quantify the degree for similarity betweentext pair using N-gram Overlap Approach

Here

Goal

N-grams are word basedN = 2

A Text PairGiven

Text 1: A dog bites a man. Text 2: A hound bites a man.

Page 209: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 209

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

Pre-process Input Text

Steps - Commuting Similarity

Lower casePunctuation Marks Removal

Text Pair Before Pre-processingText 1: A dog bites a man. Text 2: A hound bites a man.

Page 210: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 210

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

Text Pair After Pre-processingText 1: a dog bites a manText 2: a hound bites a man

Convert Text 1 and Text 2 into Sets of N-gramsText 1 Word Bigrams (S (Text 1, 2)) = {a dog, dog bites, bites a, a man}Text 2 Word Bigrams (S (Text 1, 2)) = {a, hound, bites, a, man}

Page 211: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 211

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

Compute Similarity Between Sets of N-grams usingOverlapSimilarity Co-efficient i.e. Quantify the Degree of Similarity

Overlap Similarity

𝑶𝑶𝑶𝑶𝒕𝒕𝒏𝒏𝒕𝒕𝒓𝒓𝑶𝑶 𝑺𝑺𝒓𝒓𝒏𝒏𝒓𝒓𝒕𝒕𝒓𝒓𝒏𝒏𝒓𝒓𝒄𝒄𝑺𝑺 = 𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟏𝟏,𝐧𝐧 ∩𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧𝐦𝐦𝐦𝐦𝐧𝐧( 𝐒𝐒 𝑻𝑻𝒕𝒕𝑻𝑻𝒄𝒄 𝟏𝟏,𝐧𝐧 ,|𝐒𝐒(𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧)|

Overlap Similarity = |{a dog, dog bites, bites a, a man}| ∩|{a hound , hound bites , bites a, a man}| / min (|{a dog,dog bites, bites a, a man}|, |{a hound , hound bites , bitesa, a man}|)))

Page 212: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 212

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

= 0.5

Overlap Similarity

𝑶𝑶𝑶𝑶𝒕𝒕𝒏𝒏𝒕𝒕𝒓𝒓𝑶𝑶 𝑺𝑺𝒓𝒓𝒏𝒏𝒓𝒓𝒕𝒕𝒓𝒓𝒏𝒏𝒓𝒓𝒄𝒄𝑺𝑺 = { 𝐛𝐛𝐦𝐦𝐓𝐓𝐓𝐓𝐛𝐛 𝐚𝐚, 𝒓𝒓𝐦𝐦𝐚𝐚𝐧𝐧}𝐦𝐦𝐦𝐦𝐧𝐧 (𝟒𝟒,𝟒𝟒)

= 𝟐𝟐𝟒𝟒

Page 213: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 213

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

Quantify the degree for similarity betweentext pair using N-gram Overlap Approach

Here

Goal

N-grams are word basedN = 3

A Text PairGiven

Text 1: A dog bites a man. Text 2: A hound bites a man.

Page 214: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 214

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

Pre-process Input Text

Steps - Commuting Similarity

Lower casePunctuation Marks Removal

Text Pair Before Pre-processingText 1: A dog bites a man. Text 2: A hound bites a man.

Page 215: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 215

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

Text Pair After Pre-processingText 1: a dog bites a manText 2: a hound bites a man

Convert Text 1 and Text 2 into Sets of N-gramsText 1 Word Bigrams (S (Text 1, 3)) = {a dog bites, dog bites a, bites a man}Text 2 Word Bigrams (S (Text 1, 3)) = {a hound bites, hound bites a , bites a man}

Page 216: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 216

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

Compute Similarity Between Sets of N-grams usingOverlapSimilarity Co-efficient i.e. Quantify the Degree of Similarity

Overlap Similarity

𝑶𝑶𝑶𝑶𝒕𝒕𝒏𝒏𝒕𝒕𝒓𝒓𝑶𝑶 𝑺𝑺𝒓𝒓𝒏𝒏𝒓𝒓𝒕𝒕𝒓𝒓𝒏𝒏𝒓𝒓𝒄𝒄𝑺𝑺 = 𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟏𝟏,𝐧𝐧 ∩𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧𝐦𝐦𝐦𝐦𝐧𝐧( 𝐒𝐒 𝑻𝑻𝒕𝒕𝑻𝑻𝒄𝒄 𝟏𝟏,𝐧𝐧 ,|𝐒𝐒(𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧)|

Overlap Similarity = |{a dog, dog bites, bites a, a man}| ∩|{a hound , hound bites , bites a, a man}| / min (|{a dog,dog bites, bites a, a man}|, |{a hound , hound bites , bitesa, a man}|)))

Page 217: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 217

Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient

= 0.33

Overlap Similarity

𝑶𝑶𝑶𝑶𝒕𝒕𝒏𝒏𝒕𝒕𝒓𝒓𝑶𝑶 𝑺𝑺𝒓𝒓𝒏𝒏𝒓𝒓𝒕𝒕𝒓𝒓𝒏𝒏𝒓𝒓𝒄𝒄𝑺𝑺 = { 𝐛𝐛𝐦𝐦𝐓𝐓𝐓𝐓𝐛𝐛 𝐚𝐚 𝐦𝐦𝐚𝐚𝐧𝐧}𝐦𝐦𝐦𝐦𝐧𝐧 (𝟑𝟑,𝟑𝟑)

= 𝟏𝟏𝟑𝟑

Page 218: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 218

Methods for Mono-lingual Text Reuse and Plagiarism Detection

Methods based on content

Word n-grams overlapVector Space Model

Methods based on string and sequence alignment

Longest common subsequenceGreedy String-TilingGlobal AlignmentLocal Alignment

Page 219: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 219

Methods for Mono-lingual Text Reuse and Plagiarism Detection

Methods based on structure

Stop-words based n-grams overlap

Methods based on style

Type token ratioToken ratioSentence ratio

Page 220: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 220

Methods for Cross-lingual Text Reuse and Plagiarism Detection

Methods based on Syntax

Methods based on Dictionaries

Cross-Language Character N-Gram

Cross-Language Vector Space MethodCross-Language Conceptual Thesaurus based SimilarityCross-Language Knowledge Graph Analysis

Page 221: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 221

Methods for Cross-lingual Text Reuse and Plagiarism Detection

Methods based on Parallel Corpora

Methods based on Comparable Corpora

Cross-Language Alignment based Similarity AnalysisCross-Language Latent Semantic Indexing

Cross-Language Explicit Semantic Analysis

Cross-Language Kernel Canonical Correlation Analysis

Methods based on Machine Translation

Translation + Monolingual Analysis

Page 222: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 222

Methods for Cross-lingual Text Reuse and Plagiarism Detection

Methods based on Word Embeddings

Cross-Language Conceptual Thesaurus based Similarityusing Word EmbeddingsCross-Language Word Embeddings based SimilarityCross-Language Word Embedding based Syntax Similarity

Methods based on Deep Learning

Page 223: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 223

Your TurnConsidering the following example

Text 1: a dog bites a man

Using Longest Common Subsequence (LCS) Approach theLCS between two text pairs is:

A bites a man

ComputingSimilarity

Your Turnis to take at least three text pairs and compute similarityscore between then using the Longest CommonSubsequence (LCS) Approach

Similarity Score =

Similarity Score = =

Text 2: a hound bites a man

Page 224: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 224

Summary- Methods for Text Reuse and Plagiarism Detection

Methods for Mono-lingual Text Reuse and Plagiarism Detectionand be broadly categorized into: (1) Methods based onContent, (2) Methods based on Structure and (3) Methodsbased on StyleMethods for Cross-lingual Text Reuse and Plagiarism Detectionand be broadly categorized into: (1) Methods based on Syntax,(2) Cross-Language Character N-Grams, (3) Methods based onDictionaries, (4) Methods based on Parallel Corpora, (5)Methods based on Comparable Corpora, (6) Methods based onWord Embedding's and (7) Methods based on Deep Learning

Page 225: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 225

It's Poetry Time

Page 226: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 226

�ا� د�ا� � دف �ى � �� � 

�ى � دف �ت او�د � 

او�د � دف �� � 

اور �� � دف ��ى � �� � �ق ا� ��

� � � �ى � � �� --دف �ر�  

Importance of Poetry

Page 227: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 227

� � ر�  �  �  � �  آ�  وا�  �ا 

� � ��ار �  �  ��اداس 

 �

Page 228: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 228

Stay Motivated

Page 229: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 229

. � �ڑى �� � د�ں � � ا�ر � �� � �

.�ڑى ا� �وڑ � � اور �ول � � � � � �

Motivation is the Fuel of Life

Importance of Motivation

Page 230: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 230

• Make a Schedule of 24 Hours and Live a Live a Balance Life• Watch Seminar - Balanced Life is Ideal Life

• Download Link: https://drive.google.com/open?id=1jet6r1QOtAB16Glpgiq18EeXaCkeVJUp

• On Daily Basis, read and enjoy• Poetry• Jokes

Tips - To Stay Motivated and become a Great Researcher

Page 231: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 231

Title: Power Of Appreciation

Speaker: Dr. Rao Muhammad Adeel Nawab

Downlaod Link: Video + Slides https://drive.google.com/open?id=1XnwhNjrjHPBSv0VHYuvgS5m4A7QwNDTf

Task Summarize main points of the motivational seminar

Motivational Seminar

Page 232: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Evaluation Measures

Page 233: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 233

Evaluation Measures

P = 𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻+𝑭𝑭𝑻𝑻

Precision

Precision (P) of a text reuse detection system is the

proportion of the predicted positive cases that were

correct

Precision

Page 234: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 234

Evaluation Measures

Recall (R) of a text reuse detection system is defined

as the proportion of positive cases that were

correctly identified

Recall

R = 𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻+𝑭𝑭𝑭𝑭

Recall

Page 235: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 235

Evaluation Measures

F₁ measure is a specific relationship (harmonic mean)

between precision (P) and recall (R)

F₁ measure

F₁ = 𝟐𝟐∗𝑻𝑻∗𝑹𝑹𝑻𝑻+𝑹𝑹

F₁ measure

Page 236: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 236

Summary- Evaluation Measures

Evaluation of Text Reuse / Plagiarism DetectionSystems is carried out using Precision, Recall and F₁measuresNote that in research papers (or thesis) mostly weightaverage Precision, Recall and F₁ scores are reported

Page 237: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 237

It's Poetry Time

Page 238: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 238

�ا� د�ا� � دف �ى � �� � 

�ى � دف �ت او�د � 

او�د � دف �� � 

اور �� � دف ��ى � �� � �ق ا� ��

� � � �ى � � �� --دف �ر�  

Importance of Poetry

Page 239: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 239

ر� � � دل � د�� � � آ 

آ آ � � � �ڑ � �� � �

ر� �م� �ار � �ے � �

� آ� � � �� � � � � 

� � �� � � � � �ا� 

� آ� �� � د�رہو ر� 

�ل

Page 240: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 240

� � � � �ا� ��� � �

ز�� � � آ�� � � � � 

اك � � �ں �ت �� � � �وم 

اے را� �ں � � ر�� � � آ

اب � دل �ش � � � � � ا�� 

� آ � �� � آ�ى � � ا� �از

…�ل

Page 241: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Treating the Problem of Text Reuse/ Plagiarism Detection as MachineLearning Problem – A Step by StepExample

Page 242: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 242

Treating the Problem of Text Reuse / Plagiarism Detection as Machine Learning Problem

For Ternary Classification

For Binary Classification

DerivedNon-Derived

Wholly DerivedPartially DerivedNon-Derived

Problem

Output

Text Reuse Detection

Input A Text Pair

Page 243: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 243

Treating the Problem of Text Reuse Detection as Machine Learning Problem

How to Transform Text Reuse DetectionProblem to Supervised Text ClassificationTask?

Question?

For Supervised Text Classification TaskOutput must be associated with the input i.e. dataset must beannotated

Page 244: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 244

Steps - Treating the Problem of Text Reuse Detection as Machine Learning Problem

Pre-process input (i.e. text pair)

Feature Extraction from Input

Represent extracted features and output / labelinto a format that Machine Learning algorithmscan understand (normally saved as CSV file)

Use features extracted in Step 2 to train and testMachine Learning algorithms

Main Steps to Treat Text Reuse Detection as SupervisedText Classification Task

Page 245: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 245

Experimental Setup

Ternary ClassificationBinary ClassificationDiscriminate between two classes i.e.Derived vs Non-derived

Discriminate between three classesi.e. Wholly Derived / PartiallyDerived / Non-derived

Problem of Text Reuse Detection is Treated as a SupervisedText Classification Task

Two Versions of Classification

DatasetFile containing dataset in CSV format is called data.csv15 instances (input + output)

InputText Pair

OutputLabel

Page 246: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 246

Text 1 Text 2 LabelA dog bites a man A dog bites a person Wholly Derived A dog bites a man A person was badly bitten by a dog Partially Derived

A dog bites a man A person is badly injured while running at road followed by a hound,Independently written Non Derived

I like your car Your car is nice Non DerivedThis is my country I live in Lahore, Pakistan Partially DerivedI like your car I like your car Wholly DerivedMy favorite subject is NLP My favorite subject is NLP Wholly DerivedMy favorite subject is NLP I like NLP Partially DerivedI like NLP I am studying many course but nlp an be top ranked Non Derived

Allama iqbal is our national hero it was iqbal who awoke the muslim with his poetry Non Derived

Balochistan successfully holds 3rd round of LG elections Plots in the municipal elections, the PML-N won the third stage Partially Derived

and he was once looking forward to it, he reiterated. World Cup has never refused: Shoaib Malik Partially Derived

Syed Sultan Shah termed it a national tragedy.

World Blind Cricket Council chairman Syed Sultan Shah said that Peshawar is anational tragedy. Wholly Derived

He said that increase in the tax is unconstitutional and violation of article 77.

Raza Rabbani said the Supreme Court's decision in the light of Article 77 is a violation of the decision. Wholly Derived

LHC-death LHC dismisses appeals against death sentence

5 guilty of the death penalty rejected pleas for mercy Reporter Karachi to Islamabad? President's death convicted criminals 5 rejected pleas for mercy. Non Derived

Page 247: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 247

Experimental Setup

Technique (for Feature Extraction)

N-gram OverlapN-grams are word basedN = 1 - 5

Evaluation Measure

PrecisionRecallF1

Page 248: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 248

Example-Treating the Problem of Text Reuse Detection as Machine Learning Problem

Main Steps to Treat Text Reuse Detection as Supervised TextClassification Task

Step 1: Pre-process input (i.e. text pair)

Page 249: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 249

Example - Treating the Problem of Text Reuse Detection as Machine Learning Text 1 Text 2A dog bites a man A dog bites a personA dog bites a man A person was badly bitten by a dogA dog bites a man A person is badly injured while running at road followed by a hound,

Independently writtenI like your car Your car is niceThis is my country I live in Lahore, PakistanI like your car I like your carMy favorite subject is NLP My favorite subject is NLPMy favorite subject is NLP I like NLPI like NLP I am studying many course but nlp an be top rankedAllama iqbal is our national hero it was iqbal who awoke the muslim with his poetryBalochistan successfully holds 3rd round of LG elections Plots in the municipal elections, the PML-N won the third stage

and he was once looking forward to it, he reiterated. World Cup has never refused: Shoaib Malik

Syed Sultan Shah termed it a national tragedy. World Blind Cricket Council chairman Syed Sultan Shah said thatPeshawar is a national tragedy.

He said that increase in the tax is unconstitutional and violation of article 77.

Raza Rabbani said the Supreme Court's decision in the light of Article 77 is a violation of the decision.

LHC-death LHC dismisses appeals against death sentence 5 guilty of the death penalty rejected pleas for mercy Reporter Karachi to Islamabad? President's death convicted criminals 5 rejected pleas for mercy.

Page 250: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 250

Example-Treating the Problem of Text Reuse Detection as Machine Learning Problem

Step 2: Feature Extraction from InputMain goal is to quantify the degree of similarity between textpairs (input) i.e. convert text pairs into similarity scores(numeric values) so that Machine Learning algorithms canunderstand them

We applied N-gram Overlap approach to compute similarityscores and transformed data.csv file into features.csv

Note that Machine Learning algorithms can understandnumber values

Page 251: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 251

Example - Treating the Problem of Text Reuse Detection as Machine Learning

Uni-gram-Scores Bi-gram-Score Tri-gram-score Four-gram-score Five-gram-Score

0.6 0.5 0.33 0 00.6 0.25 0 0 00.4 0 0 0 00.5 0.33 0 0 00 0 0 0 01 1 1 1 01 1 1 1 1

0.33 0 0 0 00.67 0 0 0 00.17 0 0 0 00.12 0 0 0 0

0 0 0 0 00.75 0.57 0.33 0 00.57 0.31 0.08 0 00.25 0 0 0 0

Page 252: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 252

Example - Treating the Problem of Text Reuse Detection as Machine Learning

Step 3: Represent extracted features and output / label into aformat that Machine Learning algorithms can understand(normally saved as CSV file) Feature.csv (WITH LABEL)Uni-gram-Scores Bi-gram-Score Tri-gram-score Four-gram-score Five-gram-Score Label

0.6 0.5 0.33 0 0 WD0.6 0.25 0 0 0 PD 0.4 0 0 0 0 ND0.5 0.33 0 0 0 ND0 0 0 0 0 PD1 1 1 1 0 WD1 1 1 1 1 WD

0.33 0 0 0 0 PD0.67 0 0 0 0 ND0.17 0 0 0 0 ND0.12 0 0 0 0 PD

0 0 0 0 0 PD0.75 0.57 0.33 0 0 WD0.57 0.31 0.08 0 0 WD0.25 0 0 0 0 ND

Step 4: Use features.csv file to train and test Machine Learningalgorithms See Next Slides

Page 253: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 253

Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem

Ternary Classification using WEKALoad features.csv

Click open

Page 254: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 254

Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem (Cont.)

Ternary Classification using WEKALoad features.csv

Select file by Giving Path

Page 255: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 255

Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem

Ternary Classification using WEKALoad features.csv

Click Label and see Number of Classes

Page 256: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 256

Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem

Ternary Classification using WEKALoad features.csv

Click Edit and see Data View in Weka

Page 257: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 257

Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem

Ternary Classification using WEKASelect J48

Page 258: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 258

Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem

Ternary Classification using WEKARun J48 with Spilt ratio 70

Page 259: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 259

Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem

Ternary Classification using WEKASelect Naive Bayes

Page 260: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 260

Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem

Ternary Classification using WEKARun Naive Bayes with Spilt Ratio 70

Page 261: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 261

Results for Ternary ClassificationResults are reported for weighted average Precision,Recall and F₁ scores

Machine Learning Algorithms Results

Precision Recall F₁-Measure

Naïve Bayes 0.500 0.500 0.500

J48 0.875 0.800 0.780

Page 262: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 262

Your Turn

Considering the toy dataset given in this lecture.Apply Longest Common Subsequence Approach toextract features (similarity scores between textpairs) from dataset. Convert the file into ARFF / CSVformat. Run Naïve Bayes, J48 and two otherMachine Learning algorithms from WEKA. ReportWeighted Average Precision, Recall, and F₁ scoresfor all four Machine Learning algorithms in theform of table.

Page 263: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 263

Example-Treating the Problem of Text Reuse Detection as Machine Learning Problem

Binary Classification using WEKALoad features.csv

Click Open

Page 264: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 264

Example – (Cont.)Binary Classification using WEKA

Load features.csvSelect file by Giving Path

Page 265: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 265

Example – (Cont.)Binary Classification using WEKA

Load features.csvClick Label and see Number of Classes

Page 266: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 266

Example – (Cont.)Binary Classification using WEKA

Load features.csvClick Edit and see Data View in WEKA

Page 267: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 267

Example – (Cont.)Binary Classification using WEKA

Select J48

Page 268: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 268

Example – (Cont.)Binary Classification using WEKA

Run J48 with Spilt ratio 70

Page 269: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 269

Example – (Cont.)Binary Classification using WEKA

Select Naive Bayes

Page 270: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 270

Example – (Cont.)Binary Classification using WEKA

Run Naive Bayes with Spilt Ratio 70

Page 271: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 271

Results for Binary ClassificationResults are reported for weighted average Precision,Recall and F₁ scores

Machine Learning Algorithms Results

Precision Recall F₁-Measure

Naïve Bayes 0.833 0.500 0.500

J48 0.875 0.750 0.767

Page 272: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 272

Summary: Treating the Problem of Text Reuse / Plagiarism Detection as Machine Learning Problem – A Step by Step Example

To treat text reuse and plagiarism detection problem as aSupervised Text Classification task, we need to know followingmain things

For supervised text classification task, dataset mustbe annotated

Dataset

To extract features from text pairs (input)For text reuse and plagiarism detection the (featureextraction) techniques mostly aim to compute the similaritybetween text pairs i.e. features are similarity scores

Techniques(s)

Page 273: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 273

Summary: Treating the Problem of Text Reuse / Plagiarism Detection as Machine Learning Problem – A Step by Step Example

Mostly weighted average Precision, Recall and F₁ scores are usedto evaluate the performance of text reuse and plagiarismdetection systems

Evaluation Measures

A Machine Learning Toolkit is mainly a collection of MachineLearning algorithmsTwo popular and widely used Machine Learning Toolkits are

Machine Learning Toolkit(s)

WEKA (Java Programming Language)Scikit-Learn (Python Programming Language)

Page 274: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 274

Summary: Treating the Problem of Text Reuse / Plagiarism Detection as Machine Learning Problem – A Step by Step Example

Machine Learning Algorithms that will be trained / tested onfeatures (similarity scores) extracted from the datasetFor Supervised Text Classification task some of the MachineLearning Algorithms which have proven to be effective are

Machine Learning Algorithms

ML Algorithms

Naïve Bayes

Logistic Regression

Multi-Layer Perceptron

Random Forest

Support Vector Machine

AdaBoost

Page 275: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 275

Summary: Treating the Problem of Text Reuse / Plagiarism Detectionas Machine Learning Problem – A Step by Step Example (Cont.)

Pre-process input (i.e. text pair)

Feature Extraction from Input

Represent extracted features and output / labelinto a format that Machine Learning algorithmscan understand (normally saved as CSV file)

Use features extracted in Step 2 to train and testMachine Learning algorithms

Main Steps to Treat Text Reuse Detection as SupervisedText Classification Task

Page 276: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 276

Summary – Introduction to Text Reuse and Plagiarism

Text Reuse is the process of creating a new text (or document)using the existing one(s) and it is reported to be on rise inrecent years due to easy access to large online digitalrepositoriesGiven a text pair (Text 1 and Text 2), Text 2 is said to be"Derived" from Text 1, if it is created using text from Text 1. Onthe other hand, Text 2 is said to be "Non-Derived" from Text 1,if it is interpedently written i.e. did not borrow text from Text 1

Text Reuse

Page 277: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 277

Summary – Introduction to Text Reuse and Plagiarism

Text Reuse may occur at five levels: (1) Word level, (2) Phrasallevel, (3) Sentence level, (4) Passage / Paragraph level and (5)Document levelTwo main types of text reuse are: (1) Local Text Reuse - whenamount of text reused is detected at sentence/passage leveland (2) Global Text Reuse - when amount of text reused isdetected at document levelTwo main types of text reuse are: (1) Local Text Reuse - whenamount of text reused is detected at sentence/passage leveland (2) Global Text Reuse - when amount of text reused isdetected at document level

Page 278: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 278

Summary – Introduction to Text Reuse and Plagiarism

Plagiarism is defined as the unacknowledged reuse of text andin recent years it has been reported to be on rise.Consequently, plagiarism detection systems are routinely usedby higher educational institutions to check students work forplagiarismGiven a text pair (source text and suspicious text), suspicioustext is said to be Plagiarized from source text, if it is createdusing text from source text. On the other hand, suspicious textis said to be Non-Plagiarized from source text, if it isinterpedently written i.e. did not borrow text from source text

Plagiarism

Page 279: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 279

Summary – Introduction to Text Reuse and Plagiarism

There are three levels of Plagiarism: (1) Verbatim - the originaltext is reused as verbatim (word to word copy) or with minormodifications to create the plagiarized document, (2)Paraphrased Plagiarism - the original text is heavily altered (orparaphrased) to create the plagiarized document and (3)Plagiarism of Idea - the idea of the original text is reusedwithout dependence on the words or form of the sourceThere are three main types of Plagiarism Cases: (1) ArtificialCases of Plagiarism - are generated by using Automatic TextAltering tools to obfuscate the source text for plagiarism,

Page 280: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 280

Summary – Introduction to Text Reuse and Plagiarism

(2) Simulated / Manual Cases of Plagiarism - the original text isparaphrased by humans to create the cases of plagiarism and(3) Real Cases of Plagiarism - are those which occurred in thereal worldTwo main types of plagiarism detection are: (1) IntrinsicPlagiarism Detection - checking that the entire document (orall the passages) were written by one single author and (2)Extrinsic Plagiarism Detection - searching for the source(s) (ororiginal text(s)) that were reused to create the suspiciousdocument

Page 281: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 281

Summary – Introduction to Text Reuse and Plagiarism

Data Annotation

is the process of labeling data to make it usable for machine learning?is performed by domain experts (humans – a.k.a. annotators / taggers / raters)requires a lot of effort, time and cost

Page 282: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 282

Summary – Introduction to Text Reuse and PlagiarismMain Steps to Create Benchmark Annotated DatasetRaw Data Collection

Corpus Standardization

Data Source(s)Cleaning of DataPre-processing of Data

Annotation ProcessPreparation of Annotation GuidelinesAnnotationsComputing Inter-Annotator Agreement

Page 283: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 283

Summary – Introduction to Text Reuse and Plagiarism

Methods for Mono-lingual Text Reuse and Plagiarism Detectionand be broadly categorized into: (1) Methods based onContent, (2) Methods based on Structure and (3) Methodsbased on StyleMethods for Cross-lingual Text Reuse and Plagiarism Detectionand be broadly categorized into: (1) Methods based on Syntax,(2) Cross-Language Character N-Grams, (3) Methods based onDictionaries, (4) Methods based on Parallel Corpora, (5)Methods based on Comparable Corpora, (6) Methods based onWord Embedding's and (7) Methods based on Deep Learning

Methods for Text Reuse and Plagiarism Detection

Page 284: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 284

Summary – Introduction to Text Reuse and Plagiarism

Evaluation Measures

Evaluation of Text Reuse / Plagiarism DetectionSystems is carried out using Precision, Recall and F₁measuresNote that in research papers (or thesis) mostly weightaverage Precision, Recall and F₁ scores are reported

Page 285: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 285

Summary: Treating the Problem of Text Reuse / Plagiarism Detection as Machine Learning Problem – A Step by Step Example

To treat text reuse and plagiarism detection problem as aSupervised Text Classification task, we need to know followingmain things

For supervised text classification task, dataset mustbe annotated

Dataset

To extract features from text pairs (input)For text reuse and plagiarism detection the (featureextraction) techniques mostly aim to compute the similaritybetween text pairs i.e. features are similarity scores

Techniques(s)

Page 286: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 286

Summary: Treating the Problem of Text Reuse / Plagiarism Detection as Machine Learning Problem – A Step by Step Example

Mostly weighted average Precision, Recall and F₁ scores are usedto evaluate the performance of text reuse and plagiarismdetection systems

Evaluation Measures

A Machine Learning Toolkit is mainly a collection of MachineLearning algorithmsTwo popular and widely used Machine Learning Toolkits are

Machine Learning Toolkit(s)

WEKA (Java Programming Language)Scikit-Learn (Python Programming Language)

Page 287: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 287

Summary: Treating the Problem of Text Reuse / Plagiarism Detection as Machine Learning Problem – A Step by Step Example

Machine Learning Algorithms that will be trained / tested onfeatures (similarity scores) extracted from the datasetFor Supervised Text Classification task some of the MachineLearning Algorithms which have proven to be effective are

Machine Learning Algorithms

ML Algorithms

Naïve Bayes

Logistic Regression

Multi-Layer Perceptron

Random Forest

Support Vector Machine

AdaBoost

Page 288: ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · A Template-based Approach to Reada Research Paperv1. A Template-based Approach to Design an Experiment. A Template-based

Dr. Rao Muhammad Adeel Nawab 288

Summary: Treating the Problem of Text Reuse / Plagiarism Detectionas Machine Learning Problem – A Step by Step Example (Cont.)

Pre-process input (i.e. text pair)

Feature Extraction from Input

Represent extracted features and output / labelinto a format that Machine Learning algorithmscan understand (normally saved as CSV file)

Use features extracted in Step 2 to train and testMachine Learning algorithms

Main Steps to Treat Text Reuse Detection as SupervisedText Classification Task