ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · a template-based approach to reada...
TRANSCRIPT
ن احم حيم بسم الله الر لر
Research Methodology in I.T.
Lecture 01 –
Ms. Iqra MuneerDr. Rao Muhammad Adeel Nawab
Dr. Rao Muhammad Adeel Nawab
Authors:
Introduction to Text Reuse and Plagiarism Detection
Instructor:
Dr. Rao Muhammad Adeel Nawab 3
!اے ��ان
، � � � �، � � � � � � � �
� � � � � �، � � � � � � � �� � � � � �، � � � �
� � د� � � د� ،� � د� ،
.د� � �ے � � د� �� �
رح�ت �� �� � ��
Dr. Rao Muhammad Adeel Nawab 4
How to Work?
.�م ��
. �� �� �م ��
.اهللا � �� � ��� �� �م ��﮵تت﮴ :آ ك � ��
�ك نعبد وا �� �
تعني ا س�. � اهللا � �ى � �دت �� �: ��
اور � � � � د �� �
Dr. Rao Muhammad Adeel Nawab 5
. � � � �ر�
.د�� �ں � �� � � � �� �
�ط :د� تقمي رص �ط ٱٱلمس� مٱٱهد� ٱٱلرص �ن ٱ�نعمت �لهي ٱٱ��. � �� راه د� ان ��ں � راه � � � � ا�م �: ��
Power of Effort & Dua
Dr. Rao Muhammad Adeel Nawab 6
�هم� خر يل وا�رت يل الل�متنا ال� ما �ل
�ب�انك ال �مل لنا ا س�
�ك ٱ�نت العلمي الحكمي ن�ا
ح يل صدري رب ارش يل امري و�رس
وا�لل عقدة من لساين یفقهوا قويل
Dua – Take Help from Allah Before Starting Any Task
Dr. Rao Muhammad Adeel Nawab 7
Dr. Rao Muhammad Adeel Nawab – About Me
University of Sheffield, UKPhD
COMSATS University Islamabad, Lahore CampusAssistant Professor
NLP Group, CUI, Lahore CampusGroup Lead
Dr. Rao Muhammad Adeel Nawab 8
Publications
Research Supervisions 10 PhD Studentsunder supervision (in different countries)
40+ MPhil StudentsSupervised
36 International Publications15 Impact Factor Journal Papers
21 International Conference/Workshop Papers
Citations & H-Index258 CitationsH-Index = 10
Dr. Rao Muhammad Adeel Nawab 9
About ME: Research Profile on Google Scholar
https://scholar.google.com/citations?user=fUn6DXYAAAAJ&hl=en
Dr. Rao Muhammad Adeel Nawab 10
Dedication
This course is dedicated to mymost beloved and respectablePhD supervisor
Dr. Mark StevensonDepartment of Computer Science,University of Sheffield, UK
Dr. Rao Muhammad Adeel Nawab 11
Course Details - Instructor
Dr. Rao Muhammad Adeel Nawab
https://lahore.comsats.edu.pk/Employees/74
Instructor
Office
CUI Profile
Room no. 04, Faculty Block
Dr. Rao Muhammad Adeel Nawab 12
Course Details – Classes
Lec 01 & Lec 02
Tuesday
(17:30 – 20:30)
D-110 (D Block)
Dr. Rao Muhammad Adeel Nawab 13
Mainly get Excellence in two things
Course Focus
Become a great Human Being
Become a great Researcher
Dr. Rao Muhammad Adeel Nawab 14
Five Types of Training
PoliceEliteRangersArmyCommando
Dr. Rao Muhammad Adeel Nawab 15
Main Goal of a Course - Commando Training
Commando
Commando is a man of character and (s)he should safeguard his character
Dr. Rao Muhammad Adeel Nawab 16
Main Goal of a Course - Commando Training
Two Main Qualities of a Commando
.� � � �ر�
.د�� �ں � �� � � � �� �
100% Effort with Sincerity
ادب + ��� ا�د اور وا��
1
2
Dr. Rao Muhammad Adeel Nawab 17
Main Goal of a Course - Commando Training
Summary of Qualities in a Commando
��ى
Dr. Rao Muhammad Adeel Nawab 18
ادب
� ادب � �، � ادب � �
�� �� � � �، وہ �� � �
� � �وہ �� � �م،� �� � آ � �
Dr. Rao Muhammad Adeel Nawab 19
ا� � �
� � �� �ا� � �� را�
Dr. Rao Muhammad Adeel Nawab 20
ا� � �
� �وم � وہ ادب�
� �وم ��ا� � ه�ل ا�� رو� (
ض اهللا عن ي
ض)ر�
Dr. Rao Muhammad Adeel Nawab 21
Course Aims
To introduce essential concepts required to become agreat human being and a great researcher
To develop skills to systematically teach any concept
To develop skills to carry out research using a template-based approach
To develop internet searching skills, both general andresearch specific
Dr. Rao Muhammad Adeel Nawab 22
Course Aims
To develop skills to systematically search, read andsummarize a research paper / thesis
To develop skills to systematically make a template-basedoutline of a research paper / thesis and then write it
To develop skills to systematically design an experiment
To develop skills to carry out research in such a way thatstudents who carry out research become a Commando in life
Dr. Rao Muhammad Adeel Nawab 23
Course Learning Outcomes (CLOs)
Understand how to systematically learn any concept
Understand how to search internet (both generic andresearch specific) to satisfy their information needs
By the end of this course, the students should be able to
Understand what daily tasks are important tohave a balanced personality
Dr. Rao Muhammad Adeel Nawab 24
Course Learning Outcomes (CLOs)
Read, write and design an experiment for a researchpaper / thesis using a template-based approach
Tell a coherent and connected story in a researchpaper / thesis
Carry out research in such a way that it enhancesthe self-learning abilities of students and createability in them to cope up with the challenges of life
Dr. Rao Muhammad Adeel Nawab 25
Useful Material
A Template-based Approach to Read a Research PaperDownload Link: https://ilmoirfan.com/a-template-based-approach-to-read-a-research-paper/
A Template-based Approach to Write a Research PaperDownload Link: https://ilmoirfan.com/a-template-based-approach-to-write-a-research-paper/
A Template-based Approach to Design an ExperimentDownload Link: https://ilmoirfan.com/a-template-based-approach-to-design-an-experiment/
Dr. Rao Muhammad Adeel Nawab 26
Useful Material Cont…
A Step by Step Approach to Learn Latex for ScientificWritingDownload Link: https://ilmoirfan.com/authors-page/latex/
A Template-based Approach to Write an EmailDownload Link: https://ilmoirfan.com/authors-page/a-template-based-approach-to-write-an-email/
Dr. Rao Muhammad Adeel Nawab 27
Little Efforts Daily Will Make You the GreatestTo systematically learn and get excellence in any concept /subject
Importance of completing tasks on daily basis���یں کام روز کا روز
ا ا س�تت ا جج ھاتی � ں ��یں �ن ک دن ��ی ا اتی ھاتن � کا ے �ن ں،اک ���ی ��ی
ں �ن ک دن ��ی کام اتی کا ے �ن ک ���ی ی اتی �ے �ہ �ی اا و س�تت �ہ
گا ے
ں آ� ��یس �ن ا اب وا�پ �تی � د�� �ے جپ �� زتن و دن آپ �ج
ے ا �ہ وس�تت ی �ہ کام آج �ہ کا آج
ر د� و�ہ ا �ج تن و ا�پتے � ما �ہ دان �ج ں، آج �تی ��ی
ا �ن تت کا �پ ے وا�ے دن ن
ں، آ� ��یا �ن ا وہ آتن �تی ��ر و ھاؤ�ج
Dr. Rao Muhammad Adeel Nawab 28
Instructions – To Do Tasks on Daily Basis
In order to facilitate the Course Instructor to monitor
your work progress on daily basis, every student must
follow the following steps:
Dr. Rao Muhammad Adeel Nawab 29
Instructions – To Do Tasks on Daily Basis (Cont.)
Name of Folder should be: Name – Registration Number
Create a folder on your Google Drive and share it’slink with me on [email protected]
Create a separate sub-folder for each lecture / task
For example: Muhammad Adeel – SP20-RCS-007
Put your files in appropriate sub-folders
For example: Lecture 01 – Introduction to Text Reuseand Plagiarism
Dr. Rao Muhammad Adeel Nawab 30
Instructions – To Do Tasks on Daily Basis (Cont.)
Update your files in Google Drive sub-folders on dailybasis
Very Important and Mandatory
The tasks mentioned in Your Turn slide(s)must be completed before the next lecture
Dr. Rao Muhammad Adeel Nawab 31
Course Outline – Key TopicsIntroduction to Text Reuse and PlagiarismLearn How to LearnLearning is a Searching ProblemSearching Offline and Online Sources of Knowledge and SkillsA Template-based Approach to Analyze, Summarize andDocument Search Results v1A Template-based Approach to Read a Research Paper v1A Template-based Approach to Design an ExperimentA Template-based Approach to Write a Research Paper
A Template-based Approach to Write a Research Thesis
A Template-based Approach to Write a Research Thesis Proposal
Dr. Rao Muhammad Adeel Nawab 32
Lecture Outline1 Basics of Text Reuse
2 Basics of Plagiarism3 Data Annotation for Text Reuse Detection4 Methods for Text Reuse and Plagiarism Detection5 Evaluation Measures6 Treating the Problem of Text Reuse / Plagiarism
Detection as Machine Learning Problem – A Step by Step Example
Basics of Text Reuse
Dr. Rao Muhammad Adeel Nawab 34
Text Reuse
The text which is used to create new textOriginal Text (or Source Text)
The text created by reusing the original text(s)Derived text (or Reused Text)
DefinitionThe process of creating a new text (or document) using
the existing one(s)
Dr. Rao Muhammad Adeel Nawab 35
Text Reuse
1
2
Idea
Image
ConceptReuse can also be of
3
4
5
Movies
Features etc.
Dr. Rao Muhammad Adeel Nawab 36
Text Reuse - Acceptable vs Non-Acceptable
1 Journalism
2 Plagiarism
Text reuse is a common practiceNewspapers use text(s) provided by News Agenciesto write newspaper articles
Unacknowledged text reuse is not acceptable
Dr. Rao Muhammad Adeel Nawab 37
Text Reuse in Journalism
1 News Agency
2 Text Reuse in JournalismNewspapers use articles provided by News Agenciesto write newspaper stories (or news articles)Text reuse is a common and legitimate practice in thedomain of journalism
An organization that collects news items anddistributes them to newspapers and broadcasters
Dr. Rao Muhammad Adeel Nawab 38
Two Levels of Rewrite in Journalism
Derived vs Non-Derived
The Newspaper story was created borrowing thetext(s) from News Agencies
Derived
The Newspaper story is written independentlyand doesn't borrow any text from News Agencies
Non-Derived
Dr. Rao Muhammad Adeel Nawab 39
Three Levels of Re-write in Journalism
Derived Category can be further divided into
News Agency text is the only source for the reusedNewspaper text, which means it is a verbatim (orexact) copy of the News Agency textIn this case, most of the reused text is word-to wordcopy of the source text
The Newspaper text has been either derived frommore than one News Agency or most of the text isparaphrased by the editor when rewriting fromNews Agency text source
Dr. Rao Muhammad Adeel Nawab 40
Three Levels of Re-write in Journalism (Cont.)
The News Agency text has not been used in theproduction of the Newspaper text (though wordsmay still co-occur in both documents), it hascompletely different facts and figures or is heavilyparaphrased from the News Agency’s copy
Dr. Rao Muhammad Adeel Nawab 41
Text Reuse – Importance
Large digital repositories are readily available, making iteasier to text reuse and hard to detect it
Powerful text editors are making it easier to rewrite /modify text
Automatic text altering tools are making it easier toquickly modify text for reuse
Freely available Machine Translation systems are helpingpeople to easily even reuse text written in language thatthey don’t know
Dr. Rao Muhammad Adeel Nawab 42
Detecting unacknowledged reuse of text particularly inacademia
For example, removing duplicate or near-duplicatedocuments from the set of documents returned by a SearchEngine (or Information Retrieval System) against a userquery
Text Reuse - Applications
1 Plagiarism Detection
2 Duplicate (or Near-duplicate) Document Detection
3 Copyright infringement detection
Dr. Rao Muhammad Adeel Nawab 43
Text Reuse Detection - Task
A text pair, Text 1 and Text 2 (input)
Given
FindHow much text has been reused fromOriginal (Text 1) to create Text 2(output) i.e. goal is to identify thelevel of text reuse
Dr. Rao Muhammad Adeel Nawab 44
Text Reuse – Input and Output
For two levels of text reuse
For three levels of text reuse
DerivedNon-Derived
Wholly DerivedPartially DerivedNon-Derived
Input
Output
Text Pair (Text 1 and Text 2)
Goal Identify the level of text reuse
Dr. Rao Muhammad Adeel Nawab 45
Text Reuse - Granularity
Word Level1
Phrasal Level2
Sentence Level3
Passage / Paragraph Level4
Document Level5
Text reuse may occur at five levels
Dr. Rao Muhammad Adeel Nawab 46
Output
Input
Example 01 – Text Reuse at Word Level
Text 1Meal
Text 2Food
Derived
Dr. Rao Muhammad Adeel Nawab 47
Output
Input
Example 02 – Text Reuse at Word Level
Text 1Meal
Text 2Butter
Non-Derived
Dr. Rao Muhammad Adeel Nawab 48
Output
Input
Example 03 – Text Reuse at Word Level
Text 1Like
Text 2Nice
Derived
Dr. Rao Muhammad Adeel Nawab 49
Output
Input
Example 04 – Text Reuse at Word Level
Text 1People
Text 2Audience
Derived
Dr. Rao Muhammad Adeel Nawab 50
Output
Input
Example 05 – Text Reuse at Word Level
Text 1Dinner
Text 2Gathering
Non-Derived
Dr. Rao Muhammad Adeel Nawab 51
Output
Input
Example 06 – Text Reuse at Word Level
Text 1Dinner
Text 2Food
Derived
Dr. Rao Muhammad Adeel Nawab 52
Example 01 – Text Reuse at Phrasal Level
Output
Input
Example 01
Text 1A story as old as time
Text 2A tale as old as time
Derived
Dr. Rao Muhammad Adeel Nawab 53
Example 02 – Text Reuse at Phrasal Level
Output
Input
Example 02
Text 1A story as old as time
Text 2This is the story of old times
Non-Derived
Dr. Rao Muhammad Adeel Nawab 54
Example 03 – Text Reuse at Phrasal Level
Output
Input
Example 03
Text 1Reading a book
Text 2Reading an article
Non-Derived
Dr. Rao Muhammad Adeel Nawab 55
Example 04 – Text Reuse at Phrasal Level
Output
Input
Example 04
Text 1Ambling in the rain
Text 2Walking in the rain
Derived
Dr. Rao Muhammad Adeel Nawab 56
Example 05 – Text Reuse at Phrasal Level
Output
Input
Example 05
Text 1The love of my life
Text 2The crush of my life
Derived
Dr. Rao Muhammad Adeel Nawab 57
Example 06 – Text Reuse at Phrasal Level
Output
Input
Example 06
Text 1The love of my life
Text 2The journey of love isvery long
Non-Derived
Dr. Rao Muhammad Adeel Nawab 58
Example 01 – Text Reuse at Sentence Level
DerivedOutput
Input
Text 1What is your age?
Text 2How old are you?
Dr. Rao Muhammad Adeel Nawab 59
Example 02 – Text Reuse at Sentence Level
Non-DerivedOutput
Input
Text 1Will it snow tomorrow?
Text 2The weather prediction is quitestorming, what do you think about snowin the upcoming days?
Dr. Rao Muhammad Adeel Nawab 60
Example 03 – Text Reuse at Sentence Level
DerivedOutput
Input
Text 1Your car is nice
Text 2I like your car
Dr. Rao Muhammad Adeel Nawab 61
Example 04 – Text Reuse at Sentence Level
DerivedOutput
Input
Text 1He said that sit-ins have caused a hugeloss to national economy and thenation is depressed
Text 2Prime minister said, “Sit-ins have causeda huge loss to national economy and thenation is depressed”
Dr. Rao Muhammad Adeel Nawab 62
Example 05 – Text Reuse at Sentence Level
Non-DerivedOutput
Input
Text 1Balochistan successfully holds 3rd roundof LG elections
Text 2Plots in the municipal elections, thePML-N won the third stage
Dr. Rao Muhammad Adeel Nawab 63
Example 06 – Text Reuse at Sentence Level
DerivedOutput
Input
Text 1His body was handed over to the heirsafter legal formalities
Text 2The body was handed over to the heirs
Dr. Rao Muhammad Adeel Nawab 64
Example 07 – Text Reuse at Sentence Level
Non-DerivedOutput
Input
Text 1NAB Chairman determined to root outcorruption from society
Text 2Honest people will send representativesin Parliament will help eliminatecorruption: NAB
Dr. Rao Muhammad Adeel Nawab 65
Example 01 - Text Reuse at Passage Level
Cognizant of the need to accordgreater attention towardsprotection of the vulnerable andmarginalized segments of society,the government is committed tomake every possible effort to put inplace effective legal, economic andsocial frameworks so as to ensureprotection of human rights," he saidin his message on the occasion ofInternational Human Rights Day(December 10)
Input (Text 1)On the occasion of Human RightsDay, the Constitution of Pakistan,he said in his message to thecitizens based on race, color orrace, regardless of the guarantees.To ensure the protection of humanrights as possible to provideeffective legal, economic and socialframework is determined. He is thecelebration of Human Rights Day,on every level, promotion andprotection of human rights is oursolid commitment
Input (Text 2)
Derived
Output
Dr. Rao Muhammad Adeel Nawab 66
Example 02 - Text Reuse at Passage Level
Addressing a ceremonyheld in honour of thewinners of Nobel PeacePrize-2014 and televisedfrom Oslo, Norway, sheexpressed the resolve tocontinue her struggle forbringing all girls and boysin the education net and tofight for their rights
Input (Text 1)Pakistan and India live inpeace, I am sure that we stopthe progress. But if we do notsucceed one another so thatno country would be able tomove ahead, both the issuesof poverty, lack of educationof children and women aredenied basic rights, and wehave these problemstogether to solve
Input (Text 2)
Non-Derived
Output
Dr. Rao Muhammad Adeel Nawab 67
Example 03 - Text Reuse at Passage Level
He said he had held very goodmeetings and talks with IranianMinister of Economy and Finance AliTayebnia during his visit to Pakistan.He said that there were agreementswith Chinese CNPC oil companyrecently for laying 700 kms gaspipeline from Gwadar port, 70 kmsfrom Iran border, and Nawabshah. Hesaid Chinese company envoy arrivedon Tuesday to arrange preliminariesfor the project, announcing that thecompany will complete the 700kmpipeline in 24 months
Input (Text 1)LNG terminal at the port in the firstphase while the second phase will beheld from Gwadar to Nawabshah 700kilometers of 42-inch diameterpipeline will be laid. He said thatbecause of international sanctionsimposed on Iran Pakistan to fulfill itspart of the project completed Despitethe government's efforts could relateto this project Bank, InternationalContractors and Equipment Suppliersare not willing to work. Is expectedto work on the project would bestarted soon
Input (Text 2)
Derived
Output
Dr. Rao Muhammad Adeel Nawab 68
Example 04 - Text Reuse at Passage Level
We are contacting the PTIleadership for starting thedialogue," he added. Welcomingthe PTI's decision to resumetalks, Dar said there was noimpediment in this regard as PTIchief Imran Khan had backedout from the unconstitutionaland illegal demand for theresignation of Prime Minister orfor going on a one-month leave
Input (Text 1)'Movement and the government ofrigging the kumbynh judicial commissioninvestigating judge who will not berecognized, ISI or IB representatives ofthe Commission only the Commission canadd your own, we would not demandanything, "Imran Mohib The evidence ofPakistani and sit-ins and rallies to protestthe talks ended. If they do not agree thenthe negotiations' unconstitutionaldemand the resignation of the Minister ofJustice to withdraw any unconstitutionalis welcome, we will not discuss the talksand hope that justice will not demand anyunconstitutional
Input (Text 2)
Non-Derived
Output
Dr. Rao Muhammad Adeel Nawab 69
Example 05 - Text Reuse at Passage Level
The participants, Malala said she wasproud to be the youngest-ever NobelPeace Prize recipient. She thanked herparents for providing all kind ofsupport in getting education, saying "Ithank them for not clipping my wingsand letting me fly." Stressing the needto make joint efforts for impartingquality education to girls and boyswithout fear and discrimination,Malala vowed to work for protectionof children's rights not only inPakistan but across the world withmore vigour and dedication
Input (Text 1)After completing his studies at the primeminister's wish, to promote educationseriously consider a global leader. Theinternational community to pay specialattention to education. Why can not givearms to provide easy and scripture. Whypowerful countries are weak in peace. Weare living in contemporary art, nothing isimpossible, and I thank all my fans. I amgrateful to my teachers and parents whogave me a chance to excel and education.This is a very happy day for me, I'm theyoungest Pakistani and Pashtun girl whowon it. Kailash Satyarthi champions arefighting for the rights of children. I amglad that we can work together
Input (Text 2)
Derived
Output
Dr. Rao Muhammad Adeel Nawab 70
Example 06 - Text Reuse at Passage Level
According to senior policeofficials, the reason behind themurder has not been ascertainedyet. Several police teams led bysenior police officials wereconducting raids at various placesfor the arrest of killers.Meanwhile, MQM has lodgedprotest in Sialkot city against themurder and blocked traffic onvarious roads of Sialkot. They weredemanding early arrest of thekillers
Input (Text 1)Praltaf Anwar Hussain said theBOA was the senior partner andsincere testimony of theirorganization is a senior fellowlost. The tortured bodies of ourworkers have poured into thestreets of the patient isrecommended. Altaf Hussain haswarned that if not stopped killingour workers in the province ofPunjab, including the primeminister will not enter into anyMinister Sindh
Input (Text 2)
Non-Derived
Output
Dr. Rao Muhammad Adeel Nawab 71
Example 01 – Text Reuse at Document Level
Input (Document 1)
She expressed the resolve to continue her struggle forbringing all girls and boys in the education net and to fightfor their rights. She also recalled her struggle for gettingeducation in Swat and Taliban's terror in the valley, whothreatened girls to stop getting education
Dr. Rao Muhammad Adeel Nawab 72
Example 01 – Text Reuse at Document Level (Cont.)Input (Document 2)
But, she decided to stand up against them and succeeded, she added."Terrorists failed in their nefarious designs," Malala said adding that wasnot alone as she was the voice of 66 million girls. Taliban, she said, blew upschools in Swat with bombs and rockets and they misused the name of Islamwhich was a religion peace, tolerance, brotherhood and humanity, urgingthe followers to get knowledge, education and discover new Praltaf AnwarHussain said the BOA was the senior partner and sincere testimony of theirorganization is a senior fellow lost. The tortured bodies of our workers havepoured into the streets of the patient is recommended. Altaf Hussain haswarned that if not stopped killing our workers in the province of Punjab,including the prime minister will not enter into any Minister Sindh.
Dr. Rao Muhammad Adeel Nawab 73
Example 01 – Text Reuse at Document Level (Cont.)
DerivedOutput
Dr. Rao Muhammad Adeel Nawab 74
Example 02 – Text Reuse at Document Level
Input (Document 1)
Chairman Norwegian Nobel Peace Committee ThirdbornJagland awarded the winners with gold medals and prizes in awidely televised-ceremony from Oslo, Norway. He highlightedefforts of Malala and Kailash for protecting children's rightsand bringing all girls and boys in the education net. He saidMalala faced Taliban in Swat, who were threatening to keep heraway from education and even made an attempt on her life.She, however exhibited great courage and continued studies,besides advocating for girls' education
Dr. Rao Muhammad Adeel Nawab 75
Example 02 – Text Reuse at Document Level (Cont.)Input (Document 2)
It is time that education should take place, then do not raise any action against education. I wantpeace in every corner of the world, education is a key component of basic life henna on theirhands, the formula used to calculate. I want that women be given equal rights, the award is forfrightened children who want peace. Our Prophet Mohammad is the messenger of peace, Idecided to speak out against the Taliban, hundreds of schools were destroyed by militants inSwat, once a tourist paradise of Swat was killed by terrorists. Girls' education was stopped inSwat, militants tried to stop us, me and my friends were attacked, our voice has been comparedto the Taliban, the Taliban's ideology not only won their shots prevail so, this story is not just meso many other girls, deprived of education stand to hear children's voices, this time will not beafraid and do virtually anything. Swat was always eager to learn and inventions. It is time thateducation should take place, then do not raise any action against education. I want peace in everycorner of the world, education is a key component of basic life henna on their hands, the formulaused to calculate. I want that women be given equal rights, the award is for frightened childrenwho want peace. Our Prophet Mohammad is the messenger of peace, I decided to speak outagainst the Taliban, hundreds of schools were destroyed by militants in Swat, once a touristparadise of Swat was killed by terrorists
Dr. Rao Muhammad Adeel Nawab 76
Example 02 – Text Reuse at Document Level (Cont.)
Non-DerivedOutput
Dr. Rao Muhammad Adeel Nawab 77
Example 03 – Text Reuse at Document Level
Input (Document 1)
Around 500 family members of victims of Indian staterepression along with human rights activists and membersof Dal Khalsa during a march in Amritsar said that humanrights abuses committed in Punjab and Kashmir were notrandom but were carried out as a matter of Indian statepolicy. They said that they would approach Barack Obamaduring his forthcoming visit to India on January 26, nextyear,KMS reported
Dr. Rao Muhammad Adeel Nawab 78
Example 03 – Text Reuse at Document Level
Input (Document 2)
The Indian state of Kashmir my dyasrus about five hundredfamilies of victims of terrorism on human rights activistsand members of Dal Khalsa accompanied the rally inAmritsar Punjab and Kashmir protesters said regularhuman rights violations Indian state policy being.
Dr. Rao Muhammad Adeel Nawab 79
Example 03 – Text Reuse at Document Level (Cont.)
DerivedOutput
Dr. Rao Muhammad Adeel Nawab 80
Example 04 – Text Reuse at Document Level
Input (Document 1)
In the decision it was stated that the counsel for thepetitioner was confronted with the maintainability of thesepetitions in the light of the restrictions contained in article225 of the Constitution on throwing a challenge to theelection results other than by way of election petition beforethe Election Tribunal and further whether the results of theentire general elections for the national and the provincialassemblies could be annulled under any provision of theConstitution or law
Dr. Rao Muhammad Adeel Nawab 81
Example 04 – Text Reuse at Document Level
Input (Document 2)
The decision that the applicants lawyer has no way to hearelection petitions, but tribunals do not give satisfactoryanswers to the question whether the context of Article 225and whether national and annulled the results of theAssembly can be given
Dr. Rao Muhammad Adeel Nawab 82
Example 04 – Text Reuse at Document Level (Cont.)
Non-DerivedOutput
Dr. Rao Muhammad Adeel Nawab 83
Text Reuse - Types
When amount of text reused is detected atsentence / passage level
When amount of text reused is detected at document level
Local Text Reuse vs Global Text Reuse
1 Local Text Reuse
2 Global Text Reuse
Dr. Rao Muhammad Adeel Nawab 84
Text Reuse - Types
VS
When both the original andthe reused text are in thesame language
When the original text isin one language and thereused text is in anotherlanguageCross-lingual text reusecan be carried out using
Mono-lingual Text Reuse Cross-lingual Text Reuse
Automatic Translation (e.g.Google Translator, BIgnTranslator etc.)Manual Translation
Dr. Rao Muhammad Adeel Nawab 85
Example - Mono-lingual Text Reuse
DerivedOutput
Input
Text 1A dog bites a man
Text 2A hound bites a person
Both source and reused texts are in the samelanguage
Dr. Rao Muhammad Adeel Nawab 86
Example - Cross-lingual Text Reuse
DerivedOutput
Input
Text 1When source and reused texts are indifferent languages
Text 2�� �ا� � ا� � �
Source: A dog bites a man
Both source and reused texts are in thedifferent languages
Dr. Rao Muhammad Adeel Nawab 87
Example - Verbatim Copy / Exact Copy (Mono-lingual Settings)
Text 2
Text 1
Example
He said that sit-ins havecaused a huge loss to nationaleconomy and the nation isdepressed
He said that sit-ins havecaused a huge loss to nationaleconomy and the nation isdepressed
Dr. Rao Muhammad Adeel Nawab 88
Example - Paraphrased Copy (Mono-lingual Settings)
Text 2
Text 1
Example
Prime Minister Nawaz Sharif recalledthat in start of December he hadannounced reduction of 2.32 rupeesper unit in prices of electricity, butthe announcement was deferredbecause of Peshawar school tragedy
Prime Minister said that the priceof electricity has been decreasedby 2 rupees 32 paisas per unit. Inthe coming days [we] are trying tomake electricity more affordable.
Dr. Rao Muhammad Adeel Nawab 89
Example - Independently Written (Mono-lingual Settings)
Text 2
Text 1
Example
Prime Minister Muhammad NawazSharif Wednesday announcedfurther reduction in the prices ofpetroleum products, which will beeffective from January 1, 2015
Prime Minister Nawaz Sharif hasannounced reduction in prices ofpetroleum products up to Rs. 14per liter. Petrol 6.25, diesel 7.86,kerosene oil 11.26 rupees per litercheaper [than before]
Dr. Rao Muhammad Adeel Nawab 90
Example - Verbatim Copy / Exact Copy (Cross-lingual Settings)
Text 2
Text 1
Example
The chief minister said he wouldpersonally monitor the programmedof repair and construction of roadsin rural areas and review the pace ofprogress on fortnightly basis
ا� � � � د� ��ں � � و وز� �ں � � ذا� �ر � �ا� �وں � �� � �و�ام � � ��ہ �ں روز � �� �اور � �رہ
Dr. Rao Muhammad Adeel Nawab 91
Example - Paraphrased Copy (Cross-lingual Settings)
Text 2
Text 1
Example
Prime Minister Nawaz Sharif recalledthat in start of December he hadannounced reduction of 2.32 rupeesper unit in prices of electricity, butthe announcement was deferredbecause of Peshawar school tragedy
32رو� 2وز�ا� � � � � � � �د. � � �� � � � � �ں � � آ�ہ
�� � �� � �ش � ر� �
Dr. Rao Muhammad Adeel Nawab 92
Example - Independently Written (Cross-lingual Settings)
Text 2
Text 1
Example
Prime Minister Muhammad NawazSharif Wednesday announcedfurther reduction in the prices ofpetroleum products, which will beeffective from January 1, 2015
� �ں وز�ا� �از�� � �و� ��ت� ا�ن �14� �ول . د�رو� � � � �
� � , 7.86ڈ�ل , 6.25 رو� � �11.26�
Dr. Rao Muhammad Adeel Nawab 93
Your Turn
Write at least 12 examples for each of thefollowing: 6 examples for mono-lingual text reuseand 6 examples for cross-lingual text reuse (2Wholly Derived Examples, 2 Partially DerivedExamples and 2 Non-Derived Examples)
Text Reuse at Word LevelText Reuse at Phrasal LevelText Reuse at Sentence LevelText Reuse at Passage / Paragraph LevelText Reuse at Document Level
Dr. Rao Muhammad Adeel Nawab 94
Summary - Basics of Text Reuse
Text Reuse is the process of creating a new text (ordocument) using the existing one(s) and it is reported tobe on rise in recent years due to easy access to largeonline digital repositories
Given a text pair (Text 1 and Text 2), Text 2 is said to beDerived from Text 1, if it is created using text from Text 1.On the other hand, Text 2 is said to be Non-Derived fromText 1, if it is interpedently written i.e. did not borrow textfrom Text 1
Dr. Rao Muhammad Adeel Nawab 95
Summary - Basics of Text Reuse (Cont.)
Text Reuse may occur at five levels:
WordLevel
PhrasalLevel
SentenceLevel
Passage /ParagraphLevel
DocumentLevel
Dr. Rao Muhammad Adeel Nawab 96
Summary - Basics of Text Reuse
Two main types of text reuse are: (1) Local Text Reuse -when amount of text reused is detected atsentence/passage level and (2) Global Text Reuse - whenamount of text reused is detected at document levelText Reuse can be: (1) Mono-lingual Text Reuse - whenboth the original and the reused text are in the samelanguage and (2) Cross-lingual Text Reuse - when theoriginal text is in one language and the reused text is inanother language
It's Poetry Time
Dr. Rao Muhammad Adeel Nawab 98
�ا� د�ا� � دف �ى � �� �
�ى � دف �ت او�د �
او�د � دف �� �
اور �� � دف ��ى � �� � �ق ا� ��
� � � �ى � � �� --دف �ر�
Importance of Poetry
Dr. Rao Muhammad Adeel Nawab 99
� وا�ں � �ں � � ذوق �ں
� � د� � � � � وا� �
�
Stay Motivated
Dr. Rao Muhammad Adeel Nawab 101
. � �ڑى �� � د�ں � � ا�ر � �� � �
.�ڑى ا� �وڑ � � اور �ول � � � � � �
Motivation is the Fuel of Life
Importance of Motivation
Dr. Rao Muhammad Adeel Nawab 102
• Make a Schedule of 24 Hours and Live a Live a Balance Life• Watch Seminar - Balanced Life is Ideal Life
• Download Link: https://drive.google.com/open?id=1jet6r1QOtAB16Glpgiq18EeXaCkeVJUp
• On Daily Basis, read and enjoy• Poetry• Jokes
Tips - To Stay Motivated and become a Great Researcher
Dr. Rao Muhammad Adeel Nawab 103
Title: Guru ki har bat gur hoti hai Gur ko nhe Guru ko pakar
Speaker: Dr. Rao Muhammad Adeel Nawab
Downlaod Link: Video + Slides https://drive.google.com/open?id=1SQ5yUroaPT5dRcY-8K7IVC5SQrQy8Lsh
Task Summarize main points of the motivational seminar
Motivational Seminar
Basics of Plagiarism
Dr. Rao Muhammad Adeel Nawab 105
Plagiarism
Plagiarism is defined as the unacknowledgedreuse of text
Plagiarism
Copying another person's work exactly andpresenting it as your own (withoutattributing it to the original author)
Formal Definition
Dr. Rao Muhammad Adeel Nawab 106
Plagiarism (Cont.)
The document suspected tocontain plagiarism
The document(s) whichwere used to create theplagiarized document
Suspicious Document Source Document(s)
Note that a suspiciousdocument may or may notcontain plagiarism
Dr. Rao Muhammad Adeel Nawab 107
Plagiarism – Importance
In recent years, plagiarism has been reported tobe on rise particularly in academia
Plagiarism detection systems are routinelyused in universities to check students workfor plagiarism
Dr. Rao Muhammad Adeel Nawab 108
Levels of Plagiarism
VerbatimThe original text is reused as verbatim(word to word copy) or with minormodifications to create the plagiarizeddocument
Dr. Rao Muhammad Adeel Nawab 109
Levels of Plagiarism (Cont.)
Paraphrased Plagiarism
The original text is heavily altered (orparaphrased) to create the plagiarizeddocumentParaphrasing can be as:
Light RevisionSource text is slightly paraphrased
Heavy RevisionSource text is heavily paraphrased
Dr. Rao Muhammad Adeel Nawab 110
Levels of Plagiarism (Cont.)
Plagiarism of Idea
The idea of the original text is reusedwithout dependence on the words orform of the source
Dr. Rao Muhammad Adeel Nawab 111
Example - Verbatim Plagiarism
Text 2
Text 1
Example
In fact, of innumerable creaturespredestined from the creation of theworld to lay up a store of wealth for theBritish farmer, and a store of quiteanother sort for an immaculateRepublican government
Here lived innumerable creaturespredestined from the creation of theworld to lay up a store of wealth for theBritish farmer, and a store of quiteanother sort for an immaculateRepublican government
Dr. Rao Muhammad Adeel Nawab 112
Example - Paraphrased Plagiarism
Text 2
Text 1
Example
The number of foreign and domestic touristsin the Netherlands rose above 42 million in2017, an increase of 9% and the sharpestgrowth rate since 2006, the nationalstatistics office CBS reported on Wednesday(DutchNews.nl, 2018)
According to the national statistics office, theNetherlands experienced dramatic growth intourist numbers in 2017. More than 42 milliontourists travelled to or within the Netherlandsthat year, representing a 9% increase – thesteepest in 12 years (DutchNews.nl, 2018)
Dr. Rao Muhammad Adeel Nawab 113
Example - Plagiarism of Idea
Text 2
Text 1
From a class perspective this put them [highwaymen] in anambivalent position. In aspiring to that proud, if temporary,status of Gentleman of the Road, they did not question theinegalitarian hierarchy of their society. Yet their boldness ofact and deed, in putting them outside the law as rebelliousfugitives, revivified the animal spirits of capitalism andbecame an essential part of the oppositional culture ofworking-class London, a serious obstacle to the formation ofa tractable, obedient labour force. Therefore, it was notenough to hang them – the values they espoused orrepresented had to be challenged.
Peter Linebaugh argues that highwaymen represented apowerful challenge to the mores of capitalist society andinspired the rebelliousness of London’s working class.
Dr. Rao Muhammad Adeel Nawab 114
Plagiarism Detection – Task
A suspicious text (input)
Given
Identify
The source(s) of plagiarism
Dr. Rao Muhammad Adeel Nawab 115
Plagiarism Detection – Input and Output
Suspicious Text
Input
Output
Plagiarized / Non-Plagiarized
Dr. Rao Muhammad Adeel Nawab 116
Plagiarism Detection - Two Levels of Rewrite
When any type of plagiarism isoccurred between documents theywere called plagiarized
Plagiarized
Non-Plagiarized
When no type of plagiarism isoccurred between documents theywere called non plagiarized
Dr. Rao Muhammad Adeel Nawab 117
Plagiarism Detection - Four Levels of Rewrite
When suspicious text is created by simply copyingand pasting text from source document(s)
1 - Near Copy
2 - Light RevisionWhen suspicious text is created by applyingsmall modification like synonyms replacementand altering grammatical structure
The Plagiarized cases can be further categorized into three categories
Dr. Rao Muhammad Adeel Nawab 118
Plagiarism Detection - Four Levels of Rewrite (Cont.)
When suspicious text is created by rephrasing thetext to generate the meaning
3 - Heavy Revision
4 - Non-Plagiarized
When suspicious text is written independently
It may include breaking source sentence into morethan one sentences, merging two or more sentencesinto one, replacing words with appropriate synonymsor phrases, changing voice, changing tense etc.
Dr. Rao Muhammad Adeel Nawab 119
Important Note
Documents that are independently writtenon the same topic are expected to have
Around 50% content overlap
Dr. Rao Muhammad Adeel Nawab 120
Example - Near Copy
Output
Input
Example
Text 1A dog bites a man
Text 2A dog bites a man
Near Copy
Dr. Rao Muhammad Adeel Nawab 121
Example - Light Revision
Output
Input
Example
Text 1A dog bites a man
Text 2A hound bites by a person
Light Revision
Dr. Rao Muhammad Adeel Nawab 122
Example - Heavy Revision
Output
Input
Example
Text 1A dog bites a man
Text 2A man was bitten by a dog
Heavy Revision
Dr. Rao Muhammad Adeel Nawab 123
Example - Non-Plagiarized
Output
Input
Example
Text 1A dog bites a man
Text 2A man was injured while runningon road followed by a dog
Non-Plagiarized
Dr. Rao Muhammad Adeel Nawab 124
Types of Plagiarism Cases
Artificial cases of plagiarism are generated by using
Automatic Text Altering tools to obfuscate the
source text for plagiarism
Artificial
There are three main types of plagiarism cases
Dr. Rao Muhammad Adeel Nawab 125
Types of Plagiarism Cases (Cont.)
None ObfuscationAutomatic TextAltering toolsimply copy andpastes text fromsource to createplagiarizeddocument
Low Obfuscation Automatic TextAltering toollightly rephrasessource textautomaticallybefore it is used tocreate plagiarizeddocument
High ObfuscationAutomatic TextAltering toolheavily rephrasessource textautomaticallybefore it is used tocreate plagiarizeddocument
Three levels of rewrite
Dr. Rao Muhammad Adeel Nawab 126
Type of Plagiarism Cases (Cont.)
Simulated / Manual
The original text is paraphrased by humans to createthe cases of plagiarism
For example, Karl-Theodor zu Guttenberg (GermanDefense Minister) PhD thesis proved plagiarizedURL:https://www.theguardian.com/world/2011/mar/01/german-defence-minister-resigns-plagiarism
Real
Real cases of plagiarism are those which occurred inthe real world
Dr. Rao Muhammad Adeel Nawab 127
Source
Example – Artificial (None Obfuscation)
The first agrarian movement after theenactment of lex Licinia took place in theyear 338, after the battle of Veseris inwhich the Latini and their allies werecompletely conquered
Suspicious
The first agrarian movement after theenactment of lex Licinia took place in theyear 338, after the battle of Veseris inwhich the Latini and their allies werecompletely conquered
Dr. Rao Muhammad Adeel Nawab 128
Source
Example – Manual (Simulated Obfuscation)The emigrants who sailed with Gilbert were betterfitted for a crusade than a colony, and, disappointedat not at once finding mines of gold and silver, manydeserted; and soon there were not enough sailors toman all the four ships
Suspicious
The people who left their countries and sailed withGilbert were more suited for fighting the crusadesthan for leading a settled life in the colonies. Theywere bitterly disappointed as it was not the Americathat they had expected. Since they did notimmediately find gold and silver mines, manydeserted. At one stage, there were not even enoughman to help sail the four ships
Dr. Rao Muhammad Adeel Nawab 129
Real
Due to copyright issues it is impossible to
have example of real case of plagiarism
Dr. Rao Muhammad Adeel Nawab 130
Types of Plagiarism Detection
Checking that the entire document (or all thepassages) were written by one single author
Intrinsic Plagiarism Detection
Searching for the source(s)(or original text(s)) that werereused to create thesuspicious document
Extrinsic Plagiarism Detection
In case of intrinsic plagiarism detection, thefocus is on identifying portion(s) of textwhose writing style significantly differs fromthe remaining text in the suspiciousdocument, which means that the entiredocument is not written by one single authorand contains text written by other author(s).
Mainly involves comparisonof the suspicious documentwith potential sourcedocuments
Two main types of plagiarism detection
Dr. Rao Muhammad Adeel Nawab 131
Intrinsic Plagiarism Detection – Task
Dr. Rao Muhammad Adeel Nawab 132
Intrinsic Plagiarism Detection – Task
Identify
Given
Task
A suspicious document (input)
Portion(s) of text whosewriting style is significantlydifferent form the remainingtext (output)
Dr. Rao Muhammad Adeel Nawab 133
Intrinsic Plagiarism Detection – Input and Output
Portion(s) of text whose writing style issignificantly different form the remaining textOutput
Input A Suspicious Text
If whose writing style is one or more portion(s) of text issignificantly different form the remaining text then thesuspicious document is plagiarized otherwise non-plagiarized
Note
Dr. Rao Muhammad Adeel Nawab 134
Example – Intrinsic Plagiarism Detection
Rasheed is my best friend. He lives in Lahore. He had got goodeducation. He earned his PhD degree from one of the mostprestigious, well reputed and renowned institution of the worldi.e. MIT, U.S.A. He is humble and nice. Rasheed always try tohelp others
Suspicious Document is Plagiarized
Given (Suspicious Document)
Output
Portion of text whose writing style is significantly different fromremaining text
He earned his PhD degree from one of the most prestigious, wellreputed and renowned institution of the world i.e. MIT, U.S.A.
Dr. Rao Muhammad Adeel Nawab 135
Extrinsic Plagiarism Detection - Task
Dr. Rao Muhammad Adeel Nawab 136
Extrinsic Plagiarism Detection – Task (Cont.)
Dr. Rao Muhammad Adeel Nawab 137
Example – Extrinsic Plagiarism Detection
Containing two suspicious documents
Suspicious Collection
Source Collection
Containing five source documents
Given
Dr. Rao Muhammad Adeel Nawab 138
Example – Extrinsic Plagiarism Detection (Cont.)
suspicious-document-01 Rashed is my best friend. He lives in Lahore. He had got good education. He earned hisPhD degree from one of the most prestigious, well reputed and renowned instructionsof the world i.e. MIT, U.S.A. He is humble and nice. Rashed always try to help others.
Given a user query as input, Information Retrieval system aims toretrieve document(s) which are relevant to the user query to satisfyuser's information need.
Documents in Suspicious Collection
Dr. Rao Muhammad Adeel Nawab 139
MachineLearning abranch of AI.It is a hotresearchtopic in theworld
MIT is one ofthe mostprestigious,well reputedandrenownedinstructionsof the worldlocated inU.S.A.
My fathername is RaoNawabAkhtar. He isnice andhumble. Healways try tohelp others
Understandingis deeper thanLove. A largenumber ofpeople loveyou but only afewunderstandyou
Example – Extrinsic Plagiarism Detection (Cont.)
Documents in Source Collection
NaturalLanguageProcessing isa branch ofAI. It hasmanypotentialapplicationsin the realworld
source-document 1
source-document 2
source-document 3
source-document 4
source-document 5
Dr. Rao Muhammad Adeel Nawab 140
Example – Extrinsic Plagiarism Detection (Cont.)
Candidate Document Retrieval
Extrinsic Plagiarism Detection - Two Step Process
AimIdentify potential source(s) of plagiarism
Potential Technique(s)Use an Information Retrieval (IR) based approach
Detailed AnalysisAim
Identify fragment(s) of source text(s) that were used tocreate the corresponding plagiarized fragments of text(s)
Possible Technique(s)Use a Pairwise Comparison Approach
Dr. Rao Muhammad Adeel Nawab 141
Example – Extrinsic Plagiarism Detection (Cont.)
Candidate Document Retrieval
GoalIdentify top K source documents from thesource collection which are potential sourcesof plagiarism
Here K = 3
Candidate Document Retrieval SystemVector Space Model
Dr. Rao Muhammad Adeel Nawab 142
Example – Extrinsic Plagiarism Detection (Cont.)
QueryTwo Separate Queries
suspicious-document-01 suspicious-document-02
Static Collection of DocumentsSource Collection (containing 5 documents)
Given
Dr. Rao Muhammad Adeel Nawab 143
Example – Extrinsic Plagiarism Detection (Cont.)
Output of Candidate Document Retrieval Systemsuspicious-document-01 - Potential Candidate Source Document(s)
source-document-02source-document-03source-document-05
suspicious-document-02 - Potential Candidate Source Document(s)source-document-01source-document-04source-document-05
Dr. Rao Muhammad Adeel Nawab 144
Example – Extrinsic Plagiarism Detection (Cont.)
Detailed Analysis
GoalIdentify fragment(s) of source text(s) thatwere used to create the correspondingplagiarized fragments of text(s)
Detailed Analysis TechniqueGreedy String Tiling
Dr. Rao Muhammad Adeel Nawab 145
Example – Extrinsic Plagiarism Detection (Cont.)
Considering suspicious-document-01Potential Candidate Source Documents
source-document-02 source-document-03 source-document-05
Considering suspicious-document-02Potential Candidate Source Documents
source-document-01 source-document-04 source-document-05
Given
Dr. Rao Muhammad Adeel Nawab 146
Example – Extrinsic Plagiarism Detection (Cont.)
Output - suspicious-document-01suspicious-document-01 is Plagiarized
Source(s) of Plagiarism
Output – Detailed Analysis
suspicious-document-02 suspicious-document-03
Dr. Rao Muhammad Adeel Nawab 147
Example – Extrinsic Plagiarism Detection (Cont.)
one of the most prestigious, well reputed andrenowned instructions of the world i.e. MIT, U.S.A.
Fragment Pair 1 - suspicious-document-01, source-document-01
MIT is one of the most prestigious, well reputed andrenowned instructions of the world located in U.S.A.
Suspicious-Source Fragment Pairs
suspicious-document-01
Source-document-02
Dr. Rao Muhammad Adeel Nawab 148
Example – Extrinsic Plagiarism Detection (Cont.)
He is humble and nice. Rasheed always try to helpothers.
Fragment Pair 2 - suspicious-document-01, source-document-03
He is nice and humble. He always tries to helpothers.
suspicious-document-01
Source-document-03
Dr. Rao Muhammad Adeel Nawab 149
Example – Extrinsic Plagiarism Detection (Cont.)
Output - suspicious-document-02
suspicious-document-01 is Non -Plagiarized
Output – Detailed Analysis
Dr. Rao Muhammad Adeel Nawab 150
Shared Tasks on Text Reuse and Plagiarism
PAN is a series of scientific events and shared
tasks on digital text forensics and stylometryPAN
https://pan.webis.de/
Dr. Rao Muhammad Adeel Nawab 151
Main Shared Tasks on Natural Language Processing
CoNLL
URL:
https://www.conll.
org/2019
SemEval
URL:
http://alt.qcri.org/s
emeval2019/
Dr. Rao Muhammad Adeel Nawab 152
Your Turn
Write at least 8 examples (at sentence level) foreach of the following levels of rewrite inplagiarism
Near CopyLight RevisionHeavy RevisionNon-Plagiarized
Dr. Rao Muhammad Adeel Nawab 153
Summary - Basics of Plagiarism
Plagiarism is defined as the unacknowledged reuse of text andin recent years it has been reported to be on rise.Consequently, plagiarism detection systems are routinely usedby higher educational institutions to check students work forplagiarismGiven a text pair (source text and suspicious text), suspicioustext is said to be Plagiarized from source text, if it is createdusing text from source text. On the other hand, suspicious textis said to be Non-Plagiarized from source text, if it isinterpedently written i.e. did not borrow text from source text
Dr. Rao Muhammad Adeel Nawab 154
Summary - Basics of Plagiarism (Cont.)
There are three levels of Plagiarism: (1) Verbatim - the originaltext is reused as verbatim (word to word copy) or with minormodifications to create the plagiarized document, (2)Paraphrased Plagiarism - the original text is heavily altered (orparaphrased) to create the plagiarized document and (3)Plagiarism of Idea - the idea of the original text is reusedwithout dependence on the words or form of the sourceThere are three main types of Plagiarism Cases: (1) ArtificialCases of Plagiarism - are generated by using Automatic TextAltering tools to obfuscate the source text for plagiarism,
Dr. Rao Muhammad Adeel Nawab 155
Summary - Basics of Plagiarism (Cont.)
(2) Simulated / Manual Cases of Plagiarism - the original text isparaphrased by humans to create the cases of plagiarism and(3) Real Cases of Plagiarism - are those which occurred in thereal worldTwo main types of plagiarism detection are: (1) IntrinsicPlagiarism Detection - checking that the entire document (orall the passages) were written by one single author and (2)Extrinsic Plagiarism Detection - searching for the source(s) (ororiginal text(s)) that were reused to create the suspiciousdocument
Dr. Rao Muhammad Adeel Nawab 156
It's Poetry Time
Dr. Rao Muhammad Adeel Nawab 157
�ا� د�ا� � دف �ى � �� �
�ى � دف �ت او�د �
او�د � دف �� �
اور �� � دف ��ى � �� � �ق ا� ��
� � � �ى � � �� --دف �ر�
Importance of Poetry
Dr. Rao Muhammad Adeel Nawab 158
� � ر�ے �� � �� �ض ئ دل آ�ي
دل� � �� دل �ى � � � �� �
�
Dr. Rao Muhammad Adeel Nawab 159
Stay Motivated
Dr. Rao Muhammad Adeel Nawab 160
. � �ڑى �� � د�ں � � ا�ر � �� � �
.�ڑى ا� �وڑ � � اور �ول � � � � � �
Motivation is the Fuel of Life
Importance of Motivation
Dr. Rao Muhammad Adeel Nawab 161
• Make a Schedule of 24 Hours and Live a Live a Balance Life• Watch Seminar - Balanced Life is Ideal Life
• Download Link: https://drive.google.com/open?id=1jet6r1QOtAB16Glpgiq18EeXaCkeVJUp
• On Daily Basis, read and enjoy• Poetry• Jokes
Tips - To Stay Motivated and become a Great Researcher
Dr. Rao Muhammad Adeel Nawab 162
Title: Kamal Kiya Hai - Hussan Ho Aur Nazakat Na Ho
Speaker: Dr. Rao Muhammad Adeel Nawab
Downlaod Link: Video + Slides https://drive.google.com/open?id=1RZOgjND7LiQSe0onFW3fo-GHjZsieTVA
Task Summarize main points of the motivational seminar
Motivational Seminar
Data Annotation for Text Reuse Detection
Dr. Rao Muhammad Adeel Nawab 164
Structured Data
Data is stored,processed, andmanipulated in atraditional RelationalDatabase ManagementSystem (RDBMS)
Unstructured Data
Data that iscommonly generatedfrom human activitiesand doesn’t fit into astructured databaseformat
Semi-structured Data
Data doesn’t fit into astructured database system,but is none-the-lessstructured by tags that areuseful for creating a formof order and hierarchy inthe data
Data vs Information
Varieties Of Data
Raw Facts and Figures
Varieties of DataPAN
Dr. Rao Muhammad Adeel Nawab 165
Data vs Information (Cont.)
Forms Of Data
Text Image Video Audio
Main Forms Of Data
Processed form of DataINFORMATION
Dr. Rao Muhammad Adeel Nawab 166
Data Annotation
a.k.a. data labeling / data tagging
Data Annotation
is the process of labeling data to make it usable for machine learning?is performed by domain experts (humans – a.k.a. annotators / taggers / raters)requires a lot of effort, time and cost
Dr. Rao Muhammad Adeel Nawab 167
Example 01 - Data AnnotationRaw Data
iPhone7 is a good mobileBattery of this phone is bad
I am using iphone7
Data Annotation – Sentiment AnalysisComment / Review Sentiment
iPhone7 is a good mobile PositiveBattery of this phone is
badNegative
I am using iphone7 Neutral
Data Annotation – Gender IdentificationComment / Review Gender
iPhone7 is a good mobile MaleBattery of this phone is
badMale
I am using iphone7 Female
Dr. Rao Muhammad Adeel Nawab 168
Main Steps to Create Benchmark Annotated DatasetRaw Data Collection
Corpus Standardization
Data Source(s)Cleaning of DataPre-processing of Data
Annotation ProcessPreparation of Annotation GuidelinesAnnotationsComputing Inter-Annotator Agreement
Dr. Rao Muhammad Adeel Nawab 169
Raw Data Collection Two Sources of Data
News Agencies articles01Associated Press of Pakistan (APP)
Newspapers stories02
Nawa e Waqt
Example – Data Annotation for Text Reuse Detection
Independent News Agency (INP)
Express NewspaperJang Newspaper
Multiple Newspapers can reuse text from one NewsAgency to produce their news storiesNote
Dr. Rao Muhammad Adeel Nawab 170
Raw Data Collection Raw Data Statistics
Collected 4 News Agency articles and 6 Newspaperstories
Example – Data Annotation for Text Reuse Detection (Cont.)
Cleaning of Raw DataRemove HTML tags, hyperlinks, foreign languagecharacters etc.
Pre-processing of Raw DataNo pre-processing was performed
Dr. Rao Muhammad Adeel Nawab 171
Below are cleaned and pre-processed documents (4News Agency articles and 6 Newspaper stories)
News Agency Article Newspaper Story
A dog bites a man A dog bites a personA dog bites a man A person was badly bitten by a dog
A dog bites a manA person is badly injured while running at roadfollowed by a hound, Independently written
I like your car Your car is niceThis is my country I live in Lahore, PakistanI like your car I like your car
Example – Data Annotation for Text Reuse Detection (Cont.)
Dr. Rao Muhammad Adeel Nawab 172
Example – Extrinsic Plagiarism Detection (Cont.)
Annotation Guidelines
Annotation Process
Read text pair
Assign the label (from pre-defined set of labels) which maps to the most dominating label
Dr. Rao Muhammad Adeel Nawab 173
Example – Extrinsic Plagiarism Detection (Cont.)
AnnotationsStandard practice in doing annotations is to have three annotators (A, B and C)
Can have more than three annotatorsCharacteristics of Annotators
Annotators must be domain expertsAnnotators should be expert and / or native Speaker in the language in which text is writtenAnnotators should be of good qualification
Dr. Rao Muhammad Adeel Nawab 174
Example – Data Annotation for Text Reuse Detection (Cont.)
Generally, annotations are carried out in three steps
Step 1
Step 2
Step 3
Annotators A and B annotate a subset of the datasetDiscuss conflicting and agreed text pairs and refinethe annotation guidelines to reduce the conflicts
Annotators A and B annotate the remainingdataset based on revised annotation guidelines
Annotator C annotates conflictingdocuments
Dr. Rao Muhammad Adeel Nawab 175
AnnotationsSeparately Give Text Pairs to Annotators A & B
Annotations by Annotator ANews Agency
Article Newspaper Story Annotator A
A dog bites a man A dog bites a person Wholly DerivedA dog bites a man A person was badly bitten by a dog Partially DerivedA dog bites a man A person is badly injured while running at road
followed by a hound, Independently writtenNon - Derived
I like your car Your car is nice Non - DerivedThis is my country I live in Lahore, Pakistan Partially DerivedI like your car I like your car Wholly Derived
Example – Data Annotation for Text Reuse Detection (Cont.)
Annotation Process
Dr. Rao Muhammad Adeel Nawab 176
AnnotationsSeparately Give Text Pairs to Annotators A & B
Annotations by Annotator BNews Agency
Article Newspaper Story Annotator A
A dog bites a man A dog bites a person Wholly DerivedA dog bites a man A person was badly bitten by a dog Partially DerivedA dog bites a man A person is badly injured while running at road
followed by a hound, Independently writtenNon - Derived
I like your car Your car is nice Partially DerivedThis is my country I live in Lahore, Pakistan Non - DerivedI like your car I like your car Wholly Derived
Example – Data Annotation for Text Reuse Detection (Cont.)
Annotation Process
Dr. Rao Muhammad Adeel Nawab 177
Inter-Annotator AgreementInter-Annotator Agreement (IAA) is a measure of howwell two (or more) annotators can make the sameannotation decision for a certain category
Example – Data Annotation for Text Reuse Detection (Cont.)
Annotation Process
Inter-Annotator Agreement
𝑰𝑰𝑰𝑰𝑰𝑰 =𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄 𝒄𝒄𝒕𝒕𝒕𝒕 𝒄𝒄𝒄𝒄𝒏𝒏𝒏𝒏𝒕𝒕𝒏𝒏 𝒄𝒄𝒐𝒐 𝒏𝒏𝒓𝒓𝒄𝒄𝒓𝒓𝒄𝒄𝒓𝒓𝒓𝒓 𝒓𝒓𝒄𝒄 𝒓𝒓𝒓𝒓𝒏𝒏𝒕𝒕𝒕𝒕𝒏𝒏𝒕𝒕𝒄𝒄𝒄𝒄
𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄 𝒄𝒄𝒕𝒕𝒕𝒕 𝒄𝒄𝒄𝒄𝒄𝒄𝒓𝒓𝒕𝒕 𝒄𝒄𝒄𝒄𝒏𝒏𝒏𝒏𝒕𝒕𝒏𝒏 𝒄𝒄𝒐𝒐 𝒏𝒏𝒓𝒓𝒄𝒄𝒓𝒓𝒄𝒄𝒓𝒓𝒓𝒓
01
Dr. Rao Muhammad Adeel Nawab 178
News Agency Article Newspaper Story Annotator
A Annotator BA dog bites a man A dog bites a person Wholly Derived Wholly DerivedA dog bites a man A person was badly bitten by a
dogPartiallyDerived
Partially Derived
A dog bites a man A person is badly injured whilerunning at road followed by ahound, Independently written
Non - Derived Non - Derived
I like your car Your car is nice Non - Derived Partially DerivedThis is my country I live in Lahore, Pakistan Partially
DerivedNon - Derived
I like your car I like your car Wholly Derived Wholly Derived
Example – Data Annotation for Text Reuse Detection (Cont.)
02 IAA is computed to drive two thingsHow easy was it to clearly delineate the category?How trustworthy is the annotation?
Dr. Rao Muhammad Adeel Nawab 179
Inter-Annotator Agreement =𝟒𝟒𝟔𝟔
= 0.667
Example – Data Annotation for Text Reuse Detection (Cont.)
Annotation Process
Combine Annotations of A & B to ComputeInter Annotator Agreement (IAA)
Dr. Rao Muhammad Adeel Nawab 180
Conflict ResolutionGive only conflicted pairs to Annotator C
News Agency Article
Newspaper Story
Annotator A
Annotator B
AnnotatorC
I like your car Your car is nice Partially Derived Non Derived Partially Derived
This is my country I live in Lahore, Pakistan Non Derived Partially
Derived Non Derived
Example – Data Annotation for Text Reuse Detection (Cont.)
Annotation Process
Dr. Rao Muhammad Adeel Nawab 181
Final Gold Standard Benchmark Corpus
News Agency Article Newspaper Story Label
A dog bites a man A dog bites a person Wholly DerivedA dog bites a man A person was badly bitten by a dog Partially Derived
A dog bites a man A person is badly injured while running at roadfollowed by a hound, Independently written Non Derived
I like your car Your car is nice Non Derived
This is my country I live in Lahore, Pakistan Partially Derived
I like your car I like your car Wholly Derived
Example – Data Annotation for Text Reuse Detection (Cont.)
Annotation Process
Dr. Rao Muhammad Adeel Nawab 182
Example – Corpus Standardization
Two Main Formats to Standardize Corpus
XMLCSV
Dr. Rao Muhammad Adeel Nawab 183
Corpus Standardization in CSV Format
Example – Corpus Standardization
Dr. Rao Muhammad Adeel Nawab 184
Corpus Standardization in XML Format
Example – Corpus Standardization
Dr. Rao Muhammad Adeel Nawab 185
METER Corpus
Paper Title: The METER corpus: A corpus for analyzing journalistic text reuse
SAC
Short Answer Corpus.Paper Title: Developing a corpus of plagiarized short answers
PAN Corpora
Can be downloaded from the PAN website: pan.webis.de
CLEU
Cross lingual English Urdu Corpus.Paper Title: Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair
Benchmark Corpora for Text Reuse and Plagiarism Detection
USTRC
Paper Title: USTRC: Urdu Short Text Reuse Corpus
Corpus
Dr. Rao Muhammad Adeel Nawab 186
Your Turn
Take a toy dataset of 12 examples and annotatethem with four levels of rewrite by followingthe steps discussed in the lecture
Near CopyLight RevisionHeavy RevisionNon-Plagiarized
Dr. Rao Muhammad Adeel Nawab 187
Data Annotation
Summary - Data Annotation for Text Reuse Detection
is the process of labeling data to make it usable for machine learning?
is performed by domain experts (humans – a.k.a. annotators / taggers / raters)
requires a lot of effort, time and cost
Dr. Rao Muhammad Adeel Nawab 188
01 Raw Data Collection
02 Simulated / Manual Cases of Plagiarism
03 Corpus Standardization
Summary - Data Annotation for Text Reuse Detection
Data Source(s)Cleaning of DataPre-processing of Data
Preparation of Annotation GuidelinesAnnotationsComputing Inter-Annotator Agreement
Main Steps to Create Benchmark Annotated Dataset
Dr. Rao Muhammad Adeel Nawab 189
It's Poetry Time
Dr. Rao Muhammad Adeel Nawab 190
�ا� د�ا� � دف �ى � �� �
�ى � دف �ت او�د �
او�د � دف �� �
اور �� � دف ��ى � �� � �ق ا� ��
� � � �ى � � �� --دف �ر�
Importance of Poetry
Dr. Rao Muhammad Adeel Nawab 191
�ل ر� � در�ں � �م � � ��ں �
وہ � � �ف �م � � � � �د آج �
ا� وہ �� �ے وہ � �� � دن � � ر
م �وہ �� �� � ��وں � �ں � �
�� � � � وہ��دل � �وں �ب
��م��ٴ وہ � �اب�آرزوؤں وہ
�ل
Dr. Rao Muhammad Adeel Nawab 192
�ؤ �ن�ان � � �ؤ � � ��ے
��م � ان � �ں وہ � �ں � �� �
� �ؤں � �� � � � اب � � � ا�
� �روں �� � �ے � � �رت �م �
� � � �� � ا��را� � ر�ں �
�ز�ں � �م ل� � � ���ے� �
� ر�ى
…�ل
Methods for Text Reuse and Plagiarism Detection
Dr. Rao Muhammad Adeel Nawab 194
Methods for Text Reuse and Plagiarism Detection
A text pair (Text 1 and Text 2)
Given
Goal
Quantify the degree of similaritybetween text pair
Note – High similarity score indicates that Text 2was created using Text 1
Dr. Rao Muhammad Adeel Nawab 195
Example – N-gram Overlap Approach for Text Reuse and Plagiarism Detection
An n-gram is a contiguous sequence of
n items from a given sample of textDefinition
N-gram can be Word basedCharacter based
N represents the length of N-gram
Dr. Rao Muhammad Adeel Nawab 196
Example – N-gram Generation from Input TextInput Text
R u coming?Word Uni-grams (N = 1)
12
Rucoming3
4 ?
Tokenized Text
R u coming ?Set of Word Uni-grams
Dr. Rao Muhammad Adeel Nawab 197
Example – N-gram Generation from Input TextInput Text
R u coming?Word Bi-grams (N = 2)
12
Rucoming3
4 ?
Tokenized Text
Set of Word Bi-grams R u U coming Coming ?
Dr. Rao Muhammad Adeel Nawab 198
Example – N-gram Generation from Input TextInput Text
R u coming?Word Tri-grams (N = 3)
12
Rucoming3
4 ?
Tokenized Text
Set of Word Tri-grams R u coming u coming ?
Dr. Rao Muhammad Adeel Nawab 199
Example – N-gram Generation from Input TextInput Text
R u coming?Character Tri-grams (N = 3)
R
R U C O M I N ing ?
Tokenized Text:Note that space is also a character
r u u co com omi min ing gn?Set of character Tri-grams
r u coming?
Dr. Rao Muhammad Adeel Nawab 200
A range of measures have been proposed including
Similarity Measures to Compute N-gram Overlap
Dice Similarity Co-efficient
Containment Similarity Co-efficient
Overlap Similarity Co-efficient
Jaccard Similarity Co-efficient
Dr. Rao Muhammad Adeel Nawab 201
Steps – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
A Text Pair (Text 1 and Text 2)Given
Steps - Commuting Similarity
Pre-process Input TextLower casePunctuation Marks Removal
Convert Text 1 and Text 2 into Sets of N-grams
Dr. Rao Muhammad Adeel Nawab 202
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
Similarity Co-efficient i.e. Quantify the Degree of Similarity
Where S (Text 1, n) and S (Text 1, n) represent sets of N-grams of length n for Text 1 and Text 2 respectively
Overlap Similarity
𝑶𝑶𝑶𝑶𝒕𝒕𝒏𝒏𝒕𝒕𝒓𝒓𝑶𝑶 𝑺𝑺𝒓𝒓𝒏𝒏𝒓𝒓𝒕𝒕𝒓𝒓𝒏𝒏𝒓𝒓𝒄𝒄𝑺𝑺 = 𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟏𝟏,𝐧𝐧 ∩𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧𝐦𝐦𝐦𝐦𝐧𝐧( 𝐒𝐒 𝑻𝑻𝒕𝒕𝑻𝑻𝒄𝒄 𝟏𝟏,𝐧𝐧 ,|𝐒𝐒(𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧)|
Compute Similarity Between Sets of N-grams usingOverlap
Dr. Rao Muhammad Adeel Nawab 203
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
Quantify the degree for similarity betweentext pair using N-gram Overlap Approach
Here
Goal
N-grams are word basedN = 1
A Text PairGiven
Text 1: A dog bites a man. Text 2: A hound bites a man.
Dr. Rao Muhammad Adeel Nawab 204
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
Pre-process Input Text
Steps - Commuting Similarity
Lower casePunctuation Marks Removal
Text Pair Before Pre-processingText 1: A dog bites a man. Text 2: A hound bites a man.
Dr. Rao Muhammad Adeel Nawab 205
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
Text Pair After Pre-processingText 1: a dog bites a manText 2: a hound bites a man
Convert Text 1 and Text 2 into Sets of N-gramsText 1 Word Unigrams (S (Text 1, 1)) = {a, dog, bites, a, man}Text 2 Word Unigrams (S (Text 2, 1)) = {a, hound, bites, a, man}
Dr. Rao Muhammad Adeel Nawab 206
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
Compute Similarity Between Sets of N-grams usingOverlapSimilarity Co-efficient i.e. Quantify the Degree of Similarity
Overlap Similarity
𝑶𝑶𝑶𝑶𝒕𝒕𝒏𝒏𝒕𝒕𝒓𝒓𝑶𝑶 𝑺𝑺𝒓𝒓𝒏𝒏𝒓𝒓𝒕𝒕𝒓𝒓𝒏𝒏𝒓𝒓𝒄𝒄𝑺𝑺 = 𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟏𝟏,𝐧𝐧 ∩𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧𝐦𝐦𝐦𝐦𝐧𝐧( 𝐒𝐒 𝑻𝑻𝒕𝒕𝑻𝑻𝒄𝒄 𝟏𝟏,𝐧𝐧 ,|𝐒𝐒(𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧)|
Overlap Similarity = |{a, dog, bites, a, man}| ∩|{a, hound,bites, a, man}| / min (|({a, dog, bites, a, man}|, (|{a, hound,bites, a, man}|)))
Dr. Rao Muhammad Adeel Nawab 207
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
= 0.80
Overlap Similarity
𝑶𝑶𝑶𝑶𝒕𝒕𝒏𝒏𝒕𝒕𝒓𝒓𝑶𝑶 𝑺𝑺𝒓𝒓𝒏𝒏𝒓𝒓𝒕𝒕𝒓𝒓𝒏𝒏𝒓𝒓𝒄𝒄𝑺𝑺 = { 𝐚𝐚,𝐛𝐛𝐦𝐦𝐓𝐓𝐓𝐓𝐛𝐛,𝐚𝐚,𝐦𝐦𝐚𝐚𝐧𝐧}𝐦𝐦𝐦𝐦𝐧𝐧 (𝟓𝟓,𝟓𝟓)
= 𝟐𝟐𝟒𝟒
Dr. Rao Muhammad Adeel Nawab 208
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
Quantify the degree for similarity betweentext pair using N-gram Overlap Approach
Here
Goal
N-grams are word basedN = 2
A Text PairGiven
Text 1: A dog bites a man. Text 2: A hound bites a man.
Dr. Rao Muhammad Adeel Nawab 209
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
Pre-process Input Text
Steps - Commuting Similarity
Lower casePunctuation Marks Removal
Text Pair Before Pre-processingText 1: A dog bites a man. Text 2: A hound bites a man.
Dr. Rao Muhammad Adeel Nawab 210
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
Text Pair After Pre-processingText 1: a dog bites a manText 2: a hound bites a man
Convert Text 1 and Text 2 into Sets of N-gramsText 1 Word Bigrams (S (Text 1, 2)) = {a dog, dog bites, bites a, a man}Text 2 Word Bigrams (S (Text 1, 2)) = {a, hound, bites, a, man}
Dr. Rao Muhammad Adeel Nawab 211
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
Compute Similarity Between Sets of N-grams usingOverlapSimilarity Co-efficient i.e. Quantify the Degree of Similarity
Overlap Similarity
𝑶𝑶𝑶𝑶𝒕𝒕𝒏𝒏𝒕𝒕𝒓𝒓𝑶𝑶 𝑺𝑺𝒓𝒓𝒏𝒏𝒓𝒓𝒕𝒕𝒓𝒓𝒏𝒏𝒓𝒓𝒄𝒄𝑺𝑺 = 𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟏𝟏,𝐧𝐧 ∩𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧𝐦𝐦𝐦𝐦𝐧𝐧( 𝐒𝐒 𝑻𝑻𝒕𝒕𝑻𝑻𝒄𝒄 𝟏𝟏,𝐧𝐧 ,|𝐒𝐒(𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧)|
Overlap Similarity = |{a dog, dog bites, bites a, a man}| ∩|{a hound , hound bites , bites a, a man}| / min (|{a dog,dog bites, bites a, a man}|, |{a hound , hound bites , bitesa, a man}|)))
Dr. Rao Muhammad Adeel Nawab 212
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
= 0.5
Overlap Similarity
𝑶𝑶𝑶𝑶𝒕𝒕𝒏𝒏𝒕𝒕𝒓𝒓𝑶𝑶 𝑺𝑺𝒓𝒓𝒏𝒏𝒓𝒓𝒕𝒕𝒓𝒓𝒏𝒏𝒓𝒓𝒄𝒄𝑺𝑺 = { 𝐛𝐛𝐦𝐦𝐓𝐓𝐓𝐓𝐛𝐛 𝐚𝐚, 𝒓𝒓𝐦𝐦𝐚𝐚𝐧𝐧}𝐦𝐦𝐦𝐦𝐧𝐧 (𝟒𝟒,𝟒𝟒)
= 𝟐𝟐𝟒𝟒
Dr. Rao Muhammad Adeel Nawab 213
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
Quantify the degree for similarity betweentext pair using N-gram Overlap Approach
Here
Goal
N-grams are word basedN = 3
A Text PairGiven
Text 1: A dog bites a man. Text 2: A hound bites a man.
Dr. Rao Muhammad Adeel Nawab 214
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
Pre-process Input Text
Steps - Commuting Similarity
Lower casePunctuation Marks Removal
Text Pair Before Pre-processingText 1: A dog bites a man. Text 2: A hound bites a man.
Dr. Rao Muhammad Adeel Nawab 215
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
Text Pair After Pre-processingText 1: a dog bites a manText 2: a hound bites a man
Convert Text 1 and Text 2 into Sets of N-gramsText 1 Word Bigrams (S (Text 1, 3)) = {a dog bites, dog bites a, bites a man}Text 2 Word Bigrams (S (Text 1, 3)) = {a hound bites, hound bites a , bites a man}
Dr. Rao Muhammad Adeel Nawab 216
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
Compute Similarity Between Sets of N-grams usingOverlapSimilarity Co-efficient i.e. Quantify the Degree of Similarity
Overlap Similarity
𝑶𝑶𝑶𝑶𝒕𝒕𝒏𝒏𝒕𝒕𝒓𝒓𝑶𝑶 𝑺𝑺𝒓𝒓𝒏𝒏𝒓𝒓𝒕𝒕𝒓𝒓𝒏𝒏𝒓𝒓𝒄𝒄𝑺𝑺 = 𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟏𝟏,𝐧𝐧 ∩𝐒𝐒 𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧𝐦𝐦𝐦𝐦𝐧𝐧( 𝐒𝐒 𝑻𝑻𝒕𝒕𝑻𝑻𝒄𝒄 𝟏𝟏,𝐧𝐧 ,|𝐒𝐒(𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 𝟐𝟐,𝐧𝐧)|
Overlap Similarity = |{a dog, dog bites, bites a, a man}| ∩|{a hound , hound bites , bites a, a man}| / min (|{a dog,dog bites, bites a, a man}|, |{a hound , hound bites , bitesa, a man}|)))
Dr. Rao Muhammad Adeel Nawab 217
Example – Computing Similarity Between Sets of N-grams using Overlap Similarity Co-efficient
= 0.33
Overlap Similarity
𝑶𝑶𝑶𝑶𝒕𝒕𝒏𝒏𝒕𝒕𝒓𝒓𝑶𝑶 𝑺𝑺𝒓𝒓𝒏𝒏𝒓𝒓𝒕𝒕𝒓𝒓𝒏𝒏𝒓𝒓𝒄𝒄𝑺𝑺 = { 𝐛𝐛𝐦𝐦𝐓𝐓𝐓𝐓𝐛𝐛 𝐚𝐚 𝐦𝐦𝐚𝐚𝐧𝐧}𝐦𝐦𝐦𝐦𝐧𝐧 (𝟑𝟑,𝟑𝟑)
= 𝟏𝟏𝟑𝟑
Dr. Rao Muhammad Adeel Nawab 218
Methods for Mono-lingual Text Reuse and Plagiarism Detection
Methods based on content
Word n-grams overlapVector Space Model
Methods based on string and sequence alignment
Longest common subsequenceGreedy String-TilingGlobal AlignmentLocal Alignment
Dr. Rao Muhammad Adeel Nawab 219
Methods for Mono-lingual Text Reuse and Plagiarism Detection
Methods based on structure
Stop-words based n-grams overlap
Methods based on style
Type token ratioToken ratioSentence ratio
Dr. Rao Muhammad Adeel Nawab 220
Methods for Cross-lingual Text Reuse and Plagiarism Detection
Methods based on Syntax
Methods based on Dictionaries
Cross-Language Character N-Gram
Cross-Language Vector Space MethodCross-Language Conceptual Thesaurus based SimilarityCross-Language Knowledge Graph Analysis
Dr. Rao Muhammad Adeel Nawab 221
Methods for Cross-lingual Text Reuse and Plagiarism Detection
Methods based on Parallel Corpora
Methods based on Comparable Corpora
Cross-Language Alignment based Similarity AnalysisCross-Language Latent Semantic Indexing
Cross-Language Explicit Semantic Analysis
Cross-Language Kernel Canonical Correlation Analysis
Methods based on Machine Translation
Translation + Monolingual Analysis
Dr. Rao Muhammad Adeel Nawab 222
Methods for Cross-lingual Text Reuse and Plagiarism Detection
Methods based on Word Embeddings
Cross-Language Conceptual Thesaurus based Similarityusing Word EmbeddingsCross-Language Word Embeddings based SimilarityCross-Language Word Embedding based Syntax Similarity
Methods based on Deep Learning
Dr. Rao Muhammad Adeel Nawab 223
Your TurnConsidering the following example
Text 1: a dog bites a man
Using Longest Common Subsequence (LCS) Approach theLCS between two text pairs is:
A bites a man
ComputingSimilarity
Your Turnis to take at least three text pairs and compute similarityscore between then using the Longest CommonSubsequence (LCS) Approach
Similarity Score =
Similarity Score = =
Text 2: a hound bites a man
Dr. Rao Muhammad Adeel Nawab 224
Summary- Methods for Text Reuse and Plagiarism Detection
Methods for Mono-lingual Text Reuse and Plagiarism Detectionand be broadly categorized into: (1) Methods based onContent, (2) Methods based on Structure and (3) Methodsbased on StyleMethods for Cross-lingual Text Reuse and Plagiarism Detectionand be broadly categorized into: (1) Methods based on Syntax,(2) Cross-Language Character N-Grams, (3) Methods based onDictionaries, (4) Methods based on Parallel Corpora, (5)Methods based on Comparable Corpora, (6) Methods based onWord Embedding's and (7) Methods based on Deep Learning
Dr. Rao Muhammad Adeel Nawab 225
It's Poetry Time
Dr. Rao Muhammad Adeel Nawab 226
�ا� د�ا� � دف �ى � �� �
�ى � دف �ت او�د �
او�د � دف �� �
اور �� � دف ��ى � �� � �ق ا� ��
� � � �ى � � �� --دف �ر�
Importance of Poetry
Dr. Rao Muhammad Adeel Nawab 227
� � ر� � � � � آ� وا� �ا
� � ��ار � � ��اداس
�
Dr. Rao Muhammad Adeel Nawab 228
Stay Motivated
Dr. Rao Muhammad Adeel Nawab 229
. � �ڑى �� � د�ں � � ا�ر � �� � �
.�ڑى ا� �وڑ � � اور �ول � � � � � �
Motivation is the Fuel of Life
Importance of Motivation
Dr. Rao Muhammad Adeel Nawab 230
• Make a Schedule of 24 Hours and Live a Live a Balance Life• Watch Seminar - Balanced Life is Ideal Life
• Download Link: https://drive.google.com/open?id=1jet6r1QOtAB16Glpgiq18EeXaCkeVJUp
• On Daily Basis, read and enjoy• Poetry• Jokes
Tips - To Stay Motivated and become a Great Researcher
Dr. Rao Muhammad Adeel Nawab 231
Title: Power Of Appreciation
Speaker: Dr. Rao Muhammad Adeel Nawab
Downlaod Link: Video + Slides https://drive.google.com/open?id=1XnwhNjrjHPBSv0VHYuvgS5m4A7QwNDTf
Task Summarize main points of the motivational seminar
Motivational Seminar
Evaluation Measures
Dr. Rao Muhammad Adeel Nawab 233
Evaluation Measures
P = 𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻+𝑭𝑭𝑻𝑻
Precision
Precision (P) of a text reuse detection system is the
proportion of the predicted positive cases that were
correct
Precision
Dr. Rao Muhammad Adeel Nawab 234
Evaluation Measures
Recall (R) of a text reuse detection system is defined
as the proportion of positive cases that were
correctly identified
Recall
R = 𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻+𝑭𝑭𝑭𝑭
Recall
Dr. Rao Muhammad Adeel Nawab 235
Evaluation Measures
F₁ measure is a specific relationship (harmonic mean)
between precision (P) and recall (R)
F₁ measure
F₁ = 𝟐𝟐∗𝑻𝑻∗𝑹𝑹𝑻𝑻+𝑹𝑹
F₁ measure
Dr. Rao Muhammad Adeel Nawab 236
Summary- Evaluation Measures
Evaluation of Text Reuse / Plagiarism DetectionSystems is carried out using Precision, Recall and F₁measuresNote that in research papers (or thesis) mostly weightaverage Precision, Recall and F₁ scores are reported
Dr. Rao Muhammad Adeel Nawab 237
It's Poetry Time
Dr. Rao Muhammad Adeel Nawab 238
�ا� د�ا� � دف �ى � �� �
�ى � دف �ت او�د �
او�د � دف �� �
اور �� � دف ��ى � �� � �ق ا� ��
� � � �ى � � �� --دف �ر�
Importance of Poetry
Dr. Rao Muhammad Adeel Nawab 239
ر� � � دل � د�� � � آ
آ آ � � � �ڑ � �� � �
ر� �م� �ار � �ے � �
� آ� � � �� � � � �
� � �� � � � � �ا�
� آ� �� � د�رہو ر�
�ل
Dr. Rao Muhammad Adeel Nawab 240
� � � � �ا� ��� � �
ز�� � � آ�� � � � �
اك � � �ں �ت �� � � �وم
اے را� �ں � � ر�� � � آ
اب � دل �ش � � � � � ا��
� آ � �� � آ�ى � � ا� �از
…�ل
Treating the Problem of Text Reuse/ Plagiarism Detection as MachineLearning Problem – A Step by StepExample
Dr. Rao Muhammad Adeel Nawab 242
Treating the Problem of Text Reuse / Plagiarism Detection as Machine Learning Problem
For Ternary Classification
For Binary Classification
DerivedNon-Derived
Wholly DerivedPartially DerivedNon-Derived
Problem
Output
Text Reuse Detection
Input A Text Pair
Dr. Rao Muhammad Adeel Nawab 243
Treating the Problem of Text Reuse Detection as Machine Learning Problem
How to Transform Text Reuse DetectionProblem to Supervised Text ClassificationTask?
Question?
For Supervised Text Classification TaskOutput must be associated with the input i.e. dataset must beannotated
Dr. Rao Muhammad Adeel Nawab 244
Steps - Treating the Problem of Text Reuse Detection as Machine Learning Problem
Pre-process input (i.e. text pair)
Feature Extraction from Input
Represent extracted features and output / labelinto a format that Machine Learning algorithmscan understand (normally saved as CSV file)
Use features extracted in Step 2 to train and testMachine Learning algorithms
Main Steps to Treat Text Reuse Detection as SupervisedText Classification Task
Dr. Rao Muhammad Adeel Nawab 245
Experimental Setup
Ternary ClassificationBinary ClassificationDiscriminate between two classes i.e.Derived vs Non-derived
Discriminate between three classesi.e. Wholly Derived / PartiallyDerived / Non-derived
Problem of Text Reuse Detection is Treated as a SupervisedText Classification Task
Two Versions of Classification
DatasetFile containing dataset in CSV format is called data.csv15 instances (input + output)
InputText Pair
OutputLabel
Dr. Rao Muhammad Adeel Nawab 246
Text 1 Text 2 LabelA dog bites a man A dog bites a person Wholly Derived A dog bites a man A person was badly bitten by a dog Partially Derived
A dog bites a man A person is badly injured while running at road followed by a hound,Independently written Non Derived
I like your car Your car is nice Non DerivedThis is my country I live in Lahore, Pakistan Partially DerivedI like your car I like your car Wholly DerivedMy favorite subject is NLP My favorite subject is NLP Wholly DerivedMy favorite subject is NLP I like NLP Partially DerivedI like NLP I am studying many course but nlp an be top ranked Non Derived
Allama iqbal is our national hero it was iqbal who awoke the muslim with his poetry Non Derived
Balochistan successfully holds 3rd round of LG elections Plots in the municipal elections, the PML-N won the third stage Partially Derived
and he was once looking forward to it, he reiterated. World Cup has never refused: Shoaib Malik Partially Derived
Syed Sultan Shah termed it a national tragedy.
World Blind Cricket Council chairman Syed Sultan Shah said that Peshawar is anational tragedy. Wholly Derived
He said that increase in the tax is unconstitutional and violation of article 77.
Raza Rabbani said the Supreme Court's decision in the light of Article 77 is a violation of the decision. Wholly Derived
LHC-death LHC dismisses appeals against death sentence
5 guilty of the death penalty rejected pleas for mercy Reporter Karachi to Islamabad? President's death convicted criminals 5 rejected pleas for mercy. Non Derived
Dr. Rao Muhammad Adeel Nawab 247
Experimental Setup
Technique (for Feature Extraction)
N-gram OverlapN-grams are word basedN = 1 - 5
Evaluation Measure
PrecisionRecallF1
Dr. Rao Muhammad Adeel Nawab 248
Example-Treating the Problem of Text Reuse Detection as Machine Learning Problem
Main Steps to Treat Text Reuse Detection as Supervised TextClassification Task
Step 1: Pre-process input (i.e. text pair)
Dr. Rao Muhammad Adeel Nawab 249
Example - Treating the Problem of Text Reuse Detection as Machine Learning Text 1 Text 2A dog bites a man A dog bites a personA dog bites a man A person was badly bitten by a dogA dog bites a man A person is badly injured while running at road followed by a hound,
Independently writtenI like your car Your car is niceThis is my country I live in Lahore, PakistanI like your car I like your carMy favorite subject is NLP My favorite subject is NLPMy favorite subject is NLP I like NLPI like NLP I am studying many course but nlp an be top rankedAllama iqbal is our national hero it was iqbal who awoke the muslim with his poetryBalochistan successfully holds 3rd round of LG elections Plots in the municipal elections, the PML-N won the third stage
and he was once looking forward to it, he reiterated. World Cup has never refused: Shoaib Malik
Syed Sultan Shah termed it a national tragedy. World Blind Cricket Council chairman Syed Sultan Shah said thatPeshawar is a national tragedy.
He said that increase in the tax is unconstitutional and violation of article 77.
Raza Rabbani said the Supreme Court's decision in the light of Article 77 is a violation of the decision.
LHC-death LHC dismisses appeals against death sentence 5 guilty of the death penalty rejected pleas for mercy Reporter Karachi to Islamabad? President's death convicted criminals 5 rejected pleas for mercy.
Dr. Rao Muhammad Adeel Nawab 250
Example-Treating the Problem of Text Reuse Detection as Machine Learning Problem
Step 2: Feature Extraction from InputMain goal is to quantify the degree of similarity between textpairs (input) i.e. convert text pairs into similarity scores(numeric values) so that Machine Learning algorithms canunderstand them
We applied N-gram Overlap approach to compute similarityscores and transformed data.csv file into features.csv
Note that Machine Learning algorithms can understandnumber values
Dr. Rao Muhammad Adeel Nawab 251
Example - Treating the Problem of Text Reuse Detection as Machine Learning
Uni-gram-Scores Bi-gram-Score Tri-gram-score Four-gram-score Five-gram-Score
0.6 0.5 0.33 0 00.6 0.25 0 0 00.4 0 0 0 00.5 0.33 0 0 00 0 0 0 01 1 1 1 01 1 1 1 1
0.33 0 0 0 00.67 0 0 0 00.17 0 0 0 00.12 0 0 0 0
0 0 0 0 00.75 0.57 0.33 0 00.57 0.31 0.08 0 00.25 0 0 0 0
Dr. Rao Muhammad Adeel Nawab 252
Example - Treating the Problem of Text Reuse Detection as Machine Learning
Step 3: Represent extracted features and output / label into aformat that Machine Learning algorithms can understand(normally saved as CSV file) Feature.csv (WITH LABEL)Uni-gram-Scores Bi-gram-Score Tri-gram-score Four-gram-score Five-gram-Score Label
0.6 0.5 0.33 0 0 WD0.6 0.25 0 0 0 PD 0.4 0 0 0 0 ND0.5 0.33 0 0 0 ND0 0 0 0 0 PD1 1 1 1 0 WD1 1 1 1 1 WD
0.33 0 0 0 0 PD0.67 0 0 0 0 ND0.17 0 0 0 0 ND0.12 0 0 0 0 PD
0 0 0 0 0 PD0.75 0.57 0.33 0 0 WD0.57 0.31 0.08 0 0 WD0.25 0 0 0 0 ND
Step 4: Use features.csv file to train and test Machine Learningalgorithms See Next Slides
Dr. Rao Muhammad Adeel Nawab 253
Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem
Ternary Classification using WEKALoad features.csv
Click open
Dr. Rao Muhammad Adeel Nawab 254
Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem (Cont.)
Ternary Classification using WEKALoad features.csv
Select file by Giving Path
Dr. Rao Muhammad Adeel Nawab 255
Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem
Ternary Classification using WEKALoad features.csv
Click Label and see Number of Classes
Dr. Rao Muhammad Adeel Nawab 256
Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem
Ternary Classification using WEKALoad features.csv
Click Edit and see Data View in Weka
Dr. Rao Muhammad Adeel Nawab 257
Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem
Ternary Classification using WEKASelect J48
Dr. Rao Muhammad Adeel Nawab 258
Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem
Ternary Classification using WEKARun J48 with Spilt ratio 70
Dr. Rao Muhammad Adeel Nawab 259
Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem
Ternary Classification using WEKASelect Naive Bayes
Dr. Rao Muhammad Adeel Nawab 260
Example - Treating the Problem of Text Reuse Detection as Machine Learning Problem
Ternary Classification using WEKARun Naive Bayes with Spilt Ratio 70
Dr. Rao Muhammad Adeel Nawab 261
Results for Ternary ClassificationResults are reported for weighted average Precision,Recall and F₁ scores
Machine Learning Algorithms Results
Precision Recall F₁-Measure
Naïve Bayes 0.500 0.500 0.500
J48 0.875 0.800 0.780
Dr. Rao Muhammad Adeel Nawab 262
Your Turn
Considering the toy dataset given in this lecture.Apply Longest Common Subsequence Approach toextract features (similarity scores between textpairs) from dataset. Convert the file into ARFF / CSVformat. Run Naïve Bayes, J48 and two otherMachine Learning algorithms from WEKA. ReportWeighted Average Precision, Recall, and F₁ scoresfor all four Machine Learning algorithms in theform of table.
Dr. Rao Muhammad Adeel Nawab 263
Example-Treating the Problem of Text Reuse Detection as Machine Learning Problem
Binary Classification using WEKALoad features.csv
Click Open
Dr. Rao Muhammad Adeel Nawab 264
Example – (Cont.)Binary Classification using WEKA
Load features.csvSelect file by Giving Path
Dr. Rao Muhammad Adeel Nawab 265
Example – (Cont.)Binary Classification using WEKA
Load features.csvClick Label and see Number of Classes
Dr. Rao Muhammad Adeel Nawab 266
Example – (Cont.)Binary Classification using WEKA
Load features.csvClick Edit and see Data View in WEKA
Dr. Rao Muhammad Adeel Nawab 267
Example – (Cont.)Binary Classification using WEKA
Select J48
Dr. Rao Muhammad Adeel Nawab 268
Example – (Cont.)Binary Classification using WEKA
Run J48 with Spilt ratio 70
Dr. Rao Muhammad Adeel Nawab 269
Example – (Cont.)Binary Classification using WEKA
Select Naive Bayes
Dr. Rao Muhammad Adeel Nawab 270
Example – (Cont.)Binary Classification using WEKA
Run Naive Bayes with Spilt Ratio 70
Dr. Rao Muhammad Adeel Nawab 271
Results for Binary ClassificationResults are reported for weighted average Precision,Recall and F₁ scores
Machine Learning Algorithms Results
Precision Recall F₁-Measure
Naïve Bayes 0.833 0.500 0.500
J48 0.875 0.750 0.767
Dr. Rao Muhammad Adeel Nawab 272
Summary: Treating the Problem of Text Reuse / Plagiarism Detection as Machine Learning Problem – A Step by Step Example
To treat text reuse and plagiarism detection problem as aSupervised Text Classification task, we need to know followingmain things
For supervised text classification task, dataset mustbe annotated
Dataset
To extract features from text pairs (input)For text reuse and plagiarism detection the (featureextraction) techniques mostly aim to compute the similaritybetween text pairs i.e. features are similarity scores
Techniques(s)
Dr. Rao Muhammad Adeel Nawab 273
Summary: Treating the Problem of Text Reuse / Plagiarism Detection as Machine Learning Problem – A Step by Step Example
Mostly weighted average Precision, Recall and F₁ scores are usedto evaluate the performance of text reuse and plagiarismdetection systems
Evaluation Measures
A Machine Learning Toolkit is mainly a collection of MachineLearning algorithmsTwo popular and widely used Machine Learning Toolkits are
Machine Learning Toolkit(s)
WEKA (Java Programming Language)Scikit-Learn (Python Programming Language)
Dr. Rao Muhammad Adeel Nawab 274
Summary: Treating the Problem of Text Reuse / Plagiarism Detection as Machine Learning Problem – A Step by Step Example
Machine Learning Algorithms that will be trained / tested onfeatures (similarity scores) extracted from the datasetFor Supervised Text Classification task some of the MachineLearning Algorithms which have proven to be effective are
Machine Learning Algorithms
ML Algorithms
Naïve Bayes
Logistic Regression
Multi-Layer Perceptron
Random Forest
Support Vector Machine
AdaBoost
Dr. Rao Muhammad Adeel Nawab 275
Summary: Treating the Problem of Text Reuse / Plagiarism Detectionas Machine Learning Problem – A Step by Step Example (Cont.)
Pre-process input (i.e. text pair)
Feature Extraction from Input
Represent extracted features and output / labelinto a format that Machine Learning algorithmscan understand (normally saved as CSV file)
Use features extracted in Step 2 to train and testMachine Learning algorithms
Main Steps to Treat Text Reuse Detection as SupervisedText Classification Task
Dr. Rao Muhammad Adeel Nawab 276
Summary – Introduction to Text Reuse and Plagiarism
Text Reuse is the process of creating a new text (or document)using the existing one(s) and it is reported to be on rise inrecent years due to easy access to large online digitalrepositoriesGiven a text pair (Text 1 and Text 2), Text 2 is said to be"Derived" from Text 1, if it is created using text from Text 1. Onthe other hand, Text 2 is said to be "Non-Derived" from Text 1,if it is interpedently written i.e. did not borrow text from Text 1
Text Reuse
Dr. Rao Muhammad Adeel Nawab 277
Summary – Introduction to Text Reuse and Plagiarism
Text Reuse may occur at five levels: (1) Word level, (2) Phrasallevel, (3) Sentence level, (4) Passage / Paragraph level and (5)Document levelTwo main types of text reuse are: (1) Local Text Reuse - whenamount of text reused is detected at sentence/passage leveland (2) Global Text Reuse - when amount of text reused isdetected at document levelTwo main types of text reuse are: (1) Local Text Reuse - whenamount of text reused is detected at sentence/passage leveland (2) Global Text Reuse - when amount of text reused isdetected at document level
Dr. Rao Muhammad Adeel Nawab 278
Summary – Introduction to Text Reuse and Plagiarism
Plagiarism is defined as the unacknowledged reuse of text andin recent years it has been reported to be on rise.Consequently, plagiarism detection systems are routinely usedby higher educational institutions to check students work forplagiarismGiven a text pair (source text and suspicious text), suspicioustext is said to be Plagiarized from source text, if it is createdusing text from source text. On the other hand, suspicious textis said to be Non-Plagiarized from source text, if it isinterpedently written i.e. did not borrow text from source text
Plagiarism
Dr. Rao Muhammad Adeel Nawab 279
Summary – Introduction to Text Reuse and Plagiarism
There are three levels of Plagiarism: (1) Verbatim - the originaltext is reused as verbatim (word to word copy) or with minormodifications to create the plagiarized document, (2)Paraphrased Plagiarism - the original text is heavily altered (orparaphrased) to create the plagiarized document and (3)Plagiarism of Idea - the idea of the original text is reusedwithout dependence on the words or form of the sourceThere are three main types of Plagiarism Cases: (1) ArtificialCases of Plagiarism - are generated by using Automatic TextAltering tools to obfuscate the source text for plagiarism,
Dr. Rao Muhammad Adeel Nawab 280
Summary – Introduction to Text Reuse and Plagiarism
(2) Simulated / Manual Cases of Plagiarism - the original text isparaphrased by humans to create the cases of plagiarism and(3) Real Cases of Plagiarism - are those which occurred in thereal worldTwo main types of plagiarism detection are: (1) IntrinsicPlagiarism Detection - checking that the entire document (orall the passages) were written by one single author and (2)Extrinsic Plagiarism Detection - searching for the source(s) (ororiginal text(s)) that were reused to create the suspiciousdocument
Dr. Rao Muhammad Adeel Nawab 281
Summary – Introduction to Text Reuse and Plagiarism
Data Annotation
is the process of labeling data to make it usable for machine learning?is performed by domain experts (humans – a.k.a. annotators / taggers / raters)requires a lot of effort, time and cost
Dr. Rao Muhammad Adeel Nawab 282
Summary – Introduction to Text Reuse and PlagiarismMain Steps to Create Benchmark Annotated DatasetRaw Data Collection
Corpus Standardization
Data Source(s)Cleaning of DataPre-processing of Data
Annotation ProcessPreparation of Annotation GuidelinesAnnotationsComputing Inter-Annotator Agreement
Dr. Rao Muhammad Adeel Nawab 283
Summary – Introduction to Text Reuse and Plagiarism
Methods for Mono-lingual Text Reuse and Plagiarism Detectionand be broadly categorized into: (1) Methods based onContent, (2) Methods based on Structure and (3) Methodsbased on StyleMethods for Cross-lingual Text Reuse and Plagiarism Detectionand be broadly categorized into: (1) Methods based on Syntax,(2) Cross-Language Character N-Grams, (3) Methods based onDictionaries, (4) Methods based on Parallel Corpora, (5)Methods based on Comparable Corpora, (6) Methods based onWord Embedding's and (7) Methods based on Deep Learning
Methods for Text Reuse and Plagiarism Detection
Dr. Rao Muhammad Adeel Nawab 284
Summary – Introduction to Text Reuse and Plagiarism
Evaluation Measures
Evaluation of Text Reuse / Plagiarism DetectionSystems is carried out using Precision, Recall and F₁measuresNote that in research papers (or thesis) mostly weightaverage Precision, Recall and F₁ scores are reported
Dr. Rao Muhammad Adeel Nawab 285
Summary: Treating the Problem of Text Reuse / Plagiarism Detection as Machine Learning Problem – A Step by Step Example
To treat text reuse and plagiarism detection problem as aSupervised Text Classification task, we need to know followingmain things
For supervised text classification task, dataset mustbe annotated
Dataset
To extract features from text pairs (input)For text reuse and plagiarism detection the (featureextraction) techniques mostly aim to compute the similaritybetween text pairs i.e. features are similarity scores
Techniques(s)
Dr. Rao Muhammad Adeel Nawab 286
Summary: Treating the Problem of Text Reuse / Plagiarism Detection as Machine Learning Problem – A Step by Step Example
Mostly weighted average Precision, Recall and F₁ scores are usedto evaluate the performance of text reuse and plagiarismdetection systems
Evaluation Measures
A Machine Learning Toolkit is mainly a collection of MachineLearning algorithmsTwo popular and widely used Machine Learning Toolkits are
Machine Learning Toolkit(s)
WEKA (Java Programming Language)Scikit-Learn (Python Programming Language)
Dr. Rao Muhammad Adeel Nawab 287
Summary: Treating the Problem of Text Reuse / Plagiarism Detection as Machine Learning Problem – A Step by Step Example
Machine Learning Algorithms that will be trained / tested onfeatures (similarity scores) extracted from the datasetFor Supervised Text Classification task some of the MachineLearning Algorithms which have proven to be effective are
Machine Learning Algorithms
ML Algorithms
Naïve Bayes
Logistic Regression
Multi-Layer Perceptron
Random Forest
Support Vector Machine
AdaBoost
Dr. Rao Muhammad Adeel Nawab 288
Summary: Treating the Problem of Text Reuse / Plagiarism Detectionas Machine Learning Problem – A Step by Step Example (Cont.)
Pre-process input (i.e. text pair)
Feature Extraction from Input
Represent extracted features and output / labelinto a format that Machine Learning algorithmscan understand (normally saved as CSV file)
Use features extracted in Step 2 to train and testMachine Learning algorithms
Main Steps to Treat Text Reuse Detection as SupervisedText Classification Task