approaches for source retrieval and text ... - pan.webis.de · pan@clef2013 heilongjiang institute...
TRANSCRIPT
Kong Leilei, Qi Haoliang, Du Cuixia, Wang Mingxing, Han Zhongyuan
www.hljit.edu.cnPAN@CLEF2013
Approaches for Source Retrieval and Text Alignment of Plagiarism Detection
1
Who are we?
2
Who are we?
2
Who are we?
2
Who are we?
Heilongjiang Institute of TechnologyHarbin, Heilongjiang Province,
China
2
Our University
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF20133
Our University
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF20133
Our University
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF20133
Our University
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF20133
Our University
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF20133
Our University
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF20133
Index
Approaches for Source Retrieval
Approaches for Text Alignment
Further works
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF20134
13
Source Retrieval
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
Candidate Documents
Suspicious plagiarism textDocument
Set
Internet Resource
Text Alignment
SourceRetrieval
Suspicious document
Query keywords
14
Source Retrieval
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
Candidate Documents
Suspicious plagiarism textDocument
Set
Internet Resource
Text Alignment
SourceRetrieval
Suspicious document
Query keywords
2 problmes of source retrieval
Tow core problem of source retrieval Retrieval source is millions of documents
from the Internet This work was done by PAN
The query keywords of suspicious document which would be used for retrieval are not specified How to extract query keyword is one of important
issues of our work
6
16
Query Keywords Extraction
Query Keywords Extraction Based on TF-IDF
Query Keywords Extraction Based on Weighted TF-IDF
Adjacent Query Keywords Extraction by PatTree
Combination of Queries and Execution of Retrieval
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
17
Keywords Based on TF-IDF
TF - term frequency, denotes the frequency of term i in document j
IDF - inverse document frequency
IDF =log2 (N/ df j)
TF-IDF of term i is:
Tips: we found that the top 10 terms with the highest TF-IDF can obtain a good results
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
18
Keywords Based on Weighted TF-IDF
Weighted TF-IDF
Where weight is a weighted parameter, and we calculate the weight of term i according to its location
Tips: the keywords extraction based on the weighted TF-IDF sometimes is useful, sometimes useless.
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
19
Adjacent Query Keywords Extraction by PatTree
The adjacent string with high frequency is more important than a single word
We use PatTree - an efficient data structure– to get the adjacent strings and their frequency
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
example
20
Combination of Queries
Query Query Keywords
1
2
3
4
5
6
7
8
9
Top 1 to 5 query keywords based on TF-IDF
Top 2 to 10 query keywords based on TF-IDF
2-Gram query keywords based on PatTree
3-Gram query keywords based on PatTree
4-Gram query keywords based on PatTree
4-Gram query keywords based on PatTree
Top 1 to 5 query keywords based on weighted TF-IDF
Top 6 to 10 query keywords based on weighted TF-IDF
5-Gram query keywords based on PatTree
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
Table 1: Query Combination and Group
21
Results Source Retrieval subtask
Workload
Queries 48.50
Downloads 5691.47
Time to 1st Detection
Queries 2.46
Downloads 285.66
Retrieved Performance
Precision 0.01
Recall 0.65
No Detection 3
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
Table2:Results Source of PAN@CLEF2013 Retrieval subtask
22
Results Source Retrieval subtask
Workload
Queries 48.50
Downloads 5691.47
Time to 1st Detection
Queries 2.46
Downloads 285.66
Retrieved Performance
Precision 0.01
Recall 0.65
No Detection 3
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
Table2:Results Source of PAN@CLEF2013 Retrieval subtask
23
Results Source Retrieval subtask
Workload
Queries 48.50
Downloads 5691.47
Time to 1st Detection
Queries 2.46
Downloads 285.66
Retrieved Performance
Precision 0.01
Recall 0.65
No Detection 3
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
Table2:Results Source of PAN@CLEF2013 Retrieval subtask
24
Candidate Documents
Suspicious plagiarism textDocument
Set
Internet Resource
Text Alignment
Text Alignment
Source Retrieval
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
Suspicious document
Query keywords
25
Candidate Documents
Suspicious plagiarism textDocument
Set
Internet Resource
Text Alignment
Text Alignment
Source Retrieval
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
Suspicious document
Query keywords
26
Text Alignment
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
27
Text Alignment
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
28
Text Alignment
Seeding
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
29
Text Alignment
Seeding
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
30
Match Merging
Text Alignment
Seeding
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
31
Match Merging
Text Alignment
Seeding
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
32
Match Merging
Text Alignment
Extraction FilteringSeeding
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
33
Match Merging
Text Alignment
Extraction FilteringSeeding
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
34
Seeding Match Merging
Text Alignment
Extraction Filtering
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
35
Seeding Match Merging
Text Alignment
Extraction Filtering
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
36
Seeding Match Merging
Text Alignment
Extraction Filtering
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
37
Seeding Match Merging
Text Alignment
Extraction Filtering
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
38
Seeding Match Merging
Text Alignment
Extraction Filtering
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
39
Seeding Match Merging
Text Alignment
Extraction Filtering
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
•Bilateral Alternating
Merging Algorithm
40
Seeding Match Merging
Text Alignment
Extraction Filtering
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
41
Seeding Match Merging
Text Alignment
Extraction Filtering
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
42
Seeding Match Merging
Text Alignment
Extraction Filtering
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
43
Performance on the PAN2012 test corpus
Table 3: Overall evaluation results for the final test corpus
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
44
Performance on the PAN2012 test corpus
Table 4: Results for the 02-no-obfuscation sub-corpus
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
45
Performance on the PAN2012 test corpus
Table 5: Results for the 03-random-obfuscation
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
46
Performance on the PAN2012 test corpus
Table 6: Results for the 04-translation-obfuscation
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
47
Performance on the PAN2012 test corpus
Table 7: Evaluation results for the 05-summary-obfuscation
Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013
48
Further work
Use different methods to deal with different plagiarism problems to obtain a better performance
Query keywords extraction and ranking
Thank you for your attention!