approaches for source retrieval and text ... - pan.webis.de · pan@clef2013 heilongjiang institute...

49
Kong Leilei, Qi Haoliang, Du Cuixia, Wang Mingxing, Han Zhongyuan www.hljit.edu.cn PAN@CLEF2013 Approaches for Source Retrieval and Text Alignment of Plagiarism Detection 1

Upload: others

Post on 05-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

Kong Leilei, Qi Haoliang, Du Cuixia, Wang Mingxing, Han Zhongyuan

www.hljit.edu.cnPAN@CLEF2013

Approaches for Source Retrieval and Text Alignment of Plagiarism Detection

1

Page 2: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

Who are we?

2

Page 3: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

Who are we?

2

Page 4: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

Who are we?

2

Page 5: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

Who are we?

Heilongjiang Institute of TechnologyHarbin, Heilongjiang Province,

China

2

Page 6: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

Our University

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF20133

Page 7: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

Our University

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF20133

Page 8: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

Our University

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF20133

Page 9: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

Our University

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF20133

Page 10: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

Our University

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF20133

Page 11: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

Our University

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF20133

Page 12: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

Index

Approaches for Source Retrieval

Approaches for Text Alignment

Further works

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF20134

Page 13: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

13

Source Retrieval

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Candidate Documents

Suspicious plagiarism textDocument

Set

Internet Resource

Text Alignment

SourceRetrieval

Suspicious document

Query keywords

Page 14: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

14

Source Retrieval

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Candidate Documents

Suspicious plagiarism textDocument

Set

Internet Resource

Text Alignment

SourceRetrieval

Suspicious document

Query keywords

Page 15: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

2 problmes of source retrieval

Tow core problem of source retrieval Retrieval source is millions of documents

from the Internet This work was done by PAN

The query keywords of suspicious document which would be used for retrieval are not specified How to extract query keyword is one of important

issues of our work

6

Page 16: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

16

Query Keywords Extraction

Query Keywords Extraction Based on TF-IDF

Query Keywords Extraction Based on Weighted TF-IDF

Adjacent Query Keywords Extraction by PatTree

Combination of Queries and Execution of Retrieval

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 17: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

17

Keywords Based on TF-IDF

TF - term frequency, denotes the frequency of term i in document j

IDF - inverse document frequency

IDF =log2 (N/ df j)

TF-IDF of term i is:

Tips: we found that the top 10 terms with the highest TF-IDF can obtain a good results

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 18: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

18

Keywords Based on Weighted TF-IDF

Weighted TF-IDF

Where weight is a weighted parameter, and we calculate the weight of term i according to its location

Tips: the keywords extraction based on the weighted TF-IDF sometimes is useful, sometimes useless.

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 19: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

19

Adjacent Query Keywords Extraction by PatTree

The adjacent string with high frequency is more important than a single word

We use PatTree - an efficient data structure– to get the adjacent strings and their frequency

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

example

Page 20: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

20

Combination of Queries

Query Query Keywords

1

2

3

4

5

6

7

8

9

Top 1 to 5 query keywords based on TF-IDF

Top 2 to 10 query keywords based on TF-IDF

2-Gram query keywords based on PatTree

3-Gram query keywords based on PatTree

4-Gram query keywords based on PatTree

4-Gram query keywords based on PatTree

Top 1 to 5 query keywords based on weighted TF-IDF

Top 6 to 10 query keywords based on weighted TF-IDF

5-Gram query keywords based on PatTree

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Table 1: Query Combination and Group

Page 21: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

21

Results Source Retrieval subtask

Workload

Queries 48.50

Downloads 5691.47

Time to 1st Detection

Queries 2.46

Downloads 285.66

Retrieved Performance

Precision 0.01

Recall 0.65

No Detection 3

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Table2:Results Source of PAN@CLEF2013 Retrieval subtask

Page 22: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

22

Results Source Retrieval subtask

Workload

Queries 48.50

Downloads 5691.47

Time to 1st Detection

Queries 2.46

Downloads 285.66

Retrieved Performance

Precision 0.01

Recall 0.65

No Detection 3

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Table2:Results Source of PAN@CLEF2013 Retrieval subtask

Page 23: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

23

Results Source Retrieval subtask

Workload

Queries 48.50

Downloads 5691.47

Time to 1st Detection

Queries 2.46

Downloads 285.66

Retrieved Performance

Precision 0.01

Recall 0.65

No Detection 3

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Table2:Results Source of PAN@CLEF2013 Retrieval subtask

Page 24: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

24

Candidate Documents

Suspicious plagiarism textDocument

Set

Internet Resource

Text Alignment

Text Alignment

Source Retrieval

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Suspicious document

Query keywords

Page 25: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

25

Candidate Documents

Suspicious plagiarism textDocument

Set

Internet Resource

Text Alignment

Text Alignment

Source Retrieval

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Suspicious document

Query keywords

Page 26: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

26

Text Alignment

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 27: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

27

Text Alignment

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 28: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

28

Text Alignment

Seeding

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 29: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

29

Text Alignment

Seeding

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 30: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

30

Match Merging

Text Alignment

Seeding

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 31: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

31

Match Merging

Text Alignment

Seeding

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 32: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

32

Match Merging

Text Alignment

Extraction FilteringSeeding

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 33: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

33

Match Merging

Text Alignment

Extraction FilteringSeeding

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 34: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

34

Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 35: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

35

Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 36: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

36

Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 37: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

37

Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 38: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

38

Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 39: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

39

Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

•Bilateral Alternating

Merging Algorithm

Page 40: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

40

Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 41: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

41

Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 42: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

42

Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 43: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

43

Performance on the PAN2012 test corpus

Table 3: Overall evaluation results for the final test corpus

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 44: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

44

Performance on the PAN2012 test corpus

Table 4: Results for the 02-no-obfuscation sub-corpus

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 45: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

45

Performance on the PAN2012 test corpus

Table 5: Results for the 03-random-obfuscation

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 46: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

46

Performance on the PAN2012 test corpus

Table 6: Results for the 04-translation-obfuscation

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 47: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

47

Performance on the PAN2012 test corpus

Table 7: Evaluation results for the 05-summary-obfuscation

Heilongjiang Institute of Technology, Kong LeileiPAN@CLEF2013

Page 48: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

48

Further work

Use different methods to deal with different plagiarism problems to obtain a better performance

Query keywords extraction and ranking

Page 49: Approaches for Source Retrieval and Text ... - pan.webis.de · PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei Table 1: Query Combination and Group . 21 Results Source

Thank you for your attention!