Abstract

Since the internet is so big and most of its content is public, it is very hard to find out where the information came from originally. There are many websites that publish news articles, so people and organizations can easily lose track of where their articles are reused with or without their permission. This paper presents a plagiarism detection algorithm that allows us to quickly compare online news articles with a collection of personal news articles and detect plagiarized passages with the same quality as a human. The algorithm uses a basic shingle index and a Signature Tree as a more advanced pre-filtering step to narrow down the viable documents to a query. The algorithm achieves a score of 0.96 precision and 0.94 recall but is too resource intensive to be considered scalable. When only the pre-filtering step is used, it achieves 0.85 precision and recall creating a speedup of nearly one order of magnitude.

Introduction

When you publish a document on your website, how do you know which websites are using your information as their own? Since the internet is so big and most of its content is public, it is very hard to find out where the information came from originally. Companies (like Google) are trying to detect near-duplicate documents on the internet to save storage space and to prevent the same information from showing multiple times. Near-duplicates can be shared content between different sources or altered copies, like redacted or quoted text. When near-duplicates result from copyrighted material and the original source is not mentioned, it is called plagiarism. Document-by-document comparison is sufficient to detect plagiarized documents that are copied from someone else and adjusted to make it look like your own. Documents can be compared with documents based on their content and structure [1,2,3]. To improve efficiency of this process, plagiarism detection can then be viewed as a similarity self-join over the corpus of interest, or as the similarity join between a database of copyrighted material and the corpus of interest. The corpus of interest for this thesis is the collection of online news articles and a database collection of (plain text) news articles. 

This thesis will focus on detecting online plagiarized news articles. Its goal is to detect near-duplicate news articles on various websites. The program should be able to report the URL's of web pages that contain altered or copied articles against a known personal database with news articles. Finally, the publishers can be informed about the plagiarized content. The amount of articles that can be processed should be higher than the amount of articles that are crawled (downloaded) online per day. Because there are so many websites that publish news articles, people and organizations can easily lose track of the locations where their articles are reused with or without their permission. Some websites will have an agreement with the original source to use a certain amount of articles per day, other websites do not have such an agreement. To keep track of your own online articles and their reuse frequency, a near-duplicate detection application is required. TEEZIR, an information retrieval company located in Utrecht, The Netherlands, is looking for such an application. TEEZIR is specialized in collecting, analyzing and searching (online) documents, therefore this graduation project is done in cooperation with TEEZIR and the Technical University Delft.

Define plagiarism

When someone is deliberately duplicating copyrighted material to present it as their own, they will try to cover this up by changing the original text, through adding, removing, editing, or reordering text. These modifications make it more difficult to detect plagiarism. In order to detect near-duplicate content, plagiarism must be defined and the similarity between online articles and the original articles should be measured. The definition according to a dictionary is as follows:

'The copying of another person's ideas, text, or other creative work, and presenting it as one's own, especially without permission'

Besides a dictionary, the Dutch author law and jurisprudence define plagiarism, but none of them provides exact numbers or thresholds. Therefore it is very difficult to give a mathematical definition. Humans, on the other hand, are quite good at detecting plagiarism in small corpora, but this is a boring, time-consuming, expensive and impossible task when the collection is large (which is the case when crawling the internet). This thesis presents a Signature Tree pre-filter and shingle index approach to detect near-duplicate passages from news articles on any type of website as good as a human can while keeping the computational time low to be able to deal with large and scalable document collections. Unique about this method is that it detects a copy of the text within a larger web page.

Problem statement

The problem statement for this thesis is:

'How to detect plagiarized online news articles within acceptable time and quality by using similarity join?'

This can be divided in smaller problems:

  1. 'What is plagiarism?'
  2. 'What are the usable similarity join functions that meet the requirements?'
  3. 'What is the acceptable time?'
  4. 'How can we measure quality?'

The main challenge is to achieve high quality without too much computational time on large and scalable document collections, so to focus on scalability, speed and quality. With growing document collections, the solution to this problem needs to be scalable, because more internet articles will be published per day in the future and new articles will be added to the own document collection over time. Quality and speed are very important, because a computer should be able to detect near-duplicate news articles as good as humans can, but also establish this within acceptable time. This might lead to a trade-off between quality and speed.

Download full text (PDF)

References

[1]
Timothy C. Hoad and Justin Zobel. Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inf. Sci. Technol., 54(3):203-215, 2003.
[2] Benno Stein, Sven Meyer zu Eissen, and Martin Potthast. Strategies for retrieving plagiarized documents. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 825{826, New York, NY, USA, 2007. ACM.
[3] Daniel R. White and Mike S. Joy. Sentence-based natural language plagiarism detection. J. Educ. Resour. Comput., 4(4):2, 2004.


Author: Rolf Schellenberger
Student id: 1047779
Email: R.Schellenberger@student.tudelft.nl

Media and Knowledge Engineering Research Group
Department of Software Technology
Faculty EEMCS, Delft University of Technology
Delft, the Netherlands
msc.its.tudelft.nl/mke/

TEEZIR
Kanaalweg 17L-E
Utrecht, the Netherlands
www.teezir.nl

Thesis Committee:
Prof Dr. Ir. A.P. de Vries, Faculty EEMCS, TU Delft
Ir. S.T.J. de Bruijn, TEEZIR, Utrecht
Dr. E.A. Hendriks, Faculty EEMCS, TU Delft
Dr. P. Serdyukov, Faculty EEMCS, TU Delft
Dr. P. Cimiano, Faculty EEMCS, TU Delft
© 2009 Rolf Schellenberger