User:Bappelman3/sandbox

From Wikipedia, the free encyclopedia

Plagiarism Detection[edit]

Plagiarism detection is the process of locating instances of plagiarism within a work or document. The widespread use of computers and the advent of the Internet has made it easier to plagiarize the work of others. Most cases of plagiarism are found in academia, where documents are typically essays or reports. However, plagiarism can be found in virtually any field, including novels, scientific papers, art designs, and source code.[1]

Detection of plagiarism can be either manual or software-assisted. Manual detection requires substantial effort and excellent memory, and is impractical in cases where too many documents must be compared, or original documents are not available for comparison. Software-assisted detection allows vast collections of documents to be compared to each other, making successful detection much more likely.[1]

The challenge the plagiarism detection system has at hand is that it must be fast, cost efficient, and most importantly it must be accurate. The magnitude of the challenge should be clear and it’s obvious why many plagiarism checkers launched, despite starting with the best of intentions, fail miserably. [2]

The practice of plagiarizing by use of sufficient word substitutions to elude detection software is known as rogeting .[2]

Software-assisted detection[edit]

Computer-assisted plagiarism detection (CaPD) is an Information retrieval (IR) task supported by specialized IR systems, referred to as plagiarism detection systems (PDS). [3]

In text documents

Systems for text-plagiarism detection implement one of two generic detection approaches, one being external, the other being intrinsic.[4] External detection systems compare a suspicious document with a reference collection, which is a set of documents assumed to be genuine.[5] Based on a chosen document model and predefined similarity criteria, the detection task is to retrieve all documents that contain text that is similar to a degree above a chosen threshold to text in the suspicious document.[6] Intrinsic PDS solely analyze the text to be evaluated without performing comparisons to external documents. This approach aims to recognize changes in the unique writing style of an author as an indicator for potential plagiarism.[7] PDS are not capable of reliably identifying plagiarism without human judgment. Similarities are computed with the help of predefined document models and might represent false positives.[8][9]

Academic institutions are discovering that they must operate in a proactive anti-plagiarism policy, where plagiarism is sought out to be a serious issue and it is unacceptable for academic behavior.[10] A study was conducted to test the effectiveness of plagiarism detection software in a higher education setting. One part of the study assigned one group of students to write a paper. These students were first educated about plagiarism and informed that their work was to be run through a plagiarism detection system. A second group of students was assigned to write a paper without any information about plagiarism. The researchers expected to find lower rates in group one but found roughly the same rates of plagiarism in both groups.[11]

Effectiveness of in higher education settings[edit]

Self-efficiency is a big part of students learning abilities when it comes to schoolwork and learning activities. Self-efficiency for learning refers to a students belief in their capabilities. Self-efficiency is important in the learning process. Perceived academic self-efficiency is when students believe they have the skills necessary for successful learning.[12] A study was conducted to test the effectiveness of plagiarism detection software in a higher education setting. One portion of the study assigned a group of students to write a paper. These students were first educated about plagiarism and informed that their work was to be run through a plagiarism detection system. A second group of students was assigned to write a paper without any information about plagiarism. The researchers expected to find lower rates in group one but found roughly the same rates of plagiarism in both groups.[13]

Approaches[edit]

The figure below represents a classification of all detection approaches currently in use for computer-assisted plagiarism detection that have been explained in detail throughout the article. The approaches are characterized by the type of similarity assessment they undertake: global or local. Global similarity assessment approaches use the characteristics taken from larger parts of the text or the document as a whole to compute similarity, while local methods only examine pre-selected text segments as input.

Classification of computer-assisted plagiarism detection methods

Fingerprinting[edit]

Fingerprinting is currently the most widely applied approach to plagiarism detection. This method forms representative digests of documents by selecting a set of multiple substrings (n-grams) from them. Fingerprinting works many different ways, even when just looking at plagiarism detection, but the principle is always the same. Using a complex mathematical process, you take a lengthy work, such as a file, a lot of text or something else, and convert it into a unique string, known as a fingerprint.[14] Fingerprinting works many different ways, even when just looking at plagiarism detection, but the principle is always the same. Using a complex mathematical process, you take a lengthy work, such as a file, a lot of text or something else, and convert it into a unique string, known as a fingerprint.[2] The sets represent the fingerprints and their elements are called minutiae. A suspicious document is checked for plagiarism by computing its fingerprint and querying minutiae with a precomputed index of fingerprints for all documents of a reference collection. Minutiae matching with those of other documents indicate shared text segments and suggest potential plagiarism if they exceed a chosen similarity threshold. Computational resources and time are limiting factors to fingerprinting, which is why this method typically only compares a subset of minutiae to speed up the computation and allow for checks in very large collection, such as the Internet. [14]

Fingerprinting works many different ways, even when just looking at plagiarism detection, but the principle is always the same. Using a complex mathematical process, you take a lengthy work, such as a file, a lot of text or something else, and convert it into a unique string, known as a fingerprint.[2]

String matching[edit]

The idea is straightforward, you take a string of text from one document, ranging from a few words to a dozen or more, and then try to find that same string in other documents. Then, repeat the process with another document or another string.[2] String matching is a prevalent approach used in computer science. When applied to the problem of plagiarism detection, documents are compared for verbatim text overlaps. Numerous methods have been proposed to tackle this task, of which some have been adapted to external plagiarism detection. String matching refers to the problem of accruing strings of a pattern of a text. It also plays a very important role in plagiarism detection, as it has been used as a tool in software metrics. The string matching problems have many algorithms to solve plagiarism detection. Parameterized string matching is able to detect plagiarism in a software code.[15] Checking a suspicious document in this setting requires the computation and storage of efficiently comparable representations for all documents in the reference collection to compare them pairwise. Generally, suffix document models, such as suffix trees or suffix vectors, have been used for this task. Nonetheless, substring matching remains computationally expensive, which makes it a non-viable solution for checking large collections of documents.[15]

Bag of Words[edit]

Bag of words analysis represent the adoption of vector space retrieval, a traditional IR concept, to the domain of plagiarism detection. Documents are represented as one or multiple vectors, e.g. for different document parts, which are used for pair wise similarity computations. Similarity computation may then rely on the traditional cosine similarity measure, or on more sophisticated similarity measures. The vector space model focuses on finding more weights for terms that do not frequently exist in the dataset.[16]

Citation analysis[edit]

Citation-based plagiarism detection (CbPD)[17] relies on citation analysis, and is the only approach to plagiarism detection that does not rely on the textual similarity.[18] CbPD examines the citation and reference information in texts to identify similar patterns in the citation sequences. As such, this approach is suitable for scientific texts, or other academic documents that contain citations. Citation analysis to detect plagiarism is a relatively young concept. It has not been adopted by commercial software, but a first prototype of a citation-based plagiarism detection system exists.[19] Similar order and proximity of citations in the examined documents are the main criteria used to compute citation pattern similarities. Citation patterns represent subsequences non-exclusively containing citations shared by the documents compared.[18][20] Factors, including the absolute number or relative fraction of shared citations in the pattern, as well as the probability that citations co-occur in a document are also considered to quantify the patterns’ degree of similarity.[18][20][21][22] Citaion analysis can also be used to develop subject guides. It is also an aspect of bibliometerics but others think that citation analysis deals with an evaluation of cited resources in a particular word or a group of words. [23]

Stylometry[edit]

Stylometry subsumes statistical methods for quantifying an author’s unique writing style and is mainly used for authorship attribution or intrinsic CaPD. By constructing and comparing stylometric models for different text segments, passages that are stylistically different from others, hence potentially plagiarized, can be detected. There are two main forms of metric stylometry , they are intrinsic and external. Intrinsic is used for identifying the passages plagiarized by looking at only the analyzed document, deciding and checking to make sure that parts of the material are or not written by the same author. Intrinsic plagiarism identification technique uses the writing style of an author as a basis for comparison External involves comparing the document with other existing documents within the database of material and identifying the pair of similar documents.[24][25][26]

Performance[edit]

Comparative evaluations of plagiarism detection systems[27][28][29][30][31][32] indicate that their performance depends on the type of plagiarism present (see figure). Except for citation pattern analysis, all detection approaches rely on textual similarity. It is therefore symptomatic that detection accuracy decreases the more plagiarism cases are obfuscated.

Detection performance of CaPD approaches depending on the type of plagiarism being present

Literal copies, aka copy and paste (c&p) plagiarism, or modestly disguised plagiarism cases can be detected with high accuracy by current external PDS if the source is accessible to the software. Especially substring matching procedures achieve a good performance for c&p plagiarism, since they commonly use lossless document models, such as suffix trees. The performance of systems using fingerprinting or bag of words analysis in detecting copies depends on the information loss incurred by the document model used. By applying flexible chunking and selection strategies, they are better capable of detecting moderate forms of disguised plagiarism when compared to substring matching procedures.

Intrinsic plagiarism detection using stylometry can overcome the boundaries of textual similarity to some extent by comparing linguistic similarity. Given that the stylistic differences between plagiarized and original segments are significant and can be identified reliably, stylometry can help in identifying disguised and paraphrased plagiarism. Stylometric comparisons are likely to fail in cases where segments are strongly paraphrased to the point where they more closely resemble the personal writing style of the plagiarist or if a text was compiled by multiple authors. The results of the International Competitions on Plagiarism Detection held in 2009, 2010 and 2011,[27][31][32] as well as experiments performed by Stein,[33] indicate that stylometric analysis seems to work reliably only for document lengths of several thousand or tens of thousands of words, which limits the applicability of the method to CaPD settings.

An increasing amount of research is performed on methods and systems capable of detecting translated plagiarisms. Currently, cross-language plagiarism detection (CLPD) is not viewed as a mature technology[34] and respective systems have not been able to achieve satisfying detection results in practice.[30]

Citation-based plagiarism detection using citation pattern analysis is capable of identifying stronger paraphrases and translations with higher success rates when compared to other detection approaches, because it is independent of textual characteristics.[35][36] However, since citation-pattern analysis depends on the availability of sufficient citation information, it is limited to academic texts. It remains inferior to text-based approaches in detecting shorter plagiarized passages, which are typical for cases of copy-and-paste or shake-and-paste plagiarism; the latter refers to mixing slightly altered fragments from different sources.

Software[edit]

The design of plagiarism detection software for use with text documents is characterized by a number of factors:[citation needed]

Factor Description and alternatives
Scope of search In the public internet, using search engines / Institutional databases / Local, system-specific database.[citation needed]
Analysis time Delay between the time a document is submitted and the time when results are made available.[citation needed]
Document capacity / Batch processing Number of documents the system can process per unit of time.[citation needed]
Check intensity How often and for which types of document fragments (paragraphs, sentences, fixed-length word sequences) does the system query external resources, such as search engines.
Comparison algorithm type The algorithms that define the way the system uses to compare documents against each other.[citation needed]
Precision and Recall Number of documents correctly flagged as plagiarized compared to the total number of flagged documents, and to the total number of documents that were actually plagiarized. High precision means that few false positives were found, and high recall means that few false negatives were left undetected.[citation needed]

Most large-scale plagiarism detection systems use large, internal databases (in addition to other resources) that grow with each additional document submitted for analysis. However, this feature is considered by some as a violation of student copyright.[citation needed]

In source code[edit]

Plagiarism in computer source code is also frequent, and requires different tools than those used for text comparisons in document. Significant research has been dedicated to academic source-code plagiarism.[37]

A distinctive aspect of source-code plagiarism is that there are no essay mills, such as can be found in traditional plagiarism. Since most programming assignments expect students to write programs with very specific requirements, it is very difficult to find existing programs that already meet them. Since integrating external code is often harder than writing it from scratch, most plagiarizing students choose to do so from their peers.

According to Roy and Cordy,[38] source-code similarity detection algorithms can be classified as based on either

  • Strings – look for exact textual matches of segments, for instance five-word runs. Fast, but can be confused by renaming identifiers.
  • Tokens – as with strings, but using a lexer to convert the program into tokens first. This discards whitespace, comments, and identifier names, making the system more robust to simple text replacements. Most academic plagiarism detection systems work at this level, using different algorithms to measure the similarity between token sequences.
  • Parse Trees – build and compare parse trees. This allows higher-level similarities to be detected. For instance, tree comparison can normalize conditional statements, and detect equivalent constructs as similar to each other.
  • Program Dependency Graphs (PDGs) – a PDG captures the actual flow of control in a program, and allows much higher-level equivalences to be located, at a greater expense in complexity and calculation time.
  • Metrics – metrics capture 'scores' of code segments according to certain criteria; for instance, "the number of loops and conditionals", or "the number of different variables used". Metrics are simple to calculate and can be compared quickly, but can also lead to false positives: two fragments with the same scores on a set of metrics may do entirely different things.
  • Hybrid approaches – for instance, parse trees + suffix trees can combine the detection capability of parse trees with the speed afforded by suffix trees, a type of string-matching data structure.

The previous classification was developed for code refactoring, and not for academic plagiarism detection (an important goal of refactoring is to avoid duplicate code, referred to as code clones in the literature). The above approaches are effective against different levels of similarity; low-level similarity refers to identical text, while high-level similarity can be due to similar specifications. In an academic setting, when all students are expected to code to the same specifications, functionally equivalent code (with high-level similarity) is entirely expected, and only low-level similarity is considered as proof of cheating.

  1. ^ a b Grove, Jack (15 July 2015). ""Sinister buttocks? Roget would blush at the crafty cheek Middlesex lecturer gets to the bottom of meaningless phrases found while marking essays"". Times Higher Education.
  2. ^ a b c d e "How Plagiarism Detection Works - Plagiarism Today". Plagiarism Today. 2016-05-03. Retrieved 2018-10-18.
  3. ^ "Download Limit Exceeded". citeseerx.ist.psu.edu. CiteSeerX 10.1.1.231.9889. Retrieved 2018-10-18.
  4. ^ Stein, Benno; Koppel, Moshe; Stamatatos, Efstathios (2007-12-01). "Plagiarism analysis, authorship identification, and near-duplicate detection PAN'07". ACM SIGIR Forum. 41 (2): 68. doi:10.1145/1328964.1328976. ISSN 0163-5840. S2CID 6379659.
  5. ^ Stein, Benno; Potthast, Martin; Rosso, Paolo; Barrón-Cedeño, Alberto; Stamatatos, Efstathios; Koppel, Moshe (2011-05-24). "Fourth international workshop on uncovering plagiarism, authorship, and social software misuse". ACM SIGIR Forum. 45 (1): 45. doi:10.1145/1988852.1988860. hdl:10251/44259. ISSN 0163-5840. S2CID 15553223.
  6. ^ Stein, Benno; zu Eissen, Sven Meyer; Potthast, Martin (2007). "Strategies for retrieving plagiarized documents". Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '07. New York, New York, USA: ACM Press: 825. doi:10.1145/1277741.1277928. ISBN 9781595935977. S2CID 3898511.
  7. ^ Mounia., Lalmas (2006). Advances in Information Retrieval : 28th European Conference on IR Research, ECIR 2006, London, UK, April 10-12, 2006. Proceedings. Springer Berlin Heidelberg. ISBN 9783540333487. OCLC 827361283.
  8. ^ Malcom, James (2006). [www.plagiarism.org "Text similarity in academic conference papers"]. {{cite web}}: Check |url= value (help)
  9. ^ Yuan Lin (August 2006). "Multithreaded programming challenges, current practice, and languages/tools support". 2006 IEEE Hot Chips 18 Symposium (HCS). IEEE: 1–134. doi:10.1109/hotchips.2006.7477737. ISBN 9781467388672. S2CID 42390698.
  10. ^ Culwin, Fintan; Lancaster, Thomas (June 2001). "Plagiarism issues for higher education". VINE. 31 (2): 36–41. doi:10.1108/03055720010804005. ISSN 0305-5728.
  11. ^ Youmans, Robert J. (November 2011). "Does the adoption of plagiarism-detection software in higher education reduce plagiarism?". Studies in Higher Education. 36 (7): 749–761. doi:10.1080/03075079.2010.523457. S2CID 144143548.
  12. ^ "EBSCO Publishing Service Selection Page". eds.a.ebscohost.com. Retrieved 2018-10-10.
  13. ^ Youmans, Robert J. (November 2011). "Does the adoption of plagiarism-detection software in higher education reduce plagiarism?". Studies in Higher Education. 36 (7): 749–761. doi:10.1080/03075079.2010.523457. S2CID 144143548.
  14. ^ a b Hoad, Timothy C.; Zobel, Justin (2003). "Methods for identifying versioned and plagiarized documents". Journal of the American Society for Information Science and Technology. 54 (3): 203–215. doi:10.1002/asi.10170. ISSN 1532-2882.
  15. ^ a b Pandey, Kusum Lata; Agarwal, Suneeta; Misra, Sanjay; Prasad, Rajesh (2012-06-18). Plagiarism detection in software using efficient string matching. Springer-Verlag. pp. 147–156. doi:10.1007/978-3-642-31128-4_11. ISBN 9783642311277.
  16. ^ Ragel, Roshan G. (Fall 2018). "Plagiarism Detection on Electronic Text based Assignments using Vector Space Model" (PDF). Plagiarism Detection on Electronic Text Based Assignments Using Vector Space Model: 2. arXiv:1412.7782.
  17. ^ Gipp, Bela (2014), Citation-based Plagiarism Detection, Springer Vieweg Research, ISBN 978-3-658-06393-1
  18. ^ a b c Gipp, Bela; Beel, Jöran (June 2010), "Citation Based Plagiarism Detection - A New Approach to Identifying Plagiarized Work Language Independently", Proceedings of the 21st ACM Conference on Hypertext and Hypermedia (HT'10) (PDF), ACM, pp. 273–274, doi:10.1145/1810617.1810671, ISBN 978-1-4503-0041-4
  19. ^ Gipp, Bela; Meuschke, Norman; Breitinger, Corinna; Lipinski, Mario; Nürnberger, Andreas (28 July 2013), "Demonstration of Citation Pattern Analysis for Plagiarism Detection", Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (PDF), ACM, doi:10.1145/2484028.2484214
  20. ^ a b Gipp, Bela; Meuschke, Norman (September 2011), "Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence", Proceedings of the 11th ACM Symposium on Document Engineering (DocEng2011) (PDF), ACM, pp. 249–258, doi:10.1145/2034691.2034741, ISBN 978-1-4503-0863-2
  21. ^ Gipp, Bela; Meuschke, Norman; Beel, Jöran (June 2011), "Comparative Evaluation of Text- and Citation-based Plagiarism Detection Approaches using GuttenPlag", Proceedings of 11th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’11) (PDF), ACM, pp. 255–258, doi:10.1145/1998076.1998124, ISBN 978-1-4503-0744-4
  22. ^ Gipp, Bela; Beel, Jöran (July 2009), "Citation Proximity Analysis (CPA) - A new approach for identifying related work based on Co-Citation Analysis", Proceedings of the 12th International Conference on Scientometrics and Informetrics (ISSI’09) (PDF), International Society for Scientometrics and Informetrics, pp. 571–575, ISSN 2175-1935
  23. ^ "Internet Chinese Librarians Club" (PDF). www.iclc.us. Retrieved 2018-10-10.
  24. ^ Zurini, Madalina (March 2015). "Stylometry Metrics Selection for Creating a Model for Evaluating the Writing Style of Authors According to Their Cultural Orientation" (PDF). IE Paper Templates.
  25. ^ HOLMES, D. I. (1998-09-01). "The Evolution of Stylometry in Humanities Scholarship". Literary and Linguistic Computing. 13 (3): 111–117. doi:10.1093/llc/13.3.111. ISSN 0268-1145.
  26. ^ Juola, Patrick (2007). "Authorship Attribution". Foundations and Trends® in Information Retrieval. 1 (3): 233–334. doi:10.1561/1500000005. ISSN 1554-0669.
  27. ^ a b Potthast, Martin; Stein, Benno; Eiselt, Andreas; Barrón-Cedeño, Alberto; Rosso, Paolo (2009), "Overview of the 1st International Competition on Plagiarism Detection", PAN09 - 3rd Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse and 1st International Competition on Plagiarism Detection (PDF), CEUR Workshop Proceedings, 502, pp. 1–9, ISSN 1613-0073, archived from the original (PDF) on 2 April 2012
  28. ^ Portal Plagiat - Softwaretest 2004 (in German), HTW University of Applied Sciences Berlin, retrieved 6 October 2011
  29. ^ Portal Plagiat - Softwaretest 2008 (in German), HTW University of Applied Sciences Berlin, retrieved 6 October 2011
  30. ^ a b Cite error: The named reference HTW10 was invoked but never defined (see the help page).
  31. ^ a b Potthast, Martin; Barrón-Cedeño, Alberto; Eiselt, Andreas; Stein, Benno; Rosso, Paolo (2010), "Overview of the 2nd International Competition on Plagiarism Detection", Notebook Papers of CLEF 2010 LABs and Workshops, 22–23 September, Padua, Italy (PDF)
  32. ^ a b Potthast, Martin; Eiselt, Andreas; Barrón-Cedeño, Alberto; Stein, Benno; Rosso, Paolo (2011), "Overview of the 3rd International Competition on Plagiarism Detection", Notebook Papers of CLEF 2011 LABs and Workshops, 19–22 September, Amsterdam, Netherlands (PDF)
  33. ^ Stein, Benno; Lipka, Nedim; Prettenhofer, Peter (2011), "Intrinsic Plagiarism Analysis" (PDF), Language Resources and Evaluation, 45 (1): 63–82, doi:10.1007/s10579-010-9115-y, ISSN 1574-020X
  34. ^ Potthast, Martin; Barrón-Cedeño, Alberto; Stein, Benno; Rosso, Paolo (2011), "Cross-Language Plagiarism Detection" (PDF), Language Resources and Evaluation, 45 (1): 45–62, doi:10.1007/s10579-009-9114-z, ISSN 1574-020X
  35. ^ Gipp, Bela; Beel, Jöran (June 2010), "Citation Based Plagiarism Detection - A New Approach to Identifying Plagiarized Work Language Independently", Proceedings of the 21st ACM Conference on Hypertext and Hypermedia (HT'10) (PDF), ACM, pp. 273–274, doi:10.1145/1810617.1810671, ISBN 978-1-4503-0041-4
  36. ^ Gipp, Bela; Meuschke, Norman; Beel, Jöran (June 2011), "Comparative Evaluation of Text- and Citation-based Plagiarism Detection Approaches using GuttenPlag", Proceedings of 11th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’11) (PDF), ACM, pp. 255–258, doi:10.1145/1998076.1998124, ISBN 978-1-4503-0744-4
  37. ^ "Plagiarism Prevention and Detection - On-line Resources on Source Code Plagiarism" Archived 15 November 2012 at the Wayback Machine. Higher Education Academy, University of Ulster.
  38. ^ Roy, Chanchal Kumar;Cordy, James R. (26 September 2007)."A Survey on Software Clone Detection Research". School of Computing, Queen's University, Canada.