How To Fool A Plagiarism Detector

By Neuroskeptic | April 17, 2014 4:25 am

Should you trust plagiarism detection software?

In my view, no – we should never treat an automated plagiarism report as definitive evidence, whether positive (as proof of plagiarism) or negative (as proof of innocence.) These tools are useful for rapidly screening texts to raise red flags, but once a suspicion is raised, only old-fashioned manual checking can determine originality or otherwise.

In this post I’ll explain why – but first, a little backstory.

Five months ago, I argued that certain materials published by a new British ‘research ethics organization’, called PIE, contained similarities to other, uncited sources. For more on PIE, see these posts.

Shortly after I posted, PIE put up a Disclaimer. In the past week they’ve gone on the defensive again with a blog post which, while not naming me, is clearly aimed in my direction. (The comment thread is quite entertaining.)

This is where those plagiarism detectors come in. In their “Disclaimer”, echoed in the blog post, PIE report that all of their text is rated as original by two automated plagiarism checkers: Grammarly and IThenticate.

I have no doubt that that’s true, but it doesn’t impress me much. Most of these detectors rely on spotting strings of text that are identical between two sources. So they can pick up naked copy and pasting, but they can be fooled quite easily.

All a hypothetical plagiarist needs to do, to evade such software, is to make sure that no more than, say, any given three or four consecutive words are identical to their source. So they can copy and paste, so long as they, let’s say, change the word order a bit, add or remove some filler words like ‘the’, ‘and’, ‘but’, and replace a few words with synonyms. I call this text laundering.

To show how easily text could hypothetically be laundered, I took some of PIE’s own text (from here)

PIE Original: You are invited to join the Publication Integrity and Ethics (herein referred to as PIE) as one of its founding members. PIE, a not-for profit organisation, offers free membership to all interested individuals. Please join us and become part of this exciting new movement in the world of publishing ethics; it is the professional home for authors, reviewers, editorial board members and editors-in-chief.

Now let’s copy, paste, wash and rinse…

Neuroskeptic: You are invited to join Publication Integrity and Ethics (herein referred to as PIE) and become one of its founding members. PIE, a not-for profit organisation, offers interested individuals free membership. Please join this exciting new movement in the publishing ethics world; PIE is the professional home for reviewers, editorial board members, authors, and editors-in-chief.

If that’s not plagiarism, I don’t know what is. But Grammarly’s verdict? “The text in this document is original.”

Grammarly

Importantly, Grammarly does ring the plagiarism alarm if you enter PIE’s original text. This proves that PIE’s website is part of Grammarly’s database of sources. So the software should have detected my ‘plagiarism’. But it didn’t. This is why I always take these tools with a pinch of salt, and why I’m not impressed by PIE’s Disclaimer (although please note – I have never accused PIE of ‘plagiarism’. They introduced that word into this discussion, not I. I just talk about similarities.)

There are other lessons to learn from this saga. Consider, for instance, that a few days ago, PIE released an new bit of their disclaimer, “Examined Documents“. They now say that

Our authors have examined several documents at the time of writing the contents of the Publication Integrity & Ethics [PIE] website and its guidelines. Hence, it is natural that we include the list of these documents as our references. Please see the list.

It is indeed ‘natural’ for authors to reference their sources, but it seems that for the first few months of the site’s existence, they didn’t do so. Which I guess made them… unnatural?

Anyway, the list vindicates what I said in my very first PIE post: I said that some of PIE’s content was similar to the Australian Press Council’s newspaper guidelines, and publisher Elsevier’s editorial policies – and they now reference both of those sources. If they’d only done that from the start, I wouldn’t have written my post.

The lesson here? Acknowledge your sources from the start. Because the longer you leave it, the worse your eventual climbdown will look.

But something is conspicuous by its absence from PIE’s reference list: any mention of the Committee on Publication Ethics (COPE) guidelines. Yet as I said previously, several areas of PIE’s work appear similar to COPE’s .

Consider the PIE Peer Reviewer Guidelines. The last bit of PIE’s document (parts 7.1-8.5) consists of 16 points. In my estimation, the same ideas all appear in a section of COPE’s Guidelines for Peer Reviewers. The wording differs somewhat (though in many cases, only slightly), but the content is essentially the same.

Crucially, the 16 ideas appear in exactly the same order in both documents – despite the fact that the COPE document also contains additional statements with no PIE equivalent, interspersed among the ones that are similar. If we designate the PIE statements in order as A-P, we find that the COPE equivalents also appear in the order A-P.

reviewer_structure

How odd. Perhaps it’s just a coincidence. How much of a coincidence? Well, to calculate the number of possible ways to order a given number of items (permutations), we need the mathematical factorial function, written as X! There are X! ways to order X items. 16! = 2.09*10^13 so there are about 20 trillion unique orderings of those 16 items.

So it’s quite a big coincidence, then. Why might PIE not want to credit COPE? We can but speculate. Perhaps the fact that PIE seems to be in direct competition to COPE might be relevant: they’re both organizations with “Publication” and “Ethics” in the title, who offer a set of best-practice guidelines for academics and academic publishers. Although COPE has been around for 17 years not 5 months.

It might be embarrassing to admit a debt to ones rivals… but it’s more embarrassing not to admit it. So this is the final lesson here: there’s nothing wrong with being influenced by your predecessors: no-one will care, if you’re transparent about it.

CATEGORIZED UNDER: ethics, funny, PIE, select, Top Posts, Uncategorized
  • Enrico Glerean

    Funny coincidence: Grammarly is the advert at the top of this page :)

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      Well, not so much of a coincidence – it picks the adverts based on the content.

      They need a smarter algorithm so they only match products to posts praising those products…

    • NastyGash

      What? You don’t have an ad blocker?

  • Pingback: How To Fool A Plagiarism Detector | My Educational Technology Blog: A Place of Resources and Tools for Educators

  • ohwilleke

    The point re the deficiencies of the software are well taken, but the 16! point gilds the lily and isn’t necessary very meaningful, because while there may be 16 possible ways of ordering statements in a document, a variety of considerations of logical presentation and stylistic convention almost always produce a profoundly shorter list of likely arrangements.

    For example, while there are all sorts of ways that you can organize headings in an appellate brief, it is not at all unusual for parties operating independently to address the same issues in the same order.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      I agree with your point in general, but I disagree that this is such a case.

      While it may be true that there are fewer than 16! ‘logical’ ways of ordering these statements, there are many more than 1.

      For instance, points C, D, E, and F all deal with situations in which a reviewer should decline to review. I can see no reason why they couldn’t appear in another order like D, C, F, E – indeed there are 4! permutations of that ‘block’ alone.

      Likewise I, J and K, 3!

      Likewise L, M, N, O, P., 5!

      3! * 4! * 5! makes 17280 permutations. But this is an underestimate because it grants that a chronological ordering of the peer review process is the only logical way to arrange these statements. However it would also be possible to adopt another approach, let us say a ‘thematic’ one in which they were grouped under headings such as:

      “Integrity” – B C D E G J L
      “Professionalism” – A F H I K O
      “Committment” – M N P

      And in fact this is how the PIE reviewer guidelines are organized in general, with headings such as “Conflicts of Interest” and “Confidentiality”. The 16 points in question here are merely the end of this document, headed “Expectations”.

  • Pingback: How To Fool A Plagiarism Detector - or the need...

  • Pingback: How To Fool A Plagiarism Detector - Neuroskepti...

  • Pingback: How To Fool A Plagiarism Detector - Neuroskepti...

  • John Mashey

    Indeed, opy-paste-edit can fool mechanical checkers and some manual analysis is often needed. For example, glance at the examples in pp.15-32 in See No Evil. The cyan-shaded text is word-for-word in order identical, the yellow shows trivial edits, with published text at left, antecedents at right.
    Many of these seem like:
    copy
    then make enough trivial edits to obscure it from some checker

    What I’ve found in doing hundreds of pages of this is that highlighting the cyan identical text lets one ignore that, and then the other issues becoem more apparent.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      That looks like great work. Highlighting is a favourite approach of mine. In this case, I highlit based on conceptual similarity and then bolded based on textual closeness.

  • Pingback: Text laundering | mathbabe

  • http://www.paraphrasingmatters.com/ Chris Gayle

    To use someone else work and just minutely substituting words here and there is not deemed appropriate in majority of the academic institutions. The misuse may range from minute borrowing or extended borrowing. http://paraphrasing.co.uk

  • http://designerwebs-uk.co.uk Designerwebs-UK

    This really interests me being a blogger, but some of the new spinning software can get over just the changing of a few words by inserting code which Google cannot detect. Also what a boring job it must be to rewrite articles changing a few words etc

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Neuroskeptic

No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.

ADVERTISEMENT

See More

@Neuro_Skeptic on Twitter

ADVERTISEMENT
Collapse bottom bar
+

Login to your Account

X
E-mail address:
Password:
Remember me
Forgot your password?
No problem. Click here to have it e-mailed to you.

Not Registered Yet?

Register now for FREE. Registration only takes a few minutes to complete. Register now »