Welcome to Science with Shrike! Happy New Year! In light of the recent high profile plagiarism allegations against ex-Harvard president Claudine Gay, we will cover some of the tools used to detect plagiarism in academia. This is not a new problem, and the tools have come a long ways in the last 20-30 years.
Plagiarism Overview
Plagiarism is one subset of ‘research misconduct’. Plagiarism is passing off prior work, usually (but not always) done by others as your current, new work. Plagiarism takes slightly different forms between Humanities and the Sciences, due to the nature of the work. Science relies on data and reproduction, so most research misconduct is faking data instead of copy/pasting essays and ideas. However, copy/paste and passing others’ work as your own still happens in the sciences. In the Humanities, where there is minimal data, credit for ideas and paraphrasing happens far more often. Given Shrike’s biomedical background, we’ll focus on biomedical sciences. The tools will still help for the Humanities.
We will cover four tools, two free and two paid:
Retraction Watch
PubPeer
iThenticate
Image Twin
Retraction Watch
Retraction Watch tracks retractions, editorial resignations, hijacked journals and other misconduct associated with publications. This is your first stop to see if potential plagiarism or other misconduct has already led to retractions, and see what authors have papers that already were retracted. They also maintain a database, so you can check for retracted papers from your least favorite authors or institutions. Retraction Watch does journalism, so they do not weigh in on whether retractions should happen. They report on retractions that have happened.
PubPeer
A lot of the data duplication/research misconduct/plagiarism discussions occur on Pub Peer. This is billed as an ‘online journal club’, which any scientist knows means ‘bond with strangers by trashing others’ articles online’. One strength of this site is that you can search it to see if flags have already been raised about you or your favorite author.
You can also post your own concerns about the papers in question. For example, if you search for David Sabatini, the first hit comes up about a Molecular Cell paper. The linked thread shows what people look for with Western Blots, and how authors address these concerns. Concerns about declaring conflicts of interest may also be posted.
For a new concern, enter the PubMed ID into the search box to access the paper. Then start commenting away, along with the evidence. If the paper doesn’t have a PubMed ID, and it’s in the biomedical sciences, why waste your time? Biorxiv and medrxiv are pre-print servers, so any duplication will be cleaned up before submission to a journal. They will thank you for catching their mistake. Maybe it was an honest mistake. Maybe not. But they’ll fix it before publication.
iThenticate
The work-horse of plagiarism detection is iThenticate. This is a paid solution that compares (and typically adds) a submitted document against iThenticate’s database of documents. It will also let you compare several documents against a “master” document without adding them to the database. Many universities have iThenticate integrated into their online Learning Management System (e.g. Blackboard, Canvas, etc), so that assignments are automatically scanned and added to the database.
Many journals also use iThenitcate. Some run the manuscript at submission, some later. Shrike had one journal ask Shrike to run the manuscript and submit the report along with the manuscript.
iThenticate will give a percent similarity report, and list all the documents that are similar by section. In practice, setting cut-off that is considered acceptable (e.g. 20% similarity) is needed when handling lots of documents. Above the threshold usually triggers manual review. Depending on what was the same, it may be ok, or it may have the potential for plagiarism. One area that triggers false positives is the methods section of a paper. There are only so many ways you can say “HEK cells were grown in DMEM supplemented with 10% fetal bovine serum at 37 C and 5% CO2”. And if you’re publishing multiple papers with the same technique, you may sit down and write it the exact same way. But if you’re ripping off one source, and it’s not yours, that will come out.
One the other side, iThenticate will underestimate plagiarism if there is a lot of extraneous text that is different.
Image Twin
As mentioned in the PubPeer section, in the sciences, the greatest potential for misconduct lies in the data, often presented as images. One paid solution to this is Image Twin. Image Twin does plagiarism detection for images, along with other unethical manipulations.
If you’re too cheap for ImageTwin, you can try comparing images one by one with other various online tools.
Caveats on Plagiarism Hunting
Plagiarism is a serious and important charge. Please do not make accusations without strong evidence. Remember that you will be judged on your weakest piece of evidence, not your strongest. For formal complaints, it is a good idea to see if the threshold of evidence is ‘beyond reasonable doubt’ or ‘preponderance of evidence’ and then prepare accordingly.
From experience, proving plagiarism can be challenging. Institutions often give every possible beneficial reading to the accused. International students may claim it is ‘acceptable in their culture’. Authors can claim ‘unintentional mistake’.
Public accusations of plagiarism can land you in defamation lawsuit territory, especially if your claims are weak and the accused is powerful. Notifying the institution and/or funding agency may be the most effective route. Notifying the authors gives them time to cover their tracks.
There are also some things that you might be convinced are plagiarism, but do not meet the required threshold.
First, most Open Access licenses allow reproduction and modification of the work, as long as the original work is cited.
Second, most journals allow reproduction of published journal articles in student theses, and the data in student theses do not count as ‘prior publication’. The thesis will likely contain additional information, but pasting a journal article in the thesis as a chapter is standard practice. There should be a note that the chapter was published, and journal citation, but if the chapter was published after the dissertation submitted, you might not find that.
Third, some may overinterpret image differences due to compression artifacts and other changes that occur on an image’s journey from file on author’s computer to the final, published product. Related to this is ‘acceptable practice at the time’.
This article in Science covers the story of uncovering and dealing with high profile misconduct. The stakes can be quite high, so please be careful. If you cannot convince a skeptical ally that it is plagiarism, it may be better to gather more data.
Two final notes. Most universities now have electronic dissertations that are searchable. The current crop of presidents, etc are too old to have submitted their dissertations electronically, but some of these have been digitized. If not, a trip to the school’s library, and some time scanning + OCR will get you going. Or pay a student at that school to do it for you.
Sci-hub (link constantly changes) has most of those journal pdfs that remain behind a paywall. Pubmed central has ones that are free to all now.
Happy hunting!