SAdrian Barnett, a statistician at Australia’s Queensland University of Technology, pointed out some familiar faces as he scrolled through an online image dataset. Sylvester Stallone plays Rambo and appears on the red carpet again. “This is absolutely ridiculous,” Barnett said. George Clooney, Angelina Jolie, and Daniel Craig all appear multiple times, often in the same image. “As you can see, this is a comically bad dataset,” Barnett said.
This particular dataset is collected in a folder titled “droopy” and hosted in an open source repository called Kaggle. scientific report – Not as a celebrity spotting game, but as a training set for a predictive clinical model for early stroke detection.1
This paper is the latest example of a broader issue raised by Barnett and his Ph.D. Student Alexander Gibson created a document on Kaggle. Kaggle is owned by Google and hosts user-uploaded datasets that researchers and machine learning practitioners can use to build predictive models. By examining two other Kaggle datasets on stroke and diabetes, both containing tabular patient data, Gibson and Barnett tracked how the data makes its way through the scientific literature and, in some cases, into clinical use. Their work, described in a preprint posted to medRxiv in February, has already led to the retraction of several papers that used these questionable datasets.2
After looking through so many dubious datasets for the work that led to the preprint, Gibson said: scientific report Paper was easy to find. “I just searched for ‘Kaggle’ and ‘stroke’ on Google Scholar,” Gibson said. “This was just one of the first things that came to mind,” the paper said.The paper, published in December, uses two datasets purporting to show images of people who have had strokes to train a model that detects strokes in real time and facilitates “rapid clinical intervention.” One of the datasets has since been removed from Kaggle.
Among the “droopy” datasets that remain online, Barnett and Gibson found through reverse image searches that many images depicted Bell’s palsy alongside images of children and toddlers (and celebrities). On Kaggle, the authors claim that despite the obvious overlap, the dataset contains 1,024 images of “various patients” and says it is for educational purposes. “This is clearly not suitable for serious research. It is ethically and scientifically inappropriate,” Barnett said. “Given the basic checks, there is no reason why this should be used.”
After we contacted Springer Nature, the journal added an editor’s note to the paper, alerting readers to concerns about the reliability of the data in the paper and warning that further editorial action may be taken following the investigation. The paper’s corresponding author, Alaa Mohamed of Mansoura University in Egypt, did not respond to our requests in time for publication.
Kaggle has faced intense scrutiny before over the reliability of its data. In December, transmitter Springer Nature reported that it had taken action against approximately 40 publications that used datasets that used children’s faces to train models without consent or validation.
For researchers, this latest discovery is just one example of a problem that could extend to thousands of papers across multiple online data repositories. Gibson first encountered questionable data while searching for clinical predictive model datasets for his Ph.D. He quickly discovered Kaggle and the many datasets hosted there. “Then I thought, ‘Where did they come from?'” he said. “And I kept searching, I kept searching, and I couldn’t find any information.”
To illustrate this problem, Gibson and Barnett focused on two datasets: stroke and diabetes, and identified 124 papers that built models based on these datasets. Both failed a comprehensive checklist of who, when, where, and why of the origin of data in clinical prediction models and reported it in medRxiv.
Anyone doing a basic check on their datasets will quickly see that they don’t look like real data, Gibson says. Their discovery is nature An April news article detailed how the dataset contained thousands of duplicate patient observations and had very few missing values, which is unlikely for a dataset containing real-world patient data.
When Gibson and Barnett raised these concerns on PubPeer, one of the authors of the paper based on Kaggle data responded by citing 25 other articles that used the same dataset. “The continued presence of this substance in the current literature indicates that it continues to be widely accepted as a resource for experimental evaluation in this field of research,” wrote corresponding author Naeem Ramzan.
The paper is scientific reportIt was withdrawn in April because the authors failed to provide information about the origin or accuracy of the data, according to the notice.3 “I don’t have much sympathy for people who used this data thinking it was real, because they didn’t do the basics,” Gibson said.
The majority of studies flagged in the preprint made practical recommendations for the use of the model in patients, and most did not include an ethical statement. At least two models have publicly available websites, one linked to medical device patents filed at the California Institute of Technology and the University of Southern California. One article stated that the model described would be used in an Indonesian hospital, another claimed that the model was successful in diagnosing stroke, and yet another said the authors were implementing the model in a local cardiac clinic.4-6
Some of the papers tried to determine where the stroke data came from. Two referred to clinics in Bangladesh, one referred to “authoritative medical institutions” such as AIMS and the WHO, one referred to clinical volunteers, and one referred to McKinsey & Company’s electronic health records. Barnett said most of them are “obviously false” because “they say the data sets come from different sources.” One paper acknowledged the lack of provenance information but nevertheless made clinical recommendations.
Ben van Kalster, a biostatistician at the University of Leuven who helped develop the guidelines on data provenance, said the results were not surprising. “This paper explains the problem very clearly and in detail,” he told Retraction Watch. Van Calster’s research documents similar problems in predictive models for COVID-19, finding that most have a high risk of bias, with image-based models in particular having the worst problems with data quality.7
Eleven of the papers using the problematic dataset are published in Springer Nature. Three of them are; scientific report, The paper was retracted because the authors failed to provide information about the origin or accuracy of the data. Three other cases published in the journal are under investigation.
Tim Karges, the company’s head of research integrity, said the investigation is ongoing. “We will take further editorial action as necessary on a case-by-case basis,” he said in a statement, adding that authors should have sufficient time to address concerns. A spokeswoman for Elsevier, which published nine of the papers in question, said it would look into the matter. The MDPI journal said it was aware of the issue and that an investigation into the paper was ongoing.
Barnett and Gibson said all online tools based on these datasets should be removed until their origins can be verified, and that all 124 articles should have a statement of concern.
“Of course, the repository can’t really control whether or not everyone uses this data the way it’s supposed to be used,” Van Calster says. “So I think repositories need to improve their documentation.”
A Kaggle spokesperson said the platform relies on community self-reporting regarding metadata and provenance. They said that while the use of synthetic data on Kaggle is “perfectly legal,” “these datasets are intended for benchmarking and development purposes, and not as primary evidence for medical research or decision-making.” A spokesperson said the dataset from Barnett and Gibson’s preprints does not violate its terms of use, and if it does, it will be removed.
When Gibson brought his concerns about misuse of the dataset to Kaggle, a representative said they were working on ways to better highlight synthetic data. In a chat exchange we reviewed, the representative added: “It’s unclear what kind of response you’re looking for to this request.”
Barnett and Gibson found that the articles they identified were referenced in 86 review articles. “There’s a certain cleaning effect there,” Barnett said. Once these papers are published in a meta-analysis, “few people bother to look back.”
Barnett and Gibson say Kaggle’s incentive structure as well as pressure to publish are all part of the reason for the proliferation of these questionable models. Barnett said the platform could award rankings and badges to users who upload popular datasets to use on their resumes.
Gibson said the list of questionable data sets continues to grow. “It’s very simple,” Gibson said. “Are you thinking about the patient or the paper?”
This article was written and first published by Lori Youmshajekian. retraction watch.
