Can Public Code Smells Datasets Be Trusted?

Published online: Dec 15, 2025 Full Text: PDF (1.17 MiB) DOI: https://doi.org/10.24138/jcomss-2025-0131
Cite this paper
Authors:
Ruchin Gupta, Jitendra Kumar Seth, Anupama Sharma, Abhishek Goyal

Abstract

Code smells signal potential issues in a codebase andindicate technical debt. Early detection is crucial for maintaining code quality. Researchers often rely on public datasets to automate and enhance smell detection, but their trustworthiness is frequently assumed rather than verified. While these datasets are valuable for developing detection tools, key questions arise: Can they be fully trusted? Are the labels accurate? Do they reflect real-world software development? Recent studies reveal inconsistencies, biases, and misclassifications, raising concerns about their reliability. This paper explores the integrity of widely used 2 sets of public code smells datasets namely Group A dataset and Group B dataset by examining their internal consistency, alignment with established facts. Through this investigation, we aim to determine whether these datasets can be confidently utilized in research and practical applications, or if their inherent issues undermine the validity of the results they produce. Group A datasets are smaller, balanced, and factually aligned but lack industry relevance, while Group B deviates from known facts. The study acknowledges academic–industry differences, viewing divergence as a reflection of real-world variability rather than a flaw, and emphasizes the need for rigorous validation of public datasets to ensure reliable research outcomes.

Keywords

Code Smell, code smell datasets, validation
Creative Commons License 4.0
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.