Leveraging XLM-RoBERTa with CNN and BiLSTM for Hinglish Toxicity Detection

Singhal, Nikita; Yadav, Avadhesh; , Ankush; Singh, Giriraj; Kumar, Ronak

doi:https://doi.org/10.24138/jcomss-2025-0133

Leveraging XLM-RoBERTa with CNN and BiLSTM for Hinglish Toxicity Detection

Published online: Oct 20, 2025 Full Text: PDF (2.40 MiB) DOI: https://doi.org/10.24138/jcomss-2025-0133

Cite this paper

Authors:

Nikita Singhal, Avadhesh Yadav, Ankush , Giriraj Singh, Ronak Kumar

Abstract

Toxicity in online communication, particularly in code-mixed languages like Hinglish, is a growing concern across social media platforms. Hinglish, a blend of Hindi and English, is widely used in informal online conversations, making it challenging for traditional toxicity detection models to accurately identify harmful content. This issue is compounded by the limited availability of resources and models specifically trained to handle Hinglish. This work presents the XLM-RoBERTa- CNN-BiLSTM (XCB) model, a novel architecture for toxicity detection in Hinglish on various social media platforms. This work compares the XCB model with the SOTA models mBERT, XLM-RoBERTa (XLM-R), and Indic-BERT. It was made on three publicly available datasets: Constraint, Facebook, and HASOC. The XCB model achieved macro F1 scores of 0.81, 0.73, and 0.82 and inference times of 0.24 s, 0.48 s, and 0.22 s on the Constraint, Facebook, and HASOC datasets, respectively. XCB not only outperforms existing romanized Hinglish models but also matches the macro F1 scores of existing SOTA multilingual models, requiring only half the training time—with extremely low inference times unlike the existing state-of-the-art models, thus making it a much more efficient candidate for large-scale real-time toxicity detection in Hinglish.

Keywords

Toxicity Detection, Hinglish, Code-Mixed Language, XCB Model, Real-Time Moderation, Multi-Lingual Models, Efficiency

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.