MemeCLIP: Leveraging Contrastive Language-Image Pre-training (CLIP) for Memotion Analysis

Published online: Jun 1, 2026 Full Text: PDF (3.30 MiB) DOI: https://doi.org/10.24138/jcomss-2025-0150
Cite this paper
Authors:
Vaishali Ganganwar, Gaurav Singh Chauhan, Jangveer Singh, Shashvat Khajuria, Vivek Battan

Abstract

Nowadays, memes, which commonly spread humor, ideas, or even harmful materials such as hate and propaganda, are a significant part of the Internet culture. The meme consists of an image and supporting text. Memotion Analysis, or meme Emotion Analysis, is automatic processing of memes using artificial intelligence. Unimodal solutions are now being taken over by multimodal solutions such as feature concatenation, weighted fusion, and Gated Multimodal Unit(GMU)for better Memotion Analysis. In this work, we proposed two deep learning based multimodal models for meme emotion classification. In the first model, we used ResNet and DeBERTa separately for single image-text fusion. In the second ‘MemeCLIP’ model an integrated CLIP-based representation with GMU employing a gated mechanism for adaptive visual and text feature fusion is used. In contrast to simple concatenation techniques, GMU demonstrates superior capability in extracting fine-grained emo tional cues embedded in Memes. For the Memotion Analysis task 8 of SemEval-2020 competition, the CLIP-based model ‘MemeCLIP’ achieved a F1-score of 0.65, closely followed by the ResNet+DeBERTa model with a score of 0.64, compared to the SemEval baseline of 0.5118. These findings demonstrate the strength of selectively regulating modality contributions.

Keywords

Meme Classification, Memotion Analysis, CLIP, DeBerta, ResNet
Creative Commons License 4.0
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.