International Conference on Recent Progresses in Science, Engineering and Technology

Professor Dr. M. Moshiul Hoque

Professor Dr. M. Moshiul Hoque

Biography

Department of Computer Science and Engineering, CUET, Bangladesh
&
Chair, IEEE Bangladesh Section

Title of the Invited Talk: A Multimodal Framework to Detect Target Aware Aggression in Memes

Abstract: Internet memes have become a powerful means for individuals to express emotions, thoughts, and perspectives on social media. While often considered a source of humor and entertainment, memes can also disseminate hateful content targeting individuals or communities. Most existing research focuses on the negative aspects of memes in high-resource languages, overlooking the distinctive challenges associated with low-resource languages like Bengali (also known as Bangla). Furthermore, while previous work on Bengali memes has focused on detecting hateful memes, there has been no work on detecting their targeted entities. To bridge this gap and facilitate research, we introduce a couple of novel multimodal datasets: BHM (Bengali Hateful Memes) and MIMOSA (MultIMOdal aggreSsion dAtaset) in Bengali. The first dataset consists of 7,148 memes with Bengali as well as code-mixed captions, tailored for two tasks: (i) detecting hateful memes and (ii) detecting the social entities they target (i.e., Individual, Organization, Community, and Society). MIMOSA encompasses 4,848 annotated memes across five aggression target categories: Political, Gender, Religious, Others, and non-aggressive. This talk covers two critical Multimodal Aggressive Memes classification tasks and shares their outcomes based on two developed datasets. The 1st task presents a solution, DORA (Dual cO-attention fRAmework), is a multimodal deep neural network that systematically extracts significant modality features from memes. It then evaluates these features with modality-specific features to better understand the context. Our experiments show that DORA is effective on our low-resource hateful meme datasets and outperforms several state-of-the-art rivaling baselines. The 2nd task introduces a model, MAF (Multimodal Attentive Fusion), a simple yet effective approach that uses multimodal context to detect the aggression targets. MAF captures the selective modality-specific features of the input meme and jointly evaluates them with individual modality features. Experiments on MIMOSA exhibit that the proposed method outperforms several state-of-the-art rivaling approaches.