Files
Abstract
Offensive language is a significant detriment to social media environments. Existing research predominantly assumes monolingual expression, overlooking the prevalent behavior of code-switching (CS). To address this critical knowledge gap, this study identifies and empirically validates the distinct stylometric characteristics of code-switched (CSed) offensive language. Additionally, we developed methods to construct the first social media dataset specifically for CSed offensive content. Our analysis of this dataset reveals that CSed offensive language exhibits unique stylometric characteristics; moreover, these characteristics vary between the language segments involved in the CS. Furthermore, incorporating these features significantly enhances the performance of offensive language detection models. These findings offer significant research and practical implications for social media researchers, platforms, moderators, and users.