View publication
Title | Hate Speech Detection is not as Easy as you may Think: A Closer Look at Model Validation (Extended Version) |
Authors | Ayme Arango, Jorge Pérez, Bárbara Poblete |
Publication date | 2021 |
Abstract |
Hate speech is an important problem that is seriously affecting the dynamics and usefulness of online social communities. Large scale social platforms are currently investing important resources into automatically detecting and classifying hateful content, without much success. On the other hand, the results reported by state-of-the-art systems indicate that supervised approaches achieve almost perfect performance but only within specific datasets, most of them in English language. In this work, we analyze this apparent contradiction between existing literature and actual applications. We study closely the experimental methodology used in prior work and their generalizability to other datasets. Our findings evidence methodological issues, as well as an important dataset bias. As a consequence, performance claims of the current state-of-the-art have become significantly overestimated. The problems that we have found are mostly related to data overfitting and sampling issues. We discuss the implications for current research and re-conduct experiments to give a more accurate picture of the current state-of-the art methods. Moreover, we design some baseline approaches to perform cross-lingual experiments, using English and Spanish datasets. |
Volume | 105 |
Journal name | Information Systems |
Publisher | Elsevier Science (Amsterdam, The Netherlands) |
Reference URL |