Utilizing Translation to Enhance NLP Models in Offensive Language and Hate Speech Identification

Main Article Content

Sandy Kurniawan
Indra Budi

Abstract

 The number of social media users in Indonesia has increased in recent years. The surge in social media users leads to more offensive language on these platforms. The use of offensive language can trigger conflicts between users. Therefore, it is necessary to identify the use of offensive language on social media. This study focused on identifying offensive language, hate speech, and hate speech targets on Twitter. The data used were obtained from previous research on identifying offensive language and hate speech. The amount of data is very influential on the performance of the classification. Therefore, data was added using translation in this study. Classical machine learning (SVM et al.) and deep learning (BiLSTM, CNN, and LSTM) algorithms are used as classification algorithms with word n-gram and word embedding as the features. Three scenarios were done based on the training data used in the classification model development. The result shows that scenario 3, which uses translation for data augmentation, can improve the classification model’s performance by 5%.





 

Article Details

How to Cite
Kurniawan, S., & Budi, I. (2024). Utilizing Translation to Enhance NLP Models in Offensive Language and Hate Speech Identification. Jurnal Improsci, 1(4), 182–197. https://doi.org/10.62885/improsci.v1i4.187
Section
Articles
Author Biographies

Sandy Kurniawan, a:1:{s:5:"id_ID";s:22:"Universitas Diponegoro";}

 

 

Indra Budi, Universitas Indonesia

 



 

References

Alfina, I., Mulia, R., Fanany, M. I., & Ekanata, Y. (2017). Hate Speech Detection in the Indonesian Language: A Dataset and Preliminary Study. 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 233–238.

Alom, M. Z., Taha, T. M., Yakopcic, C., Westberg, S., Sidike, P., Nasrin, M. S., Hasan, M., Van Essen, B. C., Awwal, A. A. S., & Asari, V. K. (2019). A state-of-the-art survey on deep learning theory and architectures. Electronics (Switzerland), 8(3), 1–67. https://doi.org/10.3390/electronics8030292

Anhari, I. (2019, June 26). Sepanjang 2019, Polri Telah Tangani 675 Kasus Ujaran Kebencian. https://hukum.rmol.id/read/2019/06/26/394015/sepanjang-2019-polri-telah-tangani675-kasus-ujaran-kebencian

Arnaz, F. (2019). 2019, Polri Catat Kasus Hoax Meningkat Tajam. Berita Satu. https://www.beritasatu.com/nasional/561294/2019-polri-catat-kasus-hoaxmeningkat-tajam

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135–146. https://doi.org/10.1162/tacl_a_00051

Charte, F., Rivera, A. J., Del Jesus, M. J., & Herrera, F. (2015). MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation. KnowledgeBased Systems, 89, 385–397. https://doi.org/10.1016/j.knosys.2015.07.019

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

DataReportal. (2020). Digital 2020: Indonesia. https://datareportal.com/reports/digital-2020- indonesia

DataReportal. (2021). Digital 2021: Indonesia. https://datareportal.com/reports/digital-2021-indonesia

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL HLT 2019, 1, 4171–4186.

Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2019). Learning Word Vectors

for 157 Languages. Proceedings of the Eleventh International Conference on

Language Resources and Evaluation (LREC 2018), 3483–3487.

Ibrohim, M. O., & Budi, I. (2018). A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media. 3rd International Conference on Computer Science and Computational Intelligence 2018, 222–229.

https://doi.org/10.1016/j.procs.2018.08.169

Ibrohim, M. O., & Budi, I. (2019). Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter. Proceedings of the Third Workshop on Abusive Language Online, 46–57. https://doi.org/10.18653/v1/w19-3506

Ibrohim, M. O., Sazany, E., & Budi, I. (2019). Identify abusive and offensive language in indonesian twitter using deep learning approach. Journal of Physics: Conference Series, 1196(1). https://doi.org/10.1088/1742-6596/1196/1/012041

Komnas HAM. (2015). Buku Saku Penanganan Ujaran Kebencian (Hate Speech). In Komisi Nasional Hak Asasi Manusia. https://doi.org/10.1017/CBO9781107415324.004

Kurniawan, S., & Budi, I. (2020). Indonesian Tweets Hate Speech Target Classification Using Machine Learning. 2020 5th International Conference on Informatics and Computing, ICIC 2020, 1–5. https://doi.org/10.1109/ICIC50835.2020.9288515

MacAvaney, S., Yao, H. R., Yang, E., Russell, K., Goharian, N., & Frieder, O. (2019). Hate speech detection: Challenges and solutions. PLoS ONE, 14(8), 1–16. https://doi.org/10.1371/journal.pone.0221152

Mohaouchane, H., Mourhir, A., & Nikolov, N. S. (2019). Detecting Offensive Language on Arabic Social Media Using Deep Learning. 2019 6th International Conference on Social Networks Analysis, Management and Security, SNAMS 2019, 466–471. https://doi.org/10.1109/SNAMS.2019.8931839

Nikolov, A., & Radivchev, V. (2019). Nikolov-Radivchev at SemEval-2019 Task 6: Offensive Tweet Classification with BERT and Ensembles. Proceedings of the 13th International Workshop on Semantic Evaluation, 691–695.

https://doi.org/10.18653/v1/s19-2123

Pelle, R., Alcântara, C., & Moreira, V. P. (2018). A Classifier Ensemble for Offensive Text Detection. Proceedings of the 24th Brazilian Symposium on Multimedia and the Web, 237–243. https://doi.org/10.1145/3243082.3243111

Putri, T. T. A. (2018). Analisis dan Deteksi Hate Speech pada Sosial Twitter Berbahasa Indonesia. Universitas Indonesia.

Razavi, A. H., Inkpen, D., Uritsky, S., & Matwin, S. (2010). Offensive Language Detection Using Multi-level Classification. Canadian Conference on Artificial Intelligence, 16–27. https://doi.org/10.1007/978-3-642-13059-5_5

Sennrich, R., Haddow, B., & Birch, A. (2016). Improving Neural Machine Translation Models with Monolingual Data. 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers, 86–96. https://doi.org/10.18653/v1/p16-1009

Tala, F. Z. (2003). A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia.

Wei, J., & Zou, K. (2019). EDA: Easy data augmentation techniques for boosting performance on text classification tasks. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 6382–6388.

https://doi.org/10.18653/v1/d19-1670

Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., & Kumar, R. (2019). Predicting the Type and Target of Offensive Posts in Social Media. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),

–1420. https://doi.org/10.18653/v1/n19-1144