Utilizing Translation to Enhance NLP Models in Offensive Language and Hate Speech Identification
Main Article Content
Abstract
The number of social media users in Indonesia has increased in recent years. The surge in social media users leads to more offensive language on these platforms. The use of offensive language can trigger conflicts between users. Therefore, it is necessary to identify the use of offensive language on social media. This study focused on identifying offensive language, hate speech, and hate speech targets on Twitter. The data used were obtained from previous research on identifying offensive language and hate speech. The amount of data is very influential on the performance of the classification. Therefore, data was added using translation in this study. Classical machine learning (SVM et al.) and deep learning (BiLSTM, CNN, and LSTM) algorithms are used as classification algorithms with word n-gram and word embedding as the features. Three scenarios were done based on the training data used in the classification model development. The result shows that scenario 3, which uses translation for data augmentation, can improve the classification model’s performance by 5%.
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.
References
Alfina, I., Mulia, R., Fanany, M. I., & Ekanata, Y. (2017). Hate Speech Detection in the Indonesian Language: A Dataset and Preliminary Study. 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 233–238.
Alom, M. Z., Taha, T. M., Yakopcic, C., Westberg, S., Sidike, P., Nasrin, M. S., Hasan, M., Van Essen, B. C., Awwal, A. A. S., & Asari, V. K. (2019). A state-of-the-art survey on deep learning theory and architectures. Electronics (Switzerland), 8(3), 1–67. https://doi.org/10.3390/electronics8030292
Anhari, I. (2019, June 26). Sepanjang 2019, Polri Telah Tangani 675 Kasus Ujaran Kebencian. https://hukum.rmol.id/read/2019/06/26/394015/sepanjang-2019-polri-telah-tangani675-kasus-ujaran-kebencian
Arnaz, F. (2019). 2019, Polri Catat Kasus Hoax Meningkat Tajam. Berita Satu. https://www.beritasatu.com/nasional/561294/2019-polri-catat-kasus-hoaxmeningkat-tajam
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135–146. https://doi.org/10.1162/tacl_a_00051
Charte, F., Rivera, A. J., Del Jesus, M. J., & Herrera, F. (2015). MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation. KnowledgeBased Systems, 89, 385–397. https://doi.org/10.1016/j.knosys.2015.07.019
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
DataReportal. (2020). Digital 2020: Indonesia. https://datareportal.com/reports/digital-2020- indonesia
DataReportal. (2021). Digital 2021: Indonesia. https://datareportal.com/reports/digital-2021-indonesia
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL HLT 2019, 1, 4171–4186.
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2019). Learning Word Vectors
for 157 Languages. Proceedings of the Eleventh International Conference on
Language Resources and Evaluation (LREC 2018), 3483–3487.
Ibrohim, M. O., & Budi, I. (2018). A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media. 3rd International Conference on Computer Science and Computational Intelligence 2018, 222–229.
https://doi.org/10.1016/j.procs.2018.08.169
Ibrohim, M. O., & Budi, I. (2019). Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter. Proceedings of the Third Workshop on Abusive Language Online, 46–57. https://doi.org/10.18653/v1/w19-3506
Ibrohim, M. O., Sazany, E., & Budi, I. (2019). Identify abusive and offensive language in indonesian twitter using deep learning approach. Journal of Physics: Conference Series, 1196(1). https://doi.org/10.1088/1742-6596/1196/1/012041
Komnas HAM. (2015). Buku Saku Penanganan Ujaran Kebencian (Hate Speech). In Komisi Nasional Hak Asasi Manusia. https://doi.org/10.1017/CBO9781107415324.004
Kurniawan, S., & Budi, I. (2020). Indonesian Tweets Hate Speech Target Classification Using Machine Learning. 2020 5th International Conference on Informatics and Computing, ICIC 2020, 1–5. https://doi.org/10.1109/ICIC50835.2020.9288515
MacAvaney, S., Yao, H. R., Yang, E., Russell, K., Goharian, N., & Frieder, O. (2019). Hate speech detection: Challenges and solutions. PLoS ONE, 14(8), 1–16. https://doi.org/10.1371/journal.pone.0221152
Mohaouchane, H., Mourhir, A., & Nikolov, N. S. (2019). Detecting Offensive Language on Arabic Social Media Using Deep Learning. 2019 6th International Conference on Social Networks Analysis, Management and Security, SNAMS 2019, 466–471. https://doi.org/10.1109/SNAMS.2019.8931839
Nikolov, A., & Radivchev, V. (2019). Nikolov-Radivchev at SemEval-2019 Task 6: Offensive Tweet Classification with BERT and Ensembles. Proceedings of the 13th International Workshop on Semantic Evaluation, 691–695.
https://doi.org/10.18653/v1/s19-2123
Pelle, R., Alcântara, C., & Moreira, V. P. (2018). A Classifier Ensemble for Offensive Text Detection. Proceedings of the 24th Brazilian Symposium on Multimedia and the Web, 237–243. https://doi.org/10.1145/3243082.3243111
Putri, T. T. A. (2018). Analisis dan Deteksi Hate Speech pada Sosial Twitter Berbahasa Indonesia. Universitas Indonesia.
Razavi, A. H., Inkpen, D., Uritsky, S., & Matwin, S. (2010). Offensive Language Detection Using Multi-level Classification. Canadian Conference on Artificial Intelligence, 16–27. https://doi.org/10.1007/978-3-642-13059-5_5
Sennrich, R., Haddow, B., & Birch, A. (2016). Improving Neural Machine Translation Models with Monolingual Data. 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers, 86–96. https://doi.org/10.18653/v1/p16-1009
Tala, F. Z. (2003). A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia.
Wei, J., & Zou, K. (2019). EDA: Easy data augmentation techniques for boosting performance on text classification tasks. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 6382–6388.
https://doi.org/10.18653/v1/d19-1670
Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., & Kumar, R. (2019). Predicting the Type and Target of Offensive Posts in Social Media. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),