Evaluating Ensemble-Based Classifiers for Khasi Named Entity Recognition with Data Imbalance Constraints
DOI:
https://doi.org/10.22232/stj.2026.2Keywords:
Named Entity Recognition, Low-Resource Language, Khasi Language, Transformer-based Models, Ensemble LearningAbstract
Named Entity Recognition (NER), a core task of NLP, has been the subject of much research in resource-rich languages. However, little has been attempted with respect to low-resource languages such as Khasi. Based on our previous efforts, the existing works on Khasi NER applied transformer models such as RoBERTa, XLM-RoBERTa, and BERT-base-multilingual-cased using transfer learning (TL) and achieved an F1 score of 91.25% with RoBERTa. In this paper, ensemble methods are investigated to improve performance even further. We used a novel Khasi named entity recognition dataset annotated in span-based and converted to IOB2 formats for token-level predictions for both TL and ensemble methods. Ensemble methods, including majority voting, weighted voting, and stacking with logistic regression and XGBoost, were employed. Two settings were tested, one where two models are implemented (RoBERTa and XLM-RoBERTa) and the other with three models, with the addition of BERT-base-multilingual-cased, where these models act as learners. The ensemble of three models with XGBoost scored the highest F1 score of 94.06%, which is an improvement of +2.81% over the previous best. It also achieved significant improvements on rare entities, such as WORK_OF_ ART, LANGUAGE, ORDINAL, and EVENT, although the dataset was skewed toward the PERSON class. The results show how ensemble methods can improve NER performance on imbalanced datasets and low-resource languages such as Khasi.
References
Sharma, A., Chakraborty, S., Kumar, S. et al. (2022). Named entity recognition in natural language processing: a systematic review. In: Proceedings of the Second Doctoral Symposium on Computational Intelligence. Singapore: Springer, pp. 817–828. https://doi.org/10.1007/978-981-16-3647-9_59
Li, J., Sun, A., Han, J. and Li, C. (2022). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 34(1), pp. 50–70. https://doi.org/10.1109/tkde.2020.2981314
Diefenbach, D., Lopez, V., Singh, K. and Maret, P. (2018). Core techniques of question answering systems over knowledge bases: a survey. Knowledge and Information Systems, 55(3), pp. 529–569. https://doi.org/10.1007/s10115-017-1100-y
Rogers, A., Gardner, M. and Augenstein, I. (2023). QA dataset explosion: a taxonomy of NLP resources for question answering and reading comprehension. ACM Computing Surveys, 55(10), pp. 1–45. https://doi.org/10.1145/3560260
Lewis, P., Oğuz, B., Rinott, R., Riedel, S. and Schwenk, H. (2019). MLQA: evaluating cross-lingual extractive question answering. arXiv. Available at: https://arxiv.org/abs/1910.07475
Gupta, V. and Dixit, A. (2023). Recent query reformulation approaches for information retrieval system – a survey. Recent Advances in Computer Science and Communications, 16(1), pp. 94–107. https://doi.org/10.2174/2666255815666220404091920
Eswaraiah, P. and Syed, H. (2023). An efficient ontology model with query execution for accurate document content extraction. Indonesian Journal of Electrical Engineering and Computer Science, 29(2), pp. 981–989. https://doi.org/10.11591/ijeecs.v29.i2.pp981-989
Al-Moslmi, T., Ocaña, M.G., Opdahl, A.L. and Veres, C. (2020). Named entity extraction for knowledge graphs: a literature overview. IEEE Access, 8, pp. 32862–32881. https://doi.org/10.1109/ACCESS.2020.2973928
Santana, B., Campos, R., Amorim, E., Jorge, A., Silvano, P. and Nunes, S. (2023). A survey on narrative extraction from textual data. Artificial Intelligence Review, 56(8), pp. 8393–8435. https://doi.org/10.1007/s10462-022-10338-7
Mulwad, V., Finin, T., Kumar, V.S., Williams, J.W., Dixit, S. and Joshi, A. (2023). A practical entity linking system for tables in scientific literature. arXiv. https://doi.org/10.48550/arXiv.2306.10044
Riaz, K. (2018). Improving search via named entity recognition in morphologically rich languages: a case study in Urdu. PhD thesis, University of Minnesota. Available at: https://hdl.handle.net/11299/195403
Ssemwogerere, R., Sajo, A.A., Mutwalibi, N. and Mzee, A.K. (2023). A survey about the application of artificial intelligence in search engines. In: Handbook of Research on AI Methods and Applications in Computer Engineering. Hershey, PA: IGI Global, pp. 161–178. https://doi.org/10.4018/978-1-6684-6937-8.ch008
Orekhov, S., Godlevsky, M., Malyhon, H. and Goncharenko, T. (2023). A new method of search engine optimization based on semantic kernel idea. In: Advances in Artificial Systems for Medicine and Education VI. Cham: Springer, pp. 67–77. https://doi.org/10.1007/978-3-031-24468-1_7
Hoojon, R. (2025). Optimizing low-resource Khasi NER through transfer learning. In: Proceedings of the International Conference on North East India AI Summit (NEIAIS-2025). Manipur, India
Banga, A., Ahuja, R. and Sharma, S.C. (2021). Performance analysis of regression algorithms and feature selection techniques to predict PM2.5 in smart cities. International Journal of System Assurance Engineering and Management, pp. 1–14
Naik, N. and Mohan, B.R. (2021). Novel stock crisis prediction technique: a study on Indian stock market. IEEE Access, 9, pp. 86230–86242. https://doi.org/10.1109/ACCESS.2021.3088999
Shilong, Z. et al. (2021). Machine learning model for sales forecasting by using XGBoost. In: IEEE International Conference on Consumer Electronics and Computer Engineering. pp. 480–483
Parsa, A.B., Movahedi, A., Taghipour, H., Derrible, S. and Mohammadian, A.K. (2020). Toward safer highways: application of XGBoost and SHAP for real-time accident detection. Accident Analysis & Prevention, 136, p. 105405. https://doi.org/10.1016/j.aap.2019.105405
Tang, Q., Xia, G., Zhang, X. and Long, F. (2020). A customer churn prediction model based on XGBoost and MLP. In: International Conference on Computer Engineering and Application. pp. 608–612. https://doi.org/10.1109/ICCEA50009.2020.00133
Li, Y., Stasinakis, C. and Yeo, W.M. (2022). A hybrid XGBoost-MLP model for credit risk assessment. Forecasting, 4(1), pp. 184–207
Liu, J., Zhang, S. and Fan, H. (2022). A two-stage hybrid credit risk prediction model. Expert Systems with Applications, 195, p. 116624. https://doi.org/10.1016/j.eswa.2022.116624
Wang, K., Li, M., Cheng, J., Zhou, X. and Li, G. (2022). Research on personal credit risk evaluation based on XGBoost. Procedia Computer Science, 199, pp. 1128–1135. https://doi.org/10.1016/j.procs.2022.01.143
Ogunleye, A. and Wang, Q.G. (2019). XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 17(6), pp. 2131–2140
Wang, C., Deng, C. and Wang, S. (2020). Imbalance-XGBoost: leveraging weighted and focal losses. Pattern Recognition Letters, 136, pp. 190–197. https://doi.org/10.1016/j.patrec.2020.05.035
Létinier, L. et al. (2021). Artificial intelligence for unstructured healthcare data. Clinical Pharmacology & Therapeutics, 110(2), pp. 392–400. https://doi.org/10.1002/cpt.2266
Zuech, R., Hancock, J. and Khoshgoftaar, T.M. (2021). Detecting web attacks using random undersampling. Journal of Big Data, 8(1), p. 75
Le, T.T.H., Oktian, Y.E. and Kim, H. (2022). XGBoost for imbalanced multiclass classification-based intrusion detection systems. Sustainability, 14(14), p. 8707
Mishra, M., Patnaik, B., Bansal, R.C., Naidoo, R., Naik, B. and Nayak, J. (2021). DTCDWT-SMOTE-XGBoost-based islanding detection. IEEE Systems Journal, 16(2), pp. 2008–2019
Mushava, J. and Murray, M. (2022). A novel XGBoost extension for credit scoring. Expert Systems with Applications, 202, p. 117233. https://doi.org/10.1016/j.eswa.2022.117233
Pavlyshenko, B. (2018). Using stacking approaches for machine learning models. In: IEEE International Conference on Data Stream Mining and Processing. pp. 255–258. https://doi.org/10.1109/DSMP.2018.8478522
Rojarath, A. and Songpan, W. (2020). Probability-weighted voting ensemble learning. Journal of Advances in Information Technology, 11(4), pp. 217–227. https://doi.org/10.12720/jait.11.4.217-227
Won, M. and Martins, B. (2018). Ensemble named entity recognition. Frontiers in Digital Humanities, 5. https://doi.org/10.3389/fdigh.2018.00002
Ullah, F., Gelbukh, A., Zamir, M., Riverón, E. and Sidorov, G. (2024). Enhancement of named entity recognition in low-resource languages. Computers, 13(10), p. 258. https://doi.org/10.3390/computers13100258
Jin, M., Choi, S.M. and Kim, G.W. (2025). COMCARE: a collaborative ensemble framework. Electronics, 14, p. 328. https://doi.org/10.3390/electronics14020328
Munthe, I. (2024). Implementation of stacking technique combining machine learning and deep learning algorithms. Journal of Applied Data Sciences, 5, pp. 2079–2091. https://doi.org/10.47738/jads.v5i4.421
Ganaie, M.A., Hu, M., Malik, A.K., Tanveer, M. and Suganthan, P.N. (2022). Ensemble deep learning: a review. Engineering Applications of Artificial Intelligence, 115, p. 105151. https://doi.org/10.1016/j.engappai.2022.105151
Rahimi, A., Li, Y. and Cohn, T. (2019). Massively multilingual transfer for NER. arXiv. Available at: https://arxiv.org/abs/1902.00193
Chen, Y., Zhong, R., Zha, S., Karypis, G. and He, H. (2022). Meta-learning via language model in-context tuning. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. pp. 719–730. https://doi.org/10.18653/v1/2022.acl-long.53.
Lima, K.A. et al. (2023). A novel data and model centric artificial intelligence approach for Bengali NER. PLOS ONE, 18(9), p. e0287818. https://doi.org/10.1371/journal.pone.0287818
Li, Y. (2025). Enhanced logistic regression using stacking algorithm. Highlights in Science, Engineering and Technology, 136, pp. 1–11. https://doi.org/10.54097/xmphgz15
Bentéjac, C., Csörgő, A. and Martínez-Muñoz, G. (2021). A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54(3), pp. 1937–1967. https://doi.org/10.1007/s10462-020-09896-5
Bartz-Beielstein, T., Chandrasekaran, S. and Rehbach, F. (2023). Case study II: tuning of gradient boosting (XGBoost). In: Hyperparameter tuning for machine and deep learning with R. Singapore: Springer, pp. 221–234. https://doi.org/10.1007/978-981-19-5170-1_9
Biau, G. and Cadre, B. (2021). Optimization by gradient boosting. In: Advances in Contemporary Statistics and Econometrics. Cham: Springer, pp. 23–44. https://doi.org/10.1007/978-3-030-73249-3_2
Jurafsky, D. and Martin, J.H. (2020). Speech and language processing. 3rd ed. Stanford: Stanford University
Downloads
Published
How to Cite
Issue
Section
Categories
License
Copyright (c) 2026 Amitabha Nath, Ransly Hoojon, Saralin Lyngdoh

This work is licensed under a Creative Commons Attribution 4.0 International License.
© The Author(s) 2025. Published by the Science & Technology Journal (STJ), Mizoram University.
Articles published in this journal are open access and distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
This license permits unrestricted use, distribution, and reproduction in any medium, provided the original author(s) and source are properly credited.
Authors retain copyright and grant the journal the right of first publication, with the work simultaneously licensed under the CC BY 4.0 license.
License link: Creative Commons Attribution 4.0 International License (CC BY 4.0)
LOCKSS – Library archiving for permanence
OpenAIRE – Open Access compliance