Evaluating Ensemble-Based Classifiers for Khasi Named Entity Recognition with Data Imbalance Constraints

Authors

  • Amitabha Nath North-Eastern Hill University
  • Ransly Hoojon
  • Saralin Lyngdoh

DOI:

https://doi.org/10.22232/stj.2026.2

Keywords:

Named Entity Recognition, Low-Resource Language, Khasi Language, Transformer-based Models, Ensemble Learning

Abstract

Named Entity Recognition (NER), a core task of NLP, has been the subject of much research in resource-rich languages. However, little has been attempted with respect to low-resource languages such as Khasi. Based on our previous efforts, the existing works on Khasi NER applied transformer models such as RoBERTa, XLM-RoBERTa, and BERT-base-multilingual-cased using transfer learning (TL) and achieved an F1 score of 91.25% with RoBERTa. In this paper, ensemble methods are investigated to improve performance even further. We used a novel Khasi named entity recognition dataset annotated in span-based and converted to IOB2 formats for token-level predictions for both TL and ensemble methods. Ensemble methods, including majority voting, weighted voting, and stacking with logistic regression and XGBoost, were employed. Two settings were tested, one where two models are implemented (RoBERTa and XLM-RoBERTa) and the other with three models, with the addition of BERT-base-multilingual-cased, where these models act as learners. The ensemble of three models with XGBoost scored the highest F1 score of 94.06%, which is an improvement of +2.81% over the previous best. It also achieved significant improvements on rare entities, such as WORK_OF_ ART, LANGUAGE, ORDINAL, and EVENT, although the dataset was skewed toward the PERSON class. The results show how ensemble methods can improve NER performance on imbalanced datasets and low-resource languages such as Khasi.

References

Sharma, A., Chakraborty, S., Kumar, S. et al. (2022). Named entity recognition in natural language processing: a systematic review. In: Proceedings of the Second Doctoral Symposium on Computational Intelligence. Singapore: Springer, pp. 817–828. https://doi.org/10.1007/978-981-16-3647-9_59

Li, J., Sun, A., Han, J. and Li, C. (2022). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 34(1), pp. 50–70. https://doi.org/10.1109/tkde.2020.2981314

Diefenbach, D., Lopez, V., Singh, K. and Maret, P. (2018). Core techniques of question answering systems over knowledge bases: a survey. Knowledge and Information Systems, 55(3), pp. 529–569. https://doi.org/10.1007/s10115-017-1100-y

Rogers, A., Gardner, M. and Augenstein, I. (2023). QA dataset explosion: a taxonomy of NLP resources for question answering and reading comprehension. ACM Computing Surveys, 55(10), pp. 1–45. https://doi.org/10.1145/3560260

Lewis, P., Oğuz, B., Rinott, R., Riedel, S. and Schwenk, H. (2019). MLQA: evaluating cross-lingual extractive question answering. arXiv. Available at: https://arxiv.org/abs/1910.07475

Gupta, V. and Dixit, A. (2023). Recent query reformulation approaches for information retrieval system – a survey. Recent Advances in Computer Science and Communications, 16(1), pp. 94–107. https://doi.org/10.2174/2666255815666220404091920

Eswaraiah, P. and Syed, H. (2023). An efficient ontology model with query execution for accurate document content extraction. Indonesian Journal of Electrical Engineering and Computer Science, 29(2), pp. 981–989. https://doi.org/10.11591/ijeecs.v29.i2.pp981-989

Al-Moslmi, T., Ocaña, M.G., Opdahl, A.L. and Veres, C. (2020). Named entity extraction for knowledge graphs: a literature overview. IEEE Access, 8, pp. 32862–32881. https://doi.org/10.1109/ACCESS.2020.2973928

Santana, B., Campos, R., Amorim, E., Jorge, A., Silvano, P. and Nunes, S. (2023). A survey on narrative extraction from textual data. Artificial Intelligence Review, 56(8), pp. 8393–8435. https://doi.org/10.1007/s10462-022-10338-7

Mulwad, V., Finin, T., Kumar, V.S., Williams, J.W., Dixit, S. and Joshi, A. (2023). A practical entity linking system for tables in scientific literature. arXiv. https://doi.org/10.48550/arXiv.2306.10044

Riaz, K. (2018). Improving search via named entity recognition in morphologically rich languages: a case study in Urdu. PhD thesis, University of Minnesota. Available at: https://hdl.handle.net/11299/195403

Ssemwogerere, R., Sajo, A.A., Mutwalibi, N. and Mzee, A.K. (2023). A survey about the application of artificial intelligence in search engines. In: Handbook of Research on AI Methods and Applications in Computer Engineering. Hershey, PA: IGI Global, pp. 161–178. https://doi.org/10.4018/978-1-6684-6937-8.ch008

Orekhov, S., Godlevsky, M., Malyhon, H. and Goncharenko, T. (2023). A new method of search engine optimization based on semantic kernel idea. In: Advances in Artificial Systems for Medicine and Education VI. Cham: Springer, pp. 67–77. https://doi.org/10.1007/978-3-031-24468-1_7

Hoojon, R. (2025). Optimizing low-resource Khasi NER through transfer learning. In: Proceedings of the International Conference on North East India AI Summit (NEIAIS-2025). Manipur, India

Banga, A., Ahuja, R. and Sharma, S.C. (2021). Performance analysis of regression algorithms and feature selection techniques to predict PM2.5 in smart cities. International Journal of System Assurance Engineering and Management, pp. 1–14

Naik, N. and Mohan, B.R. (2021). Novel stock crisis prediction technique: a study on Indian stock market. IEEE Access, 9, pp. 86230–86242. https://doi.org/10.1109/ACCESS.2021.3088999

Shilong, Z. et al. (2021). Machine learning model for sales forecasting by using XGBoost. In: IEEE International Conference on Consumer Electronics and Computer Engineering. pp. 480–483

Parsa, A.B., Movahedi, A., Taghipour, H., Derrible, S. and Mohammadian, A.K. (2020). Toward safer highways: application of XGBoost and SHAP for real-time accident detection. Accident Analysis & Prevention, 136, p. 105405. https://doi.org/10.1016/j.aap.2019.105405

Tang, Q., Xia, G., Zhang, X. and Long, F. (2020). A customer churn prediction model based on XGBoost and MLP. In: International Conference on Computer Engineering and Application. pp. 608–612. https://doi.org/10.1109/ICCEA50009.2020.00133

Li, Y., Stasinakis, C. and Yeo, W.M. (2022). A hybrid XGBoost-MLP model for credit risk assessment. Forecasting, 4(1), pp. 184–207

Liu, J., Zhang, S. and Fan, H. (2022). A two-stage hybrid credit risk prediction model. Expert Systems with Applications, 195, p. 116624. https://doi.org/10.1016/j.eswa.2022.116624

Wang, K., Li, M., Cheng, J., Zhou, X. and Li, G. (2022). Research on personal credit risk evaluation based on XGBoost. Procedia Computer Science, 199, pp. 1128–1135. https://doi.org/10.1016/j.procs.2022.01.143

Ogunleye, A. and Wang, Q.G. (2019). XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 17(6), pp. 2131–2140

Wang, C., Deng, C. and Wang, S. (2020). Imbalance-XGBoost: leveraging weighted and focal losses. Pattern Recognition Letters, 136, pp. 190–197. https://doi.org/10.1016/j.patrec.2020.05.035

Létinier, L. et al. (2021). Artificial intelligence for unstructured healthcare data. Clinical Pharmacology & Therapeutics, 110(2), pp. 392–400. https://doi.org/10.1002/cpt.2266

Zuech, R., Hancock, J. and Khoshgoftaar, T.M. (2021). Detecting web attacks using random undersampling. Journal of Big Data, 8(1), p. 75

Le, T.T.H., Oktian, Y.E. and Kim, H. (2022). XGBoost for imbalanced multiclass classification-based intrusion detection systems. Sustainability, 14(14), p. 8707

Mishra, M., Patnaik, B., Bansal, R.C., Naidoo, R., Naik, B. and Nayak, J. (2021). DTCDWT-SMOTE-XGBoost-based islanding detection. IEEE Systems Journal, 16(2), pp. 2008–2019

Mushava, J. and Murray, M. (2022). A novel XGBoost extension for credit scoring. Expert Systems with Applications, 202, p. 117233. https://doi.org/10.1016/j.eswa.2022.117233

Pavlyshenko, B. (2018). Using stacking approaches for machine learning models. In: IEEE International Conference on Data Stream Mining and Processing. pp. 255–258. https://doi.org/10.1109/DSMP.2018.8478522

Rojarath, A. and Songpan, W. (2020). Probability-weighted voting ensemble learning. Journal of Advances in Information Technology, 11(4), pp. 217–227. https://doi.org/10.12720/jait.11.4.217-227

Won, M. and Martins, B. (2018). Ensemble named entity recognition. Frontiers in Digital Humanities, 5. https://doi.org/10.3389/fdigh.2018.00002

Ullah, F., Gelbukh, A., Zamir, M., Riverón, E. and Sidorov, G. (2024). Enhancement of named entity recognition in low-resource languages. Computers, 13(10), p. 258. https://doi.org/10.3390/computers13100258

Jin, M., Choi, S.M. and Kim, G.W. (2025). COMCARE: a collaborative ensemble framework. Electronics, 14, p. 328. https://doi.org/10.3390/electronics14020328

Munthe, I. (2024). Implementation of stacking technique combining machine learning and deep learning algorithms. Journal of Applied Data Sciences, 5, pp. 2079–2091. https://doi.org/10.47738/jads.v5i4.421

Ganaie, M.A., Hu, M., Malik, A.K., Tanveer, M. and Suganthan, P.N. (2022). Ensemble deep learning: a review. Engineering Applications of Artificial Intelligence, 115, p. 105151. https://doi.org/10.1016/j.engappai.2022.105151

Rahimi, A., Li, Y. and Cohn, T. (2019). Massively multilingual transfer for NER. arXiv. Available at: https://arxiv.org/abs/1902.00193

Chen, Y., Zhong, R., Zha, S., Karypis, G. and He, H. (2022). Meta-learning via language model in-context tuning. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. pp. 719–730. https://doi.org/10.18653/v1/2022.acl-long.53.

Lima, K.A. et al. (2023). A novel data and model centric artificial intelligence approach for Bengali NER. PLOS ONE, 18(9), p. e0287818. https://doi.org/10.1371/journal.pone.0287818

Li, Y. (2025). Enhanced logistic regression using stacking algorithm. Highlights in Science, Engineering and Technology, 136, pp. 1–11. https://doi.org/10.54097/xmphgz15

Bentéjac, C., Csörgő, A. and Martínez-Muñoz, G. (2021). A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54(3), pp. 1937–1967. https://doi.org/10.1007/s10462-020-09896-5

Bartz-Beielstein, T., Chandrasekaran, S. and Rehbach, F. (2023). Case study II: tuning of gradient boosting (XGBoost). In: Hyperparameter tuning for machine and deep learning with R. Singapore: Springer, pp. 221–234. https://doi.org/10.1007/978-981-19-5170-1_9

Biau, G. and Cadre, B. (2021). Optimization by gradient boosting. In: Advances in Contemporary Statistics and Econometrics. Cham: Springer, pp. 23–44. https://doi.org/10.1007/978-3-030-73249-3_2

Jurafsky, D. and Martin, J.H. (2020). Speech and language processing. 3rd ed. Stanford: Stanford University

Downloads

Published

2026-05-27

How to Cite

Nath, A., Hoojon, R., & Lyngdoh, S. (2026). Evaluating Ensemble-Based Classifiers for Khasi Named Entity Recognition with Data Imbalance Constraints. Science & Technology Journal, 14(Online First). https://doi.org/10.22232/stj.2026.2