LFGA–ADENN Hybrid Feature Extraction and Oversampling for Type 2 Diabetes Mellitus Detection in Mizoram
A Genomic Data–Driven Study in the Mizoram Population
DOI:
https://doi.org/10.22232/stj.2026.1Keywords:
T2DM, Genomic Sequencing, Hybrid Feature Extraction, Leaf Fusion, Genetic Algorithm, Hybrid OversamplingAbstract
Early detection of Type 2 Diabetes Mellitus using genomic characteristic is inspiring due to large dimensionality, huge missing values, and large number of class imbalance. Bioinformatics has made it easier to analyse complicated biological data, and the fact that an increasing number of this data is available has made machine learning techniques far more beneficial. The study proposes a hybrid system by integrating preprocessing, hybrid oversampling and feature extraction for the Mizoram population. Missing value filtering, K-Nearest Neighbours imputation, and categorical encoding are performed for curating data. The combination of the ADASYN and Edited Nearest Neighbors approach is used to address class imbalance. Feature extraction uses a genetic algorithm to choose useful representations after combining leaf indices from Decision Tree, Random Forest, and XGBoost models. Minority class detection is constantly improved by machine learning models and the hybrid LFGA–ADENN pipeline, reaching up to 0.921 F1-score and 0.920 MCC using XGBoost. LFGA–ADENN shows strong performance in imbalanced genomic T2DM prediction, improving recall while preserving precision when compared to no oversampling or single-method approaches.
References
Adzhubei, Ivan, Daniel M. Jordan, and Shamil R. Sunyaev. "Predicting functional effect of human missense mutations using PolyPhen‐2." Current protocols in human genetics 76, no. 1 (2013): 7-20. https://doi.org/10.1002/0471142905.hg0720s76
Alonso-Betanzos, Amparo, and Verónica Bolón-Canedo. "Big-data analysis, cluster analysis, and machine-learning approaches." Sex-specific analysis of cardiovascular function (2018): 607-626. https://doi.org/10.1007/978-3-319-77932-4_37
American Diabetes Association. "Standards of care in diabetes—2023 abridged for primary care providers." Clinical Diabetes 41, no. 1 (2023): 4-31. https://doi.org/10.2337/cd23-as01
Bolger, Anthony M., Marc Lohse, and Bjoern Usadel. "Trimmomatic: a flexible trimmer for Illumina sequence data." Bioinformatics 30, no. 15 (2014): 2114-2120. https://doi.org/10.1093/bioinformatics/btu170
Bolón-Canedo, Verónica, Noelia Sánchez-Maroño, and Amparo Alonso-Betanzos. "Recent advances and emerging challenges of feature selection in the context of big data." Knowledge-based systems 86 (2015): 33-45. https://doi.org/10.1016/j.knosys.2015.05.014
Breiman, Leo, Jerome Friedman, Richard A. Olshen, and Charles J. Stone. Classification and regression trees. Chapman and Hall/CRC, 2017. https://doi.org/10.1201/9781315139470
Breiman, Leo. "Random forests." Machine learning 45, no. 1 (2001): 5-32. https://doi.org/10.1023/A:1010933404324
Brindha Senthil Kumar, Vanlalawmpuia R, Freda Lalrohlui, John Zohmingthanga, Lalruatpuii Hlawnmual, Nachimuthu Senthil Kumar and Lal Hmingliana. “A Multilayer Perceptron Model to Predict Risk Factors of Type II Diabetes Mellitus”. Int J Food Nutr, 11 (2022): 67-74. https://DOI:10.4103/ijfans_110-22
Castellana, Stefano, and Tommaso Mazza. "Congruency in the prediction of pathogenic missense mutations: state-of-the-art web-based tools." Briefings in bioinformatics 14, no. 4 (2013): 448-459. https://doi.org/10.1093/bib/bbt013
Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16 (2002): 321-357. https://doi.org/10.1613/jair.953
Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785-794. 2016. https://doi.org/10.1145/2939672.2939785
Chen, Xi, and Hemant Ishwaran. "Random forests for genomic data analysis." Genomics 99, no. 6 (2012): 323-329. https://doi.org/10.1016/j.ygeno.2012.04.003
Chicco, Davide, and Giuseppe Jurman. "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation." BMC genomics 21, no. 1 (2020): 6. https://doi.org/10.1186/s12864-019-6413-7
Cios, Krzysztof J., Witold Pedrycz, and Roman W. Swiniarski. Data mining methods for knowledge discovery. Springer Science & Business Media, 2012.
Dietterich, Thomas G. "Ensemble methods in machine learning." In International workshop on multiple classifier systems, pp. 1-15. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000. https://doi.org/10.1007/3-540-45014-9_1
Douville, Christopher, David L. Masica, Peter D. Stenson, David N. Cooper, Derek M. Gygax, Rick Kim, Michael Ryan, and Rachel Karchin. "Assessing the pathogenicity of insertion and deletion variants with the variant effect scoring tool (VEST‐Indel)." Human mutation 37, no. 1 (2016): 28-35. https://doi.org/10.1002/humu.22911
Fernández, Alberto, Salvador Garcia, Francisco Herrera, and Nitesh V. Chawla. "SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary." Journal of artificial intelligence research 61 (2018): 863-905. https://doi.org/10.1613/jair.1.11192
Genitsaridi, Irini, Paraskevi Salpea, Agus Salim, Seyedeh Forough Sajjadi, Dunya Tomic, Steven James, Sathish Thirunavukkarasu et al. "of the IDF Diabetes Atlas: global, regional, and national diabetes prevalence estimates for 2024 and projections for 2050." The Lancet Diabetes & Endocrinology 14, no. 2 (2026): 149-156. https://doi.org/10.1016/S2213-8587(25)00299-2
Ghatak, Souvik, Rajendra Bose Muthukumaran, and Senthil Kumar Nachimuthu. "A simple method of genomic DNA extraction from human samples for PCR-RFLP analysis." Journal of biomolecular techniques: JBT 24, no. 4 (2013): 224. https://doi.org/10.7171/jbt.13-2404-001
Hastie, Trevor, Robert Tibshirani, and J. H. Friedman. "The Elements of Statistical Learning, (2nd printing ed.)." (2009).
He, Haibo, and Edwardo A. Garcia. "Learning from imbalanced data." IEEE Transactions on knowledge and data engineering 21, no. 9 (2009): 1263-1284. https://doi.org/10.1109/TKDE.2008.239
He, Haibo, Yang Bai, Edwardo A. Garcia, and Shutao Li. "ADASYN: Adaptive synthetic sampling approach for imbalanced learning." In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pp. 1322-1328. Ieee, 2008. https://doi.org/10.1109/IJCNN.2008.4633969
Kahn, Barbara B., and Jeffrey S. Flier. "Obesity and insulin resistance." The Journal of clinical investigation 106, no. 4 (2000): 473-481. https://doi.org/10.1172/JCI10842
Kharsati, Naphisabet, and Mrinmoyi Kulkarni. "Living with diabetes in Northeast India: An exploration of psychosocial factors in management." Dialogues in Health 4 (2024): 100180. https://doi.org/10.1016/j.dialog.2024.100180
Krawczyk, Bartosz. "Learning from imbalanced data: open challenges and future directions." Progress in artificial intelligence 5, no. 4 (2016): 221-232. https://doi.org/10.1007/s13748-016-0094-0
Lalrohlui, Freda, John Zohmingthanga, Andrew Vanlallawma, and Nachimuthu Senthil Kumar. "Whole exome sequencing identifies the novel putative gene variants related with type 2 diabetes in Mizo population, northeast India." Gene 769 (2021): 145229. https://doi.org/10.1016/j.gene.2020.145229
Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Processing Subgroup. "The sequence alignment/map format and SAMtools." bioinformatics 25, no. 16 (2009): 2078-2079. https://doi.org/10.1093/bioinformatics/btp352
Libbrecht, Maxwell W., and William Stafford Noble. "Machine learning applications in genetics and genomics." Nature Reviews Genetics 16, no. 6 (2015): 321-332. https://doi.org/10.1038/nrg3920
Lunter, Gerton, and Martin Goodson. "Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads." Genome research 21, no. 6 (2011): 936-939. http://www.genome.org/cgi/doi/10.1101/gr.111120.110
Martin, Alicia R., Masahiro Kanai, Yoichiro Kamatani, Yukinori Okada, Benjamin M. Neale, and Mark J. Daly. "Clinical use of current polygenic risk scores may exacerbate health disparities." Nature genetics 51, no. 4 (2019): 584-591. https://doi.org/10.1038/s41588-019-0379-x
Moore, Jason H., Folkert W. Asselbergs, and Scott M. Williams. "Bioinformatics challenges for genome-wide association studies." Bioinformatics 26, no. 4 (2010): 445-455. https://doi.org/10.1093/bioinformatics/btp713
Nasykhova, Yulia A., Yury A. Barbitoff, Elena A. Serebryakova, Dmitry S. Katserov, and Andrey S. Glotov. "Recent advances and perspectives in next generation sequencing application to the genetic research of type 2 diabetes." World journal of diabetes 10, no. 7 (2019): 376. https://doi.org/10.4239/wjd.v10.i7.376
Ortiz, Bengie L., Vibhuti Gupta, Rajnish Kumar, Aditya Jalin, Xiao Cao, Charles Ziegenbein, Ashutosh Singhal, Muneesh Tewari, and Sung Won Choi. "Data preprocessing techniques for AI and machine learning readiness: Scoping review of wearable sensor data in cancer care." JMIR mHealth and uHealth 12, no. 1 (2024): e59587. https://doi.org/10.2196/59587
Toolkit, Picard. "GitHub repository." Broad Institute. Available online at: http://broadinstitute.github. io/picard (2019)
Quinlan, J. Ross. C4. 5: programs for machine learning. Elsevier, 2014.
Rentzsch, Philipp, Daniela Witten, Gregory M. Cooper, Jay Shendure, and Martin Kircher. "CADD: predicting the deleteriousness of variants throughout the human genome." Nucleic acids research 47, no. D1 (2019): D886-D894. https://doi.org/10.1093/nar/gky1016
Reva, Boris, Yevgeniy Antipin, and Chris Sander. "Predicting the functional impact of protein mutations: application to cancer genomics." Nucleic acids research 39, no. 17 (2011): e118-e118. https://doi.org/10.1093/nar/gkr407
Rimmer, Andy, Hang Phan, Iain Mathieson, Zamin Iqbal, Stephen RF Twigg, WGS500 Consortium, Andrew OM Wilkie, Gil McVean, and Gerton Lunter. "Integrating mapping-, assembly-and haplotype-based approaches for calling variants in clinical sequencing applications." Nature genetics 46, no. 8 (2014): 912-918. https://doi.org/10.1038/ng.3036
Saeys, Yvan, Inaki Inza, and Pedro Larranaga. "A review of feature selection techniques in bioinformatics." bioinformatics 23, no. 19 (2007): 2507-2517. https://doi.org/10.1093/bioinformatics/btm344
Sarma, Ranjan Jyoti, Jeremy Lalrinsanga Pautu, Bawitlung Zothankima, Lalfakzuala Khenglawt, Saia Chenkual, John Zohmingthanga, Lalawmpuii Pachuau, and Nachimuthu Senthil Kumar. "Novel germline variants of MUC3A in a patient with ER+ breast cancer and signet-ring cell stomach adenocarcinoma." Gene Reports 33 (2023): 101803. https://doi.org/10.1016/j.genrep.2023.101803
Shihab, H. A., Gough, J., Cooper, D. N., Stenson, P. D., Barker, G. L., Edwards, K. J., Day, I. N., & Gaunt, T. R. (2013). Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Human Mutation, 34(1), 57–65. https://doi.org/10.1002/humu.22225
Sim, Ngak-Leng, Prateek Kumar, Jing Hu, Steven Henikoff, Georg Schneider, and Pauline C. Ng. "SIFT web server: predicting effects of amino acid substitutions on and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. biorxiv, 531210. https://doi.org/10.1093/nar/gks539
Sirocchi, Christel, Martin Urschler, and Bastian Pfeifer. "Feature graphs for interpretable unsupervised tree ensembles: centrality, interaction, and application in disease subtyping." BioData Mining 18, no. 1 (2025): 15. https://doi.org/10.1186/s13040-025-00430-3
Tabák, Adam G., Christian Herder, Wolfgang Rathmann, Eric J. Brunner, and Mika Kivimäki. "Prediabetes: a high-risk state for diabetes development." The Lancet 379, no. 9833 (2012): 2279-2290. https://doi.org/10.1016/S0140-6736(12)60283-9
Trikkalinou, Aikaterini, Athanasia K. Papazafiropoulou, and Andreas Melidonis. "Type 2 diabetes and quality of life." World journal of diabetes 8, no. 4 (2017): 120. https://doi.org/10.4239/wjd.v8.i4.120
Wang, Kai, Mingyao Li, and Hakon Hakonarson. "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data." Nucleic acids research 38, no. 16 (2010): e164-e164. https://doi.org/10.1093/nar/gkq603
Wang, Tao, Yongzhuang Liu, Quanwei Yin, Jiaquan Geng, Jin Chen, Xipeng Yin, Yongtian Wang et al. "Enhancing discoveries of molecular QTL studies with small sample size using summary statistic imputation." Briefings in bioinformatics 23, no. 1 (2022): bbab370. https://doi.org/10.1093/bib/bbac139
Wilson, Dennis L. "Asymptotic properties of nearest neighbor rules using edited data." IEEE Transactions on Systems, Man, and Cybernetics 3 (2007): 408-421. https://doi.org/10.1109/TSMC.1972.4309137
Xue, Bing, Mengjie Zhang, Will N. Browne, and Xin Yao. "A survey on evolutionary computation approaches to feature selection." IEEE Transactions on evolutionary computation 20, no. 4 (2015): 606-626. https://doi.org/10.1109/TEVC.2015.2504420
Yang, Xi, Di Liu, Fei Liu, Jun Wu, Jing Zou, Xue Xiao, Fangqing Zhao, and Baoli Zhu. "HTQC: a fast quality control toolkit for Illumina sequencing data." BMC bioinformatics 14, no. 1 (2013): 33. https://doi.org/10.1186/1471-2105-14-33
Zeng, Ping, Yang Zhao, Jin Liu, Liya Liu, Liwei Zhang, Ting Wang, Shuiping Huang, and Feng Chen. "Likelihood ratio tests in rare variant detection for continuous phenotypes." Annals of human genetics 78, no. 5 (2014): 320-332. https://doi.org/10.1111/ahg.12071
Zheng, Yan, Sylvia H. Ley, and Frank B. Hu. "Global aetiology and epidemiology of type 2 diabetes mellitus and its complications." Nature reviews endocrinology 14, no. 2 (2018): 88-98. https://doi.org/10.1038/nrendo.2017.151
Downloads
Published
How to Cite
Issue
Section
Categories
License
Copyright (c) 2026 Vanlalawmpuia Ralte, Lalhmingliana; Senthil, Brindha , Freda, John , Vanlalhruaii

This work is licensed under a Creative Commons Attribution 4.0 International License.
© The Author(s) 2025. Published by the Science & Technology Journal (STJ), Mizoram University.
Articles published in this journal are open access and distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
This license permits unrestricted use, distribution, and reproduction in any medium, provided the original author(s) and source are properly credited.
Authors retain copyright and grant the journal the right of first publication, with the work simultaneously licensed under the CC BY 4.0 license.
License link: Creative Commons Attribution 4.0 International License (CC BY 4.0)
LOCKSS – Library archiving for permanence
OpenAIRE – Open Access compliance