LFGA–ADENN Hybrid Feature Extraction and Oversampling for Type 2 Diabetes Mellitus Detection in Mizoram

A Genomic Data–Driven Study in the Mizoram Population

Authors

  • Vanlalawmpuia Ralte Mizoram University
  • Lalhmingliana
  • Senthil
  • Brindha
  • Freda
  • John
  • Vanlalhruaii

DOI:

https://doi.org/10.22232/stj.2026.1

Keywords:

T2DM, Genomic Sequencing, Hybrid Feature Extraction, Leaf Fusion, Genetic Algorithm, Hybrid Oversampling

Abstract

Early detection of Type 2 Diabetes Mellitus using genomic characteristic is inspiring due to large dimensionality, huge missing values, and large number of class imbalance. Bioinformatics has made it easier to analyse complicated biological data, and the fact that an increasing number of this data is available has made machine learning techniques far more beneficial. The study proposes a hybrid system by integrating preprocessing, hybrid oversampling and feature extraction for the Mizoram population. Missing value filtering, K-Nearest Neighbours imputation, and categorical encoding are performed for curating data. The combination of the ADASYN and Edited Nearest Neighbors approach is used to address class imbalance. Feature extraction uses a genetic algorithm to choose useful representations after combining leaf indices from Decision Tree, Random Forest, and XGBoost models. Minority class detection is constantly improved by machine learning models and the hybrid LFGA–ADENN pipeline, reaching up to 0.921 F1-score and 0.920 MCC using XGBoost. LFGA–ADENN shows strong performance in imbalanced genomic T2DM prediction, improving recall while preserving precision when compared to no oversampling or single-method approaches.

References

Adzhubei, Ivan, Daniel M. Jordan, and Shamil R. Sunyaev. "Predicting functional effect of human missense mutations using PolyPhen‐2." Current protocols in human genetics 76, no. 1 (2013): 7-20. https://doi.org/10.1002/0471142905.hg0720s76

Alonso-Betanzos, Amparo, and Verónica Bolón-Canedo. "Big-data analysis, cluster analysis, and machine-learning approaches." Sex-specific analysis of cardiovascular function (2018): 607-626. https://doi.org/10.1007/978-3-319-77932-4_37

American Diabetes Association. "Standards of care in diabetes—2023 abridged for primary care providers." Clinical Diabetes 41, no. 1 (2023): 4-31. https://doi.org/10.2337/cd23-as01

Bolger, Anthony M., Marc Lohse, and Bjoern Usadel. "Trimmomatic: a flexible trimmer for Illumina sequence data." Bioinformatics 30, no. 15 (2014): 2114-2120. https://doi.org/10.1093/bioinformatics/btu170

Bolón-Canedo, Verónica, Noelia Sánchez-Maroño, and Amparo Alonso-Betanzos. "Recent advances and emerging challenges of feature selection in the context of big data." Knowledge-based systems 86 (2015): 33-45. https://doi.org/10.1016/j.knosys.2015.05.014

Breiman, Leo, Jerome Friedman, Richard A. Olshen, and Charles J. Stone. Classification and regression trees. Chapman and Hall/CRC, 2017. https://doi.org/10.1201/9781315139470

Breiman, Leo. "Random forests." Machine learning 45, no. 1 (2001): 5-32. https://doi.org/10.1023/A:1010933404324

Brindha Senthil Kumar, Vanlalawmpuia R, Freda Lalrohlui, John Zohmingthanga, Lalruatpuii Hlawnmual, Nachimuthu Senthil Kumar and Lal Hmingliana. “A Multilayer Perceptron Model to Predict Risk Factors of Type II Diabetes Mellitus”. Int J Food Nutr, 11 (2022): 67-74. https://DOI:10.4103/ijfans_110-22

Castellana, Stefano, and Tommaso Mazza. "Congruency in the prediction of pathogenic missense mutations: state-of-the-art web-based tools." Briefings in bioinformatics 14, no. 4 (2013): 448-459. https://doi.org/10.1093/bib/bbt013

Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16 (2002): 321-357. https://doi.org/10.1613/jair.953

Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785-794. 2016. https://doi.org/10.1145/2939672.2939785

Chen, Xi, and Hemant Ishwaran. "Random forests for genomic data analysis." Genomics 99, no. 6 (2012): 323-329. https://doi.org/10.1016/j.ygeno.2012.04.003

Chicco, Davide, and Giuseppe Jurman. "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation." BMC genomics 21, no. 1 (2020): 6. https://doi.org/10.1186/s12864-019-6413-7

Cios, Krzysztof J., Witold Pedrycz, and Roman W. Swiniarski. Data mining methods for knowledge discovery. Springer Science & Business Media, 2012.

Dietterich, Thomas G. "Ensemble methods in machine learning." In International workshop on multiple classifier systems, pp. 1-15. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000. https://doi.org/10.1007/3-540-45014-9_1

Douville, Christopher, David L. Masica, Peter D. Stenson, David N. Cooper, Derek M. Gygax, Rick Kim, Michael Ryan, and Rachel Karchin. "Assessing the pathogenicity of insertion and deletion variants with the variant effect scoring tool (VEST‐Indel)." Human mutation 37, no. 1 (2016): 28-35. https://doi.org/10.1002/humu.22911

Fernández, Alberto, Salvador Garcia, Francisco Herrera, and Nitesh V. Chawla. "SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary." Journal of artificial intelligence research 61 (2018): 863-905. https://doi.org/10.1613/jair.1.11192

Genitsaridi, Irini, Paraskevi Salpea, Agus Salim, Seyedeh Forough Sajjadi, Dunya Tomic, Steven James, Sathish Thirunavukkarasu et al. "of the IDF Diabetes Atlas: global, regional, and national diabetes prevalence estimates for 2024 and projections for 2050." The Lancet Diabetes & Endocrinology 14, no. 2 (2026): 149-156. https://doi.org/10.1016/S2213-8587(25)00299-2

Ghatak, Souvik, Rajendra Bose Muthukumaran, and Senthil Kumar Nachimuthu. "A simple method of genomic DNA extraction from human samples for PCR-RFLP analysis." Journal of biomolecular techniques: JBT 24, no. 4 (2013): 224. https://doi.org/10.7171/jbt.13-2404-001

Hastie, Trevor, Robert Tibshirani, and J. H. Friedman. "The Elements of Statistical Learning, (2nd printing ed.)." (2009).

He, Haibo, and Edwardo A. Garcia. "Learning from imbalanced data." IEEE Transactions on knowledge and data engineering 21, no. 9 (2009): 1263-1284. https://doi.org/10.1109/TKDE.2008.239

He, Haibo, Yang Bai, Edwardo A. Garcia, and Shutao Li. "ADASYN: Adaptive synthetic sampling approach for imbalanced learning." In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pp. 1322-1328. Ieee, 2008. https://doi.org/10.1109/IJCNN.2008.4633969

Kahn, Barbara B., and Jeffrey S. Flier. "Obesity and insulin resistance." The Journal of clinical investigation 106, no. 4 (2000): 473-481. https://doi.org/10.1172/JCI10842

Kharsati, Naphisabet, and Mrinmoyi Kulkarni. "Living with diabetes in Northeast India: An exploration of psychosocial factors in management." Dialogues in Health 4 (2024): 100180. https://doi.org/10.1016/j.dialog.2024.100180

Krawczyk, Bartosz. "Learning from imbalanced data: open challenges and future directions." Progress in artificial intelligence 5, no. 4 (2016): 221-232. https://doi.org/10.1007/s13748-016-0094-0

Lalrohlui, Freda, John Zohmingthanga, Andrew Vanlallawma, and Nachimuthu Senthil Kumar. "Whole exome sequencing identifies the novel putative gene variants related with type 2 diabetes in Mizo population, northeast India." Gene 769 (2021): 145229. https://doi.org/10.1016/j.gene.2020.145229

Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Processing Subgroup. "The sequence alignment/map format and SAMtools." bioinformatics 25, no. 16 (2009): 2078-2079. https://doi.org/10.1093/bioinformatics/btp352

Libbrecht, Maxwell W., and William Stafford Noble. "Machine learning applications in genetics and genomics." Nature Reviews Genetics 16, no. 6 (2015): 321-332. https://doi.org/10.1038/nrg3920

Lunter, Gerton, and Martin Goodson. "Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads." Genome research 21, no. 6 (2011): 936-939. http://www.genome.org/cgi/doi/10.1101/gr.111120.110

Martin, Alicia R., Masahiro Kanai, Yoichiro Kamatani, Yukinori Okada, Benjamin M. Neale, and Mark J. Daly. "Clinical use of current polygenic risk scores may exacerbate health disparities." Nature genetics 51, no. 4 (2019): 584-591. https://doi.org/10.1038/s41588-019-0379-x

Moore, Jason H., Folkert W. Asselbergs, and Scott M. Williams. "Bioinformatics challenges for genome-wide association studies." Bioinformatics 26, no. 4 (2010): 445-455. https://doi.org/10.1093/bioinformatics/btp713

Nasykhova, Yulia A., Yury A. Barbitoff, Elena A. Serebryakova, Dmitry S. Katserov, and Andrey S. Glotov. "Recent advances and perspectives in next generation sequencing application to the genetic research of type 2 diabetes." World journal of diabetes 10, no. 7 (2019): 376. https://doi.org/10.4239/wjd.v10.i7.376

Ortiz, Bengie L., Vibhuti Gupta, Rajnish Kumar, Aditya Jalin, Xiao Cao, Charles Ziegenbein, Ashutosh Singhal, Muneesh Tewari, and Sung Won Choi. "Data preprocessing techniques for AI and machine learning readiness: Scoping review of wearable sensor data in cancer care." JMIR mHealth and uHealth 12, no. 1 (2024): e59587. https://doi.org/10.2196/59587

Toolkit, Picard. "GitHub repository." Broad Institute. Available online at: http://broadinstitute.github. io/picard (2019)

Quinlan, J. Ross. C4. 5: programs for machine learning. Elsevier, 2014.

Rentzsch, Philipp, Daniela Witten, Gregory M. Cooper, Jay Shendure, and Martin Kircher. "CADD: predicting the deleteriousness of variants throughout the human genome." Nucleic acids research 47, no. D1 (2019): D886-D894. https://doi.org/10.1093/nar/gky1016

Reva, Boris, Yevgeniy Antipin, and Chris Sander. "Predicting the functional impact of protein mutations: application to cancer genomics." Nucleic acids research 39, no. 17 (2011): e118-e118. https://doi.org/10.1093/nar/gkr407

Rimmer, Andy, Hang Phan, Iain Mathieson, Zamin Iqbal, Stephen RF Twigg, WGS500 Consortium, Andrew OM Wilkie, Gil McVean, and Gerton Lunter. "Integrating mapping-, assembly-and haplotype-based approaches for calling variants in clinical sequencing applications." Nature genetics 46, no. 8 (2014): 912-918. https://doi.org/10.1038/ng.3036

Saeys, Yvan, Inaki Inza, and Pedro Larranaga. "A review of feature selection techniques in bioinformatics." bioinformatics 23, no. 19 (2007): 2507-2517. https://doi.org/10.1093/bioinformatics/btm344

Sarma, Ranjan Jyoti, Jeremy Lalrinsanga Pautu, Bawitlung Zothankima, Lalfakzuala Khenglawt, Saia Chenkual, John Zohmingthanga, Lalawmpuii Pachuau, and Nachimuthu Senthil Kumar. "Novel germline variants of MUC3A in a patient with ER+ breast cancer and signet-ring cell stomach adenocarcinoma." Gene Reports 33 (2023): 101803. https://doi.org/10.1016/j.genrep.2023.101803

Shihab, H. A., Gough, J., Cooper, D. N., Stenson, P. D., Barker, G. L., Edwards, K. J., Day, I. N., & Gaunt, T. R. (2013). Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Human Mutation, 34(1), 57–65. https://doi.org/10.1002/humu.22225

Sim, Ngak-Leng, Prateek Kumar, Jing Hu, Steven Henikoff, Georg Schneider, and Pauline C. Ng. "SIFT web server: predicting effects of amino acid substitutions on and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. biorxiv, 531210. https://doi.org/10.1093/nar/gks539

Sirocchi, Christel, Martin Urschler, and Bastian Pfeifer. "Feature graphs for interpretable unsupervised tree ensembles: centrality, interaction, and application in disease subtyping." BioData Mining 18, no. 1 (2025): 15. https://doi.org/10.1186/s13040-025-00430-3

Tabák, Adam G., Christian Herder, Wolfgang Rathmann, Eric J. Brunner, and Mika Kivimäki. "Prediabetes: a high-risk state for diabetes development." The Lancet 379, no. 9833 (2012): 2279-2290. https://doi.org/10.1016/S0140-6736(12)60283-9

Trikkalinou, Aikaterini, Athanasia K. Papazafiropoulou, and Andreas Melidonis. "Type 2 diabetes and quality of life." World journal of diabetes 8, no. 4 (2017): 120. https://doi.org/10.4239/wjd.v8.i4.120

Wang, Kai, Mingyao Li, and Hakon Hakonarson. "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data." Nucleic acids research 38, no. 16 (2010): e164-e164. https://doi.org/10.1093/nar/gkq603

Wang, Tao, Yongzhuang Liu, Quanwei Yin, Jiaquan Geng, Jin Chen, Xipeng Yin, Yongtian Wang et al. "Enhancing discoveries of molecular QTL studies with small sample size using summary statistic imputation." Briefings in bioinformatics 23, no. 1 (2022): bbab370. https://doi.org/10.1093/bib/bbac139

Wilson, Dennis L. "Asymptotic properties of nearest neighbor rules using edited data." IEEE Transactions on Systems, Man, and Cybernetics 3 (2007): 408-421. https://doi.org/10.1109/TSMC.1972.4309137

Xue, Bing, Mengjie Zhang, Will N. Browne, and Xin Yao. "A survey on evolutionary computation approaches to feature selection." IEEE Transactions on evolutionary computation 20, no. 4 (2015): 606-626. https://doi.org/10.1109/TEVC.2015.2504420

Yang, Xi, Di Liu, Fei Liu, Jun Wu, Jing Zou, Xue Xiao, Fangqing Zhao, and Baoli Zhu. "HTQC: a fast quality control toolkit for Illumina sequencing data." BMC bioinformatics 14, no. 1 (2013): 33. https://doi.org/10.1186/1471-2105-14-33

Zeng, Ping, Yang Zhao, Jin Liu, Liya Liu, Liwei Zhang, Ting Wang, Shuiping Huang, and Feng Chen. "Likelihood ratio tests in rare variant detection for continuous phenotypes." Annals of human genetics 78, no. 5 (2014): 320-332. https://doi.org/10.1111/ahg.12071

Zheng, Yan, Sylvia H. Ley, and Frank B. Hu. "Global aetiology and epidemiology of type 2 diabetes mellitus and its complications." Nature reviews endocrinology 14, no. 2 (2018): 88-98. https://doi.org/10.1038/nrendo.2017.151

Downloads

Published

2026-05-27

How to Cite

Ralte, V., Lalhmingliana, Senthil Kumar, N., Senthil Kumar, B., Lalrohlui, F., Zohmingthanga, J., & Vanlalhruaii. (2026). LFGA–ADENN Hybrid Feature Extraction and Oversampling for Type 2 Diabetes Mellitus Detection in Mizoram: A Genomic Data–Driven Study in the Mizoram Population. Science & Technology Journal, 14(Online First). https://doi.org/10.22232/stj.2026.1