PLSCLUSTER: A Hybrid Approach Combining Partial Least Squares and Graph Clustering to Address Multicollinearity in High-Dimensional Data
Main Article Content
Abstract
Multicollinearity continues to pose significant challenges in statistical modelling, especially in high-dimensional datasets where predictors exhibit strong linear dependencies. Traditional approaches such as ridge regression, principal components regression (PCR), and partial least squares (PLS) each address the issue in part but struggle to balance predictive performance and interpretability. This study presents and evaluates PLSCLUSTER, a novel hybrid technique that integrates graph-based clustering of correlated predictors with supervised dimension reduction using PLS regression. The method first partitions highly correlated variables into clusters based on pairwise correlation networks, selects representative variables or latent cluster components, and then applies PLS to derive stable, interpretable models. Through extensive simulations varying sample size (n = 20, 50, 100, 500), number of predictors (p = 5, 10, 15, 20), and correlation strength (ρ = 0.5, 0.7, 0.9), PLSCLUSTER is benchmarked against Ridge, Lasso, and PCA/PCR using RMSE, MAE, R², AIC, BIC, and model stability indices such as VIF and condition number. Results demonstrate that PLSCLUSTER consistently outperforms competing methods under moderate to strong multicollinearity (ρ ≥ 0.7), achieving lower prediction errors and greater coefficient stability while retaining interpretability through cluster representatives. The method is robust across dimensions and benefits from larger sample sizes, while preserving interpretability via cluster representatives. Practical implementation details (SAS macros used in the thesis) and recommended hyper parameter choices (cluster threshold, number of PLS components) are provided to guide replication and adoption.
Article Details
References
Abdelwahab, M. M., Abonazel, M. R., Hammad, A. T., & El-Masry, A. M. (2024). Modified two-parameter Liu estimator for addressing multicollinearity in the Poisson regression model. Axioms, 13(1), 46.
Binois, M., & Wycoff, N. (2022). A survey on high-dimensional Gaussian process modeling with application to Bayesian optimization. ACM Transactions on Evolutionary Learning and Optimization, 2(2), 1 – 26.
El-Sheikh, A. A., Abonazel, M. R., & Ali, M. C. (2022). Proposed two variable selection methods for big data: simulation and application to air quality data in Italy. Communications in Mathematical Biology and Neuroscience, 2022, Article-ID 16.
Sarwar, S., Mehmood, T., & Arfan, M. (2025). Leveraging PLS and Lasso in MARS for high-dimensional FTIR data: A hybrid proposed model for antidiabetic activity of schiff base compounds. Chemometrics and Intelligent Laboratory Systems, 105418. https://doi.org/10.1016/j.chemolab.2025.105418
Sorochan-Armstrong, M. D., de la Mata, A. P., & Harynuk, J. J. (2022). Review of variable selection methods for discriminant-type problems in chemometrics. Frontiers in Analytical Science, 2, 867938. https://doi.org/10.3389/frans.2022.867938