Accurate river flow prediction is one of the key challenges in sustainable water resources management and in assessing hydrological hazards such as floods and droughts. In the semi-arid mountainous regions of the Zagros range, pronounced spatial and temporal variability in precipitation, temperature, and evaporation results in complex and nonlinear relationships among runoff-generating factors. Consequently, the performance of classical conceptual and physically based models such as HBV and SWAT tends to decline under high uncertainty or data-scarce conditions. In recent years, artificial intelligence (AI) approaches based on machine learning, particularly gradient boosting models, have emerged as powerful data-driven tools for hydrological modeling. This study aims to develop and evaluate a multi-objective Gradient Boosting Regressor (GBR) model for river discharge prediction in the Kashkan watershed, located in southwestern Iran. By integrating the GBR model with the Harmony Search (HS) metaheuristic algorithm for optimal feature selection, and employing k-fold cross-validation for model validation, this research seeks to enhance model accuracy, stability, and generalizability under data-limited conditions. The main objective is to propose an efficient framework for identifying key hydrometric and meteorological predictors of river flow and improving the performance of data-driven models in regions with limited observational records. Materials and Methods The study area is the Kashkan River Basin in Lorestan Province, covering approximately 9,500 km². The dataset includes observations from three hydrometric stations (Kakarza, Kashkan–Afrineh, and Kashkan–Poldokhtar) and nine meteorological stations over a regular statistical period. After removing incomplete data, a total of 633 concurrent records were used for model training and testing. In the first step, the Gradient Boosting Regressor (GBR) model was developed using the Scikit-Learn library in Python. The model iteratively combines multiple weak decision trees to progressively minimize prediction errors and identify nonlinear relationships between inputs and targets through ensemble learning. Key hyperparameters, including the learning rate, number of trees, and tree depth, were tuned using k-fold cross-validation. Next, optimal and multi-objective feature selection was performed using the Harmony Search (HS) algorithm. The objective function combined prediction accuracy (based on Mean Squared Error) and the number of selected features, while a weighting factor (β) controlled the balance between these two objectives. The input features included lagged time series of discharge from neighboring stations, meteorological variables with time delays, and temporal components such as month and year. Model performance was evaluated using several statistical metrics, including the Coefficient of Determination (R²), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Nash–Sutcliffe Efficiency (NSE). Qualitative analyses were also performed using the correlation matrix, error distribution histograms, SHAP (Shapley Additive Explanations) plots, and error boxplots to assess the influence of key stations and the model’s predictive behavior under varying conditions. Results and Discussion The results demonstrated that the Gradient Boosting Regressor, coupled with the optimized feature subset from the HS algorithm, achieved high accuracy in flow prediction. When the discharge data from the Kakarza station were included as an input, the model achieved R² values of 0.92 and 0.96 for the test data at the Kashkan–Poldokhtar and Kashkan–Afrineh stations, respectively, with corresponding RMSE values of 15.51 and 8.54 L/s. In contrast, excluding Kakarza station data led to a marked decrease in model performance: R² dropped to 0.78 and 0.75, while RMSE increased to 26.98 and 19.25 L/s, respectively. The correlation matrix revealed that the Kakarza discharge exhibited a very strong correlation (>0.95) with the target variable, highlighting its crucial role in improving model stability and identifying flow relationships. SHAP feature importance analysis confirmed that removing this station changed the model behavior, increasing its reliance on meteorological and temporal inputs. When hydrometric data were included, the model effectively learned seasonal flow patterns and long-term trends. The error histograms and boxplots further supported the robustness of the model configuration that included Kakarza data, showing reduced error dispersion and mean errors close to zero. Conversely, excluding the key station resulted in positive error skewness and more outliers, indicating a tendency toward underestimation. These findings underscore the importance of multi-source data integration and intelligent feature selection for achieving robust and generalizable performance in data-driven hydrological modeling. Conclusions and Recommendations This study demonstrates that integrating the Gradient Boosting Regressor (GBR) model with the Harmony Search (HS) metaheuristic algorithm and employing multi-fold cross-validation constitutes an effective approach for predicting river flows in data-scarce basins. The results emphasize the critical importance of including highly correlated hydrometric stations to improve model accuracy and stability, as their removal substantially reduces R² and increases prediction error. The developed model successfully captured nonlinear relationships between meteorological and hydrometric variables and prevented overfitting through intelligent feature selection. Future studies are encouraged to extend this framework by incorporating advanced boosting algorithms such as XGBoost or LightGBM, and by exploring hybrid metaheuristic algorithms (e.g., HS–GA). Expanding this approach to real-time discharge or rainfall–runoff prediction in other Zagros catchments could significantly contribute to the integrated management of the country’s water resources. |