In conducting this investigation, we adhered to the conceptual framework shown in Fig. 1. The primary components of the research study are the collection of data pertaining to twenty-meter sprint performance, preprocessing of the collected data, division of the data into training and testing samples, selection and reduction of relevant features, construction of regression models utilizing traditional Machine Learning, ML, algorithms, and assessment of the predictive model’s performance.

### Participants

In this study, 282 participants, 130 males (age: 8.26 ± 1.84 kg, height: 136.51 ± 16.74 cm, weight: 36.61 ± 14.16 kg, BMI 17.77 ± 3.28 kg/m^{2}) and 152 females (age: 8.74 ± 1.83 kg, height: 135.26 ± 12.83 cm, weight: 32.73 ± 10.32 kg, BMI 17.48 ± 2.99 kg/m^{2}), aged between 6 and 11 years, participated in the first levels of primary education. In this study, we focused on participants with normal development at the first level of primary education. Participants reported that they did not have any anxiety and insomnia during the test. In this study, G-Power (version 3.1.9.7, IBM, Düsseldorf) was used to determine the minimum sample^{19}. According to this analysis, when α err prob = 0.05; minimum effect size = 0.30; and power (1−β err prob) = 0.80, it was determined that there should be at least 270 participants with an actual power of 80.0%.

In this study, voluntary participants between the ages of 6 and 11, studying at the first level of primary education, showing normal physical, cognitive and affective development were included. Participants were selected from sedentary individuals who did not actively compete in any sports branch other than physical education and sports lessons. Participants who had developmental disorders, needed special education through inclusive education, had chronic diseases, had any disability, or were taking growth hormone, glucocorticoids, antipsychotic or corticosteroid drugs that could change the body structure were not included in the study. Participants who had difficulty understanding the instructions of the researcher during the tests and who persistently refused to perform the tests were excluded from the study.

All participants, parents and teachers were informed about the purpose, rationale and contribution of the research to the literature. Informed consent forms were signed by all researchers and participants and they were informed that they could withdraw from the study at any time. The necessary permissions were obtained from the Health Sciences Non-Interventional Research Ethics Committee (Approval Number: 2024/4892). The data obtained in the study and all test procedures were performed in accordance with the principles set out in the Declaration of Helsinki. The proposed model can be observed in Fig. 1.

### Data collection

#### Antropometric measurements

An electronic scale (Tanita BC 420 SMA, Tanita Europe GmbH, Sindelfingen, Germany) was used to measure weight, with an accuracy of 0.1 kg. The children were simply decked up in underwear and a T-shirt. A telescopic height-measuring device (Seca 225 stadiometer, Birmingham, UK) was used to measure the children’s height while they were barefoot, to the nearest 0.1 cm. A skinfold caliper (Holtain, Holtain Ltd, Pembrokeshire, United Kingdom, range 0–40 mm) was used to measure the skinfold thickness (mm) twice on the right side of the body, to the nearest 0.2 mm. The following locations were used to measure skinfolds: (1) triceps, located on the back of the arm between the olecranon process and the acromion; (2) biceps, situated slightly above the cubital fossa’s center, at the same level as the triceps skinfold; Skinfold thickness (mm) was measured twice on the right side of the body to the nearest 0.2 mm with a skinfold caliper (Holtain, Holtain Ltd, Pembrokeshire, United Kingdom, range 0–40 mm). Skinfold measurements were taken at the following sites: (1) triceps, between the acromion and the olecranon process on the back of the arm; (2) biceps, at the same level as the triceps skinfold, just above the center of the cubital fossa; (3) subscapular, approximately 20 mm below the tip of the scapula, at a 45° angle to the lateral side of the body; (4) suprailiac, approximately 20 mm above the iliac crest and 20 mm toward the medial line; (5) abdominal, midway between the spina iliaca anterior superior and the umbilicus; (6) quadriceps, superior 1/3 of the quadriceps muscle vertically; (7) gastrocnemius, midway medial to the muscle. Circumferences and length of the participants were measured with a tape measure with a precision of 1 cm. In this context; (1) head circumference, frontal and occipital region, (2) neck circumference, just below the larynx, (3) shoulder circumference, just below the acromion, at the end of expiration when the deltoid is most bulging, (4) chest circumference, end of expiration, 4th rib in front, 6th rib on the side, (5) abdominal circumference, umblicus level and subclavian ribs on the sides. costa in front and the 6th costa on the side, (5) abdominal circumference was measured at the level of the umblicus and the trunk circumference at the subcostal level on the sides, (6) thigh, mid-thigh, (7) gastrocnemius, where the gastrocnemius muscle was most bulging, (8) fathom length, standing with arms against the wall and in 90 degrees of abduction, between the fingertips, (9) leg length, between the spina iliaca anterior superior and the medial malleolus, (10) thigh length, between the spina iliaca anterior superior and the medial condyle of the femur, (11) foot length, between the posterior calcaneus and the 2nd phalax.

#### 20 meters sprint performance test

Before the speed tests, 2 pairs of photocells (Smart Speed, Fusion Equipment, AUS) were placed along the running track at 0 and 20 m distances. Participants sprinted twice, starting on their own from a semi-crouched position 0.3 m behind the starting line. The sprint tests were performed on an indoor running track to avoid being affected by weather conditions. The temperature of the area was 22 degrees Celsius. After the familiarization phase, each participant was given two trials. Participants were tested in groups of 10 and after each participant finished the first test, the first participant was taken again for the second test. A rest interval of at least 5 min was ensured between both tests and the best trial was recorded^{20}.

#### Dataset preprocessing and exploration

The subsequent phase involves the preparation of the dataset. The total number of records in the dataset is 282. Table 1 presents a comprehensive statistical summary of the combined dataset. Table 1 presents a comprehensive overview of the data distribution, encompassing many statistical measures such as the number of observations (which is represented by symbol N), average, standard deviation, minimum value, 1st quantile, median, 3rd quantile, and maximum value. The depiction of the variance of diverse variables is facilitated through making use of a scatterplots of the target variable (Sprint Performance) versus all input variables as exemplified in Fig. 2. In order to facilitate the study of the data records, the entries have undergone normalization using the z-score method. The normalized values exhibit centralization around zero and possess a standard deviation of one. The z-scores of a random variable X, characterized by a mean value of M and a standard deviation (\({\varvec{sd}}\)), can be determined using Eq. (1).

$${\varvec{Z}} – {\varvec{score}} = \user2{ }\frac{{{\varvec{x}} – {\varvec{M}}}}{{{\varvec{sd}}}}$$

(1)

The effectiveness of forecasting models is significantly impacted by the relevance of the input features. The Pearson correlation coefficient is a dominant approach to assess the relationship between the input variables and determine the extent to which the outcomes are influenced by the feature space. Figure 3 shows the correlation matrix between Sprint performance (the output variable) and the other variables (the input variables). Pearson’s Correlation Coefficient is the standard statistical method for analyzing the linear association between two independent random variables. Two vectors’ correlation score, CS, indicates how dependent they are on one another. For any two vectors \({\varvec{x}}1\) and \({\varvec{x}}2\), we get (Eq. 2), which gives us the correlation coefficient where \({\varvec{cov}}\left( {{\varvec{x}}1,{\varvec{x}}2} \right)\) is the covariance between \({\varvec{x}}1\) and \({\varvec{x}}2\user2{ }\) and \({\varvec{\sigma}}\left( {{\varvec{x}}1} \right),\user2{ and \sigma }\left( {{\varvec{x}}2} \right)\) are their variances.

$${\varvec{cs}} = \frac{{{\varvec{cov}}\left( {{\varvec{x}}1,{\varvec{x}}2} \right)}}{{\sqrt {{\varvec{\sigma}}\left( {{\varvec{x}}1} \right){\varvec{\sigma}}\left( {{\varvec{x}}2} \right)} }}$$

(2)

The \({\varvec{cs}}\) takes on a value between 1 and + 1, which is linearly dependent on whether or not the input variables \({\varvec{x}}1\) and \({\varvec{x}}2\user2{ }\) are correlated. If they are unrelated, then it would be equal to zero. As depicted in Fig. 3, the response variable (performance of Sprint) exhibits a stronger correlation with the following input variables including (Age, Height, waist circumference, hip circumference, leg length, thigh length, foot length).

#### Dataset splitting

The data sets have been partitioned into training and testing instances using a random allocation method, with a ratio of 80% for training and 20% for testing. In the second stage, the experiments were repeated by selecting k = 5 from the cross-validation method. The training examples are utilized for constructing the prediction models, whereas the testing samples are employed for evaluating the correctness of these predictions. The training and testing samples are subsequently inputted into a feature selection step in order to identify the crucial aspects that could potentially impact the accuracy of the prediction.

#### Feature space

Three distinct tests were undertaken to determine the optimal technique. The initial experiment involved utilizing the entire feature space for training the machine learning prediction models. In the second experiment, we exclusively employed significant features for the purpose of training and testing the regression models. The significant features that have been studied are those extracted using correlation analysis, namely with a correlation score (CS) greater than or equal to 0.4. These features have a higher Pearson Correlation Coefficient with the outcomes. In the third experiment, Principal Component Analysis (PCA) was employed to reduce the feature space. This allowed us to obtain only the primary vectors that were deemed statistically significant for training and testing the prediction models. PCA is a technique that effectively lowers the dimensionality of the feature space by identifying and extracting the most significant patterns that capture the essential information contained within the input features. The result of performing Principal Component Analysis is the identification of the principal components inside the feature space.

#### ML Prediction models construction and evaluation

Regression is a widely employed supervised machine learning methodology utilized for the purpose of forecasting continuous quantitative outcomes. During the process of regression analysis, the estimation of the relationship between an outcome (also known as the response variable) and a number of input variables (also known as predictors) is conducted using a labelled dataset. Different types of regression analysis can be chosen based on various factors, such as the qualities of the variables, the objective variables being examined, or the specific characteristics and form of the regression curve that represents the relationship between the dependent and independent variables. Linear regression, stepwise regression, decision trees, support vector machines, ensembles, and Gaussian process regressors are illustrative examples of conventional machine learning regression methodologies. A regression model that fits well is characterized by projected values that closely match the actual data values. The mean model, which employs the mean value for each projected outcome, is typically employed in cases where there are no informative predictor variables available. The adequacy of a proposed regression model should thus be superior to that of the mean model.

There are other performance indicators that can be utilized to assess the accuracy of forecasting models, including the R squared (R2) and the Mean Squared Error (MSE), and the Root-Mean-Square Error (RMSE). The coefficient of determination, denoted as R2, quantifies the degree to which the forecasting model is capable of capturing the variability observed in the results. The calculation of R squared is described by Eq. (3). The second metric, as depicted by Eq. 4, is the MSE which is a quantitative measure utilized to assess the effectiveness of a regression model. It quantifies the average of the squared discrepancies between the observed and predicted values of the target variable. A lower value of MSE signifies superior model performance, since it signifies a smaller average difference between the anticipated and actual values. The last metric is the RMSE which quantifies the disparities between the observed and anticipated values. It is computed according to Eq. (5), whereby k represents the sample size, \({\varvec{x}}_{{\varvec{i}}} \user2{ }\) denotes the actual values, \({\varvec{x}}_{{\varvec{i}}}^{\sim }\) represents the forecasted values, and \({\varvec{x}}_{{\varvec{i}}}^{ – }\) signifies the mean of the actual values.

$${\varvec{R}}^{2} = 1 – \user2{ }\frac{{\mathop \sum \nolimits_{{{\varvec{i}} = 1}}^{{\varvec{K}}} ({\varvec{x}}_{{\varvec{i}}} – {\varvec{x}}_{{\varvec{i}}}^{\sim } )^{2} }}{{\mathop \sum \nolimits_{{{\varvec{i}} = 1}}^{{\varvec{K}}} ({\varvec{x}}_{{\varvec{i}}} – {\varvec{x}}_{{\varvec{i}}}^{ – } )^{2} }}$$

(3)

$${\varvec{MSE}} = \user2{ }\frac{1}{{\varvec{k}}}\user2{ }\mathop \sum \limits_{{{\varvec{i}} = 1}}^{{\varvec{k}}} \left( {{\varvec{x}}_{{\varvec{i}}} – {\varvec{x}}_{{\varvec{i}}}^{\sim } } \right)^{2} \user2{ }$$

(4)

$${\varvec{RMSE}} = \user2{ }\sqrt {\frac{1}{{\varvec{K}}}\mathop \sum \limits_{{{\varvec{i}} = 1}}^{{\varvec{K}}} ({\varvec{x}}_{{\varvec{i}}} – {\varvec{x}}_{{\varvec{i}}}^{\sim } )^{2} } \user2{ }$$

(5)

### Ethics approval and consent to participate

The study was conducted in accordance with the principles of the Declaration of Helsinki and was aproved by the Ethics Committee of the Institute of Health Sciences of Inonu University approved the study under registration number 2024/4892.