This section outlines the methodology adopted to investigate the classification of CRP concentrations in municipal wastewater samples using UV–Vis spectrometry measured signal and machine learning techniques. It begins with a description of the dataset, which consists of real influent samples spiked with known CRP concentrations. The spectral characteristics of these samples are then discussed, followed by a summary of the classification approach, including the selection of spectral markers. Finally, the machine learning models employed for multi-class classification are introduced, detailing their configurations and relevance to the study’s objectives. Together, these subsections provide a comprehensive overview of the experimental and computational framework that underpins the analysis.
Dataset
In this study, a dataset comprising 840 distinct wastewater samples, each exhibiting varying concentrations of CRP (including samples without CRP), serves as the foundational basis for analysis. These samples encapsulate a diverse spectrum of CRP levels, ranging from negligible concentrations to notable quantities, reflecting the dynamic composition inherent to wastewater matrices. The base for dataset construction were measurements carried out with the spectrophotometer (NanoDrop ND-1000, Thermo Fisher Scientific Inc., Waltham, MA, USA). The methodology, samples description and measurement process description can be found elsewhere20.
Although the dataset comprises 840 individual samples, it is important to highlight that all were derived from real municipal wastewater. Specifically, 14 composite influent samples were collected between August and October 2023 from the Gdynia-Dçbogórze Wastewater Treatment Plant (WWTP), located along the Baltic Sea coast in northern Poland. This WWTP is the second-largest facility of its kind in the region and serves both the city of Gdynia and neighboring municipalities. A 24-h composite, flow-proportional sampling procedure was employed to represent the daily composition of raw municipal wastewater. The influent stream is composed predominantly of domestic wastewater, with industrial and hospital sources contributing approximately 1% and 0.1%, respectively. During the sampling campaign, the plant operated at a hydraulic load of roughly 450,000 population equivalents (PE), with an average daily influent flow of 61,886.1 ± 3760.2 \(\hbox {m}^3\)/day. The treatment process employed is mechanical-biological, including advanced nutrient removal and occasional chemical phosphorus precipitation.
The physicochemical characteristics of the collected influent samples reflected typical raw wastewater complexity. On average, samples showed a chemical oxygen demand (COD) of 1268.3 ± 203.6 mg \(\hbox {O}_2\)/L, biochemical oxygen demand (\(\hbox {BOD}_5\)) of 614.2 ± 149.2 mg \(\hbox {O}_2\)/L, and total suspended solids (TSS) of 561.7 ± 90.0 mg/L. Nitrogen and phosphorus levels were also characteristic of high-load influents, with total nitrogen (TN) at 97.2 ± 5.2 mg N/L, ammonium nitrogen (N–\(\hbox {NH}_4^+\)) at 68.8 ± 2.7 mg N/L, total phosphorus (TP) at 12.3 ± 2.3 mg P/L and orthophosphates (P–\(\hbox {PO}_4^{3-}\)) at 5.9 ± 0.1 mg P/L. Other measured parameters included pH of 8.0 ± 0.1, and conductivity of 936.0 ± 73.7 \(\upmu\)S/cm. These values confirm that the experimental conditions were grounded in the complex and variable matrix of real influent wastewater, as encountered in full-scale operational facilities.
To enable ML classification of CRP levels, controlled spiking of CRP into the real wastewater samples was performed. This approach preserved the authentic background variability and interferences of raw municipal wastewater, while allowing for reliable class labeling and model evaluation. As no synthetic matrices or laboratory-prepared waters were used, the dataset represents a controlled experimental design built upon genuine environmental samples. However, while the complexity of the matrix strengthens the ecological relevance of the study, it is acknowledged that the generalizability of the models could be further enhanced by validating them on wastewater from other geographical locations or treatment configurations. Such external validation would confirm the robustness of the models under different operational and environmental conditions.
All wastewater samples utilized in this study were obtained from municipal influent streams of municipal WWTP collected over multiple days, thereby inherently reflecting the natural temporal and compositional variability characteristic of real wastewater matrices. These samples comprised complex mixtures of organic, inorganic, and colloidal constituents, including variable concentrations of nitrogen and, phosphorus species, organiccarbon compounds, and suspended solids, without artificial filtration or significant matrix alteration beyond minimal preprocessing for spectrophotometric analysis. Controlled additions of CRP were performed solely to establish known concentration classes for model training and evaluation. Consequently, the dataset preserves the authentic physicochemical heterogeneity and spectral interferences present in operational wastewater treatment environments. This approach ensures that the reported classification performance accounts for the challenges associated with real-world sample complexity, enhancing the ecological validity and practical relevance of the developed machine learning models.
The size and breadth of this dataset facilitate robust statistical analyses and ML model training across the classification tasks outlined in the research. By encompassing a wide array of CRP concentrations, the dataset affords a comprehensive understanding of CRP distribution patterns within wastewater, laying the groundwork for nuanced insights into the interplay between CRP levels and environmental dynamics.
Classification
For the classification task, 176 markers were considered, each coinciding with a specific point of the UV–Vis absorption spectrum. These markers correspond to 176 distinct points along the wavelength, providing a comprehensive dataset that captures the spectral characteristics necessary for distinguishing between different classes of wastewater samples. The wavelength range for each spectrum was 220–750 nm with accuracy of 1 nm.
Additionally, for the restricted range (400–720 nm) of spectrum classification, 116 markers were chosen. By leveraging this detailed spectral information, the classification model can effectively identify subtle variations in the absorption profiles, which are indicative of the presence and concentration of CRP in the wastewater. This approach ensures a robust analysis by utilizing the full breadth (or a restricted range) of the absorption spectrum, enhancing the accuracy and reliability of the classification results. In Fig. 1 a representation of the UV–Vis spectroscopy-based absorption spectrum can be observed.
Feature engineering in the present study was primarily conducted through explicit marker selection based on spectral resolution, encompassing 176 markers across the full UV–Vis wavelength range of 220–750 nm, and 116 markers within a restricted range of 400–720 nm. This comprehensive selection of spectral points provided a detailed dataset capturing the absorption characteristics relevant for CRP classification in wastewater samples. The dataset used in this study includes five distinct CRP concentration classes, distributed as follows: wastewater with no detectable CRP (176 samples), CRP at \(10^{-4}\) \(\upmu\)g/ml (166 samples), \(10^{-3}\) \(\upmu\)g/ml (169 samples), \(10^{-2}\) \(\upmu\)g/ml (165 samples), and \(10^{-1}\) \(\upmu\)g/ml (164 samples). This relatively balanced sample distribution across all classes minimizes potential bias and supports robust and fair model training and evaluation. No manual spectral transformations, such as derivatives, integrals, or spectral band ratios, were applied. Instead, higher-order interactions among spectral features were implicitly modeled via the cubic polynomial kernel of the Cubic Support Vector Machine (CSVM) within the Error-Correcting Output Codes (ECOC) framework. This model-driven approach enabled the capture of complex, non-linear relationships between absorption spectra and CRP concentration classes without handcrafted feature engineering. Although detailed analysis of spectral region importance was not the primary focus, comparable classification accuracies exceeding 65% were obtained using both spectral range and the restricted 400–720 nm region, suggesting that the visible spectrum alone contains a substantial portion of the discriminative information. This finding also indicates the potential for hardware simplification in future sensor development. Further internal analyses revealed that wavelengths corresponding to amide and aromatic ring absorption zones (approximately 400–500 nm), as well as protein- and lipid-associated shoulders (approximately 600–700 nm), were significant contributors to classification confidence. These observations are consistent with established absorbance characteristics of CRP and its interactions within wastewater matrices. Future work will incorporate advanced explainability methods, such as SHAP and permutation feature importance, to more precisely quantify wavelength-specific contributions, thus facilitating the optimization of optical sensor design and improving interpretability of spectral signatures related to CRP presence.

Absorption spectrum signal and markers from UV–Vis spectrometry.
Machine learning models
In the realm of classification, several ML models can be leveraged to enhance decision-making processes and support the proposed wastewater classification approach42,43. Models such as Support Vector Machines (SVM), Neural Networks (NN), K-Nearest Neighbors (KNN), Ensemble models, Decision Trees, Discriminants, and Naive Bayes were explored for their potential in handling the complexities of wastewater data. Among these, the cubic Support Vector Machine (CSVM) demonstrated the highest performance in both classification tasks, as shown in the results. Consequently, this subsection focuses on CSVM, providing a detailed explanation of its methodologies and advantages in effectively addressing wastewater classification challenges44,45.
SVMs are a powerful set of supervised learning methods used primarily for classification, though they can also be applied to regression and outlier detection tasks46. The fundamental concept behind SVMs is to find the optimal hyperplane that separates data points of different classes with the maximum margin, acting as a decision boundary. In a two-dimensional space, this hyperplane is a line; in three dimensions, it is a plane; and in higher dimensions, it is a hyperplane. The goal is to ensure that data points from different classes are on opposite sides of this hyperplane, achieving the best possible separation.
A key feature of SVMs is the margin, the distance between the hyperplane and the nearest data points from each class, known as support vectors. The margin should be as wide as possible because a larger margin implies better generalization and a lower risk of misclassification on new, unseen data. The optimal hyperplane is the one that maximizes this margin, making SVMs highly effective at ensuring that the classifier is robust and performs well on unseen data. An example of a binary classification optimal hyperplane is presented in Fig. 2.

Hyperplane separation for a binary SVM classifier with two classes represented by circles and squares.
One of the strengths of SVMs is their effectiveness in high-dimensional spaces. They are particularly useful when the number of dimensions exceeds the number of samples, a scenario that can be challenging for many other algorithms. SVMs handle this by transforming the input data into a higher-dimensional space where it becomes easier to segregate classes that are not linearly separable in the original space. This transformation is done using kernel functions, which map the input data into a higher-dimensional feature space. Kernel functions are crucial to the power of SVMs. Commonly used kernels include the linear kernel, polynomial kernel, and radial basis function (RBF) kernel. These kernels allow SVMs to create complex decision boundaries that can handle a variety of classification problems, even when the data is not linearly separable in the original feature space. By applying these kernel functions, SVMs can effectively capture the underlying structure of the data, making them a versatile and powerful tool for a wide range of classification and regression tasks.
In this work, the proposed ML model designed to handle multi-class classification problems uses the Error-Correcting Output Codes (ECOC) method. ECOC decomposes a multi-class problem into multiple binary classification problems, with each solved by a binary classifier. The results from these binary classifiers are combined to make the final multi-class prediction. In this specific model, the response variable being predicted is the CRP concentration level, and the model aims to classify instances into one of five classes labeled 0, 1, 2, 3, and 4 (no CRP, \(10^{-4}\, \upmu\)g/ml, \(10^{-3}\, \upmu\)g/ml, \(10^{-2}\,\upmu\)g/ml and \(10^{-1}\, \upmu\)g/ml CRP concentration levels). There are 10 binary learners in this model, with a one-vs-one strategy, in which every binary classifier is trained for every pair of classes. With five classes, this results in ten binary classifiers, matching the number of binary learners specified.
Each binary learner in this model is a CSVM, which uses polynomial kernel functions of degree three. This means that the decision boundary is a cubic polynomial function of the input features, allowing the model to capture more complex relationships than a linear SVM. Importantly, since each CSVM is a separate model, each binary learner has its own bias. This bias is a characteristic of the specific classifier and influences how it separates the classes it is trained on. During the prediction phase, each of the ten binary classifiers makes a prediction. The final class prediction is determined by combining these results by considering which class has the highest aggregated score from the binary classifiers.
Table 1 summarizes the configurations of various ML models evaluated for the classification task. Each model type is listed alongside its key hyperparameters and learning settings, providing insight into the diversity of algorithms tested and the range of configurations applied. This breadth of modeling approaches helps ensure a robust assessment of which types of algorithms are most suitable for predicting CRP concentration categories in wastewater samples.
Ensemble models combine multiple base learners to produce a more robust and accurate classifier. By aggregating the predictions of several weaker models, they reduce the risk of overfitting and improve generalization. Techniques like bagging (e.g., Bagged Trees) reduce variance by averaging over models trained on different data subsets, while boosting (e.g., RUSBoosted Trees) focuses sequentially on correcting the errors of previous learners, improving performance particularly on harder-to-classify instances. Subspace ensembles further enhance diversity by training each learner on a random subset of features. These models are well-suited to complex, noisy datasets and often deliver strong performance in multi-class tasks.
Decision trees are hierarchical models that split the data based on feature values to form a tree-like structure of decision rules. Each node in the tree represents a feature, and each branch corresponds to a decision based on that feature’s value. They are simple to interpret and fast to train. However, standalone trees can be sensitive to small data fluctuations, leading to overfitting, especially with deep trees (e.g., Fine Tree). Simpler trees (e.g., Coarse Tree) offer greater generalization but may underfit. Decision trees form the foundation of many ensemble methods and serve as a baseline for understanding feature importance and interactions.
KNN models are instance-based, non-parametric classifiers that predict the label of a data point based on the majority class among its k nearest neighbors in the training set. The choice of distance metric (e.g., Euclidean, cosine, Minkowski) and the number of neighbors significantly affects performance. Weighted versions further refine predictions by giving more influence to closer neighbors. KNN is simple and effective in low-dimensional problems but can become computationally expensive and less accurate in high-dimensional or imbalanced datasets. Despite these challenges, it can perform well when local structure in the data is informative.
Neural networks (NNs) are composed of interconnected layers of artificial neurons that transform inputs through weighted connections and non-linear activation functions. They are capable of modeling complex, non-linear relationships in data. In this study, both shallow (single- or two-layer) and deep architectures were explored, with varying layer sizes and activation functions like ReLU and Softmax. While neural networks require more data and computational power, they can outperform traditional models when sufficient training data is available and proper regularization is applied. Their flexibility makes them ideal for capturing nuanced patterns in complex datasets like spectrometry data.
Naive Bayes classifiers are probabilistic models based on Bayes’ Theorem, with the simplifying assumption that features are conditionally independent given the class label. Despite this assumption often being violated in practice, Naive Bayes models tend to perform surprisingly well, especially on high-dimensional or noisy datasets. Gaussian Naive Bayes assumes normally distributed features, while Kernel Naive Bayes uses non-parametric density estimation to handle more flexible distributions. These models are fast, require little training data, and are useful as a strong baseline or in ensemble combinations.
Discriminant analysis models classify data by modeling the probability distributions of each class and using Bayes’ Rule to assign labels. Linear Discriminant Analysis (LDA) assumes that classes share a common covariance structure, resulting in linear decision boundaries. Quadratic Discriminant Analysis (QDA) relaxes this assumption and allows each class to have its own covariance, yielding more flexible (quadratic) boundaries. These models are efficient and interpretable, especially when the data is approximately Gaussian and the class structure is well-separated.
Kernel models implicitly map input data into higher-dimensional spaces using a kernel function, allowing linear algorithms to learn non-linear relationships. While this is typically associated with SVMs, some other models (like kernel-based Naive Bayes or custom SVM implementations) also use this approach. Kernel learning is especially powerful for classification tasks with complex decision boundaries, offering a balance between model complexity and interpretability.

