Knowledge embedding and interpretable machine learning optimizes comprehensive benefits for water treatment

Aggregation dynamics and knowledge embedding

There is a quantitative relationship between flocculants and granules, and in aggregation dynamics, collision collisions of particles can be obtained from two sources. (2) The impact and aggregation of particles caused by fluid motion from hydraulic or mechanical agitation is called corrective aggregation (Text S2). The detailed process is explained in text S1 and text S2. As particle size increases, the surrounding aggregation itself is not affected by particle size, but the effect of Brownian motion decreases. Correct aggregation is also necessary to further promote collision and aggregation of larger particles. For actual water bodies where surrounding and corrective coagulants coexist, the ratio of different collision types is uncertain. Furthermore, particle size (d) and collision efficiency ($\ eta $) It also depends on changes in water quality (text S1, text S2). Therefore, although the exact dosage cannot be explained in quantitative formulas, it is suitable for learning nonlinear relations via ML. The type of data collected should include water quality indicators for different sections of the complete process.

How to embed environmental knowledge in models and increase the interpretability of models is key to research. This study also examined economic benefits and energy consumption. The logic chain of the aggregation process was used as a change in the rate of co-rate of the flocculant addition ➝ changes in particle aggregation ➝ changes in water quality in the aggregation section, therefore, seven economically relevant indicators (four power indicators and three economic indicators recorded by the central server were used as constraints rather than as independent variables to optimize the model parameters used in the training phase. time_n And after the aggregation segment time_n-1 Used as an independent variable and as a metric after aggregation segment time_n It is used as a constraint. The change value feature is calculated in real time by a central server, along with the results of online monitoring.

$$\begin{array}{c}\begin{array}{cc}{train}: &{pac}={ml}\left({wibc}\right)*{mlp}\end{array}\{mlp}\leftarrow{a}_{1}{loss}({wibc})+{a}_{2}{loss}({wtiac})+{a}_{3}{loss}({ptiac})+{a}_{4}{{aetiac})\begin{array}{cc}{test}: &{pac}={ml}\left({{wibc}}}_{{{{test}}\right)* {mlp}\end {array}\end {array}$$

(1)

${ml} $ Various machine learning models

${mlp} $ Various machine learning model parameters

${wibc} $ Water quality indicator before solidification

${wtiac} $ This is the index of the water quality after solidification

${ptiac} $ Power threshold indicator after solidification

${etiac} $ Economic threshold indicator after solidification

The distinction between independent variables and constraints, as well as the knowledge of environmental science built into the model training process, can clearly set up the problem framework, optimize the computational process, and improve the efficiency and accuracy of the model. For threshold control after coagulation sections, after disinfection sections, and before leaving the plant, the closer the threshold, the better the water quality and not lower water quality. A comprehensive balance due to the need for water quality, economic costs and environmental benefits. Finally, an interpretable analysis of the built model is performed along with application validation.

Treatment Process and Data Collection

The DWTP from which the data was obtained is located in Guangzhou, China. The design water supply capacity is 800,000 tons per day, and the actual water supply capacity is 450,000 tons per day, and the total construction land area is approximately 180,000 m.². This plant employs a chlorine disinfection process before chlorination, tri-salin sodium water intake, a disinfectant contact pool, and a previous water supply pump station (Fig. 1a,b). Solidification chemicals using PAC (Polymerized Aluminum Aluminum Chloride) are shown in the text S3 for the main parameters of PAC. The current water quality standard for water exiting plants is GB5749-2022, a water plant that implements stricter control standards, especially after sediment tanks control turbidity below 0.8ntu. After the disinfectant tank is below 0.5ntu, the residual chlorine is above 0.3 mg, above 0.3 mg/L. 0.8 mg/l.

To improve the feasibility of subsequent designs, this study selected 38 indicators that can be measured online for most water plants (Fig. 1A). These include seven raw water indicators, five inflow water indicators, four pre-chlorination indicators, three sedimentary tank indicators, three disinfection indicators, four drainage water indicators, four electrical indicators, three economic indicators recorded by four central servers, four changing value indicators, and PAC intakes (Table S1). This study provided equipment configuration, location of installation, quantity and category (Table S2). The dataset is divided between the training and test sets in a ratio of 8-2.

ML Principles

Eight ML algorithms were selected for the study due to the feasibility of the ML approach^29,30,31 (Text S4). It includes ridge regression (ridge), support vector regression (SVR), random forest (RF), extreme gradient boost (XG), deep neural network (DNN), recurrent neural network (RNN), long-term short-term memory network (LSTM), and transformer (TF). The ridge is a baseline model.

Ridge is a method of improving the stability and predictive power of linear regression by normalizing penalty terms (Figure 1C). SVR fits linear and nonlinear data via the support vector machine (Figure 1D)^32,33. RF is ensemble learning based on bagging, with subtrees independent and not affecting each other (Figure 1E)^34,35,36. XG is also ensemble learning, but is based on boosts, and the subtrees are mutually dependent (Figure 1F)^37,38,39. DNNs are made up of multiple layers of neurons and are the simplest deep learning network (Figure 1G)^40,41,42. The RNN captures the time dependence of time data via periodic connectivity, but may face gradient elimination during training (Fig. 1H)⁴³. LSTM controls the flow of information by introducing a gating mechanism that allows networks to selectively retain important information and forget unrelated information (Figure 1I)^44,45,46. TF captures global information through a self-joint mechanism that allows each input element to be calculated in conjunction with all other elements in a sequence (Figure 1J)^47,48,49.

Interpretability of the model

Shapley Additive Description (SHAP) is a game theory-based explanation method for measuring the contribution of each feature to the prediction results of an ML model^50,51. Based on the concept of Shapley values, we derive the contribution values of each feature by calculating the marginal contributions of features to predictions in various combinations.^52,53The advantage of SHAP is that it provides transparency in model prediction, elucidates the specific impact of functionality on outcomes, and helps to understand the decision-making process of complex models.^40,54. During model development, SHAP helps developers identify potential errors or unfair factors and avoid model bias. SHAP is used to understand and validate the ML administration model for this study.

The depth of the subtree in Random Forest (RF) can enhance interpretation and understanding of the model. The depth of the subtree directly affects the complexity, generalization ability, and computational efficiency of the model. Trees with larger depths capture complex patterns, but tend to fit excessively, while trees with lower depths are simpler and easier to interpret. Adjusting the depth helps balance model performance and interpretability and diagnose overfitting or wearing problems. Furthermore, subtree depth affects the importance analysis of functionality and transparency of decision paths, making it easier to understand and communicate the model's decision process.

Verification of application feasibility

DWTP employs folded plate coagulation tanks, and the coagulation area is divided into two blocks, a total of eight groups, 13.75 m, 14.40 m and 7.1 m respectively, into two blocks, a total of eight groups, each group's length, width and height. Each group of cohesive tanks is divided into three rows, two rows, and six areas, with a total number of folded plates of 38 blocks. The models constructed during the verification were reduced 25 times, with one group length, width and height of 55 cm, 57.6 cm and 28.4 cm, respectively. The designed hydraulic retention time for the cohesive area was 15.1 min, and the designed hydraulic retention time for the deposition area was 102 min.

The verification will take place over a total of 10 days, from February 4, 2025 to February 13, 2025. There were four sampling points daily: 10:00, 12:00, 14:00, and 16:00. Verification was performed by intake of water directly prior to coagulation, calculating the original logic of the water plant and the administration scheme of the ML-driven dosing scheme, and mixing water and pharmaceuticals via pumps. Water was then fed into two identical reduced fracture aggregation reactors and withstanded for 102 minutes leaving solidified water to test the indicator. Given the differences between reactors and actual water plants, indicators of actual water plants during reactor verification were also collected online and used for comparison. Since all tanks are displayed in pairs, separate controls are run to verify the mobility of the method.

Technique-Economic Analysis (TEA) assesses the economic feasibility, benefits and risk of a technology through quantitative and qualitative analysis and provides a scientific basis for decision-making. This study considers the costs of chemical consumption, mainly containing flocculants (PAC), disinfectants (sodium hypochlorite), and other drugs. Other chemicals refer mainly to hydrochloric acid and sodium hydroxide, which are used for backwashing.

Monte Carlo simulation is an effective numerical method for addressing the problem of uncertainty and offers advantages such as simplicity of implementation, powerful reproducibility, and applicability to perturbation analysis of complex systems. This study introduces this method to assess the robustness of the model under missing data conditions. Specifically, for each specified percentage of shortage, the corresponding function value percentage is randomly selected from the original input data and set to 0, simulating actual monitoring abnormalities such as sensor failures, communication interruptions, and missing data records. The model is then run 100 times based on the disturbed data, recording errors from each simulation. Statistical analysis of results from multiple simulations allows us to quantify the variability and stability of the model's performance under different missing data ratios.

Source link