M-estimation activation functions for high-performance extreme learning machine ensemble classification

Machine learning algorithms form the backbone of artificial intelligence research, playing a pivotal role in predictive analytics, pattern classification, image recognition, and system forecasting. Over the past two decades, there has been a sustained and growing interest in neural network-based approaches for tackling pattern recognition and regression tasks, with their applications expanding across a wide spectrum of domains. The first foundational work, by Knerr et al.¹ and LeCun et al.² introduced various single-layer learning rules as a means to divide complex tasks into manageable subtasks. The LIRA model, a dynamic neural architecture, is widely used in automated medical diagnostics and autonomous vehicle systems for image identification. Its layered model architecture achieves fast convergence of training and computational efficiency, making it crucial for real-time traffic flow management and precision agriculture. New activation functions enhance ML models’ performance for general and domain-specific applications. Kussul and Baidyk³ proposed the Limited Receptive Area, a new neural classifier developed only for picture identification problems. The LIRA framework is organized into three distinct layers, sensor, associative, and output. The sensor layer feeds into the associative layer through fixed, randomly initialized connections, while the associative layer projects to the output layer via learnable weights. Neural-network approaches are particularly attractive here because they combine powerful learning dynamics, seamless scalability to large datasets, high-fidelity approximation of complex functions, and inherently parallel architectures.

The hybridization of neural networks with the principles of machine learning has introduced hybrid approaches⁴. In a relative performance study, Kumar and Bhattacharya⁵ presented Artificial Neural Network (ANN) and Linear Discriminant Analysis (LDA) models. Their work proved that ANNs outperform LDA models on both training and test datasets by using a fully connected backpropagation ANN architecture with three neuron layers. Besides, ANNs outperformed LDA models in terms of robustness while handling missing data. In this respect, Abe et al.⁶ developed a new approach which used objective indexes to allow the evaluation of rules in post-processing mined data. Applications of evolutionary algorithms (EAs) to learning and evolution in ANNs were reviewed by Yao⁷. The combinations considered in that review include evolving ANN connection weights, topologies, learning rules, and input features using EAs. The review concluded that such combinations often lead to more successful intelligent systems than those using either ANNs or EAs in isolation. For adaptive control of strict-feedback nonlinear systems, MNNs were developed by Zhang et al.⁸. The effectiveness of the approach was verified by simulation experiments. For solving these problems and transcending some limitations of traditional neural networks, Huang et al.^9,10,11 and Bai et al.¹² proposed the Extreme Learning Machine (ELM) and related methods. ELM gives an intrinsically resistant solution to overfitting and is less affected by outliers, thanks to the randomness of the input weights initialization while performing the analytical optimization of the output weights. However, as the number of neurons in a hidden layer increase, neural networks are more complex physically. This is a serious shortcoming of ELM. The efficiency and robustness of ELM have been improved in many ways by researchers. The works of Barreto and Barros¹³Sze et al.¹⁴Horata et al.¹⁵Man et al.¹⁶Zhang and Luo¹⁷ developed modifications in order to make the basic ELM more robust against outliers. Unfortunately, most of those developments depend too much on neurons in the hidden layer, making the size of the network physical structure too big.

In addressing this challenge, Man et al.¹⁶ devised an optimal weight learning machine for handwritten image recognition that strategically employs fewer hidden nodes, thereby expediting the learning process in industrial applications. Das et al.¹⁸ also proposed a backward-forward ELM approach for input weight enhancement. They had taken an orthogonal matrix with an ideal input weight and half of the weights generated randomly to avoid the risk of overfitting. The proposed backward-forward ELM algorithm was thus performing better in performance and computational economy when the traditional ELM models were using different types of activation functions. Ensemble learning has become more popular since it uses numerous expert classifiers to increase accuracy. Khellal et al.¹⁹ tackled object recognition by integrating a convolutional neural network with a stacking ensemble of Extreme Learning Machine (ELM) classifiers. Earlier, Cao et al.²⁰ showed that a voting-based ELM employing a sigmoid activation function outstripped the original ELM’s performance. By amalgamating the complementary strengths of multiple learners, ensemble classifiers generally surpass individual models in accuracy. In line with this, recent investigations have made noteworthy progress in refining ELM architectures through both ensemble strategies and advanced optimization techniques. For example, Lan et al.²¹ and Mansoori & Sara²² demonstrated competitive accuracy on the Satimage dataset by combining multiple ELMs using traditional activation functions such as sigmoid and RBF. In a broader context, Kiani et al.²³ and Palomino-Echeverria²⁴ conducted a comprehensive survey of ELM-based approaches for outlier detection, identifying key developments in robust loss functions, data preprocessing, and ensemble training frameworks. Expanding on these advances, Tang et al.²⁵ introduced a two-stage ensemble ELM architecture optimized via the Sparrow Search Algorithm for software defect prediction, demonstrating how metaheuristic parameter tuning can substantially boost accuracy on real-world datasets. Similarly, A unified, hybrid metaheuristic optimized intrusion-detection framework developed by S Sumathi et al.^26,27,28, combining Harris Hawks and Particle Swarm Optimization (PSO) with Grey Wolf Optimization(GWO) to select and tune features and parameters across Backpropagation, Multilayer Perceptron (MP), Self-Organizing Map, and SVM classifiers, and validate it on NSL-KDD and UNSW-NS15 datasets to achieve superior distributed denial of service detection accuracy, F1 scores, and minimal false-alarm rates.

Despite these advancements, no prior work has focused on developing activation functions for ELMs based on M-estimation theory. Our study is the first to explore the integration of redescending M-estimator-based ψ-functions, as activation functions within an ELM ensemble. These activation functions preserve the key mathematical characteristics of conventional activations while offering enhanced resilience to noise and outliers, resulting in a more adaptable and noise-tolerant learning framework for classification tasks. Building on the work of Khan et al.²⁹we extend the concept, exploring more robust activation functions based on M-estimation. The goal of using the psi function is to produce a more learnt feature space for final classification because of its adaptable and distinct non-linear properties. Tables 1 and 2 provide specifics of the suggested methodology. The Psi-function, chosen for its more flexible and distinguishable non-linear characteristics, is utilized to create a more learned feature space for the final classification. As summarized in Tables 1 and 2, a core objective of this study is to formally introduce and systematize the integration of redescending M-estimation $\:\psi\:-$functions as activation mechanisms within the Extreme Learning Machine framework. These $\:\psi\:-$functions, owing to their inherent robustness and flexibility, are introduced not only as alternatives to traditional activation functions but as a means to enhance learning stability and generalization in the presence of noise and outliers. Moreover, to fully exploit these activation functions, we propose a resilient ensemble classification framework in which multiple base ELMs, each utilizing a different $\:\psi\:-$based activation, are judiciously combined via a least-squares fusion scheme. This architecture preserves model diversity while mitigating instability, thereby enhancing both the accuracy and robustness of the composite classifier. This work, therefore, establishes a novel theoretical and computational foundation for integrating robust statistical principles into neural architectures for complex classification tasks.

Novelty and significance of the study

This paper presents a robust and efficient ensemble Extreme Learning Machine-based learning framework that offers a new paradigm to tackle some of the long-standing problems of ML, such as contamination of data, instability due to random weight initialization, and inconsistent performance of classifiers. This present work is further improved by the presence of an activation function that is newly adopted and inspired by new M-estimation techniques and re-descending M-estimation. They improve robustness and increase capability in most unstructured, high uncertainty, data-laden environments, giving a great gain in accuracy over baseline models in different frameworks of stacking generalization, producing unsurpassed, stable state-of-the-art ensembles from an architecture that reduces problem situations of single, classical, as well as traditionally superior ELM architectures.

The significance of this study is not confined to algorithmic contributions alone but finds real-world practicality in domains where accuracy and reliability are critical. For instance:

Cybersecurity: It enhances the anomaly detection and incident response in Software-Defined Networking (SDN), where there is a growing need for timely and precise detection of security threats.
Healthcare: The ensemble demonstrates excellent performance on diagnostic tasks such as lung nodule detection and automated analysis of medical imaging data.
Financial Systems: In fraud detection, the strong classification capabilities are very effective in real time to identify rogue transactions and help keep the economy stable.
Autonomous Systems: Applications toward real-time decision-making for autonomous driving and precision agriculture manifest its adaptability and efficiency in high-stakes scenarios.

This research bridges the gap between theoretical innovation and practical applicability to overcome some of the critical challenges traditional ML algorithms face. Not like the conventional instability and sensitivity to the initialization faced by standard ELM frameworks, the architecture will ensure consistent performance by the system. Extensive validation against state-of-the-art methods shows its superiority in terms of classification accuracy with a lower variance and adapting to different datasets and domains of applications. The study further contributes toward the increasing area of ethics and trust in AI through an algorithmic framework that is transparent, interpretable, and scalable. This fits with the current demand for machine learning systems: besides performing well, their reliability and fairness should be trusted. This work sets a new bar for ensemble learning methodologies through its combination of theoretical advancements with practical solutions for pressing, real-world challenges, prepared to make serious impacts on both academic research and industrial practice.

Enhancing classification accuracy

We provide an effective ensemble of Extreme Learning Machines (ELMs) that are intended to extract a variety of significant information from data in order to increase accuracy and dependability. By using different beginning weights selected from a predetermined distribution and implementing a novel activation function, this ensemble makes use of the idea of variety. Using the least squares technique, the outputs of each base classifier are averaged to determine the final prediction. Section 3 contains comprehensive details regarding the suggested ensemble. We provide a brief summary of the state-of-the-art models that are currently in use, such as ELM, BFELM, and several ELM ensemble techniques from the literature, before outlining the methodology. The purpose of this conversation is to set the scene and highlight the progress made using our suggested methodology.

The extreme learning machine (ELM) algorithm

Algorithm

The dataset X with matching goal values T is one of the contributions. The number of hidden nodes is N. The activation function is g(.).

Output: – ELM parameters (input, bias, and output weights).

Steps:

1.

Initializing the input weights (W) and biases (b) at random is the first stage in creating input weights and biases.
2.

The next step is to determine the hidden layer feature space using the formula below: Use $\:H=G\left(XW+b\right),$ to find the hidden layer feature space matrix H, where g(.) is the activation function symbol.
3.

The calculation of the output weights, , involves minimizing the error between the expected and actual target values. The analytical solution is calculated as follows: In the pseudo-inverse of Moore-Penrose, $\\hat :{\beta\:}={\left(H{H}^{T}\right)}^{-1}{H}^{T}T$.
4.

During the testing step, test the model on new data using the determined ELM parameters (input weights W, biases b, and output weights).

Voting based ELM²⁰

Misclassification rates often surge when outputs fall near decision boundaries. To address this and secure accurate labeling of previously unseen instances, Cao et al.²⁰ proposed an ensemble of Extreme Learning Machines that employs a voting-based fusion strategy. This method increases prediction reliability by using trained classifiers to collectively determine the class of an unknown item. An important factor in the ensemble model’s creation was diversity. However, in Cao et al.²⁰every base classifier employed an identical activation function, a fixed number of hidden neurons, and initial weights sampled from the same continuous distribution. Additionally, to further minimize misclassification risk, the ensemble parameter k was set in advance.

The process for the Voting-based ELM is as follows:

1.

Initialize Base Classifiers: Randomly generate input weights and biases for each base classifier using the same activation function.
2.

Train Classifiers: Each base classifier is trained on the supplied dataset using a varying number of hidden neurons. Subsequently, within the voting framework, each trained model casts a class prediction for every unlabeled sample 50. The final class is determined by a majority vote among the ensemble’s classifiers. Pre-fixing k (the number of distinct classifiers) reduces variability and potential misclassification around decision limits.

This ensemble voting method shows that it is possible to aggregate predictions from multiple models to improve overall classification accuracy, especially for data at challenging decision boundaries.

Extreme learning machine (ELM) with a voting-based input scheme

Given: A training dataset: $\:\left\{\:\right(x_i,\:t_i\left)\:\right|\:x_i\:\in\:\mathbb{\:}\mathbb{R}\mathbb{^p},\mathbb{\:}t_i\:\in\:\mathbb{\:}\mathbb{R}\mathbb{^c},\mathbb{\:}for\:i\:=\:1,\:2,\:\dots\:,\:N\:\},$ with a specified number of hidden nodes and an activation function $\:G(\cdot).$ Let $\:M\:$denote the number of independent classifiers, and initialize$\:\:{S}_{m}\in\:\mathbb{\:}\mathbb{R}\mathbb{^c}$ to zero.

Training Phase:

1.

Set $\:m\:=\:1$.
2.

While $\:m\:\le\:\:M$ do:

(a) Randomly generate the weights and biases ($\:{w}_{i}^{m},{b}^{m}$).
(b) Compute the output weight $\:\beta\:^k$ using the formula:

$$\:{\beta\:}^{m}=\left({H}^{t}H\right){H}^{t}T$$

7.

End While.

Testing Phase:

For any test sample $\:x^{test},$ perform the following:

For each classifier $\:(i.e.,\:while\:m\le\:\:M):$

(a) Employ the parameters $\:({w}_{i}^{m},{b}^{m},{\beta\:}^{m})$ to predict the label for xᵗᵉˢᵗ.
(b) Update the vote count: $\:{S}_{m}\left({x}^{test}\right(i\left)\right)={S}_{m}+1$
(c) Increment$\:m\:by\:1$.

4.

Determine the final predicted class $\:Ac{c}^{test}$ by selecting the index corresponding to the maximum value in $\:{S}_{m}\left({x}^{test}\right(i\left)\right)$,i.e.,

ELM based ensemble for classification

Khellal et al.¹⁹ introduced an ensemble of extreme learning machines, each base classifier trained with a sigmoid activation function, to capture the nonlinear patterns within data for a classification problem.(refer to Fig. 1)They adopted the principle of the ordinary least square method for optimizing the contribution of each base classifier. Their algorithm pseudo-code is as follows:

Algorithm: training procedure for the $\:\varvec{E}\varvec{L}\varvec{M}-\varvec{B}\varvec{a}\varvec{s}\varvec{e}\varvec{d}$ ensemble for classification

$\:\varvec{I}\varvec{n}\varvec{p}\varvec{u}\varvec{t}$

$\:\left\{\:X,T,M,\:N\right\}$.

Where $\:Dataset\:X$, $\:Target\:T$, Number of individual models$\:\:M$. Number of hidden nodes $\:N$.

$\:\varvec{O}\varvec{u}\varvec{t}\varvec{p}\varvec{u}\varvec{t}:$ Parameters of the $\:ELM-based$ ensemble.

Procedure:

1.

For each model $\:m=1\:to\:M:$.
2.

(a) Randomly generate the input weights $\:{W}^{m}$ and biases $\:{b}^{m}$.
3.

(b) Compute the hidden layer matrix:

$$\:{H}^{m}\:=\:G(X\:{W}^{m}\:+\:{b}^{m})\:$$

c. Determine the output weights using the pseudoinverse of $\:{H}^{m}$:

$$\:{\beta\:}^{m}={\left({H}^{m}\right)}^{\dagger}T$$

d. Calculate the model output:

$$\:{O}^{m}={H}^{m}{\beta\:}^{m}$$

4.

Form the Global Hidden Matrix:
5.

Concatenate the outputs of all individual models to construct:

$$\:{H}_{g}\:=[\:{O}^{\left(1\right)}\:\:\:\:{O}^{\left(2\right)}\:\:\:\cdots\:\:\:\:{O}^{\left(M\right)}\:]$$

6.

Compute the Fusion Parameters:
7.

Obtain the fusion parameters by applying the pseudoinverse of H_g:

$$\:F={\left({H}_{g}\right)}^{\dagger}T$$

Return the ensemble parameters:

8. The complete set of ensemble parameters comprises $\:\{{\:W}^{\left(m\right)},\hspace{0.17em}{b}^{\left(m\right)},\hspace{0.17em}{\beta\:}^{\left(m\right)}\:\}$ for m = 1, 2,…, M, along with the fusion parameters F. Here, $\:{O}^{\left(m\right)}$ denotes the output of the $\:mth\:$ model, and $\:{H}_{g}$ represents the global hidden matrix. The uniqueness of the ELM-based ensemble is inherently determined by these parameters.