Classification of LAMOST spectra of B-type and hot subdwarf stars using kernel support vector machine

The classification of B-type and hot subdwarf stars presents several technical challenges. Firstly, the spectra of these stars can have overlapping features, making accurate differentiation difficult. Effective baseline correction is crucial; therefore, we used Asymmetric Least Squares (ALS) to remove noise and enhance signal quality. Identifying the most relevant features from the spectra is another significant challenge. We addressed this by employing the Pan-Core concept to identify 500 unique patterns essential for classification.

Traditional machine learning methods have been explored extensively. In this paper, the Pan-Core concept based on K-means for training data acquisition, and Support Vector Machine (SVM) for classification of star data is implemented. The Pan-Core concept utilizes K-means to identify and select representative samples from the available training data, aiming to construct a robust classification model.

Model selection and parameter tuning significantly affect the performance of the classification. We evaluated three SVM kernels (linear, polynomial, radial basis) and used cross-validation for optimal tuning. Furthermore, the research examines the effect of different kernel functions within SVM on the accuracy and performance of star classification. The choice of the kernel function plays a crucial role in capturing and separating the underlying patterns in the data.

SVMs offer several advantages over other methods such as decision trees, ensemble learning, and neural networks/deep learning for this spectral classification problem. SVMs are less prone to overfitting compared to decision trees and perform well in high-dimensional spaces, which is crucial for spectral data. They maximize the margin between classes, aiding in the distinction between overlapping spectral features of B-type and hot subdwarf stars. The use of kernel functions allows SVMs to handle non-linear relationships effectively. Additionally, SVMs are computationally more efficient and require less data than deep learning models, with simpler model interpretation and fewer hyperparameters to tune. These characteristics make SVMs particularly well-suited for our classification task.

Additionally, the data imbalance between the more numerous B-type stars and the fewer hot subdwarf stars can bias the model. We mitigated this by ensuring balanced training through appropriate sampling techniques. The model can effectively classify new star data using SVM based on the learned patterns from the training samples.

The authors present a flow chart in Fig. 2 to visually represent the adopted methodology. This flow chart outlines the sequential steps and procedures involved in acquiring the training data, training the model using K-means and SVM, and ultimately classifying the star data. By implementing this approach and analyzing the impact of the kernel function, the study aims to enhance the accuracy and efficiency of star classification, contributing to a deeper understanding of celestial objects and their characteristics.

Pre-processing

In star spectroscopy, the spectra of stars are typically composed of absorption or emission lines superimposed on a continuum of emission. These spectral features arise from various physical processes occurring within the star, providing crucial information about its composition, temperature, and other fundamental properties. The deviation from the expected smoothness at zero intensity is a consequence of the presence of these absorption or emission lines and their interactions with the continuum emission.

To address these distortions and accurately interpret the spectral features, baseline correction procedures are utilized. These procedures aim to mitigate systematic variations in the intensity baseline, thereby improving the clarity of the spectral information. However, the effectiveness of baseline correction procedures depends on tuning parameters that need to be carefully selected.

In this study, instead of relying on subjective approaches, we adopted an objective procedure for choosing the baseline correction method³⁷. This method outlines an optimal and systematic approach to selecting the most suitable baseline correction technique for the given star spectroscopy data.

By employing this objective procedure, our goal is to eliminate potential biases and ensure the selection of a baseline correction method that aligns best with the specific characteristics of the star spectra being analyzed. This objective approach enhances the reliability and reproducibility of the baseline correction process, leading to a more accurate and meaningful interpretation of the star spectroscopic data. It also provides a standardized methodology that can be applied consistently across different datasets, improving the overall quality and comparability of the results obtained.

The algorithm states

1.

For each baseline correction algorithm, determine the appropriate levels at which all parameters will be tested.
2.

Using the corresponding algorithm, the baseline is corrected at each parameter level.
3.

Utilize the corrected baseline spectral data to model responses related to the physical characteristics of the stars, employing Partial Least Squares (PLS) regression.
4.

Validate the model’s prediction capability to assess its accuracy in forecasting the relevant spectral features.
5.

The optimal levels of parameters are determined for each baseline correction algorithm.
6.

The baseline correction algorithm with the best prediction capability is selected as the optimal choice among all the algorithms considered.

The evaluation of prediction capability typically involves assessing cross-validated accuracy. This process includes performing cross-validation, where the data is divided into subsets for both training and testing the model. This division allows for an estimation of the model’s predictive performance.

The potential Asymmetric Least Squares (ALS) method is briefly explained below.

Asymmetric least squares (ALS)

The Asymmetric Least Squares (ALS)^23,24 method is a powerful approach used for data analysis utilizes the least square method to effectively handle predictor variables with significant errors. By assigning appropriate weights, the ALS method downplays the influence of variables with substantial errors while considering their impact on the analysis.

To achieve a smooth and accurate representation of the data, the ALS method incorporates 2nd derivative restriction within its smoothing process. This constraint helps balance the trade-off between achieving smoothness and preserving the relevant features present in the dataset.

The ALS method is mathematically expressed as:

$$\begin{aligned} S = \sum w_i (x_i – b_i)^2 + \lambda \sum (\Delta ^2 b_i) \ \end{aligned}$$

(1)

Here, $x_i$ represents the original spectrum, $b_i$ denotes the estimated baseline, $w_i$ corresponds to the asymmetric residual weights, and $\Delta ^2$ represents the second derivative of the estimated baseline. ALS aims to minimize the value of the expression $S$ by adjusting the baseline estimates.

To fine-tune the ALS algorithm, there are two adjustable parameters: – $\lambda$ is the smoothing parameter, which controls the degree of smoothness applied to the estimated baseline. – $w$ represents the weight assigned to the asymmetric residual, allowing for flexibility in handling different degrees of errors in predictor variables.

By appropriately adjusting these parameters, we customize the behavior of the ALS method according to the specific characteristics of our data. This flexibility enhances the ALS algorithm’s adaptability and improves its performance in accurately estimating baselines and revealing meaningful patterns in various analytical scenarios.

Pan-core spectrum training data acquisition

The study involves a substantial dataset of hot subdwarf and B-type star spectra. To overcome the challenge of training a model on such a vast dataset, we employed the pan-core concept, originally developed in genomics³⁸, as the basis for training data acquisition. The pan-core concept involves the following steps:

1.

Utilizing K-means clustering with a large value of K within each class.
2.

Employing the nearest neighborhood method to extract $s$ samples that are closest to each centroid obtained from K-means clustering.

Through the implementation of these steps, we curated a comprehensive set of $Ks$ samples for each class, guaranteeing the inclusion of various spectral representations within our training dataset. It’s imperative to highlight that the input for K-means clustering consisted of pre-processed flux spectra, with the input dimension explicitly defined. This meticulous approach efficiently encapsulates the intrinsic characteristics of star spectra, facilitating the model’s learning process with a manageable yet informative subset of samples.

The integration of the pan-core concept in star spectra analysis significantly diminishes the dimensionality of training data while upholding pivotal features and ensuring a comprehensive portrayal of spectral diversity within each class, as depicted in Fig. 3. This enhancement empowers our model to extract insights from a discerning subset of samples, thereby amplifying its accuracy and generalization capabilities in spectral classification tasks.

Support vector machines

Support vector machine (SVM) is a powerful supervised machine learning algorithm initially introduced by Cortes and Vapnik³⁰. It is widely utilized for both classification and regression tasks due to its ability to handle various types of data through the use of different kernels. SVM offers flexibility in choosing kernels such as linear (L-SVM), polynomial (P-SVM), and radial basis (R-SVM)^31,32,33, allowing for effective modeling of complex relationships within the data.

SVMs are well-suited for spectral classification problems due to several characteristics. They are effective in high-dimensional spaces, which is important given the complexity and dimensionality of spectral data. They are robust to overfitting, especially in cases where the number of features exceeds the number of samples, as often seen in spectral datasets. SVMs also perform well with clear margin separation, which helps distinguish between B-type and hot subdwarf stars with overlapping spectral features. Additionally, SVMs can utilize different kernel functions to handle non-linear relationships in the data, enhancing classification accuracy. Cross-validation for parameter tuning ensures optimal model performance, making SVMs a reliable choice for this classification task. These characteristics make SVMs particularly suited for the spectral classification of stars.

Several variables are introduced to elucidate the workings of Support Vector Machines (SVM) in classifying star spectra data. These variables include $J$, representing the total number of training samples; $x_j$ and $y_j$, denoting the features and labels of each sample, respectively; $x$ and $y$, representing the feature space and class labels, with $y$ taking values of either -1 or 1; $w$, signifying the coefficient vector; $b$, representing the bias term; $\alpha _j$, indicating the Lagrange multipliers associated with each training sample; $\theta (x)$, denoting the feature mapping function; $S(w, x)$, signifying the inner product between $w$ and $x$; and $C$, representing the regularization parameter in soft-margin SVM. Each variable plays a crucial role in the formulation and optimization of the SVM algorithm, contributing to its effectiveness in accurately classifying star spectra data.

In SVM, there is a constraint given by: $\sum _{j=1}^{J} y_j \alpha _j = 0$ $(x_j,y_j), \quad x \in R^d, y \in \{-1,1\}$ where $(x_j, y_j)$ represents the training samples, with $x$ belonging to the $d$-dimensional space and $y$ taking values of either – 1 or 1. The aim of SVM is to find a linear classifier in an infinite-dimensional space, given by:

$$\begin{aligned} f(x) = \text{sign}(w \cdot \theta (x) + b) \end{aligned}$$

(2)

Here, $w \cdot \theta (x) = S(w, x)$ denotes the inner product between the coefficient vector $w$ and the input sample $x$.

SVM’s strength lies in its ability to separate data points by defining a decision boundary while maximizing the margin between different classes. The choice of a kernel function determines the transformation of the input data into a higher-dimensional space, enabling effective separation of classes that may not be linearly separable in the original feature space.

By utilizing SVM with different kernels, we explore diverse strategies to classify the star spectra data effectively. The linearity of L-SVM, the flexibility of P-SVM, and the radial basis function of R-SVM provide distinct approaches for capturing the underlying patterns and relationships within the data. This versatility allows for a comprehensive analysis of the star spectra and enhances the model’s capability to make accurate classifications.

In our study, the soft-margin Support Vector Machine (SVM)³⁹ formulation is essential for effectively classifying star spectra data. The key variables involved in the soft-margin SVM include $\theta ^*_\text{soft}(w)$, representing the optimized parameter; C, signifying the regularization parameter controlling the trade-off between achieving a smooth decision boundary and accurately classifying training data points; $\xi _j$, indicating the slack variables that allow for misclassifications in the optimization process; $L_\text{soft}(w, b, \alpha , \xi )$, denoting the soft-margin SVM objective function; and $W_\text{soft}(\alpha )$, representing the dual cost function for soft-margin SVM. By fine-tuning the C parameter, we aim to strike the right balance between maximizing the margin between classes and minimizing misclassifications, ensuring optimal classification performance for star spectra data analysis.

For soft-margin SVM, optimization is given by:

$$\begin{aligned} \theta ^*_\text{soft}(w) = \text{argmin}_{w, \xi } \frac{1}{2} \Vert w\Vert ^2 + C\sum _{j=1}^{J} \xi _j \end{aligned}$$

(3)

such that,

$$\begin{aligned} \ y_j (w \cdot \theta (x_j) + b) \ge 1 – \xi _j \ \end{aligned}$$

(4)

$$\begin{aligned} \xi _j \ge 0 \end{aligned}$$

$$\begin{aligned} \begin{array}{rl} L_\text{soft}(w, b, \alpha , \xi ) =&\frac{1}{2} \Vert w\Vert ^2 + C \sum _{j=1}^{J} \xi _j – \sum _{j=1}^{J} \alpha _j \left( y_j (w \cdot \theta (x_j) + b) – 1 + \xi _j\right) \end{array} \end{aligned}$$

(5)

The objective of the soft-margin SVM optimization is to minimize the above function, denoted by $\theta ^*_\text{soft}(w)$, with respect to the coefficients $w$ and slack variables $\xi _j$, where $C$ controls the trade-off between margin maximization and error minimization.

The stationary conditions are,

$$\begin{aligned} \frac{\partial L_\text{soft}}{\partial w}= & {} w – \sum _{j=1}^{J} y_j \alpha _j \theta (x_j) = 0 \end{aligned}$$

(6)

$$\begin{aligned} \frac{\partial L_\text{soft}}{\partial b}= & {} \sum _{j=1}^{J} y_j \alpha _j = 0 \end{aligned}$$

(7)

$$\begin{aligned} \frac{\partial L_\text{soft}}{\partial \xi _j}= & {} C – \alpha _j = 0 \end{aligned}$$

(8)

$$\begin{aligned} \alpha _j (y_j (w \cdot \theta (x_j) + b) – 1 + \xi _j)= & {} 0 \end{aligned}$$

(9)

These stationary conditions define the critical points of the Lagrangian $L_\text{soft}$, where the partial derivatives with respect to the parameters $w$, $b$, and $\xi _j$ are equated to zero.

So the weight vector is a linear combination of the data points:

$$\begin{aligned} w = \sum _{j=1}^{J} y_j (\alpha _j – \alpha _j^*) \theta (x_j) \end{aligned}$$

(10)

The weight vector $w$ is expressed as a linear combination of the support vectors $x_j$, weighted by the corresponding Lagrange multipliers $\alpha _j – \alpha _j^*$.

Then the classifier is:

$$\begin{aligned} f_\text{soft}(x)= & {} \text{sign}\left( \sum _{j=1}^{J} y_j (\alpha _j – \alpha _j^*) \theta (x_j) \cdot \theta (x) + b\right) \end{aligned}$$

(11)

$$\begin{aligned}= & {} \text{sign}\left( \sum _{j=1}^{J} y_j(\alpha _j – \alpha _j^*) S(x_j,x) + b\right) \end{aligned}$$

(12)

The soft-margin classifier $f_\text{soft}(x)$ is determined by the sign of the inner product of the support vectors $x_j$ with the input sample $x$, weighted by the differences in Lagrange multipliers $\alpha _j – \alpha _j^*$, and added to a bias term $b$.

Substituting into the Lagrangian gives the dual cost function for soft-margin SVM:

$$\begin{aligned} W_\text{soft}(\alpha ) = \sum _{j=1}^{J} \alpha _j – \frac{1}{2} \sum _{j,i} y_j y_i (\alpha _j – \alpha _j^*) (\alpha _i – \alpha _i^*) S(x_j,x_i) \end{aligned}$$

(13)

The dual cost function $W_\text{soft}$ captures the trade-off between maximizing the margin and minimizing classification errors, where $\alpha _j$ are the Lagrange multipliers associated with each support vector.

The optimization for soft-margin SVM is now:

$$\begin{aligned} {\hat{\alpha }}_\text{soft} = \arg \max _\alpha W_\text{soft}(\alpha ) \end{aligned}$$

(14)

such that,

$$\begin{aligned} 0 \le \alpha _j \le C \end{aligned}$$

(15)

The optimal Lagrange multipliers ${\hat{\alpha }}_\text{soft}$ are obtained by maximizing the dual cost function $W_\text{soft}$ subject to the constraints $0 \le \alpha _j \le C$, ensuring that the Lagrange multipliers are within a feasible range.

$$\begin{aligned} f_\text{soft}(x)= & {} \text{sign}\left( \sum _{j=1}^{J}y_j (\alpha _j – \alpha _j^*) S(x_j,x) + b\right) \end{aligned}$$

(16)

$$\begin{aligned}= & {} \text{sign}\left( \sum _{y_j:j=1} (\alpha _j – \alpha _j^*) S(x_j,x) \right. \left. – \sum _{y_i:i=-1} (\alpha _i – \alpha _i^*) S(x_i,x) + b\right) \end{aligned}$$

(17)

$$\begin{aligned} f_\text{soft}(x)= & {} \text{sign}\left( h_+(x) – h_-(x) + b\right) \end{aligned}$$

(18)

The final soft-margin classifier $f_\text{soft}(x)$ predicts the class label of an input sample $x$ based on the sign of the decision function $h_+(x) – h_-(x) + b$, where $h_+(x)$ and $h_-(x)$ are the contributions from positive and negative support vectors, respectively, to the decision function.

In support vector machines (SVM), the parameter $C$ plays a crucial role as the regularization parameter, influencing the balance between achieving a smooth decision boundary and accurately classifying training data points. A smaller value of $C$ promotes a broader margin, allowing for a more generalizable model but potentially compromising on fitting the training data precisely. Conversely, a larger $C$ value results in a narrower margin, potentially fitting the training data more closely but risking overfitting and reduced generalization to unseen data. Fine-tuning the $C$ parameter is essential to find the right balance for SVM, ensuring effective classification while avoiding underfitting or overfitting issues in various applications, including our star spectra data analysis.

Linear kernel SVM (L-SVM)

The Linear kernel is a fundamental kernel function specifically designed for dealing with linearly separable data. It allows for the transformation of data points into a higher-dimensional space to facilitate linear separation.

The mathematical formula for the Linear kernel is given by:

$$\begin{aligned} F(x_j) \cdot F(x_k) = (x_j \cdot x_k)^2 \end{aligned}$$

(19)

This equation represents the inner product of the transformed feature vectors $F(x_j)$ and $F(x_k)$, which is obtained by squaring the dot product of the original data points $x_j$ and $x_k$.

In a simplified form, the expression for the Linear kernel can be represented as:

$$\begin{aligned} F(x_j, x_k)= x_j \cdot x_k + c \end{aligned}$$

(20)

Here, $c$ represents a constant term. This formulation enables the calculation of the dot product between the input vectors $x_j$ and $x_k$, with the addition of the constant term $c$.

While the Linear kernel in Support Vector Machines (SVM) offers simplicity and computational efficiency, it is crucial to delve into its inherent characteristics for effective utilization. The Linear kernel is particularly adept at handling linearly separable data by defining a decision boundary in the original feature space. Unlike its counterparts, such as the Polynomial or Radial Basis Function (RBF) kernels, the Linear kernel doesn’t involve complex transformations into higher-dimensional spaces. This simplicity not only contributes to computational efficiency but also provides transparency in understanding the decision-making process. Additionally, the absence of kernel-specific parameters in the Linear SVM simplifies the tuning process, making it more straightforward for practitioners. Despite its simplicity, the Linear kernel remains a powerful tool, especially when dealing with large-scale datasets, where its efficiency and interpretability become advantageous in various applications.

Polynomial kernel SVM (P-SVM)

The Polynomial kernel is a non-stationary kernel that can be applied to both hard-margin and soft-margin classification scenarios. It is particularly well-suited for problems where all the training data has been normalized, ensuring consistency across the dataset.

The mathematical representation of the Polynomial kernel is as follows:

$$\begin{aligned} \ F(x_j, x_k) = (\alpha x_j \cdot x_k + c)^d \ \end{aligned}$$

(21)

In this equation, $F(x_j, x_k)$ represents the transformed feature vectors obtained by raising the dot product of the input vectors $x_j$ and $x_k$ to the power of the polynomial degree $d$. The parameters $\alpha$, $c$, and $d$ are adjustable and play significant roles in shaping the behavior and performance of the Polynomial kernel.

By adjusting these parameters, we can control the complexity and flexibility of the kernel function, allowing it to adapt to different types of data and classification problems. The parameter $\alpha$ determines the influence of the dot product term, $c$ represents a constant offset, and $d$ determines the degree of the polynomial transformation. Fine-tuning these parameters is essential to achieve optimal performance and generalization in Polynomial kernel-based SVM models.

The flexibility of the Polynomial kernel makes it a valuable tool for handling data sets with complex relationships and non-linear decision boundaries. By leveraging the adjustable parameters, researchers can effectively explore the trade-off between model complexity and generalization, ensuring that the Polynomial kernel captures the underlying patterns in the data accurately and provides robust classification results.

Radial kernel SVM (R-SVM)

In cases where prior knowledge about the data is lacking, the radial basis function (RBF) kernel is commonly employed to transform the data. The RBF kernel introduces two critical parameters, namely C and $\gamma$, which require careful consideration. The C parameter, commonly referred to as the regularization parameter, is shared among all SVM kernels and influences their behavior. A lower value for C promotes a smoother decision surface, while a higher value aims to classify all training sets accurately.

The $\gamma$ parameter, also known as the kernel coefficient, determines the influence of each training example on the decision boundary. Additionally, the $\sigma$ parameter, representing the standard deviation in the RBF kernel, controls the width of the kernel and influences the smoothness of the decision boundary. Choosing appropriate values for C, $\gamma$, and $\sigma$ is crucial, as they significantly impact the performance of the SVM model. It’s imperative to carefully tune these parameters to achieve optimal results.

The mathematical expression for the RBF kernel is as follows:

$$\begin{aligned} \ F(x_j, x_k) = \frac{1}{\sigma \sqrt{2\pi }} \exp \left( -\frac{1}{2}\left( \frac{x_j-x_k}{\sigma }\right) ^2 \right) \ \end{aligned}$$

(22)

This equation represents the transformed feature vectors obtained by computing the exponential of the squared difference between the input vectors $x_j$ and $x_k$ divided by the square of $\sigma$. The term $\frac{1}{\sigma \sqrt{2\pi }}$ serves as a normalization factor.

Choosing suitable values for the $C$, $\gamma$, and $\sigma$ parameters is critical in achieving optimal SVM performance. Careful parameter tuning enables the RBF kernel to capture complex relationships and non-linear patterns in the data, ultimately leading to improved classification results and better generalization.

Source link