Transforming label-efficient decoding of healthcare wearables with self-supervised learning and “embedded” medical domain expertise

Machine Learning


Overview of the proposed method

Overall framework

We developed a domain-knowledge-guided SSL framework to bridge medical domain expertise (embedded in “old-school” domain feature engineering) with advanced deep learning techniques (“new-school” SSCL), for label-efficient wearable data interpretation. The overall framework consists of four stages: (a) offline domain feature extraction and clustering, (b) data augmentation and deep feature extraction, (c) domain-guided instance-level contrast, and (d) domain-guided prototype-level contrast, as illustrated in Fig. 2.

(a) Offline Domain Feature Extraction and Clustering. Before training, we extract handcrafted features from all input training samples using domain-specific signal processing tools. These features are computed using open-source domain feature engineering pipelines and reflect established clinical heuristics, such as heart rate variability or waveform amplitude for ECG. The extracted features are then clustered using offline k-means, assigning each sample to a prototype group based on domain-informed similarity. This clustering provides a coarse-grained organization of the training set, reflecting broader semantical groupings.

(b) Augmentation and Deep Feature Extraction. During training, each input time series undergoes two random augmentations (e.g., jittering, scaling, warping), producing two distinct but related views of the same signal. Both views are passed through a shared encoder network to extract latent representations. This follows the standard SSCL setup, where the model learns by contrasting pairs of representations across the batch.

(c) Domain-Guided Instance-Level Contrast. Conventional instance-level SSCL only considers the augmentation from the same signal as positive pairs. In addition to this, we identify additional positive pairs using the similarity of domain features. Specifically, within each batch, based on similarity in the domain feature space, the nearest neighbor of the anchor signal is selected. The two augmented views of this neighbor sample are treated as additional positive, reinforcing the relationship between signals that are semantically similar but not augmented views of the same instance. This process mitigates the risk of treating semantically similar signals as negatives, which is a limitation of conventional instance-level SSCL.

(d) Domain-Guided Prototype-Level Contrast. To incorporate global structure, we use the offline cluster assignments from step (a) as a reference for prototype-level contrast. For each cluster, we maintain a prototype representation, which is updated during training using the exponential moving average (EMA) of its members’ latent features. During training, each sample is encouraged to align with its assigned prototype in the latent space. This ensures that the model preserves broader structure in the data distribution, helping to stabilize training and improve generalization across diverse physiological conditions.

For technical implementation details, including loss formulations, feature extraction procedures, and training protocols, please refer to the Methods Section.

Experimental setup

Datasets

To assess the feasibility and generalizability of our framework, we conducted experiments across four types of wearable sensing modalities: ECG, EEG, IMU (focusing on 3-Axis Accelerometer), and PPG. All downstream tasks are framed as classification problems, aligning with common healthcare applications associated with each modality, such as physical activity monitoring, cardiac event detection, and sleep stage assessment (Table 1).

Table 1 Summary of datasets, modalities, and corresponding real-world applications

The CinC17 dataset contains short ECG recordings collected from portable chest patches, annotated into four categories, including normal sinus rhythm and atrial fibrillation29. The CPSC dataset provides 12-lead ECG recordings annotated by cardiologists for nine different cardiac abnormalities, though we used only lead II to simulate wearable scenarios30. The MIMIC-III-WDB31 dataset, consisting of ICU ECG recordings from bedside monitors, was primarily used for self-supervised training due to the absence of segment-level annotations32. For EEG, we used the SleepEDF dataset (two channels, Fpz-Cz and Pz-Oz), which includes long-term recordings segmented into 30-s windows and annotated with one of five sleep stages8. For IMU, we employed the Capture24 dataset (xyz accelerometer), containing wrist-worn accelerometer recordings segmented into 10-s windows for human activity recognition24. Finally, for PPG, we used the Simband dataset33,34, which comprises 10-s signal segments annotated with cardiac conditions, including normal sinus rhythm, atrial fibrillation, premature atrial/ventricular contractions, and noise. Across all datasets, basic preprocessing was applied, including resampling, band-pass filtering, segmentation, and z-normalization for consistency. All datasets were randomly split into training, validation, and test subsets, in a 6:2:2 ratio, ensuring no subject overlap. They are used to train the models, tune hyperparameters, and evaluate the performance, respectively.

Domain knowledge-informed features

Domain-knowledge-informed features were extracted from each modality using established toolboxes and following clinical guidelines. For ECG, 30 features were derived following previous established work22, including RR intervals and morphological characteristics, which are both key indicators of cardiovascular health, as well as signal quality indices that assess the informativeness and quality of wearable signals35. For EEG, 74 features (37 per channel) were computed from both temporal and spectral domains, as per the settings from previous work23, focusing on spectral power across frequency bands to capture sleep stages27. For IMU, 40 features24 were extracted, including temporal, spectral, and angular measurements, which are essential for activity recognition tasks36. For PPG, 102 features were extracted by pyPPG25, encompassing biomarkers related to peak intervals, component amplitudes, area under the curve, and other clinically relevant metrics. All features were z-normalized to zero mean and unit standard deviation to ensure consistency. A detailed description of these features is provided in the Supplementary Material Section S2.

Evaluation protocol

We employed three evaluation protocols to assess the performance and generalizability of the proposed framework: SSL, semi-supervised learning, and transfer learning. For self-supervised settings, we applied K-nearest neighbors (KNN) and linear classifiers (linear probing9) to the learned representations for downstream classification tasks across different datasets, separately. These are two simple types of classifiers that do not involve complex feature abstraction, which is well-suited for directly assessing the discriminative power of the learned representations. In the semi-supervised setup, we randomly sampled a proportion (5%, 10%, and 20%) of the full training datasets (CinC17 and CPSC) and fine-tuned the entire model using a supervised loss applied only to the available labeled data. This protocol assessed the model’s performance when only a limited amount of labeled data is available. For transfer learning, the model was pretrained on the larger MIMIC-III-WDB dataset and fine-tuned on fully annotated datasets (CinC17 and CPSC) to evaluate its ability to generalize across different datasets and domains. Across all experiments, performance was measured using the class-average F1 score, AUROC, Precision, Recall, and Accuracy. Measuring the class-average performance of these metrics can indicate the average performance of the model across all classes.

To highlight the superiority of our proposed framework, we compared it against a comprehensive suite of existing SSL methods. These include general-purpose contrastive learning frameworks such as SimCLR9, BYOL37, MoCo38, NNCLR39, TS40, SwAV41, and AMCL42, all adapted for time series data. We also evaluated domain-specific baselines, including CLOCS10, originally developed for ECG-based SSL, and TFC15 and SoftIns43, both designed for general time series modalities. Additionally, RNC15 was employed to rank time series representations based on domain features, providing a comparative approach for evaluating how domain knowledge can be incorporated into SSL. For fair comparison, without explicitly mention, all methods were implemented using a ResNet1844 (1d version) backbone to derive deep features.

Furthermore, we incorporated an additional group of comparisons within the broader self-supervised learning landscape. Recent advances in general-purpose time series foundation models45 have demonstrated the potential to handle diverse temporal signals through large-scale pretraining on heterogeneous datasets46,47. To benchmark our framework against this paradigm, we evaluated the representation quality extracted from two representative encoder-decoder-based foundation models: MOMENT47 and Chronos46. These models provide pretrained encoders capable of generating embeddings across a wide range of time series modalities. Implementation details and adaptation procedures are described in the Supplementary Material Section S3.4.

Overall self-supervised learning performance

Quantitative analysis of comparison against self-supervised learning methods

To benchmark the performance of our framework across various healthcare scenarios, we evaluated feature discriminativeness derived from SSL using the class-average F1 score on the test sets (Table 2). Simple classifiers, including a linear classifier and a KNN classifier (n = 10), were applied to directly assess the discriminative power of the learned features, as these classifiers do not involve further complex feature abstraction. This allows us to evaluate the quality of the features themselves, rather than introducing additional layers of abstraction that could obscure their inherent effectiveness. We compared our method against a series of SSCL methods, domain feature-based models (Domain Feat.), and fully supervised learning models (Fully Sup.) with randomly initialized backbones, all of which shared the same architecture as our method.

Table 2 Comparison results based on self-supervised learning

Our method demonstrates superior performance compared to other SSL methods and even outperforms fully supervised models in certain cases, particularly with IMU data, where it achieves a class-average F1 score of 0.526 compared to 0.484. This improvement is likely attributable to our domain knowledge-guided approach, which helps mitigate the effects of inter-subject variability, especially in physical activity data, where individual movement patterns can differ significantly24. Further qualitative analysis is presented in Fig. 5. These results suggest that integrating domain knowledge, even in an SSL context, helps reduce biases and enhances the overall robustness of the model.

In addition to class-average F1, we evaluated the best-performing (in terms of class-average F1 score) models on additional metrics, including precision, recall, area under the receiver operating characteristic curve (AUC), and accuracy (as shown in Fig. 3). Our method consistently outperformed other SSL approaches across almost all these metrics, further demonstrating its generalizability and effectiveness in learning robust representations from wearable health data.

Fig. 3: Performance comparison in terms of class-average F1, Accuracy (Acc), Area Under the Receiver Operating Characteristic Curve (AUC), Recall, and Precision (Prec).
figure 3

We compared the performance on the test subset, based on the model with the best F1 score on the validation subset. The values are normalized by the deviation from the best performance for each respective metric.

Compared to state-of-the-art SSCL methods, our approach benefits from the incorporation of domain-knowledge-driven features, which enhances the selection of contrastive pairs and ensures that the model captures clinically meaningful patterns. This finding highlights a promising direction for advancing SSL in time series data, particularly in healthcare applications where domain knowledge plays a critical role in improving model performance.

Quantitative analysis of comparison against time series foundation models

In terms of general-purpose time series foundation models, both MOMENT and Chronos were pretrained on large, heterogeneous collections of non-medical time series (e.g., industrial, financial, environmental), and thus lack exposure to high-frequency, high-dimensional physiological signals. While MOMENT outperforms Chronos (e.g., ECG: KNN F1 0.435 vs. 0.325; Linear Probing F1 0.514 vs. 0.301), neither matches our domain-guided SSCL on clinical data. Foundation models capture broad temporal patterns but miss the fine-grained, medically relevant distinctions in ECG, EEG, and PPG. The one exception appears in the IMU KNN evaluation, where MOMENT leads with 0.517 compared to our 0.465, suggesting that large, heterogeneous pretraining can sometimes encode general motion dynamics very effectively. These results underscore the distinct nature of healthcare wearable waveforms and show that pretraining on domain-aligned data, potentially combined with medical knowledge-informed contrastive learning, yields more robust, generalizable representations. Please refer to Supplementary Material S4.1 for additional results.

Qualitative analysis of the waveform attention map

We employed Grad-CAM48 to visually interpret the learned features of different methods by generating attention maps superimposed onto the original waveforms, as shown in Fig. 4. Grad-CAM highlights key regions of the input data that are most influential in model predictions, allowing for a direct assessment of feature importance. We combined a fully-trained classifier with the learned backbone to produce these attention maps.

Fig. 4: Grad-CAM-based waveform saliency map visualizations for two representative samples.
figure 4

A fully trained classifier, appended to each self-supervised learning backbone, was used to generate the saliency maps. These maps highlight the most important waveform segments that contribute to the model’s predictions. The saliency values, normalized between 0 and 1 for each sample, allow for improved visual comparison. Our method assigns higher saliency to clinically relevant regions, such as ST-segment elevation, thereby improving both interpretability and prediction accuracy.

For the atrial fibrillation case in Fig. 4, our method effectively captures clinically relevant morphological abnormalities, such as the absence of p-waves—a key indicator of atrial fibrillation, demonstrating its ability to focus on diagnostically critical features. Figure 4 further presents a representative sample of ST-segment elevation. Here, our approach consistently assigns higher saliency to waveform subsegments that directly reflect ST elevation, while effectively ignoring irrelevant or noisy sections. This selective attention to clinically meaningful subsegments not only enhances the interpretability of the model’s decision-making but also boosts its overall prediction accuracy (Table 2).

Qualitative analysis of feature distribution

Furthermore, we included a t-SNE visualization (Fig. 5) comparing domain features, features from end-to-end supervised learning, those learned by SimCLR, and those learned by our method, based on the whole dataset of Capture24. It can be observed from the first column that domain knowledge-based features are evenly distributed between training/validation/testing subjects. In contrast, features learned via end-to-end supervised training (second column) exhibit clear clustering based on training and validation/test splits. This suggests that the model may have overfitted to subject-specific characteristics, such as individual wearing patterns or sensor noise, which are not directly related to activity-related semantic information.

Fig. 5: t-SNE visualizations of domain features, features extracted from end-to-end supervised learning, from SimCLR, and from our method, respectively.
figure 5

Each point corresponds to a sample from the Capture24 dataset. Each column represents the same plots, with points colorized by different manners. In the top panel, each sample points were colored by data split (train/validation/test), whilst in the bottom, shaped by class label.

As shown in the third and fourth columns, SimCLR and Ours, both approaches produce more evenly distributed feature representations across training, validation, and test sets. SimCLR performs instance-level contrast, which implicitly pushes away the representations of the instances from the same subject. This helps to avoid learning those subject-specific characteristics as biases. However, the learned representations are not semantically discriminative enough. As observed from the plot of the second row, third column, points from different classes get more mixed compared to other plots of the same row. Our method goes a step further by using domain features to guide positive pair selection. This helps the model focus on learning semantically meaningful patterns rather than unrelated differences between individuals.

Semi-supervised with a small proportion of annotations

To further test the efficacy of our method, we conducted experiments in semi-supervised learning settings. We fine-tuned the backbone network using varying fractions of labeled data, specifically with label ratios set to 10%, 20%, 50%, and 100%. Specifically, in a simplistic way, we applied a supervised loss on the labeled subset only to conduct fine-tuning. The results for class-average F1 scores are presented in Fig. 6.

Fig. 6: Performance comparison of pretrained models fine-tuned using varying proportions of labeled data under a semi-supervised learning setting on ECG datasets.
figure 6

Fine-tuning was performed using a supervised loss computed only on the labeled samples. Model performance is reported in terms of class-average F1 score.

Our method consistently outperforms others, particularly when labeled data is scarce. For instance, with only 10% of labeled data, our approach demonstrates consistent improvements over competing methods. Additionally, our method is less sensitive to the proportion of labeled data compared to other models, maintaining strong performance even with small amounts of annotated data. This confirms the label efficiency of our proposed framework, as the features learned by SSL already show strong discriminativeness, reducing the reliance on large labeled datasets for effective downstream performance.

Transfer learning with a large-scale dataset

To demonstrate the transferability of our learned representations across different datasets within the same modality, we utilized the large-scale MIMIC-III-WDB dataset for pretraining. After pretraining, we fine-tuned the entire network on the labeled CPSC and CinC17 datasets. The results, presented in Table 3, show that self-supervised pretraining on the diverse and extensive MIMIC-III-WDB dataset leads to improved performance across all methods. This improvement is likely due to the size and variability of the dataset, as detailed in Table 3.

Table 3 Comparison results based on transfer learning on ECG datasets

Among all the methods evaluated, our proposed approach consistently demonstrates superior performance in nearly all cases. This highlights the effectiveness of our method in both small-scale and large-scale data scenarios, suggesting that the SSL pipeline is capable of learning transferable representations that generalize well to other datasets. Our method not only excels in extracting meaningful features for downstream tasks but also proves to be robust across varied healthcare datasets, providing large improvements in transfer learning scenarios.

Impact of domain knowledge-informed features

Sensitivity to domain features

Recognizing the critical role of domain features within our integrated framework, we conducted a comprehensive ablation study on these features in the ECG domain. Specifically, we categorized the features into distinct groups (AF related features, Morphological features, RR interval-related features, Beat similarity-related features) and systematically ablated each group individually. Our results, shown in Fig. 7, illustrate that, traditional logistic regression model trained directly on domain features (in blue) exhibits evident drops in F1 score when any group is removed. This reflects a strong dependency on the presence of specific clinical descriptors. By contrast, our SSCL approach (in green) remains robust, consistently outperforming direct regression even with reduced or non-optimized feature sets. For example, the weakest feature group under SSCL still achieved 0.545 in F1 score, exceeding the baseline of 0.532 from SimCLR. This indicates that our framework can effectively leverage domain signals-even when they are coarse or weak-without requiring highly discriminative feature design.

Fig. 7: Comparison of feature group importance for direct logistic regression versus our framework.
figure 7

The heatmap on the left illustrates which combinations of features (AF-Feature, Morphological, RR Interval, and Similarity) were used, where the purple cubes indicate feature utility. The bar plot on the right shows the class-average F1 score for each feature combination, with direct logistic regression represented in blue (hatched) and our self-supervised contrastive learning method in green.

This robustness is clinically meaningful. In real-world wearable settings, data quality often varies, and key features—such as clean RR intervals or stable waveform segments—may be degraded by noise, motion artifacts, or incomplete recording. Moreover, the extracted features, while widely used and clinically grounded, may not be tailored to any specific disease. Our framework’s ability to tolerate such imperfections and still extract meaningful representations makes it particularly well-suited for practical deployment in diverse and uncontrolled environments.

Comparison of different strategies for using domain features

We further compared several approaches to incorporate domain knowledge into SSL, with results shown in Table 4. These include (i) Deep Regression, which directly maps deep features to domain features via a multilayer perceptron module, with the \({{{{\mathcal{L}}}}}_{2}\) regression loss as the loss during training; (ii) Ranking, where samples are ranked using Euclidean distances in the domain feature space, and subsequently the ranking is used to guide contrastive learning as in RNC49; and (iii) Nearest Neighbour Only, which applies domain-guided instance-level contrast without prototypes. As outlined in Table 4, direct feature regression leads to suboptimal performance. While domain-knowledge-informed features encode semantically meaningful information, they often lack sufficient separability. Relying solely on these features reduces overall performance, reinforcing the importance of integrating them into an SSL framework to enhance both semantic understanding and discriminative capability.

Table 4 Comparison of domain feature usage strategies on the CPSC (ECG) dataset

On the other hand, domain-feature-guided ranking achieves performance comparable to nearest neighbour search only, suggesting that domain features offer useful guidance during representation learning. However, both approaches fall short of our full framework, which combines instance-level and prototype-level contrast. This combination leverages both local and global domain-informed relationships, leading to the most robust and discriminative feature representations.

Model-agnostic nature of our framework

To highlight the flexibility of our framework, we tested its performance across various backbone architectures, including ResNet18, ResNet34, ResNet50, and MSDNN50, under SSCL conditions. As shown in Table 5, our method consistently demonstrates superior performance across different backbones, further demonstrating its adaptability and robustness. These results confirm that our approach is model-agnostic, capable of achieving strong performance regardless of the underlying architecture, making it versatile for diverse applications in healthcare wearables.

Table 5 Comparison results of different backbones (ResNet18, ResNet34, ResNet50, and MSDNN) on CPSC (ECG) dataset



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *