Sleep staging is a critical task in sleep medicine, involving the classification of sleep stages based on the analysis of various physiological signals such as electroencephalography (EEG), electromyography (EMG), and electrooculography (EOG) [1]. These signals are typically recorded using multiple sensors placed on the scalp, face, and limbs, collectively forming a polysomnogram (PSG) used for diagnosing and treating sleep disorders [2]. Rechtschaffen and Kales (R&K) [3] and the American Academy of Sleep Medicine (AASM) [4] provide guidelines for sleep staging. Differences between these criteria [[3], [4], [5]] include the mapping of R&K stages (S1, S2, S3, S4) to AASM stages (N1, N2, and N3). In AASM, deep sleep (N3) combines S3 and S4 stages from R&K. Notably, AASM does not include the movement time stage. The EEG serves as a fundamental signal for monitoring brain activity and diagnosing sleep disorders [6]. The electrical activity detected by electrodes positioned on the brain’s surface primarily represents the combined effects of excitatory and inhibitory postsynaptic potentials in apical dendrites of pyramidal neurons located in the superficial cortical layers. Significant synchronous activation of extensive cortical areas is required to generate detectable changes in the electrical potential registered by scalp electrodes [6,7]. Each stage of sleep corresponds to distinct patterns of brainwave activity and neuronal behavior [6]. There are five sub-frequency bands of brainwaves: Delta (1–4 Hz), Theta (4–8 Hz), Alpha (8–13 Hz), Beta (13–30 Hz), and Gamma (30–100 Hz) [8]. EEG recordings during sleep can be assessed by analyzing these frequency bands, along with the observation of sleep-related behaviors like K complexes and spindle waves, enabling the determination of sleep stages [6,9]. EOG is a valuable tool in sleep stage classification, especially during rapid eye movement (REM) sleep. EOG signals aid in distinguishing between REM and non-REM (NREM) sleep, and their analysis, combined with observations of behaviors like K complexes and spindle waves, contributes to the delineation of various sleep stages [10,11].
Manual analysis, which requires an expert to visually classify each segment, can be a labor-intensive task and is susceptible to human errors [[12], [13], [14]]. These challenges have spurred researchers to devise methods and algorithms for automating sleep-related analyses. These automated approaches offer the benefits of increased speed and enhanced accuracy in sleep stage classification [15].
Numerous techniques for automating sleep studies have been proposed, evolving significantly over the years. Traditional methods typically involve extracting specific signal features from biosignals such as EEG, EOG, and EMG, and subsequently inputting these features into classifiers. For instance, Pascualvaca et al. [16] presented an enhanced sleep stage classification model utilizing advanced intelligent methods, incorporating EOG and EMG signals, and evaluating different SVM multi-classification techniques, which demonstrated significant improvements in classification accuracy. Ghimatgar et al. [17] introduced a single-channel EEG-based scoring method utilizing Modified Graph Clustering Ant Colony Optimization (MGCACO) for feature selection, Random Forest (RF) for classification, and Hidden Markov Model (HMM) for post-processing, achieving high accuracy and robustness on multiple datasets. Shen et al. [18] proposed an algorithm employing Improved Model-Based Essence Features (IMBEFs) and Bagged Trees for classification, demonstrating high accuracy on the Sleep-EDF and DREAMS datasets. More recently, Pei et al. [19] developed a method using Mel-frequency cepstral coefficients (MFCC) features and a deep CNN-LSTM model, demonstrating effectiveness and computational efficiency on the SHHS and UCDDB datasets.
Deep learning-based models have revolutionized the field by automatically extracting features from raw data, showing significant improvements in classification accuracy. Supratak et al. [20] introduced DeepSleepNet, a model combining CNNs for feature extraction and Bi-LSTMs for capturing temporal dependencies, setting a benchmark on the MASS and Sleep-EDF datasets. Niroshana et al. [21] developed a CNN-GRU model using EEG and EOG signals, achieving high accuracy for both patients and healthy subjects. Jiang et al. [22] proposed a robust classification method using multimodal signal decomposition and an HMM-based refinement process, demonstrating superior performance on the Sleep-EDF and MASS databases. Huang et al. [23] introduced a single-channel EEG sleep staging method using a transition optimized Hidden Markov Model (TO-HMM), incorporating feature extraction with power spectral density (PSD) feature selection and a Gaussian mixture model-Hidden Markov Model (GMM-HMM). Eldele et al. [24] proposed AttnSleep, featuring a multi-resolution CNN and an adaptive feature recalibration (AFR) module, significantly enhancing feature extraction and classification accuracy. Jia et al. [25] introduced GraphSleepNet, an adaptive spatial-temporal graph convolutional network that dynamically learns intrinsic EEG channel connections, achieving superior performance on the MASS dataset. Olesen et al. [26] developed a deep residual network for sleep stage classification using 15,684 PSG studies from five large-scale cohorts, employing a modified ResNet-50 architecture combined with a bidirectional gated recurrent unit (GRU). Guillot and Thorey [15] presented RobustSleepNet, a transfer learning model leveraging an attention mechanism and multi-head recombination, achieving high robustness and generalization across diverse datasets. You et al. [27] developed a method combining time-frequency and fractional Fourier transform (FRFT) domain features with a bidirectional LSTM network, demonstrating high accuracy with fewer parameters. Other models have continued to push the boundaries of sleep stage classification. BootstrapNet [28] employs contrastive learning on raw single-channel EEG data, enhancing feature representation and improving generalization across datasets, though it faces increased computational demands and a dependency on large annotated datasets. CCRRSleepNet [29] integrates convolutional neural networks with relational inductive biases, capturing complex data relationships and achieving high accuracy for single-channel EEG data; however, the added complexity and computational requirements pose challenges for scalability. SingleChannelNet [30] focuses on efficiency with fewer parameters, making it suitable for resource-constrained devices, though its simplicity may limit generalization without further fine-tuning. SleePyCo [31] leverages a feature pyramid network and contrastive learning to capture multi-scale features, enhancing accuracy but increasing computational requirements and training complexity. TinySleepNet [32] balances efficiency and performance with fewer parameters and data augmentation techniques, making it suitable for embedded devices, though it might limit the depth of feature extraction. Advanced models such as 3DSleepNet [33] integrate multiple convolutional layers and an attention mechanism to capture spatial-temporal and spatial-spectral features from multi-channel biosignals, combining 3D-CNN and LSTM layers for enhanced sleep stage classification for both healthy and sleep-disorder subjects, demonstrating high accuracy and robust performance. Similarly, MixSleepNet [34] leverages 3D-CNN and Graph Convolutional Networks (GCN) to capture complex spatial-temporal features, dynamically updating the adjacency matrix to reflect changing brain region connections, achieving superior performance across datasets. EEGSNet [35] converts raw EEG signals into spectrograms using Fourier transform, applying CNNs for local feature extraction and Bi-LSTMs for capturing temporal dependencies, excelling particularly in handling unbalanced data and demonstrating strong generalization across different subjects and datasets. The Multilevel Temporal Context Network (MLTCN) [36] integrates Temporal Convolutional Networks (TCN) and Hidden Markov Models (HMM) to capture multilevel temporal features, significantly improving classification accuracy. The Neural Architecture Search (NAS) [37] framework automates the design of optimal neural network architectures for sleep stage classification, utilizing differentiable architecture search (DARTS) and progressive differentiable architecture search (P-DARTS) to optimize feature extraction and classification performance, showcasing high accuracy and generalization. The LSTM-Ladder Networks (LLN) model [38] combines Ladder Network’s reconstruction capabilities with LSTM’s sequential data handling to enhance classification performance, especially by leveraging both labeled and unlabeled data. Multi-View Spatial-Temporal Graph Convolutional Networks (MSTGCN) [39] use functional and spatial distance-based brain graphs to perform spatial-temporal graph convolutions, capturing both spatial and temporal features effectively, including adversarial domain generalization to enhance robustness across different subjects. CTCNet [40] integrates CNN, Transformer, and Capsule Network technologies, excelling in handling class-imbalanced datasets and significantly improving classification accuracy, particularly for the N1 sleep stage. Jumping Knowledge-Based Spatial-Temporal Graph Convolutional Networks (JK-STGCN) [41] combine adaptive graph learning and jumping knowledge networks to capture spatial and temporal dependencies, resulting in high classification accuracy and efficiency suitable for real-time applications. Finally, the 1D-ResNet-SE-LSTM model [42] integrates one-dimensional residual convolutional neural networks with squeeze-and-excitation modules and LSTM networks, effectively recalibrating feature responses and capturing both spatial and temporal dependencies, demonstrating robustness and high performance across different datasets. In the broader context of time series classification, some models designed for general applications beyond sleep staging also offer valuable insights. Xiao et al. [43] introduced the Robust Temporal Feature Network (RTFN) for time series classification, combining a Temporal Feature Network (TFN) with an LSTM-based attention network (LSTMaN) to capture both local and global temporal features. Ji et al. [44] proposed a space-embedding strategy for anomaly detection in multivariate time series (SES-AD), using LSTM models to process dissimilarity vectors derived from projected time series data, achieving robust anomaly detection. Although these general time series models are not specifically designed for sleep stage classification, they offer methodologies that are applicable to various time-dependent data analysis tasks.
Recent advancements in sleep stage classification have led to the development of several state-of-the-art (STOA) models that leverage deep learning techniques. These models have significantly improved the accuracy and robustness of sleep stage classification by integrating various biosignals, such as EEG, EMG, and EOG. For instance, models like 3DSleepNet [33] and MixSleepNet [34] have demonstrated the efficacy of multi-channel integration and the use of attention mechanisms to enhance focus on critical temporal and frequency information. The inclusion of pseudo-3D convolutions in these models effectively reduces computational complexity while maintaining their ability to extract complex features. However, the computational demands and complex architectures of these models pose significant challenges for real-time applications and implementation. These models often require powerful hardware and optimized software environments, making them less accessible for widespread clinical or at-home use. Other models, such as EEGSNet [35] and the Multilevel Temporal Context Network (MLTCN) [36], focus on spectrogram-based feature extraction and temporal convolutional networks to capture intra-epoch and inter-epoch temporal features. These approaches, while effective in improving classification performance, substantially increase the computational load due to the high dimensionality of spectrogram images and the multiple stages of temporal learning required. Furthermore, these models may face difficulties generalizing across different datasets, necessitating large amounts of data for training and validation to avoid overfitting and to ensure robust performance. Incorporating Long Short-Term Memory (LSTM) networks, models such as the LSTM-Ladder Network [38] and various hybrid models, including 1D-ResNet-SE [42] and LSTM combinations, excel at capturing temporal dependencies and transition rules among sleep stages. However, these models often involve sophisticated layers and loss functions to handle class imbalances, which complicates the training process. Additionally, the reliance on high-quality labeled data can limit their applicability, as acquiring such data is often time-consuming and expensive. Graph convolutional networks (GCNs) have also been adapted for sleep stage classification, as seen in models like the Multi-View Spatial-Temporal Graph Convolutional Networks [39] and Jumping Knowledge-Based Spatial-Temporal Graph Convolutional Networks (JK-STGCN) [41]. These models effectively capture spatial and temporal dependencies by constructing functional and spatial brain graphs and employing advanced graph learning techniques. Despite these benefits, the high computational demand and complexity of implementation associated with these methods present substantial barriers. These models often require extensive preprocessing and optimization, which can be resource-intensive and may limit their practicality in real-time or resource-constrained environments. Lastly, the CTCNet [40] model, which integrates CNN, Transformer, and Capsule Network technologies, showcases the potential for capturing local features, global temporal context, and spatial relationships within EEG data. This model achieves high classification accuracy, particularly in handling class-imbalanced datasets, but its complexity and resource requirements remain significant challenges. The integration of multiple advanced techniques in a single model leads to increased computational overhead, making it difficult to deploy such models in environments with limited computational power.
Overally, while these state-of-the-art models demonstrate remarkable advancements in sleep stage classification by leveraging various deep learning techniques, their high computational demands, complex architectures, and reliance on extensive datasets highlight the need for further research. Optimization for practical, real-time applications remains a critical challenge. Addressing these issues will be crucial for the broader adoption of these models in clinical and home settings, ensuring that the benefits of these advancements can be effectively translated into practical tools for sleep analysis.
Despite significant advancements, several challenges continue to impede the development of highly accurate, robust, and generalizable sleep stage classification models. Variability in data quality is a major issue, as EEG signals are prone to noise and artifacts due to different acquisition methods. This variability complicates the creation of universal models that perform consistently across various datasets and settings. Additionally, class imbalance within sleep stage datasets, where certain stages are more prevalent than others, biases training processes and negatively impacts the performance of models on minority classes. This imbalance makes it difficult for models to accurately classify less common stages, leading to potential inaccuracies in sleep stage predictions. Another significant challenge is domain shifts, where differences in data distributions across datasets cause models trained on one dataset to perform poorly on others. Techniques such as ADAST [45] and RobustSleepNet [15] have been developed to address this issue, but domain adaptation remains a significant hurdle. Accurately capturing temporal dependencies in sleep data is also crucial yet challenging. Models like DeepSleepNet [20] and CNN-GRU [21] leverage recurrent neural networks to address this, but fully capturing the complex temporal patterns in sleep stages remains difficult. Effective feature extraction is critical for accurate classification, but both traditional and deep learning methods face limitations. These methods often require extensive labeled data and significant computational resources, which can be prohibitive. The high computational demands of deep learning models further limit their use in resource-constrained environments, such as home-based monitoring systems, where lightweight and efficient models are needed. Inconsistent evaluation metrics and protocols across studies also hinder direct performance comparisons, underscoring the need for standardized benchmarks in the field. Ensuring model interpretability and clinical relevance is essential, as many models function as black boxes, making it difficult for clinicians to trust and understand their decisions. This lack of transparency can be a barrier to clinical adoption. Additionally, handling class imbalance and domain adaptation effectively remains a significant challenge. Models like CoSleepNet [46] address class imbalance in EEG-EOG datasets but face difficulties in generalizing across varied datasets. Similarly, personalized approaches like LightSleepNet [47] aim to mitigate domain adaptation issues, but this area requires further refinement. Other challenges include ensuring scalability and robustness in real-time applications. Handling large-scale sleep datasets efficiently and processing data in real-time for applications like home monitoring and wearable devices are crucial. Integrating data from various biosignals and other sources, such as heart rate and respiration, can enhance classification accuracy but also increases model complexity and computational requirements. Advanced techniques such as transfer learning and domain adaptation offer promising solutions but still face challenges related to overfitting and generalization. Models must be robust to noise, particularly in non-clinical environments where various noise sources can degrade performance. Interoperability between different systems and standardization of data formats and protocols are also critical for broader application and comparison of models. Lastly, ensuring explainability and trust in model decisions is vital, not just for model interpretability but also to provide accurate and meaningful explanations in a clinical context.
The rest of the paper is organized as follows: Section 2 provides an in-depth exploration of the materials and methods, shedding light on our dataset and the techniques we employed. Moving forward, Section 3 delves into the comparison metrics we’ve utilized, while presenting the results of our experiments. Section 4 opens the floor for a detailed discussion of our findings. Finally, in Section 5, we summarize our work with a conclusion that encapsulates the key takeaways and implications of our research.