How Smart Rings Calculate Sleep Stages (and Why They're Mostly Guessing)

TL;DR

Your smart ring does not measure sleep stages. It measures heart rate variability and movement, then runs a classifier trained on people who were simultaneously wired to EEG machines in sleep labs. The ring is guessing. That guess is sometimes useful and often wrong, and the industry has spent a decade pretending otherwise. Pulsyn does the same guess on your phone, but we tell you it is a guess.

What polysomnography actually measures

The gold standard for sleep staging is polysomnography, or PSG. If you have ever done a sleep study, you know the setup: twenty-plus electrodes glued to your scalp, face, and chest; wires running to a wall-mounted amplifier; a technician watching the waveforms in real time from another room. PSG records EEG for brain activity, EOG for eye movements, and EMG for muscle tone. A trained sleep technologist scores the data in 30-second epochs, assigning each epoch to wake, REM, light N1/N2, or deep N3 sleep based on the actual electrical signals your brain and body produce.

The distinction between light and deep sleep is not a vibe. It is defined by the presence of slow delta waves on EEG and the absence of muscle tone on EMG. REM sleep is defined by a wake-like EEG pattern combined with rapid eye movements and a near-complete loss of muscle tone. These are physiological facts. You cannot substitute them.

A consumer smart ring has none of these sensors. It has a three-axis accelerometer and a set of green LEDs shining into your finger. That is the entire input set. The question is not whether a ring can measure sleep stages as well as PSG. The question is whether it can measure them at all.

A medical sleep study setup with monitoring equipment and sensors, the actual gold standard that smart rings try to approximate without the electrodes

How rings actually guess

Every mainstream smart ring (Oura, Whoop, RingConn, the Apple Watch in sleep mode, and every Fitbit with a SpO2 sensor) uses the same fundamental pipeline. It is not a secret. The patent literature and academic papers are public.

First, the ring collects inter-beat intervals from the PPG sensor. The green LED illuminates the capillary bed in your finger, and the photodiode measures the variation in reflected light that corresponds to blood volume changes with each heartbeat. The firmware detects the peak of each pulse wave and records the time between peaks. That is your inter-beat interval series, and from it the algorithm derives heart rate variability.

Heart rate variability during sleep is not random. Deep sleep tends to produce regular, low-frequency oscillations in HRV driven by the parasympathetic nervous system. The body is in a state of relative autonomic stability, and the heart rate slows in a predictable pattern. REM sleep produces more irregular, higher-frequency variation because your autonomic balance is unstable and your heart rate tracks the emotional content of your dreams. You can see this on a raw HRV plot: deep sleep looks like a slow sine wave, REM looks like jitter. Wakefulness produces the most chaotic signal of all, especially if you are rolling over or adjusting the blanket.

Second, the accelerometer records movement. The ring samples the three axes at somewhere between 25 and 100 Hz, depending on the manufacturer, then applies a threshold to distinguish motion epochs from stillness. Large movements are almost always wake. Small movements can happen during light sleep or REM. The ring uses this as a secondary input, not a primary one, because motion alone is a terrible sleep stage predictor. You can lie completely still while awake (sleep-onset insomnia) and move during REM (many people do).

Third, the ring extracts features from these raw signals. The typical feature set includes time-domain HRV statistics (mean heart rate, RMSSD, pNN50), frequency-domain power in the low-frequency and high-frequency bands, and movement-derived metrics like the number of motion epochs per minute. Some manufacturers add SpO2 or skin temperature as additional features, though these are more useful for detecting sleep apnea or circadian rhythm shifts than for stage classification.

The feature extraction window is usually one to five minutes. The algorithm looks at the distribution of HRV values in that window, not at individual beats. This is necessary because a single beat can be noisy, but it also means the classifier has no temporal resolution finer than the window size. If you transition from deep sleep to REM over thirty seconds, the ring might not register the change until the next window. The hypnogram looks smooth because it has been smoothed.

Fourth, the ring feeds these features into a machine learning classifier. The classifier was trained on a dataset where some number of volunteers wore both the ring and full PSG equipment simultaneously. The algorithm learns to map the ring's limited sensor input to the PSG-assigned sleep stage. This is called supervised learning, and the critical thing about it is that the algorithm is only as good as the training data and the overlap between the features it can see and the biology it is trying to infer.

The ring does not detect deep sleep. It detects a pattern of heart rate and motion that, in the training dataset, correlated with deep sleep. When your physiology diverges from the training population (and it does), the correlation breaks.

An abstract visualization of brain waves and neural activity, representing the EEG data that smart rings fundamentally cannot access

The training data problem

Most ring manufacturers do not publish their training datasets. The ones that do reveal a familiar problem: the data comes from healthy adults in controlled sleep lab environments. The average age is often under forty. The sample size is typically in the low hundreds. The sleep is uninterrupted. There is no alcohol, no caffeine after noon, no SSRIs, no sleep apnea, no restless leg syndrome, no third-trimester pregnancy, no newborn in the next room, and no overnight flight from Phoenix to London.

Your sleep is not like the training data. My sleep is not like the training data. When the algorithm encounters a pattern it has never seen, it does not say "I do not know." It says "REM" or "deep" or "light" with the same confidence it gives every other epoch. The output is a number. The confidence interval is not shown to the user.

This is a specific problem with the way consumer sleep staging is presented. The ring outputs a discrete stage label for every thirty-second or sixty-second window. There is no "uncertain" state. There is no "this pattern does not match any training example." The app renders a clean hypnogram with sharp boundaries between colors, and the user assumes the boundaries are real. They are not. They are the output of a classifier that was forced to choose.

Why the industry pretends otherwise

Oura calls its output "sleep stages." Whoop calls it "sleep performance." Apple Watch calls it "time in REM" and "time in deep." The language is deliberate. These are not "estimated sleep stages" or "predicted sleep stages." They are presented as direct measurements, as if the ring were a miniaturized sleep lab wrapped around your finger.

The reason is commercial. A ring that measures your sleep stages sounds like a medical device. A ring that guesses your sleep stages based on heart rate patterns sounds like a wellness gadget. The former justifies $349 plus $72 per year. The latter does not.

There is a subtler reason too. The user wants a narrative. "I got two hours of deep sleep last night" is a satisfying sentence. "The algorithm assigned a deep-sleep label to 7.3 percent of last night's epochs based on HRV patterns that resembled the training set" is not. The industry optimizes for the satisfying sentence. The accuracy of the underlying inference is secondary.

I do not think this is malicious. I think it is the default path of least resistance when you are a hardware company with a PPG sensor and a quarterly revenue target. But the default path is not the only path.

What the ring actually gets right

The inference is not useless. It is just bounded. The ring is reasonably good at detecting the coarse boundary between sleep and wake, especially if you are a healthy adult with regular sleep patterns. Total sleep time estimates from actigraphy and PPG tend to agree with PSG within ten to twenty minutes for the average user. The ring is also decent at identifying the overall architecture of the night: the first half tends to have more deep sleep, the second half more REM. These are population-level patterns that the classifier can exploit.

Where it falls apart is in the details. A single epoch of REM misclassified as light sleep does not matter to the user. Thirty minutes of wakefulness misclassified as light sleep because the user was lying still does matter. Deep sleep is the hardest stage to identify correctly because the HRV signature of deep sleep overlaps with the HRV signature of quiet wakefulness in some people. REM is easier in some ways and harder in others: the heart rate variability is distinctive, but REM without eye movements and with low muscle tone looks like light sleep from the ring's perspective.

The error is not random. It is systematic. The ring overestimates light sleep and underestimates wake time. This is well documented in the consumer sleep literature. If you have ever woken up, checked your app, and seen that you were supposedly "light sleep" for the forty minutes you spent staring at the ceiling, you have experienced this bias directly.

A dark bedroom at night, the environment where smart rings make their guesses while you lie still or toss and turn

What Pulsyn does differently

Pulsyn runs the same classifier. I am not going to claim we invented a new physics. We have a PPG sensor and an accelerometer, and we use them to infer sleep stages the same way every other ring does. The difference is where the inference happens and what we do with the uncertainty.

First, the inference runs on your phone, not in a cloud data center. The PPG data stays on the device. The classifier is a local model. This matters because the primary reason to care about sleep stage accuracy is that your sleep data is sensitive. If you are going to accept an imperfect measurement, the minimum acceptable tradeoff is that the imperfect measurement stays private. Oura uploads every heartbeat. We do not.

Second, we do not pretend the output is a medical measurement. The Pulsyn app labels the sleep stage graph as "estimated sleep stages." The sleep score weights total sleep time and sleep efficiency more heavily than stage breakdown because total sleep time is a measurement and stage breakdown is an inference. We are building a product that tells you what it knows and what it guesses.

Third, we are exploring a more honest representation of the classifier output. Instead of a hard boundary between REM and light, we could render a probability gradient: "70 percent REM, 30 percent light" for a given epoch. This is technically trivial and UX-awkward, which is why nobody does it. But the awkward truth is closer to reality than the clean hypnogram.

Fourth, we are looking at open-sourcing the sleep staging model itself. The weights can stay proprietary, but the architecture and the training methodology should not be. If a user wants to know how we map inter-beat intervals to stage labels, they should be able to read it. The only reason to keep this opaque is competitive paranoia, and I do not think that is a good enough reason.

Fifth, we are deliberately not including a "sleep score" that penalizes low REM or deep sleep. The Pulsyn score is built from metrics the sensor can actually measure: total sleep time, sleep efficiency, resting heart rate relative to your baseline, and HRV trend. We do not subtract points for "insufficient deep sleep" because we do not know how much deep sleep you got. We only know what the classifier guessed.

What I do not know yet

I am not sure if the probability-gradient approach will work in the app. Users might find it more confusing than helpful. We might A/B test it after launch. I am also not sure if on-device training (personalizing the classifier to your individual HRV patterns over the first two weeks) will improve accuracy enough to matter. The literature on personalization is mixed: it helps for some users and hurts for others by overfitting to noisy early data. We are testing both.

I am also uncertain about whether we should publish our own validation data. The honest answer is that we do not have a sleep lab, and a validation study against PSG is expensive and slow. I could commission one, but a small-sample study with a friendly population would not tell you much more than the existing literature already does. PPG-based sleep staging is an estimate. The exact percentage of misclassified epochs depends on the population, the night, and the individual. I do not want to wave a white paper around as if it changes the physics.

What the user should actually do with this information

If you are using a smart ring to optimize your sleep, the actionable metrics are the ones the ring measures directly: total sleep time, consistency of bedtime and wake time, resting heart rate trend, and HRV trend. These are physiological signals that the PPG sensor captures accurately. The sleep stage breakdown is a curiosity. It is not a target. You cannot train yourself to get more REM sleep by willpower, and chasing a "deep sleep" number is a fast path to sleep anxiety.

If you suspect you have a sleep disorder, get a sleep study. A ring is not a diagnostic tool. The FDA agrees, which is why every ring manufacturer carefully classifies its product as a "general wellness" device and not a medical device. I covered this distinction in a previous post. The short version is: the ring is a toy, a useful toy, but a toy. The sleep lab is the tool.

The real value of sleep staging in a consumer ring is trend tracking, not absolute accuracy. If your REM estimate goes from 18 percent to 12 percent over three months, that might mean something. If your REM estimate on a single night is 18 percent instead of 22 percent, that means nothing. The signal is in the long-term trajectory, not the nightly precision. Most apps do not emphasize this because a trend graph is less exciting than a color-coded hypnogram.

A person awake at night in dim light, representing the insomnia and uncertainty that smart rings often misclassify as light sleep

The bottom line

A smart ring cannot measure your sleep stages because it cannot measure your brain. It measures your heart and your motion, then guesses. The guess is informed by machine learning and validated against sleep lab data, but it is still a guess. The companies selling these guesses as measurements are not lying about the algorithm. They are lying about the certainty.

Pulsyn guesses too. We just guess on your phone, with your data, and we tell you we are guessing. That is not a feature you can market with a green checkmark. It is the minimum standard for building a health product that respects the user.

About the author

James Hoffmann is the founder of Pulsyn. He has been reverse-engineering BLE health devices and reading sleep physiology papers for two years.

References

AASM Scoring Manual, version 3.0. American Academy of Sleep Medicine, 2023.
Berry, R.B. et al. The AASM Manual for the Scoring of Sleep and Associated Events. American Academy of Sleep Medicine, 2020.
Various ring manufacturers' published patent filings for sleep stage classification (USPTO, publicly available).