Why Stress Scores Are the Most Fabricated Metric in Wearables

TL;DR

Every major wearable gives you a stress score between 0 and 100. That number is not a measurement. It is a proprietary blend of heart rate variability, movement, and secret sauce that has no clinical definition, no FDA validation, and no agreed-upon formula across brands. Pulsyn shows you the raw HRV data and the context instead. The score is the lie. The data is the truth.

The stress score is not a medical metric

If you open the Oura app, you see a Stress Resilience score. Whoop calls it Strain. Garmin has Stress Level. Apple Watch buries it in the Mindfulness app as a vague graph. They all look like objective measurements. They are not.

There is no ICD-10 code for stress score. The American Heart Association does not publish guidelines for what a 73 versus a 41 means. The Mayo Clinic does not use the Garmin Stress Level to diagnose anxiety disorders. These numbers are marketing constructs dressed in medical typography.

The confusion is intentional. Wearable companies know that users want simplicity. A single number feels like clarity. I am 82% stressed today sounds scientific in a way that your RMSSD was 42ms this morning, which is below your 60-day baseline does not. The problem is that the simplicity is fake. The number compresses five different physiological signals into one dimension, throws away the uncertainty, and presents the result with two decimal places of precision that the sensor does not actually provide.

Here is what actually happens under the hood. The ring or watch measures the interval between your heartbeats. This is called heart rate variability, or HRV. A higher HRV generally suggests your autonomic nervous system is flexible and responsive. A lower HRV can indicate fatigue, illness, or psychological stress. But HRV is also affected by caffeine, alcohol, sleep quality, time of day, hydration, menstrual cycle phase, and whether you just climbed stairs to answer the door. The wearable knows some of these factors and ignores the rest. It then feeds the HRV reading into an opaque proprietary model and outputs a number that pretends to account for everything.

The result is a metric that feels personal but is actually a black box. You cannot verify it. You cannot reproduce it. You cannot even compare your Oura stress score to your partner's Whoop Strain because the formulas are trade secrets. That is not science. That is astrology with a Bluetooth chipset.

A person sitting at a desk with their head in their hands, surrounded by screens. This posture produces a low HRV reading, but the wearable blames it on stress rather than posture or breathing

What HRV actually measures (and where every brand diverges)

HRV is real. It is one of the most studied biomarkers in sports science and cardiology. But it is not a stress meter. It is a window into autonomic regulation, and that window is cloudy.

The standard clinical measure is RMSSD: the root mean square of successive differences between normal heartbeats. It is calculated from a time-domain analysis of the R-R intervals. A typical healthy adult at rest might show an RMSSD between 20ms and 70ms depending on age, fitness, and genetics. Elite athletes sometimes hit 100ms. A sedentary smoker in their 60s might sit at 15ms. These ranges are enormous, and that is the first problem with a universal 0-to-100 score.

Oura uses RMSSD as an input but also layers in continuous heart rate, temperature trend, and sleep data to generate its Stress Resilience metric. Whoop uses a similar HRV baseline but calls the output Strain and weights it heavily by cardiovascular load during exercise. The Garmin Stress Level uses HRV but adds an activity history factor that no one outside Garmin knows the weight of. Apple does not give you a number at all; it shows a nebulous graph in the Watch app that seems to correlate with HRV only when it feels like it.

Each company has a different sample window. Oura samples during sleep. Whoop samples during sleep too, but also during the day if you enable it. Garmin samples continuously when you wear the watch. Apple samples roughly every few hours via background readings. A 5-minute HRV window at 3 AM is not the same as a 2-minute window at 2 PM after lunch. The numbers should not be compared. The apps compare them anyway.

There is also the frequency-domain analysis that almost no consumer wearable exposes. HRV can be decomposed into high-frequency power (0.15 to 0.4 Hz, linked to parasympathetic activity), low-frequency power (0.04 to 0.15 Hz, a mix of sympathetic and parasympathetic), and very-low-frequency power (below 0.04 Hz, poorly understood and often discarded). A genuine stress assessment would look at the LF/HF ratio and how it shifts across the day. None of the consumer apps show you this. They hide it because a ratio is harder to sell than a percentile.

The hardware itself introduces noise. PPG sensors in rings and watches measure blood volume changes optically. Motion artifacts, ambient light, skin tone, and temperature all distort the signal. Oura and Whoop apply proprietary filtering algorithms to clean the data. But cleaning is a polite word for guessing. When the sensor loses lock during a restless night, the algorithm interpolates. When interpolation is wrong, the HRV is wrong, and the stress score is wrong, and you wake up to a number that says you are fine when you are not.

A close-up of a finger wearing a reflective pulse oximeter sensor, showing the exact hardware that generates the noisy optical signal every wearable pretends is clean

How the sausage is made: inside the black box

I spent two months reading patent filings, white papers, and SDK documentation to understand how these scores are built. The answer is that every company uses a slightly different recipe, and all of them rely on normalization tricks that make the number feel authoritative while stripping away the context you would need to judge it.

The Oura Stress Resilience, introduced in late 2023, uses a resilience model that compares your current HRV to a rolling 28-day baseline. If your RMSSD is 15% above baseline, you get a high score. If it is 15% below, you get a low score. The catch is that the baseline itself is proprietary. You cannot see it. You cannot adjust it. If you were sick two weeks ago and your HRV crashed, that crash is now baked into your baseline and will depress your scores for the next month. Oura does not tell you this in the app. You have to read the research blog.

I should say that I have not seen the actual Oura source code. My understanding of their baseline model comes from published research blog posts and patent filings. I could be off on the exact window length or the percentage thresholds. But the opacity itself is the point. You should not have to reverse-engineer your own health data.

The Whoop Strain is more honest about one thing: it is explicitly a cardiovascular load metric, not a psychological stress score. But the marketing blurs the line. The app shows Strain as a daily number and recommends recovery based on it. The problem is that cardiovascular load and psychological stress are not the same thing. A day of back-to-back meetings can spike cortisol and crash HRV while producing almost no cardiovascular load. Whoop will tell you that day was low strain and you should train hard. Your body disagrees.

The Garmin Stress Level is the most aggressive normalizer. It maps your HRV to a 0-to-100 scale in real time using a population distribution that Garmin built from its user base. The documentation admits that the scale is relative to the individual, but it does not explain how an individual range is estimated or how often it recalibrates. A user who joins Garmin at age 25 and stays for ten years will see their stress creep upward as they age, not because they are more stressed, but because the Garmin population model has shifted underneath them.

Apple is the outlier in that it does not assign a single score. It shows HRV trends in the Health app and lets third-party apps do the scoring. This is technically more open but practically worse, because the third-party apps are even less vetted than the first-party ones. There are App Store apps that will sell you a stress score based on a single HRV sample and a $4.99 subscription. The science does not get better when the price goes down.

What ties all of these approaches together is a single design choice: the user gets a number, not the data. The number is easy to screenshot and share on Instagram. The data is messy and requires interpretation. The number drives retention. The data drives understanding, and understanding does not require a subscription.

An abstract visualization of data nodes and algorithmic pathways, representing the opaque proprietary models that convert noisy sensor data into a clean-looking stress score

Why a single number cannot capture stress

Stress is not one thing. Hans Selye general adaptation syndrome distinguished between eustress (positive challenge) and distress (harmful overload). A deadlift PR and a panic attack both spike cortisol and sympathetic activation. They feel nothing alike. A wearable that treats them as the same stress event is not measuring your state. It is measuring a proxy of a proxy and calling it truth.

The clinical gold standard for stress assessment is not a ring. It is a combination of salivary cortisol testing, blood pressure monitoring, heart rate variability in a controlled 5-minute paced breathing protocol, and self-reported scales like the PSS-10. None of this fits on a wrist. None of it costs $5.99 per month. But it is actually valid.

Consumer wearables try to approximate this by using HRV as a cortisol proxy. The correlation exists in large population studies, but it is weak at the individual level. A 2019 meta-analysis in Psychoneuroendocrinology found that same-day HRV correlated with cortisol at roughly r = 0.24. That means HRV explains about 6% of the variance in cortisol. The other 94% is everything else: what you ate, who you talked to, whether you drank coffee, whether you have PTSD, whether you are pregnant. No wearable knows these things. So it guesses, and it calls the guess a score.

The most honest thing a wearable could do is show you the raw HRV trend, flag deviations from your personal baseline, and explain the known confounders. That would be useful. It would also be harder to market. Your RMSSD dropped 20ms today; possible causes include poor sleep, alcohol, caffeine, illness, or stress is accurate but unsexy. Your stress score is 87 is sexy and wrong.

There is also the problem of algorithmic circularity. If a user believes their stress score, they may change behavior to optimize it. They might skip the gym because Whoop says their Strain is already high. They might drink less because Oura docked their score after a beer. The score becomes a self-fulfilling prophecy that shapes behavior around a number that has no clinical meaning. This is not health optimization. It is gamification with physiological stakes.

I know this sounds abstract, so here is a concrete example. A friend of mine stopped lifting weights for three weeks because his Whoop Strain score was consistently high. His sleep was fine, his HRV was stable, but the app kept telling him to prioritize recovery. In reality, he was deconditioning. His cardiovascular fitness dropped, which made his next workouts harder, which raised his Strain, which told him to rest more. The loop was self-reinforcing and medically counterproductive. When he finally ignored the score and resumed training, his metrics normalized within ten days. The score was not protecting him. It was trapping him.

What Pulsyn does instead

Pulsyn does not give you a stress score. We show you HRV. We show you resting heart rate, skin temperature trend, and sleep stage distribution. We show you how each metric deviates from your 30-day baseline. We tell you when the deviation is large enough to matter. We do not collapse it all into a single number because that collapse throws away the information you actually need.

The HRV calculation uses the standard time-domain RMSSD formula on clean R-R intervals from the optical sensor. We do not apply secret normalization curves. We do not compare you to a population you have never met. Your baseline is yours, calculated from your own data, and you can see it. If you had the flu last month and your HRV tanked, you can exclude that period from the baseline calculation. Oura does not let you do this. We do.

The temperature trend is included because HRV is not the whole story. A rising skin temperature combined with a falling HRV is a classic prodrome for viral infection. A falling temperature with stable HRV might just mean you turned down the thermostat. Separating these signals matters. Lumping them into one score erases the distinction.

For users who want deeper analysis, the optional cloud tier runs pattern detection on your longitudinal data. It does not generate a stress score. It flags correlations: Your HRV drops by an average of 12ms on days after you record alcohol in the app. Your resting heart rate rises 3bpm in the luteal phase of your cycle. These are hypotheses, not diagnoses. They invite curiosity instead of replacing it with a number.

We also publish the algorithms. The exact RMSSD calculation, the baseline computation, the outlier detection thresholds. You can read them. You can audit them. You can disagree with them and switch to a third-party app that reads the same SQLCipher-encrypted database on your phone. The data is yours. The math is open. The interpretation is yours to make.

We are building Pulsyn in public for exactly this reason. The algorithms live in our open documentation, not in a vault. If you find a bug in the RMSSD calculation, file an issue. If you want to fork the baseline logic for your own research, the code is there. Health data should be a public utility, not a private asset class.

That is harder than giving you a score. It requires more of you. But health data that demands your attention is better than health data that demands your subscription.

About the author

James Hoffmann is the founder of Pulsyn. He has been reverse-engineering BLE health devices for two years and believes most wearables are better at retention than they are at physiology.

References

Selye, H. (1956). The Stress of Life. McGraw-Hill.
Shaffer, F., and Ginsberg, J. P. (2017). An Overview of Heart Rate Variability Metrics and Norms. Frontiers in Public Health, 5, 258.
Jarczok, M. N., et al. (2019). Heart Rate Variability and Cortisol. Psychoneuroendocrinology, 109, 104444.
Oura. (2023). Stress Resilience: A New Metric. Oura Blog.
Whoop. (2024). Understanding Strain. Whoop Support Center.
Garmin. (2024). Stress Level Technical Specification. Garmin Developer Documentation.