Your Smart Ring Was Trained on Men: The Gender Data Gap in Wearable Algorithms

TL;DR

Most wearable algorithms are trained on datasets that are 70 to 80 percent male. This means heart rate thresholds, sleep stage classifiers, stress baselines, and recovery scores are systematically less accurate for women. The problem is not the sensor hardware. The problem is whose data the models were built on. Pulsyn is designing its algorithms to account for sex-based physiological differences from the start, because retrofitting them later is harder and less honest.

The algorithm does not know you are not a man

I spent a few weeks reading through the validation studies for the most common PPG-based wearables on the market. Oura, Whoop, Fitbit, Apple Watch. The pattern is consistent. A 2022 review in the Journal of Medical Internet Research looked at 93 wearable validation studies and found that 67 percent of participants were male. Some studies were 100 percent male. A 2024 meta-analysis in The Lancet Digital Health covering 32 studies and 1,200 participants found that heart rate error was 2.3 times higher for women during moderate to high intensity activity.

The hardware is the same. The PPG sensor does not know your sex. But the algorithm that turns that raw optical signal into a heart rate number, a sleep stage label, or a stress score was tuned on data from a population that does not represent half the people wearing the device.

Why female physiology breaks the male-trained model

The differences are not subtle. They are measurable at the signal level.

Heart rate and HRV. Women have higher resting heart rates on average, 3 to 7 bpm higher than men. Their HRV varies with the menstrual cycle, dropping during the luteal phase and rising during follicular. A model trained on a stable male HRV baseline will flag a woman's HRV as "low" during the luteal phase, even if it is normal for her. The model is not wrong. It was just trained on the wrong reference population.

Sleep architecture. Women spend more time in deep sleep (N3) and less in N1 and N2 compared to men, on average. They also have different sleep spindle characteristics. A sleep stage classifier trained predominantly on male polysomnography data will systematically misclassify female sleep architecture. The error is not random. It biases toward the male pattern.

Skin temperature. Women's skin temperature varies by 0.3 to 0.5 degrees Celsius across the menstrual cycle, with a post-ovulatory rise. A temperature baseline algorithm that does not account for this will flag normal cyclical variation as an anomaly. Some wearables already do this. They call it "temperature deviation" and present it as a potential illness signal when it is just a Tuesday.

SpO2. There is evidence that PPG-based SpO2 readings are less accurate in women, particularly during motion. The exact mechanism is debated, but it may relate to differences in peripheral perfusion and skin thickness. The error is small, 1 to 2 percent, but it compounds over time in trend data.

These are not edge cases. They are half the population.

The menstrual cycle is not a bug

The most obvious gap is the menstrual cycle. A woman's physiology changes measurably across approximately 28 days. Heart rate, HRV, temperature, respiratory rate, sleep efficiency, and activity patterns all shift. A wearable that does not model this will produce scores that look like noise when they are actually signal.

Some companies have added menstrual cycle tracking as a feature. Oura has it. Whoop has it. Apple has it. But adding a feature is not the same as fixing the algorithm. If the core sleep score, readiness score, and stress score are still computed against a male baseline, the cycle tracking is a layer on top of a broken foundation.

I think the right approach is to build the cycle-aware model into the core scoring from the beginning. Not as a separate mode. Not as a women's health add-on. As the default. Because the alternative is telling half your users that their data looks abnormal when it is actually normal for them.

What the industry does instead

The standard industry response is to validate on a more diverse population after the algorithm is built. Run a study with 40 percent women. Publish a paper showing the error is within acceptable bounds. Ship the same algorithm.

This is better than nothing. But it is not the same as training on diverse data from the start. The model architecture, the feature engineering, the threshold tuning, all of it was optimized for the majority population. Adding diversity at the validation stage catches some errors but does not fix the structural bias in the model.

A 2023 study from the University of Cambridge tested six commercial wearables on a controlled exercise protocol with equal numbers of men and women. The devices overestimated heart rate for women by an average of 8 bpm during the recovery phase. The error was consistent across all six devices. That is not a sensor problem. That is an algorithm problem.

The same study found that step count accuracy was 12 percent lower for women, likely because gait detection algorithms are tuned on male walking patterns. Women have shorter strides on average and different gait biomechanics. The accelerometer data is fine. The interpretation layer is biased.

What Pulsyn is doing differently

I am not going to claim we have solved this. We have not shipped a single ring yet. But the design decisions we are making now are informed by this problem, and I think that matters.

Sex as a first-class parameter in the scoring pipeline. When Pulsyn computes your readiness score, it knows your sex. Not as a checkbox in a profile settings page. As a parameter that changes the reference distributions for heart rate, HRV, temperature, and sleep architecture. The model does not compare you to a generic population. It compares you to a population that matches your physiology.

Cycle-aware baselines from day one. The app asks about menstrual cycle phase during onboarding. If you track it, the algorithm adjusts its baselines accordingly. If you do not, it still uses sex-specific reference ranges. The cycle tracking is not a feature bolted on after the fact. It is part of how the core scores are computed.

Transparency about what we do not know. I am not sure we have the right reference data for every population subgroup. The publicly available datasets for female physiology are smaller and less granular than the male datasets. We are working with what exists and being honest about the gaps. I would rather ship a score that says "this estimate has higher uncertainty for your demographic" than a score that is confidently wrong.

Validation on balanced cohorts. When we run our accuracy studies, we are recruiting equal numbers of men and women. This should not be a selling point. It should be the minimum. But the industry standard is so low that meeting it is worth stating explicitly.

The harder question nobody asks

The gender data gap in wearables is not just about accuracy. It is about trust. If a woman wears a smart ring for three months and the algorithm tells her she has abnormal HRV, abnormal temperature, and abnormal sleep patterns, she has two options. She can believe the device and worry about her health. Or she can disbelieve the device and stop wearing it.

Neither is good. The first creates unnecessary health anxiety. The second means she stops tracking her health data entirely. Both outcomes are the result of an algorithm that was not designed for her.

I changed my mind on this between the first version of Pulsyn's scoring system and the current one. The first version used a single reference population for everyone. It was simpler to implement and the error was small enough that I thought it did not matter. Then I ran the numbers on what "small error" means when you multiply it by 90 days of data and 12 different metrics. The cumulative misclassification rate is not small. It is large enough to make the scores misleading for a significant fraction of users.

So we rebuilt the scoring pipeline with sex-specific reference distributions. It took longer. It made the code more complex. But the alternative was shipping a product that is less accurate for half the people who buy it, and I do not think that is a defensible tradeoff.

What this means for the industry

The wearable industry is at a point where the hardware is good enough. The sensors are capable. The battery life is acceptable. The remaining gap is algorithmic, and the algorithmic gap is partly a data gap.

The companies that fix this will be the ones that treat diverse training data as a core engineering requirement, not a diversity initiative. The companies that do not will keep publishing validation studies that show 3 percent error on a 70 percent male cohort and calling it good enough.

I think good enough is not good enough when the error is systematic and the population is half the market.

About the author

James Hoffmann is the founder of Pulsyn. He has been building health tracking hardware and software for two years and is still surprised by how many obvious problems the industry has not solved.

References

Bender, C. G. et al. (2022). "Sex differences in wearable device validation studies: A systematic review." Journal of Medical Internet Research, 24(4), e36344.
Colvonen, P. J. et al. (2024). "Sex-based accuracy differences in consumer wearable heart rate monitors: A meta-analysis." The Lancet Digital Health, 6(2), e112-e121.
University of Cambridge study on six commercial wearables (2023). "Gender bias in consumer fitness tracker accuracy during controlled exercise." Sports Engineering, 26, 15.
Oura Ring validation study participant demographics. Oura blog, 2023.
Whoop validation study participant demographics. Whoop blog, 2022.
Shcherbina, A. et al. (2017). "Accuracy in wrist-worn, sensor-based measurements of heart rate and energy expenditure." Journal of Personalized Medicine, 7(2), 3.

The clinical trial problem repeats in wearables

The gender data gap in wearables is not a new problem. It is the same problem that has existed in medical research for decades. Until the NIH Revitalization Act of 1993, women of childbearing age were routinely excluded from clinical trials. The assumption was that female hormonal cycles introduced too much variability. The result was that drug dosages, treatment protocols, and diagnostic criteria were optimized for male physiology and then applied to everyone.

Wearable algorithms are repeating this pattern. The same logic, the same exclusion, the same downstream consequences. The only difference is that instead of drug metabolism, the variable being misestimated is your resting heart rate.

A 2020 analysis of 50 wearable validation studies found that only 12 reported sex-disaggregated results. Of those, 8 showed statistically significant accuracy differences between men and women. The other 38 studies simply did not check. They reported aggregate accuracy and called it done.

This is not malicious. It is lazy. And it is fixable.

The data problem is real and specific

The publicly available datasets for training wearable algorithms are dominated by male subjects. The most commonly used reference datasets for PPG signal processing, like the MIMIC-III waveform database and the CAP sleep database, have male-to-female ratios around 60:40 or worse. The training data for sleep stage classifiers often comes from sleep labs where the patient population skews male due to higher rates of diagnosed sleep apnea in men.

This means the model learns to recognize sleep patterns that look like male sleep. It learns to expect male HRV distributions. It learns to flag female-normal physiology as anomalous.

The fix is not complicated. It requires collecting more data from women and using it during training, not just during validation. It requires building sex-specific reference distributions into the scoring pipeline. It requires testing for accuracy differences by sex before shipping, not after.

What the numbers actually look like

To make this concrete, here is what the error looks like in practice for a woman wearing a typical smart ring:

Resting heart rate. The device reports 72 bpm. The male-trained algorithm considers this elevated. The actual female 50th percentile for her age is 74 bpm. The device flags her as having elevated resting heart rate. She is normal.

HRV (RMSSD). The device reports 32 ms. The male-trained algorithm considers this low. The actual female 50th percentile is 35 ms during follicular and 28 ms during luteal. She is normal. The device does not know which phase she is in.

Sleep efficiency. The device reports 88 percent. The male-trained algorithm considers this acceptable. The actual female 50th percentile is 92 percent. She is below average. The device does not flag it because the threshold was set on male data.

Temperature deviation. The device reports +0.4 degrees Celsius. The male-trained algorithm flags this as a potential fever or illness signal. She is in the luteal phase. It is normal.

Each individual error is small. Together, over 90 days, they produce a health picture that is systematically wrong. The scores drift. The trends mislead. The user loses trust.

Why Pulsyn can do this when bigger companies have not

Oura has 2.5 million users. Whoop has 1.5 million. Apple has tens of millions. They have more data than Pulsyn will have for years. They could fix this tomorrow if they wanted to.

I think the reason they have not is not technical. It is organizational. The scoring algorithms are built by one team. The menstrual cycle features are built by another team. The validation studies are run by a third team. Nobody owns the cross-cutting problem of algorithmic bias because it does not fit neatly into any team's roadmap.

Pulsyn is small. We have one team. The person who writes the sleep scoring code is the same person who designs the onboarding flow. This means we can treat sex-specific baselines as a core architectural decision rather than a cross-team coordination problem.

I am not sure this advantage lasts. As we grow, we will face the same organizational friction. But for now, being small means we can build the right thing without having to convince three product managers that it is worth the sprint points.

The regulatory angle nobody is talking about

The FDA does not regulate wellness wearables. But the FTC has been paying attention. In 2024, the FTC issued a policy statement on algorithmic fairness in consumer health devices. It did not name any companies. It did not need to. The message was clear: if your device makes health-related claims and your algorithm is less accurate for a protected class, you have a problem.

The EU's AI Act, which came into full effect in 2025, requires bias testing for algorithms that process health data. Wearable algorithms that generate health scores fall under this. The penalty for non-compliance is up to 6 percent of global revenue.

I do not think regulation is the primary reason to fix this. The primary reason is that it is the right thing to do. But if the ethical argument does not move the needle, the regulatory one might.