John Hernandez, PhD, MPP

John Hernandez, PhD, MPP

Director, Head of Clinical Research Center of Excellence and Health Impact team
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
A personal health large language model for sleep and fitness coaching
Anastasiya Belyaeva
Zhun Yang
Nick Furlotte
Chace Lee
Erik Schenck
Yojan Patel
Jian Cui
Logan Schneider
Robby Bryant
Ryan Gomes
Allen Jiang
Roy Lee
Javier Perez
Jamie Rogers
Cathy Speed
Shyam Tailor
Megan Walker
Jeffrey Yu
Tim Althoff
Conor Heneghan
Mark Malhotra
Leor Stern
Shwetak Patel
Shravya Shetty
Jiening Zhan
Daniel McDuff
Nature Medicine (2025)
Preview abstract Although large language models (LLMs) show promise for clinical healthcare applications, their utility for personalized health monitoring using wearable device data remains underexplored. Here we introduce the Personal Health Large Language Model (PH-LLM), designed for applications in sleep and fitness. PH-LLM is a version of the Gemini LLM that was finetuned for text understanding and reasoning when applied to aggregated daily-resolution numerical sensor data. We created three benchmark datasets to assess multiple complementary aspects of sleep and fitness: expert domain knowledge, generation of personalized insights and recommendations and prediction of self-reported sleep quality from longitudinal data. PH-LLM achieved scores that exceeded a sample of human experts on multiple-choice examinations in sleep medicine (79% versus 76%) and fitness (88% versus 71%). In a comprehensive evaluation involving 857 real-world case studies, PH-LLM performed similarly to human experts for fitness-related tasks and improved over the base Gemini model in providing personalized sleep insights. Finally, PH-LLM effectively predicted self-reported sleep quality using a multimodal encoding of wearable sensor data, further demonstrating its ability to effectively contextualize wearable modalities. This work highlights the potential of LLMs to revolutionize personal health monitoring via tailored insights and predictions from wearable data and provides datasets, rubrics and benchmark performance to further accelerate personal health-related LLM research. View details
Smartphone use in a large US adult population: Temporal associations between objective measures of usage and mental well-being
Ari Winbush
Benjamin Nelson
Nicholas Allen
Andrew Barakat
Daniel McDuff
Conor Heneghan
Allen Jiang
PNAS (2025)
Preview abstract Smartphones are a vital tool for most people. They facilitate many everyday tasks and as a result they have become ubiquitous and indispensable. There are concerns about how the use of these devices may impact mental health and wellbeing. Yet, there are few studies that have reported objective data about phone usage from large and diverse cohorts and studies have found low correlations between subjective and objective smartphone use. In order to better elucidate these complex interactions, it is important to understand and characterize what resembles “normative” smartphone use behavior. In this paper, we present normative patterns of objectively measured phone usage from a large prospective observational study. We analyze a quarter of a million days of phone usage data from 10,099 adult subjects that provides objective longitudinal data over a four week period in the US general population. Contrary to popular belief, our model shows little support for the conclusion that smartphone use predicts mood the following week or that mood predicts smartphone use the following week, with some results differing depending on whether the effects are within-person or between-person. Lastly, while some findings are statistically significant, the effect sizes of these results are minimal, suggesting little to no impact in real-world settings and therefore a lack of clinical significance. View details
The Anatomy of a Personal Health Agent
Ahmed Metwally
Ken Gu
Jiening Zhan
Kumar Ayush
Hong Yu
Akshay Paruchuri
Amy Lee
Qian He
Zhihan Zhang
Isaac Galatzer-Levy
Xavi Prieto
Andrew Barakat
Ben Graef
Yuzhe Yang
Daniel McDuff
Brent Winslow
Shwetak Patel
Girish Narayanswamy
Conor Heneghan
Max Xu
Jacqueline Shreibati
Mark Malhotra
Orson Xu
Tim Althoff
Tony Faranesh
Nova Hammerquist
Vidya Srinivas
arXiv (2025)
Preview abstract Health is a fundamental pillar of human wellness, and the rapid advancements in large language models (LLMs) have driven the development of a new generation of health agents. However, the solution to fulfill diverse needs from individuals in daily non-clinical settings is underexplored. In this work, we aim to build a comprehensive personal health assistant that is able to reason about multimodal data from everyday consumer devices and personal health records. To understand end users’ needs when interacting with such an assistant, we conducted an in-depth analysis of query data from users, alongside qualitative insights from users and experts gathered through a user-centered design process. Based on these findings, we identified three major categories of consumer health needs, each of which is supported by a specialist subagent: (1) a data science agent that analyzes both personal and population-level time-series wearable and health record data to provide numerical health insights, (2) a health domain expert agent that integrates users’ health and contextual data to generate accurate, personalized insights based on medical and contextual user knowledge, and (3) a health coach agent that synthesizes data insights, drives multi-turn user interactions and interactive goal setting, guiding users using a specified psychological strategy and tracking users’ progress. Furthermore, we propose and develop a multi-agent framework, Personal Health Insight Agent Team (PHIAT), that enables dynamic, personalized interactions to address individual health needs. To evaluate these individual agents and the multi-agent system, we develop a set of N benchmark tasks and conduct both automated and human evaluations, involving 100’s of hours of evaluation from health experts, and 100’s of hours of evaluation from end-users. Our work establishes a strong foundation towards the vision of a personal health assistant accessible to everyone in the future and represents the most comprehensive evaluation of a consumer AI health agent to date. View details
Capturing Real-World Habitual Sleep Patterns with a Novel User-centric Algorithm to Pre-Process Fitbit Data in the All of Us Research Program: Retrospective observational longitudinal study
Hiral Master
Jeffrey Annis
Karla Gleichauf
Lide Han
Peyton Coleman
Kelsie Full
Neil Zheng
Doug Ruderfer
Logan Schneider
Evan Brittain
Journal of Medical Internet Research (2025)
Preview abstract Background: Commercial wearables such as Fitbit quantify sleep metrics using fixed calendar times as default measurement periods, which may not adequately account for individual variations in sleep patterns. To address this limitation, experts in sleep medicine and wearable technology developed a user-centric algorithm designed to more accurately reflect actual sleep behaviors and improve the validity of wearable-derived sleep metrics. Objective: This study aims to describe the development of a new user-centric algorithm, compare its performance with the default calendar-relative algorithm, and provide a practical guide for analyzing All of Us Fitbit sleep data on a cloud-based platform. Methods: The default and user-centric algorithms were implemented to preprocess and compute sleep metrics related to schedule, duration, and disturbances using high-resolution Fitbit sleep data from 8563 participants (median age 58.1 years, 6002/8341, 71.96%, female) in the All of Us Research Program (version 7 Controlled Tier). Variations in typical sleep patterns were calculated by examining the differences in the mean number of primary sleep logs classified by each algorithm. Linear mixed-effects models were used to compare differences in sleep metrics across quartiles of variation in typical sleep patterns. Results: Out of 8,452,630 total sleep logs collected over a median of 4.2 years of Fitbit monitoring, 401,777 (4.75%) nonprimary sleep logs identified by the default algorithm were reclassified as primary sleep by the user-centric algorithm. Variation in typical sleep patterns ranged from –0.08 to 1. Among participants with the greatest variation in typical sleep patterns, the user-centric algorithm identified significantly more total sleep time (by 17.6 minutes; P<.001), more wake after sleep onset (by 13.9 minutes; P<.001), and lower sleep efficiency (by 2.0%; P<.001), on average. Differences in sleep stage metrics between the 2 algorithms were modest. Conclusions: The user-centric algorithm captures the natural variability in sleep schedules, providing an alternative approach to preprocess and evaluate sleep metrics related to schedule, duration, and disturbances. A publicly available R package facilitates the implementation of this algorithm for clinical and translational research. View details
Research Protocol for the Google Health Digital Wellbeing Study
Ari Winbush
Nicholas Allen
Andrew Barakat
Felicia Cordeiro
Daniel McDuff
Allen Jiang
Ryann Crowley
JMIR Research Protocols (2024)
Preview abstract The impact of digital device use on health and wellbeing is a pressing question to which individuals, families, schools, policy makers, legislators, and digital designers are all demanding answers. However, the scientific literature on this topic to date is marred by small and/or unrepresentative samples, poor measurement of core constructs (e.g., device use, smartphone addiction), and a limited ability to address the psychological and behavioral mechanisms that may underlie the relationships between device use and wellbeing. A number of recent authoritative reviews have made urgent calls for future research projects to address these limitations. The critical role of research is to identify which patterns of use are associated with benefits versus risks, and who is more vulnerable to harmful versus beneficial outcomes, so that we can pursue evidence-based product design, education, and regulation aimed at maximizing benefits and minimizing risks of smartphones and other digital devices. We describe a protocol for a Digital Wellbeing Study (DWB) to help answer these questions. View details
Preview abstract This Op-ed is by leaders from the American Heart Association, Digital Medicine Society and Google involved in a Digital Medicine Society-sponsored project on digital measures for physical activity. The Op-ed summarizes evidence that the technology exists today to digitally measure physical activity in the broad population – and, by measuring it the right way, we can embrace it as the ‘6th vital sign’ and enter a new era of healthcare centered on proactive patient care. View details
Evaluating PEARL - a Personalized Exercise Assistant using Reinforcement Learning (Walkmate Study)
Hulya Emir-Farinas
Martin Seneviratne
Amy Lee
Jim Taylor
Sriram Lakshminarasimhan
Emily Rosenzweig
OSF Registries (2024)
Preview abstract Study protocol for PEARL Study: To evaluate the impact of two personalized nudging strategies delivered as pop-up notifications via the Fitbit app on user step count. Specifically, to personalize the following parameters of the pop-up notification system: message content, and timing (hr of the day) View details
Evidence of Differences in Diurnal Electrodermal Patterns by Mental Health Status in Free-Living Data
Daniel McDuff
Isaac Galatzer-Levy
Andrew Barakat
Conor Heneghan
Samy Abdel-Ghaffar
Jake Sunshine
Lindsey Sunden
Allen Jiang
Ari Winbush
Benjamin Nelson
Nicholas Allen
medRxiv (2024)
Preview abstract Electrodermal activity (EDA) is a standardized measure of sympathetic arousal that has been linked to depression in laboratory experiments. However, the inability to measure EDA passively over time and in the real-world has limited conclusions that can be drawn about EDA as an indicator of mental health status outside of a controlled setting. Recent smartwatches have begun to incorporate wrist-worn continuous EDA sensors that enable longitudinal measurement in every-day life. This work presents the first example of passively collected, diurnal variations in EDA present in people with depression, anxiety and perceived stress. Subjects who were depressed had higher tonic EDA and heart rate, despite not engaging in greater physical activity, compared to those that were not depressed. EDA measurements showed differences between groups that were most prominent during the early morning. We did not observe amplitude or phase differences in the diurnal patterns. View details
Sleep patterns and risk of chronic disease as measured by long-term monitoring with commercial wearable devices in the All of Us Research Program
Neil S. Zheng
Jeffrey Annis
Hiral Master
Lide Han
Karla Gleichauf
Melody Nasser
Peyton Coleman
Stacy Desine
Douglas M. Ruderfer
Logan D. Schneider
Evan L. Brittain
Nature Medicine (2024)
Preview abstract Poor sleep health is associated with increased all-cause mortality and incidence of many chronic conditions. Previous studies have relied on cross-sectional and self-reported survey data or polysomnograms, which have limitations with respect to data granularity, sample size and longitudinal information. Here, using objectively measured, longitudinal sleep data from commercial wearable devices linked to electronic health record data from the All of Us Research Program, we show that sleep patterns, including sleep stages, duration and regularity, are associated with chronic disease incidence. Of the 6,785 participants included in this study, 71% were female, 84% self-identified as white and 71% had a college degree; the median age was 50.2 years (interquartile range = 35.7, 61.5) and the median sleep monitoring period was 4.5 years (2.5, 6.5). We found that rapid eye movement sleep and deep sleep were inversely associated with the odds of incident atrial fibrillation and that increased sleep irregularity was associated with increased odds of incident obesity, hyperlipidemia, hypertension, major depressive disorder and generalized anxiety disorder. Moreover, J-shaped associations were observed between average daily sleep duration and hypertension, major depressive disorder and generalized anxiety disorder. These findings show that sleep stages, duration and regularity are all important factors associated with chronic disease development and may inform evidence-based recommendations on healthy sleeping habits. View details
Towards a Personal Health Large Language Model
Anastasiya Belyaeva
Nick Furlotte
Zhun Yang
Chace Lee
Erik Schenck
Yojan Patel
Jian Cui
Logan Schneider
Robby Bryant
Ryan Gomes
Allen Jiang
Roy Lee
Javier Perez
Jamie Rogers
Cathy Speed
Shyam Tailor
Megan Walker
Jeffrey Yu
Tim Althoff
Conor Heneghan
Mark Malhotra
Leor Stern
Shwetak Patel
Shravya Shetty
Jiening Zhan
Yeswanth Subramanian
Daniel McDuff
arXiv (2024)
Preview abstract Large language models (LLMs) can retrieve, reason over, and make inferences about a wide range of information. In health, most LLM efforts to date have focused on clinical tasks. However, mobile and wearable devices, which are rarely integrated into clinical tasks, provide a rich, continuous, and longitudinal source of data relevant for personal health monitoring. Here we present a new model, Personal Health Large Language Model (PH-LLM), a version of Gemini fine-tuned for text understanding and reasoning over numerical time-series personal health data for applications in sleep and fitness. To systematically evaluate PH-LLM, we created and curated three novel benchmark datasets that test 1) production of personalized insights and recommendations from measured sleep patterns, physical activity, and physiological responses, 2) expert domain knowledge, and 3) prediction of self-reported sleep quality outcomes. For the insights and recommendations tasks we created 857 case studies in sleep and fitness. These case studies, designed in collaboration with domain experts, represent real-world scenarios and highlight the model’s capabilities in understanding and coaching. Through comprehensive human and automatic evaluation of domain-specific rubrics, we observed that both Gemini Ultra 1.0 and PH-LLM are not statistically different from expert performance in fitness and, while experts remain superior for sleep, fine-tuning PH-LLM provided significant improvements in using relevant domain knowledge and personalizing information for sleep insights. To further assess expert domain knowledge, we evaluated PH-LLM performance on multiple choice question examinations in sleep medicine and fitness. PH-LLM achieved 79% on sleep (N=629 questions) and 88% on fitness (N=99 questions), both of which exceed average scores from a sample of human experts as well as benchmarks for receiving continuing credit in those domains. To enable PH-LLM to predict self-reported assessments of sleep quality, we trained the model to predict self-reported sleep disruption and sleep impairment outcomes from textual and multimodal encoding representations of wearable sensor data. We demonstrate that multimodal encoding is both necessary and sufficient to match performance of a suite of discriminative models to predict these outcomes. Although further development and evaluation are necessary in the safety-critical personal health domain, these results demonstrate both the broad knowledge base and capabilities of Gemini models and the benefit of contextualizing physiological data for personal health applications as done with PH-LLM. View details
×