Claire Cui

Claire Cui

Claire is currently a Google Fellow in the Google Brain team, leading a team of researchers to push the forefront of machine learning research and apply it to solve high-impact problems for improving people's life.

During her tenure at Google, Claire co-initiated and launched many successful and sustainable projects. She was a founding engineer for Google’s AdSense for Content product, which was one of the first products in Google that used machine learning technologies. She later helped co-found Google Health Research and Medical Brain to work on machine learning for improving people's health.

Claire’s current research interests are around Deep Generalist Learning, including
- Large Scale General Language Model (e.g., GLaM)
- Self-Supervised Multimodal Learning
- Universal Representation Learning
- Model Uncertainty and Interpretability

Claire has a PhD in Computer Science from Stanford University and a bachelor of science degree in CS from Tsinghua University. She has two daughters, 16 and 12 yrs old. They share the love of playing volleyball as a hobby.

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
MaMMUT: A Simple Vision-Encoder Text-Decoder Architecture for MultiModal Tasks
Xiyang Luo
Wei Li
Abhijit Ogale
Luowei Zhou
Andrew Dai
Zhifeng Chen
Transactions on Machine Learning Research (2023)
Preview abstract The development of language models have moved from encoder-decoder to decoder-only designs. In addition, the common knowledge has it that the two most popular multimodal tasks, the generative and contrastive tasks, tend to conflict with one another, are hard to accommodate in one architecture, and further need complex adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. This is done with a simple model, called MaMMUT. It consists of a single vision encoder and a text decoder, and is able to accommodate contrastive and generative learning by a novel two-pass approach on the text decoder. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Furthermore, the same architecture enables straightforward extensions to open-vocabulary object detection and video-language tasks. The model tackles a diverse range of tasks, while being modest in capacity. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows very competitive results on VQA and Video Captioning, especially considering its capacity. Ablations confirm the flexibility and advantages of our approach. View details
Sparsely Activated Language Models are Efficient In-Context Learners
Andrew Dai
Barret Richard Zoph
Dmitry (Dima) Lepikhin
Emma Wang
Kathy Meier-Hellstern
Kun Zhang
Liam B. Fedus
Maarten Paul Bosma
Marie Pellat
Maxim Krikun
Nan Du
Simon Tong
Tao Wang
Toju Duke
Yanping Huang
Yanqi Zhou
Yonghui Wu
Yuanzhong Xu
Zhifeng Chen
Zongwei Zhou
(2022)
Preview abstract Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong performance on few-shot learning. However, training these large dense models require significant amounts of computing resources. In this paper, we develop a family of sparsely activated mixture-of-expert language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which can have many more parameters but require significant less training cost than dense models. The largest \glam has 1.2 trillion parameters, which is approximately 7x larger than GPT-3 but can be trained more efficiently. With only 1/3 of energy consumption to train GPT-3, \glam achieves better overall performance on 29 zero-shot and one-shot NLP tasks. For example, \glam gets 75.0\% one-shot exact match accuracy on the TriviaQA test server, a significant improvement over 68.0\% obtained by GPT-3. View details
LaMDA: Language Models for Dialog Applications
Aaron Daniel Cohen
Alena Butryna
Alicia Jin
Apoorv Kulshreshtha
Ben Zevenbergen
Chung-ching Chang
Cosmo Du
Daniel De Freitas Adiwardana
Dehao Chen
Dmitry (Dima) Lepikhin
Erin Hoffman-John
Igor Krivokon
James Qin
Jamie Hall
Joe Fenton
Johnny Soraker
Kathy Meier-Hellstern
Maarten Paul Bosma
Marc Joseph Pickett
Marcelo Amorim Menegali
Marian Croak
Maxim Krikun
Noam Shazeer
Rachel Bernstein
Ravi Rajakumar
Ray Kurzweil
Romal Thoppilan
Steven Zheng
Taylor Bos
Toju Duke
Tulsee Doshi
Vincent Y. Zhao
Will Rusch
Yanping Huang
Yanqi Zhou
Yuanzhong Xu
Zhifeng Chen
arXiv (2022)
Preview abstract We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text. While model scaling alone can improve quality, it shows less improvements on safety and factual grounding. We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.The first challenge, safety, involves ensuring that the model’s responses are consistent with a set of human values, such as preventing harmful suggestions and unfair bias. We quantify safety using a metric based on an illustrative set of values, and we find that filtering candidate responses using aLaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promising approach to improving model safety. The second challenge, factual grounding, involves enabling the model to consult external knowledge sources, such as an information retrieval system, a language translator, and a calculator. We quantify factuality using a groundedness metric, and we find that our approach enables the model to generate responses grounded in known sources, rather than responses that merely sound plausible. Finally, we explore the use of LaMDA in the domains of education and content recommendations, and analyze their helpfulness and role consistency. View details
Mind's Eye: Grounded Language Model Reasoning through Simulation
Ruibo Liu
Jason Wei
Shixiang Shane Gu
Soroush Vosoughi
Denny Zhou
Andrew Dai
ICLR 2023 (2022)
Preview abstract Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world—their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind’s MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100× larger. Finally, we confirm the robustness of Mind's Eye through ablation studies. View details
Preview abstract In recent years, various deep neural network (DNN) models led to stellar performance in various domains. However, ML practitioners and researchers have observed severe reproducibility issues on DNN models. That is, a set of DNN models trained on the same data with exactly the same architecture may lead to quite different predictions. A common remedy is to use the ensemble method to quantify the prediction variations and improve model reproducibility. However, the ensemble method makes multiple predictions given an input, and is computationally expensive especially serving web-scale traffic at inference time. In this paper, we seek to advance our understanding of prediction variation. We demonstrate that we are able to use neuron activation strength to infer prediction variation. Through empirical experiments on two widely used benchmark datasets Movielens and Criteo, we observed that prediction variations do come from various different sources with randomness, including training data shuffling, and model and embedding parameter random initialization. By adding more randomness sources into model training, we noticed that the ensemble method tends to produce more accurate predictions with higher prediction variations. Last but not least, we demonstrate that neuron activation strength has strong prediction power to infer the ensemble prediction variation. Our approach provides a cheap and simple way for prediction variation estimation, which sets up the foundation and opens up new opportunities for future work on many interesting areas (e.g., model-based reinforcement learning, and active learning) without having to relying on expensive ensemble models. View details
Predicting inpatient medication orders from electronic health record data
Kathryn Rough
Andrew M. Dai
Kun Zhang
Emily Xue
Atul J. Butte
Alvin Rajkomar
Clinical Pharmacology and Therapeutics (2020)
Preview abstract In a general inpatient population, we predicted patient‐specific medication orders based on structured information in the electronic health record (EHR). Data on over three million medication orders from an academic medical center were used to train two machine‐learning models: A deep learning sequence model and a logistic regression model. Both were compared with a baseline that ranked the most frequently ordered medications based on a patient’s discharge hospital service and amount of time since admission. Models were trained to predict from 990 possible medications at the time of order entry. Fifty‐five percent of medications ordered by physicians were ranked in the sequence model’s top‐10 predictions (logistic model: 49%) and 75% ranked in the top‐25 (logistic model: 69%). Ninety‐three percent of the sequence model’s top‐10 prediction sets contained at least one medication that physicians ordered within the next day. These findings demonstrate that medication orders can be predicted from information present in the EHR. View details
Deep State-Space Generative Model For Correlated Time-to-Event Predictions
Yuan Xue
Denny Zhou
Nan Du
Andrew Mingbo Dai
Zhen Xu
Kun Zhang
ACM KDD 2020
Preview abstract Capturing the inter-dependencies among multiple types of clinically-critical events is critical not only to accurate future event prediction, but also to better treatment planning. In this work, we propose a deep latent state-space generative model to capture the interactions among different types of correlated clinical events (e.g., kidney failure, mortality) by explicitly modeling the temporal dynamics of patients' latent states. Based on these learned patient states, we further develop a new general discrete-time formulation of the hazard rate function to estimate the survival distribution of patients with significantly improved accuracy. Extensive evaluations over real EMR data show that our proposed model compares favorably to various state-of-the-art baselines. Furthermore, our method also uncovers meaningful insights about the latent correlations among mortality and different types of organ failures. View details
Preview abstract Introduction: Auto-charting -- creation structured sections of clinical notes generated directly from a patient-doctor encounter -- holds promise to lift documentation burden from physicians. However, clinicians exercise professional judgement in what and how to document, and it is unknown if a machine learning (ML) model could assist with these tasks. Objective: Build a ML model to extract symptoms and status (i.e. experienced, not-experienced, not relevant for note) from transcripts of patient-doctor encounters and assess performance on common symptoms and conversations in which a human interpreterscribe is not used. Methods: We generated a ML model to auto-generate a review of systems (ROS) from transcripts of 90,000 de-identified medical encounters. 2950 transcripts were labeled by medical scribes to identify 171 common symptoms. Model accuracy was stratified by how clearly a symptom was mentioned in conversation for 800 snippets, which was assessed by a formal rating system termed conversational clarity. The model was also qualitatively assessed in a variety of conversational motifs. Results: Overall, the model had a sensitivity of 0.71 of matching the exact symptom labeled by a human with a positive predictive value of 0.69. Model sensitivity was associated with the clarity of a conversational (p<0.0001). 39.5% (316/800) snippets of common symptoms contained symptoms mentioned with high clarity, and in this group, the sensitivity of the model was 0.91. The model was robust to a variety of conversational motifs (e.g. detecting symptoms mentioned in colloquial ways). Conclusions: Auto-generating a review of systems is feasible across a wide-range symptoms that are commonly discussed in doctor-patient encounter View details
Scalable and accurate deep learning for electronic health records
Alvin Rishi Rajkomar
Eyal Oren
Andrew Dai
Nissan Hajaj
Mila Hardt
Peter J. Liu
Xiaobing Liu
Jake Marcus
Patrik Per Sundberg
Kun Zhang
Yi Zhang
Gerardo Flores
Gavin Duggan
Jamie Irvine
Kurt Litsch
Alex Mossin
Justin Jesada Tansuwan
De Wang
Dana Ludwig
Samuel Volchenboum
Kat Chou
Michael Pearson
Srinivasan Madabushi
Nigam Shah
Atul Butte
npj Digital Medicine (2018)
Preview abstract Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient’s record. We propose a representation of patients’ entire raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. We demonstrate that deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple centers without site-specific data harmonization. We validated our approach using de-identified EHR data from two U.S. academic medical centers with 216,221 adult patients hospitalized for at least 24 hours. In the sequential format we propose, this volume of EHR data unrolled into a total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high accuracy for tasks such as predicting: in-hospital mortality (AUROC across sites 0.93-0.94), 30-day unplanned readmission (AUROC 0.75-0.76), prolonged length of stay (AUROC 0.85-0.86), and all of a patient’s final discharge diagnoses (frequency-weighted AUROC 0.90). These models outperformed state-of-the-art traditional predictive models in all cases. We also present a case-study of a neural-network attribution system, which illustrates how clinicians can gain some transparency into the predictions. We believe that this approach can be used to create accurate and scalable predictions for a variety of clinical scenarios, complete with explanations that directly highlight evidence in the patient’s chart. View details
×