OpenAI, the AI research and deployment company responsible for ChatGPT, has launched a new benchmark for evaluating the capabilities of health AI systems, built in partnership with 262 physicians practicing across 60 countries.
The company shares findings that large language models (LLMs) “have improved significantly over time and already outperform experts in writing responses to examples tested in our benchmark”.
The HealthBench benchmark is designed to be user-friendly, covering a wide range of scenarios to help researchers measure the benefits of AI systems across the health sector and beyond. Containing 5,000 “realistic health conversations, each with a custom physician-created rubric to grade model responses”, HealthBench looks to ensure evaluations of AI systems capture complex, real-life scenarios and workflows that mirror the ways individuals and clinicians interact with models, whilst supporting trustworthiness and offering “substantial room for improvement”.
The benchmark is said to test how well AI models perform in realistic health scenarios, with the 5,000 stored conversations helping to simulate interactions between models and users. “The task for the model,” according to OpenAI, “is to provide the best possible response to the user’s last message.” These responses are then graded against a set of physician-written rubric criteria specific to that conversation, which cover areas such as necessary facts to include or avoiding technical jargon. “Model responses are evaluated by a model-based grader (GPT‑4.1) to assess whether each rubric criterion is met, and responses receive an overall score based on the total score of criteria met compared to the maximum possible score.”
Examples are offered on OpenAI’s website, spanning emergency referrals, health data tasks, global health, context seeking, and more. Axes are also provided, taking into account elements of communication quality like response length and level of detail, as well as instruction following, accuracy, context awareness, and completeness.
The full evaluation and underlying data have been made openly available in OpenAI’s GitHub repository, with feedback and input from the wider community encouraged.
The safe use of AI in health and care
Late last year, the European Commission announced that the AI Act, a legal framework seeking to address the risks of AI in Europe by setting out clear requirements and obligations in support of “trustworthy AI”, had officially entered into force. Designed to address risks with AI systems that could lead to “undesirable outcomes”, the Act includes provisions to help mitigate risks created by AI applications; to ban AI practices that pose “unacceptable risks”; to set out clear obligations for both deployers and providers of AI applications deemed to be high risk; and to put in place enforcement and a governance structure at both European and national level.
The UK Government published its AI Opportunities Action Plan, accepting recommendations for expanding computing capacity, establishing AI growth zones, unlocking data assets, and sharing alongside a proposed delivery timeline.
Earlier this year, a HTN Now webinar, supported by Restore Information Management, focused on the practicalities of AI technologies, exploring topics including implementation, adoption, the role of data, policy, regulation, evaluation and best practices. With the help of our expert panellists, we also took a closer look at examples of AI in health and care. Panellists included Neill Crump, digital strategy director at The Dudley Group NHS Foundation Trust; Lee Rickles, CIO, director and deputy SIRO at Humber Teaching NHS Foundation Trust; and Beatrix Fletcher, senior programme manager (AI) at Guy’s and St Thomas’ NHS Foundation Trust.