Your Next Doctor is a Chatbot? Language Models, Google Researchers, & MedPaLM-2

Full Transcript for EP 114

Harry Glorikian: Hello. I’m Harry Glorikian, and this is The Harry Glorikian Show, where we explore how technology is changing everything we know about healthcare.

I’ve been saying for years that it’s time to stop thinking of AI as a fantasy that’s far removed from our everyday lives.

And since November of 2022, when OpenAI released ChatGPT and put the power of large language models into the hands of everyday consumers, we’ve been seeing how true that is.

Large language models and the applications and interfaces being built around them are already changing the business of search.

But it goes way beyond that.

This new form of AI is also changing the way software companies write code and the way biotech companies screen for new drug candidates.

And now they’re about to change the practice of medicine.

My guests today, Vivek Natarajan and Shek Azizi, are both AI researchers on the Health AI team at Google.

I wanted to have them on the show because Google, as much as any other company, is pushing the boundaries of what large language models can achieve in specialized domains like health and medicine.

This spring Google announced it would start rolling out a new large language model called Med-PaLM 2 that’s designed to answer medical questions with high accuracy.

How accurate?

Well, when Google gave Med-PaLM 2 the U.S. Medical License Exam, which is the test all doctors have to take before they’re allowed to practice, it scored 85 percent, which is in the “expert” range.

To create Med-PaLM 2, Google started with a general large language model called PaLM, then fine-tuned it using data and feedback from actual clinicians.

But the interesting thing, according to Vivek, is that Google didn’t have to feed a lot of extra training data into Med-PaLM 2 to make it smart about medicine.

Its ability to answer medical questions came almost “for free,” in his words, because PaLM itself was trained on such a large body of data from the Internet.

Vivek says, quote: “When you start accumulating data trawled from all these different sources, it invariably contains the information you need.” Unquote.

What PaLM, ChatGPT, Bard, Bing, and all the other new large language models have in common is that they’re all trained on hundreds of billions and possibly trillions of data points.

Which gives them the ability to scan and remix almost the entirety of human knowledge.

In medicine in particular, it seems clear that consulting with AI is going to become an indispensable part of every medical journey — whether you’re a patient searching for information about your symptoms, or a doctor looking for an expert second opinion, or a drug developer looking to prevent the next pandemic.

That’s a trend I could already see coming three years ago when I wrote my book The Future You.

But we’re still just starting to feel its impact.

And now that it’s almost here, the world Vivek and Shek are helping to create at Google feels both exciting and a little bit scary.

But without further ado, let’s dive into the interview.

Harry Glorikian: Shek, Vivek, it’s great to have you both on the show. I’ve been really looking forward to this discussion and sort of diving deep into this area, so it’s great to have you guys here.

Vivek Natarajan: Glad to be here, Harry.

Shek Azizi: Yeah. Great to be here.

Harry Glorikian: So before we jump into the meat and talking about MedPaLM 2, say the non-technical people in the audience are those people who don’t follow the news about AI or read about it every 20 minutes I do. You know, can you guys start out with, say, a brief description of the underlying large language models PaLM or PaLM 2.

Vivek Natarajan: The key idea behind some of these large language models, I to think about this in three ways. So the concept of language models that can predict the probability distribution of sequences has been around for a long period of time. I think it might even go back to the 1970s and we’ve had different iterations or different techniques of how to build out these language models and different kinds of applications ranging from speech recognition to machine translation systems. I would say the predominant idea, even as recent as 2015, 2016, was these n-gram language models where you basically have a very large corpus of text, sometimes it could even be trillions of tokens, and then you basically go through those sequences and then measure these counts and use those counts to estimate probabilities of sequences. So that was kind of an idea that worked reasonably for, you know, when you had these short context lengths. But then as soon as you wanted to model text over longer sequences, it was kind of starting to fail and break. But then what happened was around 2017, I mean, we know this there’s this big breakthrough with transformers, and then transformers have been shown to work across a bunch of different applications since then.

Vivek Natarajan: And the key ideas were first demonstrated in machine translation, but then language modeling emerged as this breakthrough application. And then again, there are different variants of transformers that, I’m not going to go through it, but the one big one that I think was with we came across that one with the advent of GPT-was the decoder only transformer models and the GPT-3 paper basically showed three different ideas. One was using a decoder only transformer model. Second was training this transformer model on a very giant corpus of text. So internet-scale text. And so what the model does is, the objective itself is kind of very simple, where you’re basically trying to predict the download a bunch of data from the Internet and then predict the next sequence of words from it. So it seems very simple at a high level. But then if you look under the hood and if you want to do this really well at scale, at Internet scale, then the model has to, for example, it not only needs to understand syntax, but also semantics and also develop basic understandings of different topics or subjects that you might encounter on the Internet. So it could be physics, chemistry, biology and in our case, it’s medicine. And so transformer models, decoder only transformer models on Internet scale data.

Vivek Natarajan: And then the third thing is alignment strategies. And so that wasn’t explicitly part of the GPT-3 paper. But since then, people have shown within a few different models, InstructGPT, ChatGPT, etcetera, where you can not only train these models to do well on these language modeling tasks, but you can also control their behavior and ensure that their outputs are safe and aligned and fun. And so I think that has been the key breakthrough because there have been previous iterations of language models. I think some people would remember Microsoft Tay, which was released on Twitter back in 2015, 2016, but then kind of went crazy in a day or two and had to be taken out. But ChatGPT has been around for months. GPT-4 has been around for a couple of months and people have basically and Bard has also been around. And while these models do make errors, people relate to it better. It’s not going crazy. And more often than not people are just delighted. And so I think these three breakthroughs, breakthroughs in transformers, breakthroughs on decoder only transformers which allow training at Internet scale data. And then the third one is alignment strategies. I think that is what has led to models GPT-4 and PaLM and PaLM 2.

Harry Glorikian: So around the time of Google I/O, you guys released a video which Vivek, I think you were in it, where you stated that medical question answering has been a research grand challenge for several decades. And I think that phrase medical question answering has a very specific meaning in the worlds of AI and medical information informatics. So I’m wondering if you could sort of unpack that a little for me for, you know. What forms of or has medical question answering technology taken over the decades? It was interesting because when I saw I heard it first, I was, I’m not very familiar with that particular term.

Shek Azizi: To put it in a simple word, it’s question answering as we are dealing with on a daily basis, but for medical purpose. And it can range from different types of questions that may have medical contexts. For example, when you are going to your doctor, you are asking some question, you ask some concern, and you’re forming that concern in a form of a question. You can ask that question from a search engine and going from there. So those type of questions, for example, you can say that I have headache. What’s the medicine that I need to take to cure my headache, or do I need to see a doctor for this type of headache that I have? These are some sort of medical question answering that that we are dealing with. But one form that is being a grand challenge for long was that a US medical licensing exam questions that’s been around for a long and those sorts of question is the questions that they are for professionals. And when doctors want to pass their medical exam, they are going through a multiple round of a exam and question asking.

Shek Azizi: So, the first round, the USMLE exam has been a challenge for long [time]. And the reason was that we are dealing with a lot of reasoning that language models need to provide and retrieval of different kind of information and piecing them together to make a conclusion and provide the answer that is grounded and valid and not biased. So that’s been around for multiple decades, around three decades. And most of the performance that we had from previous method was around 50%, cap at 50%. And with the introduction of MedPaLM model, we for the first time basically passed the exam passing mark, which is 60% correct answer. And that was a really groundbreaking moment for us. And we see that, okay, we can reach areas that we can retrieve information in a safe fashion and answer this question. It’ll provide answers that are not biased and they are grounded.

Harry Glorikian: I want to step back to understand sort of how and why did Google why did you guys decide, hey, let’s make a specialized version, a medical version of PaLM? I’m assuming that somebody had a hypothesis that PaLM might be actually good at medical question answering and especially if you tune that version to specifically do that and I’m assuming you know that palm to which you guys released, I believe last week, if I remember the timing correctly, it’s a result of that experiment. So why did or was that always in the plan?

Vivek Natarajan: So we, I mean Google has always been at the forefront of large language models. Transformers were invented at Google and we’ve since then had various versions, Bert and PaLM and everything. And we know that these models generally looking at them, looking at their performance on various natural language, understanding benchmarks that these models at scale with large enough number of parameters start not only encoding knowledge that that is used to train them, but also do have reasoning capabilities. So there is that substrate of, if I may use that word, an intelligent agent in there. But so that’s I mean, so you can argue whether that is similar to human or not. But in terms of specialization, even when we humans have basic learning capabilities, learning and adapting to new situations and environments, right? But then if we want to do something medicine, which is a very specialized topic, you need to go to school, you need to go to medical school, you need to understand medical texts. You need to understand everything from,right, from the way from molecular biology, medicinal chemistry, all the way up to, you know, clinical outcomes, population health, health equity and everything. And that requires specialized training that people do for ten years. And so you can’t imagine a model that has been trained, although it could be a very intelligent model, trained on all the Internet to just be out of the box.

Vivek Natarajan: You know, it could be intelligent about medicine. It could understand certain topics I do, even though I have no formal training in medicine. But for me to really go and practice medicine, I need to have that specialized training. And so that is why we wanted to do that. And so this overall project emerged as a brain moonshot, Google Brain Moonshot that we have at Google. It’s basically an umbrella for several people who are keen about a particular topic and decide that that thing should exist in the world. They just come together, bottom up and then propose it and work on it. And so this project came through in that way. And then the tagline for that project was sending these foundation models and LLMs to medical school. And so the idea was to, you know, train these models on data that you would typically see in medical school, medical books, literature or everything. And then allow them to absorb that knowledge, learn from expert demonstrations, and then also maybe put them in real world settings where they can actually learn from real world interactions. And so that’s the specialization step. And we believe you need that step. So even if you have a very powerful general purpose, large language model, uh, if you don’t have this expert specialization step, we don’t think those models are ready for use in the medical domain yet.

Vivek Natarajan: And in the MedPaLM paper we showed that. And so we had, we took the out of the box model, which was plain PaLM, and that was state of the art on many of these natural language processing benchmarks. But then as soon as we started doing more rigorous evaluation with clinicians, with even with laypeople, big gaps emerged between the plain PaLM model and then the and then responses from expert physicians. But then we also showed that if you do alignment and we have some very data and compute efficient techniques to do that, which we showcased in the paper, you can very quickly start bridging this gap. And so that’s what we showed in the PaLM paper, that that specialization step, that fine tuning step, that alignment step is extremely critical and that you don’t necessarily see the value of that when you are, you know, evaluating on these style benchmarks. But the value of that becomes more apparent when you are doing this rigorous human evaluation that tries to mimic real world clinical applications and workflows. And since then, in the MedPaLM 2 paper, which is going to come out very soon, we are going to be taking that to the next level, doing even more rigorous studies.

Harry Glorikian: Yeah, I’ll be looking forward to that to the next paper because I’m trying to keep up with, I feel I’m there’s not enough hours in the day and I don’t know if I should just give up sleep to, to keep up with the with the rate of change. You know, it’s funny, you’re using words like reasoning and so forth, and I’m always trying to pull myself back from using words that would be imparted on a human onto a machine, essentially. So I’m going to use this word: how “smart is MedPaLM 2? I mean, can you share some benchmarks how it performed? You know I know it got what it got an 85% score which puts it at the expert level. But guess the question is just on its own, what does that level of performance mean? Does being able to score well on a licensing exam mean that a model can actually give safe and useful medical advice in a real clinical situation?

Shek Azizi: That’s a really important question. And I think this question and the concern that we have around that was the most valuable, part of the MedPaLM paper and the work that we are doing. And I think it’s a belief that we have that this number on its own and reporting accuracy is not enough. And the reason is that there are multiple other human values that we want to consider before making decision and put this model out there for decision making or even interaction, user interaction. And this model could be biased and they could provide they have hallucination basically, so they can produce answers that they are harmful. And one of the thing that we consider in the paper and MedPaLM 2 is that we have a rigorous axis of performance measurement that we are considering for human evaluation. And these are to consider harmfulness of answers. If there’s any bias in the answer. We are measuring the precision of the answer. And if the answer that’s provided by the model is aligned with the scientific consensus. So from there, when we have this measurement, we have a way to make sure that this answers are actually aligned with the value that we want and user is, we can guarantee some level of safety, safety for the users to play with them and to use them and that’s natural that when we are pushing accuracy number, definitely precision is improving, but we, we should be really make sure that alignment is there and the safety is also there.

Harry Glorikian: So. You’ve got MedPaLM and you’ve got MedPaLM 2. What is the difference? I mean, is it I mean, maybe in terms of quantities, is it the number of parameters it was trained on or, um, you know, or is it just the performance it’s having on the medical exam? I mean, I’m sure you guys have made multiple changes to it or reined it further. You know. How did you make the model better or I’m going to use that word again, smarter.

Vivek Natarajan: I’d say 2 is a combination of three things. So the underlying base LLM in MedPaLM was this plain PaLM model. And then we switched that over to PaLM 2, which has been shown to be significantly more stronger on a lot of these natural language benchmarks on these benchmarks. And so we decided to build on top of that. And then in the PaLM paper, because we wanted to have an alignment technique, which was kind of compute and data efficient we did not fine tune the entire model, so not all the weights of the model were updated because that model was big, 540 billion parameters. So it would take a lot of time. And so we did this. We came up with this technique known as instruction from tuning where you have a small set of additional parameters, which we call soft prompt parameters, and then you use expert demonstrations from clinicians that we have had access to, to learn these additional prompt parameters using standard gradient descent methods. But what that thing does is essentially it does not impart any net new knowledge to the model. So the knowledge that is already encoded in the weights of this giant fine 540 billion model already stays there. But what it essentially does is it’s, oh, now you are in this medical domain and there is a lot of knowledge encoded in your weights, but you don’t necessarily have to use all of that.

Vivek Natarajan: So it kind of shines light to the specific section of, so if you imagine all the knowledge encoded in this model is a giant library and the medical section is, you know, maybe one specific set of shelves. So this conditioning with the soft prompt parameters is shining a torch and saying, Oh, the answer should be there, look over there. And then the second thing it does is basically it teaches the model the style of this domain. In the medical domain when you are giving responses to consumer questions, you need to be more conservative, have a better expression around your uncertainty, maybe shows empathy in your responses depending on the situation. And so from expert demonstrations, we can even teach the model those sort of things. But we haven’t done rigorous studies on that front, so I would not be claiming anything over there. But the key message I wanted to pass on is instruction prompt tuning is very lightweight and it does not update the entire model. It just updates the small set of additional parameters. But with MedPaLM 2, we decided to do end to end fine tuning, so all the weights of the model were updated with, again, we use in the paper that we have coming out, we detail exactly what data sets we used and all of them are public data sets. Um, and that helped us get further improvements. And then the third thing I would say is this very simple prompting strategy that enable the model to reason better. The prompting strategy, we call that ensemble refinement. The idea is you don’t want the LLM to produce an answer in a one shot setting, but rather when it considers a question, you want the LLM to produce multiple different answers and then it can then use those multiple different answers and then collectively weigh the pros and cons of them and then decide to come up with the final answer. So it’s kind of a two stage process that we’ve come up with. And what we see is that that ensemble refinement strategy enables the LLM to reason better and refine its answers. So I would say it’s this combination of three things, switching the base from PaLM and PaLM 2, end to end,fine tuning, updating all the weights of the model. And then this new prompting strategy that enabled us to get to this expert level score on these benchmarks.

Harry Glorikian: So essentially you’re having it think we do, which is I always tell my kids, you know, don’t say the first thing that comes into your head. Let’s think about a few thoughts before you go. Which I think most people in the world might want to learn how to do that these days. But so let me see if I’ve got this correct. So you took the large model, you shined a light in a specific area. You, just to be clear, did or did not add any more training data to that.

Vivek Natarajan: Um, so in the first version of the model, so in both versions of the model, we did use expert demonstrations and training data. But then the key difference is did you update the entire model or not, or did you update only a small part of the model?

Harry Glorikian: No. Yeah,

Harry Glorikian: It’s a small part. And then you. Fine tuned it with clinicians from I think it was the US, UK and India.

Vivek Natarajan: That’s right.

Harry Glorikian: What kind of judgments did you ask clinicians to make? I mean, how did their input help improve the model?

Shek Azizi: Yeah, I can explain a bit. So as I mentioned, previously. We had this access that we, we want to make sure that our model are safe. They provide unbiased answer and so on and so forth. And the way that we ask the clinician was to provide some examples. Examples are basically cases and we provide some question to them and they basically write answer to those questions. But we are asking physicians to be provide that answers that they are comprehensive. They consider different aspects that you are we are caring about safety and so on. And when the model is seeing those examples, basically trying to imitate the behavior of a physician and the good answer that we are expecting. So that’s, that’s a, that’s a way that things going to happen. And we had 40 I think, uh, 40 something. I don’t recall the exact number example from different clinician at Medical One.

Harry Glorikian: Okay. And. So, I mean, just my background being, you know, in the biological sciences. I’m curious . How do you deal with the heterogeneity of the domain? I mean, if. So if you think about an area like finance where there’s more or less a universal language, right, I can go in anywhere and, you know, profit/loss is pretty standard no matter where I go, no matter what company I go into. But in an area like medicine, if I walk into a different department, different words are used to define the same thing. And so how did you guys tackle that part.

Vivek Natarajan: So I think a lot of it actually, we get for free because these models are trained on very large scale Internet corpora. And I would say we should not underestimate the power of the Internet. There can be ridiculous things in it. And so even if you have if you have something what you think is a very rare condition, it’s very likely that someone else would have mentioned about this in some forum on some social media site. And so when you start, you know, accumulating data that is trawled from all these different sources, it invariably ends up that that is that just contains the information that you need, whether that’s accurate, whether that’s fake. I mean, that’s, I think, a totally different question. But just the scale of the Internet means that there is a vast amount of knowledge that’s encoded in these models. And so I think that helps us quite a bit. But then when we’re doing this expert fine tuning step, I think that over there we’re relying on demonstrations and that’s admittedly from a very small pool of clinicians. And so I think we need more work over there in terms of expanding the pool and also the considerations that we are doing, because, you know, clinicians with different backgrounds, different contexts, different lived experiences and also the patient context needs to be taken into account and everything. I think I would say that. And all this is very early stage. We haven’t done much work over there, but the fact that we’re building on these internet scale corpora gives us a massive advantage. I would say it gives us a great start.

Harry Glorikian: Okay, interesting. So this large language model, as opposed to I mean, even the narrow models are still large. They’re not as large as the whole Internet, but they’re still pretty large. So I’m trying to think, you’re saying, using these, huge, maybe that’s a better word for it, huge language models as opposed to something that more might be more a larger language model that’s narrow in a particular space, you seem to be getting more out of something that’s chewed on everything as opposed to being defined to a specific space.

Vivek Natarajan: Yeah, I think that’s right. And I think it’s more to do with the data that you use to train these models and less about the number of parameters itself for this model. Um, and so there are these laws DeepMind has come up with something known as Chinchilla scaling laws which you can use to derive what is the optimal size of a model that you need to train if your data is these many tokens. And so I think that’s we need to decouple that. But you’re right in the sense that what matters over here is the training data itself and the Internet is just so vast and so huge that I think we tend to underestimate what is in there. And so I’ve been sometimes really surprised when I interact with these models, both ones at Google and then outside as well, like GPT-4, in in the sense that, what are the range of information that is encoded in these models and the kind of things that they’re able to do that is, I think vastly more than any human that we know of that has ever lived.

Harry Glorikian: Yeah. And there’s a few things I wish that weren’t in there, but that would probably make it better. But, um, so, so I’m reflecting on the, Microsoft wrote their bio GPT paper, which was. Underwhelming, right? Because it really wasn’t a bio GPT It did four things. Well, because they had sort of designed it to do those four things well. When I think about MedPaLM 2, is there a specialized task that you’ve sort of worked on to get it to do, or is it more of a broad based system? I’m trying to understand the level of depth the system gets to in any particular area, if that makes sense.

Vivek Natarajan: So as I mentioned before, we are building on top of PaLM and PaLM 2. And these models have been shown to be useful in many different natural language generation applications. And so these are not just typical natural language benchmarks, but also, you know, generating code or doing mathematical reasoning, even solving some sort of scientific challenges. And so that base knowledge is already encoded in this model and we are building on top of it. And then the fine tuning that we are doing is with medical reasoning and medical questions. And so and there we the benchmark that we are fine tuning on, it spans these exam style questions, these professional medicine questions, but it also includes questions from PubMed. So that’s more medical research and scientific reasoning questions. And then the final one is consumer medical question answering more what a person who would say is looking for medical information when they go online, What are the kind of questions that they ask? What are the questions that they search and what is the. And so it depends on the medical information need that they have.

Vivek Natarajan: So that’s what we have fine-tuned on. But I would stress that these models, just because, again, they are trained on the Internet, they just seem to be generally capable of a lot more tasks than we can even fathom. So think about these models as platforms, platforms that can enable many different applications. And so these could involve retrieval, encoding knowledge and retrieving them, given the right context, might be in the prompt to summarization to generative tasks in the medical domain such as, you know, workflow tasks generating prior auth letters or insurance letters or other kinds of stuff to something that involves medical reasoning that, okay, you should not be using it standalone, but maybe an aid to a clinician who wants to help from one of these large language models to, you know, help with medical diagnosis or just get more information. And finally, this could be directly in the hands of consumers as well. But again, with the right set of guardrails where we can help serve their medical information needs at scale.

[musical interlude]

Harry Glorikian: Let’s pause the conversation for a minute to talk about one small but important thing you can do, to help keep the podcast going. And that’s leave a rating and a review for the show on Apple Podcasts.

All you have to do is open the Apple Podcasts app on your smartphone, search for The Harry Glorikian Show, and scroll down to the Ratings & Reviews section. Tap the stars to rate the show, and then tap the link that says Write a Review to leave your comments.

It’ll only take a minute, but you’ll be doing a lot to help other listeners discover the show.

And one more thing. If you the interviews we do here on the show I know you’ll my new book, The Future You: How Artificial Intelligence Can Help You Get Healthier, Stress Less, and Live Longer.

It’s a friendly and accessible tour of all the ways today’s information technologies are helping us diagnose diseases faster, treat them more precisely, and create personalized diet and exercise programs to prevent them in the first place.

The book is now available in print and ebook formats. Just go to Amazon or Barnes & Noble and search for The Future You by Harry Glorikian.

And now, back to the show.

[musical interlude]

Harry Glorikian: You started talking about something about how people would use these. And so.: How do you think these are going to be used or most useful, say, for the medical community, and I’ll use that word broadly, in the near future and maybe even the distant future. Although things are moving so quickly, I don’t know what distant future means anymore. If I say five years, that’s a lifetime these days. But is it, you know, distilling insights from medical literature, you said? Is it answering complex medical questions? Is it searching long or unstructured medical texts? I’m trying to understand, what are the sorts of things that you guys envision. And I know at some point you might have to stop the list because it might be too long.

Shek Azizi: So I think for the use case of this model, and especially in the field of medical domain, we have two sort of users. One of them are professional users and the other one are people that consumers basically, they want to use this model on a daily basis. So in my opinion, I think one way that I can see that, these models can be really useful for consumers. Or if we can use these models on in countries that they don’t have access to medical professional easily. And if you can basically add a multilingual ability to these models could be very useful and also I think in countries similar to North America or other parts of the world, that can be used for patients. If we can provide the answers that they are really grounded and safe, they can they can use it on daily basis to basically mitigate some of the daily basic needs that they have to doctors to basically helping them that if they have a worry, they can make a decision to see a doctor or solve it in a way. But that’s I think is going to be very long from now. And it needs a lot of safety guards for those use cases. The other part of the word is for professionals and, for example, doctors, how they can use large language model. And you mentioned a few use cases. Already it can be used maybe in hospitals for triaging patients and it could be used to summarization of notes and even more you can look and go up to, I don’t know, knowledge discovery. If you want to really expand the use cases, that is a large language model they have for professionals. Vivek, you want to add anything?

Vivek Natarajan: Yeah, maybe just quickly mention a few points. Um, I think there’s a spectrum of use cases. And some of them where the risk is inherently low and you have a human in the loop. I think we are going to see applications of that fairly soon. Um, so these could be stuff, you know, workflow in clinical settings and environments, generating documentation, which I think is a huge cause of pain and burnout today for the clinician community. And I think LLMs can be a massive step change in improving their overall experience so that they can actually focus on doing what I think they want to do, which is, you know, interfacing with the patients and providing actual care rather than typing out these documents or whatever. So I think we’re going to see very rapid adoption on those fronts where I think there’s still a human in the loop who can verify the answer and then accept it and ensure that everything looks right. But generating the document itself takes a long period of time. So we can cut away several minutes, say, if a task takes ten minutes with these LLMs, we can probably bring them down to, you know, a minute or two. So that’s a big step change. And so I see rapid adoption on that on that front. The other thing that I would maybe quickly point out is medical education.

Vivek Natarajan: So when GPT-4 came out and same with MedPaLM as well, we knew that pretty much every medical school in this country had a grand rounds session talking about use cases of GPT-4 and large language models in medicine. And so we know that every student or every trainee in the country, they’re trying to use GPT-4 to better understand these clinical terms ,concepts, and trying to make sense of the data and using it for charting and learning. And I think that is going to be a big change as well. So you can imagine a personal medical tutor to all the trainees and people who are in medical school today. And that’s I think, again, going to be a pretty big experience and change and it’s going to uplevel pretty much all the doctors that are going to come out because it’s the right way with an intelligent person in the loop. I think these models are going to be very useful. And I see medical education improving significantly as well. Um, but I would say that again for this is again my personal opinion, but for a lot of us in the team building out MedPalm 2, we care a lot about access to health care. Like I personally grew up in India where, um, for many people in the nearby towns and villages, going to see a doctor would mean walking 30 miles in extreme heat, giving up on a day’s wages. That was simply unpalatable to most people. And so many people go their entire lifetimes without seeing a doctor. And that meant adverse health events accumulated over a period and then lower life expectancy and everything. Um, so with these systems, with these models powering systems, you can imagine, you know, a pocket of, a world class pocket and we can put them directly in the pockets of billions of people worldwide. So that’s the North Star goal. But that I think, will take a lot of work, a lot of rigorous validation studies. And we don’t have the answers to how to do validation of these systems just yet, just because of the broad range of capabilities over here. So I think a lot of things need to fall into place, but simply the fact that we can even think about that future, I think that itself is immensely exciting.

Harry Glorikian: Honestly, this is yeah hugely exciting. I mean, when I talk to people about everything that’s going on in the space, I’d say for the past five years, you know. Most people that I talk to are not they’re not even close to being in tune with where things are. And with these advances, I mean, something simple having the system prepare the physician for communicating with a patient the right way, right. That it can role play that way, I think is, you know, very cool and a little scary, but very cool. Right. But Shek you mentioned something earlier and you said it again just a few minutes ago. Guarding against potential problems. Right. And you I’ve seen this in other LLMs. And then if you work a little bit, you can actually break them, right? And get them to to say some very interesting things. Um. How do you manage incorrect answers or harmful answers or, you know, you said, you know, answers that don’t reflect scientific consensus. Hallucinating. Bias in in the model. I’m trying I’m struggling with is there a truly a technical solution to this? Because I think once you train the base model, it sort of, these things are so big. How do you adjust to get them not to do something?

Shek Azizi: Um, yeah, to answer to your question, I think It’s a very good question, actually it has two parts. Like, one part is that the research is ongoing research. And we were going to see more and more breakthroughs coming out because we need those type of grounding and safety net, not in medical domain, even for daily uses. And there is a lot of research around the safety of these models and how to prevent basically hallucinations or providing harmful answers and so on and so forth. So there is a lot of research happening there, one way that people approaching this is a bit human feedback. That they’re basically collecting human feedback and feeding that feedback to the loop of training the models so they can a bit avoiding this, uh, that’s one part even for, I think for medical purpose also that could be very useful and is going to help us to grant these models and provide some safety of them. But I think the type of feedback that we have for medical purposes is going to be very different than the general use cases. If you interact with some of these chatbots that they are out there, you can see that you have thumbs up thumbs down and you’re going to say that, okay, I like this answer. I don’t like this answer. But for medical purposes the range of human feedback that you need is broader. You need to have feedback, on the safety of these models. You need to have feedback on the basic medical concepts of these models. And if the answer that they are providing is aligned with their medical knowledge, that is out there, we know that medical knowledge and recommendations are changing from one country to another country or even in different states you can have different type of recommendation if these models are providing them. And also bias.

Shek Azizi: The other way to also control these models and train them is through adversarial testing. You need to test this model for different kind of attacks that they can come to them and different outlier question that you didn’t expect but they can cause harm. So that’s happening. And when you are putting back those questions inside your training loop, you are going to enforce these models to provide the answers that you are more grounded. But I also do believe that at this point there are a lot of things that we don’t know also and we should admit that, that there is a lot we don’t know and we don’t know how to control them. But I think as we have a lot of bad actors that maybe use these models in a harmful way, we also have a lot of knowledgeable researchers that they are trying to basically grant these models and make them safe. So let’s see what’s coming up next few months. I guess on a scale of things that we are talking about right now.

Harry Glorikian: Yeah. Mean think, you know, I said, I mean just trying to keep up with the literature is almost, I don’t want to say it’s impossible, but it’s incredibly challenging. Um, can you guys talk a little bit about this? I think it’s MultiMedQA. This is your, I believe, a benchmarking system you guys introduced for LLMs in the medical domain. Right? So I don’t know, you know, what kinds of standards or questions go into MultiMedQA. How were you guys sharing or spreading this new benchmark? I’m assuming you’re letting other groups use it. Just if you can talk about it a little bit, that would be great.

Shek Azizi: I can talk about that. So MultiMedQA is basically a benchmark that we put together is seven different kind of dataset that they have long form answer. A short form answer. Some of the some of the form of the questions that we have is open ended. So it means that you are, Yes, no, or multiple choices. We also have multiple choices questions such as inside that, um, this data sets are a big chunk of it is right now open source they’re out there getting that we did was we put them together in a format that is accessible and make it a benchmark. One nice addition that we had to those previous data set that has been out there was the consumer questions and health search basically. And those questions are the question that consumers asking search engines. So the medical question that they have medical concern that they have and we have around 3,000 questions and we are going to release those questions along with the form paper when it’s published. And you can just wait for that. And we are waiting for other people to use this benchmark as we are doing it.

Harry Glorikian: Okay. And so. What are the next steps in the rollout of MedPaLM 2, I mean, is it I’m assuming certain Google Cloud customers are going to gain access to MedPaLM 2 first? I mean, how are you guys thinking about that program? What kind of feedback are you hoping to get from your partners and. You know, I’m assuming you’re going to use that to improve the model. And I guess when does the average person, you know, do they get access to play with. When might that happen?

Vivek Natarajan: So, yeah, as you mentioned, we have been piloting this trusted tester program with some very large and established companies out there, providers, pharma companies, payers across the health care and licensing spectrum. What we are hoping to get out of that is feedback as to, what are the use cases that they envision, because as I mentioned, language models are a platform technology and in the hands of different people, it could mean very different things and different sort of use cases. And so we want to understand what are the use cases that people want to are thinking about with these language models and also maybe the value associated with them. And then the second thing is how well are the models today performing? And so once we get a good sense of that, then we can prioritize how to improve, you know, the model alongside on some of these specific use cases and also use that feedback data to generally improve these models, make them more safer and simply just better at reasoning and solving tasks and questions and so on and so forth. Um, but then we also have a few applications and use cases in mind. And so Google is a search company and it has various apps, various services with billions of users. And so you can imagine this model being surfaced in the future along one of these services.

Vivek Natarajan: We don’t know. This is my personal opinion again, that I think models such as PaLM, until we are very certain about the validation and the safety, uh, we are probably not going to very immediately release it and put it in the hands of general people. But then we’re very open to research partnerships. And also the trusted tester program is just going to expand. And so if you are a big company with well-defined use cases or even if you just want to explore, then the trusted tester program is a great opportunity. And then if you’re thinking about research on MedPaLM, we are very open to collaboration and opening up the model to expose specific use cases and academic clinical workflow settings. But then, yeah, I think I don’t see a future, say, even in three months where this model is just going to be opened up and, you know, put we put out a UI where anyone can interact with this model, I think that’s going to be it’s just not safe yet. And we want to do this. Uh, I mean, we want to do this well, definitely. And we want to move as fast as possible given how the field is moving. But we also want to do this safely because I think if we do this safe, then I think we will have the impact, uh, that we want in an, in a reasonable timeframe.

Harry Glorikian: Yeah. I mean, the the in one way, right, the race is on and it’s scary, but it’s still the race is on. So, I mean, it’s everybody’s running at 100 miles an hour.

Vivek Natarajan: Hold them responsible.

Harry Glorikian: I’m on board for responsibility. Right. Uh. But I think in one of the videos you mentioned that you wanted to make the model multimodal, and so maybe you could explain that a little bit more. I mean, what do you mean by that? You know what new scenarios or capabilities might be possible. Um, you know, we’ll doctors and nurses be able to upload a medical record, a medical image, genomic data. I think those are some of the things you mentioned in the video.

Vivek Natarajan: So I think if you look at medicine as an endeavor, right, I think it is a multi-modal endeavor. There’s so many different modalities in there that characterize a given individual. That includes medical records, but also images, labs, genomics data. Protein sequences, everything. And so in order to really do well in medicine and be able to understand particular individuals so that you can solve those tasks better, you need to be able to integrate data from multiple different modalities. That’s what our clinicians do. They don’t just look at a piece of data just as mean they have all the context about everything that’s happened. So even when a radiologist, they know so much other information about other things and so they’re just not looking at the image that is in front of them on the screen and vacuum. They have all the other context. And so we want to give our models that capability as well to be able to integrate that data. And so individually, this is going to enable a lot of different new applications. But I think fundamentally what’s going to happen is it’s going to as these models just become more generally capable about interpreting and integrating data at scale is just going to open up a new set of applications in a in the setting of providing care. And so you can imagine doing things better than we were trying to do previously, medical imaging, for example. So previously we had models and a lot of these models are getting through regulatory approvals. But the thing is, these are all narrow domain models that produce probability estimates for the presence or absence of a given disease or something that. You can imagine the new class of models with ability to express and interact and collaborate.

Vivek Natarajan: Being able to have a conversation about a given case with a clinician. You can imagine the power of such a system that’s going to enable so much better human-AI collaboration. So imagine a radiologist talking to an AI and then them collaboratively solving a difficult case or a general practitioner talking to and learning more about this medical image because they want to understand in depth what’s happening. So I don’t know who exactly said this, but mean we talk about reports a lot in the medical setting. But the goal of the medical report itself is to enable a conversation. And unfortunately, because of how the entire ecosystem is today, those conversations don’t happen. But I think with this multimodal AI that can interact and express itself, I think we can enable those conversations to happen more. And I think that is in turn going to enable just enable our doctors to provide better care to people. And then the second thing I would say is as you start integrating information at scale, that is going to also fundamentally enable new knowledge discovery in the biomedical domain, because you’re going to see all sorts of new information that you did not necessarily put together before. But these models, just because of their ability to process data and so much rich multimodal data that no human can ever do are going to be able to, you know, shine light on new phenomenon. And that is going to improve our understanding of diseases, maybe enable the discovery of new therapeutics and discovery mechanisms and so on and so forth. So I think as we go into multimodal, the kind of things that we are going to be able to do is just richly and, you know, incredibly exciting.

Harry Glorikian: Um, yeah. I mean, people wonder why, I tell people I wouldn’t want to work in any other space. This is, you know, so much fun on a daily basis, right? You’re not only getting to work on cool stuff, but you’re actually improving the human condition. Right? And so I totally agree with you, although I think a few doctors that who are listening think that the computer is actually going to talk back to them. That’s going to freak them out. I cut you off. I thought you were going to say something. I’m sorry.

Shek Azizi: No, no. Well, the explanation that Vivek gave, I think that could be a vision that we could have for a few years from now. It’s a really sweet vision. And, let’s see how things are going to move. Yeah.

Harry Glorikian: Yeah. It might not even take a few years at the way things are going, but, you know, before I let you go, so I, you know, you guys are deep in the space. You’re actually working on these systems. You’re getting them to work. I’m sure you guys have your moments of going, “Oh, I didn’t realize it could do that,” because if I’m having it, I’m sure you guys are having it. Except you’re having it sooner than I’m having it. But, why are these LLMs so good? I don’t want to keep using the word good, but why is it so much better or so good at, say medical question answering and so many other different areas that you keep poking at it. And it seems to do a pretty decent job. And then sometimes you ask it and it does something you didn’t expect it to be able to do. Because it was never told to do that particular function. Can you guys speculate on that? I mean, maybe you know the answer and you could share it with all the rest of us.

Shek Azizi: Mostly, the thing that you’re saying is, mostly, hallucination or, they’re. They’re they’re combining things and, producing new stuff out of that. And that’s, usually depending on underlying parameter that you use for these models. And I think that’s naturally coming from the way that these models kind of get trained, which is prediction of an expert. If you want to make it really simple. So you can you can see a word that they can combine things to produce new stuff as an expert. So that’s a pace that they usually all of these new combination that you didn’t expect coming from. That’s that’s my idea. But I’m not sure that if Vivek wants to…

Vivek Natarajan: Yeah, I think that at least to me, there are two parts to this question. I think we can explain some of what we are seeing through scale and the next word prediction objective that we are using these models on Internet scale data combined with sufficient depth of the model which enables intelligent compression. And that in turn leads to the emergence of, you know, reasoning algorithms and the kind of things that we are seeing these models capable of doing. The second thing I would say is, I mean, this is also a very different form of intelligence to what we are used to encountering in humans. And in many ways it’s complementary. In some ways it’s not. And so that’s where I think the friction is generally. But then what is happening is, is this very unique field where we are building something and then we are studying it. I mean, it’s not physics or chemistry or biology where the thing already exists in nature. And then we are, you know, trying to shine light, a torch on it and trying to study and understand it better. I mean, all that is happening at the same time. So I would say that I mean, yeah, there are some things that we understand as to how these models are able to do certain things based on how we train today, as Shek mentioned. But I would also say that there are a lot of things we don’t understand over here, especially the scale of these models makes it incredibly tricky. And I don’t know, who knows? Maybe it’s going to take an AI to help us shine a light on what’s happening under the hood over here. Maybe us humans are not capable, but jokes aside, I think there’s a lot more work that needs to happen under the on the interpretability front.

Harry Glorikian: No, I agree. I mean, this thing is moving so quickly. And, you know, it’s funny because people say things, you know. How do I know it made the right decision? I’m like, How do you know that the doctor made the right decision? I mean, it’s so interesting, right, that we expect the machine to be 100%. But if you barely passed medical school last time I checked, you’re still called a doctor. It was great having you. And maybe we’ll have you. I’m sure there’ll be MedPaLM-3. And then, you know, fantastic. Or maybe the multimodal med part and then have you back on the show. Just seems it’s a never ending process. I’m sure that you guys have beds under your desk because there’s no reason to ever go home.

Vivek Natarajan: It’s fun. It’s an honor and privilege. And I think we have to just respect the opportunity that is in front of us. So hopefully we can do something big and good over here.

Harry Glorikian: Excellent. I can only wish you guys incredible success.

Shek Azizi: Thank you.

Vivek Natarajan: Thank you so much, Harry.

Harry Glorikian: That’s it for this week’s episode. 

You can find a full transcript of this episode as well as the full archive of episodes of The Harry Glorikian Show and MoneyBall Medicine at our website.

Just go to glorikian.com and click on the tab Podcasts.

I’d to thank our listeners for boosting The Harry Glorikian Show into the top two and a half percent of global podcasts.

To make sure you’ll never miss an episode, just open Apple Podcasts or your favorite podcast player and hit follow or subscribe.

And don’t forget to leave us a rating and review on Apple Podcasts.

We always love to hear from listeners on Twitter, where you can find me at hglorikian.

Thanks for listening, stay healthy, and be sure to tune in two weeks from now for our next interview.

Google Researchers from MedPaLM-2 language models on The Harry Glorikian Show podcast

FAQs about Language Models and Healthcare Chatbots

What are language models?

Language models are artificial intelligence systems designed to understand and generate human-like text based on the patterns and structures of language. They are trained on large amounts of textual data, such as books, articles, websites, and other sources, to learn the statistical relationships between words and phrases.

A language model’s primary function is to predict the probability of a sequence of words or the next word in a given context. It accomplishes this by analyzing the patterns and dependencies in the training data. For example, if the model is given the phrase “I want to eat,” it can predict that the next word is likely to be something related to food, like “pizza” or “sushi,” based on the patterns it has learned from the training data.

Language models have various applications, including text completion, grammar correction, machine translation, question answering, and chatbots. They can generate coherent and contextually relevant responses, making them useful for natural language understanding and generation tasks.

Recent advancements, such as OpenAI’s GPT-3, have led to the development of highly sophisticated language models that can understand and generate human-like text across a wide range of topics and contexts. These models have been trained on massive amounts of data and can provide impressive responses to complex queries or prompts.

What are examples of language models?

There are several notable examples of language models, each with its own unique capabilities and use cases. Here are a few prominent ones:

1. GPT-3 (Generative Pre-trained Transformer 3): Developed by OpenAI, GPT-3 is one of the largest and most powerful language models to date. It has 175 billion parameters and can generate highly coherent and contextually relevant text across a wide range of topics. GPT-3 has been used for tasks like text completion, translation, question answering, and even creating conversational agents.

2. BERT (Bidirectional Encoder Representations from Transformers): BERT, developed by Google, is a pre-trained language model that has achieved significant success in natural language understanding tasks. It is trained on a large corpus of text and can understand the context and meaning of words and sentences. BERT has been widely used in tasks like sentiment analysis, named entity recognition, and text classification.

3. Transformer-XL: This language model is designed to handle long-range dependencies and generate coherent text over long contexts. It overcomes the limitation of the vanilla Transformer model by using a segment-level recurrence mechanism, allowing it to capture longer patterns in text.

4. GPT-2 (Generative Pre-trained Transformer 2): Released by OpenAI prior to GPT-3, GPT-2 gained attention for its ability to generate human-like text. Though smaller in size compared to GPT-3, it still exhibits impressive language generation capabilities and has been used for various tasks like text completion, summarization, and story writing.

5. ELMO (Embeddings from Language Models): ELMO is a deep contextualized word representation model developed by researchers at Allen Institute for Artificial Intelligence (AI2). It generates word embeddings that capture the meaning of words in the context they appear. ELMO has been effective in tasks like sentiment analysis, named entity recognition, and textual entailment.

These are just a few examples of the many language models that have been developed. Each model has its own strengths and limitations, and researchers continue to explore new approaches to improve language understanding and generation.

Can large language models become doctors?

Large language models have the potential to assist in certain aspects of medical care and support healthcare professionals, but they cannot replace the expertise and knowledge of trained doctors. While language models can process and analyze vast amounts of medical data, provide information on symptoms, treatments, and research findings, and even generate potential diagnoses based on input, they lack the clinical experience, judgment, and intuition that physicians possess.

Language models can serve as valuable tools for medical professionals by helping them access and interpret medical literature, assisting in medical research, and providing decision support based on available evidence. They can aid in automating administrative tasks, extracting relevant information from medical records, and even assist in triaging patients or generating preliminary reports. However, the ultimate responsibility for diagnosing and treating patients lies with qualified healthcare professionals who consider multiple factors, including a patient’s medical history, physical examination, laboratory tests, and personal circumstances.

It’s important to view language models as powerful tools that can augment medical practice rather than replace human expertise. Collaborations between AI systems and healthcare professionals can lead to more efficient and accurate diagnoses, improved patient care, and enhanced medical research and knowledge.