Scroll Top

How ActiveLoop Is Building the Back End for Generative AI

The Harry Glorikian Show 

David Buniatyan, CEO, ActiveLoop 

For March 26, 2024 

Final Transcript 

Harry Glorikian: Hello. Welcome to The Harry Glorikian Show, where we dive into the tech-driven future of healthcare. 

It seems clear that generative AI is going to change how we do things across the entire economy, including the fields we talk about here on the show, namely healthcare delivery, drug discovery, and drug development. 

But we’re still just starting to figure out exactly how it’s going to change things. 

For example, AI is already speeding up the process of discovering new biological targets for drugs and designing molecules to hit those targets. 

But whether that will actually lead to better medicines, or create a new generation of AI-driven pharmaceutical companies, are still unanswered questions. 

One thing that’s for sure is that generative AI isn’t magic.  

You can’t just  sprinkle it like pixie dust over an existing project or dataset and expect wonderful things to happen automatically.
In fact, just to use the data you already have, you have to train a generative model, and  you may have to invest a lot in new infrastructure and tools. 

And that’s the part of the puzzle we’re going to focus on today. 

My guest is David Buniatyan. 

He’s the founder of a company called ActiveLoop, which is trying to address the need for infrastructure capable of handling large-scale data for AI applications. 

David has a background in neuroscience from Princeton University, where he was part of a team working on reconstructing neural connectivity in mouse brains using petabyte-scale imaging data. 

At ActiveLoop, David has led the development of Deep Lake, a database optimized for AI and deep learning models trained on equally large datasets. 

Deep Lake manages data in a tensor-native format, allowing for faster iterations when training generative models. 

David says the company’s goal is to take over the boring stuff. 

That means removing the burden of data management from scientists and engineers, so they can focus on the bigger questions, like making sure their models are training on the right data, and ultimately innovate faster. 

There’s a lot of technical detail in this conversation, but sometimes it can be really helpful  to get down into the weeds, and that was definitely the case with David. 

So here’s our full interview. 

Harry Glorikian: David, welcome to the show. 

David Buniatyan: Um, thanks for having me, Harry. Great to have the conversation. Looking forward to it. 

Harry Glorikian: So, um, I’m excited to have you on here. I mean, you are. I have to, you know, compared to, like, the vast majority of people I talk to, like, you’re way more on the data science side than, say, you know, the patient care or actually working on the drugs. But I think it’s it’s an important topic because the one without the other doesn’t get you to that end point. And I want to start off and, and just sort of, uh, set the stage because I was like trying to read up as much as I could on the company and what you’re doing. And one of the points you guys make or the contentions is that, um, data scientists who want to use machine learning should be free to do what they do best, meaning, uh, training models, pushing them into production, shipping new features, and solving core business problems rather than, say, spending a lot of times building this data infrastructure, you know, dynamic or data stack, right. Um, and I think that’s a great insight. And before we go into what it takes to build a data infrastructure or how you do that at ActiveLoop, um, maybe we could talk about what it would mean if a data scientist did have access to all this infrastructure, um, especially in the world of healthcare and medicine. So what are the kind of problems, um, could health care companies say, solve if they could apply machine learning to their discovery and development challenges ease much more easily, let’s say, um, and maybe run through some of them, because I think it would be helpful for the listeners to really understand, um, what that means. And maybe we could talk about different applications of deep learning or other forms of machine learning that you think are most interesting in drug discovery, cancer detection, brain imaging, you know, surgical assistants, eye disease. I mean, there’s I can think of tens of examples of areas where this might be applied. So, you know. I said a lot there. So. So I’ll let you you start from there. 

David Buniatyan: Before getting in and sharing what the company is doing. Let me tell you the story, how it actually began before… 

Harry Glorikian: Okay. 

David Buniatyan: Before I started the company, I was doing a PhD at Princeton University in a neuroscience lab, actually. So while we will talk about more about data and AI, etc., I think there’s some sort of relation to how this all started. It’s actually came from biomedical applications at Princeton Neuroscience Institute. I was at um, um Neuroscience Lab, where we’re trying to reconstruct the connectivity of neurons. So this field is called Connectomics. It’s a new branch in neuroscience that tries to map all the neural activities and how they connect to each other so that you can come up with more biologically inspired algorithms so that how the learning is happening inside, inside to understand better how the learning is happening inside the brain. And what we were doing is that we were taking one millimeter cubic volume of a mouse brain, cutting into very thin slices. Each slice was 100,000 by 100,000, then imaged with electron microscopy images. And then we had 20,000 of this. So the data was getting to a petabyte scale. And to imagine what is a petabyte scale is like 1000 times of, um, machine data you will have on your laptops so that you can store actually just one cubic millimeter of a mouse brain data. 

David Buniatyan: And then once those image images, um, you collected and stored it on a cloud, then the problem was to be able to see how each this pixel or voxel is connected to another voxel. So then you can anatomically reconstruct the 3D models of each neuron at a synapse level, and then build the graph and then also apply um, like prior basically taking this images, you were also doing some experiment, um, where the mouse was going, like seeing some visual cues. And then you are taking this calcium imaging as well. And then you connect the neural activities with how they are connected. So then you can deduce some patterns. What are the like how each neuron affects the other neurons. Inside the brain. And because historically neuroscience was focused on understanding a single cell behavior. And then you also have psychology which is doing like high like more functional understanding of how we make decisions. But there’s a there was a big gap between going from single cell to a, like how we make decisions. Should we start this company or not? Anyway, um, and what we realized while this was one of the biggest projects by the I-Arpa, which is the fund under, um, the I think there was DARPA for the defense part, this was like Arpa under intelligence part. 

David Buniatyan: And then it was not only our lab, but also they were like from MIT, from Harvard, from Allen Institute, from other places as well. And many of them had different approaches how to tackle this problem. However, if you could see it’s like dealing like petabyte scale data to build like really high precision AI algorithms, to be able to process this amount of data, you actually need infrastructure. And what we have faced with is that most of the existing tools at that time, and this was like about 6 or 7 years ago, couldn’t scale to our needs. Like we had to, um, rethink how the distribution of compute processing should happen on the cloud. We have to rethink how the data actually should be stored so that we can move it around. Like, um, at that time we were doing all this computations like a very basic example is like moving a petabyte scale data, let’s say, from Princeton to, let’s say to, um, West Coast, like to San Francisco. It’s actually will be much cheaper to hire a truck, physically load data into this truck and then move it around, than move it over the internet. That’s just one example how things break at that scale. 

David Buniatyan: And a lot of like toolings, especially in the AI and ML as well. Like they got inspired from the biomedical use cases. And that’s how we started actively later realizing is like processing petabyte scale data on the cloud costs like a million dollars. How we can reduce this by five times, by rethinking how the data should be stored, how it should be streamed from the storage to the computing machines should be used CPU, GPUs and what kind of models to use. But the key insight, what I realized is that this infrastructure actually helps you to cycle through the data much faster, and what you need is, like you need a lot of iterations to be able to come up with, um, a quality of an output that can be later used, let’s say, for neuroscience research. Because if you imagine there’s like one mistake, one pixel like mistake of correspondence can have a butterfly effect of how this neuron connects to another neuron, and then you can have a failure or like a misinterpretation, how this neuron was actually connected to another neuron, which was a false positive. So we spent a lot of research coming up with the models that can not only like generate or like predict or segment the neuron connectivity, but also can do error correction. 

David Buniatyan: And it was a big lab. It was not just me. It’s like there were like 30 team members led by my advisor, Sebastian Sun. And we had also annotators in the lab and our lab, actually, we I remember one of our PhD students, Kisik Lee, he actually came up with the first, one of the first convolutional neural models that can beat the human in annotations. So basically, if you have a human like an expert level human who can give an electron microscopy images can segment, here’s the neuron one, here’s the neuron two. We could get neural like first models that can achieve superhuman accuracy, which means they can in average um behave better than an human expert can do at labeling this data. And what this means is that the number of annotators we had in our lab was like 20 or 30. They transitioned from annotating all the data that we collected to actually becoming an error correctors. So they only there. So we took their time and send them only the data that the model was not confident with, and started doing this annotations later. What happened is that we also trained another model that does the error correction itself and then the human annotators they also became like the characters of the error, characters. So essentially you took the expertise of the people who actually got very good at labeling, um, which neuron correspondence you have for each pixel to become like as, as the automation was going under the hood, you brought these people at higher level and higher level. 

David Buniatyan: And the realization is that you really need like a couple components from for this to really work well is that you need of course the GPU compute infrastructure. You you need the data or the label data by the human experts or the annotators. And you also needed the the researchers to be able to come up with these models. And what I saw there is that this infrastructure and is, like, so critical to come up with this AI systems that like in, I don’t know, next, within five years or ten years, all the companies will require this. And fortunately, what happened a year ago, in year and a half ago is that OpenAI came up with, GPT-4 showed that if you scale the compute, if you scale the data, if you scale the like, the like, the expertise of humans working on these models, you can achieve like models that are like superhuman or at least can act, um, within the same accuracy in some fields, um, as same as as humans as well. So that’s just a brief, um, background. But let me know if you have any questions. Happy to deep dive and and get there. 

Harry Glorikian: I just want to get back to, like, thinking about our, you know, the world that most of the listeners probably think about is, you know, drug discovery, cancer detection, etc. is, you know, when you’re thinking about the kinds of problems that these companies are solving in these worlds. What do you see as examples of of what you’ve encountered? What kind of problems are these? Clients that are utilizing your platform, trying to solve without disclosing anything, you know, super confidential. 

David Buniatyan: So from our side, when we started the company, um, about five years ago, we didn’t actually focus on biomedical use cases. We said, okay, we are building horizontal data platform where, um, any company, we can help them with managing their data and connect to AI models. And the biggest difference is that what we realized is that you have all this awesome databases, data warehouses, data, data lakes, lake houses specialized for analytical workloads. But you don’t have one for deep learning in AI applications. And we said, okay, why don’t we look into this deep learning frameworks which are under the hood of all this foundational or large language models and say, what is the best way to store the data later, while we build all this infrastructure and started onboarding customers, what we realized is that. Why this resonates well with biomedical um industry, including life sciences, pharma, healthcare, um, and medtech is that no one actually came up with a specialized well, not a specialized like a data management layer that these companies can, um, take care of. And the reason is that because you have so many formats, how, let’s say a city is scanned and stored, like how the MRI is how the electron microscopy data, like every of like the vendors who create machines, they came up with their own formats. 

David Buniatyan: It’s like it’s a big, big jungle that no any other database company actually tapped into. And and while this didn’t make sense before to unify all this infrastructure, now it’s like a tipping point because you can actually have machines to do analysis. And the analysis is not just for the sake of, okay, let’s get some insights from the data. You can actually come up with deep learning models that can do better cancer detection, let’s say from chest x rays, than a human can do. If you scale to every specific medical use case this can have huge impacts on the cross the whole healthcare industry. Um, but you need the data as like one of the biggest components to connect to the models. And that’s that’s where we’re all we are working with, um, large pharma companies to help them with. 

Harry Glorikian: So I’m asking just the question of, what does it look like when these deep learning models say fail in these areas? I mean, what does it look like when they succeed? I’m I’m trying to just get clear about why having a solid data infrastructure before you jump into a machine learning project is so important. And what research could accomplish researchers, let’s say, could accomplish if they had more power over their data sets. 

David Buniatyan: So let me give you an example. As like one of our customers came to us, they’re a fortune 500 medtech company. And they’re like, we looked into all available data import tools. They came to the point that they had a lot of AI projects internally, but there was not a unified all together. And they’re like, hey, we have all this. Like we looked into all available tools, including pretty well known ones, Databricks, etc. but no one can actually manage and unify our data storage, especially for like radiology image use cases where we have many Dicom files or Nifti files, and we have to connect it to the models to come up with better, um, AI algorithms that can train this. And one of the reasons that deep learning works is the amount of the data that it can consume, and then find the patterns and then build the recognition models. So then later you can use this in production. And what we figured out the value that we have created for them, this is in their own words, like we asked them after the successful production pilot and like getting onboarded to them was like, okay, can you tell us in your own words what what are we doing for you? And what they said was very interesting is that we act like a magical closet for them. They can throw the data on us. We will like somehow they don’t know how, like we’ll organize this data so that they can very efficiently retrieve it and connect to the models. 

David Buniatyan: And then we ask, okay, why do you need this? And the problem that I mentioned is that they have so many different teams working on AI models and trying to come up with this specific stuff, is that some of them like even end up buying the two of them, buying the same data set from a third party vendor twice. So this third party vendor knew that there’s like organization already has the license to use this data set. They did sell that twice. The second time the same data set, which is I mean it’s a shame on the on unethical sort of from the vendor. But regardless this is like the problem is that within an organization you really need to access, let’s say um, very specific data set of cancer for, I don’t know, for the chest x ray images. And then you want to train this model to be able to solve that highly precise problem. And you need to be able to access this data. And and the reason why data is very, very critical is like it’s very, very similar to self-driving cars. Um, and if you’ve seen recently, there was the self-driving car Tesla demo by Musk, who was driving across Palo Alto. And then they were first time demonstrating that you can have an end to end neural network, making the decisions on driving the cars and what they found. And then you could see that Tesla was not actually stopping at the stop sign. 

David Buniatyan: And it was like and then and then and then there was like a question, why does it like not stop? Saying like it’s basically it’s a rule, right? It’s like you have to wait three seconds, count 1 to 3, and then you leave to go, etc.. And the reason is that because they train this model across maybe the hundred thousands of drivers driving around in Palo Alto and, and it appears that only very few stop at the stop sign. And as all of us as drivers. So the neural network just learns how the drivers work. 

Harry Glorikian: So being born and raised in California, that’s called a California stop, right? So you slow you because you sort of slow down. And yeah, if there’s somebody else there, maybe you stop, but otherwise you keep rolling along. So yes, I’m very familiar with that activity. 

David Buniatyan: So that demonstrates the importance of the versus like what is an average, what is an expectation and what actually is expected from a model to behave. And the way to tackle the problem the way they did is that they took all the examples that had stop sign. Like, actually they stopped at the stop sign. And instead of adding like an event rule to the neural network saying, hey, if you see a stop sign, you should stop here. They actually went into the data and upsampled those examples that have many, many, like people who are drivers, who are stopping at a stop sign and then counting like, I don’t know, three seconds and then once they adopt sampled. So basically when the model was being trained on it, it started to see more examples of such cases versus the average that you would expect from a human driver. And then the model could behave. And this demonstrates that if you’re training a neural network and you want to put this into a very highly like sensitive or critical application, you actually need to take care of what exactly data that’s getting into the model. And this becomes way more important. I mean, of course driving cars is like it’s like a super important case. You don’t want to make any mistake, but like some minor mistakes are fine. If you are, especially in patient care and you are making a diagnosis that like a surgical operation should happen because of this decision, you want to be highly, highly, highly, highly critical, careful and accurate. And like your false positives should be super low and you have to take care of all the edge cases. And the way you do that is actually you make decisions. What data do you feed into these models? 

Harry Glorikian: Yeah, I would love to have the discussion about how how accurate these models are versus most humans that are doing the same thing, I think. I think if we actually publish the data of the error rate of humans, the machine would be beating them by at least a… But we don’t have those discussions openly, which is an interesting discussion. So you guys have built this, uh, a database at ActiveLoop that you call Deep Lake, if I’m not mistaken. And it’s a form of a data lake. Right. And so as a way into talking about Deep Lake, can you explain what is a data lake. And then we can sort of juxtapose what is what is Deep Lake. 

David Buniatyan: So typically a data lake is a centralized storage where you throw your data onto it. And this was like the original, um, form of what people started to refer to a data lake where people like just can store a bunch of data they collect from different sensors or from different data sources and just dump somewhere. So later, the organization, whenever the time comes for the business analytics to bring the insights, they can access this data and start processing it. What happened is that people started over using data lakes, and the data lakes became data swamps because it becomes a huge mess to like organize, like how to organize the data, where to find it, how to like even map it to the need is this the data is correct data. Or there were some errors and it became a mess to manage a lot of organizations not only in health care, but also like across the board. And people then came up with a second generation of data lakes, which tried to organize it into a form that then you can later both query version control and iterate upon. 

Harry Glorikian: And let me ask you a question though, there. So when you say that though. How is a data lake different from a regular database then? 

David Buniatyan: If you get into the technical terms, then the database, the need is to be able to store the data in memory. And there are different types of databases, first of all. Secondly, most of the databases, they store the data in memory on a machine. And they are like a highly transactional, which means that you can give them requests, they will immediately bring back the data, and then they keep the data in a hot memory, which is very expensive. But that’s what you need because you need to minimize the latency to be able to access this. And more. More. And another big difference between databases and data lakes is that your database mostly structured, which means it should be in the table form. It should be like there’s like a language SQL, like you should be able to store if you want to store images. Nobody is like ever, ever recommending you to store it inside the database. You likely what you do is like you put it on a data lake and then put a pointer saying in a database that, hey, my data is on, let’s say on S3 on AWS, and then try to bring it back whenever you try to access this. And then what you end up with is this huge separation of all your, like, unstructured data, which is more than 95% of the volume that’s sitting on a data lake. And then you have maybe 5 or maybe 1% stored in a database or some table that like, kind of acts as a metadata across all your board. And, and that becomes super inefficient when you are trying to do AI and machine learning model training by consuming this data. And the reason for this is like data scientists, they actually go and run a query on this database that could be Postgres or Snowflake, create a view, then go and link by link, fetch the images into a machine. So then you can connect to the model and then start doing the AI learning process. 

David Buniatyan: And then let’s say you want to, say the example I brought back like with a self-driving car use case. Now you want to go back and upsample the edge cases that you really care for your medical application. Then you have to do this iteration again, access the database, do that. And it takes a lot of cycles from the data scientist to be able to work with this. So I gave you a trivial example. But like fortune 500 or 1000 companies, they not only have one data product, they have like on the order of 200 data products across the board. The different data, that data is stored in different databases, different locations on different data lakes, different like that, like the mess is like gets like multiplied or the complexity as you increase the scale of the organization. Working with a lot of data and especially like pharma companies, they have been collecting actually data like 200 years, like they are one of the oldest organizations in the world. They have collected this data, and some of them are actually storing papers on the like the like in archive that nobody has ever seen. Like after like a few years. And imagine if you can connect all this data that all the clinical trials, all the experiments that you have done in the past and make like help you or enable you to make your next decision on which drug to double down on or which research to start on. And all this information can actually help and enable life science companies to make better decisions. 

Harry Glorikian: So I just want to drill down into this a little more so that everybody sort of, you know, gets a picture of this is, you know, what are the shortcomings of you were talking about, you know, these traditional data lakes when it comes to handling the data needed, say to train and operate these deep learning models. I mean, are there specific varieties of data that data lakes aren’t good at storing or moving around, or is there? Or is it just the sheer volume you were talking about that needs to be moving around that that’s the problem or. You know, our data lake’s just bad at storing. And I’m picking on these three things images, video and audio, because I know that that’s something you guys focus on. Or is there or other types of data, um, to, to train a deep learning model. So what are the big issues with using the traditional data lake approach. 

David Buniatyan: Yeah, that’s thanks for bringing it up. It’s like basically it’s about the abstractions. And what I mean by abstractions is like data lakes are so good they don’t make any assumptions on the data. And if you don’t take any assumption of the data, you treat everything as a blob, as like, yeah, here’s like some one box, another box, another box. You don’t really care what’s inside the box. And when you don’t really care what’s inside the box, it makes it like very easy to throw everything inside. But then being able to know what’s there and what you need to access it, that becomes a problem. And that’s where Deep Lake comes in. It’s it’s not only saying, hey, um, give me all, every type of a box that you have. It says like, hey, now I can actually know what’s the shape of the box. I know how it should. I actually take this box and then put it into smaller pieces, or combine it into larger boxes and then store it in in the closet? I know how to connect this to the models. I know how to be able to query the data. So the second generation of the data lakes, especially Deep Lake, which comes with a big differentiation that it can actually store not only tabular data, but also images, video, audio, text, Dicom files, Nifti files like, I don’t know, DNA sequences. All this information that’s super critical for healthcare industry to be able to operate in an AI native format that becomes trivial to connect to the both AI models and also to the humans, so that they can visualize or see see what the data structure is so, and that that’s one way to get out of this data swamps by being able not only like treat them as blobs, but also know their form, their shape, and how to actually maneuver or manipulate that data to connect this to the models. And that’s where actually 80% of the data scientists time is spent on is building this data pipelines to connect the data to the models. And that’s what we take a big cut out of it. 

Harry Glorikian: So is Deep Lake. Does it use a different storage format from other databases? I’m, you know, or I what I’ll call older forms of, of a data lake. I mean if I, I was reading that you’re using that it’s a tensor data that is considered native to Deep Lake in a way that wouldn’t be on, say, an older storage system. So is that am I thinking about that correctly? 

David Buniatyan: Yeah, totally. Um, so if you look into how, um, deep learning models operate, especially let’s say GPT-4, the large language models or the foundation models is that they don’t really take care if you’re giving it as an image or video or an audio file. They what they operate on is tensors. And what are tensors? They are n-dimensional arrays. So now instead of having a data lake that can store any blob or like any binary data, you can actually shape them into this, what I was mentioning like n-dimensional boxes so that they become very easy for an AI model to comprehend. You can take that box and then directly ship this to the model. The model will doesn’t have to do any transfer additional transformation to be able to consume it. And that that gives us the biggest differentiator is that the way we have said this is like, okay, let us understand how this data is going to be consumed. And there’s no free lunch in databases. Like if you optimize it for AI workloads, then it becomes sort of bad for analytical workloads. If you optimize for analytical workloads, then it becomes bad for AI workloads as new. Like if you look into the history of databases, is that every era of new innovation creates new types of databases. And like in in dotcom era where you got a lot of Json files for web development, you got companies like or database like MongoDB, so you can store all your documents and then easily retrieve it. And this became a like a huge company. Then during mobile era, um, you got a lot of analytical data like events that can you collect for understanding how each user is, um, doing behavior, let’s say what to recommend to this user. And that’s why you got like data warehouses, for example, snowflake that can process this huge amount of the data. And the way they operate on top of the data set is also totally different. 

Harry Glorikian: And so like, let me ask a specific question. So let’s. I’m trying to, uh, see if you can give me a couple of examples of why, say, queries are easier to run on this tensor formatted data. You know. Can you give a specific example, let’s say from health or biomedicine? 

David Buniatyan: Yes. So let’s say do you take the traditional data lakes and usually there’s like this famous known format called parquet. You put into parquet not only the metadata. Let’s say you collect from the electronic record files for the patient, but also now you’re saying, okay, now I have this bunch of Dicom files. What if I put this into the parquet file format and parquet is like great for analytical workloads? On top of that you have this famous data lakes like called iceberg Delta Lake that connect then to your ecosystem for the queries, etc. run. You can run like pretty cost efficient queries on top, but they’re mostly good for asking this type of like, okay, what was my sales activity for the last three months? Or how many patients did you have? Did I had that had this specific, let’s say, cancer for last, um, week. But then once you say, okay, now I want to take all the Dicom images that the CT scans or the MRI scans have collected so far, and I want to look into them. If you put this data separately, then you are doing what you what people have been doing so far is like just linking from the data lake to another blob storage file that nobody cares if you decide, let’s say, and there were companies who actually decide, oh, we’re going to create a new format. We’re going to extend the parquet files and put the Dicom files inside parquet files. 

David Buniatyan: But then this parquet files became so big that even running the queries became very inefficient. And not only that, also like when you train a, let’s say an AI model, you actually need a special type of access pattern. And what this access pattern is, is being able to shuffle the data randomly before you fit this to the model. So then your model is not biased towards the first examples that you have seen and can like randomly, um, train, train the model and yeah. What? What and what Deep Lake actually did is like, we came up with a tensor storage format to satisfy these two constraints. First, we can actually store both the metadata and the data itself within the same what we call data sets. Think of them as tables. And then we have this column isolation, which means that each column is stored separately. And instead of having one dimensional columns where you can’t I mean, technically you can do that, but that blows the memory off, is that now we can have n-dimensional columns. So your CT scans or MRI scans can actually stuck together into a single column with their corresponding dimensions and efficiently be stored on top of the blob storage. And why this is important is that once your AI model is going to access this, it’s going to get the data in the shape that it expects to consume to train the model itself. But at the same time you can actually run a query, shuffle the data randomly and bring it back. For example, the traditional analytical data sets, they don’t have the random access for you to go and access the middle of the data. You have to actually go and read from the first item till the middle one. So then you can then you can get the middle one. But if you’re going to randomly shuffle and then access data, this becomes a big bottleneck for you to train models on top. And that’s where Deep Lake becomes very efficient for your AI workloads to be able both to train and store this data and then connect this to the models. 

[musical interlude] 

Harry Glorikian: Let’s pause the conversation for a minute to talk about one small but important thing you can do, to help keep the podcast going. And that’s leave a rating and a review for the show on your favorite podcast player. 

On Apple Podcasts, all you have to do is open the app on your smartphone, search for The Harry Glorikian Show, and scroll down to the Ratings & Reviews section.  

Tap the stars to rate the show, and then tap the link that says Write a Review to leave your comments.  

On Spotify, the process is similar. Search for The Harry Glorikian Show, click on the three dots, then click “Rate Show.” 

It’ll only take a minute, but you’ll be doing us a big favor, since positive ratings and reviews help other listeners discover the show. 

Thanks. And now, back to the show. 

[musical interlude] 

Harry Glorikian: So I was reading one of the white papers that I saw on Deep Lake. And you, you seem to focus on several features that you guys think are particularly important. Um, I don’t know, maybe you could walk us through, you know, why they’re important and why they work better in this format of Deep Lake. I mean, you talk about version control, which I get visualization, querying, um, materialization and then streaming. So those seem to be the five areas that you have highlighted in the white paper. Um, you know, you know, why are they important and why do they work better in the approach you guys have? 

David Buniatyan: Well, I think that that part of the diagram that you’re referring to is basically this whole active learning loop where a company like training a one time a model doesn’t really make sense much. If you are building your own AI, um, initiative, you actually really care about faster AI iterations. And to be fully iterate with that, you you need to be able to version control the data so you can version control the experiments later. You want to be able to query the data to so that you can subset subselect the necessary information there. You want to be able to um, like then run queries on top. Being able to materialize, materialize is the step where it may, if your query is sparse, you might decide to copy it and compact the data set. And then so you can be able to very efficiently stream this to the training process. So that’s like half of actually the AI loop that you have to take care. And then there are other parts that you have to train. You have to evaluate. You have to um, deploy these models. You also have the annotation side of things. So what Deep Lake is taking care of is all the data management related activities that you have to not worry about because you spend like this is like, this is the most boring piece in the that whole steps. Everyone wants to train a model. Everyone is like super excited, like trying the GPT-4 and see how it goes with the specific data. But nobody really wants to take care of, like how how to properly store and manage the data sets. 

David Buniatyan: And that’s our philosophy is like, yeah, you guys focus on the most important thing. Give us the boring job and we’ll take care of the data management layer. And then what we have seen as well across our customers is that if you don’t havem if you just directly work off on top of a data lake like an engineer or data scientist, they come up with this custom scripts that be able to extract the data from a traditional data lake and ship this to the model. After two months, not only he is not able to share that script with another person so that another person can reproduce the experiment; he himself or she herself cannot be able, is not able to understand what was the script about because the AI changed, the model changed, the experiments changed. And I think that’s one of the critical infrastructure pieces that Deep Lake provides is this version control that gives you a full data lineage on how the data has been originally collected, till how it has been transformed before it was fed to the model piece. And unfortunately, I know that there has been a lot of attempts, but there’s no any fully, um, database data, lake, Lake house infrastructure that actually captures that part of the code. To be able to really track the experience. And this becomes. Very critical down the road. You you know better than me. It’s like on the like the when you go through the clinical trials and like especially on the models this this is going to be the one of the most critical inspection pieces. 

David Buniatyan: Not only you need the model and the accuracy metrics, but also what the data has been stored on and what are the biases and analytics on top of this that you can you can do. And one of the key things that we took the architecture differently than any other database is that it’s like serverless is embedded, while for some applications or for technical people like it sounds cool. What actually is really useful is that the data stores on our customer cloud or on premise. Obviously we don’t have access to this data, but more than that, what the visualization and the streaming and all the functionality runs embedded within customer site, which means that let’s say if you open your browser, you go to active today and you connect to your data set the data set directly streams from your client browser to the storage that you own. It doesn’t fly through our servers, it doesn’t come through this. And the same thing is happening when you train a model, you train directly streaming from the data from the storage to the GPU models. And this gives you, um, and this like enabled us so many to bypass, so many like infosec reviews. And because the data is stored within our customer premises and whenever their users are scientists, like access this data, it’s their client who, like, gets authorized through us. But then the fetches the data from their storage to their premises. And this is also one of the critical pieces that’s overlooked, especially when you manage the data. 

Harry Glorikian: So let’s jump back to the company just for a minute. Right. Um, you’re one of the founders. Who else was involved in this? I guess the founders and designers of Deep Lake. And what backgrounds did they come from? 

David Buniatyan: We have a actually a great team I’m super proud of and excited to be working with. Um, we have, um, our head of product, Ivo [Stranic], who did a PhD at Stanford in mechanical engineering, then work a few years at Tesla. Then, um, started his own computer vision company, and that’s where he realized all this data problems are big bottleneck to succeed. So he joined us to head the product. Uh, we have, um, Tatevik [Hakobyan], she’s our chief operating officer. She’s, like, overseeing all the operations within the company. Previously was a chief data officer at a gaming company, managing of um, over 500 employees. Um, and she also did math in Zurich University, where, like, that’s like the data and the math are super, super correlated with each other. And then we have Mikayel [Harutyunyan], who is on the marketing front, although he has he has the highest number of papers published in Nature, especially on the psychometric side, and how the fake news spread over the networks. Um, before joining the company, he was at UNC Chapel Hill, and then before that he also an internship at Google launch in YouTube music there. And we also have um super bright engineering team from Armenia who is like building this database with an experience of, um, chip design companies like working at Xilinx or Synopsys, where they realize all these low level items, how the data is like, like basically if you build a database, you have to understand how that works on the hardware or metal level and then stock up on that. And we also do have team members from uh, previously who worked at Oracle, um, companies like that and are all like widely open source, um, database infrastructure pieces. Um, and all this like helps us to, um, both the understanding of how the hardware works, how the databases, traditional databases worked, and how actually the AI needs are evolving, uh, moving forward. 

Harry Glorikian: So you guys, if I’m not mistaken, you guys also participated in Y Combinator back in 2018, right? 

David Buniatyan: Correct. Yes. So we actually during YC, we, we got formed, um, like we started the company. We thought like, hey, like let’s try to apply to YC and see what happens out of it. And apparently we thought of an internship. I talked to my advisors like, yeah, just try it out. It doesn’t work. You’ll be back. And we we went to YC and, um, my stayed within the company for the last six years. And like, what happened is like three years later I got an email from my advisor is like any decision. And I was like, clueless what’s going on and like, opened up. The forward is like our graduate school pinging the computer science department. If I’m coming back or not, then they’re like, playing this ping pong is like asking if we’re not sure who knows that. And then when I was like, oh, let us ask his advisor. And then they ping Sebastian. And Sebastian is like, he was on sabbatical at that time as well and was like, yeah, just ask him directly. And this is like one of my regretful decisions that I’ve ever made. Um, I’ve thought a lot about this. Like, I really wanted to finish my PhD. I really wanted to build, um, like a kind of academic path or career. But then what? I also like I talked to other professors at Princeton and one of the key advisors like, yes, you can do both at the same time, but you won’t be good at it. So just pick one and then double down on it and make this happen. And if that doesn’t work out, you can always be back. And I think that email that I mentioned, the any decision email actually helped me to like kind of retrospectively review what I’m doing. And I figured out that I can have much bigger impact on computer science and all the adjacent industries while working at this company than than going and finishing my PhD. 

Harry Glorikian: Yeah. Don’t worry. Every once in a while I threaten my family that I’m going to go back and get a PhD. So, um, just, just just in case everything else doesn’t work out or I get bored or something. So, um, but let me let me ask you a question, like. Back in 2018 was the need for a deep learning optimized database clear? 

David Buniatyan: Not at all. Like we were talking to so many people. It’s like, hey, we are building a database for AI to like, do you need, do you need one? It’s like, why do I need another database? Like, why do I care? And things literally changed a year ago because of the just like the open AI releasing GPT-4, all the hype and wave, etc. people now realize, oh wait a second, we have now new data to process, including vectors and embeddings. Where are we going to start this? The traditional databases do not support it, even though later everyone added support for vectors. The new um, like the data lakes don’t support this at all. Like what are we going to do? And you can see, like if you look into our downloads charts, like in one day, our downloads increased 7 or 8 times just because genAI developers, which is another new terms for like you had web developers, now you have genAI developers, started using Deep Lake as a vector store to be able to process the data. And now like the market dynamics totally changed as well. I think looking in hindsight, what was sort of our bet like about five years ago, is that AI is going to be big. We knew that. I mean everyone knew that like for the last 70 years. But when this will happen, nobody. But then what was the missing piece is that like there was no actually data storage for AI. And the way data scientists worked is, was the exactly the same way as you open your laptop and move around folders, um, across the state. So and this hasn’t been the case for traditional, um, databases so far. Like, like in web development, you got all the infrastructure related for you to be able to manage this. And and that’s where we said, okay, why don’t we actually go and build a database for it, before it was cool. And it takes ten years to build the database and we are still in the middle of it. 

Harry Glorikian: The funny part I would always say something like, you know, luck is a very important thing when you’re in a startup like that moment of timing and things just which you have no control over. Sorry. Like you can’t you couldn’t have changed OpenAI doing whatever they’re doing and causing this big shift. But you know what? It works to the you are right at the right place. And it just happened to come together at the right time. 

David Buniatyan: The way maybe it’s like too romantic or naive to think of. Is that actually like in expectation in average like in not know any startup including us should survive — by default you’re going to die. That’s like the default for starting a company. And then there’s all like this one percentile of luck or chance that you’re going to succeed. And you as a founder role or as a company role is like, you bet on this one percent. But not only you bet, but you also do everything to bend that one percent. And I think that’s what differentiates like not just bad thing, like it’s an uncontrollable process, but you are inside of that one percentile bet and you have to make sure that that that gets the path has been chosen like that to, to get there. 

Harry Glorikian: Yeah. I mean, we only talk about the. You only hear about the ones that succeed. You never hear about the other 99% that fail. So, um, that’s why it’s so romantic to start a company. Um, but. Okay, so I was reading up the blog on the website about the team at Google Brain that developed a convolutional neural network called Efficientnet uh, which they use to analyze retinal images for evidence of diabetic retinopathy. I mean, can you walk us through that story and how Deep Lake helps with this kind of application? 

David Buniatyan: Yeah, so the key issue is there is like when you collect all the images that you have to store there, and then connect to the models, is that the that you have two parts, right. You have to construct the architecture of the model to choose like Asianet or Densenet, or you can also choose like more traditional ones that are like, by the way, whole deep learning revolution. At least the second wave started with because of the AlexNet. And then there were multiple iterations and where that’s where the Efficientnet comes into the place. And then once the AlexNet was, um, created, the um, Google actually acquired, like, um, Alex Krizhevsky, Ilya Sutskever, who is who was the chief scientist of OpenAI, and then Geoffrey Hinton, who was one of the founding fathers of deep learning to the Google Brain. That’s where a lot of things started to happen in 2012, and I was doing my PhD in 2016. So it’s like four years after that event, a single event had happened. Um, and when you collect, like when you start collecting a lot of data for, um, training those deep learning models, the key thing there is to be able to structure in a form that you can connect to the like PyTorch, TensorFlow, to do the to do the training process, but not only on that, but also you can go and start, um, experimenting. What kind of model architecture I can choose for. And apparently what, what people figured out is that applying Efficientnet, which not only takes a convolutional neural network and then naively applies across you can actually like make it in a sparse way that the neural network is constrained to find these missing connections and only preserving the the most important missing connections for you to be able to, in a highly accurate manner, to be able to, um, classify what this is like from the retinal images you can get from. And then what later happened is that this new architecture called Transformers, they are essentially doing attention based model that can, given the data, it can decide where to focus on this data and then retrieve that information. So um, but then that’s like kind of predecessor of the Transformers, the Efficientnet, um, in that within that regard. And it’s very also similar in biomedical image use cases, you have other types of models such as like U-Net or Densenet. Sorry. Um, U-Net. There’s also like some variations of 3D units etc., so they can consume a lot of biomedical data. 

David Buniatyan: And the way they do that is they um, basically take different layers of information and then try to downsample and preserve the most compact, like high level information and then keep it as a hierarchy and then then expand back by having the skip connections between the first layers of the low resolution to the higher resolution and and then preserving both the high level details and the low level details. And that. Like what? Like let me take a step back. Like there’s this theory that you can take any two layer neural network and then map any function. But that’s not helpful because you actually want to embed the model of the world into the model itself. And why the data is there is important. Is that like it really helps you like you have the data set, you don’t know the best architecture, but you need to like quickly be able to iterate across all the different variations of the experiments that you run. And that’s the role of the AI researcher, is to come up with the next best model that can do. And that’s what happened because of the AlexNet, because of the, um, Transformers, etc.. And then for you to be able to do that fast cycles, you need just good data infrastructure to be able to store the data. Um, yeah, I think that’s that’s within that use case. I do remember the deep like importance was. To be able to structure correctly in the form that an AI model will consume, and then take it from there. 

Harry Glorikian: So how’s the business model work? I mean, you touched on it, you were talking about it being on premises and so forth, but I’m trying to figure out is it a SaaS model? Is it a you know that you sell on a subscription based does? How are you? How are you monetizing what you’re doing? 

David Buniatyan: It’s it’s based on the usage. It’s usage based pricing, basically how much data you store with us. And we manage for you. And that’s how we monetize. It’s like as simple as that. There are like much more complicated models for database components based on how many number of queries or compute do you consume? We take a very naive and simple approach that our customers do appreciate, and it’s like it’s depending on the volume of the data. 

Harry Glorikian: Okay. And then. I don’t know. Who do you think about as like a competitor? I mean, Netflix says its main competitor is sleep. So who who would your competitors be? Old fashioned databases? Or is it some other types of organizations when like maybe Google, Amazon, Microsoft? I don’t know. 

David Buniatyan: No. They are more like our partners at this moment and other databases as well, like our partners, especially not in this field. When we talk to the customers and we ask them about the alternatives, they’re like mostly fighting with files. So the way they currently most of these organizations run, their data management is like you will be surprised you have like, it’s a huge format. A company who, think data in a Dropbox bunch of Dicom files. This is like and this is now I mean, Dropbox is a great product, but it’s for consumers for storing your personal family photos and then accessing it later. It’s not designed for you to, like, put it on, I don’t know, or in a Google drive. That’s what we are like, being surprised over time and time again where seeing how these people managed it. And you can’t blame them as well. It’s like that just worked out okay, great. We move forward, let’s go and solve our next bottleneck. Um, that’s what we have seen. The biggest, um, like kind of, okay, we have to go and change the way they think about their data and how do they consume it. 

Harry Glorikian: So what’s the intellectual or technological. Moat special sauce or IP protection that you guys have that that gives you this, you know, special place. 

David Buniatyan: obviously we do have patents. We did publish a paper at the top database conference. We have our own abstractions that are on the tensor storage level. We have one of the best data loaders that can stream the data from the storage to the GPUs, as if the data is local to the machine. Think of this like Netflix for datasets. Um, we have, uh, proprietary visualization engine that like directly streams the data from storage to the browser, which is, um, like not there. Like you usually typically have another backend that does a rendering, etc.. So we have all these technological pieces. But like my myself, I’m myself like I don’t believe in technical, technology models. And that’s like sooner or later people are going to catch up on this front and then and starting from, um, like, oh, like 100 years ago when people were building like airplanes, you come up with the next best engine and people just switch to you. And the same thing you are seeing with the model AI models as well. Like every day there’s like one new model that’s launched with 1% more accuracy and their graphs is changing. Okay. Everyone is like switching from this one to another one. I think what we really, um, focus on is like creating the value for our customers and taking away the boring job that they have they don’t want to do. And that’s where, uh, it becomes super attractive for us, um, to help them with managing their data sets and. The way the architecture works as well. 

David Buniatyan: That makes it like pass all the necessary security requirements for large, either medtech or, um, life science data. Storing the PII information inside Deep Lake in production. Um, and this like gives us a big differentiator across other available tools. Aside from like yes, we can also store the the Dicom files with CT scans or the MRI scans and then connect to your radiology use cases. Or we can like really like in a matter of hours, index all the PubMed data and store that inside a separate Deep Lake data set. They can like run very fast queries on top. And then bring this back to your application that you’re building like a knowledge base across your, your site. So those are like the like in a very simple way. But those are like the engineers or data scientists stuff. I think the key thing is that, hey, we have proven, uh, with many, um, top, um, fortune 500, both pharma and medtech in healthcare industry companies, that this actually provides a lot of value for managing the data and organizing it so that you can be way more efficient bringing AI into, um, coming up with the next track or coming up with the next equipment that will be deployed in the hospital and which are like really where the most of the value is. And those like the, the data and the AI that are just tools for now, like to make it more efficient to get there. 

Harry Glorikian: David, it’s been great having you on the show. Um, hopefully I didn’t like pelt you with too many different questions to try to get to the bottom of things, but, um, I wish you guys success. I mean, the most important thing for me, as always, seeing these products get to market and then move faster because I’m not getting any younger. And the more innovation that comes into the world, the healthier, happier I’ll be. 

David Buniatyan: Thank you very much for having me. And all all questions are pretty good. Actually took me a while to think of what how to answer that. So great to be here and thanks. Thanks for having me. 

Harry Glorikian: Thank you. 

Harry Glorikian: That’s it for this week’s episode.  

You can find a full transcript of this episode as well as the full archive of episodes of The Harry Glorikian Show and MoneyBall Medicine at our website. Just go to glorikian.com and click on the tab Podcasts. 

I’d like to thank our listeners for boosting The Harry Glorikian Show into the top two and a half percent of podcasts globally. 

To make sure you’ll never miss an episode, just find the show in your favorite podcast player and hit follow or subscribe.  

Don’t forget to leave us a rating and review. 

And we always love to hear from listeners on X, where you can find me at hglorikian. 

Thanks for listening, stay healthy, and be sure to tune in two weeks from now for our next interview.