State of Cybercrime

Computational Biologist and Founder of Protocols.io, Lenny Teytelman (Part two)

Episode Summary

We continue our conversation with Protocols.io founder Lenny Teytelman. In part two of our conversation, we learn more about his company and the use cases that made his company possible. We also learn about the pros and cons of mindless data collection, when data isn’t leading you in the right direction and his experience as a scientist amassing enormous amount of data.

Episode Notes

Reminder: it's not "your data".

It's the patients' data
It's the taxpayers' data
It's the funder's data
-----------------
If you're in industry or self-fund the research & don't publish, then you have the right not to share your data. Otherwise, it's not your data.
— Lenny Teytelman (@lteytelman) July 16, 2018

We continue our conversation with Protocols.io founder Lenny Teytelman.In part two of our conversation, we learn more about his company and the use cases that made his company possible. We also learn about the pros and cons of mindless data collection, when data isn’t leading you in the right direction and his experience as a scientist amassing enormous amount of data.

Transcript

Lenny Teytelman: I am Lenny Teytelman, and I am a geneticist and Computational Biologist by training. I did graduate school in Berkeley and then postdoctoral research out at MIT. And since 2012, I have been the co-founder and CEO of Protocols.io, which is a GitHub Wikipedia-like central repository of research recipes. So for science methods detailing what exactly scientists have done.

Cindy Ng: Welcome Lenny. Why don't you tell us a little bit more about what you do at Protocols and some of the goals and use cases?

Lenny Teytelman: So I had no entrepreneurial ambitions whatsoever. Actually, I was in a straight academic path as a yeast geneticist driven just by curiosity in the projects that I was participating in. And my experience out at MIT as a postdoc was that literally, the first year and a half of my project went into fixing just one step of the research recipe of the protocol that I was using. Instead of a microliter of a chemical, it needed five. Instead of an incubation for 15 minutes, it needed an hour and the insane part is that at the end of the day, that's not a new technique. I can't publish an article on it because it's just a correction of something that's previously published and there is no good infrastructure. There's no GitHub of science methods. There's no good infrastructure for updating and sharing such corrections and optimizations.

So the end result of that year and a half was that I get no credit for this because I can't publish it and everybody else was using the same recipe is either getting completely misleading results or has to spend a year or two rediscovering what I know, what I would love to share, but can't.

It led to this obsession with creating a central open access place that makes it easy for the scientist to detail precisely what the research steps were, what are the recipes, and then after they've published, giving them the space to keep this current by sharing the corrections and optimizations and making that knowledge discoverable.

Cindy Ng: There's a hole in the process and you're connecting what you can potentially do now with what you did previously and not lose all the work. That's brilliant.

Lenny Teytelman: I shouldn't take too much credit for it because a lot of people have had this same idea over the last 20 years and there have been several attempts to create a central place. One of the hard things is that this isn't just about technology and building a website and creating a good UI, UX for people to share.

One of the hard things is that it's a culture change, right? So if we are used to publishing a scientist's made brief methods that have things like context author for details, or we roughly follow the same procedure as reported in another paper and then good luck figuring out what that roughly means, what are the slight modifications, but then one of the hard things as the culture change and getting scientists to adopt platforms like this.

Cindy Ng: So it sounds like the scientists prior who wanted to create something like Protocols, they were ahead of their time.

Lenny Teytelman: I think yes. I know of a number of efforts to create exactly what we've done. Some of the people from those have actually been huge supporters and advisors, partners helping us avoid the mistakes and helping us succeed. So, it's a long quest, a long journey towards this, but a lot of them I give them credit for the same idea and it's exactly what you said, being ahead of your time.

Cindy Ng: Because you're a scientist and have a lot of expertise collecting enormous amount of data, a lot of companies nowadays because data's the new oil, they think that, "Oh, we should just collect everything. Well, we might be able to solve a new business problem or we might be able to use it much later on." Then actually research has been done about that, that that's not a good idea because then you end up solving really silly problems. What is your approach?

Lenny Teytelman: There are sort of two different camps. One argues that you should be very targeted with the data that you collect. You should have a hypothesis, you should have a research question that's guiding you towards an experiment and towards the data that you're collecting. And another one is, let's be more descriptive. Let's just get data and then look inside and see what pops out. See what is surprising.

There are two camps and I know both types of scientists. I was more in one camp than another, but there is value to both. The tricky part in science is that you are not aware of the statistics and e-hacking and just what it means to go fishing in large datasets, particularly in genomics, particularly now with a lot of the new technology that we have for generating massive datasets across different conditions, across different organisms, right? And you can sort of drown in data and then if you're not careful, you start looking for signal.

If you're not thinking of the statistics, if you're not thinking almost of multiple testing, correction, you can get these false positives in science where something looks their usual, but it really is just by chance, it's because you're running a lot of tests and slicing data in 100 different ways and one out of 100 times just by chance, you're getting something that looks like an outlier, that looks very puzzling or interesting, but it's actually chance.

So, I don't know about in industry particularly, it seems to me if you're a business and you are just trying to grab everything and feeling that something useful will come out of it. If you're not in the business of doing science, but you're in the business of actual business, it seems to me, intuitively, that you will become very distracted and probably is not the best use of your time or resources. But in science, both approaches are valuable. You just have to be really careful if you are analyzing data without a particular question and you're trying to see what is there that's interesting.

Cindy Ng: If you're collecting everything, do you have a team or a group of people that you're working with to suss out the wrong ideas?

Lenny Teytelman: I see more and more journals, I see more and more academics becoming aware that, "Oh, I need to learn something about statistics, or I need to collaborate with biostatisticians who can help me to be careful about this." There are journals that have started statistics reviews. So it might be a biology paper, but depending on the data and the statistics that are in it, it might need to go to an expert statistician to review to make sure that you've used the appropriate methods and you've thought through the pitfalls that I'm discussing, but there's a lot more to do on this side.

And again, there is the spread…there are teams that are collaborating. And you know they have data scientists or computational biologists and statisticians who are more used to thinking about data. Then you also have people like me who used to do both. And I wasn't a great computational biologist and I wasn't a great geneticist, but my strength was the ability to do both. So, again, it's all over the map and there's a lot of training, a lot of education that still needs to happen to improve how we handle the large data sets.

Cindy Ng: Do you think that data, it's about getting the numbers right, working with statisticians, or the more qualitative side of things where even if the data showing one thing, your, let's say, experience says otherwise?

Lenny Teytelman: Oh, I've been misled by data that I've generated or had access to nonstop. As a scientist, I've given talks on things that I thought were exciting and turned out to be an artifact of how I was doing the analysis and I've experienced that many times. Think at the end of the day, whether you try to be careful or not, we always have a scientist and we always will make mistakes. And that's why I particularly feel that it's so essential for us to share the data because we think we're doing things correctly, but reviewers and other scientists who are reading your papers really can't tell unless they have access to the data that you've used and can run the analysis themselves or use different tools to analyze, and that's where problems come up, that's where mistakes are identified.

So I think science can really improve more through the sharing and less through trying to be perfectionist on the people who are generating the data and publishing the stories. I think both are important, but I think there's more opportunity for ensuring reproducibility and that mistakes get fixed by sharing the data.

Cindy Ng: Yeah. And when you're solving really complicated and hard problems, it helps to have many people work on it too, even though it might seem like they're too many chefs in the kitchen, but that it can only help, I imagine.

Lenny Teytelman: Absolutely. That's what peer review is for. It's getting eyeballs with people who have not been listening to you give this presentation evolving over time for the last five years. It's people who don't necessarily trust you the same way or have different strengths. So it does help to have people from the outside take a look.

But even reviewers, they are not going to be rerunning all of your analyses. They're not going to be spending years digging into your data. They're going to read the paper and kind of mostly trying to tell is it's clear? Do I trust what they're saying? Have they done the controls? At the end of the day, figuring out which papers are correct and which hypotheses and conclusions stand the test of time, it really does require time. And that's where sharing the data shortens the time to see what is and isn't true.