State of Cybercrime

Geneticist and Founder of Protocols.io, Lenny Teytelman (Part one)

Episode Summary

A few months ago, I came across Protocols.io founder Lenny Teytelman’s tweet on data ownership. Since we’re in the business of protecting data, I was curious what inspired Lenny to tweet out his value statement and to also learn how academics and science-based businesses approach data analysis and data ownership. We’re in for a real treat because it’s rare that we get to hear what scientists think about data when in search for discoveries and innovations.

Episode Notes

Reminder: it's not "your data".

It's the patients' data
It's the taxpayers' data
It's the funder's data
-----------------
If you're in industry or self-fund the research & don't publish, then you have the right not to share your data. Otherwise, it's not your data.
— Lenny Teytelman (@lteytelman) July 16, 2018

Transcript

Lenny Teytelman: I am Lenny Teytelman and I'm a geneticist and computational biologist by training. I did graduate school in Berkeley and then post-doctoral research out at MIT. And since 2012, I have been the Co-founder and CEO of Protocols.io, which is a GitHub Wikipedia-like central repository of research recipes, so for science methods detailing what exactly scientists have of found.

Cindy Ng: Welcome, Lenny. We first connected on Twitter through a tweet of yours, and I'm going to read it, it says, "Reminder: it's not 'your data.' It's the patient's data, it's the taxpayers' data. It's the funders' data. And if you're in an industry or self-funded the research and don't publish, then you have the right not to share your data. Otherwise, it's not your data." So can you tell us a little bit more about your point of view, your ideas about data ownership, and what inspired you to tweet out your value statement?

Lenny Teytelman: Thank you, Cindy. So this is something that comes up periodically, more so particularly, in the past 5, 10 years in the research community as different funders and publishers starting more and more intentions of reproducability challenges and published research, and including guidelines and policies that encourage or require the sharing of data as a prerequisite for publication or as a condition of getting funding. So we're seeing more and more of that, and I think the vast majority of the research community, of the scientists, are in favor of those then this time that it's important, then this time that it's one of the pillars of science to be able to reproduce and verify and validate out the people's results and not just to take them at their word. We all make mistakes, right?

But there is a minority that is upset about these kinds of requirements and I, periodically, either in person or someone on Twitter will say, "Hey, I've spent so long sailing the oceans and collecting the data. I don't want to just give it away. I want to spend the next 5, 10 years publishing and then it's my data." And so that's the part that I'm reacting to it. There are some scientists that forget who's funding them and who actually has the rights to the data.

Cindy Ng: Why do they feel like it's their data rather than the patients' data or the taxpayers' data or the funder's data?

Lenny Teytelman: So it's understandable because, particularly when the data generation takes a long time, so imagine you go on an own expeditions two, three months away from family, sampling bacteria in oceans or digging in the desert, and it can take a really long time to get the samples, to get the data, and you start to feel ownership, and it's also the career, your career, the more publications you get on a given dataset, the stronger your resume, the higher the chances of getting fellowships, faculty positions, and so on. People become a little bit possessive and take ownership of the data, if you like, put so much into it, "It's mine."

Cindy Ng: Prior to digitalizing our data, who owned the data?

Lenny Teytelman: Well, I guess, universities can also lay some claim to the intellectual property rights. I'm not an attorney so it's tricky. But I think there was always the understanding in the science world that you should be able to provide the tables, the datasets that you're publishing on request. But then we got paper journals, there really just wasn't space to make all of that available. And we're now in a different environment where we have repositories, there's GitHub focal, there are many repositories for the data to be shared. And so, with the web, we're no longer in that contact author for details and we're now in a place where journals can say, "If you want to publish in our journal, you have to make the data available." And there are some that have put in very stringent data requirement policies.

Cindy Ng: Who sets those parameters in terms of the kind of data you publish and the stringency behind it? Do a bunch of academics come together, chairman, scientists decide best practices, or they vary from publication to publication?

Lenny Teytelman: Both. So it depends on the community. There are some communities, for example, the genomics community, back when the human genome was being sequenced, there were a lot of...and I mean before that, there were a lot of meetings of the leaders in the field sort of agreeing on what are the best practices, and depositing the DNA sequences in the central repository GenBank run by the U.S. government became sort of expected in the community and from the journals. And so, that really was community-led best practices, but more recently, I also see just funders putting out mandates, and when you agree to getting funding, you agree to the data-sharing policies of the foundation. And same thing for journals. Now, journals, more and more of them are putting in statements requiring data, but it doesn't mean that they're necessarily enforcing it, so requirements are one thing, enforcement is another.

Cindy Ng: What is the difference between scientific academic research versus the science-based companies? Because a lot of, for instance, pharmaceuticals hire a lot of PhDs and they must have a close connection between one another.

Lenny Teytelman: So there is certainly overlap. You're right that, I think, in biomedicine particularly, most of the people who get PhDs actually don't stay in academia and then outside of it. Not all of it is in industry. They go through a broad spectrum, all for different careers, but a lot do end up in industry. There is some overlap where you will have industry funding some of the research. So, Novartis could give a grant to UC Berkeley, or British Petroleum could be doing ecological research, and those tend to be very interesting because there may be a push from the industry side to keep the data private, like you can imagine tobacco companies sponsoring something.

So there's some conflict of interest then usually universities try to frame these in a way that gives the researchers the right to publish regardless of what the results are, and to make it available so that the funder does not have a yea or nay vote. So those are on collaboratives side when there's some funding coming in from industry but, in general, there is basic science, there is academic science, and there is expectation there that you're publishing and making the results open, and then there is the industry side, and, of course, I'm broadly generalizing. There are things you will keep private in academia, there's competitiveness in academia as well, you're afraid of getting scooped. But broadly speaking, academia tends to publish and be very open, and your reputation and your career prospects are really tied to your publications.

And on the industry side, it's not so much about the publications as about the actual company bottom line and the vaccines, drug targets, right, molecules that you're discovering, and those you're not necessarily sharing, so there's a lot of research that happens in industry. And my understanding is that the vast majority of it is actually not published.

Cindy Ng: I think even though they have different goals, the thread between all of them really, is the data because regardless of what industry you're in, I hate this phrase, "data is the new oil," but it's considered one of the most valuable assets around. I'm wondering is there a philosophy around how much you share amongst scientists regardless of the industry?

Lenny Teytelman: In academia, it tends to be all over the place. So I think in industry, they're very careful about the security, they're very, very concerned about breach and somebody getting access to the trials, to the molecules they're considering. The competition is very intense and they take the intellectual property and security very seriously. On the academic side, it really varies and there are groups that, even long before they're ready to publish their intel on science, they generate data, they feel like we've done the sequencing of these species or of these tissues from patients, and we're going to anonymize the patient names and release the information and the sequences of the data that we have as soon as we've generated it even before the story is finished so other people can use it.

There are some academic projects that are funded as resources where you are expected to share the data as they come online. There might be requests that you don't publish from the data before we did if they're the ones producing it, so there can be community standards, but there are examples in academia, many examples in academia where the data are shared and simply as they're produced even before publications. And then you also have kind of groups that are extremely secretive. Until they're ready to publish, no one else has access to the data and sometimes even after they publish, they try to prevent other people from getting access to the data.

Cindy Ng: So it's back to the possessiveness aspect of it.

Lenny Teytelman: My feeling just anecdotally from the 13 years that I was at the bench, as a student, post-doc, is that the vast majority of scientists are open and are collaborative in academia and that it's a tiny minority that try to hoard the data, but I'm sure that that does vary by field.

Cindy Ng: In the healthcare industry, it's been shown that people try to anonymize data and release it for researchers to do research on, but then there are also a few security and privacy pros who have said that you can re-identify the anonymized data. Has there been a problem?

Lenny Teytelman: Yes, this is something that comes up a lot in discussions. Everone does when you're working with patient data, every one does go through concerted effort to anonymize the information, but usually, when people opt in to participating in these studies and these types of projects, the disclaimers do warn the patients, do warn the people participating that, yes, we'll go through anonymizing steps, but it is possible to re-identify, as you said, the anonymized, the data and figure out who it really is no matter how hard you try. So there are a lot of conversations in academia about this and it is important to be very clear with patients about it. There are concerns, but I don't know actual examples of people re-identifying for any kind of malicious purpose. There might be space and opportunity for doing that, and I'm not saying the concerns are not valid, but I don't know of examples where this has happened with genomic data, DNA sequencing, or individuals.

Cindy Ng: What about Henrietta Lacks where she was being treated for...I can't remember what problem she had, and then it was a hospital...

Lenny Teytelman: Yes, that's a major...there's a book on this, right, there's a movie. That's a major fiasco and a learning opportunity for the research community where there was no consent.

Cindy Ng: Did you ever see this movie called the "Three Identical Strangers" about triplets who found each other?

Lenny Teytelman: No, I haven't.

Cindy Ng: And then they found that all three of those triplets were adopted, and then they thought, "Hmm, that's really strange." So then they had a wonderful reunion and then, later down the line, they realized that they're being used as a study. There were researchers that went in every single week to their homes, to the adoptee's homes, to do research on the kids, and knew that they're all brothers, but neglected to tell the families until they found each other by chance. And then they realized they're part of a study and they refused to release the data. And so, I found the Henrietta Lacks and this new movie that came out just really fascinating. I mean, I guess that's why they have regulations so that you don't have things like these scenarios happen, where you find out after you're an adult, that you're a part of a strange experiment.

Lenny Teytelman: That's fascinating. So I don't know this movie, but on a related note, I'm thinking back…I don't remember the names, but I'm thinking back on the recent serial killer that was identified, not through his own DNA being in the database, but the relatives participating in ancestry sequencing, right, submitting personal genomics, submitting their cells for genotyping, and the police having access, tracing the serial killer through that. There certainly are implications of the data that we are sharing. I don't know what the biggest concerns are, but there are a lot of fascinating issues that the scientific community, patients, and regulators have to grapple with.

Cindy Ng: So, since you're a geneticist, what do you think about the latest DNA testing companies working with pharmaceuticals in potentially finding cures with a lot of privacy alarms coming up for advocates?

Lenny Teytelman: Yeah, so it has to be done ethically. You do have to think about these issues. My personal feeling is that there's a lot for world and humans to gain from sharing the DNA information and personal information. The positives outweigh the risks. That's a very vague statement, so I do, you know, I think about the opportunity to do studies where a drug is not just tested whether it works or not, but depending on the DNA of the people, you can figure out what are the percolations, what are the types of the drugs that will have adverse reactions to it, who are the ones who are unlikely to benefit from it. So there is such powerful opportunity for good use of this. Obviously, we can't dismiss the privacy risks and the potential for abuse and misuse, but it would be a real shame if we just backed away from the research and from the opportunity that this offers altogether, instead of carefully thinking through the implications and trying to do this in an ethical way.