Sunday, August 17, 2014

Another approach to having data available, standardized and accessible: who cares?

I once went to a talk by someone who spent most of their seminar talking about a platform they had created for integrating and processing data of all different kinds (primarily microarray). After the talk, a Very Wise Colleague of mine and I were chatting with the speaker, and I said something to the effect of “Yeah, it’s so crazy how much effort it takes to deal with other people’s datasets” and both the speaker and I nodded vigorously while Very Wise Colleague smiled a little. Then he said, “Well, you know, another approach to this problem is to just not care.” Now, Very Wise Colleague has forgotten more about this field than I’ve ever learned (times 10), so I have spent the last several years pondering this statement. And I think that as time has gone on and I’ve become at least somewhat less unwise, I think I largely agree with Very Wise Colleague.

I realize this is a less than fashionable point of view these days, especially amongst the “open everything” crowd (heavy overlap with the genomics crowd). I think this largely stems from some very particular aspects of genomics data that are dangerous to generalize to the broader scientific community. So let’s start with a very important exception to my argument and then work from there: the human genome. I think our lab uses the human genome on pretty much a daily basis. Mouse genome as well. As such, it is super handy that the data is available and easily accessed and manipulated because we need the data as a resource of specific important information that does not change or (substantially) improve with time or technology.

I think this is only true of a very small subset of research, though, and leads to the following bigger question: when The Man is paying for research, what are they paying for? In the case of the genome, I think the idea is that they are paying for a valuable set of data that is reasonably finalized and important to the broader scientific endeavor. Same could be said for genomes of other species, or for measuring the melting points of various metals, crystal structures, motions of celestial bodies, etc.–basically anything in which the data yields a reasonably final value of interest. For most other research, though, The Man is paying us to generate scientific insight, not data. Think about virtually every important result in biomedical science from the past however long. Like how mutations to certain genes cause cells to proliferate uncontrollably (i.e., genes cause cancer). Do we really need the original data for any reason? At this point, no. Would anyone at the time have needed the original data for any reason? Maybe a few people who wanted to trade notes on a thing or two, but that’s about it. The main point of the work is the scientific insight one gains from it, which will hopefully stand the test of time. Standing the test of time, by the way, means independent verification of your conclusions (not data) in others labs in other systems. Whether or not you make your data standardized and easily accessible makes no real difference in this context.

I think it’s also really important before building any infrastructure to first think pretty carefully about the "reasonably final" part of reasonably final value of interest. The genome, minor caveats aside, passes this bar. I mean, once you have a person’s genome, you have their sequence, end of story. No better technology will give them a radically better version of the sequence. Such situations in biology are relatively rare, though. Most of the time, technology will march along so fast that by the time you build the infrastructure, the whole field has moved on to something new. I saw so many of those microarray transcriptome profile compendiums and databases that came out just before RNA-seq started to catch on–were those efforts really worthwhile? Given that experience, is it worth doing the same thing now with RNA-seq? Even right now, although I can look up the HeLa transcriptome in online repositories, do I really trust that it’s going to give me the same results that I would get on my HeLa cells growing in my incubator in my lab? Probably just sequence it myself as a control anyway. And by the time someone figures this whole mess out, will some new tech have come along making the whole effort seem hopelessly quaint?
 Incidentally, I think the same sort of thinking is a pretty strong argument that if a line of research is not going to give a reasonably final value of interest for something, then you better try and get some scientific insight out of it, because purely as data, the whole thing will likely be obsolete in a few years.

Now, of course, making data available and easily shared with others via standards is certainly a laudable goal, and in the absence of any other factors, sure, why not, even for scientific insight-oriented studies. But there are other factors. Primary amongst them is that most researchers I know maintain all sorts of different types of data, often custom to the specific study, and to share means having to in effect write a standard for that type of data. That’s a lot of work, and likely useless as the standards will almost certainly change over time. In areas where the rationale for interoperable data is very strong, then researchers in the field will typically step up to the task with formats and databases, as is the case with genomes and protein structures, etc. For everything else, I feel like it’s probably more efficient to handle it the old fashioned way by just, you know, sending an e-mail–I think personal engagement on the data is more productive than just randomly downloading the data anyway. (Along those lines, I think DrugMonkey was right on with this post about PLOS’s new and completely inane data availability policy.) I think the question really is this: if someone for some reason wants to do a meta-analysis of my work, is the onus on me or them to wrangle with the data to make it comparable with other people’s studies? I think it’s far more efficient for the meta-analyzer to wrangle with the data from the studies they are interested in rather than make everyone go to a lot of trouble to prepare their data in pseudo-standard formats for meta-analyses that will likely never happen.

All this said, I do definitely personally think that making data generally available and accessible is a good thing, and it’s something that we’ve done for a bunch of our papers. We have even released a software package for image analysis that hopefully someone somewhere will find useful outside of the confines of our lab. Or not. I guess the point is that if someone else doesn’t want to use our software, well, that’s fine, too.

No comments:

Post a Comment