Google and Genetics Research
A recent article ( http://www.timesonline.co.uk/article/0,,2095-1892323,00.html ) highlights Google's possible entry into the field of genetics data mining.
The article points out correctly that in terms of genetics and biology, there is a very real "islands of information" problem that makes just accessing data painful. Google apparently wants to bring to bear their search technology to allow people to search on some biological term(s) and get useful results back from these various sources. While this is a decent first step, I see the problem as being much more complicated than this simplistic approach.
Biological data comes in many forms. There is the Gene Ontology, which is hierarchical. There is the KEGG pathway database which is more akin to nodes and links; there is the sequence database which lends itself to custom chromosome viewer graphics, and the list goes on and on. Much of this data lends itself to specific visual/graphical views that may be distinct to the data type. But these various data sources do represent atoms of information that are not independent...they are related to eachother. Sequence information denotes the underpinnings of genes, genes produce proteins/enzymes that are in pathways, and genes are also characterized by function in Gene Ontology. The point I am making is that for this information to be manageable and digestable to the researcher, a seamless graphical interface has to be layered on top of the data. This interface must allow users to easily move between these disparate sources, allowing the user to piece the puzzle together of what a SNP/Gene/Protein is doing. I don't see Google solving this problem soon...the graphics are challenging as is the data integration problem. There are well over 100 significant biological databases that could be utilized in such a manner as described above.
The problem gets harder when you realize that almost all of the real data out there that is useful to researchers is stored in research papers. I can tell you that this is not a task for your mother's NLP system to address. No truly competent language processor has yet to be able to comb through thousands of research papers and determine what the contents of the paper are. Until this information is successfully mined, researchers will continue to be constrained by this bottleneck. I am rather dismayed that some standards body has not stepped forward to mandate an XML type standard that would accompany every new paper and allow for easy mining of papers' content. Perhaps I should undertake to do this :)
Lastly, but not leastly, what really is needed is to take all this aggregated information and apply some real AI to it. We don't need superhuman, or even human level, AI here...we just need a very good and accurate inference system. The depth and breadth of data that resides out "there" tells me that there are discoveries to be made by looking collectively at these many sources and inferring relations that weren't previously known. A human could do the same, but the sheer volume of data, in many different places and in many different forms, makes it laborious at best, and intractable at worst. This is a perfect application for a competent AI inference system.
I have sketched out a design to do all of the above, and even more. The problem is resources and time and money....I hope someday to be able to have the time and money to implement this system... I firmly believe that it would have a major impact on research in the biological field.
Cheers,
Kevin Cramer
