Tuesday, November 29, 2005

Google and Genetics Research

A recent article ( http://www.timesonline.co.uk/article/0,,2095-1892323,00.html ) highlights Google's possible entry into the field of genetics data mining.

The article points out correctly that in terms of genetics and biology, there is a very real "islands of information" problem that makes just accessing data painful. Google apparently wants to bring to bear their search technology to allow people to search on some biological term(s) and get useful results back from these various sources. While this is a decent first step, I see the problem as being much more complicated than this simplistic approach.

Biological data comes in many forms. There is the Gene Ontology, which is hierarchical. There is the KEGG pathway database which is more akin to nodes and links; there is the sequence database which lends itself to custom chromosome viewer graphics, and the list goes on and on. Much of this data lends itself to specific visual/graphical views that may be distinct to the data type. But these various data sources do represent atoms of information that are not independent...they are related to eachother. Sequence information denotes the underpinnings of genes, genes produce proteins/enzymes that are in pathways, and genes are also characterized by function in Gene Ontology. The point I am making is that for this information to be manageable and digestable to the researcher, a seamless graphical interface has to be layered on top of the data. This interface must allow users to easily move between these disparate sources, allowing the user to piece the puzzle together of what a SNP/Gene/Protein is doing. I don't see Google solving this problem soon...the graphics are challenging as is the data integration problem. There are well over 100 significant biological databases that could be utilized in such a manner as described above.

The problem gets harder when you realize that almost all of the real data out there that is useful to researchers is stored in research papers. I can tell you that this is not a task for your mother's NLP system to address. No truly competent language processor has yet to be able to comb through thousands of research papers and determine what the contents of the paper are. Until this information is successfully mined, researchers will continue to be constrained by this bottleneck. I am rather dismayed that some standards body has not stepped forward to mandate an XML type standard that would accompany every new paper and allow for easy mining of papers' content. Perhaps I should undertake to do this :)

Lastly, but not leastly, what really is needed is to take all this aggregated information and apply some real AI to it. We don't need superhuman, or even human level, AI here...we just need a very good and accurate inference system. The depth and breadth of data that resides out "there" tells me that there are discoveries to be made by looking collectively at these many sources and inferring relations that weren't previously known. A human could do the same, but the sheer volume of data, in many different places and in many different forms, makes it laborious at best, and intractable at worst. This is a perfect application for a competent AI inference system.

I have sketched out a design to do all of the above, and even more. The problem is resources and time and money....I hope someday to be able to have the time and money to implement this system... I firmly believe that it would have a major impact on research in the biological field.

Cheers,
Kevin Cramer

6 Comments:

Ben said...

Kevin -- As you know, my colleagues and I also have a pretty detailed design for doing all this, developed as part of our work for Biomind (www.biomind.com).

Frustratingly, however, the bioinformatics market is in the toilet, and there is basically no business funding for projects such as integrative bioinformatics.

Bioinformatics today is all about open-source software; but the open-source community has not really mobilized itself to create anything except some fairly simple and standard tools.

And the NIH is big into funding bio research, but their informatics research funding programme is not terribly adventurous or inspired.

Eventually the time for this software will come, but it's pretty frustrating to see that this sort of stuff is possible to do RIGHT NOW yet will not be achieved for years to come due to lack of vision on the part of the folks holding the purse strings (and, in large part, lack of vision on the part of the biology research community itself; biology lends itself to a kind of micro-narrow-focus which doesn't lend itself to integrative thinking; the systems biology movement is trying to counter this, but is focusing on simulation rather than data analysis and integration.)

6:58 PM  
Michael Anissimov said...

Nice new blog, Ben + friends. I'll be sure to tune in with my handy-dandy RSS reader.

1:26 AM  
michael vassar said...

Wouldn't cyc fit the bill for the AI you are saying would help biotech?

10:54 AM  
Ben said...

Michael Vassar: Cyc does not deal with natural language nor quantitative data effectively. It deals with formalized logical knowledge -- and attempts to use this approach to deal with commonsense everyday human knowledge, which is basically useless for understanding molecular genetics. A Cyc database oriented toward molecular genetics rather than everyday commonsense human life would perhaps be of some use, but the problem of connecting it with quantitative bio data and natural language bio research papers would still remain. In short, even though an AGI is not necessarily needed to aid biology, a better integrative AI framework than anything now existing (including Cyc) is required. -- Ben Goertzel

5:51 PM  
Maitri said...

Ben,

You are correct in what you say about selling into the bioinformatics market. I am also aware of some of your ideas on integrating biodatabases...although I question the efficacy of the design from an inference standpoint...and it did not fundamentally address the interface potion of the application, which is every bit as challenging as the inference mechanisms IMO. I think there would be a market for such a product. This product is not an analytic tool, which biologists seems reticent to ante up money for, but data integration+presentation+inference, which is a different animal. If done correctly, it would be nearly impossible to replicate thru some open source group, yet its value to research would be immense. So i guess I am disagreeing with you on the market for such a thing. i actually saw a company spring up 6 months ago that was selling a pre built box with integrated DB's on it (BLAST, GO, DbSNP, etc) and a google type interface. The box sold for ~$25,000 a pop, if i recall correctly. There application was very basic..no text mined docs, no graphical UI, no inference...so it will be interesting to see if they find a market. I recall they had venture funding though...

8:00 PM  
Ben said...

Maitri,

My current opinion is that biomedical scientists are pretty reluctant to ante up money for ANY software application, be it bioinformatics or data visualization or bio-text-mining or whatever.

The biomedical research community seems to be oriented very strongly toward publicly available free databases and open-source tools, these days.

Thus, if a database like the one you describe is going to become popular in the biomed research community, it will most likely come about via the NIH (or perhaps some foreign equivalent agency?) releasing it on one of its websites.

But who knows ... markets can change as we all know, and sometimes they can change fast...

-- Ben G

5:59 PM  

Post a Comment

<< Home