Thursday, August 21, 2008 - 01:05
One of the things that drives me crazy on occasion is nomenclature. Well, maybe not just nomenclature, it's really the continual changes in the nomenclature, and the time it takes for those changes to ripple through various databases and get reconciled with other kinds of information. And the realization that sometimes this reconciliation may never happen.
One of the projects that I've been working on during the past couple of years has involved developing educational materials that use bioinformatics tools to look at the isozymes that metabolize alcohol. As part of this project, I've been collecting 3 dimensional structures of the enzymes and annotating polymorphic amino acids. That part is very straightforward. The complicated part of the project is figuring out how the structures correspond to the genes, genetic data, and association studies with diverse polymorphisms.
In retrospect, having sorted through all these things, it seems like compiling that information should be straightforward, too. But in practice, as someone who put all the information together by reading accessible papers and searching databases, it's not. I ran up against several confusing moments where I ended up banging my head against the wall trying to sort out which polymorphism correlates with which structural change, what it was called 8 years ago, what it's called now, and how the changes are tied to the structures.
The human versions of the alcohol and aldehyde dehydrogenase genes stand out as an example of a gene family whose members have all had multiple names and confusing polymorphisms. At one time, it seems that there were seven human genes, named ADH1-7. About eight years ago, the names all changed. The genes that were formerly called ADH1, ADH2, ADH3, and ADH4-7; became ADH1A, ADH1B, ADH1C, and ADH4-7. And the isozymes that were named alpha, beta, gamma, chi, mu, sigma, and phi changed names also. Even though I could find some of this information was in the Entrez Gene database, it still took work to figure out how those names were tied to the genes now, especially when the 3D structures have names like Human Alcohol Dehydrogenase Beta-1- Beta-1 Isoform or ADH chi chi. Naturally, the structure database entries don't tell you anything about the most recent name of the gene.
Is beta 1 the same thing as the beta polypeptide? Does this mean there's a beta 2?
Even more confusing, many of the polymorphisms have names like ADH2*1, ADH2*2, ADH2*3. That would be fine, but these names aren't in dbSNP. There's nothing in dbSNP that directly ties the SNPs to these polymorphisms. Nor are these names used consistently in the literature. It seems like every paper that tries to find an association between a phenotype and a genotype uses a different name for the genetic variations they're genotyping.
Even worse, some of the places where you might expect to find current information are out of date. I read through some of the publications at the National Institute on Alcohol Abuse and Alcoholism site including the strategic plan, naively thinking that the current names might be there.
Nope.
There was a nice table based on papers from 1998 and 2002, but the table had all old names for the polymorphisms and none of the current names or SNP references. It was also missing all the odd Greek subunit names that were assigned to various isozymes like beta, alpha, sigma, and chi.
At least this table had a date, though. Part of what made it hard to put the new and old information together was not knowing which information was current and which was not. Making a spread sheet to keep track didn't get easier until I found a paper with a kind ADH rosetta stone saying that the names changed in 2001 and how.
OMIM was the best database for figuring this stuff out, once I knew the current names of all the ADH genes. Unlike some of the other databases, I was able to look at OMIM and see when the last updates had happened. In this case, sadly, the last updates to the ADH genes were made in October 2007 by the late Victor McKusik. I knew I could trust those.
OMIM was still confusing though, since the entries would often use multiple names, in successive paragraphs, to refer to the same SNP, without giving any warning to the reader. In the ADH1B reference for example, one paragraph calls a polymorphism the typical and atypical forms of ADH, ADH2*1 and ADH2*2, the next paragraph calls it ADH1B*47his, and the next paragraph calls it ADH1B arg47-to-his polymorphism (rs1229984).
It took some time searching several databases, and filling in tables, but I finally did sort out the connections between the old names of the isozymes, SNPs, and genes, and the new names. I even tried using some other tools, like NextBio, for doing literature searches only to have them tell me that ADH is vasopressin and not give me any information on the genes I wanted.
What can I conclude from this activity?
There's still a place in the world for people (like me) who actually read papers instead of simply trusting database records.
What's next?
The question that I'm left with, is how to articulate what I did and how to describe the most efficient path so that I can teach this sort of thing to students. I battled through the databases and conflicting names and sorted them all out because I'm motivated and don't mind reading the papers. I worry that most undergraduates will just get disgusted and give up.
Some people will say that the research I did in sorting out the genes, proteins, and mutations, doesn't belong in a bioinformatics course. They'll say it's not really bioinformatics at all even though it involves trying to reconcile biological information from multiple databases.
But if it's not bioinformatics, what is it? Reading? Annotating?
Others will say that sorting out this kind of stuff is a trivial problem. All the information was in the database right?
Well, if this kind of work is trivial, why would it take someone with a Ph.D. and several years of experience, three days to figure this out?
Others will say this kind of work something we should bother trying to teach at all. We just need better search tools, right? Won't the semantic web solve all of this?
No, I don't think reconciling database records and nomenclature is a solved problem.
As far as bioinformatics goes, I think combining the information from genes, proteins, polymorphisms, structures, and genetics is the hardest thing to do and the absolutely hardest thing to teach.
If it were easy, there would be a database that did it.