Oct 12 2011

Walking on Eggshells

After a self-declared coding holiday, I was back at things this weekend working on the Pathscrubber module of the Datapunk platform. A recently developed vexing problem that needed to be addressed was actually two problems intertwined. If you used Pathscrubber and clicked on any gene/protein node, PS would query Entrez-gene for the descriptive text and pull a bunch of theory and clinical stuff together and send it all out as a pop-up window. For some reason the response time (on their end) was unbearably slow. The second problem was a change to the interface between NCBI and the OMIM database. OMIM is run by Johns Hopkins and suddenly one day the NCBI query tool that PS uses to get the OMIM entry on any gene stopped working. It was certainly their problem since the NCBI’s own links do not work. However I discovered that OMIM was now available for download (something like 200 megabytes total).

Gotta love having an email address that ends in ‘.edu’!

Datapunk Logo.

Datapunk Logo.


However there were problems with the data files, beyond the fact that they were incredibly huge. They are not in a typically common data file format, where each record is delineated by a carriage return (‘enter’) and each field in the record is delineated by a  tab, comma or pipe (|)  character. The OMIM gene records as weird blend of individual lines that contain data and other lines that name fields, all of which are variable in length and appearance.  I’ve dealt with files like these  before (some KEGG files have this format)  and you have to really work hard to code a way for Perl (the computer language I typically use)  to tease out what you need. Fortunately, Perl has a vibrant community of programmers that produce different ‘modules’ that expand Perl’s capabilities. Thus you do not have to reinvent the wheel if someone has already done it.

One module I use a lot is called BIO::PERL. This has lots of cool interfaces and tools, including one that parses (reads) OMIM gene files. Normally that would be end of the story. However that BIO::PERL module, while doing a good job, was too slow, so I developed a work-around that involved using the module to tease out specific data, which was then re-organized and written to new data files indexed by the OMIM gene ID number. By the time I was done, I have four different data files which the program could quickly query and execute rapidly.

One problem I encountered doing this was the exceedingly complex nature of the data returned from the BIO::PERL parser. Much of it was nested inside a series of ‘hash arrays.’ In the computer world an array is a place to store data, much like an egg carton stores eggs: once the eggs are in the carton, you can specify which egg you want by naming the column and row number of the egg you want. Easy enough, but in computer world, in addition to an egg (or an empty space), the location of any place in our egg carton can also contain the location of another egg carton!

This is how data often gains meaning from organization.



2 responses so far

2 Responses to “Walking on Eggshells”

  1. Gillian says:

    Why do scrambled eggs enter my brain, followed by thoughts of Mork throwing an egg in the air while exclaiming, “Fly my little friend!”

  2. Todd LePine, MD says:

    Hi Peter,
    I’m a friend of David Brady….he just showed me your website. Would love to meet you. I too am interested in Visualizing Data, Edward Tufte, Systems Biology, Complexity Theory, Etc. Look forward to meeting and talking with you some day.

    Best, Todd