Posts tagged: genomics

Benchmarking the Cloud for Genomics

By Gary Stiehr, November 22, 2009 2:06 am

Executing in parallel using Hadoop, Crossbow analyzes data comprising 38-fold coverage of the human genome in three hours using a 320-CPU cluster rented from a cloud computing service for about $85.

– “Searching for SNPs with cloud computing” (Genome Biology 2009, 10:R134)

Page numbers and table references provided below are the page numbers within the provisional PDF, which was the available format at the time of writing this post.

As of late, I’ve seen a number of papers or announcements regarding running genomics and other bioinformatics applications in the cloud (usually meaning using Amazon’s EC2 and S3 services).  These projects are providing valuable empirical data about costs and  run times in the cloud.  To help in that effort, I thought I’d take a look at the latest example of such published results.  “Searching for SNPs with cloud computing” was published on November 20 (Genome Biology 2009, 10:R134) and it describes the use of the Crossbow package (requiring Hadoop) on local, non-cloud resource as well as using Amazon’s cloud resources.

So how did the application perform in the cloud versus on the local computing resources?  Within the paper, we find that the 320-CPU cluster mentioned above is comprised of 40 “Extra Large High CPU Instances” (virtual machines each with eight cores (approximately 2.5-2.8 GHz) and 7 GB of memory and 1690 GB of “instance storage”) upon which Hadoop was running (p. 7).  The running (wall) time of the application with the given data set was  2 hours and 53 minutes (Table 3).  Two other scenarios (80 and 160 EC2 cores) were also run (see Table 3 of the paper for timing results).  The application was run locally in the authors’ lab via Hadoop using a cluster of ten nodes each with 4 GB of memory, 366 GB of local disk space and four 3.2 GHz Intel Xeons cores (40 cores total).  The running (wall) time on the local cluster was “about” one day (p. 7).

A few things stood out in terms of comparing times between the local and cloud resources (not that that was the intent of the paper):

  • The local cluster consisted of only 40 cores but the runs on EC2 were with 80, 160 and 320 cores (and the cores were of different speeds)–why not also test with 40 EC2 cores?
  • The network topology of the local 40-core cluster was not discussed (aside from the use of Gigabit Ethernet), which could have implications during different stages of the map/reduce flow.
  • The local storage infrastructure was not described, which could have implications in the transfer of data to the cluster nodes.
  • Having 40 non-virtualized 3.2 GHz cores take one day to run but 80 virtualized “2.5 to 2.8 GHz” cores take 6 hours and 30 minutes for the application using the same version of Hadoop does not quite add up, which implies differences in the architecture of the infrastructure used for testing:
    • does the increased disk space for available for Hadoop’s HDFS make a difference for data caching? (366 GB per local node versus 1690 GB of “instance storage” per EC2 node)
    • is there a network bottleneck between the local nodes that does not exist with the EC2 nodes?
    • is there a bottleneck getting to the storage in the local cluster that does not exist between the EC2 nodes and S3?
  • The Amazon wall clock timings were run once each and it is not specified about the number of timings on the local cluster.
    • also, a timing on the local cluster of “about” one day is a little vague.
  • The paper’s conclusion (p. 14) states that when run on “conventional computers” this type of analysis requires “weeks of CPU time” but the same analysis with Crossbow can finish “in less than 3 hours on a 320-core cluster.”
    • the terminology related to time (e.g., wall vs. CPU) was not necessarily traditional.  Perhaps they use “CPU time” to generically mean time on a computer rather than the time the processor spent executing instructions on behalf of the application?.  In any case, the statement is really a distortion of the timings and/or terminology:
      • The 80-core EC2 run consumed 6.5 hours * 80 cores = 520 core-wall_hours
      • The 160-core EC2 run consumed 4.55 hours * 160 cores = 728 core-wall_hours
      • The 320-core EC2 run consumed 2.88 hours * 320 cores = 922 core-wall_hours
      • The local 40-core run took “about” one day (I’ll assume 24 hours) and so consumed 24 hours * 40 cores = 960 core-wall_hours

Aside from the above, the data gives some interesting starting points surrounding timings (including data transfer timing).  As for the pricing, the $85 mentioned is for computation only and there is another $35.50 in data transfer-related fees and it is not clear if S3 fees came into play during the tests.  Further, one should note the price (a premium of $20.96 per hour (360%)) of finishing 3 hours and 37 minutes (55%) earlier with 320 EC2 cores rather than using only 80 EC2 cores. Perhaps I’ll address that in a future post.

Illumina Offers $48,000 Personal Genome Sequencing–How Will Data be Handled?

By Gary Stiehr, June 12, 2009 11:03 pm
dna_transcription

A depiction of the structure of DNA

Illumina will offer a service to sequence a person’s genome for $48,000 (a doctor’s prescription is required).  Note that this is only the sequencing and not the actual analysis of that sequence data.  The consumer must choose from a few different providers to do the actual analysis of the genome sequence data.  Currently, the representation of a human genome as Illumina is proposing (30-fold coverage of your DNA sequence) would require the transfer of terabytes of data to the company doing the analysis.  Of course, there are various parts to “analysis” so depending on where Illumina stops and the other companies take over, this actually could be a lot less data (e.g., gigabytes).

So this raises at least a couple of possible challenges for Illumina:

  • How will the data be transferred?
  • How will the data be secured?

Transferring the Data

One could see that data transfer of on the order terabytes of data would not be a problem if the turnaround time is long enough.  Although if the service becomes more and more popular, scaling may be a problem (or at least synchronizing network abilities with analysis providers).  Nevertheless, will Illumina establish encrypted network connections with the consumer’s/doctor’s chosen analysis provider?  Will they transfer the data encrypted on external hard drives?  If on external hard drives, how will tracking of the multiple pieces be tracked?

Securing the Data

I’m assuming the security/encryption questions may have answers based off of current electronic health records implementations although I’m not sure if electronic patient information systems are typically interconnected between different health care organizations.  That is, aren’t these systems usually secured/confined within the network of a particular health care organization?  If it is placed on external hard drives and shipped, would the encryption of terabytes of data per patient be challenging?

A Human Genome Per Day? The Genome Center at Washington University Scales Up on Illumina Sequencers

By Gary Stiehr, February 6, 2009 2:47 pm

We at The Genome Center at Washington University were happy to get official word that we will be adding an additional 21 Illumina Genome Analyzers to our portfolio of sequencing technology.  That enables us to sequence enough DNA to be equivalent to an entire human genome per day (at 25x coverage).  There is a lot of excitement about the potential such capacity brings.  The Genome Center’s director had this to say:

“Our intention to substantially scale-up with this technology reflects our commitment to large-scale sequencing projects that aim to uncover the underlying genetic basis of various human diseases. With the rapid decline in the cost of whole-genome sequencing, we believe now is the time to embark on initiatives which were previously not possible,” said Richard K. Wilson, Ph.D., Professor of Genetics and Director of the Genome Center at Washington University. “We are confident that we can further reduce the cost and accelerate the rate of human genome sequencing.”

A scale up of sequencing capacity brings a scale up in IT capacity.  We’ll be watching our internal network, disk and HPC resources and scaling as appropriate.  It will be likely that these sequencers alone will generate upwards of 20 TB of data per day, which needs further processing on The Genome Center’s computational resources.  I’m excited about the possibilities that this scale up will bring!

Panorama theme by Themocracy