Benchmarking the Cloud for Genomics
Executing in parallel using Hadoop, Crossbow analyzes data comprising 38-fold coverage of the human genome in three hours using a 320-CPU cluster rented from a cloud computing service for about $85.
– “Searching for SNPs with cloud computing” (Genome Biology 2009, 10:R134)
Page numbers and table references provided below are the page numbers within the provisional PDF, which was the available format at the time of writing this post.
As of late, I’ve seen a number of papers or announcements regarding running genomics and other bioinformatics applications in the cloud (usually meaning using Amazon’s EC2 and S3 services). These projects are providing valuable empirical data about costs and run times in the cloud. To help in that effort, I thought I’d take a look at the latest example of such published results. “Searching for SNPs with cloud computing” was published on November 20 (Genome Biology 2009, 10:R134) and it describes the use of the Crossbow package (requiring Hadoop) on local, non-cloud resource as well as using Amazon’s cloud resources.
So how did the application perform in the cloud versus on the local computing resources? Within the paper, we find that the 320-CPU cluster mentioned above is comprised of 40 “Extra Large High CPU Instances” (virtual machines each with eight cores (approximately 2.5-2.8 GHz) and 7 GB of memory and 1690 GB of “instance storage”) upon which Hadoop was running (p. 7). The running (wall) time of the application with the given data set was 2 hours and 53 minutes (Table 3). Two other scenarios (80 and 160 EC2 cores) were also run (see Table 3 of the paper for timing results). The application was run locally in the authors’ lab via Hadoop using a cluster of ten nodes each with 4 GB of memory, 366 GB of local disk space and four 3.2 GHz Intel Xeons cores (40 cores total). The running (wall) time on the local cluster was “about” one day (p. 7).
A few things stood out in terms of comparing times between the local and cloud resources (not that that was the intent of the paper):
- The local cluster consisted of only 40 cores but the runs on EC2 were with 80, 160 and 320 cores (and the cores were of different speeds)–why not also test with 40 EC2 cores?
- The network topology of the local 40-core cluster was not discussed (aside from the use of Gigabit Ethernet), which could have implications during different stages of the map/reduce flow.
- The local storage infrastructure was not described, which could have implications in the transfer of data to the cluster nodes.
- Having 40 non-virtualized 3.2 GHz cores take one day to run but 80 virtualized “2.5 to 2.8 GHz” cores take 6 hours and 30 minutes for the application using the same version of Hadoop does not quite add up, which implies differences in the architecture of the infrastructure used for testing:
- does the increased disk space for available for Hadoop’s HDFS make a difference for data caching? (366 GB per local node versus 1690 GB of “instance storage” per EC2 node)
- is there a network bottleneck between the local nodes that does not exist with the EC2 nodes?
- is there a bottleneck getting to the storage in the local cluster that does not exist between the EC2 nodes and S3?
- The Amazon wall clock timings were run once each and it is not specified about the number of timings on the local cluster.
- also, a timing on the local cluster of “about” one day is a little vague.
- The paper’s conclusion (p. 14) states that when run on “conventional computers” this type of analysis requires “weeks of CPU time” but the same analysis with Crossbow can finish “in less than 3 hours on a 320-core cluster.”
- the terminology related to time (e.g., wall vs. CPU) was not necessarily traditional. Perhaps they use “CPU time” to generically mean time on a computer rather than the time the processor spent executing instructions on behalf of the application?. In any case, the statement is really a distortion of the timings and/or terminology:
- The 80-core EC2 run consumed 6.5 hours * 80 cores = 520 core-wall_hours
- The 160-core EC2 run consumed 4.55 hours * 160 cores = 728 core-wall_hours
- The 320-core EC2 run consumed 2.88 hours * 320 cores = 922 core-wall_hours
- The local 40-core run took “about” one day (I’ll assume 24 hours) and so consumed 24 hours * 40 cores = 960 core-wall_hours
- the terminology related to time (e.g., wall vs. CPU) was not necessarily traditional. Perhaps they use “CPU time” to generically mean time on a computer rather than the time the processor spent executing instructions on behalf of the application?. In any case, the statement is really a distortion of the timings and/or terminology:
Aside from the above, the data gives some interesting starting points surrounding timings (including data transfer timing). As for the pricing, the $85 mentioned is for computation only and there is another $35.50 in data transfer-related fees and it is not clear if S3 fees came into play during the tests. Further, one should note the price (a premium of $20.96 per hour (360%)) of finishing 3 hours and 37 minutes (55%) earlier with 320 EC2 cores rather than using only 80 EC2 cores. Perhaps I’ll address that in a future post.
-
nickloman
-
Gary Stiehr