Posts tagged: hadoop

Benchmarking the Cloud for Genomics

By Gary Stiehr, November 22, 2009 2:06 am

Executing in parallel using Hadoop, Crossbow analyzes data comprising 38-fold coverage of the human genome in three hours using a 320-CPU cluster rented from a cloud computing service for about $85.

– “Searching for SNPs with cloud computing” (Genome Biology 2009, 10:R134)

Page numbers and table references provided below are the page numbers within the provisional PDF, which was the available format at the time of writing this post.

As of late, I’ve seen a number of papers or announcements regarding running genomics and other bioinformatics applications in the cloud (usually meaning using Amazon’s EC2 and S3 services).  These projects are providing valuable empirical data about costs and  run times in the cloud.  To help in that effort, I thought I’d take a look at the latest example of such published results.  “Searching for SNPs with cloud computing” was published on November 20 (Genome Biology 2009, 10:R134) and it describes the use of the Crossbow package (requiring Hadoop) on local, non-cloud resource as well as using Amazon’s cloud resources.

So how did the application perform in the cloud versus on the local computing resources?  Within the paper, we find that the 320-CPU cluster mentioned above is comprised of 40 “Extra Large High CPU Instances” (virtual machines each with eight cores (approximately 2.5-2.8 GHz) and 7 GB of memory and 1690 GB of “instance storage”) upon which Hadoop was running (p. 7).  The running (wall) time of the application with the given data set was  2 hours and 53 minutes (Table 3).  Two other scenarios (80 and 160 EC2 cores) were also run (see Table 3 of the paper for timing results).  The application was run locally in the authors’ lab via Hadoop using a cluster of ten nodes each with 4 GB of memory, 366 GB of local disk space and four 3.2 GHz Intel Xeons cores (40 cores total).  The running (wall) time on the local cluster was “about” one day (p. 7).

A few things stood out in terms of comparing times between the local and cloud resources (not that that was the intent of the paper):

  • The local cluster consisted of only 40 cores but the runs on EC2 were with 80, 160 and 320 cores (and the cores were of different speeds)–why not also test with 40 EC2 cores?
  • The network topology of the local 40-core cluster was not discussed (aside from the use of Gigabit Ethernet), which could have implications during different stages of the map/reduce flow.
  • The local storage infrastructure was not described, which could have implications in the transfer of data to the cluster nodes.
  • Having 40 non-virtualized 3.2 GHz cores take one day to run but 80 virtualized “2.5 to 2.8 GHz” cores take 6 hours and 30 minutes for the application using the same version of Hadoop does not quite add up, which implies differences in the architecture of the infrastructure used for testing:
    • does the increased disk space for available for Hadoop’s HDFS make a difference for data caching? (366 GB per local node versus 1690 GB of “instance storage” per EC2 node)
    • is there a network bottleneck between the local nodes that does not exist with the EC2 nodes?
    • is there a bottleneck getting to the storage in the local cluster that does not exist between the EC2 nodes and S3?
  • The Amazon wall clock timings were run once each and it is not specified about the number of timings on the local cluster.
    • also, a timing on the local cluster of “about” one day is a little vague.
  • The paper’s conclusion (p. 14) states that when run on “conventional computers” this type of analysis requires “weeks of CPU time” but the same analysis with Crossbow can finish “in less than 3 hours on a 320-core cluster.”
    • the terminology related to time (e.g., wall vs. CPU) was not necessarily traditional.  Perhaps they use “CPU time” to generically mean time on a computer rather than the time the processor spent executing instructions on behalf of the application?.  In any case, the statement is really a distortion of the timings and/or terminology:
      • The 80-core EC2 run consumed 6.5 hours * 80 cores = 520 core-wall_hours
      • The 160-core EC2 run consumed 4.55 hours * 160 cores = 728 core-wall_hours
      • The 320-core EC2 run consumed 2.88 hours * 320 cores = 922 core-wall_hours
      • The local 40-core run took “about” one day (I’ll assume 24 hours) and so consumed 24 hours * 40 cores = 960 core-wall_hours

Aside from the above, the data gives some interesting starting points surrounding timings (including data transfer timing).  As for the pricing, the $85 mentioned is for computation only and there is another $35.50 in data transfer-related fees and it is not clear if S3 fees came into play during the tests.  Further, one should note the price (a premium of $20.96 per hour (360%)) of finishing 3 hours and 37 minutes (55%) earlier with 320 EC2 cores rather than using only 80 EC2 cores. Perhaps I’ll address that in a future post.

A Look at the Yahoo Systems Used to Sort a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds

By Gary Stiehr, July 2, 2009 10:01 pm

The Yahoo! Developer Network Blog reported that Yahoo! was able to sort a petabyte of data in 16.25 hours and a terabyte in 62 seconds using Apache Hadoop.  I was curious what type of hardware they needed to pull this off.  From their post, they mentioned using this hardware (their Hammer cluster):

  • approximately 3800 nodes (in such a large cluster, nodes are always down)
  • 2 quad core Xeons @ 2.5ghz per node
  • 4 SATA disks per node
  • 8G RAM per node (upgraded to 16GB before the petabyte sort)
  • 1 gigabit ethernet on each node
  • 40 nodes per rack
  • 8 gigabit ethernet uplinks from each rack to the core

Here are my observations/inferences and questions about this:

  1. They used 95 racks of equipment dedicated to this test (3800 nodes with 40 nodes per rack)
  2. An 8 Gb/s uplink must consist of eight trunked gigabit connections
  3. From 1&2 above, we see they’d need 95 * 8 = 760 Gigabit Ethernet ports at their core.
    • What type of switches do you think they have at the core?
  4. Assuming 90% efficiency on the GigE link, 760 Gb/s over 16.25 hours could push about 5 PB of data.
    • How much data do you think is transferred when sorting 1 PB of data?
  5. 3800 nodes were used each with two quad core Xeon processors means 30,400 cores were used for this (?!)
    • What percentage of CPU time do you think they used to sort 1 PB of data on 30,400 (?!) cores?
  6. 3800 nodes were used each with 16 GB of memory means 60.8 TB of memory were used to sort 1 PB.
    • How much of the 60.8 TB of memory do you think was required to do the sort of 1 PB of data?
    • Are these Nehalems?  If not, if they were using Nehalem-based systems, do you think the run times would have changed due to increased memory bandwidth?
  7. Assuming “40 nodes per rack” is implying 40 1U servers, I think it is safe to assume that each rack (with 320 cores in 40 systems) uses around 10 kW of power.  With 95 racks like this, it seems this system would require just under 1 megawatt to operate.  Of course the various switches also contribute here.  Assuming $0.04 per kWh, it would have cost about $650 in electricity to do this sort.  This cost may double given the electricity needed to provide the required cooling.

Ok, so is this right?  Perhaps they meant to say 3800 cores instead of 3800 nodes?  Does Yahoo have a 30,400 core cluster sitting around to do these benchmarks?  I’m assuming this a cluster that they also use in their day-to-day operations?

    Panorama theme by Themocracy