Posts tagged: cloud

Benchmarking the Cloud for Genomics

By Gary Stiehr, November 22, 2009 2:06 am

Executing in parallel using Hadoop, Crossbow analyzes data comprising 38-fold coverage of the human genome in three hours using a 320-CPU cluster rented from a cloud computing service for about $85.

– “Searching for SNPs with cloud computing” (Genome Biology 2009, 10:R134)

Page numbers and table references provided below are the page numbers within the provisional PDF, which was the available format at the time of writing this post.

As of late, I’ve seen a number of papers or announcements regarding running genomics and other bioinformatics applications in the cloud (usually meaning using Amazon’s EC2 and S3 services).  These projects are providing valuable empirical data about costs and  run times in the cloud.  To help in that effort, I thought I’d take a look at the latest example of such published results.  “Searching for SNPs with cloud computing” was published on November 20 (Genome Biology 2009, 10:R134) and it describes the use of the Crossbow package (requiring Hadoop) on local, non-cloud resource as well as using Amazon’s cloud resources.

So how did the application perform in the cloud versus on the local computing resources?  Within the paper, we find that the 320-CPU cluster mentioned above is comprised of 40 “Extra Large High CPU Instances” (virtual machines each with eight cores (approximately 2.5-2.8 GHz) and 7 GB of memory and 1690 GB of “instance storage”) upon which Hadoop was running (p. 7).  The running (wall) time of the application with the given data set was  2 hours and 53 minutes (Table 3).  Two other scenarios (80 and 160 EC2 cores) were also run (see Table 3 of the paper for timing results).  The application was run locally in the authors’ lab via Hadoop using a cluster of ten nodes each with 4 GB of memory, 366 GB of local disk space and four 3.2 GHz Intel Xeons cores (40 cores total).  The running (wall) time on the local cluster was “about” one day (p. 7).

A few things stood out in terms of comparing times between the local and cloud resources (not that that was the intent of the paper):

  • The local cluster consisted of only 40 cores but the runs on EC2 were with 80, 160 and 320 cores (and the cores were of different speeds)–why not also test with 40 EC2 cores?
  • The network topology of the local 40-core cluster was not discussed (aside from the use of Gigabit Ethernet), which could have implications during different stages of the map/reduce flow.
  • The local storage infrastructure was not described, which could have implications in the transfer of data to the cluster nodes.
  • Having 40 non-virtualized 3.2 GHz cores take one day to run but 80 virtualized “2.5 to 2.8 GHz” cores take 6 hours and 30 minutes for the application using the same version of Hadoop does not quite add up, which implies differences in the architecture of the infrastructure used for testing:
    • does the increased disk space for available for Hadoop’s HDFS make a difference for data caching? (366 GB per local node versus 1690 GB of “instance storage” per EC2 node)
    • is there a network bottleneck between the local nodes that does not exist with the EC2 nodes?
    • is there a bottleneck getting to the storage in the local cluster that does not exist between the EC2 nodes and S3?
  • The Amazon wall clock timings were run once each and it is not specified about the number of timings on the local cluster.
    • also, a timing on the local cluster of “about” one day is a little vague.
  • The paper’s conclusion (p. 14) states that when run on “conventional computers” this type of analysis requires “weeks of CPU time” but the same analysis with Crossbow can finish “in less than 3 hours on a 320-core cluster.”
    • the terminology related to time (e.g., wall vs. CPU) was not necessarily traditional.  Perhaps they use “CPU time” to generically mean time on a computer rather than the time the processor spent executing instructions on behalf of the application?.  In any case, the statement is really a distortion of the timings and/or terminology:
      • The 80-core EC2 run consumed 6.5 hours * 80 cores = 520 core-wall_hours
      • The 160-core EC2 run consumed 4.55 hours * 160 cores = 728 core-wall_hours
      • The 320-core EC2 run consumed 2.88 hours * 320 cores = 922 core-wall_hours
      • The local 40-core run took “about” one day (I’ll assume 24 hours) and so consumed 24 hours * 40 cores = 960 core-wall_hours

Aside from the above, the data gives some interesting starting points surrounding timings (including data transfer timing).  As for the pricing, the $85 mentioned is for computation only and there is another $35.50 in data transfer-related fees and it is not clear if S3 fees came into play during the tests.  Further, one should note the price (a premium of $20.96 per hour (360%)) of finishing 3 hours and 37 minutes (55%) earlier with 320 EC2 cores rather than using only 80 EC2 cores. Perhaps I’ll address that in a future post.

Penguin Computing POD: HPC in the Cloud

By Gary Stiehr, August 12, 2009 1:00 am

Penguin Computing is offering a servicePenguin on Demand” or “POD” where customers can ssh into Penguin’s clusters and pay for CPU-core-hours.  Citing performance concerns, they are forgoing the use of virtualization, which is often a key component of a cloud computing offering.  As of mid-2009, I/O does take a hit in virtualized environments but upcoming PCI-IOV standards may help–see HPC Trends and Virtualization starting at slide 11.  Initially, their cluster will consist of a modest 1,000 cores but at least some will have access to GPUs as well.

For delivering your data to the cluster, you have the option of overnighting “2 TB hot-swappable drives” to them if it is not feasible to transfer over the Internet.  It’s not clear whether they are providing specific drives that you must then have a compatible system or if you can use your own eSATA or other external drive for this.  While that sounds easy enough, I think the loading of data onto these drives and handling of file systems between source and destination OS can introduce plenty of delays especially in the case of poorly chosen external drive interfaces.  Of course, the feasibility of just transferring over the network should be investigated first.

From what I’ve been able to read, you’d have an account on Penguin’s cluster and use their queuing system.  Its also not clear how your OS will be deployed to the cluster (presumably you’d provide an image to them and they’d deploy using functionality in Scyld’s ClusterWare software).  At this point, it is not clear that you’d be doing much more than using someone else’s cluster and getting billed for it…something that several places have done for some time (e.g., NCSA’s Private Sector Program, R Systems, etc.).

What is needed is a programmatic interface to interact with their cluster with strong auth/authz technologies to allow an organization to seamlessly flex their HPC infrastructure and manage jobs with local apps.  For some disciplines, transferring large data sets may continue to be a barrier to seamless extensions of their infrastructure…or perhaps not with Darkstrand’s growing presence providing 10-40 Gb/s commercial connectivity to HPC resources.  Amazon provides more web services to move closer to this programmatic interface but as Penguin points out, EC2 is not very well geared for many types of HPC applications (although it is for some).

Penguin does not give many details yet about the storage environment associated with their POD service–only that it is “high-speed”.  NetApp is mentioned in one article about POD.  It seems all nodes have Gigabit Ethernet and/or DDR Infiniband network fabrics but it is not clear about the scalability achievable before nodes are spread across multiple hops to reach each other for inter-process communication.  It does not mention over which fabric the storage is accessible.

Have you read any additional articles providing any of the missing details?

Not All Apps Are Fit for the Cloud | The Intelligent Enterprise Blog

By Gary Stiehr, January 13, 2009 2:00 am

Here’s a notable quote from Not All Apps Are Fit for the Cloud | The Intelligent Enterprise Blog

With cloud computing the trick is not to follow the hype and the crowd, but to understand your own issues and applications first. From there you can make an educated call as to what applications make sense to outsource to a good cloud computing platform, and what applications to keep local. Keep in mind that this should be an evolving process, and you can always relocate applications as the cloud computing resources improve, and clearly they will.

Which applications have you put in the cloud?

Redundancy in the Cloud

The article Forecast Mostly Sunny for Company Opting for Cloud Computing describes some of the common cloud computing benefits and potential drawbacks. One of the common drawbacks listed is that you are completely reliant on your cloud computing vendor (in this case Amazon). Near the end of the article, two comments from Stevie Clifton, Animoto’s CTO, match two that I’ve often thought when hearing the drawback about vendor reliance:

  1. It’s no different than using any other web or application hosting as companies have done for some time.
  2. At some point, you could use servers instances across more than one vendor’s cloud to limit your risk of reliance on one vendor.

Do you see using services in the cloud from being any different than utilizing other web/application hosting, server co-location or managed services in terms of reliance on other vendors?

Panorama theme by Themocracy