Posts tagged: HPC

Benchmarking the Cloud for Genomics

By Gary Stiehr, November 22, 2009 2:06 am

Executing in parallel using Hadoop, Crossbow analyzes data comprising 38-fold coverage of the human genome in three hours using a 320-CPU cluster rented from a cloud computing service for about $85.

– “Searching for SNPs with cloud computing” (Genome Biology 2009, 10:R134)

Page numbers and table references provided below are the page numbers within the provisional PDF, which was the available format at the time of writing this post.

As of late, I’ve seen a number of papers or announcements regarding running genomics and other bioinformatics applications in the cloud (usually meaning using Amazon’s EC2 and S3 services).  These projects are providing valuable empirical data about costs and  run times in the cloud.  To help in that effort, I thought I’d take a look at the latest example of such published results.  “Searching for SNPs with cloud computing” was published on November 20 (Genome Biology 2009, 10:R134) and it describes the use of the Crossbow package (requiring Hadoop) on local, non-cloud resource as well as using Amazon’s cloud resources.

So how did the application perform in the cloud versus on the local computing resources?  Within the paper, we find that the 320-CPU cluster mentioned above is comprised of 40 “Extra Large High CPU Instances” (virtual machines each with eight cores (approximately 2.5-2.8 GHz) and 7 GB of memory and 1690 GB of “instance storage”) upon which Hadoop was running (p. 7).  The running (wall) time of the application with the given data set was  2 hours and 53 minutes (Table 3).  Two other scenarios (80 and 160 EC2 cores) were also run (see Table 3 of the paper for timing results).  The application was run locally in the authors’ lab via Hadoop using a cluster of ten nodes each with 4 GB of memory, 366 GB of local disk space and four 3.2 GHz Intel Xeons cores (40 cores total).  The running (wall) time on the local cluster was “about” one day (p. 7).

A few things stood out in terms of comparing times between the local and cloud resources (not that that was the intent of the paper):

  • The local cluster consisted of only 40 cores but the runs on EC2 were with 80, 160 and 320 cores (and the cores were of different speeds)–why not also test with 40 EC2 cores?
  • The network topology of the local 40-core cluster was not discussed (aside from the use of Gigabit Ethernet), which could have implications during different stages of the map/reduce flow.
  • The local storage infrastructure was not described, which could have implications in the transfer of data to the cluster nodes.
  • Having 40 non-virtualized 3.2 GHz cores take one day to run but 80 virtualized “2.5 to 2.8 GHz” cores take 6 hours and 30 minutes for the application using the same version of Hadoop does not quite add up, which implies differences in the architecture of the infrastructure used for testing:
    • does the increased disk space for available for Hadoop’s HDFS make a difference for data caching? (366 GB per local node versus 1690 GB of “instance storage” per EC2 node)
    • is there a network bottleneck between the local nodes that does not exist with the EC2 nodes?
    • is there a bottleneck getting to the storage in the local cluster that does not exist between the EC2 nodes and S3?
  • The Amazon wall clock timings were run once each and it is not specified about the number of timings on the local cluster.
    • also, a timing on the local cluster of “about” one day is a little vague.
  • The paper’s conclusion (p. 14) states that when run on “conventional computers” this type of analysis requires “weeks of CPU time” but the same analysis with Crossbow can finish “in less than 3 hours on a 320-core cluster.”
    • the terminology related to time (e.g., wall vs. CPU) was not necessarily traditional.  Perhaps they use “CPU time” to generically mean time on a computer rather than the time the processor spent executing instructions on behalf of the application?.  In any case, the statement is really a distortion of the timings and/or terminology:
      • The 80-core EC2 run consumed 6.5 hours * 80 cores = 520 core-wall_hours
      • The 160-core EC2 run consumed 4.55 hours * 160 cores = 728 core-wall_hours
      • The 320-core EC2 run consumed 2.88 hours * 320 cores = 922 core-wall_hours
      • The local 40-core run took “about” one day (I’ll assume 24 hours) and so consumed 24 hours * 40 cores = 960 core-wall_hours

Aside from the above, the data gives some interesting starting points surrounding timings (including data transfer timing).  As for the pricing, the $85 mentioned is for computation only and there is another $35.50 in data transfer-related fees and it is not clear if S3 fees came into play during the tests.  Further, one should note the price (a premium of $20.96 per hour (360%)) of finishing 3 hours and 37 minutes (55%) earlier with 320 EC2 cores rather than using only 80 EC2 cores. Perhaps I’ll address that in a future post.

Penguin Computing POD: HPC in the Cloud

By Gary Stiehr, August 12, 2009 1:00 am

Penguin Computing is offering a servicePenguin on Demand” or “POD” where customers can ssh into Penguin’s clusters and pay for CPU-core-hours.  Citing performance concerns, they are forgoing the use of virtualization, which is often a key component of a cloud computing offering.  As of mid-2009, I/O does take a hit in virtualized environments but upcoming PCI-IOV standards may help–see HPC Trends and Virtualization starting at slide 11.  Initially, their cluster will consist of a modest 1,000 cores but at least some will have access to GPUs as well.

For delivering your data to the cluster, you have the option of overnighting “2 TB hot-swappable drives” to them if it is not feasible to transfer over the Internet.  It’s not clear whether they are providing specific drives that you must then have a compatible system or if you can use your own eSATA or other external drive for this.  While that sounds easy enough, I think the loading of data onto these drives and handling of file systems between source and destination OS can introduce plenty of delays especially in the case of poorly chosen external drive interfaces.  Of course, the feasibility of just transferring over the network should be investigated first.

From what I’ve been able to read, you’d have an account on Penguin’s cluster and use their queuing system.  Its also not clear how your OS will be deployed to the cluster (presumably you’d provide an image to them and they’d deploy using functionality in Scyld’s ClusterWare software).  At this point, it is not clear that you’d be doing much more than using someone else’s cluster and getting billed for it…something that several places have done for some time (e.g., NCSA’s Private Sector Program, R Systems, etc.).

What is needed is a programmatic interface to interact with their cluster with strong auth/authz technologies to allow an organization to seamlessly flex their HPC infrastructure and manage jobs with local apps.  For some disciplines, transferring large data sets may continue to be a barrier to seamless extensions of their infrastructure…or perhaps not with Darkstrand’s growing presence providing 10-40 Gb/s commercial connectivity to HPC resources.  Amazon provides more web services to move closer to this programmatic interface but as Penguin points out, EC2 is not very well geared for many types of HPC applications (although it is for some).

Penguin does not give many details yet about the storage environment associated with their POD service–only that it is “high-speed”.  NetApp is mentioned in one article about POD.  It seems all nodes have Gigabit Ethernet and/or DDR Infiniband network fabrics but it is not clear about the scalability achievable before nodes are spread across multiple hops to reach each other for inter-process communication.  It does not mention over which fabric the storage is accessible.

Have you read any additional articles providing any of the missing details?

The Fastest Supercomputer Became Faster Than the Top 500 Combined Four Years Prior

By Gary Stiehr, June 25, 2009 12:11 am

TOP500 Performance over time

After reading a perspective of the latest TOP500 Supercomputer List from @Chris P_Intel I took another look at the progress of the systems on the list shown above.  The June 2009 list just released puts the RoadRunner supercomputer in the number one spot with 1105 TFLOPS.  In June 2004, just five years ago, all 500 supercomputers combined summed to 813 TFLOPS, with the most powerful single system being the Earth-Simulator at 36 TFLOPS.  So in just five years, a single supercomputer has become more powerful than the 500 most powerful supercomputers from June 2004.

Upon taking a closer look, I saw that RoadRunner was actually in the number one spot in June 2008 at 1026 TFLOPS.  So the top supercomputer on the list in June 2008  was actually faster than all of the top 500 supercomputers combined from four years prior!

Ok, and going back to November 2005, it seems that the #1 system may have been more powerful than the sum of the top 500 supercomputers in November 2002.  So perhaps we are down to three years…I haven’t verified exact numbers though.  Has anyone officially tracked the record for how quickly the #1 supercomputer on the TOP500 list had achieved the performance of all of the supercomputers on a previous TOP500 list?

147,000 Processors Used for Atom-by-Atom Simulation of Nanoscale Transistor

By Gary Stiehr, June 23, 2009 1:29 am

Using 147,000 processors for 15 minutes from the Jaguar system (a Cray XT5) at the Oak Ridge Leadership Computing Facility, “a simulation of electrical current moving through a futuristic electronic transistor has been modeled atom-by-atom in less than 15 minutes by Purdue University researchers.”

“Professor Klimeck and his colleague have demonstrated the unique transformational scientific opportunity that comes from scaling a science application to fully exploit the capabilities of petascale systems like the Cray XT5 at the Oak Ridge Leadership Computing Facility,” Kothe says.

Freely available nanoelectrics software (OMEN) was used from nanoHUB.org to do this simulation.  I am curious about how else this could be applied.  What other nanostructures might we be able to simulate in this way?

For more information, see the source article.

Using HPC to Understand Swine Flu

By Gary Stiehr, June 8, 2009 11:29 pm

Here is another great example of how an HPC site can function as a versatile resource for a wide variety of problem domains.  A priority queue was setup on TACC’s Ranger cluster to provide 2,000 to 3,000 processors for two weeks to allow a team to assess the way in which the underlying structure of the Swine Flue virus (H1N1A) has or could mutate and lead to drug resistance.  With this data, “they believe it will be possible to intelligently design a drug or vaccine that can’t be resisted.”

This still from a Quicktime movie represents a view of the drug buried in the binding pocket of the A/H1N1 neuraminidase protein. The animation also shows a 3D surface view of a neuraminidase protein and footage from the actual drug binding simulation.

This still from a Quicktime movie represents a view of the drug buried in the binding pocket of the A/H1N1 neuraminidase protein. The animation also shows a 3D surface view of a neuraminidase protein and footage from the actual drug binding simulation.

From the article cited below:

Supercomputers routinely assist in emergency weather forecasting, earthquake predictions, and epidemiological research. Now, says Schulten, they are proving their usefulness in biomedical crises.

“It’s a historic moment,” he said. “For the first time these supercomputers are being used for emergency situations that require a close look with a computational tool in order to shape our strategy.”

Find more details at Inside the Swine Flu Virus (found via this HPCwire article).

Sequoia: 20 Petaflops, 1.6 million cores, 1.6 Petabytes RAM, 6 Megawatts

By Gary Stiehr, February 5, 2009 11:50 pm

IBM has won a contract to build a supercomputer, called Sequoia, for the DOE’s NNSA.  It is estimated to be installed and brought online in 2011 and 2012.  It will have 1.6 million cores (from potentially 16-core chips) within 96 racks (in about 3,400 sq. ft.).  It will have around 1.6 Petabytes of memory and achieve about 20 Petaflops.  It will require about 6 million watts of power to operate, which is around 3.3 billion operations per second per watt–very impressive.  I wonder if that includes the power needed for the cooling system.  And is that when the processors are at 100% or when the system is idle?

At 1.6 PB of memory for 1.6 million cores, that is a relatively low amount of memory per core.  If the memory is doubled, for example, the system may require a few more megawatts of power.  This is based off of very rough estimates of power needed per GB of memory based off of some recent commodity clusters.  Do you have any hard numbers on power per GB of memory today?  Any information on the type of memory that might be used in Sequoia?

For more information, see IBM to send blazing fast supercomputer to Energy Dept. and/or U.S. taps IBM for 20 petaflops computer.

Video Analytics Appliance Clusters for Business Intelligence

By Gary Stiehr, January 14, 2009 9:51 pm

LightHaus Logic has available some video analytics appliances aimed at analyzing store security camera video to help assess customer behavior in stores.  The appliances are mutli-processor nodes that can process five to twenty video streams.  Adding more appliances into a cluster can increase capacity.  The end result is supposed to be “to condense hundreds or thousands of hours of video into actionable intelligence.”  The HPCwire article has more information.

Two other recent posts talked about using HPC for other types of video processing:

CultureVis: HPC in the Humanities

By Gary Stiehr, January 13, 2009 2:28 am

CultureVis is a growing number of projects using information visualization to graph cultural patterns, relationships, and dynamics.  According to this article, they have been awarded 330,000 hours of time “to explore the full potential of cultural analytics in a project on ‘Visualizing Patterns in Databases of Cultural Images and Video.’ The grant is one of three inaugural awards from a new Humanities High Performance Computing Program established jointly by DOE and NEH.”

It is an interesting new application of HPC.  Do you know of other humanities projects utilizing HPC?

New ATLAS cluster at the Max Planck Institute for Gravitational Physics

By Gary Stiehr, July 3, 2008 1:20 am

HPCwire’s article Gravity Attracts a GigE HPC Cluster describes some of the features of the new ATLAS cluster at the Max Planck Institute for Gravitational Physics. The 144 10-GigE port non-blocking switch from Woven was a technical feature that stood out. Additionally, it would be interesting to find out what file system is being used on the 42 storage nodes.

Also, the article mentions “An additional 500 GB of direct-connected storage is provided on each compute node. The CPU on any server can access the local disk storage on any other server as well as the central storage nodes.” I wonder in what way that local disk space is made available to the other servers.

Can we better understand creativity using HPC?

By Gary Stiehr, June 16, 2008 11:27 pm

A recent blog entry about The Neuroscience of Creativity lists some interesting questions, such as: what does the brain look like when it’s being creative? Or when it’s listening to music?

I looked around and found a few sites, such as the NeuroGrid site, that describes how brain activity data can be collected and then the analysis of that data distributed across a computing grid.

Perhaps the questions asked above about creativity will be answered using HPC?

Panorama theme by Themocracy