Is a 50 Petaflop Supercomputer Coming Soon?

comments Comments Off
By , June 5, 2011 6:10 pm

With their recently announced XK6 system, Cray states that they can scale their system to 500,000 scalar cores (16-core 64-bit AMD Opteron 6200 Series processors) and to 50 petaflops of peak hybrid performance when coupled with NVIDIA Tesla X2090 GPUs.  Each cabinet will require 45-54 kW of power depending on the configuration (and usage), with cooling provided by either 3,000 cfm of air or Cray’s ECOphlex liquid cooling.  The configuration seems to allow for up to one GPU per CPU with up to 96 CPUs and 96 GPUs per cabinet.  Each cabinet will perform at 70+ teraflops according to the XK6 technical specifications.  There will be one Gemini routing and communications ASIC per two compute nodes with 48 switch ports per Gemini chip (160 GB/s internal switching capacity per chip) to enable a 3-D torus topology.

So what would it take to build a system operating at 50 petaflops based on the Cray XK6?  At 70 teraflops per cabinet, we’d need about 714 cabinets and about 38.6 megawatts of power to achieve 50 petaflops.  But Cray’s XK6 specifications page states 70+ teraflops per cabinet…so perhaps fewer racks and energy would be needed.  For comparison, scaling the numbers by 2.5 for the 1.6 million-core Sequoia supercomputer, a 50 petaflop machine made from Sequoia’s hardware would need 240 racks and consume 15 megawatts of power.

Oak Ridge National Lab (ORNL) will be building a 20 petaflop machine, called Titan, based off of the Cray XK6.  HPCWire talks about Titan and how its design compares to the 20 petaflop Sequoia design.

Carbon-14 Decay Rate mystery understood using 30 million CPU hours on Jaguar supercomputer

comments Comments Off
By , May 27, 2011 11:33 pm

Carbon dating is used to help determine the age of artifacts, dinosaur bones and so on.  Carbon-14′s nearly 6,000-year half-life is a key reason why this method of calculating age is thought to be accurate.  Until recently, however, scientist didn’t understand why carbon-14′s half life was so long when other light atomic nuclei can have a half-life of minutes or seconds.  Nuclear physicists at Iowa State University used the Jaguar supercomputer at ORNL to understand this mystery.  According to an article on

The reason involves the strong three-nucleon forces (a nucleon is either a neutron or a proton) within each carbon-14 nucleus. It’s all about the simultaneous interactions among any three nucleons and the resulting influence on the decay of carbon-14. And it’s no easy task to simulate those interactions.

To do this simulation required 30,000,000 CPU hours on the Jaguar supercomputer, which has a peak performance of 2.3 Petaflops.  The calculation required dealing with a 30 billion x 30 billion matrix with 30 trillion non-zero elements.  The article mentions that there was the need to adapt the code to scale properly.  It also says that this was “six months of work pressed into three months of time.”  At 30,000,000 CPU hours over three months, that would require 13,889 processors utilized at 100% efficiency.

A related previous post about Jaguar: 147,000 Processors Used for Atom-by-Atom Simulation of Nanoscale Transistor

Amazon’s HPC cloud offering: “Cluster Compute Quadruple Extra Large” instances

comments Comments Off
By , July 13, 2010 11:54 pm

In August 2009, Penguin computing announced “Penguin on Demand” (POD), which they deemed to be HPC in the cloud.  It amounted to remotely accessing their pre-installed cluster and submitting your jobs.  Virtual machine images were not an option with POD at the time.  Today, Amazon announced their Cluster Compute Instances (CCIs) for EC2, which offers the ability to boot Linux-based VMs on a new “Cluster Compute Quadruple Extra Large” instance to form virtual clusters, which have better performance characteristics relative to previously available EC2 instance types.

The main differences from other EC2 instances:

  • if you boot multiple CCIs, the instances will be more closely linked together to offer lower inter-node network latencies with a full bisection 10 Gigabit/s bandwidth.
  • you will be able to identify the processor architecture so your code can be tuned appropriately

From Amazon’s HPC applications page:

The Cluster Compute instance family currently contains a single instance type, the Cluster Compute Quadruple Extra Large with the following specifications:

23 GB of memory
33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem” architecture)
1690 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
API name: cc1.4xlarge

As of the announcement on July 13:

  • the cost per instance was USD$1.60 per hour (if purchased on-demand per hour) or USD$0.56 if a 1-year or 3-year reserved instance is purchased.
  • only Linux VMs are supported on these instances
  • a default cluster size limit of eight of these instances are available (for a 64-core cluster) without needing to fill out a special request form.

Here are a couple of articles that point to some cluster management providers specifically geared toward provisioning/interfacing with EC2-based virtual clusters:

It is interesting that the instances are listed as providing 33.5 EC2 Computing units.  I wonder what method Amazon uses to establish these measurements. Since the new Cluster Computing Quadruple Extra Large instances allow you to know the processor architecture, you may realize a higher number of Computing Units depending on the extent to which your code will benefit from compiler optimizations taking advantage of the specific Nehalem cores.

If you’ve run across other helpful articles with more details, please do leave a comment with the info.

Exploring Cloud Computing for Genomics

comments Comments Off
By , March 31, 2010 6:48 am

From March 31 to April 1, the National Human Genome Research Institute (NHGRI) is holding a meeting to explore the use of cloud computing for the storage, management, and analysis of genomic data, including the computing issues and the implementation for biological analyses.  I wanted to make this site available for those who will not be in attendance but who want to contribute (post comments here or via Twitter (@garystiehr)).

The agenda contains presentations and discussion surrounding some specific topics:

  • an overview of various cloud offerings as well as a survey of some genomics-specific cloud computing pilot projects
  • an overview of some of the associated technical questions:
    • transmitting large genomic data sets
    • computational architecture of cloud environments
    • cloud security and relevant NIH data privacy policies

Further, a general discussion session will be held regarding supporting genomic analyses using cloud computing:

  • What are the key challenges for genomic analyses within a cloud?
  • Under what conditions are cloud environments better than local clusters of computers?
  • What types of analyses of large genomic datasets are clouds appropriate for currently and in the near future?
  • What would be needed for cloud environments to work better for genomic analyses?
  • What features of a cloud environment are appropriate for genome repositories to use?
  • What alternatives to cloud environments exist or may be developed, and under what conditions would these be better than clouds?
  • Are data standards needed – if so, what areas are they needed?
  • What best practices are needed?

Please, join in the discussion on the topics above and help identify if there are principals from within the cloud computing arena that can be used to advance the state of genomics research.  Post comments here or via Twitter (@garystiehr).

Benchmarking the Cloud for Genomics

By , November 22, 2009 2:06 am

Executing in parallel using Hadoop, Crossbow analyzes data comprising 38-fold coverage of the human genome in three hours using a 320-CPU cluster rented from a cloud computing service for about $85.

– “Searching for SNPs with cloud computing” (Genome Biology 2009, 10:R134)

Page numbers and table references provided below are the page numbers within the provisional PDF, which was the available format at the time of writing this post.

As of late, I’ve seen a number of papers or announcements regarding running genomics and other bioinformatics applications in the cloud (usually meaning using Amazon’s EC2 and S3 services).  These projects are providing valuable empirical data about costs and  run times in the cloud.  To help in that effort, I thought I’d take a look at the latest example of such published results.  “Searching for SNPs with cloud computing” was published on November 20 (Genome Biology 2009, 10:R134) and it describes the use of the Crossbow package (requiring Hadoop) on local, non-cloud resource as well as using Amazon’s cloud resources.

So how did the application perform in the cloud versus on the local computing resources?  Within the paper, we find that the 320-CPU cluster mentioned above is comprised of 40 “Extra Large High CPU Instances” (virtual machines each with eight cores (approximately 2.5-2.8 GHz) and 7 GB of memory and 1690 GB of “instance storage”) upon which Hadoop was running (p. 7).  The running (wall) time of the application with the given data set was  2 hours and 53 minutes (Table 3).  Two other scenarios (80 and 160 EC2 cores) were also run (see Table 3 of the paper for timing results).  The application was run locally in the authors’ lab via Hadoop using a cluster of ten nodes each with 4 GB of memory, 366 GB of local disk space and four 3.2 GHz Intel Xeons cores (40 cores total).  The running (wall) time on the local cluster was “about” one day (p. 7).

A few things stood out in terms of comparing times between the local and cloud resources (not that that was the intent of the paper):

  • The local cluster consisted of only 40 cores but the runs on EC2 were with 80, 160 and 320 cores (and the cores were of different speeds)–why not also test with 40 EC2 cores?
  • The network topology of the local 40-core cluster was not discussed (aside from the use of Gigabit Ethernet), which could have implications during different stages of the map/reduce flow.
  • The local storage infrastructure was not described, which could have implications in the transfer of data to the cluster nodes.
  • Having 40 non-virtualized 3.2 GHz cores take one day to run but 80 virtualized “2.5 to 2.8 GHz” cores take 6 hours and 30 minutes for the application using the same version of Hadoop does not quite add up, which implies differences in the architecture of the infrastructure used for testing:
    • does the increased disk space for available for Hadoop’s HDFS make a difference for data caching? (366 GB per local node versus 1690 GB of “instance storage” per EC2 node)
    • is there a network bottleneck between the local nodes that does not exist with the EC2 nodes?
    • is there a bottleneck getting to the storage in the local cluster that does not exist between the EC2 nodes and S3?
  • The Amazon wall clock timings were run once each and it is not specified about the number of timings on the local cluster.
    • also, a timing on the local cluster of “about” one day is a little vague.
  • The paper’s conclusion (p. 14) states that when run on “conventional computers” this type of analysis requires “weeks of CPU time” but the same analysis with Crossbow can finish “in less than 3 hours on a 320-core cluster.”
    • the terminology related to time (e.g., wall vs. CPU) was not necessarily traditional.  Perhaps they use “CPU time” to generically mean time on a computer rather than the time the processor spent executing instructions on behalf of the application?.  In any case, the statement is really a distortion of the timings and/or terminology:
      • The 80-core EC2 run consumed 6.5 hours * 80 cores = 520 core-wall_hours
      • The 160-core EC2 run consumed 4.55 hours * 160 cores = 728 core-wall_hours
      • The 320-core EC2 run consumed 2.88 hours * 320 cores = 922 core-wall_hours
      • The local 40-core run took “about” one day (I’ll assume 24 hours) and so consumed 24 hours * 40 cores = 960 core-wall_hours

Aside from the above, the data gives some interesting starting points surrounding timings (including data transfer timing).  As for the pricing, the $85 mentioned is for computation only and there is another $35.50 in data transfer-related fees and it is not clear if S3 fees came into play during the tests.  Further, one should note the price (a premium of $20.96 per hour (360%)) of finishing 3 hours and 37 minutes (55%) earlier with 320 EC2 cores rather than using only 80 EC2 cores. Perhaps I’ll address that in a future post.

Duty Cycle of SATA Drives

comments Comments Off
By , September 4, 2009 12:24 am

One definition of duty cycle specific to disk drives is this excerpt from a patent description (designed to enforce certain duty cycles):

The disk drive duty cycle can be expressed as a ratio=Ta/t for a given time period t where Ta is the amount of time the disk drive is actively processing read/write commands during the time period. For example, if during a time period t=60 seconds, the disk drive was actively processing read/write commands for a collective total of Ta=15 seconds, then the average duty cycle for that time period would be 25%.

Some SATA drives out there say they have a duty cycle of 100% (24×7 usage) but others are less than 100%.  What happens if you exceed the duty cycle?  Drives fail more often than the stated MTBF/MTTF.  But at what rate?

How can you tell, though, to what extent you are using your drives in terms of a duty cycle? For example, what if you have a general purpose HPC cluster.  That cluster accesses data through a clustered file system across multiple arrays of hundreds of disks.  I/O patterns will vary.  Some weeks you may pull 2 GB/s from your disk array while other weeks you may only pull 750 MB/s.  On the other hand, some weeks you may be pulling 40,000 IOPS from your disk arrays while other weeks you are at 20,000 IOPS.

So to determine your actual utilization, would you add up the total IOPS of the SATA drives in your disk array and then track the IOPS actually performed by that array.  Take the duty cycle ratio to equal the IOPS performed divided by max possible IOPS and see if that duty cycle ratio is within the threshold of the type of SATA drives you have in your array?  Perhaps you take a monthly average?  Or would you need to consider the amount of time spent reading/writing–i.e., would a single IO reading/writing more data sequentially count more against the duty cycle than a smaller single IO?

On the other hand, some sources have indicated that drive reliability (ala FC vs SATA) is not so much due to duty cycles related to drive mechanics but to the possibility of bit errors during rebuilds of RAID sets.  Some FC drives have more measures taken to correct errors compared to SATA drives resulting in bit errors being perhaps 100 times more likely on SATA drives vs. FC drives (1 in 10^14 bits vs. 1 in 10^16 bits).

Here is some other info I’m taking a look at in regard to drive reliability issues:

Penguin Computing POD: HPC in the Cloud

By , August 12, 2009 1:00 am

Penguin Computing is offering a servicePenguin on Demand” or “POD” where customers can ssh into Penguin’s clusters and pay for CPU-core-hours.  Citing performance concerns, they are forgoing the use of virtualization, which is often a key component of a cloud computing offering.  As of mid-2009, I/O does take a hit in virtualized environments but upcoming PCI-IOV standards may help–see HPC Trends and Virtualization starting at slide 11.  Initially, their cluster will consist of a modest 1,000 cores but at least some will have access to GPUs as well.

For delivering your data to the cluster, you have the option of overnighting “2 TB hot-swappable drives” to them if it is not feasible to transfer over the Internet.  It’s not clear whether they are providing specific drives that you must then have a compatible system or if you can use your own eSATA or other external drive for this.  While that sounds easy enough, I think the loading of data onto these drives and handling of file systems between source and destination OS can introduce plenty of delays especially in the case of poorly chosen external drive interfaces.  Of course, the feasibility of just transferring over the network should be investigated first.

From what I’ve been able to read, you’d have an account on Penguin’s cluster and use their queuing system.  Its also not clear how your OS will be deployed to the cluster (presumably you’d provide an image to them and they’d deploy using functionality in Scyld’s ClusterWare software).  At this point, it is not clear that you’d be doing much more than using someone else’s cluster and getting billed for it…something that several places have done for some time (e.g., NCSA’s Private Sector Program, R Systems, etc.).

What is needed is a programmatic interface to interact with their cluster with strong auth/authz technologies to allow an organization to seamlessly flex their HPC infrastructure and manage jobs with local apps.  For some disciplines, transferring large data sets may continue to be a barrier to seamless extensions of their infrastructure…or perhaps not with Darkstrand‘s growing presence providing 10-40 Gb/s commercial connectivity to HPC resources.  Amazon provides more web services to move closer to this programmatic interface but as Penguin points out, EC2 is not very well geared for many types of HPC applications (although it is for some).

Penguin does not give many details yet about the storage environment associated with their POD service–only that it is “high-speed”.  NetApp is mentioned in one article about POD.  It seems all nodes have Gigabit Ethernet and/or DDR Infiniband network fabrics but it is not clear about the scalability achievable before nodes are spread across multiple hops to reach each other for inter-process communication.  It does not mention over which fabric the storage is accessible.

Have you read any additional articles providing any of the missing details?

A Look at the Yahoo Systems Used to Sort a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds

comments Comments Off
By , July 2, 2009 10:01 pm

The Yahoo! Developer Network Blog reported that Yahoo! was able to sort a petabyte of data in 16.25 hours and a terabyte in 62 seconds using Apache Hadoop.  I was curious what type of hardware they needed to pull this off.  From their post, they mentioned using this hardware (their Hammer cluster):

  • approximately 3800 nodes (in such a large cluster, nodes are always down)
  • 2 quad core Xeons @ 2.5ghz per node
  • 4 SATA disks per node
  • 8G RAM per node (upgraded to 16GB before the petabyte sort)
  • 1 gigabit ethernet on each node
  • 40 nodes per rack
  • 8 gigabit ethernet uplinks from each rack to the core

Here are my observations/inferences and questions about this:

  1. They used 95 racks of equipment dedicated to this test (3800 nodes with 40 nodes per rack)
  2. An 8 Gb/s uplink must consist of eight trunked gigabit connections
  3. From 1&2 above, we see they’d need 95 * 8 = 760 Gigabit Ethernet ports at their core.
    • What type of switches do you think they have at the core?
  4. Assuming 90% efficiency on the GigE link, 760 Gb/s over 16.25 hours could push about 5 PB of data.
    • How much data do you think is transferred when sorting 1 PB of data?
  5. 3800 nodes were used each with two quad core Xeon processors means 30,400 cores were used for this (?!)
    • What percentage of CPU time do you think they used to sort 1 PB of data on 30,400 (?!) cores?
  6. 3800 nodes were used each with 16 GB of memory means 60.8 TB of memory were used to sort 1 PB.
    • How much of the 60.8 TB of memory do you think was required to do the sort of 1 PB of data?
    • Are these Nehalems?  If not, if they were using Nehalem-based systems, do you think the run times would have changed due to increased memory bandwidth?
  7. Assuming “40 nodes per rack” is implying 40 1U servers, I think it is safe to assume that each rack (with 320 cores in 40 systems) uses around 10 kW of power.  With 95 racks like this, it seems this system would require just under 1 megawatt to operate.  Of course the various switches also contribute here.  Assuming $0.04 per kWh, it would have cost about $650 in electricity to do this sort.  This cost may double given the electricity needed to provide the required cooling.

Ok, so is this right?  Perhaps they meant to say 3800 cores instead of 3800 nodes?  Does Yahoo have a 30,400 core cluster sitting around to do these benchmarks?  I’m assuming this a cluster that they also use in their day-to-day operations?

    DARPA Challenge: billion-way parallelism, 1 PFLOPS system in one rack, 57 kW max power

    By , June 28, 2009 2:37 pm
    DOE's Roadrunner Supercomputer

    DOE's Roadrunner Supercomputer

    DARPA has issued an RFI (pdf) to help enable what they are calling Ubiquitous High Performance Computing (UHPC).  According to the June 2009 TOP500 supercomputer list, the fastest supercomputer available, Roadrunner, runs at just over 1 PFLOP.  It uses around 2.5 million watts of electricity and requires around 278 racks of equipment [1].

    DARPA would like to fit the same computational power into one air-cooled rack and use no more than 57,000 watts (including cooling).  That’s 100% of Roadrunner’s computational power in 0.4% of the space using 2% of the electrical power. Also, while the most energy efficient system now achieves 536 Mflops/watt [2], DARPA is looking for 50 Gflops/watt.

    What’s more, is that they would like to minimize the overhead associated with thousand-way to billion-way parallelism.  Why billion-way parallelism?  I suppose this implies an anticipation of systems containing billions of execution units.  This  may not be unreasonable.  For example, take a look at the proposed Sequoia supercomputer, which is proposed to include one million cores.

    Beyond these astounding requirements, there are also requests for a “Self Aware OS” that is introspective, goal-oriented, adaptive, self-healing and approximate.  I’d recommend reading page eight of the RFI above for more details.  The hope is that the system will be able to continue operations in the face of failures and “attack” (see page 4 of RFI).

    Well, while the OS and application capabilities will be huge challenges, the restrictions put on the physical aspects of the hardware are also challenging.  With GPUs and Cell processors leading to increased computations per watt, perhaps we may be able to significantly improve overall system power efficiencies.  In addition, DARPA is looking for this to take place potentially in 9 years (proposals are due by July 27, 2009) if it is feasible.  With top supercomputers sometimes becoming more powerful than the 500 most powerful supercomputers combined from four years prior, we can definitely see overall computational ability increase quickly but this doesn’t necessarily translate into the density and energy efficiencies.

    Aside from the RFI above, you can read more here or here.  Also, thanks to @HPC_Guru from whom I first heard about this RFI.

    The Fastest Supercomputer Became Faster Than the Top 500 Combined Four Years Prior

    comments Comments Off
    By , June 25, 2009 12:11 am

    TOP500 Performance over time

    After reading a perspective of the latest TOP500 Supercomputer List from @Chris P_Intel I took another look at the progress of the systems on the list shown above.  The June 2009 list just released puts the RoadRunner supercomputer in the number one spot with 1105 TFLOPS.  In June 2004, just five years ago, all 500 supercomputers combined summed to 813 TFLOPS, with the most powerful single system being the Earth-Simulator at 36 TFLOPS.  So in just five years, a single supercomputer has become more powerful than the 500 most powerful supercomputers from June 2004.

    Upon taking a closer look, I saw that RoadRunner was actually in the number one spot in June 2008 at 1026 TFLOPS.  So the top supercomputer on the list in June 2008  was actually faster than all of the top 500 supercomputers combined from four years prior!

    Ok, and going back to November 2005, it seems that the #1 system may have been more powerful than the sum of the top 500 supercomputers in November 2002.  So perhaps we are down to three years…I haven’t verified exact numbers though.  Has anyone officially tracked the record for how quickly the #1 supercomputer on the TOP500 list had achieved the performance of all of the supercomputers on a previous TOP500 list?

    Panorama Theme by Themocracy