A Look at the Yahoo Systems Used to Sort a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds

By Gary Stiehr, July 2, 2009 10:01 pm

The Yahoo! Developer Network Blog reported that Yahoo! was able to sort a petabyte of data in 16.25 hours and a terabyte in 62 seconds using Apache Hadoop.  I was curious what type of hardware they needed to pull this off.  From their post, they mentioned using this hardware (their Hammer cluster):

  • approximately 3800 nodes (in such a large cluster, nodes are always down)
  • 2 quad core Xeons @ 2.5ghz per node
  • 4 SATA disks per node
  • 8G RAM per node (upgraded to 16GB before the petabyte sort)
  • 1 gigabit ethernet on each node
  • 40 nodes per rack
  • 8 gigabit ethernet uplinks from each rack to the core

Here are my observations/inferences and questions about this:

  1. They used 95 racks of equipment dedicated to this test (3800 nodes with 40 nodes per rack)
  2. An 8 Gb/s uplink must consist of eight trunked gigabit connections
  3. From 1&2 above, we see they’d need 95 * 8 = 760 Gigabit Ethernet ports at their core.
    • What type of switches do you think they have at the core?
  4. Assuming 90% efficiency on the GigE link, 760 Gb/s over 16.25 hours could push about 5 PB of data.
    • How much data do you think is transferred when sorting 1 PB of data?
  5. 3800 nodes were used each with two quad core Xeon processors means 30,400 cores were used for this (?!)
    • What percentage of CPU time do you think they used to sort 1 PB of data on 30,400 (?!) cores?
  6. 3800 nodes were used each with 16 GB of memory means 60.8 TB of memory were used to sort 1 PB.
    • How much of the 60.8 TB of memory do you think was required to do the sort of 1 PB of data?
    • Are these Nehalems?  If not, if they were using Nehalem-based systems, do you think the run times would have changed due to increased memory bandwidth?
  7. Assuming “40 nodes per rack” is implying 40 1U servers, I think it is safe to assume that each rack (with 320 cores in 40 systems) uses around 10 kW of power.  With 95 racks like this, it seems this system would require just under 1 megawatt to operate.  Of course the various switches also contribute here.  Assuming $0.04 per kWh, it would have cost about $650 in electricity to do this sort.  This cost may double given the electricity needed to provide the required cooling.

Ok, so is this right?  Perhaps they meant to say 3800 cores instead of 3800 nodes?  Does Yahoo have a 30,400 core cluster sitting around to do these benchmarks?  I’m assuming this a cluster that they also use in their day-to-day operations?

    DARPA Challenge: billion-way parallelism, 1 PFLOPS system in one rack, 57 kW max power

    By Gary Stiehr, June 28, 2009 2:37 pm
    DOE's Roadrunner Supercomputer

    DOE's Roadrunner Supercomputer

    DARPA has issued an RFI (pdf) to help enable what they are calling Ubiquitous High Performance Computing (UHPC).  According to the June 2009 TOP500 supercomputer list, the fastest supercomputer available, Roadrunner, runs at just over 1 PFLOP.  It uses around 2.5 million watts of electricity and requires around 278 racks of equipment [1].

    DARPA would like to fit the same computational power into one air-cooled rack and use no more than 57,000 watts (including cooling).  That’s 100% of Roadrunner’s computational power in 0.4% of the space using 2% of the electrical power. Also, while the most energy efficient system now achieves 536 Mflops/watt [2], DARPA is looking for 50 Gflops/watt.

    What’s more, is that they would like to minimize the overhead associated with thousand-way to billion-way parallelism.  Why billion-way parallelism?  I suppose this implies an anticipation of systems containing billions of execution units.  This  may not be unreasonable.  For example, take a look at the proposed Sequoia supercomputer, which is proposed to include one million cores.

    Beyond these astounding requirements, there are also requests for a “Self Aware OS” that is introspective, goal-oriented, adaptive, self-healing and approximate.  I’d recommend reading page eight of the RFI above for more details.  The hope is that the system will be able to continue operations in the face of failures and “attack” (see page 4 of RFI).

    Well, while the OS and application capabilities will be huge challenges, the restrictions put on the physical aspects of the hardware are also challenging.  With GPUs and Cell processors leading to increased computations per watt, perhaps we may be able to significantly improve overall system power efficiencies.  In addition, DARPA is looking for this to take place potentially in 9 years (proposals are due by July 27, 2009) if it is feasible.  With top supercomputers sometimes becoming more powerful than the 500 most powerful supercomputers combined from four years prior, we can definitely see overall computational ability increase quickly but this doesn’t necessarily translate into the density and energy efficiencies.

    Aside from the RFI above, you can read more here or here.  Also, thanks to @HPC_Guru from whom I first heard about this RFI.

    The Fastest Supercomputer Became Faster Than the Top 500 Combined Four Years Prior

    By Gary Stiehr, June 25, 2009 12:11 am

    TOP500 Performance over time

    After reading a perspective of the latest TOP500 Supercomputer List from @Chris P_Intel I took another look at the progress of the systems on the list shown above.  The June 2009 list just released puts the RoadRunner supercomputer in the number one spot with 1105 TFLOPS.  In June 2004, just five years ago, all 500 supercomputers combined summed to 813 TFLOPS, with the most powerful single system being the Earth-Simulator at 36 TFLOPS.  So in just five years, a single supercomputer has become more powerful than the 500 most powerful supercomputers from June 2004.

    Upon taking a closer look, I saw that RoadRunner was actually in the number one spot in June 2008 at 1026 TFLOPS.  So the top supercomputer on the list in June 2008  was actually faster than all of the top 500 supercomputers combined from four years prior!

    Ok, and going back to November 2005, it seems that the #1 system may have been more powerful than the sum of the top 500 supercomputers in November 2002.  So perhaps we are down to three years…I haven’t verified exact numbers though.  Has anyone officially tracked the record for how quickly the #1 supercomputer on the TOP500 list had achieved the performance of all of the supercomputers on a previous TOP500 list?

    147,000 Processors Used for Atom-by-Atom Simulation of Nanoscale Transistor

    By Gary Stiehr, June 23, 2009 1:29 am

    Using 147,000 processors for 15 minutes from the Jaguar system (a Cray XT5) at the Oak Ridge Leadership Computing Facility, “a simulation of electrical current moving through a futuristic electronic transistor has been modeled atom-by-atom in less than 15 minutes by Purdue University researchers.”

    “Professor Klimeck and his colleague have demonstrated the unique transformational scientific opportunity that comes from scaling a science application to fully exploit the capabilities of petascale systems like the Cray XT5 at the Oak Ridge Leadership Computing Facility,” Kothe says.

    Freely available nanoelectrics software (OMEN) was used from nanoHUB.org to do this simulation.  I am curious about how else this could be applied.  What other nanostructures might we be able to simulate in this way?

    For more information, see the source article.

    Illumina Offers $48,000 Personal Genome Sequencing–How Will Data be Handled?

    By Gary Stiehr, June 12, 2009 11:03 pm
    dna_transcription

    A depiction of the structure of DNA

    Illumina will offer a service to sequence a person’s genome for $48,000 (a doctor’s prescription is required).  Note that this is only the sequencing and not the actual analysis of that sequence data.  The consumer must choose from a few different providers to do the actual analysis of the genome sequence data.  Currently, the representation of a human genome as Illumina is proposing (30-fold coverage of your DNA sequence) would require the transfer of terabytes of data to the company doing the analysis.  Of course, there are various parts to “analysis” so depending on where Illumina stops and the other companies take over, this actually could be a lot less data (e.g., gigabytes).

    So this raises at least a couple of possible challenges for Illumina:

    • How will the data be transferred?
    • How will the data be secured?

    Transferring the Data

    One could see that data transfer of on the order terabytes of data would not be a problem if the turnaround time is long enough.  Although if the service becomes more and more popular, scaling may be a problem (or at least synchronizing network abilities with analysis providers).  Nevertheless, will Illumina establish encrypted network connections with the consumer’s/doctor’s chosen analysis provider?  Will they transfer the data encrypted on external hard drives?  If on external hard drives, how will tracking of the multiple pieces be tracked?

    Securing the Data

    I’m assuming the security/encryption questions may have answers based off of current electronic health records implementations although I’m not sure if electronic patient information systems are typically interconnected between different health care organizations.  That is, aren’t these systems usually secured/confined within the network of a particular health care organization?  If it is placed on external hard drives and shipped, would the encryption of terabytes of data per patient be challenging?

    Using HPC to Understand Swine Flu

    By Gary Stiehr, June 8, 2009 11:29 pm

    Here is another great example of how an HPC site can function as a versatile resource for a wide variety of problem domains.  A priority queue was setup on TACC’s Ranger cluster to provide 2,000 to 3,000 processors for two weeks to allow a team to assess the way in which the underlying structure of the Swine Flue virus (H1N1A) has or could mutate and lead to drug resistance.  With this data, “they believe it will be possible to intelligently design a drug or vaccine that can’t be resisted.”

    This still from a Quicktime movie represents a view of the drug buried in the binding pocket of the A/H1N1 neuraminidase protein. The animation also shows a 3D surface view of a neuraminidase protein and footage from the actual drug binding simulation.

    This still from a Quicktime movie represents a view of the drug buried in the binding pocket of the A/H1N1 neuraminidase protein. The animation also shows a 3D surface view of a neuraminidase protein and footage from the actual drug binding simulation.

    From the article cited below:

    Supercomputers routinely assist in emergency weather forecasting, earthquake predictions, and epidemiological research. Now, says Schulten, they are proving their usefulness in biomedical crises.

    “It’s a historic moment,” he said. “For the first time these supercomputers are being used for emergency situations that require a close look with a computational tool in order to shape our strategy.”

    Find more details at Inside the Swine Flu Virus (found via this HPCwire article).

    Scalable Staging of Large Datasets to Many Compute Nodes

    By Gary Stiehr, April 9, 2009 11:46 pm

    At The Genome Center at Washington University, we are seeing an ever increasing need to align against various reference sequences.  In many cases, hundreds of nodes at time need to access the same input file (e.g., the appropriate reference sequence database).  The size of the file varies depending on the organism and the aligner being used but, in aggregate for hundreds of copies, a terabyte or more might be requested at the same instant.  At startup, all of the jobs grab the same input file at once, which can put a significant toll on our NFS servers and the other unrelated jobs also using the NFS servers.  In some instances, we wanted to copy the input dataset permanently to the local disk on the computational nodes.  However, we can not do that for all possible inputs.

    In the past, I had used a tool called rgang (doesn’t seem to be available for download anymore) to distribute files using a distribution tree (e.g., one node would transfer to five others, which in turn would each transfer to five each and so on).  Alternatives to that were other peer-to-peer distribution methods that could ease the burden on the centralized NFS servers while better leveraging the bandwidth available in the cluster’s network switches.

    When hearing peer-to-peer many people thought of using the bittorrent protocol so I decided to take a look to see if anyone had applied that to staging large datasets to many compute nodes.  I found that this had been studied in several cases for some years.  See some of the bittorrent links I ran across, especially ones related to data distribution in clusters.  While I had seen bittorrent used in some versions of ROCKS and SystemImager for OS deployment to cluster nodes, I hadn’t seen it used directly for distributing large datasets to compute nodes.  We’ll continue to look into using bittorrent to see if we might be able to decrease the I/O wait time associated with many nodes needing the same input file at the same time.

    Open Innovation in Cancer Research

    By Gary Stiehr, February 18, 2009 2:43 am

    Scientific research often benefits from open innovation.  While there are many examples, I am particularly excited to see what happens in the area of cancer genomics. The Genome Center at Washington University published the results of sequencing the first cancer genome back in November 2008.  Internally, there was collaboration between departments in the School of Medicine resulting in innovative analyses and leading to more discoveries.  Since then I’ve read and heard about a number of similar or follow up projects at varioius institutions.  As data is shared amongst researchers across the world, new collaborations will be formed.  The innovations resulting from these collaborations will hopefully result in better treatments for cancer.

    A Human Genome Per Day? The Genome Center at Washington University Scales Up on Illumina Sequencers

    By Gary Stiehr, February 6, 2009 2:47 pm

    We at The Genome Center at Washington University were happy to get official word that we will be adding an additional 21 Illumina Genome Analyzers to our portfolio of sequencing technology.  That enables us to sequence enough DNA to be equivalent to an entire human genome per day (at 25x coverage).  There is a lot of excitement about the potential such capacity brings.  The Genome Center’s director had this to say:

    “Our intention to substantially scale-up with this technology reflects our commitment to large-scale sequencing projects that aim to uncover the underlying genetic basis of various human diseases. With the rapid decline in the cost of whole-genome sequencing, we believe now is the time to embark on initiatives which were previously not possible,” said Richard K. Wilson, Ph.D., Professor of Genetics and Director of the Genome Center at Washington University. “We are confident that we can further reduce the cost and accelerate the rate of human genome sequencing.”

    A scale up of sequencing capacity brings a scale up in IT capacity.  We’ll be watching our internal network, disk and HPC resources and scaling as appropriate.  It will be likely that these sequencers alone will generate upwards of 20 TB of data per day, which needs further processing on The Genome Center’s computational resources.  I’m excited about the possibilities that this scale up will bring!

    Sequoia: 20 Petaflops, 1.6 million cores, 1.6 Petabytes RAM, 6 Megawatts

    By Gary Stiehr, February 5, 2009 11:50 pm

    IBM has won a contract to build a supercomputer, called Sequoia, for the DOE’s NNSA.  It is estimated to be installed and brought online in 2011 and 2012.  It will have 1.6 million cores (from potentially 16-core chips) within 96 racks (in about 3,400 sq. ft.).  It will have around 1.6 Petabytes of memory and achieve about 20 Petaflops.  It will require about 6 million watts of power to operate, which is around 3.3 billion operations per second per watt–very impressive.  I wonder if that includes the power needed for the cooling system.  And is that when the processors are at 100% or when the system is idle?

    At 1.6 PB of memory for 1.6 million cores, that is a relatively low amount of memory per core.  If the memory is doubled, for example, the system may require a few more megawatts of power.  This is based off of very rough estimates of power needed per GB of memory based off of some recent commodity clusters.  Do you have any hard numbers on power per GB of memory today?  Any information on the type of memory that might be used in Sequoia?

    For more information, see IBM to send blazing fast supercomputer to Energy Dept. and/or U.S. taps IBM for 20 petaflops computer.

    Panorama theme by Themocracy