The Yahoo! Developer Network Blog reported that Yahoo! was able to sort a petabyte of data in 16.25 hours and a terabyte in 62 seconds using Apache Hadoop. I was curious what type of hardware they needed to pull this off. From their post, they mentioned using this hardware (their Hammer cluster):
- approximately 3800 nodes (in such a large cluster, nodes are always down)
- 2 quad core Xeons @ 2.5ghz per node
- 4 SATA disks per node
- 8G RAM per node (upgraded to 16GB before the petabyte sort)
- 1 gigabit ethernet on each node
- 40 nodes per rack
- 8 gigabit ethernet uplinks from each rack to the core
Here are my observations/inferences and questions about this:
- They used 95 racks of equipment dedicated to this test (3800 nodes with 40 nodes per rack)
- An 8 Gb/s uplink must consist of eight trunked gigabit connections
- From 1&2 above, we see they’d need 95 * 8 = 760 Gigabit Ethernet ports at their core.
- What type of switches do you think they have at the core?
- Assuming 90% efficiency on the GigE link, 760 Gb/s over 16.25 hours could push about 5 PB of data.
- How much data do you think is transferred when sorting 1 PB of data?
- 3800 nodes were used each with two quad core Xeon processors means 30,400 cores were used for this (?!)
- What percentage of CPU time do you think they used to sort 1 PB of data on 30,400 (?!) cores?
- 3800 nodes were used each with 16 GB of memory means 60.8 TB of memory were used to sort 1 PB.
- How much of the 60.8 TB of memory do you think was required to do the sort of 1 PB of data?
- Are these Nehalems? If not, if they were using Nehalem-based systems, do you think the run times would have changed due to increased memory bandwidth?
- Assuming “40 nodes per rack” is implying 40 1U servers, I think it is safe to assume that each rack (with 320 cores in 40 systems) uses around 10 kW of power. With 95 racks like this, it seems this system would require just under 1 megawatt to operate. Of course the various switches also contribute here. Assuming $0.04 per kWh, it would have cost about $650 in electricity to do this sort. This cost may double given the electricity needed to provide the required cooling.
Ok, so is this right? Perhaps they meant to say 3800 cores instead of 3800 nodes? Does Yahoo have a 30,400 core cluster sitting around to do these benchmarks? I’m assuming this a cluster that they also use in their day-to-day operations?
At The Genome Center at Washington University, we are seeing an ever increasing need to align against various reference sequences. In many cases, hundreds of nodes at time need to access the same input file (e.g., the appropriate reference sequence database). The size of the file varies depending on the organism and the aligner being used but, in aggregate for hundreds of copies, a terabyte or more might be requested at the same instant. At startup, all of the jobs grab the same input file at once, which can put a significant toll on our NFS servers and the other unrelated jobs also using the NFS servers. In some instances, we wanted to copy the input dataset permanently to the local disk on the computational nodes. However, we can not do that for all possible inputs.
In the past, I had used a tool called rgang (doesn’t seem to be available for download anymore) to distribute files using a distribution tree (e.g., one node would transfer to five others, which in turn would each transfer to five each and so on). Alternatives to that were other peer-to-peer distribution methods that could ease the burden on the centralized NFS servers while better leveraging the bandwidth available in the cluster’s network switches.
When hearing peer-to-peer many people thought of using the bittorrent protocol so I decided to take a look to see if anyone had applied that to staging large datasets to many compute nodes. I found that this had been studied in several cases for some years. See some of the bittorrent links I ran across, especially ones related to data distribution in clusters. While I had seen bittorrent used in some versions of ROCKS and SystemImager for OS deployment to cluster nodes, I hadn’t seen it used directly for distributing large datasets to compute nodes. We’ll continue to look into using bittorrent to see if we might be able to decrease the I/O wait time associated with many nodes needing the same input file at the same time.
IBM has won a contract to build a supercomputer, called Sequoia, for the DOE’s NNSA. It is estimated to be installed and brought online in 2011 and 2012. It will have 1.6 million cores (from potentially 16-core chips) within 96 racks (in about 3,400 sq. ft.). It will have around 1.6 Petabytes of memory and achieve about 20 Petaflops. It will require about 6 million watts of power to operate, which is around 3.3 billion operations per second per watt–very impressive. I wonder if that includes the power needed for the cooling system. And is that when the processors are at 100% or when the system is idle?
At 1.6 PB of memory for 1.6 million cores, that is a relatively low amount of memory per core. If the memory is doubled, for example, the system may require a few more megawatts of power. This is based off of very rough estimates of power needed per GB of memory based off of some recent commodity clusters. Do you have any hard numbers on power per GB of memory today? Any information on the type of memory that might be used in Sequoia?
For more information, see IBM to send blazing fast supercomputer to Energy Dept. and/or U.S. taps IBM for 20 petaflops computer.
HPCwire’s article Gravity Attracts a GigE HPC Cluster describes some of the features of the new ATLAS cluster at the Max Planck Institute for Gravitational Physics. The 144 10-GigE port non-blocking switch from Woven was a technical feature that stood out. Additionally, it would be interesting to find out what file system is being used on the 42 storage nodes.
Also, the article mentions “An additional 500 GB of direct-connected storage is provided on each compute node. The CPU on any server can access the local disk storage on any other server as well as the central storage nodes.” I wonder in what way that local disk space is made available to the other servers.
As an update to a previous post, TACC has updated the description for their Ranger cluster. Some highlights: 62,976 cores, Lustre over 72 I/O servers, 82 racks of equipment…
TACC’s HPC Systems page mentions their new Ranger cluster will eventually have 52,608 cores (13,152 AMD “Barcelona” quad-core processors) across 3,288 nodes (mentioned here as four-socket blades) with a total of 105 TB of memory and 1.7 PB of disk. If the total memory is equally distributed, each blade has 32 GB of memory (that’s 2 GB per core). According to one article, the nodes are interconnected with a 3,456 port InfiniBand system (over three switches?) with a total bandwidth of 110Tb/s (I wonder how this was calculated?). As for the disk space, it was mentioned that some StorageTek disk was used. I wonder if all 1.7 PB is StorageTek disk or if it also counts the disk internal to each blade (if any)? And how is the StorageTek disk presented to the blades?
Some other things I am wondering:
- What Linux distribution are they using?
- Are these nodes diskless? If not, is the OS installed locally or are they network booted?
- Are they using Sun’s LOM for out-of-band management?
- I’d estimate around 69 racks were needed to house the blades (48 blades per rack). How many racks are used for storage and switches?
- How many staff are dedicated to system administration of this system?
- What batch system or job scheduler is used? MyCluster?
In any case, the system sounds impressive and I look forward to hearing about the first production applications to run on it.