In August 2009, Penguin computing announced “Penguin on Demand” (POD), which they deemed to be HPC in the cloud. It amounted to remotely accessing their pre-installed cluster and submitting your jobs. Virtual machine images were not an option with POD at the time. Today, Amazon announced their Cluster Compute Instances (CCIs) for EC2, which offers the ability to boot Linux-based VMs on a new “Cluster Compute Quadruple Extra Large” instance to form virtual clusters, which have better performance characteristics relative to previously available EC2 instance types.
The main differences from other EC2 instances:
- if you boot multiple CCIs, the instances will be more closely linked together to offer lower inter-node network latencies with a full bisection 10 Gigabit/s bandwidth.
- you will be able to identify the processor architecture so your code can be tuned appropriately
From Amazon’s HPC applications page:
The Cluster Compute instance family currently contains a single instance type, the Cluster Compute Quadruple Extra Large with the following specifications:
23 GB of memory
33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem” architecture)
1690 GB of instance storage
I/O Performance: Very High (10 Gigabit Ethernet)
API name: cc1.4xlarge
As of the announcement on July 13:
- the cost per instance was USD$1.60 per hour (if purchased on-demand per hour) or USD$0.56 if a 1-year or 3-year reserved instance is purchased.
- only Linux VMs are supported on these instances
- a default cluster size limit of eight of these instances are available (for a 64-core cluster) without needing to fill out a special request form.
Here are a couple of articles that point to some cluster management providers specifically geared toward provisioning/interfacing with EC2-based virtual clusters:
It is interesting that the instances are listed as providing 33.5 EC2 Computing units. I wonder what method Amazon uses to establish these measurements. Since the new Cluster Computing Quadruple Extra Large instances allow you to know the processor architecture, you may realize a higher number of Computing Units depending on the extent to which your code will benefit from compiler optimizations taking advantage of the specific Nehalem cores.
If you’ve run across other helpful articles with more details, please do leave a comment with the info.
A depiction of the structure of DNA
Illumina will offer a service to sequence a person’s genome for $48,000 (a doctor’s prescription is required). Note that this is only the sequencing and not the actual analysis of that sequence data. The consumer must choose from a few different providers to do the actual analysis of the genome sequence data. Currently, the representation of a human genome as Illumina is proposing (30-fold coverage of your DNA sequence) would require the transfer of terabytes of data to the company doing the analysis. Of course, there are various parts to “analysis” so depending on where Illumina stops and the other companies take over, this actually could be a lot less data (e.g., gigabytes).
So this raises at least a couple of possible challenges for Illumina:
- How will the data be transferred?
- How will the data be secured?
Transferring the Data
One could see that data transfer of on the order terabytes of data would not be a problem if the turnaround time is long enough. Although if the service becomes more and more popular, scaling may be a problem (or at least synchronizing network abilities with analysis providers). Nevertheless, will Illumina establish encrypted network connections with the consumer’s/doctor’s chosen analysis provider? Will they transfer the data encrypted on external hard drives? If on external hard drives, how will tracking of the multiple pieces be tracked?
Securing the Data
I’m assuming the security/encryption questions may have answers based off of current electronic health records implementations although I’m not sure if electronic patient information systems are typically interconnected between different health care organizations. That is, aren’t these systems usually secured/confined within the network of a particular health care organization? If it is placed on external hard drives and shipped, would the encryption of terabytes of data per patient be challenging?
HPCwire’s article Gravity Attracts a GigE HPC Cluster describes some of the features of the new ATLAS cluster at the Max Planck Institute for Gravitational Physics. The 144 10-GigE port non-blocking switch from Woven was a technical feature that stood out. Additionally, it would be interesting to find out what file system is being used on the 42 storage nodes.
Also, the article mentions “An additional 500 GB of direct-connected storage is provided on each compute node. The CPU on any server can access the local disk storage on any other server as well as the central storage nodes.” I wonder in what way that local disk space is made available to the other servers.