From March 31 to April 1, the National Human Genome Research Institute (NHGRI) is holding a meeting to explore the use of cloud computing for the storage, management, and analysis of genomic data, including the computing issues and the implementation for biological analyses. I wanted to make this site available for those who will not be in attendance but who want to contribute (post comments here or via Twitter (@garystiehr)).
The agenda contains presentations and discussion surrounding some specific topics:
- an overview of various cloud offerings as well as a survey of some genomics-specific cloud computing pilot projects
- an overview of some of the associated technical questions:
- transmitting large genomic data sets
- computational architecture of cloud environments
- cloud security and relevant NIH data privacy policies
Further, a general discussion session will be held regarding supporting genomic analyses using cloud computing:
- What are the key challenges for genomic analyses within a cloud?
- Under what conditions are cloud environments better than local clusters of computers?
- What types of analyses of large genomic datasets are clouds appropriate for currently and in the near future?
- What would be needed for cloud environments to work better for genomic analyses?
- What features of a cloud environment are appropriate for genome repositories to use?
- What alternatives to cloud environments exist or may be developed, and under what conditions would these be better than clouds?
- Are data standards needed - if so, what areas are they needed?
- What best practices are needed?
Please, join in the discussion on the topics above and help identify if there are principals from within the cloud computing arena that can be used to advance the state of genomics research. Post comments here or via Twitter (@garystiehr).
Executing in parallel using Hadoop, Crossbow analyzes data comprising 38-fold coverage of the human genome in three hours using a 320-CPU cluster rented from a cloud computing service for about $85.
– “Searching for SNPs with cloud computing” (Genome Biology 2009, 10:R134)
Page numbers and table references provided below are the page numbers within the provisional PDF, which was the available format at the time of writing this post.
As of late, I’ve seen a number of papers or announcements regarding running genomics and other bioinformatics applications in the cloud (usually meaning using Amazon’s EC2 and S3 services). These projects are providing valuable empirical data about costs and run times in the cloud. To help in that effort, I thought I’d take a look at the latest example of such published results. “Searching for SNPs with cloud computing” was published on November 20 (Genome Biology 2009, 10:R134) and it describes the use of the Crossbow package (requiring Hadoop) on local, non-cloud resource as well as using Amazon’s cloud resources.
So how did the application perform in the cloud versus on the local computing resources? Within the paper, we find that the 320-CPU cluster mentioned above is comprised of 40 “Extra Large High CPU Instances” (virtual machines each with eight cores (approximately 2.5-2.8 GHz) and 7 GB of memory and 1690 GB of “instance storage”) upon which Hadoop was running (p. 7). The running (wall) time of the application with the given data set was 2 hours and 53 minutes (Table 3). Two other scenarios (80 and 160 EC2 cores) were also run (see Table 3 of the paper for timing results). The application was run locally in the authors’ lab via Hadoop using a cluster of ten nodes each with 4 GB of memory, 366 GB of local disk space and four 3.2 GHz Intel Xeons cores (40 cores total). The running (wall) time on the local cluster was “about” one day (p. 7).
A few things stood out in terms of comparing times between the local and cloud resources (not that that was the intent of the paper):
- The local cluster consisted of only 40 cores but the runs on EC2 were with 80, 160 and 320 cores (and the cores were of different speeds)–why not also test with 40 EC2 cores?
- The network topology of the local 40-core cluster was not discussed (aside from the use of Gigabit Ethernet), which could have implications during different stages of the map/reduce flow.
- The local storage infrastructure was not described, which could have implications in the transfer of data to the cluster nodes.
- Having 40 non-virtualized 3.2 GHz cores take one day to run but 80 virtualized “2.5 to 2.8 GHz” cores take 6 hours and 30 minutes for the application using the same version of Hadoop does not quite add up, which implies differences in the architecture of the infrastructure used for testing:
- does the increased disk space for available for Hadoop’s HDFS make a difference for data caching? (366 GB per local node versus 1690 GB of “instance storage” per EC2 node)
- is there a network bottleneck between the local nodes that does not exist with the EC2 nodes?
- is there a bottleneck getting to the storage in the local cluster that does not exist between the EC2 nodes and S3?
- The Amazon wall clock timings were run once each and it is not specified about the number of timings on the local cluster.
- also, a timing on the local cluster of “about” one day is a little vague.
- The paper’s conclusion (p. 14) states that when run on “conventional computers” this type of analysis requires “weeks of CPU time” but the same analysis with Crossbow can finish “in less than 3 hours on a 320-core cluster.”
- the terminology related to time (e.g., wall vs. CPU) was not necessarily traditional. Perhaps they use “CPU time” to generically mean time on a computer rather than the time the processor spent executing instructions on behalf of the application?. In any case, the statement is really a distortion of the timings and/or terminology:
- The 80-core EC2 run consumed 6.5 hours * 80 cores = 520 core-wall_hours
- The 160-core EC2 run consumed 4.55 hours * 160 cores = 728 core-wall_hours
- The 320-core EC2 run consumed 2.88 hours * 320 cores = 922 core-wall_hours
- The local 40-core run took “about” one day (I’ll assume 24 hours) and so consumed 24 hours * 40 cores = 960 core-wall_hours
Aside from the above, the data gives some interesting starting points surrounding timings (including data transfer timing). As for the pricing, the $85 mentioned is for computation only and there is another $35.50 in data transfer-related fees and it is not clear if S3 fees came into play during the tests. Further, one should note the price (a premium of $20.96 per hour (360%)) of finishing 3 hours and 37 minutes (55%) earlier with 320 EC2 cores rather than using only 80 EC2 cores. Perhaps I’ll address that in a future post.
One definition of duty cycle specific to disk drives is this excerpt from a patent description (designed to enforce certain duty cycles):
The disk drive duty cycle can be expressed as a ratio=Ta/t for a given time period t where Ta is the amount of time the disk drive is actively processing read/write commands during the time period. For example, if during a time period t=60 seconds, the disk drive was actively processing read/write commands for a collective total of Ta=15 seconds, then the average duty cycle for that time period would be 25%.
Some SATA drives out there say they have a duty cycle of 100% (24×7 usage) but others are less than 100%. What happens if you exceed the duty cycle? Drives fail more often than the stated MTBF/MTTF. But at what rate?
How can you tell, though, to what extent you are using your drives in terms of a duty cycle? For example, what if you have a general purpose HPC cluster. That cluster accesses data through a clustered file system across multiple arrays of hundreds of disks. I/O patterns will vary. Some weeks you may pull 2 GB/s from your disk array while other weeks you may only pull 750 MB/s. On the other hand, some weeks you may be pulling 40,000 IOPS from your disk arrays while other weeks you are at 20,000 IOPS.
So to determine your actual utilization, would you add up the total IOPS of the SATA drives in your disk array and then track the IOPS actually performed by that array. Take the duty cycle ratio to equal the IOPS performed divided by max possible IOPS and see if that duty cycle ratio is within the threshold of the type of SATA drives you have in your array? Perhaps you take a monthly average? Or would you need to consider the amount of time spent reading/writing–i.e., would a single IO reading/writing more data sequentially count more against the duty cycle than a smaller single IO?
On the other hand, some sources have indicated that drive reliability (ala FC vs SATA) is not so much due to duty cycles related to drive mechanics but to the possibility of bit errors during rebuilds of RAID sets. Some FC drives have more measures taken to correct errors compared to SATA drives resulting in bit errors being perhaps 100 times more likely on SATA drives vs. FC drives (1 in 10^14 bits vs. 1 in 10^16 bits).
Here is some other info I’m taking a look at in regard to drive reliability issues:
A depiction of the structure of DNA
Illumina will offer a service to sequence a person’s genome for $48,000 (a doctor’s prescription is required). Note that this is only the sequencing and not the actual analysis of that sequence data. The consumer must choose from a few different providers to do the actual analysis of the genome sequence data. Currently, the representation of a human genome as Illumina is proposing (30-fold coverage of your DNA sequence) would require the transfer of terabytes of data to the company doing the analysis. Of course, there are various parts to “analysis” so depending on where Illumina stops and the other companies take over, this actually could be a lot less data (e.g., gigabytes).
So this raises at least a couple of possible challenges for Illumina:
- How will the data be transferred?
- How will the data be secured?
Transferring the Data
One could see that data transfer of on the order terabytes of data would not be a problem if the turnaround time is long enough. Although if the service becomes more and more popular, scaling may be a problem (or at least synchronizing network abilities with analysis providers). Nevertheless, will Illumina establish encrypted network connections with the consumer’s/doctor’s chosen analysis provider? Will they transfer the data encrypted on external hard drives? If on external hard drives, how will tracking of the multiple pieces be tracked?
Securing the Data
I’m assuming the security/encryption questions may have answers based off of current electronic health records implementations although I’m not sure if electronic patient information systems are typically interconnected between different health care organizations. That is, aren’t these systems usually secured/confined within the network of a particular health care organization? If it is placed on external hard drives and shipped, would the encryption of terabytes of data per patient be challenging?
Here is another great example of how an HPC site can function as a versatile resource for a wide variety of problem domains. A priority queue was setup on TACC’s Ranger cluster to provide 2,000 to 3,000 processors for two weeks to allow a team to assess the way in which the underlying structure of the Swine Flue virus (H1N1A) has or could mutate and lead to drug resistance. With this data, “they believe it will be possible to intelligently design a drug or vaccine that can’t be resisted.”
This still from a Quicktime movie represents a view of the drug buried in the binding pocket of the A/H1N1 neuraminidase protein. The animation also shows a 3D surface view of a neuraminidase protein and footage from the actual drug binding simulation.
From the article cited below:
Supercomputers routinely assist in emergency weather forecasting, earthquake predictions, and epidemiological research. Now, says Schulten, they are proving their usefulness in biomedical crises.
“It’s a historic moment,” he said. “For the first time these supercomputers are being used for emergency situations that require a close look with a computational tool in order to shape our strategy.”
Find more details at Inside the Swine Flu Virus (found via this HPCwire article).
Scientific research often benefits from open innovation. While there are many examples, I am particularly excited to see what happens in the area of cancer genomics. The Genome Center at Washington University published the results of sequencing the first cancer genome back in November 2008. Internally, there was collaboration between departments in the School of Medicine resulting in innovative analyses and leading to more discoveries. Since then I’ve read and heard about a number of similar or follow up projects at varioius institutions. As data is shared amongst researchers across the world, new collaborations will be formed. The innovations resulting from these collaborations will hopefully result in better treatments for cancer.
We at The Genome Center at Washington University were happy to get official word that we will be adding an additional 21 Illumina Genome Analyzers to our portfolio of sequencing technology. That enables us to sequence enough DNA to be equivalent to an entire human genome per day (at 25x coverage). There is a lot of excitement about the potential such capacity brings. The Genome Center’s director had this to say:
“Our intention to substantially scale-up with this technology reflects our commitment to large-scale sequencing projects that aim to uncover the underlying genetic basis of various human diseases. With the rapid decline in the cost of whole-genome sequencing, we believe now is the time to embark on initiatives which were previously not possible,” said Richard K. Wilson, Ph.D., Professor of Genetics and Director of the Genome Center at Washington University. “We are confident that we can further reduce the cost and accelerate the rate of human genome sequencing.”
A scale up of sequencing capacity brings a scale up in IT capacity. We’ll be watching our internal network, disk and HPC resources and scaling as appropriate. It will be likely that these sequencers alone will generate upwards of 20 TB of data per day, which needs further processing on The Genome Center’s computational resources. I’m excited about the possibilities that this scale up will bring!
From Blue Data Center Will Be Powered by the Tides (found via @tkunau/@ecogeek):
At first, tidal power will only cover one-fifth of the data center’s needs, but Atlantis hopes that if the first phase is successful, they can expand the tidal array to make up the remaining wattage.
Sun’s Colorado Consolidation Saves Millions describes how Sun used Liebert’s XD rack cooling, clear vinyl cold isle curtains and flywheels to increase the density of its data center while also reducing energy consumption. They reduced 165,000 square feet of data center space into 700 square feet while reducing their monthly power usage by one million kilowatt-hours.
When we considered the XD cooling units, there were two options: chilled water or refrigerant. In the case of chilled water, there was the question of potential water leaks in these rack-attached units. With the refrigerant option, there was the question of an increase in the number of condensers and where they would be placed and how much more maintenance would be needed. With either option, there is also an increase in the need for maintenance inside the server room amongst the servers, storage, switches, etc. The obvious benefit of the XD units is the fact that they can provide enough cooling for up to 30 kW in a single rack. Although, if I recall correctly, there is a limit to the total number of racks with the refrigerant-based version due to limits on the maximum pressure or capacity of the refrigerant in a single system.
As for the vinyl curtains, there is usually more of an objection to their aesthetics. Personally, I would like to see them installed to help keep the cold air completely contained in the cold aisle, where it is intended. Especially in raised floor environments with high velocity air flow where the cold air might be pushed outside the confines of the cold aisle without such containment.
One question about Sun’s use of the flywheel: How large are your flywheels? Flywheels generally supply on the order of ten seconds or so of power, which is usually enough time for generators to kick on but cuts it very close. What type of services run out of Sun’s Colorado facility?
Out of necessity, we have some file systems that are 25 TB. As of 2009, we consider 25 TB a large file system and we are concerned about the potential downtime that may result if an fsck is needed.
Some storage vendors advertise that they can have single file systems that are hundreds of terabytes or even a petabyte. Often, however, there is no mention of when or if fsck or similar operations would be needed and how long they take.
ZFS claims to eliminate the need for fsck and Chunkfs (ext2 enhancements from around 2006) claims to reduce fsck times by splitting the repair domain. Further, “journaling file systems only speed fsck time in the case of a system crash, disconnected disk, or other interruptions in the middle of file system updates. They do not speed recovery in the case ‘real’ metadata corruption” (see third paragraph here).
1.) What do you consider a large file system? (What file system do you use for them?)
2.) Are you concerned about fsck times? (Why or why not?)
3.) Can you predict fsck times based on some parameters (e.g., inodes used, disk size, etc.)?
4.) Any special cases related to fsck or similar operations for clustered file systems?