Using 147,000 processors for 15 minutes from the Jaguar system (a Cray XT5) at the Oak Ridge Leadership Computing Facility, “a simulation of electrical current moving through a futuristic electronic transistor has been modeled atom-by-atom in less than 15 minutes by Purdue University researchers.”
“Professor Klimeck and his colleague have demonstrated the unique transformational scientific opportunity that comes from scaling a science application to fully exploit the capabilities of petascale systems like the Cray XT5 at the Oak Ridge Leadership Computing Facility,” Kothe says.
Freely available nanoelectrics software (OMEN) was used from nanoHUB.org to do this simulation. I am curious about how else this could be applied. What other nanostructures might we be able to simulate in this way?
For more information, see the source article.
A depiction of the structure of DNA
Illumina will offer a service to sequence a person’s genome for $48,000 (a doctor’s prescription is required). Note that this is only the sequencing and not the actual analysis of that sequence data. The consumer must choose from a few different providers to do the actual analysis of the genome sequence data. Currently, the representation of a human genome as Illumina is proposing (30-fold coverage of your DNA sequence) would require the transfer of terabytes of data to the company doing the analysis. Of course, there are various parts to “analysis” so depending on where Illumina stops and the other companies take over, this actually could be a lot less data (e.g., gigabytes).
So this raises at least a couple of possible challenges for Illumina:
- How will the data be transferred?
- How will the data be secured?
Transferring the Data
One could see that data transfer of on the order terabytes of data would not be a problem if the turnaround time is long enough. Although if the service becomes more and more popular, scaling may be a problem (or at least synchronizing network abilities with analysis providers). Nevertheless, will Illumina establish encrypted network connections with the consumer’s/doctor’s chosen analysis provider? Will they transfer the data encrypted on external hard drives? If on external hard drives, how will tracking of the multiple pieces be tracked?
Securing the Data
I’m assuming the security/encryption questions may have answers based off of current electronic health records implementations although I’m not sure if electronic patient information systems are typically interconnected between different health care organizations. That is, aren’t these systems usually secured/confined within the network of a particular health care organization? If it is placed on external hard drives and shipped, would the encryption of terabytes of data per patient be challenging?
At The Genome Center at Washington University, we are seeing an ever increasing need to align against various reference sequences. In many cases, hundreds of nodes at time need to access the same input file (e.g., the appropriate reference sequence database). The size of the file varies depending on the organism and the aligner being used but, in aggregate for hundreds of copies, a terabyte or more might be requested at the same instant. At startup, all of the jobs grab the same input file at once, which can put a significant toll on our NFS servers and the other unrelated jobs also using the NFS servers. In some instances, we wanted to copy the input dataset permanently to the local disk on the computational nodes. However, we can not do that for all possible inputs.
In the past, I had used a tool called rgang (doesn’t seem to be available for download anymore) to distribute files using a distribution tree (e.g., one node would transfer to five others, which in turn would each transfer to five each and so on). Alternatives to that were other peer-to-peer distribution methods that could ease the burden on the centralized NFS servers while better leveraging the bandwidth available in the cluster’s network switches.
When hearing peer-to-peer many people thought of using the bittorrent protocol so I decided to take a look to see if anyone had applied that to staging large datasets to many compute nodes. I found that this had been studied in several cases for some years. See some of the bittorrent links I ran across, especially ones related to data distribution in clusters. While I had seen bittorrent used in some versions of ROCKS and SystemImager for OS deployment to cluster nodes, I hadn’t seen it used directly for distributing large datasets to compute nodes. We’ll continue to look into using bittorrent to see if we might be able to decrease the I/O wait time associated with many nodes needing the same input file at the same time.