Titan (supercomputer)

From Wikipedia, the free encyclopedia
Jump to: navigation, search
This article is about the Titan supercomputer. For the Atlas 2 prototype computer, see Titan (computer).
Titan (supercomputer)
An image of the cabinets that make up Titan.
Active Became operational October 29, 2012
Sponsors US DOE and NOAA (<10%)
Operators Cray Inc.
Location Oak Ridge National Laboratory
Architecture 18,688 AMD Opteron 6274 16-core CPUs
18,688 Nvidia Tesla K20X GPUs
Power 8.2 MW
Operating system Cray Linux Environment
Space 404 m2 (4352 ft2)
Memory 693.5 TiB (584 TiB CPU and 109.5 TiB GPU)
Storage 40 PB, 1.4 TB/s IO Lustre filesystem
Speed 17.59 petaFLOPS (LINPACK)
27 petaFLOPS theoretical peak
Cost $945 million
Ranking TOP500: #2, June 2013[1]
Purpose Scientific research
Legacy Ranked 1 on TOP500 when built.
First GPU based supercomputer to perform over 10 petaFLOPS
Web site www.olcf.ornl.gov/titan/

Titan is a supercomputer built by Cray at Oak Ridge National Laboratory for use in a variety of science projects. Titan is an upgrade of Jaguar, a previous supercomputer at Oak Ridge, that uses graphics processing units (GPUs) in addition to conventional central processing units (CPUs). It is the first such hybrid to perform over 10 petaFLOPS. The upgrade began in October 2011, commenced stability testing in October 2012 and it became available to researchers in early 2013. The initial cost of the upgrade was US$60 million, funded primarily by the United States Department of Energy.

Titan employs AMD Opteron CPUs in conjunction with Nvidia Tesla GPUs to improve energy efficiency while providing an order of magnitude increase in computational power over Jaguar. It uses 18,688 CPUs paired with an equal number of GPUs to perform at a theoretical peak of 27 petaFLOPS; in the LINPACK benchmark used to rank supercomputers' speed, it performed at 17.59 petaFLOPS. This was enough to take first place in the November 2012 list by the TOP500 organization, but Tianhe-2 overtook it on the June 2013 list.

Titan is available for any scientific purpose; access depends on the importance of the project and its potential to exploit the hybrid architecture. Any selected code must also be executable on other supercomputers to avoid sole dependence on Titan. Six vanguard codes were the first selected. They dealt mostly with molecular scale physics or climate models, while 25 others queued behind them. The inclusion of GPUs compelled authors to alter their codes. The modifications typically increased the degree of parallelism, given that GPUs offer many more simultaneous threads than CPUs. The changes often yield greater performance even on CPU-only machines.

History[edit]

A computer rendered image of Titan
A rendering of the Titan supercomputer

Plans to create a supercomputer capable of 20 petaFLOPS at the Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National Laboratory (ORNL) originated as far back as 2005, when Jaguar was built.[2] Titan will itself be replaced by an approximately 200 petaFLOPS system in 2016 as part of ORNL's plan to operate an exascale (1000 petaFLOPS to 1 exaFLOPS) machine by 2020.[2][3][4] The initial plan to build a new 15,000 square meter (160,000 ft2) building for Titan, was discarded in favor of using Jaguar's existing infrastructure.[5] The precise system architecture was not finalized until 2010, although a deal with Nvidia to supply the GPUs was signed in 2009.[6] Titan was first announced at the private ACM/IEEE Supercomputing Conference (SC10) on November 16, 2010, and was publicly announced on October 11, 2011, as the first phase of the Titan upgrade began.[3][7]

Jaguar had received various upgrades since its creation. It began with the Cray XT3 platform that yielded 25 teraFLOPS.[8] By 2008, Jaguar had been expanded with more cabinets and upgraded to the XT4 platform, reaching 263 teraFLOPS.[8] In 2009, it was upgraded to the XT5 platform, hitting 1.4 petaFLOPS.[8] Its final upgrades brought Jaguar to 1.76 petaFLOPS.[9]

A Cray technician upgrading a stack of compute blades.
A Cray technician upgrading Jaguar to Titan.

Titan was funded primarily by the US Department of Energy through ORNL. Funding was sufficient to purchase the CPUs but not all of the GPUs so the National Oceanic and Atmospheric Administration agreed to fund the remaining nodes in return for computing time.[10][11] ORNL scientific computing chief Jeff Nichols noted that Titan cost approximately $60 million upfront, of which the NOAA contribution was less than $10 million, but precise figures were covered by non-disclosure agreements.[10][12] The full term of the contract with Cray included $97 million, excluding potential upgrades.[12]

The yearlong conversion began October 9, 2011.[13][14] Between October and December, 96 of Jaguar's 200 cabinets, each containing 24 XT5 blades (two 6-core CPUs per node, four nodes per blade), were upgraded to XK7 blades (one 16-core CPU per node, four nodes per blade) while the remainder of the machine remained in use.[13] In December, computation was moved to the 96 XK7 cabinets while the remaining 104 cabinets were upgraded to XK7 blades.[13] The system interconnect (the network over which CPUs do communicate with each other) was updated and the ORNL's external ESnet connection was upgraded from 10 Gbit/s to 100 Gbit/s.[13][15] The system memory was doubled to 584 TiB.[14] 960 of the XK7 nodes (10 cabinets) were fitted with a Fermi based GPU as Kepler GPUs were not then available; these 960 nodes were referred to as TitanDev and used to test code.[13][14] This first phase of the upgrade increased the peak performance of Jaguar to 3.3 petaFLOPS.[14] Beginning on September 13, 2012, Nvidia K20X GPUs were fitted to all of Jaguar's XK7 compute blades, including the 960 TitanDev nodes.[13][16][17] In October, the task was completed and the computer was finally christened Titan.[13]

In March 2013, Nvidia launched the GTX Titan, a consumer graphics card that uses the same GPU die as the K20X GPUs in Titan.[18] Titan underwent acceptance testing in early 2013 but only completed 92% of the tests, short of the required 95%.[13][19] The problem was discovered to be excess gold in the female edge connectors of the motherboards' PCIe slots causing cracks in the motherboards' solder.[20] The cost of repair was borne by Cray and between 12 and 16 cabinets were repaired each week.[20] Throughout the repairs users were given access to the available CPUs.[20] On March 11, they gained access to 8,972 GPUs.[21] ORNL announced on April 8 that the repairs were complete[22] and acceptance test completion was announced on June 11, 2013.[23]

Titan's hardware has a theoretical peak performance of 27 petaFLOPS with "perfect" software.[24] On November 12, 2012, the TOP500 organization that ranks the worlds' supercomputers by LINPACK performance, ranked Titan first at 17.59 petaFLOPS, displacing IBM Sequoia.[25][26] Titan also ranked third on the Green500, the same 500 supercomputers ranked in terms of energy efficiency.[27] In the June 2013 TOP500 ranking, Titan fell to second place behind Tianhe-2 and to twenty-ninth on the Green500 list.[1][28] Titan did not re-test for the June 2013 ranking,[1] because it would still have ranked second, at 27 petaFLOPS.[29]

Hardware[edit]

A researcher studies an output on EVEREST, a 10 by 3 meter screen
EVEREST allows researchers to visualize the data that Titan outputs in 3D on a 10 by 3 meter (33 by 10 ft) wall.

Titan uses Jaguar's 200 cabinets, covering 404 square meters (4,352 ft2), with replaced internals and upgraded networking.[30][31] Reusing Jaguar's power and cooling systems saved approximately $20 million.[32] Power is provided to each cabinet at 480 V. This requires thinner cables than the US standard 208 V, saving $1 million in copper.[33] At its peak, Titan draws 8.2 MW,[34] 1.2 MW more than Jaguar, but runs almost ten times as fast in terms of floating point calculations.[30][33] In the event of a power failure, carbon fiber flywheel power storage can keep the networking and storage infrastructure running for up to 16 seconds.[35] After 2 seconds without power, diesel generators fire up, taking approximately 7 seconds to reach full power. They can provide power indefinitely.[35] The generators are designed only to keep the networking and storage components powered so that a reboot is much quicker; the generators are not capable of powering the processing infrastructure.[35]

Titan has 18,688 nodes (4 nodes per blade, 24 blades per cabinet),[36] each containing a 16-core AMD Opteron 6274 CPU with 32 GiB of DDR3 ECC memory and an Nvidia Tesla K20X GPU with 6 GiB GDDR5 ECC memory.[37] There are a total of 299,008 processor cores, and a total of 693.6 TiB of CPU and GPU RAM.[33]

Initially, Titan used Jaguar's 10 PB of Lustre storage with a transfer speed of 240 GB/s,[33][38] but in April 2013, the storage was upgraded to 40 PB with a transfer rate of 1.4 TB/s.[39] GPUs were selected for their vastly higher parallel processing efficiency over CPUs.[37] Although the GPUs have a slower clock speed than the CPUs, each GPU contains 2,688 CUDA cores at 732 MHz,[40] resulting in a faster overall system.[31][41] Consequently, the CPUs' cores are used to allocate tasks to the GPUs rather than directly processing the data as in conventional supercomputers.[37]

Titan runs the Cray Linux Environment, a full version of Linux on the login nodes that users directly access, but a smaller, more efficient version on the compute nodes.[42]

Titan's components are air-cooled by heatsinks, but the air is chilled before being pumped through the cabinets.[43] Fan noise is so loud that hearing protection is required for people spending more than 15 minutes in the machine room.[44] The system has a cooling capacity of 23.2 MW (6600 tons) and works by chilling water to 5.5 °C (42 °F), which in turn cools recirculated air.[43]

Researchers also have access to EVEREST (Exploratory Visualization Environment for Research and Technology) to better understand the data that Titan outputs. EVEREST is a visualization room with a 10 by 3 meter (33 by 10 ft) screen and a smaller, secondary screen. The screens are 37 and 33 megapixels respectively with stereoscopic 3D capability.[45]

Projects[edit]

A simulation of nuclear fuel rods
A VERA simulation of a light water reactor's core. This image was rendered on Jaguar but the project will continue with greater detail on Titan

In 2009, the Oak Ridge Leadership Computing Facility that manages Titan narrowed the fifty applications for first use of the supercomputer down to six "vanguard" codes chosen for the importance of the research and for their ability to fully utilize the system.[31][46] The six vanguard projects to use Titan were:

VERA is a light water reactor simulation written at the Consortium for Advanced Simulation of Light Water Reactors (CASL) on Jaguar. VERA allows engineers to monitor the performance and status of any part of a reactor core throughout the lifetime of the reactor to identify points of interest.[50] Although not one of the first six projects, VERA was planned to run on Titan after optimization with assistance from CAAR and testing on TitanDev. Computer scientist Tom Evans found that the adaption to Titan's hybrid architecture was more difficult than to previous CPU-based supercomputers. He aimed to simulate an entire reactor fuel cycle, an eighteen to thirty-six month-long process, in one week on Titan.[50]

In 2013 thirty-one codes were planned to run on Titan, typically four or five at any one time.[44][51]

Code modifications[edit]

See also: GPGPU

The code of many projects has to be modified to suit the GPU processing of Titan, but each code is required to be executable on CPU-based systems so that projects do not become solely dependent on Titan.[46] OLCF formed the Center for Accelerated Application Readiness (CAAR) to aid with the adaptation process. It holds developer workshops at Nvidia headquarters to educate users about the architecture, compilers and applications on Titan.[52][53] CAAR has been working on compilers with Nvidia and code vendors to integrate directives for GPUs into their programming languages.[52] Researchers can thus express parallelism in their code with their existing programming language, typically Fortran, C or C++, and the compiler can express it to the GPUs.[52] Dr. Bronson Messer, a computational astrophysicist, said of the task: "...an application using Titan to the utmost must also find a way to keep the GPU busy, remembering all the while that the GPU is fast, but less flexible than the CPU."[52] Moab Cluster Suite is used to prioritize jobs to nodes to keep utilization high; it improved efficiency from 70% to approximately 95% in the tested software.[54][55] Some projects found that the changes increased efficiency of their code on non-GPU machines; the performance of Denovo doubled on CPU-based machines.[46]

The amount of code alteration required to run on the GPUs varies by project. According to Dr. Messer of NRDF, only a small percentage of his code runs on GPUs because the calculations are relatively simple but processed repeatedly and in parallel.[56] NRDF is written in CUDA Fortran, a version of Fortran with CUDA extensions for the GPUs.[56] Chimera's third "head" was the first to run on the GPUs as the nuclear burning could most easily be simulated by GPU architecture. Other aspects of the code were planned to be modified in time.[49] On Jaguar, the project modeled 14 or 15 nuclear species but Messer anticipated simulating up to 200 species, allowing far greater precision when comparing the simulation to empirical observation.[49]

See also[edit]

References[edit]

  1. ^ a b c "June 2013". TOP500. Archived from the original on July 2, 2013. Retrieved July 2, 2013. 
  2. ^ a b "Discussing the ORNL Titan Supercomputer with ORNL’s Jack Wells.". The Exascale Report. November 2012. Archived from the original on March 26, 2013. Retrieved December 19, 2012. 
  3. ^ a b Bland, Buddy (November 16, 2010). "Where do we go from here?". Oak Ridge National Laboratory. Archived from the original on March 3, 2012. Retrieved December 18, 2012. 
  4. ^ Goldman, David (October 29, 2012). "Top U.S. supercomputer guns for fastest in world". CNN. Archived from the original on March 2, 2013. Retrieved March 31, 2013. 
  5. ^ Munger, Frank (March 7, 2011). "Oak Ridge lab to add titanic supercomputer". Knox News. Archived from the original on July 4, 2012. Retrieved December 19, 2012. 
  6. ^ Morgan, Timothy Prickett (October 1, 2009). "Oak Ridge goes gaga for Nvidia GPUs". The Register. Archived from the original on November 9, 2012. Retrieved December 19, 2012. 
  7. ^ Levy, Dawn (October 11, 2011). "ORNL awards contract to Cray for Titan supercomputer". Oak Ridge National Laboratory. Archived from the original on February 26, 2013. Retrieved December 19, 2012. 
  8. ^ a b c "Jaguar: Oak ridge National Laboratory". TOP500. Archived from the original on March 17, 2013. Retrieved December 18, 2012. 
  9. ^ "TOP500 List November 2011". TOP500. Archived from the original on January 21, 2013. Retrieved December 18, 2012. 
  10. ^ a b Munger, Frank (November 26, 2012). "The ORNL and NOAA relationship". Knox News. Archived from the original on March 26, 2013. Retrieved December 20, 2012. 
  11. ^ Munger, Frank (November 18, 2012). "The cost of Titan". Knox News. Archived from the original on March 26, 2013. Retrieved December 20, 2012. 
  12. ^ a b Feldman, Michael (October 11, 2011). "GPUs Will Morph ORNL's Jaguar Into 20-Petaflop Titan". HPC Wire. Archived from the original on July 27, 2012. Retrieved October 29, 2012. 
  13. ^ a b c d e f g h "Titan Project Timeline". Oak Ridge Leadership Computing Facility. Archived from the original on June 18, 2012. Retrieved December 18, 2012. 
  14. ^ a b c d Brouner, Jennifer; McCorkle, Morgan; Pearce, Jim; Williams, Leo (2012). "ORNL Review Vol. 45". Oak Ridge National Laboratory. Archived from the original on March 4, 2013. Retrieved November 2, 2012. 
  15. ^ "Superfast Titan, Superfast Network". Oak Ridge Leadership Computing Facility. December 17, 2012. Archived from the original on March 26, 2013. Retrieved December 18, 2012. 
  16. ^ Poeter, Damon (October 11, 2011). "Cray's Titan Supercomputer for ORNL Could Be World's Fastest". PC Magazine. Archived from the original on June 5, 2012. Retrieved October 29, 2012. 
  17. ^ Jones, Gregory Scott (September 17, 2012). "Final Upgrade Underway". Oak Ridge Leadership Computing Facility. Archived from the original on March 26, 2013. Retrieved November 16, 2012. 
  18. ^ Smith, Ryan (February 21, 2013). "Nvidia's GeForce GTX Titan Review, Part 2: Titan's Performance Unveiled". Anandtech. Archived from the original on February 23, 2013. Retrieved March 26, 2013. 
  19. ^ Munger, Frank (February 20, 2013). "No. 1 Titan not yet living up to potential". Knox News. Archived from the original on March 26, 2013. Retrieved March 26, 2013. 
  20. ^ a b c Huotari, John (March 13, 2013). "Cray re-soldering Titan’s connectors, supercomputer testing could be done in April". Oak Ridge Today. Archived from the original on March 26, 2013. Retrieved March 26, 2013. 
  21. ^ Jones, Scott (March 26, 2013). "Titan Users Now Have Access to GPUs". Oak Ridge Leadership Computing Facility. Archived from the original on March 13, 2013. Retrieved March 26, 2013. 
  22. ^ Huotari, John (April 8, 2013). "Titan repairs complete, ORNL preparing for second round of supercomputer testing". Oak Ridge Today. Archived from the original on April 8, 2013. Retrieved April 8, 2013. 
  23. ^ Munger, Frank (June 11, 2013). "Titan passes acceptance test, seals ORNL's supercomputer deal with Cray". Knox News. Archived from the original on July 2, 2013. Retrieved July 2, 2013. 
  24. ^ Jones, Gregory Scott (November 12, 2012). "ORNL Supercomputer Named World’s Most Powerful". Oak Ridge National Laboratory. Archived from the original on February 22, 2013. Retrieved December 14, 2012. 
  25. ^ "Oak Ridge Claims No. 1 Position on Latest TOP500 List with Titan". TOP500. November 12, 2012. Archived from the original on January 21, 2013. Retrieved November 15, 2012. 
  26. ^ "US Titan supercomputer clocked as world's fastest". BBC. November 12, 2012. Archived from the original on February 3, 2013. Retrieved November 12, 2012. 
  27. ^ Williams, Leo (November 14, 2012). "Titan is Also a Green Powerhouse". Oak Ridge Leadership Computing Facility. Archived from the original on February 16, 2013. Retrieved November 15, 2012. 
  28. ^ "The Green500 List - June 2013". Green500. June 28, 2013. Archived from the original on July 2, 2013. Retrieved July 2, 2013. 
  29. ^ Munger, Frank (June 12, 2013). "Titan didn't re-test for TOP500, keeping last year's benchmark; ORNL's Jeff Nichols explains why". Knox News. Archived from the original on July 2, 2013. Retrieved July 2, 2013. 
  30. ^ a b Tibken, Shara (October 29, 2012). "Titan supercomputer debuts for open scientific research". CNET. Archived from the original on December 15, 2012. Retrieved October 29, 2012. 
  31. ^ a b c d "Introducing Titan". Oak Ridge Leadership Computing Facility. Archived from the original on February 22, 2013. Retrieved October 29, 2012. 
  32. ^ Munger, Frank (October 29, 2012). "Titan's ready to roll; ORNL supercomputer may become world's No. 1". Knox News. Archived from the original on March 26, 2013. Retrieved October 29, 2012. 
  33. ^ a b c d Lal Shimpi, Anand (October 31, 2012). "Inside the Titan Supercomputer". Anandtech. p. 1. Archived from the original on January 25, 2013. Retrieved November 2, 2012. 
  34. ^ "Heterogeneous Systems Re-Claim Green500 List Dominance". Green500. November 14, 2012. Archived from the original on February 5, 2013. Retrieved November 15, 2012. 
  35. ^ a b c Bland, Buddy; Lal Shimpi, Anand (October 30, 2012). "Oak Ridge National Laboratory Tour – Backup Power" (Youtube). Anandtech. Retrieved November 2, 2012. 
  36. ^ Morgan, Timothy Prickett (October 11, 2011). "Oak Ridge changes Jaguar's spots from CPUs to GPUs". The Register. Archived from the original on October 15, 2012. Retrieved December 21, 2012. 
  37. ^ a b c "ORNL Debuts Titan Supercomputer". Oak Ridge Leadership Computing Facility. Archived from the original on February 26, 2013. Retrieved October 29, 2012. 
  38. ^ Lal Shimpi, Anand (October 31, 2012). "Titan's storage array". Anandtech. Archived from the original on March 26, 2013. Retrieved December 18, 2012. 
  39. ^ Santos, Alexis (April 16, 2013). "Titan supercomputer to be loaded with 'world's fastest' storage system". Engadget. Archived from the original on April 16, 2013. Retrieved April 16, 2013. 
  40. ^ Smith, Ryan (November 12, 2012). "NVIDIA Launches Tesla K20 & K20X: GK110 Arrives At Last". Anandtech. Archived from the original on January 24, 2013. Retrieved December 21, 2012. 
  41. ^ Feldman, Michael (October 29, 2012). "Titan Sets High Water Mark for GPU Supercomputing". HPC Wire. Archived from the original on March 26, 2013. Retrieved October 30, 2012. 
  42. ^ "Titan System Overview". Oak Ridge Leadership Computing Facility. Archived from the original on March 26, 2013. Retrieved December 21, 2012. 
  43. ^ a b Bland, Buddy; Lal Shimpi, Anand (October 30, 2012). "Oak Ridge National Laboratory Tour – Cooling Requirements" (Youtube). Anandtech. Retrieved November 2, 2012. 
  44. ^ a b Pavlus, John (October 29, 2012). "Building Titan: The ‘world’s fastest’ supercomputer". BBC. Archived from the original on January 20, 2013. Retrieved January 8, 2013. 
  45. ^ Munger, Frank (January 1, 2013). "ORNL visualization lab gets $2.5M makeover, adds 3D". Knox News. Archived from the original on March 26, 2013. Retrieved January 2013. 
  46. ^ a b c d e f g h "TITAN: Built for Science". Oak Ridge Leadership Computing Facility. Archived from the original on February 26, 2013. Retrieved October 29, 2012. 
  47. ^ "Nuclear Energy – Supercomputer speeds path forward". Consortium for Advanced Simulation of LWRs. Archived from the original on February 14, 2013. Retrieved December 14, 2012. 
  48. ^ Zybin, Sergey. "LAMMPS Molecular Dynamics Simulator". Sandia National Laboratories. Archived from the original on February 16, 2013. Retrieved October 29, 2012. 
  49. ^ a b c d Messer, Bronson (October 30, 2012). "Using Titan to Model Supernovae" (Youtube). Anandtech. Retrieved November 15, 2012. 
  50. ^ a b Pearce, Jim. "VERA analyzes nuclear reactor designs in unprecedented detail". Oak Ridge National Laboratory. Archived from the original on February 15, 2013. Retrieved December 18, 2012. 
  51. ^ "2013 INCITE Awards". US Department of Energy. Archived from the original on March 26, 2013. Retrieved January 17, 2013. 
  52. ^ a b c d Williams, Leo. "Preparing users for Titan". Oak Ridge National Laboratory. Archived from the original on March 1, 2013. Retrieved November 19, 2012. 
  53. ^ Rumsey, Jeremy (December 17, 2012). "Titan Trainers Take Road Trip". Oak Ridge Leadership Computing Facility. Archived from the original on March 26, 2013. Retrieved December 18, 2012. 
  54. ^ "Supercomputing Leaders Choose Adaptive Computing to Complement Latest HPC Systems". Business Wire. January 30, 2013. Archived from the original on March 26, 2013. Retrieved January 31, 2013. 
  55. ^ DuBois, Shelley (January 30, 2013). "The next revolution in cloud computing". Fortune Magazine. Archived from the original on March 26, 2013. Retrieved January 31, 2013. 
  56. ^ a b Lal Shimpi, Anand (October 31, 2012). "Inside the Titan Supercomputer". Anandtech. p. 3. Archived from the original on March 26, 2013. Retrieved November 15, 2012. 

External links[edit]

Records
Preceded by
IBM Sequoia
16.325 petaflops
World's most powerful supercomputer
November 2012 – June 2013
Succeeded by
Tianhe-2
33.9 petaflops