Mar 6, 2009

New Earth Simulator (ES2) and New Plasma Simulator into operation

According to the recent Japanese news releases, two Japanese supercomputer sites announced start of operation. One is a 131 TFOPS new Earth Simulator (ES2) in Japan Agency for Marine-Earth Science and Technology (JAMSTEC) and another is a 77 TFLOPS new Plasma Simulator in National Institute of Plasma Science (NIFS) in Toki of Japan.

ES2, successor of the well-known Earth Simulator is NEC SX-9/E vector supercomputer and the new Plasma Simulator is Hitachi SR16000 model L2 that is a POWER6 based supercomputer.

The two supercomputers are designed using different architectures from general purpose commodity based cluster machines that rapidly become dominant even in HPC as shown in Top500 supercomputer list.

JAMSTEC and NIFS can clearly specify their real requirements for successors, not only performance but also memory capacity, memory bandwidth, etc. in application points of view and may consider consistency of their major application codes and programming know-how for successors. Hence, it is imagined that NEC SX-9/E and Hitachi SR16000 have been evaluated the best for JAMSTEC and NIFS computational environments respectively.

The following are summary of the two different systems based on each specification below.
(For comparison purpose, some data of ES2 and SR16000 are filled with SX-9 model A and some IBM POWER6 public material respectively that can be reasonable assumption. The value with ** indicates complemented value.)

- The FLOPS per core of SX-9 is five times faster than POWER6 (you do not surprise it because a SX-9 CPU includes 8 vector units and a scalar unit)

- SX-9 gives large and steady memory bandwidth. Memory transfer per FLOPS , 2.5 Byte/FLOP, is very large and stable because of vector architecture (no cache).

- A Byte/FLOPS of POWER6 is varying between 0.21 and 4, depending on data location, i.e., on cache or memory. Hence, cache-aware programming is desirable.

- SX-9 is expensive and not green (low MFLOPS/W), probably due to rich devices in order to deliver highest vector performances.

- Peak performance per node is almost same between two simulators, approx. 820 GFLOPS in ES2 and 620 GFLOPS in the Plasma Simulator (SR16000) and less than 1 TFLOPS.

- SR16000 gives a 102 MFLOPS/W energy efficiency, more than three times better than SX-9. (According to Green 500, top is 500 MFOPS/W in IBM QS22 using PowerXCell 8i processor and then 372 MFLOPS/W in Blue Gene/P. Even a latest Xeon Quad-core server can provide 200+ MFOPS/W energy efficiency.)

- SX-9 uses traditional air cooling. On the contrary, SR16000 adopts an efficient direct water cooling system.

The following are characteristics of ES2 and new Plasma Simulator.

● ES2 (NEX SX-9/E)
- 131 TFLOPS vector peak performance, 20 TB memory, Fat-tree Network
- Number of nodes: 160
- Air Cooling
- Peak Performance/Power: *about 27.3 MFLOPS/W (819.2 GFLOPS/30 KVA)
-Construction and 6 yr lease fee: about 18.9 B yen (about $192M)

- Vector Peak Performance: 102.4 GFLOPS (3.2 GHz Clock)
- 8 Vector units + 1 Scalar unit
- 32 port-memory port crossbar
- 65 nm CMOS 11 cupper layers

- Vector peak performance: 819.2 GFLOPS
- CPU/node: 8
- Memory/node: 128 GB (SMP)
- Memory Band Width: **2,048 GB/s (8 CPU)
- Byte/FLOP: **2.5
- Inter node transfer: 128 GB/s (8 GB/s x 8 x 2)

● New Plasma Simulator (Phase 1: Hitachi SR16000 model L2)

- 77 TFLOPS peak performance, 16 TB memory, InfiniBand Fat Tree Network
- Number of nodes: 128
- External storage: 0.5 PB
- Direct water cooling
- Peak performance/Power: 102.1 MFLOPS/W
- Contract price: About 5.4 Byen (about $55M)
In Phase2 (2012/10~2015/3), it is upgraded to 315 TFLOPS)

(CPU chip)
- Chip peak performance: 37.6 GFLOPS (Dual core)
- Dual core POWER6 processor (4.7 GHz clock)
- 32MB L3cache
- 8 channel memory controller (DDR2/DDR3)
- 65 nm CMOS cupper + SOI

- Peak performance: 601.6 GFLOPS
- CPU core/node: 32
- Memory/node: 128 GB (**32 way cc-NUMA SMP)
- Memory bandwidth: **128~160 GB/s (**4~5 GB/s x 32 core)
(When data locate in L2 or L3 cache, bandwidth significantly becomes large. Memory band width behavior is different from vector processor which shows steady value.)
- Byte/FLOP: **0.21 (data on memory)~4 (data on L2 cache)
- Inter node transfer: 32 GB/s (bi-direction)

NCAR's 76.4 TFLOPS BLUEFIRE is IBM p575 that is almost same peak performance as the new Plasma Simulator.

If we simply measure value of supercomputers with a performance/price scale, commodity based cluster may become the most favorite. However it must be essential in HPC that a scale depends on a supercomputer site and there should be different architecture systems that user can choose among by their own scale, such as memory, reliability, power, space in addition to peak performance and price.

Based on my hard experience against Japanese vector supercomputers around 1985 to 1995 in IBM Japan, vector supercomputer's advantage should be rediscovered by successive innovative challenges including energy efficiency and price, if possible.

No comments:

Post a Comment