A 26.58 Tﬂops Global Atmospheric Simulation with the Spectral … · 2002. 12. 5. · A 26.58 Tﬂops Global Atmospheric Simulation with the Spectral Transform Method on the Earth

A 26.58 Tflops Global Atmospheric Simulation with the SpectralTransform Method on the Earth Simulator

Satoru Shingu1, Hiroshi Takahara2, Hiromitsu Fuchigami3, Masayuki Yamada3,Yoshinori Tsuda1, Wataru Ohfuchi1, Yuji Sasaki3, Kazuo Kobayashi3,

Takashi Hagiwara2, Shin-ichi Habata2, Mitsuo Yokokawa4, Hiroyuki Itoh5

and Kiyoshi Otsuka1

1 Earth Simulator Center, Japan Marine Science and Technology Center3173-25 Showa-machi, Kanazawa-ku, Yokohama, Kanagawa 236-0001, [email protected]

2 NEC Corporation1-10, Nisshin-cho, Fuchu-shi, Tokyo 183-8501, Japan

3 NEC Informatec Systems, Ltd.3-2-1, Sakado, Takatsu-ku, Kawasaki, Kanagawa 213-0012, Japan

4 National Institute of Advanced Industrial Science and TechnologyTsukuba Central 2, Umezono 1-1-1, Tsukuba, Ibaraki 305-8568, Japan

5 National Space Development Agency of Japan2-4-1, Hamamatsu-cho, Minato-ku, Tokyo 105-8060, Japan

Abstract

A spectral atmospheric general circulation model called AFES (AGCM for Earth Simulator) was developedand optimized for the architecture of the Earth Simulator (ES). The ES is a massively parallel vector supercom-puter that consists of 640 processor nodes interconnected by a single stage crossbar network with its total peakperformance of 40.96 Tflops. The sustained performance of 26.58 Tflops was achieved for a high resolution sim-ulation (T1279L96) with AFES by utilizing the full 640-node configuration of the ES. The resulting computingefficiency is 64.9% of the peak performance, well surpassing that of conventional weather/climate applicationshaving just 25-50% efficiency even on vector parallel computers. This remarkable performance proves the effec-tiveness of the ES as a viable means for practical applications.

1 Introduction

More realistic and accurate simulations are expected for a better understanding of such phenomena as globalwarming and El Nino, and for the preparation of measures against their possible impacts on social and economicactivities. For such numerical simulations, there are also increasing demands for more powerful computers withthe growing sophistication and complexity of these models as well as increasing spatial resolution. Large climatesimulations are possible only with a combination of a powerful computer and an efficient climate model.

The Earth Simulator (ES) project was launched in order to meet such requirements and to aim at the com-prehensive understanding of global changes. It started developing the ES as a joint project of the National SpaceDevelopment Agency (NASDA) in Japan, the Japan Atomic Energy Research Institute (JAERI), and the JapanMarine Science and Technology Center (JAMSTEC) in 1997. The ES was successfully completed and put intooperational use at JAMSTEC’s Yokohama Institute for Earth Sciences (YES) in March 2002. The ES is currentlyranked number 1 with 35.86 Tflops on the Linpack performance benchmark. On the other hand, the development ofan atmospheric general circulation model (AGCM), which is now called AFES (AGCM for the Earth Simulator),has been conducted since 1995.

One of the ultimate goals of the ES project was to simulate 10-km mesh global atmospheric general circulationat a sustained speed of 5Tflops. By fully harnessing the capabilities of the ES, the authors successfully carried out

0-7695-1524-X/02 $17.00 (c) 2002 IEEE

1

such high-resolution climate modeling for the first time ever and achieved a record-breaking sustained performanceof 26.58 Tflops. With the advent of the ES, the powerful combination of a computing system and highly optimizedapplication software is expected to spearhead the era of Tera to even 10 Tera flops computing.

This paper describes an overview of the ES and AFES with a particular focus on techniques used for theoptimization and performance analysis. Some preliminary results of simulations are also discussed.

2 Overview of the Earth Simulator

The ES is a massively parallel vector system with a shared memory architecture, which consists of 640 processornodes interconnected by 640 � 640 single-stage crossbar switches. Each node contains 8 vector processors with apeak performance of 8 Gflops, a shared memory system of 16GB, a remote access control unit, and an I/O proces-sor. It gives the machine a theoretical maximum performance of 40.96 Tflops with the total of 5120 processors and10 TB of memory system [1, 2]. The system runs the extension of the NEC’s SUPER-UX UNIX-based operatingsystem. Each processor is equipped with a vector operation unit, a 4-way super-scalar operation unit, and a mainmemory access control unit, all of which are implemented on a one-chip LSI.

2.1 Memory System

The memory system of each node is equally shared by 8 processors and is configured with 32 main memory units,each of which has one memory port and is interconnected with a crossbar switch. Every memory unit has 64memory banks, which totals 2048 banks with the memory capacity of 16GB for a node. Each processor withina node can have access to 32 memory ports when vector load/store instructions are issued. Each processor hasa data transfer rate of 32 GB/s with memory devices, which results in the aggregate throughput of 256 GB/s pernode. Here a series of access to contiguous memory addresses will result in the maximum performance, whileit will be degraded particularly with memory access at a constant even number of memory interleave because thenumber of memory ports available for concurrent access is decreased. Memory access at an every 32 address stridewill result in the lowest efficiency of 1 GB/sec. Even so, the user can easily gain high performance with parallelvector systems without being cognizant of complicated cache-tailored programming as often in the case of scalarcomputers.

2.2 Interconnection Network

The single-stage crossbar interconnection network consists of two units; one is the inter-node crossbar control unitfor the coordination of switch operations and the other the inter-node crossbar switch that is an actual data path.All the pairs of nodes and switches are connected by electric cables. The theoretical data transfer rate betweenevery two nodes is 12.3GB/s � 2 ways.

A message passing programming model with Message Passing Interface (MPI) libraries is supported bothwithin a node and among nodes. The performance of the MPI Put function was measured [3]. The maximumthroughput and latency of MPI Put are 11.63GB/s and 6.63 µsec, respectively. It should be noted that the timefor barrier synchronization is only 3.3 µsec because of a dedicated hardware system for barrier synchronizationamong nodes.

2.3 Processor

The super-scalar operation unit has a 64K-byte instruction cache, a 64K-byte data cache, and 128 general-purposescalar registers which are capable of performing such instructions as branch prediction, data pre-fetching, and out-of-order instructions. The vector operation unit consists of 8 logical sets of vector pipelines, vector registers, andsome mask registers.

Vector pipelines in a vector unit can handle six types of operations, i.e., addition/shift, multiplication, divi-sion, logical operation, masking, and load/store operation. In total there are 8 vector registers and 64 vector dataregisters, each having 256 vector elements.

When addition and multiplication are carried out concurrently, the peak performance will be achieved. Herethe theoretical peak performance can be computed for various operation patterns. Each processor can carry out 16

2

floating operations within one machine cycle of 2 nsec by fully utilizing a set of pipelines — provided that bothaddition and multiplication run 8 operations. This indicates 8 Gflops of peak performance, but a sufficiently longvector length and a high speed data transfer rate between the processor and the memory unit are essential to assurethis performance.

Here one element of arrays has 8 bytes; therefore, 8Gflops of performance requires 64GB/s of data transferrate, while the data transfer rate with the memory unit is 32 GB/s for one processor. This implies that an appli-cation program must be compute-intensive while the frequency of memory access must be kept low for efficientcomputation. The maximum theoretical performance F may be derived from the number of instructions within aDO loop on the basis of the following formula by assuming a sufficiently long vector length.

F �4�ADD�MUL�

max�ADD�MUL�VLD�VST�Gflops� (1)

where ADD, MUL, VLD, and VST are the number of additions, multiplications, and vector load and store oper-ations, respectively. Table 1 summarizes as to how the sustained performance is computed for different operationpatterns. Here the index of the innermost DO loop is denoted as n and its outer loop as j. Please note that accessto the arrays C and D is taken as scalar load in terms of n, and is not included in the count of VLD.

The highest effective sustained performance can be obtained for operation No. 4 on Table 1, which correspondsto the coding pattern of the inverse Legendre transform. In the case of the Legendre transform, the computationsfor real number parts (X in this case) and imaginary number parts (Y) are identical in combination with thecalculation of the associated Legendre function (P), thus halving the number of VLD instructions when calculatedin a same DO loop. Unrolling of DO loops is also effective in reducing the number of VLD and VST instructions,contributing to the enhanced performance. Such a method can also be applied to other parts of the code to gaincomputational efficiency.

Table 1: Estimation of vector performance.

No. DO loop VLD VST ADD MUL Gflops Peak ratio1 X(n) = X(n) + A(n) 2 1 1 0 4/3=1.33 17 %2 X(n) = X(n) + A(n)*B(n) 3 1 1 1 8/4=2.00 25 %3 X(n) = X(n) + P(n, j)*C(n) 2 1 1 1 8/3=2.67 33 %4 X(n) = X(n) + P(n, j )*C(j ) 6 2 8 8 64/8=8.00 100 %

+ P(n, j+1)*C(j+1)+ P(n, j+2)*C(j+2)+ P(n, j+3)*C(j+3)

Y(n) = Y(n) + P(n, j )*D(j )+ P(n, j+1)*D(j+1)+ P(n, j+2)*D(j+2)+ P(n, j+3)*D(j+3)

3 AFES Climate Simulation Code

As the predecessor of AFES, NJR-SAGCM [4] was developed by the Research Organization for InformationScience and Technology under the funding by NASDA from 1995 to 2001. Its original code was adopted fromthe AGCM version 5.4 [5, 6, 7, 8] jointly developed by the Center for Climate System Research, the Universityof Tokyo and the Japanese National Institute for Environmental Studies. NJR-SAGCM was parallelized only withMPI for generic machine platforms. As a result, its computing performance on the ES was not satisfactory. AFEShas been optimized for the ES by the Earth Simulator Research and Development Center under the funding byNASDA. After the launch of the operational use of the ES in March 2002, AFES has been maintained by the EarthSimulator Center (ESC), successor of the former ESRDC.

3

3.1 Coding Style

The source code of AFES is written in Fortran90 and comprises approximately 30000 lines. In order to makeeffective use of memory capacity, some selected features of Fortran90 are adopted such as automatic and dynamicallocation of arrays. The MODULE structure feature is also used to enhance maintainability and readability ofthe code and substitute for legacy Fortran coding such as ENTRY, COMMON, and INCLUDE statements. On theother hand, such new features as derived type, pointer, and array expression were not used because of possibleperformance degradation.

3.2 Description of Model

The model is based on global hydrostatic primitive equations on sphere and uses the spectral transform method[11, 12] in the horizontal directions with the use of the Gaussian grid and the finite-difference method in the verticaldirection with the sigma coordinates. It predicts such variables as horizontal winds, temperatures, ground surfacepressure, specific humidity, and cloud water.

The primitive equation system is employed for AFES because of the following reason. Here our target res-olution is 10 km, which may be arguably very close to the lower limit where the hydrostatic approximation isapplicable. One of the main reasons to use the non-hydrostatic equation system instead may be to simulate convec-tive motions explicitly. Clearly the 10-km mesh is too coarse for that purpose. On the other hand, finer resolution,such as 1-km mesh for resolving convection is not practical at this moment for global simulations because of theenormous amount of required computations. In addition, AFES was also intended for the use with lower spatialresolution, such as 20 km, 40 km, and so on, where the hydrostatic approximation is totally valid.

The computation of AFES mainly falls into one of two categories, dynamics or physics. In dynamics, thespectral transform method is used and variables in the physical domain are mapped onto the Fourier and spec-tral domains. In physics, parameterizations for several phenomena are taken into account. The parameterizationsinclude a sophisticated radiation scheme with a two-stream k-distribution method, several variations of cumulusconvective parameterization schemes (simplified Arakawa-Schubert, Kuo, and Manabe’s moisture convective ad-justment schemes), large scale condensation with a prognostic cloud water scheme, the Mellor-Yamada level-2turbulence closure scheme with simple cloud effects, orographic gravity wave drag, and a simple land-surface sub-model. AFES is being upgraded for further sophistication; its possible enhancement includes the adoption of thesemi-Lagrangian advection scheme for tracer variables such as specific humidity and cloud water.

3.3 Spectral Transform Method

There have been many algorithms proposed and examined on various computers so far about the parallelization ofthe spectral transform method [12, 13, 14]. Here the outline of the Legendre transform (LT) that makes up the corepart of the spectral transform will be given by following Foster [12].

In the spectral transform method, fields are transformed at each timestep between the physical domain, wherethe physical forces are calculated, and the spectral domain, where the horizontal terms of the differential equationare evaluated. The spectral representation of a field variable ψ on a given vertical layer above the surface on asphere is defined by a truncated expansion in terms of the spherical harmonic functions �Pm

n �µ�eimλ �.

ψ�λ �µ� �M

∑m��M

N�m�

∑n��m�

ψmn �µ�e

imλ (2)

where

ψmn �

� 1

�1

�1

2π

� 2π

0ψ�λ �µ�eiλ dλ

�Pm

n �µ�dµ (3)

�� 1

�1ψm�µ�Pm

n �µ�dµ� (4)

Here i��1, µ � sinθ , θ is latitude, λ is longitude, m is zonal wavenumber, n is total wavenumber, and P m

n �µ�is the associated Legendre function. In the truncated expansion, M is the highest wavenumber, while N�m� is the

4

highest degree of the associated Legendre function along lines of constant longitude. Because of the triangularspectral truncation used here, N�m� � M.

In each vertical layer of the physical domain, fields are discretized on an I � J longitude-latitude grid, wherethe I lines of constant longitude are evenly spaced and the J lines of constant latitude are placed at the Gaussianquadrature points. Transforms from physical coordinates to spectral coordinates involve a Fourier transform foreach line of constant latitude, generating the values �ψm�µ j�� in an M�J wavenumber-latitude space. Then, eachwavenumber is integrated with J-point Gaussian quadrature to obtain the spectral coefficients by forward Legendretransform,

ψmn �

J

∑j�1

ψm�µ j�Pmn �µ j�w j� (5)

Here w j is the Gaussian quadrature weight corresponding to the Gaussian latitude µ j. The point values are recov-ered from the spectral coefficients by inverse Legendre transform

ψm�µ j� �N

∑n��m�

ψmn Pm

n �µ j� (6)

for each m, followed by the inverse Fourier transform to calculate ψ�λ i�µ j�. The Fast Fourier Transform (FFT)algorithm used in AFES is the self-sorting mixed-radix FFT that includes the factors 2, 3, 4, and 5 by Temperton[9, 10].

The notation “TMLK” is used to specify the spatial resolution of a model with spectral truncation of M totalwavenumbers and K vertical layers, where T stands for the triangular truncation. Let I and J be the number of gridpoints in longitudinal and latitudinal directions, and then the relations of J � I�2 and M � � I�1

3 � hold. The amountof computation for the LT is estimated to be ��M 3�, while the numbers of computations for FFT and the rest ofthe part are ��M2 log2 M� and ��M2� respectively. With the increasing horizontal resolution, the computationalcost of the LT is getting more predominant in comparison with the other parts. Therefore, the optimization of LTis crucial to enhancing the total performance.

4 Performance Optimization

4.1 Three-level Parallelism on the ES

In order to pursuit the best possible performance on the ES, it is crucial to fully utilize the three-level parallelismavailable for the ES, i.e., inter-node parallel processing for distributed memory architecture, intra-node parallelprocessing for shared memory architecture, and vector processing for a single processor. They have the maxi-mum number of parallel decompositions of 640, 8, and 256, each of which corresponds to the number of nodes,processors within a node, and vector register elements, respectively. It is also important to choose an appropriatespatial resolution tailored to the algorithmic nature and the data structure of the computer code as well as the fullutilization of arithmetic operation units. In this study, MPI-based parallelism is adopted for inter-node operations,microtasking for intra-node operations, and vector processing for a single processor.

4.2 Target Spatial Resolution

One of the ultimate goals originally set in developing the ES-tailored AFES code is to achieve high sustainedperformance for an atmospheric model with a horizontal resolution of 10km and around 100 vertical layers. Takinginto account the hardware configuration and the parallel efficiency of the ES, the spatial resolution of T1279L96was selected. This model has 3840 grid points for each line of constant latitude and 1920 points for each lineof constant longitude together with 96 vertical layers. Since the triangular truncation method is used for spectralwavenumber to eliminate aliasing error, the number of grid points with respect to wavenumber is M�1 � 1279�1 � 1280 including wavenumber of zero. In the following sections, performance enhancement strategies will bedescribed with the focus on the T1279L96 resolution model.

5

4.3 Parallelism of AFES

Three kinds of variable spaces are used on the basis of the spectral transform method, i.e., physical grid space(I� J �K), Fourier grid space (M� J �K), and spectral grid space (N�M�K). These grid spaces are shown inFigure 1 for one vertical layer. In physics, there is column-wise parallelism, while sequential processing is requiredfor the vertical integration of variables. Physics can be processed only in the physical grid space. In dynamics,the Fourier transform has interdependency among variables in the zonal direction (with constant latitude), whileit can be processed independently in both meridional (with constant longitude) and vertical directions. On theother hand, the Legendre transform needs to be processed sequentially with respect to the meridional directionwith no dependency in terms of Fourier wavenumber and vertical direction. The semi-implicit time integrationhas dependency only in the vertical direction, while it is independent of wavenumbers m and n. Such algorithmicnatures are taken into account in the parallel programming of AFES. It will be described that the T1279 horizontalresolution is the minimum resolution to efficiently utilize the full 640-node configuration of the ES.

4.3.1 Inter-node Parallel Processing

First of all, inter-node parallel optimization is described (Figure 1. Here each MPI process is mapped onto a node.For the MPI-based parallelism, the entire computational domain is decomposed into sub-domains of constantlatitude j so that the Fourier transforms are computed concurrently, while decomposition is made in terms ofwavenumber m in the spectral grid space, thus resulting in all-to-all type communications due to the data transposebetween the FFT and the Legendre transform. Figure 2 shows array sizes of decomposed variables and the dataflow.

One bottleneck that hampers efficiency is load imbalance among MPI processes arising from the use of trian-gular truncation for wavenumber. The straightforward allocation of MPI processes to each wavenumber m incursserious imbalance of computational amount, since the DO loop length for n will be decreased with the increasingwavenumber m. In order to reduce performance degradation due to this load imbalance, the mapping of computa-tion to MPI processes was rearranged as illustrated in Figure 1. This method is called ”zigzag cyclic mapping” inthis paper. In this method, each MPI process must deal with at least two wavenumbers m for the purpose of thisbalancing.

Therefore, the maximum possible number of divisions for the Legendre transform becomes 640 �� M�1��2�,which corresponds to the total number of nodes of the ES. This implies that all the parallelism with respect towavenumber m has been utilized to its maximum, and there is no room for further parallelism. In this case, thetotal numbers of parallel processes that can be assigned to the loops with respect to latitude is 3 �� J�P�P� 640�at most.

4.3.2 Main Flow for Inter-node Parallel Processing

Figure 3 shows the main processing flow of AFES within a timestep loop. In dynamics, two cycles of spectraltransforms are made at each timestep. One cycle is composed of forward and inverse transforms as shown in Figure1. The forward transform consists of the forward FFT, data transpose (DT) and LT, and transforms of variablesfrom physical grid space to spectral grid space. The inverse transform is composed of the inverse LT, DT andFFT, and the transforms of variables from spectral grid space to physical grid space. Here the number of variablestransformed between spaces varies among each transform cycle, including both 3- and 2-dimensional variables;the former includes wind velocity components, temperature, and so on, while the latter variables associated withground surface pressure.

In the first stage of dynamics (Prep), one cycle of transform is conducted. It includes the spectral transform(Fwd1) for five 3-D variables and two 2-D variables, followed by the inverse transform (Inv1) for two 3-D variablesand two 2-D variables. In the second stage (Tendency) to compute time tendency terms, the spectral transform(Fwd2) is conducted for twelve 3-D variables and one 2-D variable. Then, the variables are transformed intothe spectral grid space, followed by the time integration with the semi-implicit scheme for gravity-related terms.Finally, Update transforms updated variables into the physical grid space. Here the inverse spectral transform(Inv2) is conducted for nine 3-D variables and one 2-D variable.

In Figure 3, the number of target variables in each stage is indicated as 5K+2, 2K+2, 12K+1, and 9K+1,respectively. Here K is the number of vertical layers. In FFT and DT, these variables are packed in a large array.

6

1 2 3

i

j

j

m

Fourier Transform

1 2 3 4 4 3 2 1 Data Transpose

1 2 3 4 4 3 2 1

m

j

m

n

Legendre Transform

4

1 2 3 4

Communication

(Fourier grid)

(Spectral grid) (Physical grid)

(Fourier grid)

Figure 1: Image of inter-node parallelism, zigzag cyclic mapping to MPI processes, and communication due todata transpose in each vertical layer.

Fwd-FFT Fwd-DT Fwd-LT

(I, J/P, K) (M, J/P, K) (J, M/P, K) (N, M/P, K)

Inv-FFT Inv-DT Inv-LT

Figure 2: Array sizes at each stage of the spectral transform method. Fwd; forward Legendre transform, Inv;inverse Legendre transform, DT; data transpose, P; number of MPI processes.

Time-step loop

Dynamics Prep Physical gridFwd1 FFT DT LT 5K+2

Spectral gridInv1 FFT DT LT 2K+2

Tendency Physical gridFwd2 FFT DT LT 12K+1

Semi-implicit Spectral gridUpdate

Inv2 FFT DT LT 9K+1

Physics Physical grid

Figure 3: Main flow of processing within a timestep loop.

7

Such a packing method increases the lengths of vector loops in FFT and the sizes of MPI messages, thus reducingthe frequency of communications in the DT.

4.3.3 Intra-node Parallel Processing

��

��

��

��

��

��

��

��

��

��

��

��

��

�� !

�"��"��

�

��

��

#�� $

��$�%��&�"&'��

��

�

��

��

��

��

��

�

��

(�) �

��

� ��

Figure 4: Three-level parallelization in physics.

The second approach is intra-node parallel processing for the shared memory architecture within each node,which is realized with microtasking, a kind of thread programming. The third approach is vector processing fora single processor. Here in order to share the parallelism at a DO loop level, these two approaches need to beeffectively combined. Each microtask is mapped onto a processor. After the division of the computational regioninto some sub-regions with MPI-based parallelization, the computation for sub-regions is taken care of with thecombination of vector processing and microtasking. The LT has parallelism in terms of both wavenumber mand vertical direction k. Here parallelism associated with wavenumber has already been utilized for inter-nodeparallelization; therefore, vertical direction was chosen as a target of microtasking. In order to equally share thecomputational load by 8 threads of microtasking, it is necessary that the number of vertical layers be a multiplierof 8.

As for physics, column-wise parallelization can easily be implemented because of its algorithmic nature thathas data dependency only in the vertical direction. However, parallel decomposition is made with respect tolatitude to maximize performance by avoiding data transfer due to the re-allocation of data between dynamics andphysics. Because of MPI-based parallel decomposition in terms of latitude similar to dynamics, the maximumpossible number of microtasking processes is 3 �� J�P�. In order to make use of 8 processors within a node,microtasking is applied to collapsed DO loops having a loop length of 1440 �� I�J�P�8�, with the remaining loopsprocessed using vector processors (Figure 4). Similar methods are used for FFT and semi-implicit time integrationto capitalize on effective combination of microtasking and vector processing. Such optimization for microtaskingwas made by manual parallelization for the most part in combination with automatic compiler parallelization.

The above discussion on parallelization for inter-node and intra-node processing is summarized in Table 2.

8

This table shows the parallel granularity for three-level parallelization. As described in Figure 3, the arrays of datatranspose and FFT are packed in large arrays. Table 3 indicates the sizes (KSQ) of these arrays along the verticaldirection.

Table 2: Parallel granularity for three-level parallelization.

MPI microtask vectorFFT J KSQ*J/P (KSQ*J/P)/8Fwd-LT M+1 K J/2 and M+1Inv-LT M+1 K J/2Semi-implicit M+1 (N+1)*(M+1)/P ((N+1)*(M+1)/P)/8Physics J I*J/P (I*J/P)/8

Table 3: Sizes of combined arrays for FFT and data transpose.

Transform Size of KSQ L96 (K=96)Fwd1 5K+2 482Inv1 2K+2 194Fwd2 12K+1 1153Inv2 9K+1 865

5 Optimization for the Legendre Transform

(b) LT kernel (inverse)

0

1

2

3

4

5

6

7

8

0 320 640 960 1280Triangular truncation resolution

Gfl

ops

outer product assembler

(a) LT kernel (forward)

0

1

2

3

4

5

6

7

8

0 320 640 960 1280Triangular truncation resolution

Gfl

ops

inner product outer product assembler

Figure 5: Performance of the Legendre kernel.Inner product and outer product patterns are coded in Fortran.

Although the LT kernel is similar to vector-matrix multiplication, no libraries such as BLAS were used and thefollowing optimization approach was adopted for flexible performance tuning. In order to evaluate the performanceof the LT, an LT kernel program was developed. This LT kernel includes only inverse and forward LTs. It is not

9

parallelized and computes only one variable. Figure 5 shows the sustained performance for one processor in termsof triangular truncation resolution for the kernel program of forward and inverse LTs. Different programmingtechniques were examined to gain enhanced performance. Here the outer product type is represented by suchan operation as V1=V1+V2*V3 (V1, V2, V3; vector variables); on the other hand, the inner product type byS=S+V2*V3 (S; scalar variable). The calculation of inner product is composed of 2 stages. In the first stage, theouter product type calculation is carried out. In the second stage, the summation of vector elements is done. Thesecond stage is not vectorized in a straightforward way because it involves inter-pipeline computations. In fact, theperformance obtained for the inner product is low with less than 50% peak ratio even for the T1279 resolution. Incontrast, the outer product type is advantageous in gaining higher performance with 6.61 Gflops for the forwardLT and 7.85 Gflops for the inverse LT, each of which is 82.6% and 98.1% of the theoretical peak performance ofa single vector processor, respectively. Assembler coding was attempted for further optimization. The resultingperformance is 6.86 Gflops for the forward LT and 7.88 Gflops for the inverse LT, each of which is 85.8% and98.5% of the theoretical peak performance. The highly optimized LTs are one of the major achievements inobtaining a record-breaking performance with AFES. It is obvious that the performance increment due to assemblercoding is quite small. Therefore, it is reasonable to say that such a high level of performance optimization can beaccomplished only with Fortran90.

Figure 6 gives comparison of the elapsed time for the LT between before and after the zigzag cyclic mappingwith MPI programming. Please note that the horizontal axis represents spectral wave number and MPI processnumber for these cases, respectively. It can easily be seen from (a) (b) (c), and (d) that the zigzag cyclic mappingis effective in balancing the flop count and elapsed time for the outer product coding pattern, thus resulting inenhanced sustained performance. To gain more balanced performance among MPI processes, a hybrid program-ming paradigm was introduced by combining both outer and inner product type programming. As shown in (e),the elapsed time is getting smaller against the increasing spectral wave number. Here each method has perfor-mance decline at the point where the vector loop length (more specifically the surplus of the loop length afterbeing divided by 256) is getting smaller; however, the way of decrease in elapsed time is different between thesetwo coding methods. The most balanced elapsed time (f) was realized by choosing a coding pattern that resultsin smaller elapsed time between two methods depending on wave number. Such a coding technique seems to besomehow tricky, but was found to be effective in maximizing the sustained performance.

6 Performance Analyses

Performance analyses were made for 1-day simulations and 10-timestep simulations with the T1279L96 modelresolution.

6.1 Performance Measurement Tool

The number of floating-point operations can be counted with the performance analysis tool FTRACE availableon the ES. The ES hardware has built-in counters for such measures as the number of floating-point operationsand vector instructions, clock counts, vector loop length, and the time spent for the delay due to out-of-cacheoperations. The use of FTRACE may be specified with a compiler option, and the embedded routine into a Fortrancode will collect such information by capitalizing on the built-in counters for specified subroutines or segments.Figure 7 shows a sample output for the T1279L96 model simulation.

6.2 Input Data to Simulation

A five-year simulation was conducted in advance with a coarse resolution of T106L20 to prepare numerically stableinitial conditions for the T1279L96 runs. Climatic averaged data were used for the boundary conditions such assea surface temperature and distribution of sea ice. The ETOPO05 topographical data were linearly interpolated togive data representing the topographic feature for the T1279L96 fine resolution model. The time integration stepis automatically determined by taking into account the numerical stability conditions at all the grid points. In the1-day simulation described below, the time integration step was 30 sec at every timestep, with 2880 steps neededin total to accomplish the 1-day simulation.

10

(d)Elapsed time of outer product patternZigzag cyclic mapping

0.01.02.03.04.05.06.07.08.09.0

0 160 320 480 640

MPI process number

Ela

psed

time

(m

sec

)

(a)Flop count of outer product pattern

0

10

20

30

40

50

60

0 160 320 480 640 800 960 1120 1280

Wavenumber m

Flo

pco

unt

(×

10E

+6

)

(c)Elapsed time of outer productpattern

0.01.02.03.04.05.06.07.08.09.0

0 160 320 480 640 800 960 1120 1280

Wavenumber m

Ela

psed

time

(m

sec

)(b)Flop count of outer product pattern

Zigzag cyclic mapping

0

10

20

30

40

50

60

0 160 320 480 640

MPI process number

Flo

pco

unt

(×

10E

+6

)

(f)Elapsed time of hybrid patternZigzag cyclic mapping

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

0 160 320 480 640

MPI process number

Ela

psed

time

(m

sec

)

(e)Elapsed time of hybrid programming

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

0 160 320 480 640 800 960 1120 1280

Wavenumber m

Ela

psed

time

(m

sec

)

outer product inner product

Zigzag cyclicmapping to 640MPI processes

Figure 6: Effects of zigzag cyclic mapping and hybrid vector programming for the Legendre kernel.Left: for each wavenumber, Right: for each MPI process after zigzag cyclic mapping

(a),(b) Flop count, (c),(d) Elapsed time for outer product pattern only, (e),(f) Elapsed time for the hybrid patternof outer and inner products.

11

*--------------------------*FLOW TRACE ANALYSIS LIST

*--------------------------*

Execution : Thu May 2 18:26:04 2002Total CPU : 2:32’36"700

PROG.UNIT FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. I-CACHE O-CACHE BANKTIME[sec]( % ) [msec] RATIO V.LEN MISS MISS CONF

msdupdt_lt.sdupdt_lt23040 2255.198(24.6) 97.882 9951.9 7863.3 99.57 240.0 0.2939 1.1362 0.0055

-micro1 2880 282.081( 3.1) 97.945 9945.6 7858.3 99.57 240.0 0.0263 0.1583 0.0012-micro2 2880 281.909( 3.1) 97.885 9951.6 7863.0 99.57 240.0 0.0293 0.1361 0.0006-micro3 2880 282.010( 3.1) 97.920 9948.0 7860.2 99.57 240.0 0.0287 0.1390 0.0006

-micro4 2880 281.779( 3.1) 97.840 9956.2 7866.7 99.57 240.0 0.0503 0.1433 0.0006-micro5 2880 281.467( 3.1) 97.732 9967.2 7875.4 99.57 240.0 0.0730 0.1454 0.0006-micro6 2880 281.919( 3.1) 97.889 9951.3 7862.8 99.57 240.0 0.0299 0.1364 0.0006-micro7 2880 282.020( 3.1) 97.924 9947.7 7859.9 99.57 240.0 0.0278 0.1388 0.0006-micro8 2880 282.013( 3.1) 97.921 9947.9 7860.1 99.57 240.0 0.0285 0.1389 0.0006

Figure 7: Sample output of the FTRACE performance analysis tool for a 1-day simulation.T1279L96, 5120 processors (640 MPI processes � 8 microtasks) , MPI process number=0.

6.3 Performance of a 1-day Simulation

Table 4: Overall performance of a 1-day simulation with the T1279L96 resolution.

N : Number T : Elapsed Tflops S ε : Parallelof processors time (sec) (Peak ratio) Speedup efficiency

640 ( 80�8) 9782.9 3.71(72.5%) — — %1280 (160�8) 5054.3 7.20(70.3%) 1.936 96.8 %2560 (320�8) 2751.7 13.20(64.4%) 3.555 88.9 %5120 (640�8) 1517.8 23.93(58.4%) 6.445 80.6 %

6.3.1 Total Performance of a 1-day Simulation

The overall performance of a 1-day simulation with the T1279L96 model is shown in Table 4 for various processorconfigurations ranging from 80 nodes by 8 intra-node processors to 640 nodes by 8 processors, the ES full systemconfiguration. Here the flops values were determined on the basis of the performance information outputted foreach MPI process. Each flop value can be derived as the total number of floating point operations for all the pro-cessors divided by the maximum elapsed time. It can be said that AFES achieved a very excellent overall sustainedperformance of 23.93 Tflops, which is 58.4% of the theoretical peak performance of the ES. The sustained per-formance was obtained for a wide range of system configurations, indicating around 60-70% of peak performanceratio.

6.3.2 Detailed Performance for a 1-day Simulation

Table 5 summarizes elapsed time data of each major process and their ratios to the overall time spent on the modelsimulation of AFES for different system configurations. The actual sustained performance in Gflops and its peakratio obtained from the FTRACE tool are shown in Table 6 for the processes having floating point operations. Herethe data shown are for a particular MPI process (MPI process number 0). The elapsed time data for each latitudinal

12

0

200

400

600

0 80 160 240 320 400 480 560 640

MPI Process number

Ela

psed

time

(se

c)

updateTendency

Prep

Inv-DTFwd-DT

Semi-implicit

Prep + Tendency +Semi-implicit + Inv-DT

Figure 8: Load imbalance in dynamics (1-day simulation).T1279L96, 5120 processors (640 MPI processes � 8 microtasks)

zone are shown in Figure 8. The Legendre transform (Prep and Tendency) is fastest for the MPI process number0, resulting in the highest peak ratio (95.2%) of the Legendre transform when running the model with 5120 fullprocessors.

As can be seen in Table 5, the elapsed time with respect to dynamics and physics decreases with the increasingnumber of processors. The sustained peak performance ratio of the Legendre transform is quite high (more than90%) for a broad range of processor configurations, with its elapsed time making up around 50-65%. In general,the cost spent on the Legendre transform tends to dominate in the total cost due to the rapid increase of the numberof floating point operations in proportion to the cube of wave number; however, it can be said that the excellentsustained performance of the Legendre transform offsets the increased elapse time. The sustained performance ofFFT is lower than that of the LT and gradually decreases around 50% with the increasing number of processors,since it is not so compute-intensive as the LT with its rather shorter vector loop (such as Inv1 in Table 3). Thelow peak performance ratio of the semi-implicit time integration loop is mainly attributable to the IF statementsfor defining the target areas in wave number space, but its influence on the overall performance is not significant.Physics dominates 12.0 to 14.6% of the total cost depending on processor configuration. Its peak performance ratioranges from 30 to 50% for different sub-processes in physics, with little dependency on the number of processorsused. This is due to the stable vector performance resulting from longer vector loop lengths I � J�P. Here I (J) isthe number of grid points along lines of constant latitude (longitude), with P the number of processors. MPI-drivencommunications are getting dominant with the increased number of processors because of the all-to-all type intra-node communications. The output routine also shows a slight increase in cost because computational results fromall the nodes are collected into the node of MPI process number 0.

13

Table 5: Elapsed time (sec) (%) for a 1-day simulation with the T1279L96 resolution.

number of processors (#nodes � #processors)640 ( 80�8) 1280 (160�8) 2560 (320�8) 5120 (640�8)

LT (3-D) 6404.9 ( 65.3%) 3191.8 ( 63.1%) 1607.1 ( 58.5%) 783.8 ( 51.8%)FFT 537.1 ( 5.5%) 277.0 ( 5.5%) 152.5 ( 5.6%) 75.8 ( 5.0%)LT (2-D) + others 339.2 ( 3.5%) 172.7 ( 3.4%) 97.4 ( 3.6%) 44.4 ( 2.9%)Semi-implicit 273.2 ( 2.8%) 137.0 ( 2.7%) 68.9 ( 2.5%) 35.0 ( 2.3%)Dynamics total 7554.4 ( 77.0%) 3778.5 ( 74.7%) 1925.9 ( 70.1%) 939.0 ( 62.1%)

Large-scale Condensation 345.8 ( 3.5%) 178.5 ( 3.5%) 85.6 ( 3.1%) 43.2 ( 2.9%)Vertical diffusion 316.8 ( 3.2%) 159.2 ( 3.1%) 80.6 ( 2.9%) 40.0 ( 2.7%)Radiation (each 1 hour) 255.2 ( 2.6%) 128.7 ( 2.5%) 65.3 ( 2.4%) 32.9 ( 2.2%)Physics(others) 236.4 ( 2.4%) 113.4 ( 2.2%) 60.4 ( 2.2%) 31.6 ( 2.1%)Cumulus (Kuo) 144.3 ( 1.5%) 80.6 ( 1.6%) 35.3 ( 1.3%) 16.9 ( 1.1%)Gravity wave drag 137.8 ( 1.4%) 69.2 ( 1.4%) 35.0 ( 1.3%) 17.4 ( 1.2%)Physics total 1436.3 ( 14.6%) 729.5 ( 14.4%) 362.3 ( 13.2%) 181.9 ( 12.0%)

Transpose 438.9 ( 4.5%) 267.0 ( 5.3%) 229.2 ( 8.3%) 189.9 ( 12.6%)MPI(others) 22.4 ( 0.2%) 19.1 ( 0.4%) 17.2 ( 0.6%) 16.1 ( 1.1%)MPI total 461.3 ( 4.7%) 286.1 ( 5.6%) 246.4 ( 9.0%) 206.0 ( 13.6%)

Subtotal 9452.0 ( 96.4%) 4797.2 ( 94.8%) 2534.5 ( 92.3%) 1326.9 ( 87.8%)

Initial 223.8 ( 2.3%) 149.1 ( 3.0%) 111.7 ( 4.1%) 93.2 ( 6.2%)Output 105.4 ( 1.1%) 97.6 ( 1.9%) 93.3 ( 3.4%) 88.7 ( 5.9%)Input 26.8 ( 0.3%) 16.5 ( 0.3%) 6.9 ( 0.2%) 3.2 ( 0.2%)

Total 9808.0 (100.0%) 5057.5 (100.0%) 2746.3 (100.0%) 1512.0 (100.0%)

Table 6: Gflops (Peak ratio %) for a 1-day simulation (T1279L96).(peak performance: 64Gflops per node)

number of processors (#nodes � #processors)640 ( 80�8) 1280 (160�8) 2560 (320�8) 5120 (640�8)

LT (3-D) 59.65 ( 93.2%) 59.85 ( 93.5%) 59.43 ( 92.9%) 60.93 ( 95.2%)FFT 34.12 ( 53.3%) 33.08 ( 51.7%) 30.04 ( 47.0%) 30.23 ( 47.2%)LT (2-D) + others 20.19 ( 31.6%) 19.83 ( 31.0%) 17.58 ( 27.5%) 19.28 ( 30.1%)Semi-implicit 17.47 ( 27.3%) 17.42 ( 27.2%) 17.32 ( 27.1%) 17.03 ( 26.6%)Large-scale Condensation 31.74 ( 49.6%) 31.64 ( 49.4%) 31.49 ( 49.2%) 31.20 ( 48.8%)Vertical diffusion 30.03 ( 46.9%) 29.87 ( 46.7%) 29.50 ( 46.1%) 29.72 ( 46.4%)Radiation (each 1 hour) 32.63 ( 51.0%) 32.36 ( 50.6%) 31.91 ( 49.9%) 31.68 ( 49.5%)Physics(others) 23.58 ( 36.9%) 24.56 ( 38.4%) 23.09 ( 36.1%) 22.10 ( 34.5%)Cumulus (Kuo) 19.43 ( 30.4%) 19.68 ( 30.8%) 19.25 ( 30.1%) 19.33 ( 30.2%)Gravity wave drag 30.48 ( 47.6%) 30.36 ( 47.4%) 29.98 ( 46.8%) 30.18 ( 47.2%)Initial 0.07 ( 0.1%) 0.10 ( 0.2%) 0.12 ( 0.2%) 0.14 ( 0.2%)

14

3

4

5

6

7

8

9

10

11

0 1,000 2,000 3,000 4,000 5,000

Message size (KB)

Thr

ough

put(

GB

/s)

Inv-DT

Fwd-DT

Figure 9: Throughput in data transpose for a 1-day simulation (T1279L96).Each point on the lines represents the case for 640, 320, 160 and 80 nodes from left to right.

6.3.3 Performance Analysis of Data Transpose for a 1-day Simulation

Most of the communications arise from the data transpose associated with the spectral transform method as akernel of dynamics. As shown in Figure 1, there are four stages of communications in connection with the spec-tral transform method and the resulting Legendre transform method. It was found to be most efficient to use theunidirectional communication MPI PUT, a newly added MPI-2 function, among other instructions for communi-cation. As for synchronization, MPI WIN FENCE was used. In addition, the global memory feature was adoptedto skip redundant copy of data to an OS-controlled buffer area, which is automatically conducted unless specifiedotherwise, and make available direct data transfer between nodes. As clarified in Figure 3, the volume of data tobe transferred varies among the four stages of communication. Figure 9 shows the relationship between messagesize and throughput for a 1-day simulation, both of which were measured with the FTRACE tool. In this figure,each line represents the sum of throughputs associated with two stages of communications during the LT and theinverse LT, respectively. Since MPI WIN FENCE is used to reduce the load imbalance resulting from the LT, themeasured elapsed time for communications includes the overhead cost of synchronization.

In Figure 9, the rightmost point of each line represents the case with 80 nodes� 8 processors (640 processors intotal), while its leftmost point corresponds to 640 nodes � 8 processors (5120 processors in total). As can be seen,the throughput approaches its peak rate of 12.3 GB/s for 80-node runs, indicating a very excellent performance.With the increasing number of nodes toward 640, the throughput gets smaller. This is a major cause of the declinein communication throughput that leads to the significant increase of overhead particularly for the simulation onthe 640 full nodes. Even so, such a throughput degradation 1/2.5 may be acceptable by considering the fact thatthe size of a message was decreased to its 1/52.8 between 80 and 640 node runs.

6.4 Performance Analysis of 10-timestep Simulation

The measurements for a 1-day simulation include processes of which execution frequency shows strong depen-dency on the time integration span, such as atmospheric radiation. Initial calculations and post-processing are alsoconducted at particular timesteps, thus affecting the overall performance of the simulation code depending on thespan of time integration. Here the performance is shown in Table 7 for the first 20-step segment of the main timeintegration loop except for its initial 10-step segment that shows fluctuated performance due to initial calculations.

The full T1279L96 model requires at least 0.8 TB of memory capacity, which is made available with 80 nodesof configuration. Therefore, the performance evaluation was made in reference to the performance figures obtained

15

3.86Tflops

7.61Tflops

14.50Tflops

26.58Tflops

0

5

10

15

20

25

30

35

40

0 640 1280 1920 2560 3200 3840 4480 5120

Tflo

ps

Peak Sustained

#processors

Figure 10: Parallel scalability of AFES (T1279L96).

Table 7: Overall performance for 10-timestep simulations.

N : Number T : Elapsed Tflops S ε : Parallelof processors time (sec) (Peak ratio) Speedup efficiency

80 ( 80�1) 238.037 0.52(81.1%) 1.000 —160 (160�1) 119.260 1.04(81.0%) 1.996 99.798%320 (320�1) 60.516 2.04(79.8%) 3.933 98.336%

640 ( 80�8) 32.061 3.86(75.3%) 7.425 92.806%1280 (160�8) 16.240 7.61(74.3%) 14.657 91.609%2560 (320�8) 8.524 14.50(70.8%) 27.926 87.267%5120 (640�8) 4.651 26.58(64.9%) 51.180 79.968%

for the 80-node configuration. The simulations with 80, 160, and 320 processors are based on the parallelizationonly with MPI, while those with 640, 1280, 2560, and 5120 processors are for MPI-based parallelization in com-bination with 8 intra-node microtasking processes. In order to identify major routines affecting the performancecharacteristics of AFES, cost estimation was made for the computationally intensive time integration loop core.Here the performance evaluation was done for 10 time integration steps

Table 7 shows scalable performance of AFES with the T1279L96 resolution for different node configurations.As can be seen, AFES shows a very excellent sustained performance and scalability for different node config-urations. Figure 10 shows the speedup with respect to the number of processors for the T1279L96 resolution.AFES running on the ES shows remarkably high sustained performance over a wide range of processor configu-rations with 26.58 Tflops for the 640 full nodes (5120 processors) of the ES. It also achieved 23.93 Tflops for a1-day simulation by utilizing the entire ES system. These performance figures correspond to 64.9% and 58.4%of the theoretical peak performance, respectively, thus surpassing the computational efficiency of the currentlyavailable weather and climate simulation models (typically 25%-50%). Considering that these performance fig-ures include even file I/O and some initial settings which are not computationally intensive, such results are asignificant achievement as a real application program.

16

7 Preliminary Results of Simulations

Figure 11 shows precipitation after three and half days of model integration with the T106L24, T213L48, andT1279L96 resolutions. The T106 resolution is currently used for high-resolution climate simulations while theT213 resolution is typically used for daily weather forecasts. These simulations were conducted with an initialfield taken from a quasi-equilibrated instantaneous state for early January in a 5-year model simulation with theT106L20 resolution (equivalent to the grid sizes of approximately 125 km at the equator). As evident in Figure11 , meso-scale features that were unresolved in the initial state are now emerging in the T1279 simulation, andthey appear to be fairly realistic. For example, each of the cyclones developing and migrating eastward over themid-latitude oceans is characterized by a moist and precipitating area with distinctive T-bone shapes. The rightpanels show a tropical cyclone near Madagascar. To the best of our knowledge, no other global climate modelshave ever reproduced a tropical cyclone with such reality.

It is possible to conduct a few years of AFES climate simulations with the 10-km mesh within a few monthsof processing time on the ES. Therefore it is now possible to investigate the climate or statistics of meso-scalephenomena with super-high resolution simulations. This is a breakthrough in an interdisciplinary area betweenmeso-scale meteorology and climate dynamics. Such high resolution models allow scientists to study changes inthe minute structures of the Baiu front due to global warming.

The high performance achieved for AFES on the ES is mainly attributed to the following optimization efforts.First of all, careful selection of effective parallel decomposition as well as the increasing model resolution accountsfor such a high performance. Secondly, significant performance enhancement was obtained for the dynamics kernelby fully utilizing the hardware architecture of the ES through DO-loop optimization and parallel decomposition.Effective combination of vectorization and parallelization also contributed to the high performance. From the viewpoints of performance, the Kuo cloud physics scheme was found to be advantageous over the widely acceptedArakawa-Schubert (A-S) scheme. The simple formulation of the Kuo scheme is effective for the reduction of thecomputing load and load imbalance inherent to the cloud physics, contributing to the dominance of vector-tailored,efficient dynamical processes. In comparison, the number of floating-point operations increases as a function ofthe square of the number of vertical grid points in the A-S scheme, thus easily causing load imbalance.

The advent of high-end computers such as the ES is significant in paving the way for more accurate simula-tions. Of course, sophisticated modeling is also essential to more realistic simulations. It spans a wide spectrum ofissues, including more precise topographical data, improved physical parameterization tailor to high spatial resolu-tion, and assimilation of data from observational data. From the computational aspects, feasible strategies must beestablished to efficiently cope with a volume of data resulting from number crunching, such as high-end visualiza-tion and data extraction techniques as well as data archiving. Load balancing is also a crucial issue in maintaininghigh sustained performance. Even so, the results obtained here are expected to serve as basis to broaden the horizonof simulation-driven science and give deep insights into the complicated mechanisms of global changes.

8 Summary

The developments of the ES and AFES were successfully completed with the high sustained performance of 26.58Tflops (64.9% of peak performance) for a very high-resolution global climate model with the full utilization of 640nodes of the ES. This encouraging performance can act as a driving force for the comprehensive understanding ofthe challenging issues, such as the impacts of anthropogenic emission of green house gases on climate systems.Improved computing capabilities are central to the scientific understanding and policy making required for sustain-able development of human societies in harmony with planet earth. The authors are expecting that this innovativesystem can serve as an intermediary between geosciences and computational sciences.

Acknowledgements

This paper is dedicated to late Mr. Hajime Miyoshi, the former director of the Earth Simulator Research andDevelopment Center. Without his outstanding leadership, patient guidance, and enthusiastic encouragement, thisproject was not possible.

17

Figure 11: Precipitation (mm/hour).

Top; T106L24 (125.1km), Middle; T213L48 (62.5km), Bottom; T1279L96 (10.4km), (Horizontal grid interval atthe equator in parenthesis). Left; Global field, Right; A Tropical cyclone near Madacascar. The shown area is from2ÆN to 23ÆS and from 35ÆE to 77ÆE.

18

References

[1] M. Yokokawa, S. Shingu, S. Kawai, K. Tani, and H. Miyoshi,1998: “Performance Estimation of the EarthSimulator”, Proceedings of 8th ECMWF Workshop, World Scientific, pp.34-53.

[2] K. Yoshida, S. Shingu, 2000:“Research and Development of the Earth Simulator”,Proceedings of 9thECMWF Workshop, World Scientific, pp.1-13.

[3] H. Uehara, M. Tamura, and M. Yokokawa, 2002: “An MPI Benchmark Program Library and Its Applicationto the Earth Simulator”, LNCS 2327.

[4] Y. Tanaka, N. Goto, M. Kakei, T. Inoue, Y. Yamagishi, M. Kanazawa, H. Nakamura, 1999: “Parallel Com-putational Design of NJR Global Climate Models”, High Performance Computing ISHPC’99 Proceedings,Springer, pp281-291.

[5] A. Numaguti, S. Sugata, M. Takahashi, T. Nakajima, and A. Sumi, 1997: “Study on the Climate Systemand Mass Transport by a Climate Model”, CGER’s Supercomputer Monograph Report, 3, Center for GlobalEnvironmental Research, National Institute for Environmental Studies.

[6] Cubasch, U., G. A. Meehl, G. J. Boer, R. J. Stouffer, M. Dix, A. Noda, C. A. Senior, S. Raper, K. S. Yap,2001: Projections of future climate. In ”Climate Change 2001: The Scientific Basis. Contribution of WorkingGroup I to the Third Assessment Report of the Intergovernmental Panel on Climate Change” (Houghton, J.T., Y. Ding, D. J. Griggs, M. Noguer, P. J. van der Linden, X. Dai, K. Maskell and C. A. Johnson, eds.),Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, pp. 525-582.

[7] Gates, W. L., J. Boyle, C. Covey, C. Dease, C. Doutriaux, R. Drach, M. Fiorino, P. Gleckler, J. Hnilo, S.Marlais, T. Phillips, G. Potter, B. Santer, K. Sperber, K. Taylor and D. Williams, 1998: An Overview of theResults of the Atmospheric Model Intercomparison Project (AMIP I ), Bulletin of the American Meteorolog-ical Society, 73, pp.1962-1970.

[8] Shen, X., M. Kimoto, A. Sumi, A. Numaguti and J. Matsumoto, 2001: Simulation of the 1998 East AsianSummer Monsoon by the CCSR/NIES AGCM. J. Meteorol. Soc. Japan, 79, pp.741-757.

[9] C. Temperton, 1983: “Self-Sorting Mixed-Radix Fast Fourier Transform”, J. Comp. Phy. 52 , pp. 1-23.

[10] C. Temperton, 1983: “Fast Mixed-Radix Real Fourier Transform”, J. Comp. Phy. 52, pp. 340-350.

[11] T.N. Krishnamurti, H.S. Bedi and V.M. Hardiker, 1998: “An Introduction to Global Spectral Modeling”,Oxford University Press.

[12] I.T. Foster and P.H. Worley, 1997:“Parallel Parallel Algorithms for the Spectral Transform Method”, SIAMJ. Sci. Stat. Comput 18(3) , pp. 806-837.

[13] J. Drake, I. Foster, J. Michalakes, B. Toonen, P. Worley, 1995:“Design and Performance of a Scalable ParallelCommunity Climate Model”, Parallel Computing 21, pp. 1571-1591.

[14] S.W. Hammond, R.D. Roft, J. M. Dennis, and R.K. Sato, 1995: “Implementation and Performance Issues ofa Massively Parallel Atmospheric Model”, Parallel Computing 21, pp. 1593-1619.

19

Documents

A 26.58 Tﬂops Global Atmospheric Simulation with the Spectral … · 2002. 12. 5. · A 26.58 Tﬂops Global Atmospheric Simulation with the Spectral Transform Method on the Earth