APPENDIX D
MODEL PERFORMANCE

The two-dimensional simulations presented in this report were run on the University of Oklahoma ECAS Cray J-90 series computer. This computer is composed of 8 processors and 256 megawords of main memory. Model performance was measured using the -perf compiler option on the Cray FORTRAN 77 compiler. The domain size is 64 x 64 x 53 grid points in the x, y, and z directions. The test was run on one processor using the -Zv compiler option. The large and small time steps were 20 and 4 seconds respectively. The grid spacing was dx = dy = 2000m and dz = 250m and the base state wind was a constant 10m/s. The vertically implicit solution technique and upper radiation were implemented. Table D-1 presents a compilation of the performance statistics by subroutine for a three-dimensional mountain wave simulation. The overall code rating for this test is 95.6 MFLOPS.

The percentage of the total time for the  solver is a function of the ratio of large time steps to small time steps. For a large big to small time step ratio, the small time step solver requires a larger portion of the total CPU time. Figure D.1 presents a pie chart of the most significant contributors to the model total CPU time. Approximately 35% of the time is spent in the small time step solvers dwpim3d and tri3d. The subroutine arpi3d also contains small time step calculations for u and v.
 

 

 
Table D-1. Performance statistics for ARPI3D three-dimensional mountain wave simulation on a Cray J-90 series computer using a single processor and the vector compiler option.
 
 
 

 

Figure D.1. Pie chart of CPU time requirements for a three-dimensional mountain wave simulation using ARPI3D and the ECAS Cray J-90 computer.

 

Tests were conducted during the initial model development phase which measured the efficiency of different terrain transformations and pressure equation formulations. The simple chain rule terrain formulation is found to be significantly faster (>33%, without turbulence) than the strongly conservative form used in a number of models including the ARPS. This is primarily due to the computationally intensive floating-point divisions in the strong conservation formulation. The adaptation of the system of equations from pressure to non-dimensional pressure also improves the computational efficiency of the code, as does the implementation of the advective form of the equations. In the dimensional pressure system of equations, the additional term in the buoyancy relation, due to a power series approximation, is computed on the small time step. The effect of this term was not explicitly determined but is estimated on the order of a few percent of the total solution time.

Another method of estimating computational efficiency is to test the model with other established mesoscale numerical models. A rough comparison of ARPI3D with ARPS Version 4.0 was made for a number of simple tests with the results of only two comparisons presented here. In 2-D mode, ARPI3D is on the order of 12-15 times more efficient (CPU seconds) than a similarly configured ARPS simulation. In defense of the ARPS, this is primarily due to the fact that the ARPS has a pseudo 2-dimensional option. The ARPS 2-dimensional mode computes 4 vertical slices, due to boundary condition requirements, while ARPI3D?s 2-D mode computes only 1 vertical slice. A more realistic test involves a 3-dimensional cold bubble dropped over a symmetric mountain. Both models were run without moisture since ARPI3D currently uses a dry formulation. The simulation time on a Cray J-90 computer for ARPS is approximately 3 times greater than that required by ARPI3D. Such a large discrepancy is likely due to the use of a simple coordinate transformation (chain rule), equivalent advective form of the advection terms, solving non-dimensional pressure, and the absence of operator subroutines. The memory requirements between the two models are comparable with ARPI3D requiring approximately 1/2 that of the ARPS.

The three-dimensional experiments presented in this report were performed on the Pittsburgh Super Computing Center?s Cray T3D and T3E massively parallel computers and the University of Oklahoma Hitachi SR2201C parallel super computer. During the winter of 1996, the source code was upgraded to include message passing interface (MPI) subroutine calls. MPI was chosen over the Parallel Virtual Machine (PVM) message passing technique because it is more efficient in passing similarly sized packets. The message passing application allows the code to be run on massively parallel computer platforms. The advantage to this method is the removal of the memory limitation existing on the Cray J90 and other symmetric multi-processor (SMP) platforms. Tests were conducted on the T3D in which the per processor model grid arrays remained constant and the number of processors increased. This experiment tests the scalability of ARPI3D on a specific machine type. As the number of processors increases the domain size also increases. A perfect code implemented on an infinitely fast computer would register the same wall clock times regardless of the number of processors. The relation for the number of grid points per processor to the global domain size is:

 

(D.1)

(D.2)

Nx and ny are the number of grid points for each processor in the x and y-directions and gnx and gny are the number of global grid points in the x and y directions. The choice of the per processor domain in (D.1) and (D.2) is based on the desire to eliminate message passing of intermediate variables associated with fourth order spatial derivatives. In the present configuration no intermediate variable passing is required. Other models (e.g. ARPS) use a smaller more memory efficient per-processor domain (nx-3 and ny-3) but are required to pass intermediate results. Intermediate variables are present in the turbulence and fourth order advection and turbulent mixing terms. The disadvantage to the method applied to the present model is a slight increase in the number of grid points per processor. This redundancy is balanced by a more efficient message-passing configuration. Figure D.2 presents a chart of the scalability of ARPI3D through a range of processor configurations on the PSC T3D computer. The values are normalized by the 16-processor test simulation. The results indicate that as the processor domain is expanded from 16 to 512 processors the code is 80% efficient. Personnel communication with PSC consultants reveals that this efficiency rating is very good, exceeding a large fraction of the current MPP applications. The code performance was measured on the T3D using the apprentice performance monitoring software. ARPI3D is rated at approximately 10Mflops on the T3D. This is approximately 9 times slower than simulations performed on a single processor J90 and 6.5 time slower than a single Hitachi SR201C node. Attempts were made to improve the code performance on the T3D. Optimization was minimal due to the small data cache on the DEC alpha processor. ARPS has a similar mflop rating and is equally difficult to optimize on the T3D.

 

Figure D.2. Plot of the normalized wall clock time for a 20x12x115 per processor grid simulations as a function of processor configuration. Tests were conducted on the PSC Cray T3D computer.

 
 
 
APPENDIX E
SOUNDING PROFILES

Sounding data for Wangara Day 33 simulations.

Sounding filename = wang.snd

1-D Sounding Input for ARPI3D

Sounding Data collected at Wangara Surface Experiment,

34.5 South 144.93 East, Australia

Date: 9:00am August 16, 1967

Sounding obtained from Yamada and Mellor, 1975.

Surface Height = 0.0 m, Surface Pressure = 102,300 Pa

Number of Levels = 23

Pressure Temp. Qv U V

15000 -65.0 .00000 35.00 00.00

35000 -40.0 .00023 30.00 00.00

48000 -15.0 .00023 25.00 00.00

62300 -5.0 .00026 15.00 00.00

72300 -1.5 .00031 7.00 00.00

79900 -0.2 .00060 .50 1.10

82000 1.4 .00070 -.70 1.72

84000 1.7 .00080 -1.19 .26

86100 2.0 .00080 -1.45 .07

88300 2.3 .00150 -1.93 -.90

89000 2.6 .00180 -2.29 -1.41

90500 2.5 .00200 -2.55 -1.16

91600 2.9 .00220 -2.28 -.76

92800 3.5 .00250 -2.45 -.48

93900 3.8 .00290 -2.43 -.35

95100 4.7 .00320 -2.79 -.26

96300 5.8 .00330 -2.49 -.37

97400 6.8 .00330 -3.20 -.47

98600 7.4 .00370 -3.12 -.51

99800 7.5 .00380 -2.79 -.57

101100 5.4 .00380 -2.92 -.38

101700 5.1 .00370 -2.84 .03

102300 5.5 .00420 0.0 0.00

 

Sounding data for January 11, 1972 Boulder Colorado windstorm simulations.

Sounding filename = bld2.snd

1-D Sounding Input for ARPI3D

Sounding Data collected at Grand Junction, Colorado

Date: 12Z Jan. 11, 1972

Sounding estimated from Figure 10 Durran and Klemp (1983)

The top two layers were taken from Peltier and Clark (1979)

Surface Height = 0.0 m, Surface Pressure = 82000 Pa

Number of Levels = 13

Pressure Pt Qv U V

100.00000 1481.0000 0.00000 20.00 0.00

1000.00000 764.00000 0.00000 20.00 0.00

11000.00000 388.00000 0.00000 20.00 0.00

16000.00000 350.00000 0.00000 22.00 0.00

18500.00000 328.50000 0.00000 31.00 0.00

22000.00000 321.50000 0.00000 44.00 0.00

24000.00000 319.50000 0.00000 53.00 0.00

30000.00000 317.00000 0.00000 46.00 0.00

40000.00000 313.00000 0.00000 38.50 0.00

53000.00000 308.50000 0.00000 31.00 0.00

62500.00000 296.50000 0.00000 20.00 0.00

68000.00000 293.00000 0.00000 17.00 0.00

82000.00000 293.00000 0.00000 9.00 0.00

 

Sounding Data for the January 9, 1989 Boulder Colorado 2305UTC simulations.

Sounding filename = cl2d.snd

1-D Sounding Input for ARPI3D taken from Clark et. al. (1994)

Data collected at Craig, Colorado

Date: 15Z January 9, 1989

Surface Height 0.0 m, Surface Pressure 100000 Pa

Number of Levels = 20

Pressure Temp. Qv U V

500.00000 -55.70000 0.00000 30.00 0.00

2500.00000 -55.70000 0.00000 30.00 0.00

5000.00000 -55.70000 0.00000 30.00 0.00

9810.00000 -55.70000 0.00000 30.00 0.00

11880.00000 -55.80000 0.00000 31.09 0.00

15090.00000 -56.90000 0.00000 31.26 0.00

19980.00000 -60.90000 0.00000 40.57 0.00

24970.00000 -57.10000 0.00000 39.28 0.00

29920.00000 -47.20000 0.00000 34.74 0.00

35000.00000 -41.90000 0.00000 29.77 0.00

40030.00000 -35.00000 0.00000 29.07 0.00

45000.00000 -28.80000 0.00000 27.14 0.00

50170.00000 -22.60000 0.00000 26.11 0.00

55210.00000 -20.30000 0.00000 27.99 0.00

60290.00000 -15.30000 0.00000 25.50 0.00

69460.00000 -11.90000 0.00000 23.26 0.00

70220.00000 -11.00000 0.00000 13.34 0.00

75420.00000 -6.80000 0.00000 9.96 0.00

81160.00000 -6.00000 0.00000 3.75 0.00

100000.0000 -6.00000 0.00000 3.75 0.00