Parallel Textbook Multigrid Efficiency

22
Parallel Textbook Multigrid Efficiency - Uli Rüde Parallel Textbook Multigrid Efficiency Lehrstuhl für Simulation FAU Erlangen-Nürnberg U. Rüde (FAU Erlangen, [email protected]) joint work with B. Gmeiner, M. Huber, H. Stengel, H. Köstler (FAU) C. Waluga, M. Huber, L. John, B. Wohlmuth (TUM) M. Mohr, S. Bauer, J. Weismüller, P. Bunge (LMU) 1 TERRA FOURTEENTH COPPER MOUNTAIN CONFERENCE ON ITERATIVE METHODS March 20 - March 25, 2016

Transcript of Parallel Textbook Multigrid Efficiency

Page 1: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

Parallel Textbook Multigrid Efficiency

Lehrstuhl für Simulation FAU Erlangen-Nürnberg

U. Rüde (FAU Erlangen, [email protected]) joint work with B. Gmeiner, M. Huber, H. Stengel, H. Köstler (FAU)

C. Waluga, M. Huber, L. John, B. Wohlmuth (TUM) M. Mohr, S. Bauer, J. Weismüller, P. Bunge (LMU)

1

TERRA NEO

TERRA

FOURTEENTH COPPER MOUNTAIN CONFERENCEON

ITERATIVE METHODS

March 20 - March 25, 2016

Büro

für G

esta

ltung

Wan

gler

& A

bele

04.

Apr

il 20

11

Page 2: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

Two questions:What is the minimal cost of solving a PDE? (such as Poisson’s or Stokes’ equation in 3D)

asymptotic results of the form

Cost ≦ C np (Flops)

with unspecified constant C areinadequate to predict the real performance

Flops as metric become questionable

How do we quantify true cost? (i.e. resources needed)

Number of flops? Memory consumption? Memory bandwidth? (aggregated?) Communication bandwidth? Communication latency? Power consumption?

2

TERRA NEO

TERRA

Page 3: Parallel Textbook Multigrid Efficiency

Büro

für G

esta

ltung

Wan

gler

& A

bele

04.

Apr

il 20

11

Exascale Computing Ulrich Rüde

Two Multi-PetaFlops SupercomputersJUQUEEN SuperMUC

Blue Gene/Q architecture 458,752 PowerPC A2 cores 16 cores (1.6 GHz) per node 16 GiB RAM per node 5D torus interconnect 5.8 PFlops Peak 448 TByte memory TOP 500: #11

Intel Xeon architecture 147,456 cores 16 cores (2.7 GHz) per node 32 GiB RAM per node Pruned tree interconnect 3.2 PFlops Peak TOP 500: #23

SuperMuc: 3 PFlops

Page 4: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

How big a PDE problem can we solve?400 TByte main memory = 4*1014 Bytes = 5 vectors each with 1013 elements8 Byte = double precision even with a sparse matrix format, storing a matrix of dimension 1013 is not possible on Juqueen

matrix-free implementation necessary Which algorithm?

multigrid asymptotically optimal complexity: Cost = C*N C „moderate“

does it parallelize well? overhead?

4

TERRA NEO

TERRA

equal, not „≦“

Page 5: Parallel Textbook Multigrid Efficiency

TERRA NEO

TERRA

5

Energy

computer generation gigascale:109 FLOPS

terascale1012 FLOPS

petascale1015 FLOPS

exascale1018 FLOPS

desired problem size DoF=N 106 109 1012 1015

energy estimate (kWh) 1 NJoule × N2

all-to-all communication

0.278 Wh 10 min of LED light

278 kWh 2 weeks

blow drying hair

278 GWh 1 month electricity

for Denver

278 PWh 100 years

world electricity production

TerraNeo prototype (kWh) 0.13 Wh 0.03 kWh 27 kWh ?

Only optimal algorithms qualify

At extreme scale: optimal complexity is a must!

Page 6: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde 6

Hierarchical Hybrid Grids (HHG)

Parallelize „plain vanilla“ multigrid for tetrahedral finite elements

partition domain parallelize all operations on all grids use clever data structures matrix free implementation

Do not worry (so much) about coarse grids idle processors? short messages? sequential dependency in grid hierarchy?

Elliptic problems always require global communication. This cannot be accomplished by

local relaxation or Krylov space acceleration or domain decomposition without coarse grid

Bey‘s Tetrahedral Refinement

B. Bergen, F. Hülsemann, U. Rüde, G. Wellein: ISC Award 2006: „Is 1.7× 1010 unknowns the largest finite element system that can be solved today?“, SuperComputing, 2005.

TERRA NEO

TERRA

Page 7: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

HHG: A modern architecture for FE computations

Geometrical Hierarchy: Volume, Face, Edge, Vertex

7

TERRA NEO

TERRA

Page 8: Parallel Textbook Multigrid Efficiency

Copying to update ghost pointsVertex Edge Interior

0 1 2 3 4

5 6 7 8

9 10 11 12

0 1 2 3 4

5 6 7 8

9 10 11

12 13

14

MPI message to updateghost points

linearization & memory representation

Parallel Textbook Multigrid Efficiency - Uli Rüde 8

TERRA NEO

TERRA

Page 9: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

Towards a holistic performance engineering methodology: Parallel Textbook Multigrid Efficiency

9

TERRA NEO

TERRA

Brandt, A. (1998). Barriers to achieving textbook multigrid efficiency (TME) in CFD. Thomas, J. L., Diskin, B., & Brandt, A. (2003). Textbook Multigrid Efficiency for Fluid Simulations*. Annual review of fluid mechanics, 35(1), 317-340. Gmeiner, UR, Stengel, Waluga, Wohlmuth: Towards Textbook Efficiency for Parallel Multigrid, Journal of Numerical Mathematics: Theory, Methods and Applications, 2015

Page 10: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

Textbook Multigrid Efficiency (TME)„Textbook multigrid efficiency means solving a discrete PDE problem with a computational effort that is only a small (less than 10) multiple of the operation count associated with the discretized equations itself.“ [Brandt, 98]

work unit (WU) = single elementary relaxation classical algorithmic TME-factor:

ops for solution/ ops for work unit new parallel TME-factor to quantify

algorithmic efficiency combined with implementation scalability

10

TERRA NEO

TERRA

Page 11: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

Textbook Multigrid Efficiency (TME)

Full multigrid method (FMG): linear tetrahedral elements One V(2,2)-cycle with SOR smoother on each new level reduces the algebraic error to 70% of discretization error. Textbook efficiency for 3D Poisson problem: 6.5 WU

11

TERRA NEO

TERRA

10 Björn Gmeiner

For simplicity, the computational domain in all our computations is chosen as the unitcube ⌦ = (0,1)3 if not mentioned otherwise. For scaling and performance studies on morecomplex domains, we refer to our work [18], where spherical shells and complex networksof channels were considered using the same version of the HHG software framework. Forsuch geometries the performance was observed to deteriorate by some ten percent com-pared to the cube domain.

Our first experiment investigates the performance of different multigrid settings, vary-ing the number of smoothing steps, type of cycle, and using over-relaxation in the smoother.We first consider the scalar case (2.1) with a constant coefficient, i.e., k = 1, and an exactsolution given as

u(x) = sin(2x1) sin(4x2) sin(16x3). (CC)

The boundary data is chosen according to the exact solution and the source term f on theright hand side is constructed by evaluating the strong form of the operator on the left handside of (2.1). In Figure 4, we display errors and residuals as functions of computational

1.0E-07

1.0E-06

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

1.0E+000 5 10 15 20 25 30 35 40 45 50

Erro

r (sc

aled

euc

)

Work units

V(1,1)-cycle V(3,3)-cycle V(3,3)-cycle (SOR) W(3,3)-cycle (SOR)

1.0E-09

1.0E-08

1.0E-07

1.0E-06

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

1.0E+000 5 10 15 20 25 30 35 40 45 50

Resi

dual

(euc

)

Work units

V(1,1)-cycle V(3,3)-cycle V(3,3)-cycle (SOR) W(3,3)-cycle (SOR)

Figure 4: Error (left) and residual (right) for (CC) on a resolution of 8.6 · 109grid points.

cost measured in work units. We note that in the pre-asymptotic stage, the W(3,3) cyclewith an over-relaxation factor of ! = 1.3 produces the most efficient error reduction, butinterestingly this is not the most efficient method when considering the convergence ofthe residual. In this respect, the V(3,3) cycles with a mesh independent over-relaxationfactor of again ! = 1.3 in (2.3) are found to perform best. All methods require about30� 50 operator evaluations (WU) to converge to the truncation error of approximately10�6. Clearly, we have not yet reached the 10 WU required for TME, but since we solvesystems as large as 8.6 · 109 unknowns, we believe that this is already in an acceptablerange.

To achieve full TME, we next consider a FMG algorithm. To evaluate different variantsof FMG, Table 1 lists the ratio

�l = ku� ulk/ku� ulk, (3.1)

between the total error obtained by the FMG ku� ulk and the discretization error ku�ulk,where ul denotes the result obtained with FMG. For the first row of Table 1 there is only one

Residual reduction for multigrid V- and W-cycles

vs. work units (WU)

Yavneh, I. (1996). On red-black SOR smoothing in multigrid. SIAM Journal on Scientific Computing, 17(1), 180-192. Gmeiner, B., Huber, M., John, L., Rüde, U., & Wohlmuth, B. (2015). A quantitative performance analysis for Stokes solvers at the extreme scale. submitted. arXiv preprint arXiv:1511.02134.

Page 12: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

extended TME paradigm for parallel performance analysis

Analyse cost of an elementary relaxation micro-kernel benchmarks for smoother

• optimize node level performance • justify/remove any performance degradation

aggregate performance for whole parallel system defines a WU

Measure parallel solver performance Compute TME factor as overall measure of efficiency

• analyse discrepancies • identify possible improvements

Optimize TME factor

12

TERRA NEO

TERRA

Page 13: Parallel Textbook Multigrid Efficiency

Uµsm

T (N,U)

EParTME(N, U) =T (N, U)

TWU(N, U)= T (N,U)

Uµsm

N

TWU(N, U) =N

Uµsm

Parallel Textbook Multigrid Efficiency - Uli Rüde

Parallel TME# of elementary relaxation steps on single core/sec

# cores

aggregate peak relaxation performance

13

TERRA NEO

TERRA

µsm

U

time to solution for N unknowns on U cores

idealized time for a work unit

Parallel textbook efficiency factor

combines algorithmic and implementation efficiency.

Page 14: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

TME Efficiency Analysis: RB-GS Smoother

14

TERRA NEO

TERRA

for (int i=1; i < (tsize-j-k-1); i=i+2) { u[mp_mr+i] = c[0] * ( -c[1] *u[mp_mr+i+1] - c[2] *u[mp_tr+i-1] - c[3] *u[mp_tr+i] - c[4] *u[tp_br+i] - c[5] *u[tp_br+i+1] - c[6] *u[tp_mr+i-1] - c[7] *u[tp_mr+i] - c[8] *u[bp_mr+i] - c[9] *u[bp_mr+i+1] - c[10]*u[bp_tr+i-1] - c[11]*u[bp_tr+i] - c[12]*u[mp_br+i] - c[13]*u[mp_br+i+1] - c[14]*u[mp_mr+i-1] + f[mp_mr+i] );

This loop should be executed on single SuperMuc core at 720 M updates/sec (in theory - peak performance) = 176 M updates/sec (in practice - memory access bottleneck; RB-ordering prohibits vector loads)

Thus whole SuperMuc should perform = 147456*176M ≈ 26T updates/sec

ĺ tp_br

ĺ mp_br

tp_mr ĺ

mp_tr ĺ

mp_mr ĺ

bp_tr ĺ

bp_mr ĺ

µsm

Uµsm

Page 15: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

Execution-Cache-Memory Model (ECM)

ECM model for the 15-point stencil on SNB core.

Arrow indicates a 64 Byte cache line transfer.

Run-times represent 8 elementary updates.

15

TERRA NEO

TERRA

Hager et al. Exploring performance and power properties of modern multi-core chips via simple machine models. Concurrency and Computation: Practice and Experience, 2013.

16 Björn Gmeiner

���������

�� ���� � ������

�� ���� �� � ���� ��

��

��

������

�� ����

��� ����

� ������

��� ������

����

� ������

� ������

��� ������

��!�������"#�$��%�&�'����������

��

��

������

()� ������

*

*

�( ������

Figure 5: ECM model for the 15-point stencil with variable coefficients (left) and constant coefficients

(right) on SNB core. An arrow indicates a 64 Byte cache line transfer. Run-times represent 8 elementary

updates.

Since the performance predictions obtained as above by the roofline model are un-satisfactory, we must proceed to develop a more accurate analysis. To this end, a morecareful inspection of the code reveals the following: The vectorization report by the Intelcompiler confirms that the innermost loop of the stencil code is optimally executed. How-ever, the roofline model in its simplest setting does not account for a possible non-optimalinstruction code scheduling. Also the run-time contributions of transfers inside the cachehierarchy are not taken into account. To include these effects, we employ the Execution-Cache-Memory (ECM) model of [22, 33], and we refer to [18] for a thorough descriptionof the modeling process for this kernel.

An overview of the input parameters to the model is displayed in Figure 5. Differentassumptions which operations can be overlapped lead to a corridor of performance pre-dictions of 159� 189MLups/s for a single core. The obtained measurements are withinthis range, indicating that the micro architecture is capable of overlapping at least someof the operations (cf. Figure 6). We also note that the (CC) kernel cannot saturate themaximal memory bandwidth of 42 GB/s with 8 cores. A comparison of the different run-time contributions shows that the single-core performance is in fact dominated by codeexecution and that — in contrast to the initial analysis — instruction throughput and notmemory bandwidth is the critical resource. The poor predictive results that were obtainedwith the simple model are mainly caused by an over-optimistic assumption for the peakperformance value that was based on FLOPS throughput alone. For the (CC) smoother,additional instruction overhead has to be considered. We remark that this is caused by thered-black pattern of updates that requires a high number of logistic machine instructionsbefore vectorized operations can be used. In such cases, a thorough understanding of theperformance depends on a careful code analysis. As demonstrated, this leads to a realisticprediction interval for the single-core performance.

We can additionally conclude that code optimization strategies that try to speed up theexecution of a WU by reducing the memory bandwidth requirements, in particular cache

Variable coefficients Constant coefficients

Page 16: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

TME and Parallel TME results

16

TERRA NEO

TERRA

20 Björn Gmeiner

Setting/Measure ETME ESerTME ENodeTME EParTME1 EParTME2

Grid points - 2 · 106 3 · 107 9 · 109 2 · 1011

Processor cores U - 1 16 4 096 16 384(CC) - FMG(2,2) 6.5 15 22 26 22(VC) - FMG(2,2) 6.5 11 13 15 13(SF) - FMG(2,1) 31 64 100 118 -

Table 4: TME factors for different problem sizes and discretizations.

Table 4 lists the efficiencies for the (CC), (VC), as well as the (SF) case. In the lattercase, we employ the FMG-4CG-1V(2,1) pressure correction cycle. All cycles are chosensuch that they are as cheap as possible, but still keep the ratio �l ⇡ 2. We observe a largergap between the ETME and the new EParTME for (CC) and (SF) compared to the previouscases. We also observe the effect that traversing the coarser grids of the multigrid hierarchyis less efficient than the processing the the large grids on the finest level. This is particularlytrue because of communication overhead and the deteriorating surface to volume ratio oncoarser grids, but also because of loop startup overhead. When there are less operations,the loop startup time is less well amortized.

The case (VC) exhibits a better relative performance, since time is here spent in themore complex smoothing and residual computations. However, for the quality of an im-plementation it is crucial to show that the WU is implemented as time efficient as possiblefor a given problem. We remark that a good parallel efficiency by itself has limited valuewhen the serial performance of the algorithm is slow. In this sense, multigrid may ex-hibit a worse parallel efficiency than other solvers while it is superior in terms of absoluteexecution time for a given problem and a given computer architecture.

ESerTME and ENodeTME are measured using only one core (U = 1) and one node (U =16), respectively. We observe that the transition from one core to one node causes a similaroverhead as the transition from one node to 256 nodes (EParTME1). For the (CC) case,roughly a factor of two lies between the TME and the serial implementation, and a furtherfactor of two between the serial and the parallel execution. To interpret these results,recall that TME and ParTME would only coincide if there were no inter-grid transfers,if the loops on coarser grids were executed as efficiently as on the finest mesh, if therewere no communication overhead, and if no additional copy operations, or other programoverhead occurred.

For the last setup EParTME2, we increase the number of grid points per core to themaximal capacity of the main memory. This also provides a better (lower) surface tovolume ratio and consequently decreases the EParTME even for more compute cores.

The HHG multigrid implementation is already designed to minimize these overheadsso that we can solve very large finite element systems. The largest Stokes system presentedin Table 4 is solved on 4096 cores and has 3.6⇥1010 degrees of freedom. Note that thoughthis compares favorably with some of the largest FE computations to date, it is alreadyachieved on a quite small partition of the SuperMUC cluster that has in total 155,656cores. For scaling results to even larger computations with more than 1012 degrees offreedom we refer to [18,19].

Three model problems: scalar constant (CC) & scalar variable (VC) coefficients Stokes solved via Schur complement (SF)

Full multigrid with #iterations such that asymptotic optimality maintained TME = 6.5 (algorithmically for scalar cases) ParTME around 20 for scalar PDE, and ≥100 for Stokes

Page 17: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

Application to Earth Mantle Convection Models

17

TERRA NEO

TERRA

Ghattas O., et. al.: An Extreme-Scale Implicit Solver for Complex PDEs: Highly Heterogeneous Flow in Earth's Mantle. Gordon Bell Prize 2015.

Weismüller, J., Gmeiner, B., Ghelichkhan, S., Huber, M., John, L., Wohlmuth, B., UR, Bunge, H. P. (2015). Fast asthenosphere motion in high-resolution global mantle flow models. Geophysical Research Letters, 42(18), 7429-7435.

Page 18: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

HHG Solver for Stokes System Motivated by Earth Mantle convection problem

18

TERRA NEO

TERRA

Scale up to ~1012 nodes/ DOFs ⇒ resolve the whole Earth Mantle globally with 1km resolution

Solution of the Stokes equations

Boussinesq model for mantle convection problems

derived from the equations for balance of forces, conservation ofmass and energy:

�r · (2⌘✏(u)) +rp = ⇢(T )g,

r · u = 0,

@T

@t+ u ·rT �r · (rT ) = �.

u velocityp dynamic pressureT temperature⌫ viscosity of the material✏(u) = 1

2

(ru+ (ru)T ) strain rate tensor⇢ density, �, g thermal conductivity,

heat sources, gravity vector

Gmeiner, Waluga, Stengel, Wohlmuth, UR: Performance and Scalability of Hierarchical Hybrid Multigrid Solvers for Stokes Systems, SISC, 2015.

Page 19: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

Coupled Flow-Transport Problem

19

TERRA NEO

TERRA

• 6.5×109 DoF • 10000 time steps • run time 7 days • Mid-size cluster: 288 compute cores in 9 nodes of LSS at FAU

Lehrstuhl fürNumerische Mathematik 1 Mantle convection model

A simplified model

For the sake of presentation, let us simplify the model by neglecting compressibility,nonlinearities and inhomogeneities in the material.

In dimensionless form, we obtain:

��u+rp = �RaT br

divu = 0

@tT + u ·rT = Pe�1�T

Notation:

u velocity

T temperature

p pressure

br radial unit vector

Dimensionless numbers:

• Rayleigh number (Ra): determines the presence and strenght of convection

• Peclet number (Pe): ratio of convective to di↵usive transport

The mantle convection problem is defined by the model equations in the interior of thespherical shell as well as suitable initial/boundary conditions.

JJ J I II 0 1 2 3 4

Page 20: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

Comparison of Stokes SolversThree solvers:

Schur complement CG (pressure correction) with MG MINRES with MG preconditioner for Vector-Laplace MG for saddle point problem with Uzawa type smoother

Time to solution without coarse grid solver

residual reduction by 10-8

with coarse grid solver the difference shrinks

20

TERRA NEO

TERRA

10 B. GMEINER, M. HUBER, L. JOHN, U. RUDE, B. WOHLMUTH

L DoFs SCG MINRES Uzawa

4 1.4 · 104 1.30 1.08 1.785 1.2 · 105 1.40 1.52 1.816 1.0 · 106 1.62 2.00 1.957 8.2 · 106 1.67 2.08 2.108 6.6 · 107 1.56 2.17 2.189 5.3 · 108 – – 2.25

Table 4. Factor of the time to solution for Laplace and D operator.T:factor_times

evaluations in case of the Uzawa method stay exactly the same. This behaviour is also reflected bythe time to solution, c.f. Tab. 3.

For a comparison of the Laplace and D operator we also include Tab. 4, for the serial testcases. Here, we present the ratios of the times to solution for the Laplace and D operator. Whilethis factor is approximately 1.5 for Schur complement CG it is around 2 for the MINRES andUzawa-type multigrid method. The reason therefore can be explained as follows ... ??Lorenz:

Grund

angeben

oder Tab.

weglassen?

Lorenz:

Grund

angeben

oder Tab.

weglassen?

sec:par_perfrom

5.3. Parallel performance. In the following we present numerical results for the parallel case,illustrating excellent scalability. Beside the comparison of the di↵erent solvers, focus is also laid onthe di↵erent formulations for the Stokes equations. Scaling results are present for up to 786 432threads and 1013 degrees of freedom. We consider 32 threads per node and ?? tetrahedrons of theinitial mesh per thread. Let us also mention that the current computational level is obtained bya 7 times refined initial mesh. Until not further specified, the settings for the solvers remain asdescribed above.

SCG MINRES UzawaNodes Threads DoFs iter time iter time iter time

1 24 6.6 · 107 28 (3) 137.26 s 65 (3) 116.35 s 7 (1) 65.71 s6 192 5.3 · 108 26 (5) 135.85 s 64 (7) 133.12 s 7 (3) 80.42 s48 1 536 4.3 · 109 26 (15) 139.64 s 62 (15) 133.29 s 7 (5) 85.28 s384 12 288 3.4 · 1010 26 (30) 142.92 s 62 (35) 138.26 s 7 (10) 93.19 s

3 072 98 304 2.7 · 1011 26 (60) 154.73 s 62 (71) 156.86 s 7 (20) 114.36 s24 576 786 432 2.2 · 1012 28 (120) 187.42 s 64 (144) 184.22 s 8 (40) 177.69 s

Table 5. Weak scaling results for the Laplace operator.T:itertime_lap_par

SCG MINRES UzawaNodes Threads DoFs iter time iter time iter time

1 24 6.6 · 107 28 136.30 s 65 115.05 s 7 58.80 s6 192 5.3 · 108 26 134.26 s 64 130.56 s 7 64.40 s

48 1 536 4.3 · 109 26 135.06 s 62 128.34 s 7 65.03 s384 12 288 3.4 · 1010 26 135.41 s 62 128.34 s 7 64.96 s

3 072 98 304 2.7 · 1011 26 139.55 s 62 133.30 s 7 66.08 s24 576 786 432 2.2 · 1012 28 154.06 s 64 139.52 s 8 78.24 s

Table 6. Weak scaling results for the Laplace operator without coarse grid time.T:itertime_lap_withoutcoarse_par

In Tab. 5 we present weak scaling results for the individual solvers in the case of the Laplaceoperator. Additionally, we also present the coarse gird iteration numbers in brackets, which in caseof the Uzawa solver represent the iterations of the CG, used as a preconditioner, cf. Sec. 4.2. Weobserve excellent scalability of all three solvers, where the Schur complement CG and MINRES have

10 S. Bauer et al.

SCG UMG

nop

(A)

nop

(B)

nop

(C)

nop

(M)

83.0%

12.9%2.2%1.9%

23.3%24.8%

4.1%

Fig. 1.2 Operator counts for the individual blocks for the SCG and UMG method in percentswhere 100% corresponds to the total number of operator counts for the SCG.

requires only half of the total number of operator evaluations in comparison to theSCG algorithm. This is also reflected within the time-to-solution of the UMG.

1.4.2 Scalability

In the following, we present numerical examples, illustrating the robustness andscalability of the Uzawa-type multigrid solver. For simplicity we consider the ho-mogenous Dirichlet boundary value problem. As a computational domain, we con-sider the spherical shell

W = {x 2 R3 : 0.55 < kxk2 < 1}.

We present weak scaling results, where the initial mesh T�2 consists of 240 tetra-hedra for the case of 5 nodes and 80 threads on 80 compute cores, cf. Tab. 1.3. Thismesh is constructed using an icosahedron to provide an initial approximation of thesphere, see e.g. [19, 41], where we subdivide each resulting spherical prismatoidinto three tetrahedra. The same approach is used for all other experiments belowthat employ a spherical shell. We now consider 16 compute cores per node (onethread per core) and assign 3 tetrahedra of the initial mesh per compute core. Thecoarse grid T0 is a twice refined initial grid, where in one uniform refinement stepeach tetrahedron is decomposed into 8 new ones. Thus the number of degrees offreedoms grows approximately from 103 to 106 when we perfrom a weak scalingexperiment starting on the coarse grid T0.

The right-hand side f = 0 and a constant viscosity n = 1 are considered. Dueto the constant viscosity it is also possible to use the simplification of the Stokesequations using Laplace-operators. In the following numeral experiments, we shallcompare this formulation with the generalized form that employs the e-operator. For

Page 21: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

Exploring the Limits …

21

TERRA NEO

TERRA

typically appear in simulations for molecules, quantum mechanics, or geophysics. The initial mesh

T�2

consists of 240 tetrahedrons for the case of 5 nodes and 80 threads. The number of degrees of

freedoms on the coarse grid T0

grows from 9.0 · 103 to 4.1 · 107 by the weak scaling. We consider

the Stokes system with the Laplace-operator formulation. The relative accuracies for coarse grid

solver (PMINRES and CG algorithm) are set to 10�3 and 10�4, respectively. All other parameters

for the solver remain as previously described.

nodes threads DoFs iter time time w.c.g. time c.g. in %

5 80 2.7 · 109 10 685.88 678.77 1.04

40 640 2.1 · 1010 10 703.69 686.24 2.48

320 5 120 1.2 · 1011 10 741.86 709.88 4.31

2 560 40 960 1.7 · 1012 9 720.24 671.63 6.75

20 480 327 680 1.1 · 1013 9 776.09 681.91 12.14

Table 10: Weak scaling results with and without coarse grid for the spherical shell geometry.

Numerical results with up to 1013 degrees of freedom are presented in Tab. 10, where we observe

robustness with respect to the problem size and excellent scalability. Beside the time-to-solution

(time) we also present the time excluding the time necessary for the coarse grid (time w.c.g.) and

the total amount in % that is needed to solve the coarse grid. For this particular setup, this

fraction does not exceed 12%. Due to 8 refinement levels, instead of 7 previously, and the reduction

of threads per node from 32 to 16, longer computation times (time-to-solution) are expected,

compared to the results in Sec. 4.3. In order to evaluate the performance, we compute the factor

t nc n�1, where t denotes the time-to-solution (including the coarse grid), nc the number of used

threads, and n the degrees of freedom. This factor is a measure for the compute time per degree of

freedom, weighted with the number of threads, under the assumption of perfect scalability. For

1.1 · 1013 DoFs, this factor takes the value of approx. 2.3 · 10�5 and for the case of 2.2 · 1012 DoFs

on the unit cube (Tab. 5) approx. 6.0 · 10�5, which is of the same order. Thus, in both scaling

experiments the time-to-solution for one DoF is comparable. The reason why the ratio is even

smaller for the extreme case of 1.1 · 1013 DoFs is the deeper multilevel hierarchy. Recall also that

the computational domain is di↵erent in both cases.

The computation of 1013 degrees of freedom is close to the limits that are given by the shared

memory of each node. By (8), we obtain a theoretical total memory consumption of 274.22 TB,

and on one node of 14.72 GB. Though 16 GB of shared memory per node is available, we employ

one further optimization step and do not allocate the right-hand side on the finest grid level. The

right-hand side vector is replaced by an assembly on-the-fly, i.e., the right-hand side values are

evaluated and integrated locally when needed. By applying this on-the-fly assembly, the theoretical

18

Multigrid with Uzawa Smoother Optimized for Minimal Memory Consumption

1013 Unknowns correspond to 80 TByte for the solution vector Juqueen has 450 TByte Memory matrix free implementation essential

Gmeiner B., Huber M, John L, UR, Wohlmuth, B: A quantitative performance analysis for Stokes solvers at the extreme scale, arXiv:1511.02134, submitted.

Page 22: Parallel Textbook Multigrid Efficiency

Parallel Textbook Multigrid Efficiency - Uli Rüde

Conclusions and Outlook

Resilience with multigrid Multigrid scales to Peta and beyond HHG: lean and mean implementation, excellent time to sol. Reaching 1013 DoF

22

TERRA NEO

TERRA