In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU....

Post on 25-Jun-2020

1 views 0 download

Transcript of In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU....

Radiation-Induced Error Criticality In Modern HPC Parallel AcceleratorsPresented by: Christopher Boggs, Clayton Connors on 09/26/2018

Authored by: Daniel Oliveira, Laercio Pilla, Mauricio Hanzich, Vinicius Fratin, Fernando Fernandes, Caio Lunardi, Jose ́ Mar ́ıa Cela, Philippe Navaux, Luigi Carro, Paolo Rech

Outline

● Background● Motivation● Radiation-Induced Effects● Error Criticality● Procedure● Results● SDCs for HPC Applications● Discussion

High Performance Computing (HPC)

● Parallel processing for advanced application programs● Above a teraflop of floating point operations per second● Interest businesses of all sizes

○ Transaction processing○ Data warehouses○ Complex models○ Etc

An Accelerator?

● “Accelerate” a computation with massive parallelization● Numerous shared resources● Work best with many algebraic-heavy operations● Intel Xeon Phi and Nvidia Kepler GPU

Parallel Accelerators Offer:

● Lower cost● Flexibility● High efficiency● High computational power● Massive amount of resources

Parallel Accelerators Offer:

● Lower cost● Flexibility● High efficiency● High computational power● Massive amount of resources● What about reliability?

With Titan

● 18,688 GPUs● GPU Corruption Common● Uncorrectable Errors MTBF ~44 hours

https://www.kisspng.com/png-top5-cray-xk7-oak-ridge-leadership-computing-facil-6045373/

Radiation-Induced Effects

● Number of high-energy neutrons generated● Interaction with device can give Soft Errors

○ Bit-flips○ Logic Errors

● Cause crash in instruction cache, bus controller, etc● Could cause Silent Data Corruption (SDC)

Silent Data Corruption (SDC)

● Soft Error hits, DOESN’T cause a crash○ Data cache○ Logic gates (ALU)○ Register files○ etc

● Especially harmful in HPC○ Fault on shared resource or scheduler○ Affects several threads, many elements

So What?● Error can be small

○ Within certain range so not seen as errors○ In the xth bit of a float

● Not all errors critical○ Within certain range so not seen as errors

● Quantify and qualify SDC in Intel Xeon-Phi and Nvidia K40

http://ena.support.keysight.com/e5061b/manuals/webhelp/eng/programming/remote_control/reading-writing_measurement_data/data_transfer_format.htm

Parallel Accelerators

https://techgage.com/article/a-look-at-nvidias-kepler-based-tesla-k-series-gpu-accelerators/

https://www.software.intel.com

How Reliable?● K40

○ Error will raise with input○ Threads data shared in register file

● Xeon-Phi○ Constant errors with input○ Other areas for errors

● A metric must be workload between failures!

Errors

● Relative Error○ Read = observed value○ Mean of Relative Errors

● Masked Errors○ < 2% RE is tolerable

● Spatial Locality of Errors○ Line, square, etc○ Share a resource○ Correct error types differently

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.“Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, WMC_2017_Rio_Daniel

Testing

● Each architecture tested for 800 hours● Simulates ~91,000 years of natural radiation● Algorithms which

○ Simulate different resources○ Represent HPC applications○ Minimize error masking

Algorithms

● DGEMM○ Matrix multiplication

● LavaMD○ Calculates interactions of particles

● Hotspot○ Simulates energy dissipation

● CLAMR○ Fluid dynamics application

DGEMM

Relative mean error, number of corrupted elements lower for K40

K40 Xeon Phi

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.

DGEMM

>2% filter removes most random errors on K40

ABFT corrects single, line errors in linear time

FIT less dependent on input size for Xeon Phi

K40 Xeon Phi

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.

DGEMM

● FIT correlation with input size on K40 but not Xeon Phi○ NVIDIA devices have a dedicated scheduler

○ K40 keeps active thread data on device

Source: Rech, Pilla, Navaux, Carro. “Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability,” DSN, Atlanta, USA, 2014.

LavaMD

Number of corrupted elements lower for K40

Relative mean error lower for Xeon Phi

Exponentiation may cause large deviance

K40 Xeon Phi

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.

LavaMD

Xeon Phi: cubic, square errors from larger shared cache

Less K40 FIT correlation: Local memory use limits thread count

K40 locality vs input size: Less likely to “share” errors for larger input

K40 Xeon Phi

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.

HotSpot

Number of corrupted elements lower for K40

Relative mean error appears lower for K40 (not stated in paper)

Errors “dissipate”

K40 Xeon Phi

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.

HotSpot

>2% threshold removes most errors on both devices

Runtime error checking can affect performance

K40 Xeon Phi

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.

CLAMR

Only tested on Xeon Phi

All errors were >2%

Xeon Phi Locality Map(for a single execution) Xeon Phi

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.

CLAMR

● (Related work) Runtime error checking showed fault coverage of 82%

Source: Atkinson, Debardeleben, Guan, Robey, Jones. “Fault injection experiments with the clamr hydrodynamics mini-app,” ISSREW, 2014.

Conclusion

● DGEMM more resilient on K40○ GPUs have shortened pipelines

● LavaMD more resilient on Xeon Phi○ Transcendental function unit more prone to corruption in K40?

● HotSpot spreads errors○ This behavior may hold for all stencil applications

● CLAMR spreads errors without attenuating them● Xeon Phi keeps corrupted elements around for longer

Future Work

● Determine sources of most critical errors

Discussion Questions

● Does the provided data allow for anything beyond comparing the two tested devices?

● Would it be tolerable for manufacturers to target “lower relative error” at the expense of having a higher total number of errors?

● Is it fair to irradiate the chips but not the DRAM?