In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU....

27
Radiation-Induced Error Criticality In Modern HPC Parallel Accelerators Presented by: Christopher Boggs, Clayton Connors on 09/26/2018 Authored by: Daniel Oliveira, Laercio Pilla, Mauricio Hanzich, Vinicius Fratin, Fernando Fernandes, Caio Lunardi, Jose ́ Mar ́ıa Cela, Philippe Navaux, Luigi Carro, Paolo Rech

Transcript of In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU....

Page 1: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

Radiation-Induced Error Criticality In Modern HPC Parallel AcceleratorsPresented by: Christopher Boggs, Clayton Connors on 09/26/2018

Authored by: Daniel Oliveira, Laercio Pilla, Mauricio Hanzich, Vinicius Fratin, Fernando Fernandes, Caio Lunardi, Jose ́ Mar ́ıa Cela, Philippe Navaux, Luigi Carro, Paolo Rech

Page 2: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

Outline

● Background● Motivation● Radiation-Induced Effects● Error Criticality● Procedure● Results● SDCs for HPC Applications● Discussion

Page 3: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

High Performance Computing (HPC)

● Parallel processing for advanced application programs● Above a teraflop of floating point operations per second● Interest businesses of all sizes

○ Transaction processing○ Data warehouses○ Complex models○ Etc

Page 4: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

An Accelerator?

● “Accelerate” a computation with massive parallelization● Numerous shared resources● Work best with many algebraic-heavy operations● Intel Xeon Phi and Nvidia Kepler GPU

Page 5: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

Parallel Accelerators Offer:

● Lower cost● Flexibility● High efficiency● High computational power● Massive amount of resources

Page 6: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

Parallel Accelerators Offer:

● Lower cost● Flexibility● High efficiency● High computational power● Massive amount of resources● What about reliability?

Page 7: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

With Titan

● 18,688 GPUs● GPU Corruption Common● Uncorrectable Errors MTBF ~44 hours

https://www.kisspng.com/png-top5-cray-xk7-oak-ridge-leadership-computing-facil-6045373/

Page 8: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

Radiation-Induced Effects

● Number of high-energy neutrons generated● Interaction with device can give Soft Errors

○ Bit-flips○ Logic Errors

● Cause crash in instruction cache, bus controller, etc● Could cause Silent Data Corruption (SDC)

Page 9: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

Silent Data Corruption (SDC)

● Soft Error hits, DOESN’T cause a crash○ Data cache○ Logic gates (ALU)○ Register files○ etc

● Especially harmful in HPC○ Fault on shared resource or scheduler○ Affects several threads, many elements

Page 10: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

So What?● Error can be small

○ Within certain range so not seen as errors○ In the xth bit of a float

● Not all errors critical○ Within certain range so not seen as errors

● Quantify and qualify SDC in Intel Xeon-Phi and Nvidia K40

http://ena.support.keysight.com/e5061b/manuals/webhelp/eng/programming/remote_control/reading-writing_measurement_data/data_transfer_format.htm

Page 11: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

Parallel Accelerators

https://techgage.com/article/a-look-at-nvidias-kepler-based-tesla-k-series-gpu-accelerators/

https://www.software.intel.com

Page 12: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

How Reliable?● K40

○ Error will raise with input○ Threads data shared in register file

● Xeon-Phi○ Constant errors with input○ Other areas for errors

● A metric must be workload between failures!

Page 13: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

Errors

● Relative Error○ Read = observed value○ Mean of Relative Errors

● Masked Errors○ < 2% RE is tolerable

● Spatial Locality of Errors○ Line, square, etc○ Share a resource○ Correct error types differently

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.“Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, WMC_2017_Rio_Daniel

Page 14: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

Testing

● Each architecture tested for 800 hours● Simulates ~91,000 years of natural radiation● Algorithms which

○ Simulate different resources○ Represent HPC applications○ Minimize error masking

Page 15: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

Algorithms

● DGEMM○ Matrix multiplication

● LavaMD○ Calculates interactions of particles

● Hotspot○ Simulates energy dissipation

● CLAMR○ Fluid dynamics application

Page 16: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

DGEMM

Relative mean error, number of corrupted elements lower for K40

K40 Xeon Phi

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.

Page 17: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

DGEMM

>2% filter removes most random errors on K40

ABFT corrects single, line errors in linear time

FIT less dependent on input size for Xeon Phi

K40 Xeon Phi

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.

Page 18: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

DGEMM

● FIT correlation with input size on K40 but not Xeon Phi○ NVIDIA devices have a dedicated scheduler

○ K40 keeps active thread data on device

Source: Rech, Pilla, Navaux, Carro. “Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability,” DSN, Atlanta, USA, 2014.

Page 19: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

LavaMD

Number of corrupted elements lower for K40

Relative mean error lower for Xeon Phi

Exponentiation may cause large deviance

K40 Xeon Phi

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.

Page 20: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

LavaMD

Xeon Phi: cubic, square errors from larger shared cache

Less K40 FIT correlation: Local memory use limits thread count

K40 locality vs input size: Less likely to “share” errors for larger input

K40 Xeon Phi

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.

Page 21: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

HotSpot

Number of corrupted elements lower for K40

Relative mean error appears lower for K40 (not stated in paper)

Errors “dissipate”

K40 Xeon Phi

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.

Page 22: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

HotSpot

>2% threshold removes most errors on both devices

Runtime error checking can affect performance

K40 Xeon Phi

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.

Page 23: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

CLAMR

Only tested on Xeon Phi

All errors were >2%

Xeon Phi Locality Map(for a single execution) Xeon Phi

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.

Page 24: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

CLAMR

● (Related work) Runtime error checking showed fault coverage of 82%

Source: Atkinson, Debardeleben, Guan, Robey, Jones. “Fault injection experiments with the clamr hydrodynamics mini-app,” ISSREW, 2014.

Page 25: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

Conclusion

● DGEMM more resilient on K40○ GPUs have shortened pipelines

● LavaMD more resilient on Xeon Phi○ Transcendental function unit more prone to corruption in K40?

● HotSpot spreads errors○ This behavior may hold for all stencil applications

● CLAMR spreads errors without attenuating them● Xeon Phi keeps corrupted elements around for longer

Page 26: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

Future Work

● Determine sources of most critical errors

Page 27: In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU. Parallel Accelerators Offer: Lower cost Flexibility High efficiency High computational

Discussion Questions

● Does the provided data allow for anything beyond comparing the two tested devices?

● Would it be tolerable for manufacturers to target “lower relative error” at the expense of having a higher total number of errors?

● Is it fair to irradiate the chips but not the DRAM?