Media hub‎ > ‎ESCAPE-2 events‎ > ‎

ESCAPE-2 workshop on fault tolerant algorithms and resilient approaches for exascale computing

MOX - Modelling and Scientific Computing, Dipartimento di Matematica, Politecnico di Milano, 23-24 January 2019

Workshop description

With the growing number of processors on the way to exascale, hardware and software-related failures in operational weather prediction are bound to become a regular occurrence. A range of computational approaches is available to supply models with fault-tolerance and data recovery capabilities. Channeling these efforts into a resilience framework will improve parallel performance and sustainability for the building blocks of existing and future atmospheric models.  

Workshop goals

Through presentations by leading investigators in this field, the goal of the workshop is to survey the main avenues of current research in the area and foster discussion with project participants. Specifically the workshop aims to:

  • Introduce domain scientists to concepts used in the literature on fault-tolerant methods in computational science;
  • Match the properties of resilience approaches with the requirements of mathematical models and algorithms used in atmospheric applications;
  • Identify the most promising strategies for future exascale numerical weather prediction codes.

Attendance

The workshop presentations are public. The discussion sessions are by invitation only.

Workshop Result Summary

The workshop consisted of a first day of seminars by experts in systems resilience and fault-tolerant numerical algorithms and a second day of scientific discussions of the same experts with project participants. The presentations gave a detailed picture of the state of the art in the field and established connections with operational workflows and numerical algorithms used in atmospheric applications. 

During the discussion sessions, participants explored more in detail how to complement existing numerical weather and climate prediction models with resilience and fault-tolerance techniques. Specific recommendations included benchmarking NWP data volume and operational requirements, pairing fault-tolerant algorithms with system resilience in consistent workflows, coordinating with vendors to provide detailed hardware fault information, and embedding fault-tolerance in domain-specific language programming paradigms. 

The conclusions of the workshop will feature in a white paper to be submitted as an ESCAPE-2 project deliverable, and will inform the investigation of hardware and software resiliency tools within existing and future ESCAPE-2 project dwarfs.

Agenda and Presentations

 Wednesday, 23rd January 2019
Time Title  Presenter 
12:30 - 13:45 Welcome & Lunch
13:45 - 14:00 Introductory Remarks
14:00 - 15:00 Dealing with unreliable computing platforms at extreme scale Luc Giraud, INRIA
15:00 - 15:45 Fault-tolerance for linear solvers with a focus on multigrid
Mirco Altenbernd, University of Stuttgart
15:45 - 16:30 Coffee Break
16:30 - 17:15 Exascale resilience strategies for transient solvers Chris Cantwell, Imperial College
17:15 - 18:00 Local Failure Local Recovery: Toward Scalable Resilient Parallel Programming Model Keita Teranishi, Sandia National Laboratories
18:00 - 18:45 A hands-on approach to secure weather and climate models against hardware faults Peter Düben, ECMWF
18:45 Closing Remarks
Thursday, 24th January 2019
 Time Title  Presenter 
09:30 - 09:45 Introduction to the discussion sessions
09:45 - 11:15 Discussion: Software resilience
11:15 - 11:45 Coffee Break
11:45 - 13:15 Discussion: Algorithms and solvers resilience
13:15 Closing Remarks and Lunch


Organising committee

Luca Bonaventura, Tommaso Benacchio

Location