WSL/SLF GitLab Repository

Skip to content

Clean OpenMPI exit

Moved from SVN. Adrien Michel, Jan 16, 2019

When an exception is raised in only one MPI worker, the other workers are not killed. As a result, the app continue to run and reach a race condition (infinitely wait on the killed worker). This consume CPU hours, which is highly problematic on HPC facilities, where we pay and/or have only a limited amount of core hours. It would be important to find a way to cleanly kill everything (which means catching all exceptions from snowpack or meteoio, to be not too slow, we should have really large catch blocks).

Comment 1 by Adrien Michel, Nov 26, 2020

Actually this is cased by AlpineMain, which catches all exceptions and then uses "exit(1)". But exit() is apparently not seen by the other MPI workers so if only the master dies (e.g. due to input reading error), the other workers continue to wait. The solution is to change all the "exit(1)" to "throw" in AlpineMain. Then we exit with an uncaught exception (which is printed) and all the MPI workers are cleanly killed, which saves a lot of node hours on clusters... I'll commit the fix soon

Status: Started

Comment 2 by Adrien Michel, Jan 6, 2021

The commit is now on the git version, will be moved soon to the SLF gitlab. However, with this implementation cleanDestroyAll() is not called on all the MPI instances. Some more work is required for a clean implementation.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information