Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one node
-
- Posts: 16
- Joined: 17 Apr 2015, 13:57
- First name(s): Birgitte
- Last name(s): Brydso
- Affiliation: HPC2N, Umeaa University
- Country: Sweden
Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one node
I have installed Dalton 2015 on a Linux Ubuntu 12.04 system, with GCC 4.6, Openmpi 1.8.1, and MKL 11.2.
Configured with
./setup --fc=mpif90 --cc=mpicc --cxx=mpicxx --mpi -D BLAS_TYPE=MKL -D LAPACK_TYPE=MKL build_gcc
All tests pass when I run on one node, but now I tested running on more than one node. This should work, since I configured for MPI, correct?
However, the run crashes immediately. I have no idea why.
I have attached the output.
Configured with
./setup --fc=mpif90 --cc=mpicc --cxx=mpicxx --mpi -D BLAS_TYPE=MKL -D LAPACK_TYPE=MKL build_gcc
All tests pass when I run on one node, but now I tested running on more than one node. This should work, since I configured for MPI, correct?
However, the run crashes immediately. I have no idea why.
I have attached the output.
- Attachments
-
- slurm-2807514.out
- (1.99 KiB) Downloaded 505 times
-
- Posts: 82
- Joined: 27 Aug 2013, 22:05
- First name(s): Janus
- Middle name(s): Juul
- Last name(s): Eriksen
- Affiliation: Institut für Physikalische Chemie, Johannes Gutenberg-Universität Mainz
- Country: Germany
Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n
What is your value of DALTON_NUM_MPI_PROCS? To use ctest with mpi, this should be > 1 (export DALTON_NUM_MPI_PROCS=N)
-
- Posts: 1210
- Joined: 26 Aug 2013, 13:22
- First name(s): Radovan
- Last name(s): Bast
- Affiliation: none
- Country: Germany
Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n
i think it is 8. but normally if it works on one node and fails or more than one, then the question is whether the machine
accesses a global scratch or whether it uses local scratch disks. if the latter then there might be a problem with the distribution
of input files, IOW one of the nodes then cannot read the files.
accesses a global scratch or whether it uses local scratch disks. if the latter then there might be a problem with the distribution
of input files, IOW one of the nodes then cannot read the files.
-
- Posts: 16
- Joined: 17 Apr 2015, 13:57
- First name(s): Birgitte
- Last name(s): Brydso
- Affiliation: HPC2N, Umeaa University
- Country: Sweden
Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n
Yes, I have the value for DALTON_NUM_MPI_PROCS set to 8.
Each node has its own /scratch space, so it could definitely be the cause of the error!
I am currently just trying to run one of the test-cases, so I created a directory with the files for it in and started it with
/home/b/bbrydsoe/pfs/Dalton2015/DALTON-Source/build_gcc_default_noscalapack_mkl_32_akka/dalton -N 8 -t /home/b/bbrydsoe/pfs/Dalton2015/DALTON-Source/build_gcc_default_noscalapack_mkl_32_akka/shared-test dft_b3lyp_nosym.dal H2O_intgrl.mol
since pfs is a shared filespace, accessible to all nodes during batch runs. However, am I correct that the "-t" flag just creates a subdirectory in the /scratch on the local node? In that case I probably need to set
DALTON_TMPDIR
instead? I will try that.
Each node has its own /scratch space, so it could definitely be the cause of the error!
I am currently just trying to run one of the test-cases, so I created a directory with the files for it in and started it with
/home/b/bbrydsoe/pfs/Dalton2015/DALTON-Source/build_gcc_default_noscalapack_mkl_32_akka/dalton -N 8 -t /home/b/bbrydsoe/pfs/Dalton2015/DALTON-Source/build_gcc_default_noscalapack_mkl_32_akka/shared-test dft_b3lyp_nosym.dal H2O_intgrl.mol
since pfs is a shared filespace, accessible to all nodes during batch runs. However, am I correct that the "-t" flag just creates a subdirectory in the /scratch on the local node? In that case I probably need to set
DALTON_TMPDIR
instead? I will try that.
-
- Posts: 1210
- Joined: 26 Aug 2013, 13:22
- First name(s): Radovan
- Last name(s): Bast
- Affiliation: none
- Country: Germany
Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n
if you have local scratch space then i recommend to use it in production.
i admit that the tests are not trivial to run across nodes with local scratch.
what i would do is to run the test set on one node. if this looks ok, it is not needed
to test again across nodes. then i would calibrate a proper calculation (meaning
not the test scripts) for several nodes. in this case use dalton -noappend
and make sure that input files get broadcast to all nodes before the calc starts.
typically the machine has mechanisms for that. for instance on NSC there is sbcast:
https://www.nsc.liu.se/systems/triolith ... alton.html
something like that ...
i admit that the tests are not trivial to run across nodes with local scratch.
what i would do is to run the test set on one node. if this looks ok, it is not needed
to test again across nodes. then i would calibrate a proper calculation (meaning
not the test scripts) for several nodes. in this case use dalton -noappend
and make sure that input files get broadcast to all nodes before the calc starts.
typically the machine has mechanisms for that. for instance on NSC there is sbcast:
https://www.nsc.liu.se/systems/triolith ... alton.html
something like that ...
-
- Posts: 16
- Joined: 17 Apr 2015, 13:57
- First name(s): Birgitte
- Last name(s): Brydso
- Affiliation: HPC2N, Umeaa University
- Country: Sweden
Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n
Unfortunately, they way sbcast is set up on our system, the files got copied to the root of /scratch on each node, instead of to the directory that is created for the job.
However, it looks like it runs when I do something like this:
../dalton -N 8 -noappend -t `pwd`/tmp dft_b3lyp_nosym.dal H2O_intgrl.mol
It throws some warnings, but I believe they are benign, as the calculation runs and produces output as it should (job out and dalton out attached).
Thanks for figuring out what was going wrong
However, it looks like it runs when I do something like this:
../dalton -N 8 -noappend -t `pwd`/tmp dft_b3lyp_nosym.dal H2O_intgrl.mol
It throws some warnings, but I believe they are benign, as the calculation runs and produces output as it should (job out and dalton out attached).
Thanks for figuring out what was going wrong

- Attachments
-
- dft_b3lyp_nosym_H2O_intgrl.out
- (35.92 KiB) Downloaded 513 times
-
- slurm-2808003.out
- (6 KiB) Downloaded 531 times
-
- Posts: 116
- Joined: 27 Aug 2013, 16:38
- First name(s): Joanna
- Last name(s): Kauczor
- Affiliation: Helsinki
- Country: Finland
Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n
Do you specify and export DALTON_TMPDIR?
This should help.
This should help.
-
- Posts: 16
- Joined: 17 Apr 2015, 13:57
- First name(s): Birgitte
- Last name(s): Brydso
- Affiliation: HPC2N, Umeaa University
- Country: Sweden
Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n
After some tests, I found that making these changes in the dalton and lsdalton files worked:
$MPIEXEC -np $optn $DALTON_EXECUTABLE >> srun $DALTON_EXECUTABLE
$MPIEXEC -np $optn $MPI >> srun $MPI
and then run with
dalton -N <procs> -noappend -t <some tempdir> ....
in the job script.
$MPIEXEC -np $optn $DALTON_EXECUTABLE >> srun $DALTON_EXECUTABLE
$MPIEXEC -np $optn $MPI >> srun $MPI
and then run with
dalton -N <procs> -noappend -t <some tempdir> ....
in the job script.
-
- Posts: 600
- Joined: 15 Oct 2013, 05:37
- First name(s): Peter
- Middle name(s): Robert
- Last name(s): Taylor
- Affiliation: Tianjin University
- Country: China
Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n
I take it your use of the symbol ">>" in your posting does not imply the UNIX/Linux convention of "append to file"? That is, you are asserting that what comes before ">>" in your posting should be replaced by what comes after?
Best regards
Pete
Best regards
Pete
-
- Posts: 16
- Joined: 17 Apr 2015, 13:57
- First name(s): Birgitte
- Last name(s): Brydso
- Affiliation: HPC2N, Umeaa University
- Country: Sweden
Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n
No, it does not mean append to. It was supposed to be -> to mean "change to", so yes replace by what comes after. I don't know why I wrote >> instead.
-
- Posts: 1210
- Joined: 26 Aug 2013, 13:22
- First name(s): Radovan
- Last name(s): Bast
- Affiliation: none
- Country: Germany
Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n
thanks! and settingbbrydsoe wrote:After some tests, I found that making these changes in the dalton and lsdalton files worked:
$MPIEXEC -np $optn $DALTON_EXECUTABLE >> srun $DALTON_EXECUTABLE
$MPIEXEC -np $optn $MPI >> srun $MPI
and then run with
dalton -N <procs> -noappend -t <some tempdir> ....
in the job script.
Code: Select all
export DALTON_LAUNCHER="srun"
-
- Posts: 16
- Joined: 17 Apr 2015, 13:57
- First name(s): Birgitte
- Last name(s): Brydso
- Affiliation: HPC2N, Umeaa University
- Country: Sweden
Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n
No, setting DALTON_LAUNCHER doesn't work. The tests fail in all but a few cases. If I instead do as I described, all tests pass.
It looks like it ought to be the same, if you set DALTON_LAUNCHER or make the changes I did, but for some reason it is not. May be due to some weirdness with our batch system, I suppose
It looks like it ought to be the same, if you set DALTON_LAUNCHER or make the changes I did, but for some reason it is not. May be due to some weirdness with our batch system, I suppose

Who is online
Users browsing this forum: No registered users and 0 guests