Page 1 of 1

Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one node

Posted: 28 Apr 2015, 16:44
by bbrydsoe
I have installed Dalton 2015 on a Linux Ubuntu 12.04 system, with GCC 4.6, Openmpi 1.8.1, and MKL 11.2.

Configured with

./setup --fc=mpif90 --cc=mpicc --cxx=mpicxx --mpi -D BLAS_TYPE=MKL -D LAPACK_TYPE=MKL build_gcc

All tests pass when I run on one node, but now I tested running on more than one node. This should work, since I configured for MPI, correct?

However, the run crashes immediately. I have no idea why.

I have attached the output.

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Posted: 29 Apr 2015, 08:13
by janus
What is your value of DALTON_NUM_MPI_PROCS? To use ctest with mpi, this should be > 1 (export DALTON_NUM_MPI_PROCS=N)

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Posted: 29 Apr 2015, 08:19
by bast
i think it is 8. but normally if it works on one node and fails or more than one, then the question is whether the machine
accesses a global scratch or whether it uses local scratch disks. if the latter then there might be a problem with the distribution
of input files, IOW one of the nodes then cannot read the files.

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Posted: 29 Apr 2015, 11:47
by bbrydsoe
Yes, I have the value for DALTON_NUM_MPI_PROCS set to 8.

Each node has its own /scratch space, so it could definitely be the cause of the error!

I am currently just trying to run one of the test-cases, so I created a directory with the files for it in and started it with

/home/b/bbrydsoe/pfs/Dalton2015/DALTON-Source/build_gcc_default_noscalapack_mkl_32_akka/dalton -N 8 -t /home/b/bbrydsoe/pfs/Dalton2015/DALTON-Source/build_gcc_default_noscalapack_mkl_32_akka/shared-test dft_b3lyp_nosym.dal H2O_intgrl.mol

since pfs is a shared filespace, accessible to all nodes during batch runs. However, am I correct that the "-t" flag just creates a subdirectory in the /scratch on the local node? In that case I probably need to set

DALTON_TMPDIR

instead? I will try that.

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Posted: 29 Apr 2015, 11:53
by bast
if you have local scratch space then i recommend to use it in production.
i admit that the tests are not trivial to run across nodes with local scratch.
what i would do is to run the test set on one node. if this looks ok, it is not needed
to test again across nodes. then i would calibrate a proper calculation (meaning
not the test scripts) for several nodes. in this case use dalton -noappend
and make sure that input files get broadcast to all nodes before the calc starts.
typically the machine has mechanisms for that. for instance on NSC there is sbcast:
https://www.nsc.liu.se/systems/triolith ... alton.html
something like that ...

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Posted: 29 Apr 2015, 13:12
by bbrydsoe
Unfortunately, they way sbcast is set up on our system, the files got copied to the root of /scratch on each node, instead of to the directory that is created for the job.

However, it looks like it runs when I do something like this:

../dalton -N 8 -noappend -t `pwd`/tmp dft_b3lyp_nosym.dal H2O_intgrl.mol

It throws some warnings, but I believe they are benign, as the calculation runs and produces output as it should (job out and dalton out attached).

Thanks for figuring out what was going wrong :)

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Posted: 29 Apr 2015, 13:29
by Joanna
Do you specify and export DALTON_TMPDIR?
This should help.

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Posted: 04 May 2015, 14:32
by bbrydsoe
After some tests, I found that making these changes in the dalton and lsdalton files worked:

$MPIEXEC -np $optn $DALTON_EXECUTABLE >> srun $DALTON_EXECUTABLE

$MPIEXEC -np $optn $MPI >> srun $MPI


and then run with

dalton -N <procs> -noappend -t <some tempdir> ....

in the job script.

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Posted: 04 May 2015, 15:55
by taylor
I take it your use of the symbol ">>" in your posting does not imply the UNIX/Linux convention of "append to file"? That is, you are asserting that what comes before ">>" in your posting should be replaced by what comes after?

Best regards
Pete

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Posted: 04 May 2015, 15:59
by bbrydsoe
No, it does not mean append to. It was supposed to be -> to mean "change to", so yes replace by what comes after. I don't know why I wrote >> instead.

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Posted: 07 May 2015, 23:03
by bast
bbrydsoe wrote:After some tests, I found that making these changes in the dalton and lsdalton files worked:

$MPIEXEC -np $optn $DALTON_EXECUTABLE >> srun $DALTON_EXECUTABLE

$MPIEXEC -np $optn $MPI >> srun $MPI


and then run with

dalton -N <procs> -noappend -t <some tempdir> ....

in the job script.
thanks! and setting

Code: Select all

export DALTON_LAUNCHER="srun"
did not work?

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Posted: 08 May 2015, 11:33
by bbrydsoe
No, setting DALTON_LAUNCHER doesn't work. The tests fail in all but a few cases. If I instead do as I described, all tests pass.

It looks like it ought to be the same, if you set DALTON_LAUNCHER or make the changes I did, but for some reason it is not. May be due to some weirdness with our batch system, I suppose :)