Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one node

Problems with Dalton installation? Find answers or ask for help here
Post Reply
bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one node

Post by bbrydsoe » 28 Apr 2015, 16:44

I have installed Dalton 2015 on a Linux Ubuntu 12.04 system, with GCC 4.6, Openmpi 1.8.1, and MKL 11.2.

Configured with

./setup --fc=mpif90 --cc=mpicc --cxx=mpicxx --mpi -D BLAS_TYPE=MKL -D LAPACK_TYPE=MKL build_gcc

All tests pass when I run on one node, but now I tested running on more than one node. This should work, since I configured for MPI, correct?

However, the run crashes immediately. I have no idea why.

I have attached the output.
Attachments
slurm-2807514.out
(1.99 KiB) Downloaded 456 times

janus
Posts: 82
Joined: 27 Aug 2013, 22:05
First name(s): Janus
Middle name(s): Juul
Last name(s): Eriksen
Affiliation: Institut für Physikalische Chemie, Johannes Gutenberg-Universität Mainz
Country: Germany

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Post by janus » 29 Apr 2015, 08:13

What is your value of DALTON_NUM_MPI_PROCS? To use ctest with mpi, this should be > 1 (export DALTON_NUM_MPI_PROCS=N)

bast
Posts: 1210
Joined: 26 Aug 2013, 13:22
First name(s): Radovan
Last name(s): Bast
Affiliation: none
Country: Germany

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Post by bast » 29 Apr 2015, 08:19

i think it is 8. but normally if it works on one node and fails or more than one, then the question is whether the machine
accesses a global scratch or whether it uses local scratch disks. if the latter then there might be a problem with the distribution
of input files, IOW one of the nodes then cannot read the files.

bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Post by bbrydsoe » 29 Apr 2015, 11:47

Yes, I have the value for DALTON_NUM_MPI_PROCS set to 8.

Each node has its own /scratch space, so it could definitely be the cause of the error!

I am currently just trying to run one of the test-cases, so I created a directory with the files for it in and started it with

/home/b/bbrydsoe/pfs/Dalton2015/DALTON-Source/build_gcc_default_noscalapack_mkl_32_akka/dalton -N 8 -t /home/b/bbrydsoe/pfs/Dalton2015/DALTON-Source/build_gcc_default_noscalapack_mkl_32_akka/shared-test dft_b3lyp_nosym.dal H2O_intgrl.mol

since pfs is a shared filespace, accessible to all nodes during batch runs. However, am I correct that the "-t" flag just creates a subdirectory in the /scratch on the local node? In that case I probably need to set

DALTON_TMPDIR

instead? I will try that.

bast
Posts: 1210
Joined: 26 Aug 2013, 13:22
First name(s): Radovan
Last name(s): Bast
Affiliation: none
Country: Germany

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Post by bast » 29 Apr 2015, 11:53

if you have local scratch space then i recommend to use it in production.
i admit that the tests are not trivial to run across nodes with local scratch.
what i would do is to run the test set on one node. if this looks ok, it is not needed
to test again across nodes. then i would calibrate a proper calculation (meaning
not the test scripts) for several nodes. in this case use dalton -noappend
and make sure that input files get broadcast to all nodes before the calc starts.
typically the machine has mechanisms for that. for instance on NSC there is sbcast:
https://www.nsc.liu.se/systems/triolith ... alton.html
something like that ...

bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Post by bbrydsoe » 29 Apr 2015, 13:12

Unfortunately, they way sbcast is set up on our system, the files got copied to the root of /scratch on each node, instead of to the directory that is created for the job.

However, it looks like it runs when I do something like this:

../dalton -N 8 -noappend -t `pwd`/tmp dft_b3lyp_nosym.dal H2O_intgrl.mol

It throws some warnings, but I believe they are benign, as the calculation runs and produces output as it should (job out and dalton out attached).

Thanks for figuring out what was going wrong :)
Attachments
dft_b3lyp_nosym_H2O_intgrl.out
(35.92 KiB) Downloaded 449 times
slurm-2808003.out
(6 KiB) Downloaded 466 times

Joanna
Posts: 116
Joined: 27 Aug 2013, 16:38
First name(s): Joanna
Last name(s): Kauczor
Affiliation: Helsinki
Country: Finland

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Post by Joanna » 29 Apr 2015, 13:29

Do you specify and export DALTON_TMPDIR?
This should help.

bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Post by bbrydsoe » 04 May 2015, 14:32

After some tests, I found that making these changes in the dalton and lsdalton files worked:

$MPIEXEC -np $optn $DALTON_EXECUTABLE >> srun $DALTON_EXECUTABLE

$MPIEXEC -np $optn $MPI >> srun $MPI


and then run with

dalton -N <procs> -noappend -t <some tempdir> ....

in the job script.

taylor
Posts: 576
Joined: 15 Oct 2013, 05:37
First name(s): Peter
Middle name(s): Robert
Last name(s): Taylor
Affiliation: Tianjin University
Country: China

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Post by taylor » 04 May 2015, 15:55

I take it your use of the symbol ">>" in your posting does not imply the UNIX/Linux convention of "append to file"? That is, you are asserting that what comes before ">>" in your posting should be replaced by what comes after?

Best regards
Pete

bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Post by bbrydsoe » 04 May 2015, 15:59

No, it does not mean append to. It was supposed to be -> to mean "change to", so yes replace by what comes after. I don't know why I wrote >> instead.

bast
Posts: 1210
Joined: 26 Aug 2013, 13:22
First name(s): Radovan
Last name(s): Bast
Affiliation: none
Country: Germany

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Post by bast » 07 May 2015, 23:03

bbrydsoe wrote:After some tests, I found that making these changes in the dalton and lsdalton files worked:

$MPIEXEC -np $optn $DALTON_EXECUTABLE >> srun $DALTON_EXECUTABLE

$MPIEXEC -np $optn $MPI >> srun $MPI


and then run with

dalton -N <procs> -noappend -t <some tempdir> ....

in the job script.
thanks! and setting

Code: Select all

export DALTON_LAUNCHER="srun"
did not work?

bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Re: Dalton 2015 (OpenMPI, MKL, GCC) fails on more than one n

Post by bbrydsoe » 08 May 2015, 11:33

No, setting DALTON_LAUNCHER doesn't work. The tests fail in all but a few cases. If I instead do as I described, all tests pass.

It looks like it ought to be the same, if you set DALTON_LAUNCHER or make the changes I did, but for some reason it is not. May be due to some weirdness with our batch system, I suppose :)

Post Reply

Who is online

Users browsing this forum: No registered users and 5 guests