Parallel run error

Find answers or ask questions regarding Dalton calculations.
Please upload an output file showing the problem, if applicable.
(It is not necessary to upload input files, they can be found in the output file.)

Post Reply
avelon
Posts: 8
Joined: 13 Nov 2013, 06:55
First name(s): Alexander
Middle name(s): I
Last name(s): Petrov
Affiliation: Siberian Federal University
Country: Russian Federation

Parallel run error

Post by avelon » 13 Nov 2013, 08:14

Dear colleague,
System information: gcc-4.7.3; openmpi-1.7.2; cmake-2.8.12.1; Python-2.7.5;gpfs, inifiband, SUSE Linux Enterprise Server (64 bit); none PBS.

script install
./setup --mpi --blas=none --lapack=none --explicit-libs="-L/gpfs/CLUSTERUSERS/neilparker2/INSTALL/ATLAS/lib -llapack -lf77blas -lcblas -latlas -lptcblas -lptf77blas"
cd build
make -j8 >&make.log

successful installation and test passed (ctest -j8). command ldd dalton.x is right. in bashrc add: /gpfs/CLUSTERUSERS/neilparker2/soft/DALTON-2013.0-Source/build
The program runs in single core,

export DALTON_TMPDIR=/gpfs/CLUSTERUSERS/neilparker2/QCHEM/DALTON/temp/
export WRKMEM=2000000000
dalton -N 1 optimize_CCSD_T SnCys

but parallel mode
export DALTON_TMPDIR=/gpfs/CLUSTERUSERS/neilparker2/QCHEM/DALTON/temp/
dalton -N 8 -nb 15000 optimize_CCSD_T SnCys

gives errors:

*****************************************
**** OUTPUT FROM DALTON SHELL SCRIPT ****
*****************************************

DALTON release 2013.0

Invocation: /gpfs/CLUSTERUSERS/neilparker2/soft/DALTON-2013.0-Source/build/dalton -N 8 -nb 15000 optimize_CCSD_T SnCys

Срд Ноя 13 13:02:28 KRAT 2013

Calculation: optimize_CCSD_T_SnCys (input files: optimize_CCSD_T.dal and SnCys.mol)
PID : 30304
Input dir : /gpfs/CLUSTERUSERS/neilparker2/QCHEM/DALTON
Scratch dir: /gpfs/CLUSTERUSERS/neilparker2/QCHEM/DALTON/temp/DALTON_scratch_neilparker2/optimize_CCSD_T_SnCys


output from the communication group generator:
1 intra-node group has been built.


DALTON: default work memory size used. 64000000

DALTON: user specified work memory size used,
environment variable NODE_WRKMEM = "1920000000"


Work memory size (LMWORK+2): 64000002 = 488.28 megabytes; node 0
Work memory size (LMWORK+2): 1920000002 = 14.305 gigabytes; node 1

0: Directories for basis set searches:
/gpfs/CLUSTERUSERS/neilparker2/QCHEM/DALTON:/gpfs/CLUSTERUSERS/neilparker2/soft/DALTON-2013.0-Source/build/basis

1: Directories for basis set searches:
/gpfs/CLUSTERUSERS/neilparker2/QCHEM/DALTON:/gpfs/CLUSTERUSERS/neilparker2/soft/DALTON-2013.0-Source/build/basis
[node-05-01:30339] *** An error occurred in MPI_Bcast
[node-05-01:30339] *** reported by process [47194641530881,140733193388036]
[node-05-01:30339] *** on communicator MPI_COMM_WORLD
[node-05-01:30339] *** MPI_ERR_TRUNCATE: message truncated
[node-05-01:30339] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-05-01:30339] *** and potentially your MPI job)
[node-05-01:30334] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[node-05-01:30334] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Error in /gpfs/CLUSTERUSERS/neilparker2/INSTALL/OPENMPI/bin/mpiexec -np 8 /gpfs/CLUSTERUSERS/neilparker2/soft/DALTON-2013.0-Source/build/dalton.x, exit code 1

in posted attachments input and output.
HELP me, please!
Thanks.
Attachments
optimize_CCSD_T_SnCys.out
parallel run
(15.47 KiB) Downloaded 397 times
SnCys.mol
(997 Bytes) Downloaded 362 times
optimize_CCSD_T.dal
(116 Bytes) Downloaded 364 times

arnfinn
Posts: 231
Joined: 28 Aug 2013, 08:02
First name(s): Arnfinn
Middle name(s): Hykkerud
Last name(s): Steindal
Affiliation: UiT
Country: Norway
Location: UiT The Arctic University of Norway

Re: Parallel run error

Post by arnfinn » 13 Nov 2013, 08:47

CC is not supposed to work in parallel in Dalton2013.

Maybe you get some speedup if your library is OpenMP parallelized and you run with

Code: Select all

dalton -omp 8 optimize_CCSD_T SnCys

avelon
Posts: 8
Joined: 13 Nov 2013, 06:55
First name(s): Alexander
Middle name(s): I
Last name(s): Petrov
Affiliation: Siberian Federal University
Country: Russian Federation

Re: Parallel run error

Post by avelon » 13 Nov 2013, 09:05

change in file *.dal on:
**DALTON INPUT
.RUN WAVE FUNCTIONS
**WAVE FUNCTIONS
.DFT
B3LYP
**END OF DALTON INPUT
gives the same error in parallel mode

User avatar
magnus
Posts: 524
Joined: 27 Jun 2013, 16:32
First name(s): Jógvan Magnus
Middle name(s): Haugaard
Last name(s): Olsen
Affiliation: Aarhus University
Country: Denmark

Re: Parallel run error

Post by magnus » 13 Nov 2013, 09:36

I don't know if it is related to the error that you get but I noticed that you used the "-nb" to specify memory which is used to specify max memory on "slave" nodes and I think it is meant to be used with the "-mb" option which specifies memory on the master node (and all nodes if "-nb" is not present). Since this was not specified Dalton will use the default. You can see this in your output:
DALTON: default work memory size used. 64000000

DALTON: user specified work memory size used,
environment variable NODE_WRKMEM = "1920000000"


Work memory size (LMWORK+2): 64000002 = 488.28 megabytes; node 0
Work memory size (LMWORK+2): 1920000002 = 14.305 gigabytes; node 1

avelon
Posts: 8
Joined: 13 Nov 2013, 06:55
First name(s): Alexander
Middle name(s): I
Last name(s): Petrov
Affiliation: Siberian Federal University
Country: Russian Federation

Re: Parallel run error

Post by avelon » 13 Nov 2013, 09:52

change "-nb" to "-mb" and without this option gives the same error.
run /gpfs/CLUSTERUSERS/neilparker2/soft/DALTON-2013.0-Source/build/dalton.x instead of $SCRATCHDIR/dalton_mpi_copy.x. May it be an error in script of dalton?

arnfinn
Posts: 231
Joined: 28 Aug 2013, 08:02
First name(s): Arnfinn
Middle name(s): Hykkerud
Last name(s): Steindal
Affiliation: UiT
Country: Norway
Location: UiT The Arctic University of Norway

Re: Parallel run error

Post by arnfinn » 13 Nov 2013, 10:00

I am not able to reproduce the error you are getting. Can you try to run some test cases in parallel and see if they fail? For instance:

Code: Select all

export DALTON_NUM_MPI_PROCS=8
ctest -L short
Please be aware that running with "-nb" and "-mb", dalton will reserve 15000 MB memory on EACH core, so almost 120 GB in total on your node.

avelon
Posts: 8
Joined: 13 Nov 2013, 06:55
First name(s): Alexander
Middle name(s): I
Last name(s): Petrov
Affiliation: Siberian Federal University
Country: Russian Federation

Re: Parallel run error

Post by avelon » 13 Nov 2013, 10:24

all tests have failed((

bast
Posts: 1210
Joined: 26 Aug 2013, 13:22
First name(s): Radovan
Last name(s): Bast
Affiliation: none
Country: Germany

Re: Parallel run error

Post by bast » 13 Nov 2013, 10:37

hi,

i suggest take a simple dal file:

Code: Select all

**DALTON INPUT
.RUN WAVE FUNCTIONS
**WAVE FUNCTIONS
.DFT
B3LYP
**END OF DALTON INPUT
and a simple mol file:

Code: Select all

BASIS
cc-pVDZ


C   1
       10.    1
Ne   0.0  0.0  0.0
run this with MPI with default memory (don't specify any memory).
this calculation "has to work"
and it takes 1 second. does this fail with your installation?

avelon
Posts: 8
Joined: 13 Nov 2013, 06:55
First name(s): Alexander
Middle name(s): I
Last name(s): Petrov
Affiliation: Siberian Federal University
Country: Russian Federation

Re: Parallel run error

Post by avelon » 13 Nov 2013, 10:56

calculation stops on "Starting in Integral Section" and gives the same error.

bast
Posts: 1210
Joined: 26 Aug 2013, 13:22
First name(s): Radovan
Last name(s): Bast
Affiliation: none
Country: Germany

Re: Parallel run error

Post by bast » 13 Nov 2013, 12:13

avelon wrote:calculation stops on "Starting in Integral Section" and gives the same error.
have a look at the dalton script:
/gpfs/CLUSTERUSERS/neilparker2/soft/DALTON-2013.0-Source/build/dalton
look for the line MPIRUN= at the beginning of the script.
does it point to the right mpirun?

bast
Posts: 1210
Joined: 26 Aug 2013, 13:22
First name(s): Radovan
Last name(s): Bast
Affiliation: none
Country: Germany

Re: Parallel run error

Post by bast » 15 Nov 2013, 09:55

bast wrote:
avelon wrote:calculation stops on "Starting in Integral Section" and gives the same error.
have a look at the dalton script:
/gpfs/CLUSTERUSERS/neilparker2/soft/DALTON-2013.0-Source/build/dalton
look for the line MPIRUN= at the beginning of the script.
does it point to the right mpirun?
were you able to solve the problem?

avelon
Posts: 8
Joined: 13 Nov 2013, 06:55
First name(s): Alexander
Middle name(s): I
Last name(s): Petrov
Affiliation: Siberian Federal University
Country: Russian Federation

Re: Parallel run error

Post by avelon » 24 Nov 2013, 17:42

No.
What version GLIBC is required?

bast
Posts: 1210
Joined: 26 Aug 2013, 13:22
First name(s): Radovan
Last name(s): Bast
Affiliation: none
Country: Germany

Re: Parallel run error

Post by bast » 25 Nov 2013, 10:13

avelon wrote:No.
What version GLIBC is required?
i cannot answer this question.
is it possible that there is something wrong with the OpenMPI installation?
are you able to compile and run other programs with this installation?
the MPI error that you see means that either sending and receiving calls are not matched (message has surprising length) or
collective calls are not over all ranks. but normally we would see these errors but i cannot reproduce them.
i would also try to run the simple test input i sent with 2 processes and default
memory (no -mw or -mb flags to the dalton script).
if you suspect an error in the dalton script, then rename your input files to DALTON.INP and MOLECULE.INP
and launch dalton.x directly with mpirun/mpiexec.

Post Reply

Who is online

Users browsing this forum: No registered users and 3 guests