Dalton with OpenMPI, GCC, ACML - testing fails.

Problems with Dalton installation? Find answers or ask for help here
Post Reply
bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Dalton with OpenMPI, GCC, ACML - testing fails.

Post by bbrydsoe » 20 Apr 2015, 11:33

I have compiled Dalton 2015 with OpenMPI 1.6.5, GCC 4.6.3, and ACML 5.3.1. OS is Linux Ubuntu 12.04.

I configure with

Code: Select all

export MATH_ROOT='/lap/acml/5.3.1/gfortran64_fma4' 

./setup --fc=mpif90 --cc=mpicc --cxx=mpicxx --mpi -D BLAS_TYPE=ACML -D LAPACK_TYPE=ACML build_gcc_noscalapack_acml_32
It then compiles without errors. I then run the tests (as a batch job, as this is a cluster I am running on):

Code: Select all

#!/bin/bash
#SBATCH -J DALTON-test
#SBATCH -n 4
#SBATCH --time=20:30:00

module load openmpi/gcc/1.6.5
module load acml/gcc/5.3.1
module load libint/gcc 

#srun ./TEST -keep -reftest -benchmark 

export DALTON_TMPDIR=/scratch
export DALTON_LAUNCHER="mpirun -np 4"
export LSDALTON_LAUNCHER="mpirun -np 4"
export DALTON_NUM_MPI_PROCS=4

srun make test
I get many errors, 352 errors, actually. I have attached a list.

What could be the reason? Is there a combination of compilers and libraries that is recommended?
Attachments
LastTestsFailed.log
(8.48 KiB) Downloaded 469 times

taylor
Posts: 600
Joined: 15 Oct 2013, 05:37
First name(s): Peter
Middle name(s): Robert
Last name(s): Taylor
Affiliation: Tianjin University
Country: China

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by taylor » 20 Apr 2015, 13:36

Although you supplied a list of the failed tests, you do not appear to have posted any actual error messages as to why the tests failed? Do the tests run apparently to completion but then give an error because the final result(s) are incorrect? Or do the test jobs die at some point (or perhaps do not even start)?

The fact that some tests run correctly suggests that there cannot be much wrong with your build. I do not use the software stack you do but as far as I know it would be expected to work, unless one of the other developers has a different opinion?

I think it would be helpful to post at least some of the failed tests with more information about what has actually taken place.

Best regards
Pete

bast
Posts: 1210
Joined: 26 Aug 2013, 13:22
First name(s): Radovan
Last name(s): Bast
Affiliation: none
Country: Germany

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by bast » 20 Apr 2015, 13:44

dear Birgitte,

we do not have an official recommended set of compilers and libraries.

however, i will point you to
this site that can be useful: https://testboard.org/cdash/index.php?project=Dalton
here we aggregate nightly test results from a number of machines and compilers/libaries.
scroll to the "Release-2015" section which tests the Dalton2015 release branch.
there you can see what combinations the developers typically use and test and how well
the test set performs on these. we typically try to make the test set pass on most/all of "our"
platforms but even that is not trivial as you can see from some of the red (failed) tests.

good luck,
radovan

bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by bbrydsoe » 20 Apr 2015, 14:05

taylor wrote:Although you supplied a list of the failed tests, you do not appear to have posted any actual error messages as to why the tests failed? Do the tests run apparently to completion but then give an error because the final result(s) are incorrect? Or do the test jobs die at some point (or perhaps do not even start)?

The fact that some tests run correctly suggests that there cannot be much wrong with your build. I do not use the software stack you do but as far as I know it would be expected to work, unless one of the other developers has a different opinion?

I think it would be helpful to post at least some of the failed tests with more information about what has actually taken place.

Best regards
Pete
I am unsure where the logs for the failed tests are? All I can find in test/ are the succeeded tests. There are also files of the type "test_walk_vibave2_failed_2015-04-17T15_41-21885", but they appear to be binary files?

There is a little bit of info in the main "LastTest.log" (too big to attach), and a number of directories named "testruns/2015-04-17T21_37-testjob-pid-XXXXX" Each of those contains some files, including TESTLOG. That is the only one I can find that seems to have useful information? I have attached one of these files (renamed to TESTLOG.txt as I wasn't allowed to attach a file without an extension, apparently).

Thanks for your help!
Attachments
TESTLOG.txt
(3.42 KiB) Downloaded 460 times

bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by bbrydsoe » 20 Apr 2015, 14:09

bast wrote:dear Birgitte,

we do not have an official recommended set of compilers and libraries.

however, i will point you to
this site that can be useful: https://testboard.org/cdash/index.php?project=Dalton
here we aggregate nightly test results from a number of machines and compilers/libaries.
scroll to the "Release-2015" section which tests the Dalton2015 release branch.
there you can see what combinations the developers typically use and test and how well
the test set performs on these. we typically try to make the test set pass on most/all of "our"
platforms but even that is not trivial as you can see from some of the red (failed) tests.

good luck,
radovan
Thank you! I will take a look at it! I see that the compiler combo I used isn't there, but GCC 4.9.1 is, and I think we have that installed as a test, so I can try with that.

I will probably try and compile with no optimization at all, and see if that helps!

bast
Posts: 1210
Joined: 26 Aug 2013, 13:22
First name(s): Radovan
Last name(s): Bast
Affiliation: none
Country: Germany

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by bast » 20 Apr 2015, 14:13

bbrydsoe wrote:I am unsure where the logs for the failed tests are? All I can find in test/ are the succeeded tests. There are also files of the type "test_walk_vibave2_failed_2015-04-17T15_41-21885", but they appear to be binary files?

There is a little bit of info in the main "LastTest.log" (too big to attach), and a number of directories named "testruns/2015-04-17T21_37-testjob-pid-XXXXX" Each of those contains some files, including TESTLOG. That is the only one I can find that seems to have useful information? I have attached one of these files (renamed to TESTLOG.txt as I wasn't allowed to attach a file without an extension, apparently).

Thanks for your help!
the LastTest.log is probably the most interesting file. in there search for "failed".

bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by bbrydsoe » 20 Apr 2015, 14:38

bast wrote:
bbrydsoe wrote:I am unsure where the logs for the failed tests are? All I can find in test/ are the succeeded tests. There are also files of the type "test_walk_vibave2_failed_2015-04-17T15_41-21885", but they appear to be binary files?

There is a little bit of info in the main "LastTest.log" (too big to attach), and a number of directories named "testruns/2015-04-17T21_37-testjob-pid-XXXXX" Each of those contains some files, including TESTLOG. That is the only one I can find that seems to have useful information? I have attached one of these files (renamed to TESTLOG.txt as I wasn't allowed to attach a file without an extension, apparently).

Thanks for your help!
the LastTest.log is probably the most interesting file. in there search for "failed".
Okay, randomly picked out four tests that failed.
Attachments
gen1int_fluorobenzene_cart_test.log
(5.12 KiB) Downloaded 484 times
energy_corehole_test.log
(1.4 KiB) Downloaded 492 times
geoopt_exci2_log.txt
(6.44 KiB) Downloaded 480 times
rsp_mnf_log.txt
(4.69 KiB) Downloaded 481 times

taylor
Posts: 600
Joined: 15 Oct 2013, 05:37
First name(s): Peter
Middle name(s): Robert
Last name(s): Taylor
Affiliation: Tianjin University
Country: China

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by taylor » 20 Apr 2015, 14:44

One question: you are running with the scratch directory DALTON_TMPDIR set to /scratch. I can readily imagine that /scratch is your scratch filesystem, but on most platforms it is more common to have directories like /scratch/taylor underneath which particular runs set up their scratch directories (e.g., something like /scratch/taylor/prop_vibave or similar). The failed test you posted clearly is unable to open a file in the designated scratch directory, so I was just wondering whether this might be a permissions issue with your scratch directory?

I should also add that unless you are likely to be the only person using your cluster, or at least the only person doing quantum chemistry, it is probably unwise to use /scratch as the TMPDIR. This is because sooner or later you are going to run a calculation named "water_VDZ", or similar, and the odds are someone else will want to use the same name, or even accidentally/unintentionally delete the directory if they have enough permissions!

Best regards
Pete

bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by bbrydsoe » 20 Apr 2015, 15:56

taylor wrote:One question: you are running with the scratch directory DALTON_TMPDIR set to /scratch. I can readily imagine that /scratch is your scratch filesystem, but on most platforms it is more common to have directories like /scratch/taylor underneath which particular runs set up their scratch directories (e.g., something like /scratch/taylor/prop_vibave or similar). The failed test you posted clearly is unable to open a file in the designated scratch directory, so I was just wondering whether this might be a permissions issue with your scratch directory?

I should also add that unless you are likely to be the only person using your cluster, or at least the only person doing quantum chemistry, it is probably unwise to use /scratch as the TMPDIR. This is because sooner or later you are going to run a calculation named "water_VDZ", or similar, and the odds are someone else will want to use the same name, or even accidentally/unintentionally delete the directory if they have enough permissions!

Best regards
Pete
/scratch should have the right permissions for the job to write there. I have just restarted the job and logged into the node, and I can see the files being created in /scratch, so that should not be a problem. Unless they become so big that they use all the /scratch space on the compute node, but I doubt that as there is currently about 804G free.

I (or others) are not going to get problems with more than one person using the same directory. When jobs are allocated nodes to run on, a temporary directory in /scratch on that node is created for each user and everything is run under that. If anything, I probably do not need to set this variable, as I can see that it ended up creating the following directory on the node:

/scratch/slurm.3066265.0/scratch/DALTON_scratch_bbrydsoe

where the second /scratch is (I think) due to the variable I have set.

But I agree, it does look like some sort of permission problem. I have really no idea why right now, but I will try change it to write to a different directory and see what happens.

Thanks!

bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by bbrydsoe » 22 Apr 2015, 13:31

I tried compiling without any optimizations. No change. After some more tests I found that it would seem the problem has something to do with running certain of the tests on more than one core. The tests that failed only do so when I do

mpirun -np <cores> make test

or

srun make test (with more than one core allocated)

If I just run "make test" all goes well. Same for ctest. If I run it with mpirun or srun it fails for half the tests (or so).

Changing the batch job I used for testing to:

Code: Select all

#!/bin/bash
#SBATCH -J DALTON-test
#SBATCH -n 8
#SBATCH --time=20:30:00

module load openmpi/gcc/1.6.5
module load acml/gcc/5.3.1
module load libint/gcc 

export DALTON_LAUNCHER="mpirun -np 8"
export LSDALTON_LAUNCHER="mpirun -np 8"
export DALTON_NUM_MPI_PROCS=8

make test
Means only one test fails. This one:

energy_localize_selected

Errors from LastTest.log attached.
Attachments
energy_localize_selected.log
(4.28 KiB) Downloaded 463 times

taylor
Posts: 600
Joined: 15 Oct 2013, 05:37
First name(s): Peter
Middle name(s): Robert
Last name(s): Taylor
Affiliation: Tianjin University
Country: China

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by taylor » 22 Apr 2015, 13:47

I can't make much of a comment about testing on more than one core, although I imagine there are other people who may have more suggestions. But at VLSCI (the computing centre I used to run in Melbourne) we had a great deal of user problems with the srun command in SLURM and we strongly discouraged its use. I am not sure this is directly relevant, because I see you have problems using mpirun as well.

All that said, it seems that essentially all of the tests run correctly at least on a single core. At this point what may make the most sense is to create some sort of parallel job yourself (use the .PARALLEL option in the .dal file) for methods that Dalton (not LSDalton at this point) implements in parallel. This includes integral-direct SCF and DFT, as well as a considerable variety of response properties. If this will run serial but not in parallel, I would be inclined to suspect something about your OpenMPI or SLURM setup. (By "your" I mean this could be your computer centre, not necessarily you yourself.) Again, at VLSCI we had some difficulties with very recent OpenMPI releases and in fact discouraged their use until very recently.

I cannot help with any of the gcc-based aspects. Normally I only use the Intel compilers (but again avoid very recent releases...).

Best regards
Pete

Joanna
Posts: 116
Joined: 27 Aug 2013, 16:38
First name(s): Joanna
Last name(s): Kauczor
Affiliation: Helsinki
Country: Finland

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by Joanna » 22 Apr 2015, 13:56

You can try running dalton with a -noappend flag. Maybe you have some problems with sharing memory?
Are you able to log in or a node and actually see number of process running?

bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by bbrydsoe » 22 Apr 2015, 14:06

Good idea checking what happened on the node!

It looks like it is actually running on 8 cores (possibly because I set export DALTON_NUM_MPI_PROCS=8 in my submit script). I didn't think it would when I didn't use srun or mpirun.

I will try do a run with -noappend flag set and see what happens.

I am currently in the process of compiling for Intel 15.0/impi 5.0.1/MKL 11.2.0. I will try a run with that as well later.

User avatar
magnus
Posts: 524
Joined: 27 Jun 2013, 16:32
First name(s): Jógvan Magnus
Middle name(s): Haugaard
Last name(s): Olsen
Affiliation: Aarhus University
Country: Denmark

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by magnus » 22 Apr 2015, 14:07

I don't know about srun but "make test" uses the dalton script which in turn executes mpirun when DALTON_NUM_MPI_PROCS is larger than one. Therefore it makes sense that "mpirun make test" doesn't work. I could suspect something similar with srun although I don't know how that works.

bast
Posts: 1210
Joined: 26 Aug 2013, 13:22
First name(s): Radovan
Last name(s): Bast
Affiliation: none
Country: Germany

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by bast » 22 Apr 2015, 21:37

hi, if you set the env variable DALTON_LAUNCHER or LSDALTON_LAUNCHER
then the underlying dalton/lsdalton scripts will use this launcher and ignore
DALTON_NUM_MPI_PROCS so you don't need to set the latter in addition.
the DALTON_LAUNCHER and LSDALTON_LAUNCHER allows to launch
with launchers other than mpirun.

bast
Posts: 1210
Joined: 26 Aug 2013, 13:22
First name(s): Radovan
Last name(s): Bast
Affiliation: none
Country: Germany

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by bast » 22 Apr 2015, 21:45

bast wrote:hi, if you set the env variable DALTON_LAUNCHER or LSDALTON_LAUNCHER
then the underlying dalton/lsdalton scripts will use this launcher and ignore
DALTON_NUM_MPI_PROCS so you don't need to set the latter in addition.
the DALTON_LAUNCHER and LSDALTON_LAUNCHER allows to launch
with launchers other than mpirun.
Magnus just informed me that this is not the case (DALTON_NUM_MPI_PROCS overrides
[LS]DALTON_LAUNCHER). i will create an issue for this. so please ignore what i wrote above.

bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by bbrydsoe » 23 Apr 2015, 10:32

So if I just set DALTON_NUM_MPI_PROCS then that should be enough?

bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by bbrydsoe » 23 Apr 2015, 10:36

magnus wrote:I don't know about srun but "make test" uses the dalton script which in turn executes mpirun when DALTON_NUM_MPI_PROCS is larger than one. Therefore it makes sense that "mpirun make test" doesn't work. I could suspect something similar with srun although I don't know how that works.
That does explain why I saw it running on 8 cores, yes. Thanks!

bbrydsoe
Posts: 16
Joined: 17 Apr 2015, 13:57
First name(s): Birgitte
Last name(s): Brydso
Affiliation: HPC2N, Umeaa University
Country: Sweden

Re: Dalton with OpenMPI, GCC, ACML - testing fails.

Post by bbrydsoe » 24 Apr 2015, 14:23

I made some more tests.

I could not get it to compile with Intel and Impi, since it triggered a compiler bug, but I found that using MKL instead of ACML (and then otherwise compiling with GCC 4.6.3 and OpenMPI 1.8.1) works. All tests are passed now :)

Thanks for your help, guys!

Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests