Preformance MPI vs MPI/OpenMP hybrid code

Find answers or ask questions regarding LSDalton calculations
Post Reply
vivanic
Posts: 4
Joined: 03 Mar 2016, 14:02
First name(s): Vedran
Last name(s): Ivanic
Affiliation: University of Split
Country: Croatia

Preformance MPI vs MPI/OpenMP hybrid code

Post by vivanic » 07 Mar 2016, 11:42

Dear all,

I was testing performance of MPI vs MPI/OpenMP hybrid code. Test is SCF run with DFT for organic dye. Details can be seen in attached output files.

MPI (20 processes)
>>> CPU Time used in LSDALTON is 4 hours 38 minutes 1 second
>>> wall Time used in LSDALTON is 4 hours 56 minutes 2 seconds
MPI/OMP (1 process / 20 threads)
>>> CPU Time used in LSDALTON is 101 hours 19 minutes 28 seconds
>>> wall Time used in LSDALTON is 7 hours 36 minutes 36 seconds
Both calculations were done on node with 20 physical cores. Math library is ATLAS. Correct installation/optimization of ATLAS is out of my control so I used math libraries "as is". MPI was build with ./setup --mpi. MPI/OMP was build with ./setup --mpi --omp.

As "LSDALTON is designed as a MPI/OpenMP hybrid code" I am puzzled with much larger time for MPI/OMP hybrid than for MPI. Am I doing something wrong or results are fine?

Regards,
Vedran
Attachments
mpi_omp.out
(159.27 KiB) Downloaded 249 times
mpi.out
(158.8 KiB) Downloaded 238 times

tkjaer
Posts: 300
Joined: 27 Aug 2013, 20:35
First name(s): Thomas
Last name(s): Kjaergaard

Re: Preformance MPI vs MPI/OpenMP hybrid code

Post by tkjaer » 07 Mar 2016, 12:42

Hi

While LSDalton is designed as a MPI/OpenMP code, in principle both your runs should take a approximately the same amount of time (as it uses the same resources).

Personally I am a little surprised.

As far as I can see

1. The Lapack calls takes the same amount of time on both runs as expected
2. The formation of the Coulomb matrix and exchange correlation matrix seem to be the same
3. The difference comes from the construction of the Exchange matrix. Here the pure MPI parallelization seem to be better than the OpenMP parallelization. I am not sure why and I will put this on my todo list to investigate further.

I should mention that the integral code that constructs the Exchange matrix was written when the usually number of cores was 2-8, so the code have not been profiled using 20 cores. I suspect that the problem arise due to an inefficiency in the exchange driver for many OpenMP threads.

Best Regards
Thomas Kjærgaard

Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests