Performance: MPI (OpenMPI) vs. threaded (OpenMP)

Use this forum for general discussions of topics that don't fit in the other forums
Post Reply
ejberquist
Posts: 12
Joined: 26 Feb 2014, 22:05
First name(s): Eric
Last name(s): Berquist
Affiliation: University of Pittsburgh
Country: United States

Performance: MPI (OpenMPI) vs. threaded (OpenMP)

Post by ejberquist » 16 Mar 2015, 19:43

In general, if one has the choice between an MPI parallel and OpenMP parallel implementation, what the optimal choice, performance-wise? I'm assuming it depends on what parts of the package are most often used. For example, I mostly perform SCF calculations followed by linear response, and will probably never touch the higher-order response implementation.

tkjaer
Posts: 300
Joined: 27 Aug 2013, 20:35
First name(s): Thomas
Last name(s): Kjaergaard

Re: Performance: MPI (OpenMPI) vs. threaded (OpenMP)

Post by tkjaer » 16 Mar 2015, 19:53

LSDalton have been designed as a MPI/OpenMP hybrid code - so if you only have 1 node with several core - you should use OpenMP

Dalton is designed as a MPI code with only Lapack call OpenMP parallel - as far as I know. So here you need to try what is best.

Note OpenMPI is a vendor of MPI where OpenMP is a coding standard/language - The name of the thread is strange

taylor
Posts: 532
Joined: 15 Oct 2013, 05:37
First name(s): Peter
Middle name(s): Robert
Last name(s): Taylor
Affiliation: Tianjin University
Country: China

Re: Performance: MPI (OpenMPI) vs. threaded (OpenMP)

Post by taylor » 17 Mar 2015, 12:58

Let me add my thoughts to Thomas's. As he observes, first, OpenMPI is one implementation (there are others, including an optimized proprietary one from Intel as part of Intel Cluster Studio, and older legacy ones like MPICH and LAM) of the MPI standard. This is designed for distributed-memory coarse-grained parallelism, and is the only form of parallelism explicitly (see below) supported by Dalton. LSDalton was developed much more recently and has vastly more elaborate support for parallelism both at the coarse-grained MPI level, including middleware like Scalapack, and fine-grained thread level, using OpenMP (which as Thomas pointed out is not a "product", like OpenMPI, but a standard, like MPI itself). In general, if the calculations one wants to do can be accomplished with LSDalton, and one or two limitations, such as no explicit treatment of symmetry, are not an issue, then it is almost always the case that LSDalton is to be preferred: it is much more modern code and even in a strictly sequential version will almost certainly out-perform Dalton.

All that said, while Dalton itself is parallelized explicitly only in certain modules and only then with MPI, there are other ways of exploiting parallelism in hardware. A considerable amount of the Dalton functionality is dependent on manipulation of matrices, and the matrix routines such as matrix-vector or matrix-matrix multiply provided in (e.g.) the Intel compiler suite are guaranteed thread-safe and indeed threaded by default if the user, or runscripts like the Dalton one, don't specifically intervene by changing environment variables like OMP_NUM_THREADS or MKL_NUM_THREADS. In particular, coupled-cluster calculations are very heavily dependent on matrix multiplication and will benefit from this thread-level parallelism, whereas the CC routines are not coarse-grained (MPI-type) parallelized and probably never will be, given the Divide-Expand-Consolidate parallel approach gradually being introduced into the release version of LSDalton. Response calculations of any order are also heavily dependent on matrix multiplication and will also benefit from simply letting the libraries sort out the threading.

Note that the MKL libraries are available separately and independent of the Intel compiler suite, and that thanks to the gcc/gfortran mob adopting strict Intel guidelines about binary compatibility it is perfectly possible (although I admit this is not what I do: I'm Intel through and through...) to compile with gfortran/gcc and then link against the Intel libraries. In any event, as you can see and as Thomas indicated this is all very much a "your mileage may vary" situation, or as we used to say "how long is a piece of string"?! If you are running relatively small systems (say, less than 500-600 basis functions) and if symmetry --- being able to identify specific excited states, say --- is important for your work I would try Dalton, linked against the MKL libraries and letting the number of threads default to, or be explicitly set to, the number of cores on your machine, or node. If indeed you are running SCF and then linear response I doubt you will see a huge benefit from running MPI-parallel for such systems, but you may get some benefit. If you are running larger calculations and you are not concerned about explicitly controlling the symmetry I suggest you try LSDalton. What the balance between coarse- and fine-grained parallelism in this case is something one has to play with, but (again as Thomas says) starting with something like as many threads as there are cores on a node and then using coarser-graining (MPI, Scalapack, etc.) across nodes should be a good start.

Best regards
Pete

ejberquist
Posts: 12
Joined: 26 Feb 2014, 22:05
First name(s): Eric
Last name(s): Berquist
Affiliation: University of Pittsburgh
Country: United States

Re: Performance: MPI (OpenMPI) vs. threaded (OpenMP)

Post by ejberquist » 17 Mar 2015, 18:24

Great, thank you both for your responses.

Perhaps I should have been more clear. I asked because I have access to a sizeable supercomputer with both fast individual nodes and fast interconnects; I only mention OpenMPI because that's what we use, although that national computing centers seem to prefer MPICH, and OpenMP as opposed to running on a GPU with OpenACC or something similar. I always use Intel/MKL where possible.

It sounds like the non-MPI version is the probably the way to go, and eventually testing out LSDalton.
What the balance between coarse- and fine-grained parallelism in this case is something one has to play with, but (again as Thomas says) starting with something like as many threads as there are cores on a node and then using coarser-graining (MPI, Scalapack, etc.) across nodes should be a good start.
I don't think this is clear in the manual, but this would imply I can compile/run Dalton for hybrid parallelism, with both MPI and OpenMP. Is that correct?

tschwabe
Posts: 2
Joined: 23 Apr 2015, 08:13
First name(s): Tobias
Last name(s): Schwabe
Affiliation: University of Hamburg
Country: Germany

Re: Performance: MPI (OpenMPI) vs. threaded (OpenMP)

Post by tschwabe » 23 Apr 2015, 10:52

I would like to add to this thread. I just installed DALTON2015 myself and was confronted with similar performance considerations and finally decided to use a serial compilation with MKL=parallel. Now I did some tests for coupled cluster response calculations (with and without direct computation of AO integrals) and found the following: When I allow for parallel threading of MKL routines I find a much larger CPU than WALL time in the output (as expected for a parallel run) and I also observe a CPU usage which corresponds to parallel use of MKL routines (i.e. CPU usage about 800% for 8 parallel threads). The strange thing: the serial run with only 1(!) thread is slightly faster than the one with 8 threads. I tested different systems and different basis sets...

My question: When can I expect that a parallel run gives a real speed up (in real time) in comparison to a serial run?

Joanna
Posts: 116
Joined: 27 Aug 2013, 16:38
First name(s): Joanna
Last name(s): Kauczor
Affiliation: Helsinki
Country: Finland

Re: Performance: MPI (OpenMPI) vs. threaded (OpenMP)

Post by Joanna » 23 Apr 2015, 10:57

The coupled cluster code is serial, so I would not expect any speed up in real time (in any near-by future).

taylor
Posts: 532
Joined: 15 Oct 2013, 05:37
First name(s): Peter
Middle name(s): Robert
Last name(s): Taylor
Affiliation: Tianjin University
Country: China

Re: Performance: MPI (OpenMPI) vs. threaded (OpenMP)

Post by taylor » 23 Apr 2015, 12:11

While Joanna is correct, I am not sure this is the answer to the question the user is asking. (Also, I think this should perhaps be in a separate thread, but that's up to someone else to decide.)

My understanding is that the user was posing the following question. If one compiles Dalton with no OpenMPI-level parallelism, but uses the threaded MKL library, then (MKL threading is very similar to OpenMP threading but relieves the user of any programming burden since it's all done in the library) watching a CC calculation, which should be dominated by matrix multiplications and thus readily threaded by the MKL library, one does indeed see a "speedup", in the sense that for example a 'top' will show Dalton running with almost 800% CPU efficiency on 8 cores. Reiterating, the library threading is using all 8 cores and the overall CPU time will be (almost) 8 times the walltime. However, the user is asserting that the walltime for the calculation is the same irrespective of whether 8 threads are being used, or 1 thread. That is, although more CPU time is being used in a given walltime, the overall calculation takes the same walltime independent of the number of threads.

Hence the user is asking (I think) "if the library uses 8 threads and 8 times the overall CPU time, why is the walltime the same?" "Why is there no improvement in throughput when apparently the library matrix multiply should be running eight times faster?" And if this is indeed the observation, I am also puzzled! But I almost never run Dalton CC calculations, so I can't offer an immediate comment.

Best regards
Pete

RikaK
Posts: 9
Joined: 09 Jun 2014, 03:40
First name(s): Rika
Last name(s): Kobayashi
Affiliation: Australian National University
Country: Australia

Re: Performance: MPI (OpenMPI) vs. threaded (OpenMP)

Post by RikaK » 23 Apr 2015, 14:02

"Hence the user is asking (I think) "if the library uses 8 threads and 8 times the overall CPU time, why is the walltime the same?" "Why is there no improvement in throughput when apparently the library matrix multiply should be running eight times faster?" And if this is indeed the observation, I am also puzzled! But I almost never run Dalton CC calculations, so I can't offer an immediate comment."

This puzzles me as well but I do have a different perspective. I was looking into this question exactly (not necessarily this case) for one of our user's jobs and I couldn't spot any time when it was clocking >100% cpu - this was just by eye. I did a quick browse through the CC routines I thought the job was going through and did spot a lot of DGEMMs but the size of the job made me think that the matrix sizes involved for this particular case was such that the parallel workload was not dominant and we were being hit by Amdahl's law. I decided that I couldn't really be certain without doing some profiling and felt I needed a case which definitely was dominated by matrix multiplications. I would like to see the case which clocked the 800% cpu.

tschwabe
Posts: 2
Joined: 23 Apr 2015, 08:13
First name(s): Tobias
Last name(s): Schwabe
Affiliation: University of Hamburg
Country: Germany

Re: Performance: MPI (OpenMPI) vs. threaded (OpenMP)

Post by tschwabe » 24 Apr 2015, 14:18

So, first of all thanks for considering my problem. And yes: my question is indeed "Why is there no improvement in throughput when apparently the library matrix multiply should be running eight times faster?" I wouldn't care too much if just most of the code would be serial and not much can be gained from parallel running MKL routines. But that should show up in bad speed ups comparing cpu time and wall time. For example, I allow 8 parallel threads but my speed up is only 1.25 or so. But what I get from the DALTON output and also from watching CPU usage is an almost perfect speed up (let's say 7.5 with 8 parallel threads which is quit okay). But still, comparing this to a serial run of the same problem, there is not net speed up. Here, I have some example output for a CCSDR(3)/aug-cc-pVDZ excitation energies and oscillator strength computation:

with OMP_NUM_THREADS=8:

>>>> Total CPU time used in DALTON: 14 hours 52 minutes 9 seconds
>>>> Total wall time used in DALTON: 1 hour 52 minutes 35 seconds

with OMP_NUM_THREADS=1:

>>>> Total CPU time used in DALTON: 1 hour 56 minutes 3 seconds
>>>> Total wall time used in DALTON: 1 hour 56 minutes 29 seconds

(Just to be sure: this should not be an error of the internal timings of DALTON. I have also checked the run times from our job schedular.)

So, is there something wrong with my setup? Or what is going on here... Any hints are welcome.

bast
Posts: 1197
Joined: 26 Aug 2013, 13:22
First name(s): Radovan
Last name(s): Bast
Affiliation: none
Country: Germany

Re: Performance: MPI (OpenMPI) vs. threaded (OpenMP)

Post by bast » 24 Apr 2015, 18:52

hi Tobias, i will for the moment not discuss whether the calculation is expected to scale with OMP or not
but discuss the findings that you see. first of all "busy" CPUs does not mean that the calculation
scales. the CPUs can be busy for a myriad of reasons: waiting for data or synchronizing or something else, in other
words "busy wait", so i would not give the CPU load too much meaning.
then the fact that you see the "Total CPU time" basically multiplied by the number
of threads is IMO because the routines that are used within the code to measure CPU-time
are not OMP-aware. i am not sure about this but if i am right then this is because they were written long before OMP arrived
and never changed. ideally the code should use proper OMP-intrinsic timing routines.
so long story short i don't think anything is wrong with your setup.

Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests