What is the LSDalton Fock matrix construction doing apart from computing J and K?

Find answers or ask questions regarding LSDalton calculations
Post Reply
elias
Posts: 5
Joined: 02 Oct 2017, 09:59
First name(s): Elias
Last name(s): Rudberg
Affiliation: Uppsala University
Country: Sweden

What is the LSDalton Fock matrix construction doing apart from computing J and K?

Post by elias » 04 Oct 2017, 18:52

Hello!

I am running some HF/3-21G calculations for water clusters using LSDalton, and trying to understand where the time goes.

The background is that I am working on some new Fock matrix construction algorithms in the context of the Ergo program (see http://www.ergoscf.org/) and would like to compare to the performance of Fock matrix construction in LSDalton. (The water cluster geometry files used for the tests are at http://www.ergoscf.org/xyz/h2o.php)

Attached is the output from one such calculation. This was run on 16 nodes at the Rackham cluster at UPPMAX in Uppsala. Right now I am mostly interested in how the Fock matrix construction part behaves, which I suppose is the FCK_FO timings. The wall time output for one SCF iteration looks like this:

>>> wall Time used in reg-Jengine is 1 minute 15 seconds
>>> wall Time used in LINK-Kbuild is 56.25 seconds
>>> wall Time used in FCK_FO is 4 minutes 16 seconds
>>> wall Time used in G_GRAD is 2 minutes 9 seconds
>>> wall Time used in AVERAG is 0.49 seconds
>>> wall Time used in G_DENS is 21 minutes 22 seconds
>>> wall Time used in SCF iteration is 27 minutes 47 seconds

If I understand this correctly, it says that the Fock matrix construction FCK_FO took 4 minutes 16 seconds, and of that J took 1 minute 15 seconds and K took 56 seconds.
In seconds, that is 256 seconds for Fock matrix construction, of which 75 seconds for J and 56 seconds for K.
That leaves 125 seconds unaccounted for -- what is LSDalton doing during that time?

The Fock matrix is usually computed as
F = J + K + H_core
but since H_core is precomputed the main work is computing J and K, then the H_core term is just added.

So I was expecting the J and K parts to dominate the FCK_FO time completely. That is also what happens for smaller water cluster sizes, but as the size gets larger there is something else that starts to take more and more time. I tried also larger cases than the attached, then the extra time is even larger, so that the time for J and K becomes only a small part of the Fock matrix construction time.

What could be the reason for this extra FCK_FO time in addition to the J and K times?

Are there some other input options I could use to make the Fock matrix construction work better and/or get some more information about what is taking time?

/ Elias
Attachments
stdout.txt
(11.49 KiB) Downloaded 123 times
LSDALTON.OUT
(333.53 KiB) Downloaded 148 times

simensr
Posts: 182
Joined: 28 Aug 2013, 09:54
First name(s): Simen
Middle name(s): Sommerfelt
Last name(s): Reine
Affiliation: University of Oslo
Country: Norway

Re: What is the LSDalton Fock matrix construction doing apart from computing J and K?

Post by simensr » 05 Oct 2017, 14:18

Dear Elias,

Clearly this is not the typical behaviour, and looking at your output it is clear that the PBLAS/ScaLapack interface is not working as it should for your run; since the CPU and wall timings of the matrix operations (in G_GRAD, AVERAG and G_DENS) are identical. So the first step here would be to make the PBLAS/ScaLapack work properly, and I suggest confirming this for a smaller system.

To answer your question though, the three additional steps not included in the timings of J and K construction are:

1. transformation to the Grand-Canonical (GC) basis, associated with the default ATOMS (or TRILEVEL) startguess
2. Adding the exchange matrix to the Fock matrix
3. two matrix dot-products for the energy evaluation

The GC transformation step 1. is the main the cause of the overhead here. It consists of four matrix multiplications, two for transforming the AO density-matrix from GC to AO basis and two to transform the Fock matrix back to the GC basis - as integration is carried out in the AO basis. This overhead should be reduced by proper installation/usage of PBLAS/ScaLapack.

You can however avoid this transformation steps by using the core Hamiltonian starting guess which should work well in your case. Possibly you can also go for RH/DIIS here, and I suggest to use VanLenthe DIIS here. Although the diagonalization step (using ScaLapack) does not scale as good with MPI tasks as the matrix multiplies (using PBLAS), I belive this will overall be much faster than the matrix multiplicatitons performed in the default Augmented Rothaan Hall optimizer. You can include all these suggestions by including:

**GENERAL
.NOGCBASIS
**WAVE FUNCTIONS
*DENSOPT
.VanLenthe
.START
H1DIAG

PS: and you can of course try this out before solving the PBLAS/ScaLapack issues to get an idea of how this will work.

Hope this is of help.

Cheers,
Simen

elias
Posts: 5
Joined: 02 Oct 2017, 09:59
First name(s): Elias
Last name(s): Rudberg
Affiliation: Uppsala University
Country: Sweden

Re: What is the LSDalton Fock matrix construction doing apart from computing J and K?

Post by elias » 06 Oct 2017, 10:56

Dear Simen,

Thanks for your kind reply, it was very helpful!

You are right that the PBLAS/ScaLapack I was using was not working well. It turned out that the system default BLAS was used, now I was able to fix that so that OpenBLAS is used instead, and now those operations are much faster. For example, the G_GRAD part is about 5 times faster.

However, the "CPU Time" and "wall Time" numbers in the output are still almost identical for the G_GRAD, AVERAG and G_DENS parts. But I do get a parallelization speedup when using several MPI processes. To get the best possible performance on a cluster with multicore nodes, I guess ScaLapack should be used together with a threaded BLAS library, is that right? So far I have used a serial BLAS library.

I also tried your suggestions to use .NOGCBASIS and H1DIAG starting guess and .VanLenthe DIIS. That seems to work for the smallest cases, but for a 96-atom water cluster it fails to converge. See the attached output file.

What do you think, are there some other options that could make the VanLenthe approach work better?

/ Elias
Attachments
LSDALTON.OUT
(139.21 KiB) Downloaded 116 times

simensr
Posts: 182
Joined: 28 Aug 2013, 09:54
First name(s): Simen
Middle name(s): Sommerfelt
Last name(s): Reine
Affiliation: University of Oslo
Country: Norway

Re: What is the LSDalton Fock matrix construction doing apart from computing J and K?

Post by simensr » 06 Oct 2017, 12:30

Dear Elias,

Great to hear, and yes, ScaLapack should be used together with threaded BLAS.

You can in fact also use the default ATOMS startguess together with the NOGCBASIS. I converges nicely with VanLenthe DIIS in 8 iteration for your 96-atom water cluster. So either replace H1DIAG with ATOMS or simply remove .START and ATOMS.

Simen

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest