Hello!
I am running some HF/321G calculations for water clusters using LSDalton, and trying to understand where the time goes.
The background is that I am working on some new Fock matrix construction algorithms in the context of the Ergo program (see http://www.ergoscf.org/) and would like to compare to the performance of Fock matrix construction in LSDalton. (The water cluster geometry files used for the tests are at http://www.ergoscf.org/xyz/h2o.php)
Attached is the output from one such calculation. This was run on 16 nodes at the Rackham cluster at UPPMAX in Uppsala. Right now I am mostly interested in how the Fock matrix construction part behaves, which I suppose is the FCK_FO timings. The wall time output for one SCF iteration looks like this:
>>> wall Time used in regJengine is 1 minute 15 seconds
>>> wall Time used in LINKKbuild is 56.25 seconds
>>> wall Time used in FCK_FO is 4 minutes 16 seconds
>>> wall Time used in G_GRAD is 2 minutes 9 seconds
>>> wall Time used in AVERAG is 0.49 seconds
>>> wall Time used in G_DENS is 21 minutes 22 seconds
>>> wall Time used in SCF iteration is 27 minutes 47 seconds
If I understand this correctly, it says that the Fock matrix construction FCK_FO took 4 minutes 16 seconds, and of that J took 1 minute 15 seconds and K took 56 seconds.
In seconds, that is 256 seconds for Fock matrix construction, of which 75 seconds for J and 56 seconds for K.
That leaves 125 seconds unaccounted for  what is LSDalton doing during that time?
The Fock matrix is usually computed as
F = J + K + H_core
but since H_core is precomputed the main work is computing J and K, then the H_core term is just added.
So I was expecting the J and K parts to dominate the FCK_FO time completely. That is also what happens for smaller water cluster sizes, but as the size gets larger there is something else that starts to take more and more time. I tried also larger cases than the attached, then the extra time is even larger, so that the time for J and K becomes only a small part of the Fock matrix construction time.
What could be the reason for this extra FCK_FO time in addition to the J and K times?
Are there some other input options I could use to make the Fock matrix construction work better and/or get some more information about what is taking time?
/ Elias
What is the LSDalton Fock matrix construction doing apart from computing J and K?

 Posts: 5
 Joined: 02 Oct 2017, 09:59
 First name(s): Elias
 Last name(s): Rudberg
 Affiliation: Uppsala University
 Country: Sweden
What is the LSDalton Fock matrix construction doing apart from computing J and K?
 Attachments

 stdout.txt
 (11.49 KiB) Downloaded 111 times

 LSDALTON.OUT
 (333.53 KiB) Downloaded 141 times

 Posts: 182
 Joined: 28 Aug 2013, 09:54
 First name(s): Simen
 Middle name(s): Sommerfelt
 Last name(s): Reine
 Affiliation: University of Oslo
 Country: Norway
Re: What is the LSDalton Fock matrix construction doing apart from computing J and K?
Dear Elias,
Clearly this is not the typical behaviour, and looking at your output it is clear that the PBLAS/ScaLapack interface is not working as it should for your run; since the CPU and wall timings of the matrix operations (in G_GRAD, AVERAG and G_DENS) are identical. So the first step here would be to make the PBLAS/ScaLapack work properly, and I suggest confirming this for a smaller system.
To answer your question though, the three additional steps not included in the timings of J and K construction are:
1. transformation to the GrandCanonical (GC) basis, associated with the default ATOMS (or TRILEVEL) startguess
2. Adding the exchange matrix to the Fock matrix
3. two matrix dotproducts for the energy evaluation
The GC transformation step 1. is the main the cause of the overhead here. It consists of four matrix multiplications, two for transforming the AO densitymatrix from GC to AO basis and two to transform the Fock matrix back to the GC basis  as integration is carried out in the AO basis. This overhead should be reduced by proper installation/usage of PBLAS/ScaLapack.
You can however avoid this transformation steps by using the core Hamiltonian starting guess which should work well in your case. Possibly you can also go for RH/DIIS here, and I suggest to use VanLenthe DIIS here. Although the diagonalization step (using ScaLapack) does not scale as good with MPI tasks as the matrix multiplies (using PBLAS), I belive this will overall be much faster than the matrix multiplicatitons performed in the default Augmented Rothaan Hall optimizer. You can include all these suggestions by including:
**GENERAL
.NOGCBASIS
**WAVE FUNCTIONS
*DENSOPT
.VanLenthe
.START
H1DIAG
PS: and you can of course try this out before solving the PBLAS/ScaLapack issues to get an idea of how this will work.
Hope this is of help.
Cheers,
Simen
Clearly this is not the typical behaviour, and looking at your output it is clear that the PBLAS/ScaLapack interface is not working as it should for your run; since the CPU and wall timings of the matrix operations (in G_GRAD, AVERAG and G_DENS) are identical. So the first step here would be to make the PBLAS/ScaLapack work properly, and I suggest confirming this for a smaller system.
To answer your question though, the three additional steps not included in the timings of J and K construction are:
1. transformation to the GrandCanonical (GC) basis, associated with the default ATOMS (or TRILEVEL) startguess
2. Adding the exchange matrix to the Fock matrix
3. two matrix dotproducts for the energy evaluation
The GC transformation step 1. is the main the cause of the overhead here. It consists of four matrix multiplications, two for transforming the AO densitymatrix from GC to AO basis and two to transform the Fock matrix back to the GC basis  as integration is carried out in the AO basis. This overhead should be reduced by proper installation/usage of PBLAS/ScaLapack.
You can however avoid this transformation steps by using the core Hamiltonian starting guess which should work well in your case. Possibly you can also go for RH/DIIS here, and I suggest to use VanLenthe DIIS here. Although the diagonalization step (using ScaLapack) does not scale as good with MPI tasks as the matrix multiplies (using PBLAS), I belive this will overall be much faster than the matrix multiplicatitons performed in the default Augmented Rothaan Hall optimizer. You can include all these suggestions by including:
**GENERAL
.NOGCBASIS
**WAVE FUNCTIONS
*DENSOPT
.VanLenthe
.START
H1DIAG
PS: and you can of course try this out before solving the PBLAS/ScaLapack issues to get an idea of how this will work.
Hope this is of help.
Cheers,
Simen

 Posts: 5
 Joined: 02 Oct 2017, 09:59
 First name(s): Elias
 Last name(s): Rudberg
 Affiliation: Uppsala University
 Country: Sweden
Re: What is the LSDalton Fock matrix construction doing apart from computing J and K?
Dear Simen,
Thanks for your kind reply, it was very helpful!
You are right that the PBLAS/ScaLapack I was using was not working well. It turned out that the system default BLAS was used, now I was able to fix that so that OpenBLAS is used instead, and now those operations are much faster. For example, the G_GRAD part is about 5 times faster.
However, the "CPU Time" and "wall Time" numbers in the output are still almost identical for the G_GRAD, AVERAG and G_DENS parts. But I do get a parallelization speedup when using several MPI processes. To get the best possible performance on a cluster with multicore nodes, I guess ScaLapack should be used together with a threaded BLAS library, is that right? So far I have used a serial BLAS library.
I also tried your suggestions to use .NOGCBASIS and H1DIAG starting guess and .VanLenthe DIIS. That seems to work for the smallest cases, but for a 96atom water cluster it fails to converge. See the attached output file.
What do you think, are there some other options that could make the VanLenthe approach work better?
/ Elias
Thanks for your kind reply, it was very helpful!
You are right that the PBLAS/ScaLapack I was using was not working well. It turned out that the system default BLAS was used, now I was able to fix that so that OpenBLAS is used instead, and now those operations are much faster. For example, the G_GRAD part is about 5 times faster.
However, the "CPU Time" and "wall Time" numbers in the output are still almost identical for the G_GRAD, AVERAG and G_DENS parts. But I do get a parallelization speedup when using several MPI processes. To get the best possible performance on a cluster with multicore nodes, I guess ScaLapack should be used together with a threaded BLAS library, is that right? So far I have used a serial BLAS library.
I also tried your suggestions to use .NOGCBASIS and H1DIAG starting guess and .VanLenthe DIIS. That seems to work for the smallest cases, but for a 96atom water cluster it fails to converge. See the attached output file.
What do you think, are there some other options that could make the VanLenthe approach work better?
/ Elias
 Attachments

 LSDALTON.OUT
 (139.21 KiB) Downloaded 107 times

 Posts: 182
 Joined: 28 Aug 2013, 09:54
 First name(s): Simen
 Middle name(s): Sommerfelt
 Last name(s): Reine
 Affiliation: University of Oslo
 Country: Norway
Re: What is the LSDalton Fock matrix construction doing apart from computing J and K?
Dear Elias,
Great to hear, and yes, ScaLapack should be used together with threaded BLAS.
You can in fact also use the default ATOMS startguess together with the NOGCBASIS. I converges nicely with VanLenthe DIIS in 8 iteration for your 96atom water cluster. So either replace H1DIAG with ATOMS or simply remove .START and ATOMS.
Simen
Great to hear, and yes, ScaLapack should be used together with threaded BLAS.
You can in fact also use the default ATOMS startguess together with the NOGCBASIS. I converges nicely with VanLenthe DIIS in 8 iteration for your 96atom water cluster. So either replace H1DIAG with ATOMS or simply remove .START and ATOMS.
Simen
Who is online
Users browsing this forum: No registered users and 1 guest