Memory optimization in TPA running

Find answers or ask questions regarding Dalton calculations.
Please upload an output file showing the problem, if applicable.
(It is not necessary to upload input files, they can be found in the output file.)

Post Reply
peppelicari
Posts: 23
Joined: 05 Feb 2016, 16:26
First name(s): Giuseppe
Last name(s): Licari
Affiliation: University of Geneva
Country: Switzerland

Memory optimization in TPA running

Post by peppelicari » 12 Feb 2016, 21:46

Dear all,
I am a new user of Dalton and I find it really a great software, especially for the possibility to calculate quadratic and cubic response.
I'm struggling a bit in getting the TPA cross section of relatively large systems because apparently I run out of working memory. I'm running the calculation over three nodes of 16 cores (so 48 processors in total) with an assigned memory of 60GB each (so 180 GB of RAM in total if I'm not wrong). From the output file and the error file you could have a look.

My question is: is there a way to optimize the use of memory? Am I doing something wrong in the input or in the submission of the job? I submit the job in a cluster using the Slurm Utility, and the maximum allowed memory per node is 64GB.

I really appreciate the answers and wish you a nice day!
Regards,
Giuseppe Licari
Attachments
TPA_ACN.txt
Error file from the Slurm Utility
(5.42 KiB) Downloaded 197 times
TPA_ACN.out
Output containing the molecule specification and Dalton input
(275.04 KiB) Downloaded 203 times

taylor
Posts: 589
Joined: 15 Oct 2013, 05:37
First name(s): Peter
Middle name(s): Robert
Last name(s): Taylor
Affiliation: Tianjin University
Country: China

Re: Memory optimization in TPA running

Post by taylor » 13 Feb 2016, 10:44

Please post your SLURM script. The error file you posted suggests that WRKMEM is not being parsed correctly and it is certain that the program is not requesting the memory you hoped for at invocation, but it is not possible to say more without seeing the runscript. Further, it is not possible to deduce from what you have provided whether you are trying to run 48 MPI processes, or (say) 3 MPI processes each of which is supposed to be using the threaded libraries. Certainly if it is the latter you will need to explicitly set OMP_NUM_THREADS or MKL_NUM_THREADS.

Now, if you are running 48 MPI processes and you have 180GB total each process will have a little less than 4GB, and one way or another it should be possible to obtain and address this amount of memory straightforwardly. But if you have 3 MPI processes and wish each one to access 60GB, then you will need to build with 64-bit addressing (you can address at most 16GB using 32-bit addressing in Fortran and this has been commented on in other posts). 64-bit addressing is not recommended unless it is absolutely needed, in part because to use it correctly every single level of the software stack needs to have been built to be 64-bit aware. This includes for example the MPI library: if you're using the IntelMPI supplied with Intel Cluster Studio then this is fine, but some care is needed to build a 64-bit aware OpenMPI.

Best regards
Pete

peppelicari
Posts: 23
Joined: 05 Feb 2016, 16:26
First name(s): Giuseppe
Last name(s): Licari
Affiliation: University of Geneva
Country: Switzerland

Re: Memory optimization in TPA running

Post by peppelicari » 13 Feb 2016, 15:25

Dear Peter Taylor,
thank you for answering right away.

I attached the slurm file which I use for submitting Dalton jobs.

I think I'm making some mistake in submitting the job, in our Department we just started to use Dalton, so we haven't yet a lot of experience. The program was recently installed in the large cluster, which uses openMPI for parallel computing.
Since I'm not totally sure how to submit the job with the right variables, how would you suggest to modify the slurm and the *.dal file in order to set correctly the memory?

I really thank you for the help, have a nice weekend!
Best regards,
Giuseppe Licari
Attachments
slurm.dal
Slurm file for job submission in queuing system
(1.37 KiB) Downloaded 216 times

taylor
Posts: 589
Joined: 15 Oct 2013, 05:37
First name(s): Peter
Middle name(s): Robert
Last name(s): Taylor
Affiliation: Tianjin University
Country: China

Re: Memory optimization in TPA running

Post by taylor » 14 Feb 2016, 12:14

I see at least one immediate problem, and perhaps a second. First, the memory request on the Dalton invocation line (I am assuming your "dalton.srun" has made no significant changes to the runscript that comes with the program) is for a single instantiation of Dalton. That is, each Dalton task will (try to) create a workspace of this dimension. I doubt you want to be trying to create 60GB scratch work memory for each of your MPI tasks! You should set this to the appropriate number based on your previous post (as I recall it was a bit less than 4GB per task). Unless you are trying to take advantage of integral storage in memory or want to run huge CASSCF or RASSCF calculations (which are anyway not parallelized currently) you should not need more than a few GB anyway. So the first thing to do would be to reduce your memory request with the -mb option.

The second issue is that if you build with (mostly) default options Dalton will be built with 32-bit integers. This has virtually no consequences, other than perhaps for very large coupled-cluster calculations, provided you do not want to address more than 16GB of memory per task. This is obviously relevant if you actually do want to assign 60GB per task, although this seems unlikely. The address limit in Fortran is the (signed) integer value 2^31 - 1, or "2 gig", but since Fortran addresses in words and the Dalton scratch work area is allocated in 8-byte reals, this corresponds to 2 gigawords, or 16 GB. This is a hard limit and you can get around it only by building with 64-bit integers. This is possible but is not recommended for several reasons. One is that there are still some unresolved issues (a fancy term for bugs...) in some parts of the code with 64-bit ints. But much more importantly, if you build with 64-bit ints the entire software stack must also be built that way. For instance, if you use OpenMPI as the MPI layer, that must be built entirely with 64-bit ints. So must ATLAS if that is your BLAS/LAPACK/etc library. These parts of the software stack are not always under user control!

Finally, thank you for your wishes for a good weekend, which it has been so far. It is completely irrelevant to your issue, but it happens that today is my birthday (2^6, a nice round number, if anyone's wondering) and my partner started the day by opening a bottle of Veuve Clicquot: she may be decadent, but she's got good taste, at least in champagnes... So we have started the day well, and I hope your weekend is going well and the above is some help!

Best regards
Pete

hansh
Posts: 8
Joined: 02 Mar 2014, 15:36
First name(s): Hans
Last name(s): Hogreve
Affiliation: IFISR
Country: United States

Re: Memory optimization in TPA running / Pete's anniversary

Post by hansh » 14 Feb 2016, 13:31

Happy Birthday, dear Pete,

as a regular reader of your great posts, I am pretty sure
that not only I myself, but many other members of the Dalton
community are very grateful for all your substantial contributions
for advancing quantum chemistry and your kind help to those
in need for assistance. With best wishes and hopes that you
continue in this way - at least until we can celebrate your
3^4 th birthday, Hans

peppelicari
Posts: 23
Joined: 05 Feb 2016, 16:26
First name(s): Giuseppe
Last name(s): Licari
Affiliation: University of Geneva
Country: Switzerland

Re: Memory optimization in TPA running

Post by peppelicari » 14 Feb 2016, 18:50

Dear Peter,
happy birthday! I wish you all the best in this special day (I think it is a great way to start the day with a classy champagne :D ), in two days is also my birthday, you gave me the idea how to celebrate it properly.

I thank you for the previous comment. I was convinced that the value to put after "-mb" in the submission was the total requested memory. I tried to decrease the memory (say "-mb 2000") and the output file reports the correct work memory size now. So indeed "-mb" specifies the memory allocation for each processor.

Just to be sure, if I want to run the calculation on more nodes (say 3 nodes with 16 procs each) is the specification below in the Slurm file correct?:
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=16
#SBATCH --ntasks=48
(and then submit with "$DALTON_INSTALL_DIR/dalton.srun -mb 2000 -N 48 file.dal"

I haven't still so clear if I'm running 48 MPI or 3MPI processes, and if I need to set the OMP_NUM_THREADS or MKL_NUM_THREADS variables. Is this coming from the Slurm file or from the fact that we are using openMPI?

Anyway, thanks a lot for the help and have a nice ending of your birthday!
Regards,
Giuseppe

taylor
Posts: 589
Joined: 15 Oct 2013, 05:37
First name(s): Peter
Middle name(s): Robert
Last name(s): Taylor
Affiliation: Tianjin University
Country: China

Re: Memory optimization in TPA running

Post by taylor » 15 Feb 2016, 11:21

Thanks for this, and happy birthday for tomorrow for yourself! A couple of comments on your SLURM script and running Dalton in parallel:

Your instructions to SLURM and options to the Dalton runscript are fine and should indeed have the calculation run as 48 tasks, across 3 nodes with 16 cores each. But there is an element of redundancy in the SLURM options --- if you ask for 3 nodes and 16 tasks per node, specifying that you have 48 tasks to SLURM is redundant. This is not an issue as long as the specification are consistent, as they are here. But if one were to make a mistake and for example leave the number of tasks as 48, but request 4 nodes (still with 16 tasks per node) I am not sure what SLURM does. Does it let you actually use 64 cores, or are you held to 48 total, so using only 12 of the possible 16 cores on each node? I don't know the answer to this and it is part of the reason that at the computer centre I used to run in Melbourne the systems group disabled the "ntasks" SLURM option, or at least that SLURM ignored it and used only "nodes" and "tasks-per-node" to figure out the total number of tasks.

The issue relating to OMP_NUM_THREADS/MKL_NUM_THREADS is a vexed one and there has in the past been discussion among the Dalton developers about it. For your purposes, running an MPI task on each core, setting the number of threads to 1 is the right thing to do --- in principle the calculation should be making the best use it can of each core and increasing the threading cannot help and in fact is likely to make performance worse. This is why the script sets the number of threads to 1 when Dalton (but see below for LSDalton) runs. The vexed question I referred to earlier is related to the fact that this approach overrides the Intel compiler/library default and thus is different from what a user might expect from reading the Intel documentation. In the latter it is stated that if a user does nothing explicit to define the threading environment variables, then the compiled code/libraries default to as many threads as there are cores on the chip. This is not what one usually wants in MPI-parallel Dalton runs, as I said above, and so the Dalton script overrides the Intel default. But I repeat that it's a vexed question, whether to override default behaviour in such a way. In general with Dalton if the functionality is parallelized (SCF, DFT, response properties at these levels) it is better to use only task-parallelism and ignore threading. If the functionality (coupled-cluster calculations and CC response) is not explicitly parallelized there may be an advantage to running threaded, on a single node or socket.

Having said all that, the situation with LSDalton (which I should say I have not played a developer's role in) is different. LSDalton was written much later and with much more experience of recent advances in the Fortran language. For example, it does allocation of memory dynamically, rather than allocating a static workspace upfront like Dalton does. This means LSDalton is not limited to 16GB of working memory even with a 32-bit compilation: it is only the case that no individual array can be larger than 16GB. Further, the parallelism is implemented in different ways and can be invoked through, say, Scalapack at a coarse-grained level, but because most of the operations are programmed as matrix functions it can also, simultaneously, benefit from thread-level parallelism. So one could run, say, 16 MPI tasks on 16 nodes, and if these nodes have 16 cores each run with 16 threads on each node. Of course, one can imagine everything in this case from 256 MPI tasks each single-threaded up to non-MPI with 256 threads, although even if a compiler/library allowed it going off-node with threads would have a terrible performance penalty from ensuring cache coherence. For LSDalton the optimum is likely to be one task per node or perhaps per socket and threading across the cores on that node, although this is a best guess, not a given. One advantage of using threading is that the memory is shared between threads --- if you have four cores and 16GB then each MPI task, assuming they're clones of each other, can only request 4GB, whereas one MPI task invoking four threads has access to the full, shared, 16GB. Let me repeat finally that what I have said in this paragraph applies only to LSDalton, not Dalton! A downside for some applications is that LSDalton does not explicitly exploit molecular symmetry, but particularly for larger systems, even symmetrical ones, the newer, cleaner, and more efficient code in LSDalton means that it usually outperforms Dalton for a given functionality. But the functionality of LSDalton is not yet as broad as Dalton, so the latter can do some things that LSDalton cannot (yet).

Best regards
Pete

peppelicari
Posts: 23
Joined: 05 Feb 2016, 16:26
First name(s): Giuseppe
Last name(s): Licari
Affiliation: University of Geneva
Country: Switzerland

Re: Memory optimization in TPA running

Post by peppelicari » 15 Feb 2016, 18:11

Thank you very much Pete, I have a much clear idea now. I hope I won't fall anymore in this kind of issues.

I wish you a nice working week!
Best regards,
Giuseppe

taylor
Posts: 589
Joined: 15 Oct 2013, 05:37
First name(s): Peter
Middle name(s): Robert
Last name(s): Taylor
Affiliation: Tianjin University
Country: China

Re: Memory optimization in TPA running

Post by taylor » 15 Feb 2016, 20:46

Let me just round this off by saying, in time for your birthday, please do not feel that "I hope I won't fall anymore in this kind of issues". While I won't go as far as to say "quite the reverse, I hope you fall more!", because of course I don't, one should recognize that it is only by people experiencing difficulties of one sort or another and then reporting them, clearly, through this forum, that make the Dalton developers aware of things we otherwise (perhaps because we have too much experience, or perhaps we've never used SLURM, etc.) would not have known. In my own case, my partner (the champagne girl, who happens also to be a theoretical chemist) and I experienced some considerable difficulties and mystifications when she was trying to run very large calculations on the facility I directed in Melbourne. We could not understand what was happening: my systems people could not understand initially either, and it turned that (a) the SLURM documentation back then was flat-out wrong, and (b) IBM, the vendor of our machines, had decided to set aside 12GB of memory on each node as cache for the parallel file system, which meant that some memory requests were honoured by the operating system but could not be accessed in practice. In that case the problem was not with Dalton/LSDalton but with other parts of the software stack, but irrespective of where the fault is, things only get fixed when people bring them to our attention. So good luck with your future calculations, but if things go awry, do not hesitate to post! Especially when people such as yourself go to the trouble of following up and thanking us, the Dalton/LSDalton mob are all happy to do what we can.

Best regards
Pete

Post Reply

Who is online

Users browsing this forum: No registered users and 13 guests