Parallel computing on a cluster

Ruonan · October 19, 2021, 4:35pm

Dear developers,

Please could you help me with these questions?

Recently I compiled the GitHub latest version on my university’s cluster. This time I didn’t use GUI, what I did is: copy the mesh, xml (generated by GUI on my desktop PC) and src files to the cluster’s corresponding folder, then go into the /DATA/run.cfg file, change the node and CPU numbers, then run “code_Saturne run” command in DATA folder. The case can run, but I have three questions:

On the cluster, I got two nodes, each node has 20 CPUs. In the run.cfg file, I wrote “n_procs: 2 n_threads: 20”. Is this correct? Does “n_threads” mean the total CPUs or the CPUs per node please?
Are there anything else I need to specify in the run.cfg file? For example the Input/output method, MPI rank step, etc. ?
When I want to stop the case and save the results, what command should I run in the terminal? In GUI I can click “Stop now” but in the terminal I don’t know how to stop it.

Many thanks and best regards,
Ruonan

Yvan_Fournier · October 19, 2021, 5:08pm

Hello,

n_procs is the number of MPI processes used, and n_threads the number of OpenMP threads per CPU.

I do not recommend more than 2 threads per CPU, as OpenMP is not used everywhere, so
(n_procs = 40, n_threads=1) or (n_procs = 20, n_threads=2) are the recommended options.

There have already been similar questions on this forum, regarding performance, so you should find more info by searching.

You will find detailed documentation on the run.cfg here: https://www.code-saturne.org/documentation/7.0/doxygen/src/cs_ug_case_structure.html#sec_prg_run_cfg

To stop the code, look here: https://www.code-saturne.org/documentation/7.0/doxygen/src/cs_ug_run_computation.html#sec_prg_control_file

Regards,

Yvan

Ruonan · October 20, 2021, 2:43pm

Hi Yvan,

Thanks for your reply! they are very helpful.

Regarding stopping the code, I tried but failed. I generated “control_file”, added a line “<time_step_number>1000” into it, and put this file in the DATA folder (I also tried SRC folder). At that moment, the case had run more than 1000 timesteps, so I think the calculation should stop immediately after that timestep. But the calculation didn’t stop, nothing happened. Please could you tell me if anything I did wrong?

Best regards,
Ruonan

Yvan_Fournier · October 20, 2021, 11:05pm

Hello,

The control_file must be placed in the execution folder (RESU/<run_id>) to be used.

If copied in DATA, it will be copied to RESU/<run_id> for each next run (probably not what you want).

Regards,

Yvan

Ruonan · October 21, 2021, 10:15am

Hi Yvan,

Thanks for your reply! I tried but when I put the control_file in RESU/<run_id> folder, the control_file will be deleted immediately, and the calculation can’t stop. Did I write the control_file wrong? I only have one line in the control_file:

<time_step_number>100

Thanks for checking!

Best regards,
Ruonan

Ruonan · November 15, 2021, 12:44pm

Hi Yvan,

Sorry I miswrote the control_file. Just add a number “1” into control_file, and copy it to the result folder will work.

Many thanks,
Ruonan

Ruonan · November 16, 2021, 2:20pm

Hello Yvan,

Could you please help me with this error? I ran a test case in parallel on cluster, using the setting that you recommended, but still can’t do it.

I have 1 node with 27 CPUs. I set “n_procs: 27, n_threads: 1” in run.cfg, but the calculation can’t start. The error is shown below, the run_solver.log file and two error files are attached.

  ----------------------------------------------------------
 Composing periodicities

 Halo construction with standard neighborhood
 ============================================

 Face interfaces creation
 Definition of periodic vertices
 Vertex interfaces creation
 Halo creation
 Halo definition
    Local halo definition
    Distant halo creation
SIGINT signal (Control+C or equivalent) received.
--> computation interrupted by user.

Call stack:
   1: 0x7fdbf92d9296 <PMPIDI_CH3I_Progress+0x1146>    (libmpi.so.12)
   2: 0x7fdbf93e4c29 <MPIC_Wait+0x39>                 (libmpi.so.12)
   3: 0x7fdbf93e526a <MPIC_Recv+0xea>                 (libmpi.so.12)
   4: 0x7fdbf92bdeef <MPIR_Barrier_intra+0x2ff>       (libmpi.so.12)
   5: 0x7fdbf92bd875 <I_MPIR_Barrier_intra+0x125>     (libmpi.so.12)
   6: 0x7fdbf92bd6cc <MPIR_Barrier+0xc>               (libmpi.so.12)
   7: 0x7fdbf92bd5fc <MPIR_Barrier_impl+0x4c>         (libmpi.so.12)
   8: 0x7fdbf92bf482 <PMPI_Barrier+0x1c2>             (libmpi.so.12)
   9: 0x7fdbfb56bf5f <+0x5f4f5f>                      (libsaturne-7.1.so)
  10: 0x7fdbfb56e229 <cs_mesh_halo_define+0x1139>     (libsaturne-7.1.so)
  11: 0x7fdbfb52e817 <cs_mesh_init_halo+0x1cd7>       (libsaturne-7.1.so)
  12: 0x7fdbfb106aa0 <cs_preprocess_mesh+0x370>       (libsaturne-7.1.so)
  13: 0x7fdbfc156b96 <main+0x2d6>                     (libcs_solver-7.1.so)
  14: 0x7fdbf89e6c05 <__libc_start_main+0xf5>         (libc.so.6)
  15: 0x401879     <>                               (cs_solver)
End of stack

But what strange is: I tried to decrease the process number, using “n_procs: 8, n_threads: 1”, the calculation can run with no error. I also tried to run this case on my desktop PC, it can run with no error. So I think the case setting is ok, the error is related to parallel running.

(I am using the master version from GitHub. When I compiled the code on cluster, I used the semi-automatic installation method. PT-Scotch and ParMETIS were installed with no errors.)

Could you please guide me on what is wrong here?

Many thanks,
Ruonan
run_solver.log (18.3 KB)
error_r09.log (648 Bytes)
error.log (1.2 KB)

Ruonan · December 1, 2021, 10:10pm

Hi Yvan,

Thank you! I still have no idea what to do with the previous error. But I tried other nodes with different features. Then all the errors disappeared. I think some nodes on my cluster are not compatible with Saturne, or need special settings maybe, for some reason.

I tested the parallel performance on the cluster, using 2~56 cores. Could you suggest me, based on your experience, whether this parallel performance is good or not?

Please see the two graphs below. The speedup ratio is calculated by (time using 1 core)/(time using n cores). The parallel efficiency is calculated by (speedup ratio)/(core number). I got a parallel efficiency of about 40%. I followed the suggestion: putting 20000~80000 cells per core. This suggested region is highlighted in green.

I can see you are very experienced in parallel optimization, with many papers published. I really appreciate your help.

Best regards,
Ruonan

Yvan_Fournier · December 1, 2021, 10:57pm

Hello,

What type of cluster do you have (processor type, network type, … and even MPI library install)? From one of your older logs I would say Intel(R) Xeon(R) Gold 5120, but that does not tell me whether you have a fast network (Infiniband for example or something with higher latency (Gigabit Ethernet maybe). Or whether the MPI drivers make the best of the network (the compilers and system seem old).

Even on our own clusters (the latest is the one described here https://top500.org/system/179899/), we can observe a factor of 2 on the performance depending on the compilers and especially MPI library configuration used (between optimized libraries and a “generic” workstation-type configuration).

Partitioning quality may play a major role, as well as load balance. Do you have the performance.log files for some of your runs on different number of processes ?

Also, are other codes running on the same nodes, or do you have exclusive access (such as when using SLURM’s --exclusive option, or whatever equivalent option LSF, Torque, or the scheduler/resource manager you use may have ?

And finally, some specific models might not scale as well as the commonly-used ones. The info in timer_stats.csv and performance.log can provide precious feedback, to help see where things are slower.

All these factors can be important.

Best regards,

Yvan

Ruonan · December 2, 2021, 5:03pm

Hello Yvan,

Thanks a lot for your very useful comments. I have attached the performance.log files when using 2cores, 28cores and 56 cores. I really appreciate it if you could help check.

Here are some details of my cluster:
Processor type: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
Network type: op
MPI version: 3.0 (MPICH 3.1.2)

I have “gold-5120” nodes as well, but these nodes give me errors described in the previous post. So I can only use “e5-2660” nodes now.

Yes, I use “–exclusive” command.

Best regards,
Ruonan
2cores-performance.log (37.1 KB)
28cores-performance.log (37.1 KB)
56cores-performance.log (37.1 KB)

Ruonan · December 2, 2021, 5:11pm

Hello Yvan,

Here follow the timer_stats.csv files. They may help. Thank you!

Best regards,
Ruonan
2cores-timer_stats.csv (414 KB)
56cores-timer_stats.csv (270 KB)
28cores-timer_stats.csv (126 KB)

Yvan_Fournier · December 3, 2021, 7:34pm

Hello,

Comparison is a bit tricky, since you have 2633 time steps in the 56 cores case, and 1224 in the 28 cores case.

Looking at the averages, I find 0.848 s/time step for the 56 cores case, and 1.64 s/time step for the 28 cores case and 9.34 s/time step for the 2 core case., so efficiency seems to match your curve. I see no specific operation being slower (we have about 2/3 of the time in the linear solvers and gradients for the 2 procs case, and 3/4 for the 56 procs case, so I see no obvious issue here.

If you are on a single node, memory bandwidth saturation does not seem to be the issue either, because in that case you would have very little additional speedup from 28 to 56. So I would guess some MPI or network driver aspect comes into play here. Do you have “vanilla” MPICH 3.1 on the machine, or some version with optimized drivers for OmniPath ? That could be the cause of the performance loss ?

Best regards,

Yvan

Ruonan · December 5, 2021, 5:08pm

Hello Yvan,

Thanks a lot for your comments! I really appreciate your help!

Sorry for not running the same timesteps for each case. I also use the average time for each step, the same as your method.

Actually not all the cases are on a single node. I have 28 cores per node. So for the 2cores and 28cores cases, I only use one node. But for the 56cores case, I use two nodes. So will the “memory bandwidth saturation” be a problem?

I will check the MPI or network driver thing with my IT support and get back to you soon. Because as you said, if I can increase the parallel performance by a factor of 2, that will be wonderful.

Best regards,
Ruonan

Yvan_Fournier · December 5, 2021, 6:30pm

Hello,

In that case, the drop in performance may be due to saturating the (node) memory bandwidth on the node when you move to 28 ranks.

A test to confirm this would be to try running 14 ranks on a single node, and 28 ranks on 2 nodes, and check the performance in those configurations.

Best regards,

Yvan