Different pressure values on different machines!

Mohammad · August 26, 2020, 7:20am

Hello,

I ran the same calculation on two different machines with the same number of cores, same mesh and exactly same files and same versions of the code (6.0.4) (Started from the same checkpoint file).
The calculated velocities are the close to each other on both machines while the pressure and total pressure are quite different!
The pressure on the first machine is:

field minimum maximum set mean spatial mean
v Pressure -5.6154e+08 -5.6152e+08 -5.6153e+08 -5.6153e+08

And on the second machine is:

field minimum maximum set mean spatial mean
v Pressure -70045 74732 5001.5 -2.1064e-07

As you can see, the pressure field on the first machine is 10000 times bigger than the second machine!
The correct values are for the second machine.

The weird thing is that I have version 5.0 on both machines and test it with that version and this problem did not occur in that version!

What is wrong there?

Regards,
Mohammad

Yvan_Fournier · August 26, 2020, 7:48am

Hello,

Do you have a small test case (same setup, but small mesh if possible) with which we could try to reproduce this ?

Or in any case at least the xml and user files.

Regards,

Yvan

Mohammad · August 30, 2020, 6:13pm

Hello Yvan and thank you.

I don’t know what happend, but the problem went away automatically after 1 day!
I will send the files if this problem happens again!

Regards,
Mohammad

Mohammad · October 24, 2020, 11:00pm

Hello,

The problem happened again, I noticed it only happens if I use Neumann B.C for the pressure at the outlet boundary.

    icodcl(ifac,ipr) = 3 !Neumann Boundary Condition for the Pressure
    rcodcl(ifac,ipr,1) = 0.d0
    rcodcl(ifac,ipr,2) = rinfin
    rcodcl(ifac,ipr,3) = 0.d0

When I add these codes to the outlet B.C (isolib) in cs_user_boundary_conditions.f90, I get different results for the pressure on different machines (my PC and cluster node). The first machine which is my PC gives correct results for the pressure while the second machine which is a cluster node, gives very big numbers for the pressure and the number of iterations for the pressure solver increases with the time (from 30 to about 320 after a few hours), while my PC gives correct results for the pressure! (The results for the PC and the cluster node are the same when I use ver. 5.0.9 and the pressure is correct on both of them!)

I uploaded the files of the case with small mesh. You can download it here:

I also attached the setup.log files of PC and cluster node.
setup_PC.log (22.8 KB)
setup_Cluster.log (22.5 KB)
Regards,

Mohammad

Yvan_Fournier · October 25, 2020, 10:27pm

Hello,

Ok thanks for the info. This is probably related to partitioning. Using the default outlet boundary conditions, there is a specific treatment to ensure the choice of the reference pressure point is independent of mesh partitioning (and without this, we observed similar convergence dependence on partitioning). This is based on setting the default outlet type, not simply Neumann on the pressure.

I suspect there is a similar issue here. What boundary condition type code do you use for the outlet (not just pressure) ?

Regards,

Yvan

Mohammad · October 26, 2020, 8:46am

Hello,

Thanks for your reply.

I tested the case on another PC (with ver. 6.0.4 installed) with my checkpoint files, there is no problem on that PC, too! The problem just occurs on the cluster and only on version 6.

[For your information, I start the calculations with a previous simulation checkpoint file.]

I use isolib type of B.C for the outlet, but I changed the velocity to convective outlet and pressure to Neumann as you can see below:

double precision courant
double precision, dimension(:,:), pointer :: previous_vel
double precision, dimension(:), pointer :: cflnum
call field_get_val_prev_v(ivarfl(iu), previous_vel)
call field_get_val_s(icour, cflnum) 

! SOME CODES ...

call getfbr('OUTLET', nlelt, lstelt)
do ilelt = 1, nlelt
   ifac = lstelt(ilelt)
   itypfb(ifac) = isolib   

    iel = ifabor(ifac)
    courant = cflnum(iel)
    icodcl(ifac,iu) = 2
    rcodcl(ifac,iu,1) = previous_vel(1,iel)
    rcodcl(ifac,iu,2) = courant

    icodcl(ifac,iv) = 2
    rcodcl(ifac,iv,1) = previous_vel(2,iel)
    rcodcl(ifac,iv,2) = courant

    icodcl(ifac,iw) = 2
    rcodcl(ifac,iw,1) = previous_vel(3,iel)
    rcodcl(ifac,iw,2) = courant

    icodcl(ifac,ipr) = 3 !Neumann Boundary Condition for Pressure
    rcodcl(ifac,ipr,1) = 0.d0
    rcodcl(ifac,ipr,2) = rinfin
    rcodcl(ifac,ipr,3) = 0.d0
enddo

Regards,
Mohammad

Mohammad · November 10, 2020, 1:45pm

Hello,

I noticed that MPI is not installed on the cluster and it seems that it does not apply the partitioning on the mesh!

I don’t see any messages about partitioning in the listing file on the cluster, while on my own PC I see the following messages on all runs:

No “partition_input/domain_number_4” file available;

Partitioning by space-filling curve: Morton (in bounding box).
Number of cells per domain (histogramm):
[ 788417 ; 788417 ] = 4

Partitioning finished (27.9 s)

There’s no such message in listing file on the cluster.

You can see the header of listing file on both systems for the same case:
listing_PC.txt (30.3 KB)
listing_CLUSTER_6.txt (21.8 KB)
I tried to force it to use METIS by using cs_user_performance_tuning-partition.c file, but it does not make any difference.

I also tried to perform mesh partitioning on my own PC and then import that on cluster using domain.partition_input in cs_user_scripts.py, but again, it does not even read the partitions on the cluster!

I don’t have privilege to install MPI on the cluster. And you said that this problem is maybe due to the partitioning.

The weird thing is that there is no problem on the same cluster with code_saturne version 5.0 with the same settings and same files! The problem only occurs if I use version 6.0.

I also attached the header of the listing file on the cluster for version 5. Hope it will be useful!
listing_CLUSTER_5.txt (21.1 KB)
Regards,
Mohammad

Yvan_Fournier · November 10, 2020, 5:32pm

Hello,

Good you noticed. It means you are running the same computation multiple times on the cluster.

Perhaps MPI is already installed on the cluster, and you only need to detect it correctly and reinstall code_saturne to use it.
Check the cluster’s documentation to check which MPI libraries are installed, and load environment modules if needed. Once “mpicc” and “mpicxx” scripts are found in your path (test using “mpicc -show”), you can use CC=mpixcc CXX=mpicxx in the code_saturne “configure” command (you do not need MPI for Fortran, as it is not used directly from Fortran in code_saturne).

Partitioning is only useful if you are running wth MPI (or to prepare a partitionning for another computation with MPI). You can run in parallel (MPI) without Metis (with the default partitioner), but without MPI, you are limited to OpenMP, which is used only in some parts of the code and is much more limited (and less tested/possibly buggy).

Regards,

Yvan

Mohammad · November 12, 2020, 7:35pm

Hello,

Thank you very much dear Yvan, I found mpi on the cluster.
But now I get this error in compile.log file when I run a job:

/state/partition1/intel/impi/5.0.1.035/intel64/bin/mpiicc: line 590: icc: command not found

/state/partition1/intel/impi/5.0.1.035/intel64/bin/mpiicc -DHAVE_CONFIG_H -I/share/users/mohammad/usr/Code_Saturne/6.0.0/arch/prod_mpi/include -I/share/users/mohammad/usr/Code_Saturne/6.0.0/arch/prod_mpi/include/code_saturne -I/share/users/mohammad/usr/share/hdf5-1.10.5//include -I/usr/include -DMPICH_SKIP_MPICXX -D_POSIX_C_SOURCE=200809L -DNDEBUG -std=c99 -restrict -funsigned-char -Wall -Wcheck -Wshadow -Wpointer-arith -Wmissing-prototypes -Wuninitialized -Wunused -wd981 -openmp -O2 -c /share/users/mohammad/JOB/RESU/20201112-2259/src/cs_user_parameters-fields.c

Regards,
Mohammad

Yvan_Fournier · November 12, 2020, 10:26pm

Hello,

Maybe we should move this to the “install” section of the forum…

I guess you do not have the same environment when running or installing (perhaps not the same environment on the compute nodes).

Did you do the post-install (code_saturne.cfg) to indicate the batch system ? If yes, did you try “code_saturne submit” instead of “code_saturne run” . It compiles the code on the front-end, so may have a more complete environment.

Otherwise, dif you load any modules or need settings to find mpicc ?

Best regards,

Yvan

Mohammad · November 19, 2020, 11:57am

Thank you dear Yvan,

A linux expert guy solved it for me!

Regards,
Mohammad

Mohammad · November 22, 2020, 10:23pm

Hello,

I noticed that the problem only occurs if I use OMP_NUM_THREADS>1 (bigger than 1) and Neumann boundary condition for the pressure at the outlet(as I mentioned before), this problem occurs and the pressure field overshoots and values for the pressure keep getting bigger and bigger after each iteration.
But if I use OMP_NUM_THREADS=1, this problem won’t occur! However, the code gets much slower on my case.
Is this a bug? Is there any fix?
I think it’s because of threaded numbering can I set the numbering type to default?

Regards,
Mohammad

Yvan_Fournier · November 23, 2020, 3:49pm

Hello,

This is interesting. It is probably due to a variable which is shared but should be private on the OpenMP code,
Unfortunately, there are no good tools to debug this, except for a LLVM-based tool (Archer) which requires a local LLVM/clang install, so the usual way to debug this is by bisection: replace hald of #pragma omp (in C) and !omp (in Fortran) with a non-recognized pragma/directive (such as nomp), see if the bug occurs, and switch to the other half, or refine that half… which takes some time.

In your case, if I remember correctly, since it relates to the pressure outlet, the volume of code to check is not too large, so I will take a look, but maybe only in a few days (I have a deadline for a presentation).

In any case, you may try calling cs_renumber_set_algorithm (in cs_user_performance_tuning.c) with b_face_cells_numbering set to CS_RENUMBER_B_FACES_NONE. Chances that this will avoid the bug are not very big, but it can provide additional info.

Best regards,

Yvan

Mohammad · November 24, 2020, 9:11pm

Hello,

Thank you dear Yvan, I appreciate your helps.
The CS_RENUMBER_B_FACES_NONE did not help.

But since the bug is not fixed yet, can I implement the homogeneous Neuman B.C for the pressure at the outlet as Dirichlet B.C?
I mean for the 1st order Neumann B.C we have (after discretizing):
P[boundary]=P[last cell]
So I can set the pressure at the boundary as the pressure of the last cell which is connected to the boundary. As codes we have:

double precision, dimension(:), pointer :: cflnum, per
call field_get_val_s(ivarfl(ipr), per)
...
do ilelt = 1, nlelt
   ifac = lstelt(ilelt)
   itypfb(ifac) = isolib                                                                                                                     
   iel = ifabor(ifac)
   icodcl(ifac,ipr) = 1
   rcodcl(ifac,ipr,1) = per(iel)+p0
   rcodcl(ifac,ipr,2) = rinfin
   rcodcl(ifac,ipr,3) = 0.d0
enddo

If it is OK, then the question is that should I use the relative pressure(p) or total pressure(p+p0) for the boundary?

Regards,
Mohammad

Yvan_Fournier · November 25, 2020, 3:01am

Hello,

I am not sure about this. You can already impose a constant Dirichlet value, but if you wan to impose a local value based on the current/previous time step value, I would expect this not to be robust as regards the time step.

If found a possible OpenMP error in the turbomachinery module, and another with the volume average of the pressure when not boundary face has a Dirichlet value (which might be the problem).

Could you test adding the two attached files (or at least the C file if not using the turbo-machinery module) ?

Best regards,

Yvan

Mohammad · November 25, 2020, 5:50am

Hello,

Thanks for your reply.

I don’t see any attached files in your reply and I’m not using turbomachinery module.

Today I figured out that another problem is happening!
Few days ago, I just stopped using more than 1 threads and ran the code with OMP_NUM_THREADS = 1 to solve the problem with pressure values.
I noticed that when I use hemogenous Neumann B.C for the pressure at the outlet, whether I use OpenMP or not, the residuals for the pressure equation solver and the value of the outlet mass flow keeps growing up and same happens to the number of iterations for the pressure solver which makes the computation very very time consuming. It starts from 10 iterations for the pressure solver and after 2 or 3 days of simulation, it needs 300 iterations to solve the pressure equation for each time step (and still keeps increasing)! Which makes the simulation impossible. There’s no problem with the velocity solver and the solution does not seem to diverge because I checked the calculated values. It just keeps getting slower and slower.

This problem does not happen with Dirichlet B.C for the outlet pressure.
But as I need to compare my results with a paper, I need Neumann B.C to have the same boundaty conditions.

Regards,
Mohammad

Yvan_Fournier · November 25, 2020, 1:50pm

Hello,

Sorry, I missed a step for the attachment. Here is the one which may affect pressure values.

Regarding your other remark, yes, we have much more robustness/stability issues with free outlet pressure BC’s than with Dirichlet BC’s. I am not sure it is actually “pure” Neumann BC’s, but it is mostly that.

If your mesh has non-orthogonal elements, extruding it near the outlet may help. Also, if this i not a pipe flow, but a domain open on many sides, some other Pressure BC’s may be better (such as free inlet/outlet).

Also, the “improved pressure interpolation” (iphydr = 1) option may help in some cases with head loss or stratification (more with results quality than with robustness, and probably with more feedback for RANS than LES)

Best regards,

Yvan
cs_field_operator.c (37.1 KB)

Mohammad · November 30, 2020, 8:24pm

Hello,

Thank you dear Yvan,
It seems that your attached file solved the problem!

Please keep in mind to fix this problem in newer versions.

Regards,
Mohammad

Yvan_Fournier · November 30, 2020, 9:34pm

Hello,

The fix is already commited in the development versions. As a rule, we always try to fix the development versions and merge the fixes in older maintenance branches, so as to avoid old bugs which “reappear” in newer versions.

Glad to confirm the fix works.

Best regards,

Yvan