Example Error in 6.01.00.2755 - D-Flow Flexible Mesh - Delft3D
intro story D-Flow FM
D-Flow Flexible MeshD-Flow Flexible Mesh (D-Flow FM) is the new software engine for hydrodynamical simulations on unstructured grids in 1D-2D-3D. Together with the familiar curvilinear meshes from Delft3D 4, the unstructured grid can consist of triangles, pentagons (etc.) and 1D channel networks, all in one single mesh. It combines proven technology from the hydrodynamic engines of Delft3D 4 and SOBEK 2 and adds flexible administration, resulting in:
An overview of the current developments can be found here. The D-Flow FM - team would be delighted if you would participate in discussions on the generation of meshes, the specification of boundary conditions, the running of computations, and all kinds of other relevant topics. Feel free to share your smart questions and/or brilliant solutions!
======================================================= | Sub groups
|
Message Boards
Example Error in 6.01.00.2755
Hi again,
i am now testing 6.01.00.2755 without the DELWAQ Module.
While running the Example Files i noticed the following:
What is the meaning of this and how can i resolve it? It seems, that the processes are running only on my Master but not on the Nodes.
All other Testcases seem to be working.
Regards,
Dirk
i am now testing 6.01.00.2755 without the DELWAQ Module.
While running the Example Files i noticed the following:
MPI process number 000 has host unknown and is running on processor master | |
MPI process number 001 has host unknown and is running on processor master | |
MPI process number 002 has host unknown and is running on processor master |
What is the meaning of this and how can i resolve it? It seems, that the processes are running only on my Master but not on the Nodes.
All other Testcases seem to be working.
Regards,
Dirk
Attachments:
Adri Mourits, modified 7 Years ago.
RE: Example Error in 6.01.00.2755
Yoda Posts: 1221 Join Date: 1/3/11 Recent Posts 00
Hi Dirk,
On Linux, a file named "machinefile" is used, see "...\examples\01_standard\run_flow2d3d_parallel.sh". This file must contain the exact names of the machines on which the partitions must be started.
If you know the machine names in advance, you can create the machinefile manually.
If you are using some queueing mechanism, you don't know the machine names in advance. In that case you have to find out how your queueing tool publishes the allocated machines. See "...\examples\01_standard\run_flow2d3d_parallel_sge.sh" for an example when SGE is used.
Regards,
Adri
On Linux, a file named "machinefile" is used, see "...\examples\01_standard\run_flow2d3d_parallel.sh". This file must contain the exact names of the machines on which the partitions must be started.
If you know the machine names in advance, you can create the machinefile manually.
If you are using some queueing mechanism, you don't know the machine names in advance. In that case you have to find out how your queueing tool publishes the allocated machines. See "...\examples\01_standard\run_flow2d3d_parallel_sge.sh" for an example when SGE is used.
Regards,
Adri
Hi Adri,
i have set the machinefile as you can see in the output.
If i put a mpdtrace into the script i get the machines he plans to use.
Regards,
Dirk
i have set the machinefile as you can see in the output.
If i put a mpdtrace into the script i get the machines he plans to use.
Regards,
Dirk
Attachments:
Adri Mourits, modified 7 Years ago.
RE: Example Error in 6.01.00.2755
Yoda Posts: 1221 Join Date: 1/3/11 Recent Posts 00
Hi Dirk,
It's difficult to find out what goes wrong on your cluster. Here are some suggestions:
Example script "...\examples\01_standard\run_flow2d3d_parallel.sh" contains the line:
You should remove this line when using a cluster. When using a queueing system, the queueing system should take care for starting mpd. When this line is in the script, mpd will be started on the master machine.
The machinefile specifies the available hardware, the command "mpdboot" in your script distributes the processes over the available hardware. I don't know what MPI tool and version you are using, but may be the manual of that mpdboot command may help you. The mpdboot line in example script "...\examples\01_standard\run_flow2d3d_parallel.sh" reads:
If you use exactly this line: Write the value of parameter $NHOSTS just before executing mpdboot. Does it contain the expected value?
--ncpus=2 puts 2 partitions on the first node, then 2 partitions on the second node and so on. What happens if you leave this out?
Regards,
Adri
It's difficult to find out what goes wrong on your cluster. Here are some suggestions:
Example script "...\examples\01_standard\run_flow2d3d_parallel.sh" contains the line:
mpd &
You should remove this line when using a cluster. When using a queueing system, the queueing system should take care for starting mpd. When this line is in the script, mpd will be started on the master machine.
The machinefile specifies the available hardware, the command "mpdboot" in your script distributes the processes over the available hardware. I don't know what MPI tool and version you are using, but may be the manual of that mpdboot command may help you. The mpdboot line in example script "...\examples\01_standard\run_flow2d3d_parallel.sh" reads:
mpdboot -n $NHOSTS -f $(pwd)/machinefile --ncpus=2
If you use exactly this line: Write the value of parameter $NHOSTS just before executing mpdboot. Does it contain the expected value?
--ncpus=2 puts 2 partitions on the first node, then 2 partitions on the second node and so on. What happens if you leave this out?
Regards,
Adri
Hi Adri,
sorry for the delay.
Normaly we use Torque as a scheduler.
But since i am troubleshooting right now the scheduler is my last concern.
The MPICH2 Version on our cluster i 1.4.1p1 right now.
The NHOSTS value is correct and removing "--npcus=2 " has no effect.
I still got the "MPI process number 000 has host unknown and is running on processor master.cluster" line.
Regards,
Dirk
sorry for the delay.
Normaly we use Torque as a scheduler.
But since i am troubleshooting right now the scheduler is my last concern.
The MPICH2 Version on our cluster i 1.4.1p1 right now.
The NHOSTS value is correct and removing "--npcus=2 " has no effect.
I still got the "MPI process number 000 has host unknown and is running on processor master.cluster" line.
Regards,
Dirk
Adri Mourits, modified 7 Years ago.
RE: Example Error in 6.01.00.2755
Yoda Posts: 1221 Join Date: 1/3/11 Recent Posts 00
Hi Dirk,
The calculation itself runs fine in parallel. So it has to do with one of the commands "mpd", "mpdboot" or the machinefile.
May be the full names of the machines must be used in the machinefile: not "node03" but "node03.mycompany.com".
Does your system administrator have suggestions?
Regards,
Adri
The calculation itself runs fine in parallel. So it has to do with one of the commands "mpd", "mpdboot" or the machinefile.
May be the full names of the machines must be used in the machinefile: not "node03" but "node03.mycompany.com".
Does your system administrator have suggestions?
Regards,
Adri
Hi Adri,
I am the Admin responsible for our Cluster-Calculations, so no, i have not ;-)
The machinefile is the same one we use with our other MPICH2 Programs, so THAT should not be an issue.
But even with the full Hostnames it does not work.
Could it be an error in the compilation? I noticed, that there is no Path to the MPI Liibrary in the Makefile
Regards,
Dirk
I am the Admin responsible for our Cluster-Calculations, so no, i have not ;-)
The machinefile is the same one we use with our other MPICH2 Programs, so THAT should not be an issue.
But even with the full Hostnames it does not work.
Could it be an error in the compilation? I noticed, that there is no Path to the MPI Liibrary in the Makefile
Regards,
Dirk
Adri Mourits, modified 7 Years ago.
RE: Example Error in 6.01.00.2755
Yoda Posts: 1221 Join Date: 1/3/11 Recent Posts 00
Hi Dirk,
Sometimes there are problems related to mpi. I collected some information in the FAQ.
But since your model does run in parallel, I expect that the compilation was fine. The problem you have is that the processes are not distributed correctly.
I have two suggestions left:
Hope that helps.
Regards,
Adri
Sometimes there are problems related to mpi. I collected some information in the FAQ.
But since your model does run in parallel, I expect that the compilation was fine. The problem you have is that the processes are not distributed correctly.
I have two suggestions left:
- The mpi version used during compilation is another one than used at runtime
- The mpi version used is compiled with another compiler than used for the Delft3D source code
Hope that helps.
Regards,
Adri
Hi Adri,
i noticed that the program wouldn't start a process remotely on the nodes.
So after some more testing i found an error in my run file AND in my my mpich2 compilation ( not the same Flags )
Now the program is distributed to my nodes as planned but the error message still shows on my console output
I am getting that error for every process that is started on the nodes.
But since it runs at last in parallel-mode on my nodes i can switch to my other Problems (compiling it using intel12, distribution by torque,...)
Regards,
Dirk
i noticed that the program wouldn't start a process remotely on the nodes.
So after some more testing i found an error in my run file AND in my my mpich2 compilation ( not the same Flags )
Now the program is distributed to my nodes as planned but the error message still shows on my console output
MPI process number 004 has host unknown and is running on processor node01.cluster |
I am getting that error for every process that is started on the nodes.
But since it runs at last in parallel-mode on my nodes i can switch to my other Problems (compiling it using intel12, distribution by torque,...)
Regards,
Dirk
Adri Mourits, modified 7 Years ago.
RE: Example Error in 6.01.00.2755
Yoda Posts: 1221 Join Date: 1/3/11 Recent Posts 00
Hi Dirk,
If you have useful information that helps to avoid others running into the same problem: please post it in this forum.
Thanks.
Adri
If you have useful information that helps to avoid others running into the same problem: please post it in this forum.
Thanks.
Adri