OPEN MPI Error Message - MPI Process Communication Problem

[user-name@node2 ~]$ mpirun -np 48 --hostfile host_Name toy-prg
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[63363,1],40]) is on host: node2.host.com
  Process 2 ([[63363,1],0]) is on host: node1
  BTLs attempted: self sm tcp

Your MPI job is now going to abort; sorry.

--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[node2.host.com:7601] *** An error occurred in MPI_Init
[node2.host.com:7601] *** on a NULL communicator
[node2.host.com:7601] *** Unknown error
[node2.host.com:7601] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

  Reason:     Before MPI_INIT completed
  Local host: node2.host.com
  PID:        7601
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 40 with PID 7601 on
node node2.host.com exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[node2.host.com:07599] 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
[node2.host.com:07599] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[node2.host.com:07599] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
[node2.host.com:07599] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
[node2.host.com:07599] 1 more process has sent help message help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed

SOLUTION:

Not Properly exported the openmpi library Path.
what are library paths are exported in master is not exported in the node1 and node2.
In another point exported openmpi-library architecture is not uniform between the master and  nodes.these is root cause for the problem.
once the exported mpi library path correctly configured in master,node1 and node2 then the problem will be resolved.

4 comments:

  1. Do you mean master, node1 and node2 should have same .bash_profile/.bashrc?

    ReplyDelete
  2. with same entries on all the three nodes

    export PATH=/usr/anuj/openmpi-1.5.4/bin:$PATH

    export LD_LIBRARY_PATH=/usr/anuj/openmpi-1.5.4/lib:$LD_LIBRARY_PATH

    What else should go in .bash_profile/.bashrc?

    ReplyDelete
  3. Is there some one?
    who have solved this problem?
    when i execute mpi using
    mpiexec -n 2 --host a,b --mca btl ^openib --mca btl openib,sm,self ./a.out

    I receive the error posted by HPC HCL.

    i tested the execution using
    mpiexec -n 2 --host a,b --mca btl ^openib --mca btl tcp,self ./a.out

    But no flag works for me.

    please help!!!

    Thanks
    Hamid Saeed

    ReplyDelete
  4. https://stackoverflow.com/questions/36156822/error-when-starting-open-mpi-in-mpi-init-via-python

    ReplyDelete