Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Error about mpi #125

Open
Zhixi-Li opened this issue Jan 24, 2024 · 0 comments
Open

Training Error about mpi #125

Zhixi-Li opened this issue Jan 24, 2024 · 0 comments

Comments

@Zhixi-Li
Copy link

Zhixi-Li commented Jan 24, 2024

System: Ubuntu 18.04.6
I follow the instruction following to install openmpi:

wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.1.tar.gz
tar -zxvf openmpi-5.0.1.tar.gz
cd openmpi-5.0.1
./configure --prefix=$HOME/openmpi CC=gcc CXX=g++ --disable-mpi-fortran --disable-mca-dso
make
make install

And I add two lines in the bottom of file ~/.bashrc to update environment variables:

export PATH=$HOME/openmpi/bin:$PATH
export LD_LIBRARY_PATH=$HOME/openmpi/lib:$LD_LIBRARY_PATH

Then I install mpi4py:

conda install mpi4py

When I run the train.py script, it goes error:

ImportError: libmpi.so.12: cannot open shared object file: No such file or directory

Then I follow the suggest online to install openmpi:

conda install openmpi

Now the error is:

  --------------------------------------------------------------------------
The value of the MCA parameter "plm_rsh_agent" was set to a path
that could not be found:

 plm_rsh_agent: ssh : rsh

Please either unset the parameter, or check that the path is correct
--------------------------------------------------------------------------
[718c7e141fd5:75498] [[INVALID],INVALID] FORCE-TERMINATE AT Not found:-13 - error plm_rsh_component.c(327)
[718c7e141fd5:75498] *** Process received signal ***
[718c7e141fd5:75498] Signal: Segmentation fault (11)
[718c7e141fd5:75498] Signal code: Address not mapped (1)
[718c7e141fd5:75498] Failing at address: (nil)
[718c7e141fd5:75498] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f82eb125980]
[718c7e141fd5:75498] *** End of error message ***
[718c7e141fd5:75416] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 716
[718c7e141fd5:75416] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 172
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

 orte_ess_init failed
 --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 ompi_mpi_init: ompi_rte_init failed
 --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[718c7e141fd5:75416] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Then I couldn't solve the problem although I have tried many methods.
Could anybody help me? Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant