-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI.jl with cray-mpich on the Cray XC50 is not running #616
Comments
Can you try writing a C reproducer? I am not sure this is a Julia issue. What happens if you run this under |
I got the result when I executed the MPI job using 'aprun' as below. user1@login:/proj/user1/test:> cat hello_world.mpi.jl
#examples/01-hello.jl
using MPI
MPI.Init()
comm = MPI.COMM_WORLD
print("Hello world, I am rank $(MPI.Comm_rank(comm)) of $(MPI.Comm_size(comm))\n")
MPI.Barrier(comm)
user1@login:/proj/user1/test:>
user1@login:/proj/user1/test:> cat parallel_job.julia.sh
#!/bin/sh
#PBS -N mpi_test
#PBS -q normal
#PBS -l select=2:ncpus=3:mpiprocs=3:ompthreads=1
#PBS -l walltime=00:20:00
#PBS -j oe
cd $PBS_O_WORKDIR
module unload PrgEnv-intel
module unload PrgEnv-cray
module unload PrgEnv-gnu
module load PrgEnv-gnu
module load cmake
module load julia/1.7.3
aprun -n 6 julia hello_world.mpi.jl
user1@login:/proj/user1/test:>
user1@login:/proj/user1/test:> qsub parallel_job.julia.sh
588693.sdb
user1@login:/proj/user1/test:>
user1@login:/proj/user1/test:> cat mpi_test.o588693
elogin version login.. loading modules
signal (11): Segmentation fault
in expression starting at /mnt/lustre/proj/user1/test/hello_world.mpi.jl:3
signal (11): Segmentation fault
in expression starting at /mnt/lustre/proj/user1/test/hello_world.mpi.jl:3
signal (11): Segmentation fault
in expression starting at /mnt/lustre/proj/user1/test/hello_world.mpi.jl:3
signal (11): Segmentation fault
in expression starting at /mnt/lustre/proj/user1/test/hello_world.mpi.jl:3
signal (11): Segmentation fault
in expression starting at /mnt/lustre/proj/user1/test/hello_world.mpi.jl:3
unknown function (ip: (nil))
unknown function (ip: (nil))
unknown function (ip: (nil))
unknown function (ip: (nil))
unknown function (ip: (nil))
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
do_call at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:126
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
do_call at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:126
do_call at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:126
do_call at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:126
do_call at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:126
eval_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:215
eval_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:215
eval_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:215
eval_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:215
eval_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:215
eval_stmt_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:166 [inlined]
eval_body at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:587
eval_stmt_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:166 [inlined]
eval_body at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:587
eval_stmt_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:166 [inlined]
eval_body at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:587
eval_stmt_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:166 [inlined]
eval_body at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:587
eval_stmt_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:166 [inlined]
eval_body at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:587
jl_interpret_toplevel_thunk at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:731
jl_interpret_toplevel_thunk at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:731
jl_interpret_toplevel_thunk at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:731
jl_interpret_toplevel_thunk at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:731
jl_interpret_toplevel_thunk at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:731
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:885
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:885
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:885
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:885
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:885
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:830
jl_toplevel_eval_in at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:944
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:830
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:830
jl_toplevel_eval_in at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:944
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:830
jl_toplevel_eval_in at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:944
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:830
jl_toplevel_eval_in at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:944
jl_toplevel_eval_in at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:944
eval at ./boot.jl:373 [inlined]
eval at ./boot.jl:373 [inlined]
include_string at ./loading.jl:1196
eval at ./boot.jl:373 [inlined]
eval at ./boot.jl:373 [inlined]
include_string at ./loading.jl:1196
include_string at ./loading.jl:1196
include_string at ./loading.jl:1196
eval at ./boot.jl:373 [inlined]
include_string at ./loading.jl:1196
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
_include at ./loading.jl:1253
_include at ./loading.jl:1253
_include at ./loading.jl:1253
_include at ./loading.jl:1253
_include at ./loading.jl:1253
include at ./Base.jl:418
include at ./Base.jl:418
include at ./Base.jl:418
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
include at ./Base.jl:418
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
include at ./Base.jl:418
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
exec_options at ./client.jl:292
exec_options at ./client.jl:292
exec_options at ./client.jl:292
exec_options at ./client.jl:292
exec_options at ./client.jl:292
_start at ./client.jl:495
_start at ./client.jl:495
_start at ./client.jl:495
_start at ./client.jl:495
_start at ./client.jl:495
jfptr__start_22092 at /mnt/lustre/opt/prg/julia/1.7.3/GNU/73/lib/julia/sys.so (unknown line)
jfptr__start_22092 at /mnt/lustre/opt/prg/julia/1.7.3/GNU/73/lib/julia/sys.so (unknown line)
jfptr__start_22092 at /mnt/lustre/opt/prg/julia/1.7.3/GNU/73/lib/julia/sys.so (unknown line)
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jfptr__start_22092 at /mnt/lustre/opt/prg/julia/1.7.3/GNU/73/lib/julia/sys.so (unknown line)
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jfptr__start_22092 at /mnt/lustre/opt/prg/julia/1.7.3/GNU/73/lib/julia/sys.so (unknown line)
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
true_main at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:559
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
true_main at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:559
jl_repl_entrypoint at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:701
jl_repl_entrypoint at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:701
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
true_main at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:559
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
true_main at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:559
true_main at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:559
jl_repl_entrypoint at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:701
jl_repl_entrypoint at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:701
jl_repl_entrypoint at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:701
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at julia (unknown line)
Allocations: 2723 (Pool: 2710; Big: 13); GC: 0
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at julia (unknown line)
Allocations: 2723 (Pool: 2710; Big: 13); GC: 0
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at julia (unknown line)
Allocations: 2723 (Pool: 2710; Big: 13); GC: 0
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at julia (unknown line)
Allocations: 2723 (Pool: 2710; Big: 13); GC: 0
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at julia (unknown line)
Allocations: 2723 (Pool: 2710; Big: 13); GC: 0
_pmiu_daemon(SIGCHLD): [NID 00381] [c1-0c2s15n1] [Tue Jul 12 10:51:52 2022] PE RANK 4 exit signal Segmentation fault
[NID 00381] 2022-07-12 10:51:52 Apid 1303379: initiated application termination
Application 1303379 exit codes: 139
Application 1303379 resources: utime ~1s, stime ~2s, Rss ~179684, inblocks ~1816, outblocks ~0
8<---------------- PBS Pro Epilogue ------------------
Job id : 588693.sdb
User : user1
Group : normal
Jobname : mpi_test
Session : 3349
Resource limits : arch=XT,ncpus=6,place=scatter,walltime=00:20:00
Resources used : cpupercent=25,cput=00:00:03,mem=11124kb,ncpus=6,vmem=138176kb,walltime=00:00:18
Queue : normal
Account : null
Exit Status : 139
Directory : /mnt/lustre/home/user1
ALPS ResId : 705284
Hostname : mom1
user1@login:/proj/user1/test:> |
I have also experieced the same issue.
my test example is as follows: using MPI
function do_hello()
comm = MPI.COMM_WORLD
println("Hello world, I am $(MPI.Comm_rank(comm)) of $(MPI.Comm_size(comm))\n")
MPI.Barrier(comm)
end
function main()
println("Hello world before MPI.Init\n")
MPI.Init()
do_hello()
MPI.Finalize()
end
main() All ranks print "Hello world before MPI.Init". |
Since this is cray machine you should be compiling with According to their installation https://juliaparallel.org/MPI.jl/v0.10/installation/ what is the value for |
Thanks for your advice. I tried again buiding MPI.jl using the below environment(both MPI ver.0.19.2 and ver.0.18.0):
No error to build it. and I submit job with aprun.
Our system has still the above symptom when submitting a job.
|
Hi @w21085 (and @wons6554 ?). (Thanks @shahzebsiddiqui for bringing this issue to my attention.) I think there are several things to unpack here: 1) you're building your own version of Julia; 2) you're building MPI.jl. Nothing in Julia uses MPI, so I don't think the segmentation fault you see should be related to 1). So I will address (2) first -- as I think this should work regardless of how you build Julia. This is how we build MPI.jl at NERSC: using Pkg
# We use a shared global environment -- you might be doing something similar, but this next line is not necessary for single-user repos.
# Pkg.activate("globalenv", shared=true)
ENV["JULIA_MPI_BINARY"] = "system"
ENV["JULIA_MPI_PATH"] = ENV["CRAY_MPICH_DIR"]
ENV["JULIA_MPIEXEC"] = "srun"
Pkg.add("MPI")
Pkg.build("MPI"; verbose=true) That's it. No secret sauce. No specifying the compiler directly, or pointing to libmpich. Please give this a try (using using MPI
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
name = gethostname()
println("Hello world, I am rank $(rank) of $(size) on $(name)")
MPI.Barrier(comm)
MPI.Finalize() The fact that your test fails without RE point (1) -- building Julia from source: The use of the compiler wrappers is not necessary to build MPI.jl. I have built Julia from source (using the Makefile and also Spack) with varying degrees of success. There are situations where one would need the compiler wrappers in some circumstances (e.g. to pick up MKL). I think what @shahzebsiddiqui meant was specifying # No BinaryBuilder for dependencies
USE_BINARYBUILDER = 0
# Cannot use cray `ftn` to build libblas and liblapack because:
# ftn-78 crayftn: ERROR in command line
# The -f option has an invalid argument, "default-integer-8".
# from OpenBLAS => use system BLAS and LAPACK (which makes sense anyway)
USE_SYSTEM_BLAS = 1
LIBBLAS = -lopenblas
LIBBLASNAME = libopenblas
USE_SYSTEM_LAPACK = 1
LIBLAPACK = -llapack
LIBLAPACKNAME = liblapack
# LLVM doesn't compile with with `cc` => use the system one (I think it's the
# right thing to do anyway)
USE_SYSTEM_LLVM = 1
# Force the use of the Cray compiler wrappers
override FC := $(shell which ftn)
override CXX := $(shell which CC)
override CC := $(shell which cc)
# Some things still require the GCC libraries => patching those back into the
# compiler config
LDFLAGS += -L/opt/cray/pe/gcc-libs -Wl,-rpath,/opt/cray/pe/gcc-libs -lstdc++
CXXLDFLAGS += -L/opt/cray/pe/gcc-libs -Wl,-rpath,/opt/cray/pe/gcc-libs -lstdc++
# Rig PKG_CONFIG -- the Make.inc overwrites this, so we're making sure that the
# system configurations are in there
SYS_PKG_CONFIG_PATH := $(PKG_CONFIG_PATH):/opt/cray/xpmem/2.2.40-7.0.1.0_2.3__g1d7a24d.shasta/lib64/pkgconfig/
SYS_PKG_CONFIG_LIBDIR := $(PKG_CONFIG_LIBDIR)
override PKG_CONFIG_PATH = $(JULIAHOME)/usr/lib/pkgconfig:$(SYS_PKG_CONFIG_PATH)
override PKG_CONFIG_LIBDIR = $(JULIAHOME)/usr/lib/pkgconfig:$(SYS_PKG_CONFIG_LIBDIR) Please note that this is full of jiggery-pokery to get something to work on an unstable platform. The specifics are taylored to the state of Perlmutter from 6 months ago. Use with caution. Also looking forward to what @vchuravy has to say on these topics. |
Hi JBlaschke, Thank you so much for your advice. But, I am getting the problem on building the Julia as below now. $ cat Make.user USE_BINARYBUILDER = 0 USE_SYSTEM_LLVM = 1 override FC := ftn LDFLAGS += -L/opt/cray/pe/gcc-libs -Wl,-rpath,/opt/cray/pe/gcc-libs -lstdc++ SYS_PKG_CONFIG_PATH := $(PKG_CONFIG_PATH):/opt/cray/xpmem/2.2.15-6.0.7.1_5.7__g7549d06.ari/lib64/pkgconfig/ SYS_PKG_CONFIG_LIBDIR := $(CRAY_LD_LIBRARY_PATH) override PKG_CONFIG_PATH = $ make -C deps -j 20 --snip--snip-- |
The admin has installed the JULIA with the Modules on the system directory of the Cray XC50.
And then, I have been trying to set the MPI.jl on my account but MPI.jl gets me errors as follows.
Could I get some clue about the problem?
The text was updated successfully, but these errors were encountered: