-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summary of recent UCX TPCxBB tests and intermittent failures #670
Comments
It's worth noting that one needs to use a different set of patches on 1.9.0 as opposed to 1.8.0. They are more similar to the patches we would use on 1.10.0. More details in this discussion ( https://github.com/rapidsai/ucx-split-feedstock/pull/50#issuecomment-763067547 ). Mentioning in case this hasn't already come up elsewhere. |
Thanks @jakirkham, I think we have the right patch:
|
The CUDA alloc patch yes. The IB patch shouldn’t be there though (not sure if I’m understanding the quoted text correctly). |
It's a dirty remnant from the dockerfile -- the patch is being written to a name /tmp/ib_reg... |
The patch quoted in #670 (comment) is the correct one for 1.9, there's only one patch needed. IIRC, the old patches from 1.8 will not apply to 1.9. |
This is the UCX build we're using (for the third setup with UCX 1.9), extracted from a Dockerfile.
Note that we removed |
Ok thanks for clarifying Nick! Yeah that looks right. |
For context on |
My only suggestion at this point would be to double check that we are linking against the libraries in the OFED version we expect as opposed to a different version (or non-OFED libraries). This is one place where we have seen issues crop up before (though not the only place). |
When using the second setup listed above query 3 consistently fails with UCX issues using latest dask/distributed (note, latest dask/distributed resolved issues around BlockwiseIO)
|
Any chance we could run the same query using RAPIDS 0.17? The error from #670 (comment) is consistent with that I remember from the many issues we had last year with IB, and I hadn't seen it since. |
With both UCX 1.8 / 1.9 and RAPIDS 0.17 (probably other configurations above) we are seeing the following error in the scheduler
We think this is due to the scheduler opening too many file descriptors and maybe some ulimit settings? We are still investigating... |
Correction: We know this is due to the scheduler opening too many file descriptors. |
Update: We've been testing UCX 1.10 with RDMACM but we are running into other issues (also observer with ucp_client_server):
We are working with UCX devs to diagnose and resolve. |
@beckernick it would be good to have GPU-BDB tested again with UCX 1.11, we believe that issues here have been resolved. |
We've been doing some UCX TPCxBB tests on a Slurm cluster. Across multiple configurations, we've run into intermittent and as of yet unexplained failures using UCX and InfiniBand. We have been using UCX 1.8 and 1.9 rather than UCX 1.10 due to the already discussed issues (see #668 and associated issues/PRs). This issue will summarize several of the configurations we've recently tested and with which we've seen failures.
The setup includes a manually QA check of the mappings between GPUs, MLNX NICs, and NIC interfaces. The specific failures are being triaged and may result in their own issues with more details, which can be crosslinked for tracking.
Initial Setup
With this setup, we are able to run a few queries successfully. However, we experienced intermittent segfaults that were not consistently reproducible.
We also saw the following warning related to
libibcm
, which we are triaging but may perhaps resolve itself with Ubuntu 20.04. Others (including @pentschev ) have suggested that we may simply no longer needlibibcm
.Second Setup
The only change in this setup was to use OpenUCX 1.9. With this setup, we were also able to run a few queries successfully. However, we again experienced intermittent failures. Failing queries included both large and small queries, suggesting that this was not driven by out of memories but by something else.
Third Setup (
pending, may suceeed -- will edit as appropriate)After additional discussions, we upgraded from Ubuntu 18.04 to Ubuntu 20.04. In this test, we also removed
--with-cm
from the UCX build process. We now consistently see compute occurring and then shortly after we see a hang.@quasiben please feel free to edit/correct me if I've misstated anything.
The text was updated successfully, but these errors were encountered: