-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Restoring the TensorAttr.fully_specify
method.
#10012
base: master
Are you sure you want to change the base?
Conversation
@drivanov Could you share the full error message and env details? I somehow cannot reproduce it locally in master. |
Sorry for the breaking change. I was able to reproduce this in master: $ python examples/multi_gpu/ogbn_train_cugraph.py --dataset ogbn-products
/home/aki/.miniconda3/envs/cugraph/lib/python3.10/site-packages/ogb/nodeproppred/dataset_pyg.py:69: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
self.data, self.slices = torch.load(self.processed_paths[0])
Training ogbn-products with GCN model.
Data = Data(num_nodes=2449029, edge_index=[2, 123718152], x=[2449029, 100], y=[2449029])
Let's use 1 GPUs!
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/aki/work/github.com/pyg-team/pytorch_geometric/examples/multi_gpu/ogbn_train_cugraph.py", line 372, in <module>
[rank0]: run_train(0, args, data, world_size, cugraph_id, model, split_idx,
[rank0]: File "/home/aki/work/github.com/pyg-team/pytorch_geometric/examples/multi_gpu/ogbn_train_cugraph.py", line 182, in run_train
[rank0]: feature_store['node', 'x'] = data.x
[rank0]: File "/home/aki/work/github.com/pyg-team/pytorch_geometric/torch_geometric/data/feature_store.py", line 476, in __setitem__
[rank0]: assert key.is_fully_specified()
[rank0]: AssertionError
[rank0]:[W211 12:49:11.477280342 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) A quick workaround is applying this patch to the example script: diff --git a/examples/multi_gpu/ogbn_train_cugraph.py b/examples/multi_gpu/ogbn_train_cugraph.py
index 098ff9be4..a4d969293 100644
--- a/examples/multi_gpu/ogbn_train_cugraph.py
+++ b/examples/multi_gpu/ogbn_train_cugraph.py
@@ -179,8 +179,8 @@ def run_train(rank, args, data, world_size, cugraph_id, model, split_idx,
)] = ixr
feature_store = TensorDictFeatureStore()
- feature_store['node', 'x'] = data.x
- feature_store['node', 'y'] = data.y
+ feature_store['node', 'x', None] = data.x
+ feature_store['node', 'y', None] = data.y
dist.barrier() |
Thank you, @akihironitta! I've just created PR#10017 with your fix. I believe we can close that one. |
This PR implements the fix proposed by @akihironitta in [PR#10012](#10012) to address the issue that arose in `ogbn_train_cugraph` after the merge of [PR#9782](#9782).
I'll fix the issue here as |
@akihironitta checking in if there is anything needed from our end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it's on my todo list, but this pr still needs test and updates to examples.
The
FeatureStore.__setitem__
method was modified in PR#9782 Previously, it usedkey.fully_specify()
:but now it calls
assert key.is_fully_specified()
instead, which consistently triggers an assertion failure inexamples/multi_gpu/ogbn_train_cugraph.py
test.This PR provides a fix for the issue.