[Bug] Restoring the `TensorAttr.fully_specify` method. #10012

drivanov · 2025-02-10T21:17:03Z

The FeatureStore.__setitem__ method was modified in PR#9782 Previously, it used key.fully_specify() :

            def __setitem__(self, key: TensorAttr, value: FeatureTensorType):
                key = self._tensor_attr_cls.cast(key)
                # We need to fully-specify the key for __setitem__ as it does not make
                # sense to work with a view here:
                key.fully_specify()
                self.put_tensor(value, key)

but now it calls assert key.is_fully_specified() instead, which consistently triggers an assertion failure in examples/multi_gpu/ogbn_train_cugraph.py test.

This PR provides a fix for the issue.

akihironitta · 2025-02-11T12:32:30Z

@drivanov Could you share the full error message and env details? I somehow cannot reproduce it locally in master.

akihironitta · 2025-02-11T12:51:54Z

Sorry for the breaking change. I was able to reproduce this in master:

$ python examples/multi_gpu/ogbn_train_cugraph.py --dataset ogbn-products
/home/aki/.miniconda3/envs/cugraph/lib/python3.10/site-packages/ogb/nodeproppred/dataset_pyg.py:69: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  self.data, self.slices = torch.load(self.processed_paths[0])
Training ogbn-products with GCN model.
Data = Data(num_nodes=2449029, edge_index=[2, 123718152], x=[2449029, 100], y=[2449029])
Let's use 1 GPUs!
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/aki/work/github.com/pyg-team/pytorch_geometric/examples/multi_gpu/ogbn_train_cugraph.py", line 372, in <module>
[rank0]:     run_train(0, args, data, world_size, cugraph_id, model, split_idx,
[rank0]:   File "/home/aki/work/github.com/pyg-team/pytorch_geometric/examples/multi_gpu/ogbn_train_cugraph.py", line 182, in run_train
[rank0]:     feature_store['node', 'x'] = data.x
[rank0]:   File "/home/aki/work/github.com/pyg-team/pytorch_geometric/torch_geometric/data/feature_store.py", line 476, in __setitem__
[rank0]:     assert key.is_fully_specified()
[rank0]: AssertionError
[rank0]:[W211 12:49:11.477280342 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

A quick workaround is applying this patch to the example script:

diff --git a/examples/multi_gpu/ogbn_train_cugraph.py b/examples/multi_gpu/ogbn_train_cugraph.py
index 098ff9be4..a4d969293 100644
--- a/examples/multi_gpu/ogbn_train_cugraph.py
+++ b/examples/multi_gpu/ogbn_train_cugraph.py
@@ -179,8 +179,8 @@ def run_train(rank, args, data, world_size, cugraph_id, model, split_idx,
     )] = ixr

     feature_store = TensorDictFeatureStore()
-    feature_store['node', 'x'] = data.x
-    feature_store['node', 'y'] = data.y
+    feature_store['node', 'x', None] = data.x
+    feature_store['node', 'y', None] = data.y

     dist.barrier()

drivanov · 2025-02-11T20:55:30Z

Thank you, @akihironitta! I've just created PR#10017 with your fix. I believe we can close that one.

@akihironitta

This PR implements the fix proposed by @akihironitta in [PR#10012](#10012) to address the issue that arose in `ogbn_train_cugraph` after the merge of [PR#9782](#9782).

akihironitta · 2025-02-12T11:13:28Z

I'll fix the issue here as feature_store[..., ..., None] = data.x doesn't look super user-friendly! Sorry again for the breaking change!

puririshi98 · 2025-02-20T22:01:39Z

@akihironitta checking in if there is anything needed from our end

akihironitta

Sorry it's on my todo list, but this pr still needs test and updates to examples.

Restoring the TensorAttr.fully_specify method.

c72d116

drivanov requested a review from mananshah99 as a code owner February 10, 2025 21:17

akihironitta self-assigned this Feb 11, 2025

akihironitta added bug 0 - Priority P0 data labels Feb 11, 2025

drivanov mentioned this pull request Feb 11, 2025

[Bug] Fixing bug in ogbn_train_cugraph example. #10017

Merged

drivanov closed this Feb 11, 2025

akihironitta reopened this Feb 12, 2025

drivanov and others added 3 commits February 12, 2025 08:33

Merge branch 'master' into fully_specify

4ea0a7d

Merge branch 'master' into fully_specify

2d0d4a5

Merge branch 'master' into fully_specify

20e1ad9

akihironitta reviewed Feb 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Restoring the `TensorAttr.fully_specify` method. #10012

[Bug] Restoring the `TensorAttr.fully_specify` method. #10012

drivanov commented Feb 10, 2025

akihironitta commented Feb 11, 2025 •

edited

Loading

akihironitta commented Feb 11, 2025

drivanov commented Feb 11, 2025

akihironitta commented Feb 12, 2025

puririshi98 commented Feb 20, 2025

akihironitta left a comment

[Bug] Restoring the TensorAttr.fully_specify method. #10012

Are you sure you want to change the base?

[Bug] Restoring the TensorAttr.fully_specify method. #10012

Conversation

drivanov commented Feb 10, 2025

akihironitta commented Feb 11, 2025 • edited Loading

akihironitta commented Feb 11, 2025

drivanov commented Feb 11, 2025

akihironitta commented Feb 12, 2025

puririshi98 commented Feb 20, 2025

akihironitta left a comment

Choose a reason for hiding this comment

[Bug] Restoring the `TensorAttr.fully_specify` method. #10012

[Bug] Restoring the `TensorAttr.fully_specify` method. #10012

akihironitta commented Feb 11, 2025 •

edited

Loading