Skip to content

Commit 8df9f24

Browse files
authored
Improve pytorch data / torcharrow / torchdata integration (#1463)
Signed-off-by: Tao He <[email protected]>
1 parent 5b248ce commit 8df9f24

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+1352
-766
lines changed

.github/workflows/build-test.yml

+12-51
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ jobs:
5555
submodules: true
5656

5757
- name: Free disk space
58+
if: false
5859
run: |
5960
bash .github/free-disk-space.sh
6061
@@ -325,112 +326,72 @@ jobs:
325326
if: false
326327
uses: mxschmitt/action-tmate@v3
327328

328-
- name: Run tests
329+
- name: Run cpp tests
329330
run: |
330331
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/lib64:/usr/local/lib/x86_64-linux-gnu
331-
332332
export VINEYARD_DATA_DIR=`pwd`/gstest
333333
export TMPDIR="${TMPDIR:-$(dirname $(mktemp))}"
334334
335335
rm -rf default.etcd
336336
rm -rf /dev/shm/etcd*
337-
python3 test/runner.py $RUNNER_ARGS --with-cpp --with-deployment
337+
python3 test/runner.py $RUNNER_ARGS --with-cpp
338338
339-
- name: Run tests with mimalloc
340-
if: ${{ steps.changes.outputs.mimalloc == 'true' }}
339+
- name: Run python tests
341340
run: |
342341
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/lib64:/usr/local/lib/x86_64-linux-gnu
343-
344342
export VINEYARD_DATA_DIR=`pwd`/gstest
345343
export TMPDIR="${TMPDIR:-$(dirname $(mktemp))}"
346344
347345
rm -rf default.etcd
348346
rm -rf /dev/shm/etcd*
349-
python3 test/runner.py $RUNNER_ARGS --allocator=mimalloc --with-cpp --with-deployment
350-
351-
- name: Setup tmate session
352-
if: false
353-
uses: mxschmitt/action-tmate@v3
347+
python3 test/runner.py $RUNNER_ARGS --with-python
354348
355-
- name: Run python tests
349+
- name: Run deployment tests
356350
run: |
357351
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/lib64:/usr/local/lib/x86_64-linux-gnu
358-
359-
# enable coredump for debugging
360-
ulimit -c unlimited
361-
362-
export VINEYARD_DEVELOP=TRUE
363-
364352
export VINEYARD_DATA_DIR=`pwd`/gstest
365353
export TMPDIR="${TMPDIR:-$(dirname $(mktemp))}"
366354
367355
rm -rf default.etcd
368356
rm -rf /dev/shm/etcd*
369-
python3 test/runner.py $RUNNER_ARGS --with-python --with-migration --with-deployment
357+
python3 test/runner.py $RUNNER_ARGS --with-deployment --with-migration
370358
371-
if [[ ! "$RUNNER_ARGS" =~ .*"$redis".* ]]; then
372-
python3 test/runner.py $RUNNER_ARGS --with-contrib
373-
fi
374-
375-
- name: Run python tests with mimalloc
376-
if: ${{ steps.changes.outputs.mimalloc == 'true' }}
359+
- name: Run contrib/thirdparty tests
360+
if: false # lack of heavy dependencies like torch/tensorflow in CI
377361
run: |
378362
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/lib64:/usr/local/lib/x86_64-linux-gnu
379-
380-
# enable coredump for debugging
381-
ulimit -c unlimited
382-
383-
export VINEYARD_DEVELOP=TRUE
384-
385363
export VINEYARD_DATA_DIR=`pwd`/gstest
386364
export TMPDIR="${TMPDIR:-$(dirname $(mktemp))}"
387365
388366
rm -rf default.etcd
389367
rm -rf /dev/shm/etcd*
390-
python3 test/runner.py $RUNNER_ARGS --allocator=mimalloc --with-python --with-migration --with-deployment
391-
392-
if [[ ! "$RUNNER_ARGS" =~ .*"$redis".* ]]; then
393-
python3 test/runner.py $RUNNER_ARGS --allocator=mimalloc --with-contrib
394-
fi
395-
396-
- name: Setup tmate session
397-
if: false
398-
uses: mxschmitt/action-tmate@v3
368+
python3 test/runner.py $RUNNER_ARGS --with-contrib
399369
400370
- name: Run I/O tests
401371
run: |
402372
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/lib64:/usr/local/lib/x86_64-linux-gnu
403-
404-
export VINEYARD_DEVELOP=TRUE
405-
406373
export VINEYARD_DATA_DIR=`pwd`/gstest
407374
export TMPDIR="${TMPDIR:-$(dirname $(mktemp))}"
408375
409376
rm -rf default.etcd
410377
rm -rf /dev/shm/etcd*
411-
python3 test/runner.py $RUNNER_ARGS --with-io --with-migration
378+
python3 test/runner.py $RUNNER_ARGS --with-io
412379
413380
- name: Run FUSE tests
414381
run: |
415382
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/lib64:/usr/local/lib/x86_64-linux-gnu
416-
417-
export VINEYARD_DEVELOP=TRUE
418383
export VINEYARD_DATA_DIR=`pwd`/gstest
419384
export TMPDIR="${TMPDIR:-$(dirname $(mktemp))}"
420385
421386
rm -rf default.etcd
422387
rm -rf /dev/shm/etcd*
423388
python3 test/runner.py $RUNNER_ARGS --with-fuse
424389
425-
- name: Find vineyard using CMake
390+
- name: Run CMake tests
426391
run: |
427392
cmake -S test/vineyard-cmake-example -B build/vineyard-cmake-example
428393
cmake --build build/vineyard-cmake-example
429394
430-
- name: Setup tmate session
431-
if: false
432-
uses: mxschmitt/action-tmate@v3
433-
434395
- name: Upload Coverage
435396
if: runner.os == 'Linux'
436397
uses: codecov/codecov-action@v3

.pylintrc

+2-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@
33
# A comma-separated list of package or module names from where C extensions may
44
# be loaded. Extensions are loading into the active Python interpreter and may
55
# run arbitrary code.
6-
extension-pkg-whitelist=vineyard
6+
extension-pkg-whitelist=numpy,
7+
vineyard
78

89
# Add files or directories to the blacklist. They should be base names, not
910
# paths.

docs/notes/developers/troubleshooting.rst

-29
Original file line numberDiff line numberDiff line change
@@ -47,32 +47,3 @@ Vineyard Issues on Kubernetes
4747
cgroups. In such cases, users should increase the CPU resources allocated to the etcd pod. For
4848
more information on etcd tuning, please refer to the `Hardware recommendations
4949
<https://etcd.io/docs/v3.4.0/op-guide/hardware/>`_ section in the etcd documentation.
50-
51-
Python SDK Error Scenarios
52-
--------------------------
53-
54-
1. Unexpected Behavior with PyArrow
55-
56-
Vineyard's Python SDK relies on libarrow-dev. When this module is imported
57-
alongside PyArrow, numerous DLL conflict issues may arise, as detailed in https://issues.apache.org/jira/browse/ARROW-10599.
58-
59-
To maintain consistency between Apache Arrow and Vineyard installations:
60-
61-
+ For users, simply install PyArrow and Vineyard using the :code:`pip` package
62-
manager should suffice. You can install them with the following command:
63-
64-
.. code::
65-
66-
pip3 install pyarrow vineyard
67-
68-
+ For Vineyard developers, the locally installed PyArrow should be built
69-
from scratch, using the system-wide libarrow-dev. This can be achieved by:
70-
71-
.. code::
72-
73-
pip3 install --no-binary pyarrow pyarrow
74-
75-
Additionally, you may encounter unexpected crashes or runtime exceptions if you :code:`import`
76-
PyArrow before Vineyard. To avoid this, adjust the import order by :code:`import`ing Vineyard
77-
before PyArrow. If you encounter any shared library-related issues, please set the environment
78-
variable :code:`VINEYARD_DEVELOP=TRUE`.

docs/notes/integration/dask.rst

+2-5
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ In this step, we load the mnist data and duplicate it to simulate the parallel p
2929
.. code:: python
3030
3131
from vineyard.core.builder import builder_context
32-
from vineyard.contrib.dask.dask import register_dask_types
32+
from vineyard.contrib.dask.dask import dask_context
3333
3434
def dask_preprocess(dask_scheduler):
3535
def get_mnist():
@@ -40,10 +40,7 @@ In this step, we load the mnist data and duplicate it to simulate the parallel p
4040
y_train = y_train.astype(np.int64)
4141
return pd.DataFrame({'x': list(x_train), 'y': y_train})
4242
43-
with builder_context() as builder:
44-
# register the builder for dask.dataframe to the builder_context
45-
register_dask_builder(builder, None)
46-
43+
with dask_context():
4744
datasets = [delayed(get_mnist)() for i in range(20)]
4845
dfs = [dd.from_delayed(ds) for ds in datasets]
4946
gdf = dd.concat(dfs)

docs/notes/key-concepts/io-drivers.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ registration examples.
2222
>>> def open(path, *args, **kwargs):
2323
>>> scheme = urlparse(path).scheme
2424
25-
>>> for reader in open.__factory[scheme][::-1]:
25+
>>> for reader in open._factory[scheme][::-1]:
2626
>>> r = reader(path, *args, **kwargs)
2727
>>> if r is not None:
2828
>>> return r

python/vineyard/_C.pyi

+4
Original file line numberDiff line numberDiff line change
@@ -376,6 +376,10 @@ def connect(
376376
*, username: str = "", password: str = ""
377377
) -> Union[IPCClient, RPCClient]: ...
378378
@overload
379+
def connect(
380+
target: None = None, *, username: str = "", password: str = ""
381+
) -> IPCClient: ...
382+
@overload
379383
def connect(
380384
socket: str = "", *, username: str = "", password: str = ""
381385
) -> IPCClient: ...

python/vineyard/__main__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
# limitations under the License.
1717
#
1818

19-
from .deploy.__main__ import main
19+
from vineyard.deploy.__main__ import main
2020

2121
if __name__ == '__main__':
2222
main()

python/vineyard/contrib/dask/dask.py

+10-1
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
# limitations under the License.
1717
#
1818

19+
import contextlib
1920
import json
2021

2122
import numpy as np
@@ -27,6 +28,7 @@
2728
from dask.distributed import Client # pylint: disable=no-name-in-module
2829

2930
import vineyard
31+
from vineyard.core import context
3032
from vineyard.data.dataframe import make_global_dataframe
3133
from vineyard.data.tensor import make_global_tensor
3234

@@ -133,11 +135,18 @@ def get_partition(obj_id):
133135
return dd.concat(dfs, axis=0)
134136

135137

136-
def register_dask_types(builder_ctx, resolver_ctx):
138+
def register_dask_types(builder_ctx=None, resolver_ctx=None):
137139
if builder_ctx is not None:
138140
builder_ctx.register(dask.array.Array, dask_array_builder)
139141
builder_ctx.register(dask.dataframe.DataFrame, dask_dataframe_builder)
140142

141143
if resolver_ctx is not None:
142144
resolver_ctx.register('vineyard::GlobalTensor', dask_array_resolver)
143145
resolver_ctx.register('vineyard::GlobalDataFrame', dask_dataframe_resolver)
146+
147+
148+
@contextlib.contextmanager
149+
def dask_context():
150+
with context() as (builder_ctx, resolver_ctx):
151+
register_dask_types(builder_ctx, resolver_ctx)
152+
yield builder_ctx, resolver_ctx

python/vineyard/contrib/dask/tests/test_dask.py

+3-7
Original file line numberDiff line numberDiff line change
@@ -28,20 +28,16 @@
2828
import pytest
2929

3030
import vineyard
31-
from vineyard.contrib.dask.dask import register_dask_types
32-
from vineyard.core.builder import builder_context
33-
from vineyard.core.resolver import resolver_context
31+
from vineyard.contrib.dask.dask import dask_context
3432
from vineyard.data.dataframe import make_global_dataframe
3533
from vineyard.data.tensor import make_global_tensor
3634
from vineyard.deploy.utils import start_program
3735

3836

3937
@pytest.fixture(scope="module", autouse=True)
4038
def vineyard_for_dask():
41-
with builder_context() as builder:
42-
with resolver_context() as resolver:
43-
register_dask_types(builder, resolver)
44-
yield builder, resolver
39+
with dask_context():
40+
yield
4541

4642

4743
@contextlib.contextmanager

python/vineyard/contrib/kedro/io/dataset.py

+2-4
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,9 @@
2020
stored in vineyard.
2121
"""
2222

23-
import functools
24-
import logging
25-
import time
2623
from typing import Any
2724
from typing import Dict
25+
from typing import Optional
2826

2927
from kedro.io.core import AbstractDataSet
3028

@@ -55,7 +53,7 @@ class VineyardDataSet(AbstractDataSet):
5553
5654
"""
5755

58-
def __init__(self, ds_name: str, vineyard_ipc_socket: str = None):
56+
def __init__(self, ds_name: str, vineyard_ipc_socket: Optional[str] = None):
5957
"""Creates a new instance of ``MemoryDataSet`` pointing to the
6058
provided Python object.
6159

python/vineyard/contrib/kedro/plugins/argo.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
from kedro.framework.project import pipelines
2121
from kedro.framework.startup import bootstrap_project
2222

23-
from .cli import vineyard as vineyard_cli
23+
from vineyard.contrib.kedro.plugins.cli import vineyard as vineyard_cli
2424

2525
TEMPLATE_FILE = "argo_spec.tmpl"
2626
TEMPLATE_PATH = Path("templates")

python/vineyard/contrib/kedro/plugins/catalog.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
from kedro.framework.project import settings
2525
from kedro.framework.startup import ProjectMetadata
2626

27-
from .cli import vineyard as vineyard_cli
27+
from vineyard.contrib.kedro.plugins.cli import vineyard as vineyard_cli
2828

2929

3030
@vineyard_cli.group()
@@ -147,7 +147,7 @@ def create_catalog(metadata: ProjectMetadata, pipeline_name, env):
147147

148148

149149
def _add_missing_datasets_to_catalog(missing_ds, catalog_path):
150-
from ..io import VineyardDataSet
150+
from vineyard.contrib.kedro.io import VineyardDataSet
151151

152152
if catalog_path.is_file():
153153
catalog_config = yaml.safe_load(catalog_path.read_text()) or {}

python/vineyard/contrib/kedro/plugins/docker.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
from jinja2 import FileSystemLoader
2424
from kedro.framework.cli.utils import call
2525

26-
from .cli import vineyard as vineyard_cli
26+
from vineyard.contrib.kedro.plugins.cli import vineyard as vineyard_cli
2727

2828
TEMPLATE_PATH = Path("templates")
2929
DOCKER_FILE_TMPL = "Dockerfile.tmpl"

python/vineyard/contrib/kedro/plugins/hook.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
from kedro.io import DataSetNotFoundError
2424
from kedro.pipeline import Pipeline
2525

26-
from ..io import VineyardDataSet
26+
from vineyard.contrib.kedro.io import VineyardDataSet
2727

2828
logger = logging.getLogger('vineyard')
2929

python/vineyard/contrib/kedro/runner/parallel_runner.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020

2121
from kedro.runner import ParallelRunner as KedroParallelRunner
2222

23-
from ..io import VineyardDataSet
23+
from vineyard.contrib.kedro.io import VineyardDataSet
2424

2525

2626
class ParallelRunner(KedroParallelRunner):

python/vineyard/contrib/kedro/runner/sequential_runner.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818

1919
from kedro.runner import SequentialRunner as KedroSequentialRunner
2020

21-
from ..io import VineyardDataSet
21+
from vineyard.contrib.kedro.io import VineyardDataSet
2222

2323

2424
class SequentialRunner(KedroSequentialRunner):

python/vineyard/contrib/kedro/runner/thread_runner.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818

1919
from kedro.runner import ThreadRunner as KedroThreadRunner
2020

21-
from ..io import VineyardDataSet
21+
from vineyard.contrib.kedro.io import VineyardDataSet
2222

2323

2424
class ThreadRunner(KedroThreadRunner):

0 commit comments

Comments
 (0)