Skip to content

Commit 1eeacc3

Browse files
committed
Initial packaging effort
0 parents  commit 1eeacc3

File tree

10 files changed

+911
-0
lines changed

10 files changed

+911
-0
lines changed

.gitignore

+115
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# Dask
2+
dask-worker-space
3+
4+
# Byte-compiled / optimized / DLL files
5+
__pycache__/
6+
*.py[cod]
7+
*$py.class
8+
9+
# C extensions
10+
*.so
11+
12+
# Distribution / packaging
13+
.Python
14+
env/
15+
build/
16+
develop-eggs/
17+
dist/
18+
downloads/
19+
eggs/
20+
.eggs/
21+
lib/
22+
lib64/
23+
parts/
24+
sdist/
25+
var/
26+
wheels/
27+
*.egg-info/
28+
.installed.cfg
29+
*.egg
30+
pip-wheel-metadata/
31+
32+
# PyInstaller
33+
# Usually these files are written by a python script from a template
34+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
35+
*.manifest
36+
*.spec
37+
38+
# Installer logs
39+
pip-log.txt
40+
pip-delete-this-directory.txt
41+
42+
# Unit test / coverage reports
43+
htmlcov/
44+
.tox/
45+
.coverage
46+
.coverage.*
47+
.cache
48+
nosetests.xml
49+
coverage.xml
50+
*.cover
51+
.hypothesis/
52+
.pytest_cache/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
62+
# Flask stuff:
63+
instance/
64+
.webassets-cache
65+
66+
# Scrapy stuff:
67+
.scrapy
68+
69+
# Sphinx documentation
70+
docs/_build/
71+
72+
# PyBuilder
73+
target/
74+
75+
# Jupyter Notebook
76+
.ipynb_checkpoints
77+
78+
# pyenv
79+
.python-version
80+
81+
# celery beat schedule file
82+
celerybeat-schedule
83+
84+
# SageMath parsed files
85+
*.sage.py
86+
87+
# dotenv
88+
.env
89+
90+
# virtualenv
91+
.venv
92+
venv/
93+
ENV/
94+
95+
# Spyder project settings
96+
.spyderproject
97+
.spyproject
98+
99+
# Rope project settings
100+
.ropeproject
101+
102+
# mkdocs documentation
103+
/site
104+
105+
# mypy
106+
.mypy_cache/
107+
108+
# jetbrains ide stuff
109+
*.iml
110+
.idea/
111+
112+
# vscode ide stuff
113+
*.code-workspace
114+
.history
115+
.vscode

LICENSE.txt

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Copyright 2020 Robin Kåveland
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4+
5+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6+
7+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

MANIFEST.in

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
include *.txt
2+
include *.md

README.md

+87
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
pyarrowfs-adlgen2
2+
==
3+
4+
pyarrowfs-adlgen2 is an implementation of a pyarrow filesystem for Azure Data Lake Gen2.
5+
6+
It allows you to use pyarrow and pandas to read parquet datasets directly from Azure without the need to copy files to local storage first.
7+
8+
Reading datasets
9+
--
10+
11+
Example usage with pandas dataframe:
12+
13+
```python
14+
import azure.identity
15+
import pandas as pd
16+
import pyarrow.fs
17+
import pyarrowfs_adlgen2
18+
19+
handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
20+
'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())
21+
fs = pyarrow.fs.PyFileSystem(handler)
22+
df = pd.read_parquet('container/dataset.parq', filesystem=fs)
23+
```
24+
25+
Example usage with arrow tables:
26+
27+
```python
28+
import azure.identity
29+
import pyarrow.dataset
30+
import pyarrow.fs
31+
import pyarrowfs_adlgen2
32+
33+
handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
34+
'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())
35+
fs = pyarrow.fs.PyFileSystem(handler)
36+
ds = pyarrow.dataset.dataset('container/dataset.parq', filesystem=fs)
37+
table = ds.to_table()
38+
```
39+
40+
Writing datasets
41+
--
42+
43+
As of pyarrow version 1.0.1, `pyarrow.parquet.ParquetWriter` does not support `pyarrow.fs.PyFileSystem`, but data can be written to open files:
44+
45+
```python
46+
with fs.open_output_stream('container/out.parq') as out:
47+
df.to_parquet(out)
48+
```
49+
50+
Or with arrow tables:
51+
52+
```python
53+
import pyarrow.parquet
54+
55+
with fs.open_output_stream('container/out.parq') as out:
56+
pyarrow.parquet.write_table(table, out)
57+
```
58+
59+
Accessing only a single container/file-system
60+
--
61+
62+
If you do not want, or can't access the whole storage account as a single filesystem, you can use `adlgen2fs.FileSystemHandler` to view a single file system within an account:
63+
64+
```python
65+
import azure.identity
66+
import pyarrowfs_adlgen2
67+
68+
handler = pyarrowfs_adlgen2.FileSystemHandler.from_account_name(
69+
"STORAGE_ACCOUNT", "FS_NAME", azure.identity.DefaultAzureCredential())
70+
```
71+
72+
All access is done through the file system within the storage account.
73+
74+
Running tests
75+
--
76+
77+
To run the tests, you need:
78+
79+
- Azure Storage Account V2 with hierarchial namespace enabled (Data Lake gen2 account)
80+
- To configure azure login (f. ex. use `$ az login`)
81+
- Install pytest, f. ex. `pip install pytest`
82+
83+
**NB! All data in the storage account is deleted during testing, USE AN EMPTY ACCOUNT**
84+
85+
```
86+
AZUREARROWFS_TEST_ACT=thestorageaccount pytest
87+
```

pyarrowfs_adlgen2/__init__.py

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
from .core import FilesystemHandler, AccountHandler
2+

0 commit comments

Comments
 (0)