Replies: 1 comment
-
Please see new discussion "Add support for sparse and enhance support for variable-length data in HDF5 " #3257 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Is your feature request related to a problem? Please describe.
Variable length strings take a lot of space on-disk.
Previous discussion:
vlen strings lead to larger files
Describe the solution you'd like
I would like these strings to take up less space.
My understanding is that variable length strings have a large 32 byte overhead, which does not seem to get compressed. This seems to check out with the sizes of files generated.
My naive assumption would be that wherever this overhead is being stored could also be compressed.
Describe alternatives you've considered
Using fixed length instead:
Different storage format
In the issue for h5py, it was suggested we use a different storage format. HDF5 is particularly useful to us because so many languages can read it. We do also use zarr, but it would be nice to limit the downsides of hdf5.
Some other change to hdf5's variable length encoding
Some other solution (mentioned more below) could also be good. I would imagine adding compression to the current format would be an easier lift though.
Additional context
The hdf5 docs do have an RFC about inefficiencies of variable length data types – but it doesn't seem like there has been any progress made towards implementation
Beta Was this translation helpful? Give feedback.
All reactions