-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: xr.Dataset as primary data structure #62
Comments
related to #39 this is a good idea and not hard to implement, but is a large API break. |
Any thoughts on how we should structure the fundamental dataset? Going off of the xarray example they structure their data like this: ![]() For us an abstract example could be something like this: ![]() Do we name each data_vars entry by their scan name? ![]() And does moving to a dataset inherently solve #39 or do we need to be careful in how we structure the dataset to avoid those same issues? |
I would make scan_id and related things like edge, polarization, temperature, etc coordinates of the dataset. The data variables would then be a standard set of terms like scattering intensity, intensity uncertainty, incident intensity i0, transmitted intensity it, sample drain current, and possibly (where supported) instrument specific terms or secondary measurements. The major change is from single scan or single experiment (thermal anneal, shear series) being the primary structure to the primary structure being a whole experiment as a series of samples. This is sort of like loadSeries in SST1RSoXSDB. The major challenge here will be performance, I believe. Dask might help offset that some. |
One bit of context that @martintb understands better than I do is that I think something throwaway like scan_id can be the dimension, and data variables like edge/temperature/etc can be promoted or demoted from being coordinates at will, in a fairly performant way (compared with MultiIndexes). |
#52 is an example of something like this, where data can be simultaneously labeled with pix_x and pix_y and q_x and q_y at the same time. In that example we pop the data back out of being a Dataset after that coordinate swap, but if the data were already a dataset we wouldn't have to. |
Many of the pain points of working with multi-indices can be mitigated by reworking pyhyper to use xr.Datasets rather than xr.DataArrays. One example would be avoiding memory intensive unstack() commands.
Datasets would also allow easy switching between different coordinates and storage of non coordinate data associated with the experiment.
More details and examples to follow.
The text was updated successfully, but these errors were encountered: