refactor: xr.Dataset as primary data structure #62

martintb · 2023-01-21T00:49:59Z

Many of the pain points of working with multi-indices can be mitigated by reworking pyhyper to use xr.Datasets rather than xr.DataArrays. One example would be avoiding memory intensive unstack() commands.

Datasets would also allow easy switching between different coordinates and storage of non coordinate data associated with the experiment.

More details and examples to follow.

pbeaucage · 2023-01-22T17:08:19Z

related to #39

this is a good idea and not hard to implement, but is a large API break.

pdudenas · 2023-07-21T21:32:38Z

Any thoughts on how we should structure the fundamental dataset? Going off of the xarray example they structure their data like this:

For us an abstract example could be something like this:

Do we name each data_vars entry by their scan name?

And does moving to a dataset inherently solve #39 or do we need to be careful in how we structure the dataset to avoid those same issues?

pbeaucage · 2023-07-22T12:50:53Z

I would make scan_id and related things like edge, polarization, temperature, etc coordinates of the dataset.

The data variables would then be a standard set of terms like scattering intensity, intensity uncertainty, incident intensity i0, transmitted intensity it, sample drain current, and possibly (where supported) instrument specific terms or secondary measurements.

The major change is from single scan or single experiment (thermal anneal, shear series) being the primary structure to the primary structure being a whole experiment as a series of samples. This is sort of like loadSeries in SST1RSoXSDB. The major challenge here will be performance, I believe. Dask might help offset that some.

pbeaucage · 2023-07-22T12:53:09Z

One bit of context that @martintb understands better than I do is that I think something throwaway like scan_id can be the dimension, and data variables like edge/temperature/etc can be promoted or demoted from being coordinates at will, in a fairly performant way (compared with MultiIndexes).

pbeaucage · 2023-07-22T12:55:24Z

#52 is an example of something like this, where data can be simultaneously labeled with pix_x and pix_y and q_x and q_y at the same time. In that example we pop the data back out of being a Dataset after that coordinate swap, but if the data were already a dataset we wouldn't have to.

pdudenas · 2023-07-23T00:59:45Z

So you'd do something more like this?

Repeatedly stacking scans like that could be a source of performance issues, like you mentioned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: xr.Dataset as primary data structure #62

refactor: xr.Dataset as primary data structure #62

martintb commented Jan 21, 2023

pbeaucage commented Jan 22, 2023

pdudenas commented Jul 21, 2023

pbeaucage commented Jul 22, 2023

pbeaucage commented Jul 22, 2023

pbeaucage commented Jul 22, 2023

pdudenas commented Jul 23, 2023

refactor: xr.Dataset as primary data structure #62

refactor: xr.Dataset as primary data structure #62

Comments

martintb commented Jan 21, 2023

pbeaucage commented Jan 22, 2023

pdudenas commented Jul 21, 2023

pbeaucage commented Jul 22, 2023

pbeaucage commented Jul 22, 2023

pbeaucage commented Jul 22, 2023

pdudenas commented Jul 23, 2023