Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Versioning dependancies for users loading as python package #2497

Closed

Conversation

nelsonauner
Copy link
Contributor

@nelsonauner nelsonauner commented Apr 5, 2023

PR Overview

This small PR attempts to add in version control to dependancies for users installing pudl via conda or mamba.
As a side-benefit, it also solves immediate issue #2426 for Apple Silicon computers by installing h3 from pip, instead of from conda-forge. Conda-forge has an earlier version of h3 that does not support apple silicon.

I would love feedback on these aspects:

  1. Does this roughly look like the right direction?
  2. Do we want to support environment.yml? Or just conda-lock? or both?
  3. If this conversation gets long, is there a small part of this we should merge in earlier to fix the current bug while punting on larger versioning discussions?
  4. Should documentation in the README.md be changed? There wasn't a whole lot there on pure python installs, anyways. In what situations are non-developers installing pudl locally via conda/pip?

Workflow:

Using pure mamba and environment.yml

# mamba does not support creating from an environment.yml file, so create and then update:
mamba env create --name test
mamba env update -n test --file  environment.yml

Using conda-lock to create lockfiles

conda-lock -f environment.yml -p osx-64 -p linux-64 -p osx-arm64

Using conda-lock.yml to install from scratch 😄

(base) ➜  pudl git:(main) ✗ conda-lock install --name fromlock conda-lock.yml
WARNING:conda_lock.conda_lock:WARNING: installation of pip dependencies is only supported by the 'conda-lock install' command. Other tools may silently ignore them. For portability, we recommend using the newer unified lockfile format (i.e. removing the --kind=explicit argument.
INFO:root:Downloading and Extracting Packages
INFO:root:Downloading and Extracting Packages
INFO:root:Preparing transaction: ...working... done
INFO:root:Verifying transaction: ...working... done
INFO:root:Executing transaction: ...working... done
INFO:root:Collecting h3@ https://files.pythonhosted.org/packages/51/4b/31ed29f5dca2093f6e29707dc21b0cf7071de4623ade7491bd2ee191617d/h3-3.7.6-cp310-cp310-macosx_11_0_arm64.whl#sha256=8bf1e080b9a47774754834e7f10155f3d2e3542bf895488a0519b2ae7d5b15db
INFO:root:  Using cached h3-3.7.6-cp310-cp310-macosx_11_0_arm64.whl (904 kB)
INFO:root:Installing collected packages: h3
INFO:root:  Attempting uninstall: h3
INFO:root:    Found existing installation: h3 3.7.4
INFO:root:    Uninstalling h3-3.7.4:
INFO:root:      Successfully uninstalled h3-3.7.4
INFO:root:Successfully installed h3-3.7.6
INFO:root:
(base) ➜  pudl git:(main) ✗ conda activate fromlock
(fromlock) ➜  pudl git:(main) ✗ python
Python 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:12:31) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pudl
/Users/nelsonauner/mambaforge/envs/fromlock/lib/python3.10/site-packages/pudl/analysis/spatial.py:7: UserWarning: Shapely 2.0 is installed, but because PyGEOS is also installed, GeoPandas will still use PyGEOS by default for now. To force to use and test Shapely 2.0, you have to set the environment variable USE_PYGEOS=0. You can do this before starting the Python process, or in your code before importing geopandas:

import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd
>>>  #works! no segfault

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@zaneselvans
Copy link
Member

I've been using mamba to create my local development environment from scratch like:

$ mamba env create --name pudl-dev --file devtools/environment.yml

How does your mamba not like that kind of thing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holy bujeezus, 13,345 lines. That is a lot of dependencies. I guess it's for 4 separate platforms but wow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but think of how epic it will make everyone's contributions statistics look! 😕

- python=3.10
- pip
- pip:
- h3==3.7.6 # h3-py 3.7.4 on conda-forge fails, see https://github.com/scikit-build/scikit-build/pull/901
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this version pin here won't help folks who are just doing

mamba install catalystcoop.pudl

on Apple silicon though, which I think is the urgent near-term problem. Could we branch off of the commit that we used for that old release, add this version pin to our dependencies in setup.py and do a new release to PyPI & conda-forge with just that small change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we'd like to pin all of our dependencies and rely on conda-lock to do that work for us, how do we need to change the way we think about creating our software environment?

The conda-lock.yml file gets checked into source control and is supposed to represent a set of platform-specific packages which exact versions that, if installed, is known to work (since we'll have run all our tests within that environment and gotten the ✅ right?

How would this affect our development environment? Right now we typically do development within the environment defined by devtools/environment.yml. Would we switch to doing development in the environment defined by conda-lock.yml plus some additional packages? In development we can only install catalystcoop.pudl using pip -e right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we:

  • avoid the need to specify our dependencies in more than one place.
  • ensure that we have a fully specified environment that doesn't drift over time.
  • periodically update all our dependencies intentionally and check that everything is still working (and if is, save the new set of dependencies as the default going forward).
  • use the same environment for published packages, deployment (e.g. in CI, local tests with Tox, and our nightly builds) and also in the environment that we're developing in day to day?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughtful review!

  1. Short-term: Yes, I'll take a stab at branching off of current commit to solve apple silicon problem while we discuss this. (I totally missed devtools/environment.yml folder, sorry!)
  2. Yes, you are totally correct that PUDL would move to using conda-lock as its source of truth for dependancies. So the test runners should boot up their environment using conda-lock, the prod system should run using conda-lock, etc.
  3. Development environment: I'm less puritanical here and I think the process would be installing the environment with conda-lock and then the package with pip -e. The important part is that after you're done developing, you know that any tests and production will be done on the conda-lock.yml deps. So if your development work requires updating a package, or adding a new one, that has to be frozen into the lockfile. This will be enforced automatically when the test suite installs from the lockfile and then tries to run your code.

I think one question you alluded to is "Do we need both an environmental.yml and a conda-lock.yml?!
I come from Pipfile world where every project has a Pipfile and a Pipfile.lock so I don't have a strong reaction to two files - one that's human readable, and another (lock file) that's really pinning down specific version dependancies. People get used to it pretty quickly.

Now, using your bullet points as a ruberic:

Goal Does conda-lock help?
Dependancies in only one place Kind of. For python packages conda-lock is superior since it is pinning version numbers
Environment doesn't drift Definitely
Periodically update dependancies and test that it works Yes, definitely. This can be done by volunteers or an automated job runner that updates dependancies, sees if tests works, and then opens a PR. Updating deps is a code change and should deserve the same rigor of testing
Use same env for published packages, deployment, and dev environment I think so. Everyone should use the environment created by conda-lock and then pip -e for local development of the pudl code.

@zaneselvans
Copy link
Member

I think we've mostly avoided directing to install the software directly because it is so tightly linked to the data. If you have just the software it's not that useful. You also need to have the appropriate data. With the software you can generate the data, but that's more involved than we expect anyone other than us or open source contributors to get. So we've tried to tell people how to access the data directly, or the data+software together.

@codecov
Copy link

codecov bot commented Apr 5, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.1 ⚠️

Comparison is base (140627f) 87.2% compared to head (f5a049c) 87.2%.

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #2497     +/-   ##
=======================================
- Coverage   87.2%   87.2%   -0.1%     
=======================================
  Files         81      81             
  Lines       9511    9511             
=======================================
- Hits        8302    8299      -3     
- Misses      1209    1212      +3     

see 2 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@nelsonauner nelsonauner changed the title Nelson fix #2426 Versioning dependancies for users loading as python package #2426 Apr 6, 2023
@nelsonauner nelsonauner changed the title Versioning dependancies for users loading as python package #2426 Versioning dependancies for users loading as python package Apr 6, 2023
@zaneselvans zaneselvans deleted the branch catalyst-cooperative:dev January 5, 2024 04:14
@zaneselvans zaneselvans closed this Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants