Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment of image processing, ideally somewhere maintained by others! #47

Open
metazool opened this issue Nov 11, 2024 · 5 comments
Open

Comments

@metazool
Copy link
Collaborator

metazool commented Nov 11, 2024

Range of options on this

[ ] luigid running on a development VM in the on-prem cloud with direct read access to NAS
[ ] chromadb also running locally on the same machine
[ ] Object store API in Posit or Datalabs (how do apps then authenticate?)

[ ] luigid in a container on e.g. kubernetes in the on-prem cloud, but with a means of mounting data from the NAS
[ ] object store API in a container too, it has a Dockerfile
[ ] Data from the NAS going unprocessed to an object store, and the pipelines reading from there, obviating the need to connect applications to local storage
[ ] tasks running in e.g. Airflow or Argo Workflows rather than within Luigi

draw.io diagram

@metazool
Copy link
Collaborator Author

https://github.com/NERC-CEH/plankton_ml/blob/main/PIPELINES.md - this has the walkthrough of setting up the Luigi-based workflow (from NAS to object store).

@metazool
Copy link
Collaborator Author

metazool commented Nov 20, 2024

https://luigi.readthedocs.io/en/stable/central_scheduler.html - docs for the central scheduler. it's more of a task manager and UI, reminds me of Celery Flower, the expectation with Luigi is you use cron or similar to trigger tasks.

While testing this my connection to the VM died, luigid stayed running in the background but lost its memory of tasks run in my session. You can configure it to use a SQLAlchemy connection to preserve memory. I've got a Postgres available here. I'd still like to look at alternatives to chromadb ( #44 ) for vector types in a more typical SQL database

@metazool
Copy link
Collaborator Author

metazool commented Nov 20, 2024

I set up a luigi.cfg with sqlite backend and immediately ran into this - spotify/luigi#3227

(should be?) a small change and I might try to contribute it right now, but worried for our Luigi usage that it's been a known issue for over a year and version pinning to pre-2.0 is still the suggestion

edit ... now at Luigi's equivalent of this issue with tox4, and slightly regretting life choices https://github.com/python/mypy/pull/14578/files

edit ... seeing there's already an unmerged PR with the same set of changes I was considering, i'll try to leave a helpful comment there and then just pin sqlalchemy spotify/luigi#3267

update - there's now a work in progress change to drop 3.6 support in Luigi so the above change can be eligible for merging, which is nice to see!

@metazool
Copy link
Collaborator Author

metazool commented Dec 2, 2024

@rodscott @dolegi tagging you on this for the description above - our range of options for deploying a simple pipeline that reads data from the NAS, applies some processing steps and uploads the results to object storage via an API.

If other projects are testing Argo Workflows for this then I'd be well up for trying - the envisaged issues are

  • giving Kubernetes a PersistentVolumeClaim to an area on the NAS that's currently managed with user role permissions (e.g. in the present setup I had to request directory access from a project coordinator and then request it be mounted on a VM)
  • uncertainty about where and when an on-prem Kubernetes cluster could be used (for testing purposes, EDS maintains one? would the planned work by IT support be suitable for testing purposes anyway, or is it more for longer term maintenance?)

@metazool
Copy link
Collaborator Author

I'm rethinking this after having seen @Kzra's recent work on https://github.com/NERC-CEH/cyto-ML (the labelling application, originally RShiny, now successfully ported to Label Studio)

It uses just the image processing parts of this project (decollage plus, i hope, EXIF tagging) and wraps the rest up in shell scripts, use s3cmd to transfer data to JASMIN object storage. Its simplicity means it would work well as stages in a DVC pipeline.

Given that we

  • Presently accept that running on a single virtual machine is our best option
  • Are limited (by design) to individual-based credentials for both reading from the NAS and uploading to JASMIN

Then contributing a DVC pipeline definition (in essence a YAML file that says "run these scripts sequentially, option to pass data between them, and track whether to re-run if the input hasn't changed") and an ansible playbook for setting it to run on a schedule, to that project, is probably the most useful small step onwards.

Luigi has been good to explore, it was great for rapid prototyping. The object store API is a useful standalone and for container-based workflows it definitely has its place, but here it adds complexity...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

1 participant