Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve general docker interoperability #2078

Open
torbjorn opened this issue Jan 16, 2025 · 9 comments
Open

Improve general docker interoperability #2078

torbjorn opened this issue Jan 16, 2025 · 9 comments

Comments

@torbjorn
Copy link

torbjorn commented Jan 16, 2025

When setting up R in containers I repeatedly end up implementing elaborate hacks to make renv integrate seemlessly and efficiently. These are typically related to:

  • where do I put the library
  • handle cache efficiently
  • binary incompatibilities between ubuntu packages and renv cache
  • preinstall renv in the expected location during docker build
  • doing restore in build or in post-build scripts?
  • renv::restore never succeeds when rebuilding an image a year later

The renv project has been very forthcomming when I have asked for specific changes that makes a big difference to running renv in a container, but it's in a way a never ending story.

Perhaps a "docker task force" could be useful, a group that would maintain recomended reference implementations of renv functionality in docker based projects. They would stay informed by (or come from) core renv development and ideally advice on future renv development from a docker perspective. Things they'd maintain would typically include:

  • rootless or rootfull docker
  • do or do not use renv cache to speed up restore
  • renv cache inside the docker context or outside of
  • restore in build or post-build
  • run as root or a local dockeruser
  • pak - with and without
  • more...

EDIT:

There is already this article, but in many ways it only scratches the surface of these issues:
https://rstudio.github.io/renv/articles/docker.html

@kevinushey
Copy link
Collaborator

I'd definitely welcome any updates or additions to the Docker vignette at https://rstudio.github.io/renv/articles/docker.html; that's where I want to collect this sort of advice. I can brain-dump some of my thoughts...

where do I put the library

I think the general answer is "it depends", but I think you usually want the R library in the Docker container, with the cache potentially mounted externally from somewhere.

handle cache efficiently
preinstall renv in the expected location during docker build
doing restore in build or in post-build scripts?

Can you elaborate for these points? Some suggestions are enumerated in the vignette, but they could surely be improved.

binary incompatibilities between ubuntu packages and renv cache

Can you elaborate here as well? You can use RENV_PATHS_PREFIX_AUTO = TRUE to ensure that renv places an OS identifier in the library paths, e.g.

> .libPaths()
[1] "/home/kevin/scratch/example/renv/library/linux-ubuntu-noble/R-4.3/aarch64-unknown-linux-gnu" 
[2] "/home/kevin/.cache/R/renv/sandbox/linux-ubuntu-noble/R-4.3/aarch64-unknown-linux-gnu/9a444a72"

and this is also the default behavior for R 4.4 and newer.

@kevinushey
Copy link
Collaborator

renv::restore never succeeds when rebuilding an image a year later

It would help to know what kinds of failures you're seeing; examples would be very helpful.

@torbjorn
Copy link
Author

torbjorn commented Feb 7, 2025

I'd definitely welcome any updates or additions to the Docker vignette at https://rstudio.github.io/renv/articles/docker.html; that's where I want to collect this sort of advice. I can brain-dump some of my thoughts...

where do I put the library

I think the general answer is "it depends", but I think you usually want the R library in the Docker container, with the cache potentially mounted externally from somewhere.

Library in the container, I agree, this feels cleaner. (As opposed to being mounted from the project directory.) Though not ruling out special cases where this may not be the case.

handle cache efficiently (1. below)
preinstall renv in the expected location during docker build (2. below)
doing restore in build or in post-build scripts? (3. below)

Can you elaborate for these points? Some suggestions are enumerated in the vignette, but they could surely be improved.

Here are some problems:

  1. If you do renv::restore during docker build, packages that are downloaded and installed are also cached, as normal, but the cache is now inside the image and you'd actively have to take steps to export it to a location outside of the image otherwise the cache is lost, and thus not really serving a purpose. (This is not a problem that I think renv should solve, but recomended work arounds or general commentary would be nice, so developers dont waste time)

  2. If you start R and renv is not in the project library, R will spend 30 seconds downloading and installing renv as the first thing it does, ie:

    $ R
    # Bootstrapping renv 1.1.0 ---------------------------------------------------
    - Downloading renv ... OK
    - Installing renv  ... OK

    - Project '~/somewhere' loaded. [renv 1.1.0]

This will happen every time you run the image and is annoying enough that it should be prevented. To avoid this I need to know what the full path of the library will be in the future once renv is activated, and then install renv in that location. Currently the best way I see, is to have something like this in my Dockerfile:

RUN <<EOR
R --vanilla -e "install.packages('remotes', repos='${repo}')
remotes::install_version('renv', version = '${RENV_VERSION}', repos='${repo}')
dir.create(renv::paths\$library(), recursive=TRUE)
file.copy(find.package('renv'), renv::paths\$library(), recursive = TRUE)"
EOR

I run R with --vanilla because otherwise the system renv will not be available. The rest works because although I run R without activating renv, renv::paths$library() still works as if renv had been activated if I load renv in a directory that has a renv.lock (it seems). This is undocumented behaviour as far as I can tell, which works for the time being. So to be clear - the challenge is to predetermine the library path eg. /project-library/linux-ubuntu-noble/R-4.4/x86_64-pc-linux-gnu before renv is activated so I can put renv there during docker build. (RENV_PATHS_LIBRARY is only good for the first part of that path)

EDIT: RUN R -e 1 is another option here, ie just running R the first time in the docker build and have it install renv (the drawback being that it installs from scratch one time more)

See related discussion here: #1668

  1. I want to have cache available when I restore. Just tidyverse alone is a 10-15 minute operation. With cache it is a 0.1 sec operation. Cache is the way to go. One way is to mount the renv cache, as you point out, when you run docker run ... and then do renv::restore as the very first thing you do. It's doable, both for devcontainers and project containers, but not really for application containers. But for production application containers a clean install without old cache is likely also preferable. At any rate, in practice for this to work, it means you need a script that does this so that all docker runs become something like: docker run ...options... myimage bash -c 'script/that/runs/restore.sh && bash' . Or you have to make sure that you're R code always starts with a renv::restore()

A viable alternative is to keep the renv cache in the docker context (ie in the same project) and then COPY it during docker build. Unless the renv cache runs into 10s of GBs (it won't) its a decent solution. This also means you need to have functionality to export the cache from the image to that folder in the docker context if you restore packages during docker build without cache.

binary incompatibilities between ubuntu packages and renv cache

Can you elaborate here as well? You can use RENV_PATHS_PREFIX_AUTO = TRUE to ensure that renv places an OS identifier in the library paths, e.g.

.libPaths()
[1] "/home/kevin/scratch/example/renv/library/linux-ubuntu-noble/R-4.3/aarch64-unknown-linux-gnu"
[2] "/home/kevin/.cache/R/renv/sandbox/linux-ubuntu-noble/R-4.3/aarch64-unknown-linux-gnu/9a444a72"
and this is also the default behavior for R 4.4 and newer.

The problem is the cache, not the library. To elaborate - I just now ran renv::init() in a fresh new project using renv 1.1.1 and R 4.4.2. The renv::paths$cache() was:

/home/tlindahl/.cache/R/renv/cache/v5/R-4.4/x86_64-pc-linux-gnu

It does not contain my distros version. That path will be the same if I run ubuntu 24, 22 or 20, likely also debian. Each of those distroes will have different versions of c libaryes used by R packages. If I install a package in one of them, and it links to those local libraries. That package then gets copied to a cache path which, as shown, is the same for many distroes and versions. If I then restore on another distro from that same cache, things will either fail during install or later when I use the package.

This problem is solvable with the env variable RENV_PATHS_PREFIX but now youre forcing me to do really tedious book-keeping that Im not well equipped to do.

renv::restore never succeeds when rebuilding an image a year later

It would help to know what kinds of failures you're seeing; examples would be very helpful.

This one is easier - it's already discussed here: #1893

R packages that depend on system packages (eg. ubuntu packages) become uninstallable (and unrestorable) once those packages are updated. Often you dont really enforce R version either, and they tend to have rolling updates in the lifespan of distroes


Im working on a github project with reference implementations that will highlight these problems and more or less hacky solutions to them. It may make disussion easier.

EDIT:
I see you were already all over those other issues I linked too also, sorry, should have acknowledged (and noticed!).

@torbjorn
Copy link
Author

torbjorn commented Feb 9, 2025

binary incompatibilities between ubuntu packages and renv cache

Can you elaborate here as well? You can use RENV_PATHS_PREFIX_AUTO = TRUE to ensure that renv places an OS identifier in the library paths, e.g.

.libPaths()
[1] "/home/kevin/scratch/example/renv/library/linux-ubuntu-noble/R-4.3/aarch64-unknown-linux-gnu"
[2] "/home/kevin/.cache/R/renv/sandbox/linux-ubuntu-noble/R-4.3/aarch64-unknown-linux-gnu/9a444a72"
and this is also the default behavior for R 4.4 and newer.

The problem is the cache, not the library. To elaborate - I just now ran renv::init() in a fresh new project using renv 1.1.1 and R 4.4.2. The renv::paths$cache() was:

/home/tlindahl/.cache/R/renv/cache/v5/R-4.4/x86_64-pc-linux-gnu

It does not contain my distros version. That path will be the same if I run ubuntu 24, 22 or 20, likely also debian. Each of those distroes will have different versions of c libaryes used by R packages. If I install a package in one of them, and it links to those local libraries. That package then gets copied to a cache path which, as shown, is the same for many distroes and versions. If I then restore on another distro from that same cache, things will either fail during install or later when I use the package.

This problem is solvable with the env variable RENV_PATHS_PREFIX but now youre forcing me to do really tedious book-keeping that Im not well equipped to do.

I re-ran this example, and my initial claim was wrong. The cache does contain an OS specific parameter. In my case the ubuntu distro name. This issue can probably be removed from the list. (I had RENV_PATHS_PREFIX set to an empty string when I tried it)

@torbjorn
Copy link
Author

torbjorn commented Feb 9, 2025

I wrote up some examples of making cache work in docker here: https://github.com/torbjorn/renv-docker

There are essentially two variants there:

  1. Copy cache into the image during docker build and restore from it.
  2. Run restore at first docker run, like you describe in the article.

Operations can be tucked away in scripts, yes, but I kept them in the Dockerfile for clarity.

@kevinushey , in your vignette you mention multistage builds. How exactly does that help? Surely the user will barely notice how layers are composed before restore is run? Re-executing renv::restore() or not seems to be the main issue (to me at least), and multistage builds doesnt really change that?

The way I preinstall renv in the project library feels like it could have been done more elegantly, perhaps by a function in renv itself? (So eg. renv::deploy())

I mentioned problems with rebuilding images "a year later", and linked to a dicsussion. That was mainly about different R versions, so not really renv/docker related. As long as you stick to the same source image, rebuilding should probably work fine enough, I may have to back paddle on that point too.

@kevinushey
Copy link
Collaborator

kevinushey commented Feb 11, 2025

If you do renv::restore during docker build, packages that are downloaded and installed are also cached, as normal, but the cache is now inside the image and you'd actively have to take steps to export it to a location outside of the image otherwise the cache is lost, and thus not really serving a purpose. (This is not a problem that I think renv should solve, but recomended work arounds or general commentary would be nice, so developers dont waste time)

This is what I was trying to get at in https://rstudio.github.io/renv/articles/docker.html#dynamically-provisioning-r-libraries-with-renv; in that the "best" solution in these scenarios is to have the renv cache on a mounted drive that is used and updated by containers when they are started.

EDIT: RUN R -e 1 is another option here, ie just running R the first time in the docker build and have it install renv (the drawback being that it installs from scratch one time more)

Wouldn't it also suffice to run R -e "renv::init()"? That would initialize the project, the autoloader, and also ensure that renv is installed into the private library. That is, something like:

WORKDIR /project
RUN R -e "renv::init()"

I want to have cache available when I restore. Just tidyverse alone is a 10-15 minute operation. With cache it is a 0.1 sec operation. Cache is the way to go. One way is to mount the renv cache, as you point out, when you run docker run ... and then do renv::restore as the very first thing you do. It's doable, both for devcontainers and project containers, but not really for application containers. But for production application containers a clean install without old cache is likely also preferable. At any rate, in practice for this to work, it means you need a script that does this so that all docker runs become something like: docker run ...options... myimage bash -c 'script/that/runs/restore.sh && bash' . Or you have to make sure that you're R code always starts with a renv::restore()

IMHO the most straightforward solution is indeed to ensure you're calling renv::restore() in your R code; either as part of the final Dockerfile RUN statement, or just manually in some appropriate part of your runtime. That's the approach I'm describing in https://rstudio.github.io/renv/articles/docker.html#dynamically-provisioning-r-libraries-with-renv, at least.

This problem is solvable with the env variable RENV_PATHS_PREFIX but now youre forcing me to do really tedious book-keeping that Im not well equipped to do.

The simpler solution is to set:

RENV_PATHS_PREFIX_AUTO = TRUE

This will force renv to include a platform component in the library + cache paths it uses. This is the default for R 4.4 and newer, but unfortunately not for older versions of R. I didn't want to change this for older versions of R since I would've potentially been changing the cache path in already-existing applications using renv. Maybe I can revisit this in some future release of renv...

@kevinushey
Copy link
Collaborator

@kevinushey , in your vignette you mention multistage builds. How exactly does that help? Surely the user will barely notice how layers are composed before restore is run? Re-executing renv::restore() or not seems to be the main issue (to me at least), and multistage builds doesnt really change that?

The bit on multi-stage builds was contributed by other users, so I'm less well-equipped to comment on it. e82f56c

The way I preinstall renv in the project library feels like it could have been done more elegantly, perhaps by a function in renv itself? (So eg. renv::deploy())

I think the right solution here is (or should be) renv::init(), but there are some internal tools like renv:::imbue() which can be used to install a particular version of renv into a project's library.

@torbjorn
Copy link
Author

EDIT: RUN R -e 1 is another option here, ie just running R the first time in the docker build and have it install renv (the drawback being that it installs from scratch one time more)

Wouldn't it also suffice to run R -e "renv::init()"? That would initialize the project, the autoloader, and also ensure that renv is installed into the private library. That is, something like:

WORKDIR /project
RUN R -e "renv::init()"

Your RUN R -e "renv::init()" would be intercepted by .Rprofile -> renv/activate.R, which would trigger a fresh install of renv into the project library, after which the renv::init() doesn't really make a difference, I believe? So just running the R code 1 should be enough to trigger that install?

I want to have cache available when I restore. Just tidyverse alone is a 10-15 minute operation. With cache it is a 0.1 sec operation. Cache is the way to go. One way is to mount the renv cache, as you point out, when you run docker run ... and then do renv::restore as the very first thing you do. It's doable, both for devcontainers and project containers, but not really for application containers. But for production application containers a clean install without old cache is likely also preferable. At any rate, in practice for this to work, it means you need a script that does this so that all docker runs become something like: docker run ...options... myimage bash -c 'script/that/runs/restore.sh && bash' . Or you have to make sure that you're R code always starts with a renv::restore()

IMHO the most straightforward solution is indeed to ensure you're calling renv::restore() in your R code; either as part of the final Dockerfile RUN statement, or just manually in some appropriate part of your runtime. That's the approach I'm describing in https://rstudio.github.io/renv/articles/docker.html#dynamically-provisioning-r-libraries-with-renv, at least.

Those are the only two solutions I have meant to outline so far, either renv::restore() during build, in a RUN statement towards the end, or as the first thing being done when you run the container, using mounted cache.

This problem is solvable with the env variable RENV_PATHS_PREFIX but now youre forcing me to do really tedious book-keeping that Im not well equipped to do.

The simpler solution is to set:

RENV_PATHS_PREFIX_AUTO = TRUE
This will force renv to include a platform component in the library + cache paths it uses. This is the default for R 4.4 and newer, but unfortunately not for older versions of R. I didn't want to change this for older versions of R since I would've potentially been changing the cache path in already-existing applications using renv. Maybe I can revisit this in some future release of renv...

I agree with this, this seems fixed in 4.4, I demonstrated that this works above also, or at least I mean to!

@torbjorn
Copy link
Author

I agree 100% that the two main strategies for doing restore with functional cache are either:

  1. maintain a project local cache and copy it during docker build to speed up renv::restore()
  2. mount a cache volume (or dir) with docker run, and have renv::restore() be the first thing you do in R

Point 1. feels more elegant, you come out of docker build with a docker image with renv packages restored. Point 2. puts extra burden on the user to remember to do restore() before running R code that uses packages. The con with 1. is of course that the user must be wary of the inner workins of the build and export the cache if un-cached packages were installed during the build. A skilled developer will know which one to use, but in its current form the vignette perhaps doesnt really focus on guiding the user towards the two options above?

The section on multistage builds I believe shuold be removed, or moved to a section that deals with other optimizations, not related to efficient handling of the renv cache for renv::restore() .


The point related to init(), imbue() or my `R -e 1' is not related to restore, here I just wanted to avoid downloading and installing the renv package again (after first having installed it system wide). But a better solution is just not to install renv systemwide and have renv imbue it itself, this should result in just a single install of renv during build, rendering that point of mine moot in that sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants