Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish Observability SRE images to internal container registry #17401

Open
wants to merge 6 commits into
base: feature/fedramp-high-8.x
Choose a base branch
from

Conversation

donoghuc
Copy link
Member

@donoghuc donoghuc commented Mar 24, 2025

Release notes

[rn:skip]

What does this PR do?

Add pipelines for building and shipping the observability SRE image. Specifically this adds a step to the DRA pipeline that largely follows the existing patterns and steps for doing artifact publishing. The noteable difference is that currently we directly do a docker build and docker push workflow instead of the existing build/staging workflow with the release manager. A design goal for this iteration is to ensure that this new step does not interfere with any existing DRA steps. As such we ensure this happens after DRA is published and that failures are marked with a soft_fail option so as to not interrupt any artifact publishing while we stabilize and iterate on this workflow.

Why is it important/What is the impact to the user?

N/A

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files (and/or docker env variables)
  • [ ] I have added tests that prove my fix is effective or that my feature works

Related issues

@donoghuc donoghuc force-pushed the POC-observability-sre-push branch from c27ea34 to 1426454 Compare March 24, 2025 19:33
This commit adds a step to the pull_request_pipeline buildkite definition to
push a docker image to the elastic container registry. It is added here to show
that we have the proper creds etc in CI to push the container where it needs to
go. We will likely move this into the DRA pipeline once we are confident it is
pushing to the correct place with a naming convention that works for all
consumers/producers.

The general idea is to build the container with our gradle task, then once we
have that image we can tag it with the git sha and a "latest" identifier. This
would allow consumers to choose between an exact sha for a stream like 8.19.0 or
the "latest". I will also need to factor in the case where we have the tag
*without* the sha postfix. Obviously we will want to fold this in to the existing DRA
pipeline for building/staging images but for now it seems reasonable to handle
this separately.
@donoghuc donoghuc force-pushed the POC-observability-sre-push branch from 1426454 to 297226b Compare March 27, 2025 17:10
@donoghuc
Copy link
Member Author

At this time the POC is successfully pushing (see passing build in buildkite)

I can verify that with proper docker login i can pull the image:

➜  logstash git:(POC-observability-sre-push) ✗ docker pull docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-latest
8.19.0-SNAPSHOT-latest: Pulling from logstash/logstash-observability-sre
da71d451b228: Download complete
a6fd2b9c3a1e: Download complete
ed251f38bc33: Download complete
5be1afb3effe: Download complete
ab3031146b38: Download complete
4f4fb700ef54: Already exists
b11d0ffe7078: Download complete
90f82fcef27d: Download complete
2fa0fa18264e: Download complete
d0fae614e212: Download complete
7a4bc1d79b54: Download complete
905324ee5b05: Download complete
93abb63d3447: Download complete
b9ce1cf57560: Download complete
096ba33a94ad: Download complete
758d2b3b74cc: Download complete
Digest: sha256:80fa3e883dea6fcfe454de5dfbd4068cbc47f3f6aca5fffef27c349763224bc6
Status: Downloaded newer image for docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-latest

Open questions:

  1. Where should i slot in this workflow? My first inclination was somewhere in the DRA pipeline, but that would likely require this feature branch get merged into 8.x
  2. Currently CI is building for x86, do we also need to build for ARM?
  3. How should we do versioning? Currently each time this action runs we tag with [version]:[sha] and [version]:[latest] (docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-297226b1d, docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-latest). I assume we would want any non snapshot version to not have a sha.

@jsvd
Copy link
Member

jsvd commented Mar 28, 2025

Where should i slot in this workflow? My first inclination was somewhere in the DRA pipeline, but that would likely require this feature branch get merged into 8.x

If possible a separate set of jobs (one for daily snapshot) another for the release would be best since this is not tied to DRA and maybe we don't want to allow this new flow to impact DRA if something goes wrong.

Currently CI is building for x86, do we also need to build for ARM?

My first reaction was to only build what's necessary for SRE, but having been testing this locally myself it's painful to use the x64 when our laptops are aarch64, so I'd suggest building both.

How should we do versioning? Currently each time this action runs we tag with [version]:[sha] and [version]:[latest] (docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-297226b1d, docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-latest). I assume we would want any non snapshot version to not have a sha.

for the current non-fips snapshot images we do:

docker.elastic.co/<namespace>/<image-name>:9.0.0-387e2120-SNAPSHOT
docker.elastic.co/<namespace>/<image-name>:9.0.0-SNAPSHOT
docker.elastic.co/<namespace>/<image-name>:9.0-SNAPSHOT

So I suggest we do the same naming scheme. and then for staging images:

docker.elastic.co/staging/<image-name>:9.0.0-34f198f6

So we'd do:

  • Snapshot:
    docker.elastic.co/logstash/logstash-observability-sre:8.19.0-297226b1d-SNAPSHOT

  • Staging:
    docker.elastic.co/logstash/logstash-observability-sre:8.19.0-387e2120

@donoghuc donoghuc marked this pull request as ready for review March 28, 2025 21:42
@donoghuc donoghuc changed the title POC for publishing observability SRE images Publish Observability SRE images to internal container registry Mar 28, 2025
@donoghuc
Copy link
Member Author

@jsvd thanks for the review!

After starting to make this a separate workflow i realized just how much duplication that would add. Instead I opted to add this to DRA pipeline with the explicit guiding principle of "do not interfere with existing artifact generation/publishing". I ended up adding a step at the very end of the pipeline that is configured to "soft fail". This should ensure that we dont interfere with existing publishing while we iterate on this.

As far as the naming... I'm having trouble understanding the pattern we expect. What i came up with is this:

For snapshot:
We will push a tag like ${REGISTRY_PATH}:${QUALIFIED_VERSION}-${SHA} (example: docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-297226b1df)

For staging:
We will push a tag like ${REGISTRY_PATH}:${QUALIFIED_VERSION} (example: docker.elastic.co/logstash/logstash-observability-sre:8.19.0) AND ${REGISTRY_PATH}:${QUALIFIED_VERSION}-${SHA} (example: docker.elastic.co/logstash/logstash-observability-sre:8.19.0-297226b1df)

The idea there being that for "snapshot" we will run frequently and not want to overwrite images, hence giving them a sha tag. For staging we run this infrequently and ultimately want only one image for the version there. I decided to also include the sha too so we can track any history of images pushed there. The main thing i'm thinking of here is that as a consumer of the image I would want to use a tag like docker.elastic.co/logstash/logstash-observability-sre:8.19.0 in my deployment. But maybe i'm not following exactly what you were getting at with naming.

This commit takes the POC from the pull_request_pipeline and adds it to the DRA
pipeline. Noteably, we take care to not disrupt anything about the existing DRA
pipeline by making this wait until after the artifacts are published and we set
a soft_fail. While this is being introduced and stabilized we want to ensure the
existing DRA pipeline continues to work without interruption. As we get more
stability we can look at a tigther integration.
@donoghuc donoghuc force-pushed the POC-observability-sre-push branch from 457aba0 to 8cc2b90 Compare March 28, 2025 22:08
@donoghuc
Copy link
Member Author

donoghuc commented Mar 28, 2025

DRA snapshot build: https://buildkite.com/elastic/logstash-dra-snapshot-pipeline/builds/2555 (note i ommitted the depends_on for testing so i dont have to wait an hour for publishing to happen).

Eventually we will want to do proper annotations with manifests but for now
just add arch to the tag.
@donoghuc
Copy link
Member Author

Regarding architecture... Currently i'm building and pushing for each architecture, however this creates a race condition where the slowest job seems to overwrite the image in the registry. In order to get around this for now i've added architecture into the name of the image tag. I'm not sure how the release manager handles this. One option would be to explicitly publish manifest information after images exist, but i'm not sure how to coordinate that across machines at this point.

I figured that for now just requiring architecture in the name would solve our immediate needs.

@jsvd
Copy link
Member

jsvd commented Mar 31, 2025

This should ensure that we dont interfere with existing publishing while we iterate on this.

I'm good with that, that's exactly the tradeoff I was expecting we'd analyze and make an informed call on.

The idea there being that for "snapshot" we will run frequently and not want to overwrite images, hence giving them a sha tag.

I believe we want both. e.g. the non-sha label convenience of doing docker run --rm docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT to run the latest snapshot is tremendous, not requiring us to know what the latest sha was.
But we also want to have the sha-tagged version too, especially for debugging/investigation, to ensure we're running the right build.

The main thing i'm thinking of here is that as a consumer of the image I would want to use a tag like docker.elastic.co/logstash/logstash-observability-sre:8.19.0 in my deployment. But maybe i'm not following exactly what you were getting at with naming.

Currently our non fips images go through three steps:

  1. snapshot daily builds
  2. staging builds (release candidates)
  3. release candidate promotion to GA

For snapshot daily builds I'd expect the tag to have -SNAPSHOT in it at the end, and like mentioned above, both a sha and non-sha label, for convenience:

docker.elastic.co/elastic/logstash-observability-sre:9.0.0-387e2120-SNAPSHOT # unique
docker.elastic.co/elastic/logstash-observability-sre:9.0.0-SNAPSHOT # rewritten every day

For staging builds I'd expect a -sha, non-SNAPSHOT tag:

docker.elastic.co/staging/logstash-observability-sre:9.0.0-34f198f6 # unique

For promotion of RCs to GA, I'd expect a final job would add an extra non-sha'ed label to the latest staging build:

docker.elastic.co/staging/logstash-observability-sre:9.0.0 # same image as "9.0.0-34f198f6"

Regarding architecture... Currently i'm building and pushing for each architecture, however this creates a race condition where the slowest job seems to overwrite the image in the registry. In order to get around this for now i've added architecture into the name of the image tag. I'm not sure how the release manager handles this. One option would be to explicitly publish manifest information after images exist, but i'm not sure how to coordinate that across machines at this point.

We should be able to push the individual arch-named images upstream and them create a manifest for the generic label encompassing the arch-named images, as described in https://www.docker.com/blog/multi-arch-build-and-images-the-simple-way/, in the chapter The hard way with docker manifest:

To begin our journey, we’ll first need to build and push the images for each architecture to the Docker Hub. We will then combine all these images in a manifest list referenced by a tag.

# AMD64
$ docker build -t your-username/multiarch-example:manifest-amd64 --build-arg ARCH=amd64/ .
$ docker push your-username/multiarch-example:manifest-amd64

# ARM32V7
$ docker build -t your-username/multiarch-example:manifest-arm32v7 --build-arg ARCH=arm32v7/ .
$ docker push your-username/multiarch-example:manifest-arm32v7

# ARM64V8
$ docker build -t your-username/multiarch-example:manifest-arm64v8 --build-arg ARCH=arm64v8/ .
$ docker push your-username/multiarch-example:manifest-arm64v8

Now that we have built our images and pushed them, we are able to reference them all in a manifest list using the docker manifest command.

$ docker manifest create \
your-username/multiarch-example:manifest-latest \
--amend your-username/multiarch-example:manifest-amd64 \
--amend your-username/multiarch-example:manifest-arm32v7 \
--amend your-username/multiarch-example:manifest-arm64v8

Once the manifest list has been created, we can push it to Docker Hub.

$ docker manifest push your-username/multiarch-example:manifest-latest

If you now go to Docker Hub, you’ll be able to see the new tag referencing the images

This commit refactors the POC pipeline for pushing observabilty SRE containers
to handle conflicts for tags based on target architectures. Cells with
respective architectures build containers and push to the container registry
with a unique identifier. Once those exist we introduce a separate step to use
the docker manifest command to annotate those images such that a container
client can download the correct image based on architecture. As a result for
every artifact there will be 2 images pushed (one for each arch) and N manifests
pushed. The manifests will handle the final naming that the consumer would
expect.
@donoghuc
Copy link
Member Author

I refactored the workflow to build and push architecture specific images then use docker manifest to do the final tagging.

So for example in this build https://buildkite.com/elastic/logstash-dra-snapshot-pipeline/builds/2579

The following containers are built and pushed:

  • docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66b-aarch64
  • docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66b-x86_64

Once those jobs are done a new step is added which adds the following manifests:

  • docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT
  • docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66b

I verified locally:

➜  ~ docker pull docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66b
8.19.0-SNAPSHOT-bbfa8d66b: Pulling from logstash/logstash-observability-sre
Digest: sha256:f54c60bad4546a474efcaebc879d27af2af7b0c6a4d6503dc71ddc29589a9b4c
Status: Image is up to date for docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66b
docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66b
➜  ~ docker manifest inspect docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66
no such manifest: docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66
➜  ~ docker manifest inspect docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66b
{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
   "manifests": [
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 3658,
         "digest": "sha256:f1e41783578a77d82e2c4cb049e5225f4f519b5f1cccf098b7578566c3fad589",
         "platform": {
            "architecture": "arm64",
            "os": "linux"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 3658,
         "digest": "sha256:ec09410dbd36347f025e381934364a39bf10b5c9cea7a42c0fd60414762ae56b",
         "platform": {
            "architecture": "amd64",
            "os": "linux"
         }
      }
   ]
}

@donoghuc
Copy link
Member Author

Regarding the naming... I'm not quite following why the sha would come before snapshot. I'm using the qualified version script to create the version. It seems like that is pretty standard and it is responsible for adding the SNAPSHOT from what i can tell.

So for the stage where we actually build and push a container we get a unique ID with [VERSION][SHA][ARCH]

Then in the manifest stage we construct the unique [VERSION][SHA] and [VERSION] (which will be the de-facto "latest").

@jsvd
Copy link
Member

jsvd commented Apr 1, 2025

Regarding the naming... I'm not quite following why the sha would come before snapshot. I'm using the qualified version script to create the version. It seems like that is pretty standard and it is responsible for adding the SNAPSHOT from what i can tell.

I was just following the current naming convention we use for the snapshot and staging builds, as seen in https://artifacts-snapshot.elastic.co/logstash/8.18.0-15c9af3b/summary-8.18.0-SNAPSHOT.html, for example:

Docker Image Tags
For snapshots docker images are available with these tags:

docker pull docker.elastic.co/<namespace>/<image-name>:8.18.0-15c9af3b-SNAPSHOT
docker pull docker.elastic.co/<namespace>/<image-name>:8.18.0-SNAPSHOT
docker pull docker.elastic.co/<namespace>/<image-name>:8.18-SNAPSHOT

Except For cloud, build candidates docker images are available in the staging namespace (authentication required) with these tags:

docker pull docker.elastic.co/staging/<image-name>:8.18.0-15c9af3b

We can see in these examples:

❯ docker manifest inspect docker.elastic.co/logstash/logstash:8.18.0-14175896-SNAPSHOT | jq '.manifests | map(.digest)'
[
  "sha256:d4f403fed4027eafc81b86aacadf4cec4a7a4cb4454faaf3654738f92a5bd9f6",
  "sha256:7dbf5baf8867109ecfc145c98c89eb339488182fa3f5d3eb4182fb50f3ef9979"
]

❯ docker manifest inspect docker.elastic.co/logstash/logstash:8.18.0-SNAPSHOT | jq '.manifests | map(.digest)'
[
  "sha256:d4f403fed4027eafc81b86aacadf4cec4a7a4cb4454faaf3654738f92a5bd9f6",
  "sha256:7dbf5baf8867109ecfc145c98c89eb339488182fa3f5d3eb4182fb50f3ef9979"
]

@donoghuc
Copy link
Member Author

donoghuc commented Apr 1, 2025

I'm really struggling to understand this... I'll write out what i'm looking at and maybe you can help me wrap my head around it...

So, for a snapshot build (lets stick with 8.18) i see https://buildkite.com/elastic/logstash-dra-snapshot-pipeline/builds/2580

This was triggered as a scheduled build for logstash branch 8.18 on sha 9f651f (9f651f6). In buildkite there is a "Summary link" provided: https://artifacts-snapshot.elastic.co/logstash/8.18.0-15c9af3b/summary-8.18.0-SNAPSHOT.html I dont understand the identifier in that link (15c9af3b and i dont see it anywhere in the job logs).

According to that link there should be the following images:

  1. docker.elastic.co/logstash/logstash:8.18.0-15c9af3b-SNAPSHOT
  2. docker.elastic.co/logstash/logstash:8.18.0-SNAPSHOT
  3. docker.elastic.co/logstash/logstash:8.18-SNAPSHOT
➜  logstash git:(POC-observability-sre-push) ✗ docker pull docker.elastic.co/logstash/logstash:8.18.0-15c9af3b-SNAPSHOT
Error response from daemon: failed to resolve reference "docker.elastic.co/logstash/logstash:8.18.0-15c9af3b-SNAPSHOT": docker.elastic.co/logstash/logstash:8.18.0-15c9af3b-SNAPSHOT: not found
➜  logstash git:(POC-observability-sre-push) ✗ docker pull docker.elastic.co/logstash/logstash:8.18.0-SNAPSHOT
8.18.0-SNAPSHOT: Pulling from logstash/logstash
Digest: sha256:d66e89e068457f424650d77a4339deda865df9aac88f775ff5c3f946ac7ab5b5
Status: Image is up to date for docker.elastic.co/logstash/logstash:8.18.0-SNAPSHOT
docker.elastic.co/logstash/logstash:8.18.0-SNAPSHOT
➜  logstash git:(POC-observability-sre-push) ✗ docker pull docker.elastic.co/logstash/logstash:8.18-SNAPSHOT
Error response from daemon: failed to resolve reference "docker.elastic.co/logstash/logstash:8.18-SNAPSHOT": docker.elastic.co/logstash/logstash:8.18-SNAPSHOT: not found

I do not understand where the 15c9af3b is coming from, and I dont see the images that are reported to be available.

I want to be able to follow a pattern for this, but i'm just not understanding the pattern to follow 😅 . As mentioned and implemented so far the pattern I was proposing is to create two tags:

  1. The "unique" image qualifed version with logstash sha it was built from appended
  2. The "latest" which is just the qualified version.

The "qualified version" comes from the shared https://github.com/elastic/logstash/blob/main/.buildkite/scripts/common/qualified-version.sh script and the unique identifier is the sha of the commit the container is built from.

In order to follow more closely the existing tagging scheme this commit
refactors the naming for images to include the build sha BEFORE the SNAPSHOT
identifier. WHile this does not exactly follow the whole system that exists
today for container images in DRA it follows a pattern that is more similar.
Ideally we can iterate to fold handling of this container into DRA and in that
case consumers would not need to update their patterns for identifying images.
@donoghuc donoghuc force-pushed the POC-observability-sre-push branch from d14ba95 to 676bf86 Compare April 2, 2025 01:31
Copy link

Quality Gate passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

@donoghuc
Copy link
Member Author

donoghuc commented Apr 2, 2025

https://buildkite.com/elastic/logstash-dra-snapshot-pipeline/builds/2588 shows a build with the updated naming as discussed in slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants