Publish Observability SRE images to internal container registry #17401

donoghuc · 2025-03-24T19:07:51Z

Release notes

[rn:skip]

What does this PR do?

Add pipelines for building and shipping the observability SRE image. Specifically this adds a step to the DRA pipeline that largely follows the existing patterns and steps for doing artifact publishing. The noteable difference is that currently we directly do a docker build and docker push workflow instead of the existing build/staging workflow with the release manager. A design goal for this iteration is to ensure that this new step does not interfere with any existing DRA steps. As such we ensure this happens after DRA is published and that failures are marked with a soft_fail option so as to not interrupt any artifact publishing while we stabilize and iterate on this workflow.

Why is it important/What is the impact to the user?

N/A

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
I have made corresponding change to the default configuration files (and/or docker env variables)
~~[ ] I have added tests that prove my fix is effective or that my feature works~~

Related issues

Closes https://github.com/elastic/ingest-dev/issues/4841

This commit adds a step to the pull_request_pipeline buildkite definition to push a docker image to the elastic container registry. It is added here to show that we have the proper creds etc in CI to push the container where it needs to go. We will likely move this into the DRA pipeline once we are confident it is pushing to the correct place with a naming convention that works for all consumers/producers. The general idea is to build the container with our gradle task, then once we have that image we can tag it with the git sha and a "latest" identifier. This would allow consumers to choose between an exact sha for a stream like 8.19.0 or the "latest". I will also need to factor in the case where we have the tag *without* the sha postfix. Obviously we will want to fold this in to the existing DRA pipeline for building/staging images but for now it seems reasonable to handle this separately.

donoghuc · 2025-03-27T17:42:26Z

At this time the POC is successfully pushing (see passing build in buildkite)

I can verify that with proper docker login i can pull the image:

➜  logstash git:(POC-observability-sre-push) ✗ docker pull docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-latest
8.19.0-SNAPSHOT-latest: Pulling from logstash/logstash-observability-sre
da71d451b228: Download complete
a6fd2b9c3a1e: Download complete
ed251f38bc33: Download complete
5be1afb3effe: Download complete
ab3031146b38: Download complete
4f4fb700ef54: Already exists
b11d0ffe7078: Download complete
90f82fcef27d: Download complete
2fa0fa18264e: Download complete
d0fae614e212: Download complete
7a4bc1d79b54: Download complete
905324ee5b05: Download complete
93abb63d3447: Download complete
b9ce1cf57560: Download complete
096ba33a94ad: Download complete
758d2b3b74cc: Download complete
Digest: sha256:80fa3e883dea6fcfe454de5dfbd4068cbc47f3f6aca5fffef27c349763224bc6
Status: Downloaded newer image for docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-latest

Open questions:

Where should i slot in this workflow? My first inclination was somewhere in the DRA pipeline, but that would likely require this feature branch get merged into 8.x
Currently CI is building for x86, do we also need to build for ARM?
How should we do versioning? Currently each time this action runs we tag with [version]:[sha] and [version]:[latest] (docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-297226b1d, docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-latest). I assume we would want any non snapshot version to not have a sha.

jsvd · 2025-03-28T17:11:13Z

Where should i slot in this workflow? My first inclination was somewhere in the DRA pipeline, but that would likely require this feature branch get merged into 8.x

If possible a separate set of jobs (one for daily snapshot) another for the release would be best since this is not tied to DRA and maybe we don't want to allow this new flow to impact DRA if something goes wrong.

Currently CI is building for x86, do we also need to build for ARM?

My first reaction was to only build what's necessary for SRE, but having been testing this locally myself it's painful to use the x64 when our laptops are aarch64, so I'd suggest building both.

How should we do versioning? Currently each time this action runs we tag with [version]:[sha] and [version]:[latest] (docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-297226b1d, docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-latest). I assume we would want any non snapshot version to not have a sha.

for the current non-fips snapshot images we do:

docker.elastic.co/<namespace>/<image-name>:9.0.0-387e2120-SNAPSHOT
docker.elastic.co/<namespace>/<image-name>:9.0.0-SNAPSHOT
docker.elastic.co/<namespace>/<image-name>:9.0-SNAPSHOT

So I suggest we do the same naming scheme. and then for staging images:

docker.elastic.co/staging/<image-name>:9.0.0-34f198f6

So we'd do:

Snapshot:
docker.elastic.co/logstash/logstash-observability-sre:8.19.0-297226b1d-SNAPSHOT
Staging:
docker.elastic.co/logstash/logstash-observability-sre:8.19.0-387e2120

donoghuc · 2025-03-28T22:03:27Z

@jsvd thanks for the review!

After starting to make this a separate workflow i realized just how much duplication that would add. Instead I opted to add this to DRA pipeline with the explicit guiding principle of "do not interfere with existing artifact generation/publishing". I ended up adding a step at the very end of the pipeline that is configured to "soft fail". This should ensure that we dont interfere with existing publishing while we iterate on this.

As far as the naming... I'm having trouble understanding the pattern we expect. What i came up with is this:

For snapshot:
We will push a tag like ${REGISTRY_PATH}:${QUALIFIED_VERSION}-${SHA} (example: docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-297226b1df)

For staging:
We will push a tag like ${REGISTRY_PATH}:${QUALIFIED_VERSION} (example: docker.elastic.co/logstash/logstash-observability-sre:8.19.0) AND ${REGISTRY_PATH}:${QUALIFIED_VERSION}-${SHA} (example: docker.elastic.co/logstash/logstash-observability-sre:8.19.0-297226b1df)

The idea there being that for "snapshot" we will run frequently and not want to overwrite images, hence giving them a sha tag. For staging we run this infrequently and ultimately want only one image for the version there. I decided to also include the sha too so we can track any history of images pushed there. The main thing i'm thinking of here is that as a consumer of the image I would want to use a tag like docker.elastic.co/logstash/logstash-observability-sre:8.19.0 in my deployment. But maybe i'm not following exactly what you were getting at with naming.

This commit takes the POC from the pull_request_pipeline and adds it to the DRA pipeline. Noteably, we take care to not disrupt anything about the existing DRA pipeline by making this wait until after the artifacts are published and we set a soft_fail. While this is being introduced and stabilized we want to ensure the existing DRA pipeline continues to work without interruption. As we get more stability we can look at a tigther integration.

donoghuc · 2025-03-28T22:11:56Z

DRA snapshot build: https://buildkite.com/elastic/logstash-dra-snapshot-pipeline/builds/2555 (note i ommitted the depends_on for testing so i dont have to wait an hour for publishing to happen).

Eventually we will want to do proper annotations with manifests but for now just add arch to the tag.

donoghuc · 2025-03-28T22:42:06Z

Regarding architecture... Currently i'm building and pushing for each architecture, however this creates a race condition where the slowest job seems to overwrite the image in the registry. In order to get around this for now i've added architecture into the name of the image tag. I'm not sure how the release manager handles this. One option would be to explicitly publish manifest information after images exist, but i'm not sure how to coordinate that across machines at this point.

I figured that for now just requiring architecture in the name would solve our immediate needs.

jsvd · 2025-03-31T10:09:54Z

This should ensure that we dont interfere with existing publishing while we iterate on this.

I'm good with that, that's exactly the tradeoff I was expecting we'd analyze and make an informed call on.

The idea there being that for "snapshot" we will run frequently and not want to overwrite images, hence giving them a sha tag.

I believe we want both. e.g. the non-sha label convenience of doing docker run --rm docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT to run the latest snapshot is tremendous, not requiring us to know what the latest sha was.
But we also want to have the sha-tagged version too, especially for debugging/investigation, to ensure we're running the right build.

The main thing i'm thinking of here is that as a consumer of the image I would want to use a tag like docker.elastic.co/logstash/logstash-observability-sre:8.19.0 in my deployment. But maybe i'm not following exactly what you were getting at with naming.

Currently our non fips images go through three steps:

snapshot daily builds
staging builds (release candidates)
release candidate promotion to GA

For snapshot daily builds I'd expect the tag to have -SNAPSHOT in it at the end, and like mentioned above, both a sha and non-sha label, for convenience:

docker.elastic.co/elastic/logstash-observability-sre:9.0.0-387e2120-SNAPSHOT # unique
docker.elastic.co/elastic/logstash-observability-sre:9.0.0-SNAPSHOT # rewritten every day

For staging builds I'd expect a -sha, non-SNAPSHOT tag:

docker.elastic.co/staging/logstash-observability-sre:9.0.0-34f198f6 # unique

For promotion of RCs to GA, I'd expect a final job would add an extra non-sha'ed label to the latest staging build:

docker.elastic.co/staging/logstash-observability-sre:9.0.0 # same image as "9.0.0-34f198f6"

Regarding architecture... Currently i'm building and pushing for each architecture, however this creates a race condition where the slowest job seems to overwrite the image in the registry. In order to get around this for now i've added architecture into the name of the image tag. I'm not sure how the release manager handles this. One option would be to explicitly publish manifest information after images exist, but i'm not sure how to coordinate that across machines at this point.

We should be able to push the individual arch-named images upstream and them create a manifest for the generic label encompassing the arch-named images, as described in https://www.docker.com/blog/multi-arch-build-and-images-the-simple-way/, in the chapter The hard way with docker manifest:

To begin our journey, we’ll first need to build and push the images for each architecture to the Docker Hub. We will then combine all these images in a manifest list referenced by a tag.
# AMD64
$ docker build -t your-username/multiarch-example:manifest-amd64 --build-arg ARCH=amd64/ .
$ docker push your-username/multiarch-example:manifest-amd64

# ARM32V7
$ docker build -t your-username/multiarch-example:manifest-arm32v7 --build-arg ARCH=arm32v7/ .
$ docker push your-username/multiarch-example:manifest-arm32v7

# ARM64V8
$ docker build -t your-username/multiarch-example:manifest-arm64v8 --build-arg ARCH=arm64v8/ .
$ docker push your-username/multiarch-example:manifest-arm64v8
Now that we have built our images and pushed them, we are able to reference them all in a manifest list using the docker manifest command.
$ docker manifest create \
your-username/multiarch-example:manifest-latest \
--amend your-username/multiarch-example:manifest-amd64 \
--amend your-username/multiarch-example:manifest-arm32v7 \
--amend your-username/multiarch-example:manifest-arm64v8
Once the manifest list has been created, we can push it to Docker Hub.

$ docker manifest push your-username/multiarch-example:manifest-latest

If you now go to Docker Hub, you’ll be able to see the new tag referencing the images

This commit refactors the POC pipeline for pushing observabilty SRE containers to handle conflicts for tags based on target architectures. Cells with respective architectures build containers and push to the container registry with a unique identifier. Once those exist we introduce a separate step to use the docker manifest command to annotate those images such that a container client can download the correct image based on architecture. As a result for every artifact there will be 2 images pushed (one for each arch) and N manifests pushed. The manifests will handle the final naming that the consumer would expect.

donoghuc · 2025-03-31T21:38:42Z

I refactored the workflow to build and push architecture specific images then use docker manifest to do the final tagging.

So for example in this build https://buildkite.com/elastic/logstash-dra-snapshot-pipeline/builds/2579

The following containers are built and pushed:

docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66b-aarch64
docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66b-x86_64

Once those jobs are done a new step is added which adds the following manifests:

docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT
docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66b

I verified locally:

➜  ~ docker pull docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66b
8.19.0-SNAPSHOT-bbfa8d66b: Pulling from logstash/logstash-observability-sre
Digest: sha256:f54c60bad4546a474efcaebc879d27af2af7b0c6a4d6503dc71ddc29589a9b4c
Status: Image is up to date for docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66b
docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66b
➜  ~ docker manifest inspect docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66
no such manifest: docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66
➜  ~ docker manifest inspect docker.elastic.co/logstash/logstash-observability-sre:8.19.0-SNAPSHOT-bbfa8d66b
{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
   "manifests": [
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 3658,
         "digest": "sha256:f1e41783578a77d82e2c4cb049e5225f4f519b5f1cccf098b7578566c3fad589",
         "platform": {
            "architecture": "arm64",
            "os": "linux"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 3658,
         "digest": "sha256:ec09410dbd36347f025e381934364a39bf10b5c9cea7a42c0fd60414762ae56b",
         "platform": {
            "architecture": "amd64",
            "os": "linux"
         }
      }
   ]
}

donoghuc · 2025-03-31T21:41:29Z

Regarding the naming... I'm not quite following why the sha would come before snapshot. I'm using the qualified version script to create the version. It seems like that is pretty standard and it is responsible for adding the SNAPSHOT from what i can tell.

So for the stage where we actually build and push a container we get a unique ID with [VERSION][SHA][ARCH]

Then in the manifest stage we construct the unique [VERSION][SHA] and [VERSION] (which will be the de-facto "latest").

jsvd · 2025-04-01T10:08:24Z

Regarding the naming... I'm not quite following why the sha would come before snapshot. I'm using the qualified version script to create the version. It seems like that is pretty standard and it is responsible for adding the SNAPSHOT from what i can tell.

I was just following the current naming convention we use for the snapshot and staging builds, as seen in https://artifacts-snapshot.elastic.co/logstash/8.18.0-15c9af3b/summary-8.18.0-SNAPSHOT.html, for example:

Docker Image Tags
For snapshots docker images are available with these tags:
docker pull docker.elastic.co/<namespace>/<image-name>:8.18.0-15c9af3b-SNAPSHOT
docker pull docker.elastic.co/<namespace>/<image-name>:8.18.0-SNAPSHOT
docker pull docker.elastic.co/<namespace>/<image-name>:8.18-SNAPSHOT
Except For cloud, build candidates docker images are available in the staging namespace (authentication required) with these tags:
docker pull docker.elastic.co/staging/<image-name>:8.18.0-15c9af3b

We can see in these examples:

❯ docker manifest inspect docker.elastic.co/logstash/logstash:8.18.0-14175896-SNAPSHOT | jq '.manifests | map(.digest)'
[
  "sha256:d4f403fed4027eafc81b86aacadf4cec4a7a4cb4454faaf3654738f92a5bd9f6",
  "sha256:7dbf5baf8867109ecfc145c98c89eb339488182fa3f5d3eb4182fb50f3ef9979"
]

❯ docker manifest inspect docker.elastic.co/logstash/logstash:8.18.0-SNAPSHOT | jq '.manifests | map(.digest)'
[
  "sha256:d4f403fed4027eafc81b86aacadf4cec4a7a4cb4454faaf3654738f92a5bd9f6",
  "sha256:7dbf5baf8867109ecfc145c98c89eb339488182fa3f5d3eb4182fb50f3ef9979"
]

donoghuc · 2025-04-01T18:21:01Z

I'm really struggling to understand this... I'll write out what i'm looking at and maybe you can help me wrap my head around it...

So, for a snapshot build (lets stick with 8.18) i see https://buildkite.com/elastic/logstash-dra-snapshot-pipeline/builds/2580

This was triggered as a scheduled build for logstash branch 8.18 on sha 9f651f (9f651f6). In buildkite there is a "Summary link" provided: https://artifacts-snapshot.elastic.co/logstash/8.18.0-15c9af3b/summary-8.18.0-SNAPSHOT.html I dont understand the identifier in that link (15c9af3b and i dont see it anywhere in the job logs).

According to that link there should be the following images:

docker.elastic.co/logstash/logstash:8.18.0-15c9af3b-SNAPSHOT ❌
docker.elastic.co/logstash/logstash:8.18.0-SNAPSHOT ✅
docker.elastic.co/logstash/logstash:8.18-SNAPSHOT ❌

➜  logstash git:(POC-observability-sre-push) ✗ docker pull docker.elastic.co/logstash/logstash:8.18.0-15c9af3b-SNAPSHOT
Error response from daemon: failed to resolve reference "docker.elastic.co/logstash/logstash:8.18.0-15c9af3b-SNAPSHOT": docker.elastic.co/logstash/logstash:8.18.0-15c9af3b-SNAPSHOT: not found
➜  logstash git:(POC-observability-sre-push) ✗ docker pull docker.elastic.co/logstash/logstash:8.18.0-SNAPSHOT
8.18.0-SNAPSHOT: Pulling from logstash/logstash
Digest: sha256:d66e89e068457f424650d77a4339deda865df9aac88f775ff5c3f946ac7ab5b5
Status: Image is up to date for docker.elastic.co/logstash/logstash:8.18.0-SNAPSHOT
docker.elastic.co/logstash/logstash:8.18.0-SNAPSHOT
➜  logstash git:(POC-observability-sre-push) ✗ docker pull docker.elastic.co/logstash/logstash:8.18-SNAPSHOT
Error response from daemon: failed to resolve reference "docker.elastic.co/logstash/logstash:8.18-SNAPSHOT": docker.elastic.co/logstash/logstash:8.18-SNAPSHOT: not found

I do not understand where the 15c9af3b is coming from, and I dont see the images that are reported to be available.

I want to be able to follow a pattern for this, but i'm just not understanding the pattern to follow 😅 . As mentioned and implemented so far the pattern I was proposing is to create two tags:

The "unique" image qualifed version with logstash sha it was built from appended
The "latest" which is just the qualified version.

The "qualified version" comes from the shared https://github.com/elastic/logstash/blob/main/.buildkite/scripts/common/qualified-version.sh script and the unique identifier is the sha of the commit the container is built from.

In order to follow more closely the existing tagging scheme this commit refactors the naming for images to include the build sha BEFORE the SNAPSHOT identifier. WHile this does not exactly follow the whole system that exists today for container images in DRA it follows a pattern that is more similar. Ideally we can iterate to fold handling of this container into DRA and in that case consumers would not need to update their patterns for identifying images.

donoghuc · 2025-04-02T05:08:40Z

https://buildkite.com/elastic/logstash-dra-snapshot-pipeline/builds/2588 shows a build with the updated naming as discussed in slack.

.buildkite/scripts/dra/generatesteps.py

.buildkite/scripts/dra/build-and-push-observability-sre.sh

.buildkite/scripts/common/qualified-version.sh

Rename INCLUDE_SHA to INCLUDE_COMMIT_ID in qualified-version script. Confine use of this argument to individual invocations instead at top level in scripts.

This gates build/push for observability SRE containers on success of DRA pipeline.

donoghuc · 2025-04-03T20:38:48Z

Refactored based on review and un-commented out the depends_on contstraint to publish to registry only after success of publishing of other DRA artifacts.

elastic-sonarqube · 2025-04-03T20:50:01Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

elasticmachine · 2025-04-03T21:03:39Z

💚 Build Succeeded

Buildkite Build
Commit: b599715

History

💚 Build #2770 succeeded bc03bde
💚 Build #2769 succeeded 97a3325
💚 Build #2756 succeeded 676bf86
💚 Build #2755 succeeded d14ba95
💚 Build #2743 succeeded 65028a1
💚 Build #2735 succeeded 976f104

yaauie

LGTM 👍🏼

…17541) * Provision automatic test runs for ruby/java unit tests and integration tests with fips mode (#17029) * Run ruby unit tests under FIPS mode This commit shows a proposed pattern for running automated tests for logstash in FIPS mode. It uses a new identifier in gradle for conditionally setting properties to configure fips mode. The tests are run in a container representative of the base image the final artifacts will be built from. * Move everything from qa/fips -> x-pack This commit moves test setup/config under x-pack dir. * Extend test pipelines for fips mode to java unit tests and integration * Add git to container for gradle * move fips-mode gradle hooks to x-pack * Skip license check for now --------- Co-authored-by: Ry Biesemeyer <[email protected]> * Split fips integration tests into two steps (#17038) * Split fips integration tests into two steps The integration tests suite takes about 40 minutes. This is far too slow for reasonable feedback on a PR. This commit follows the pattern for the non-fips integration tests whereby the tests are split into two sections that can run in parallel across two steps. This should halve the feedback time. The logic for getting a list of specs files to run has been extracted to a shared shell script for use here and in the integration tests shell script. * Use shared function for splitting integration tests The logic for getting a list of specs to run has been extracted so that it can be shared across fips and non fips integration test modes. This commit updates the non fips integration tests to use the shared function. * fix typo in helper name (kebab case, not snake) * Escape $ so buildkite upload does not try to interpolate * Wrap integration tests in shell script to avoid BK interpolation * Move entrypoint for running integration tests inside docker * Skip offline pack manager tests when running in fips mode (#17160) This commit introduces a pattern for skipping tests we do not want to run in fips mode. In this case the plugin manager tests rely on using bundler/net-http/openssl which is not configured to be run with bouncycastle fips providers. * Get tests running in FIPS environment (#17096) * Modify FIPS test runner environment for integration tests This commit makes two small changes to the dockerfile used to define the fips test environment. Specifically it adds curl (which is required by integration tests), make (which is required by test setup), adds a c compiler (gcc and glibc for integration tests which compile a small c program) and turns off debug ssl logging as it is extremely noisy in logs and breaking some assumptions in tests about logfile content. Closes elastic/ingest-dev#5074 * Do not run test env as root The elastic stack is not meant to be run as root. This commit updates the test environment to provision a non root user and have the container context execute under that providioned user. Closes elastic/ingest-dev#5088 * Skip unit tests that reach out to rubygems for fips mode The `update` test setup reaches out to rubygems with net/http which is incompatible with our use of openssl in fips mode. This commit skips those tests when running under fips. See elastic/ingest-dev#5071 * Work around random data request limits in BCFIPS This commit changes test setup to make chunked calls to random data generation in order to work around a limit in fips mode. See elastic/ingest-dev#5072 for details. * Skip tests validating openssl defaults Openssl will not be used when running under FIPS mode. The test setup and tests themselves were failing when running in FIPS mode. This commit skips the tests that are covering behavior that will be disabled. See elastic/ingest-dev#5069 * Skip tests that require pluginmanager to install plugins This commit skips tests that rely on using the pluginmanager to install plugins during tests which require reaching out to rubygems. See elastic/ingest-dev#5108 * Skip prepare offline pack integration tests in fips mode The offline pack tests require on pluginmanager to use net-http library for resolving deps. This will not operate under fips mode. Skip when running in fips mode. See elastic/ingest-dev#5109 * Ensure a gem executible is on path for test setup This commit modifies the generate-gems script to ensure that a `gem` executable is on the path. If there is not one on the test runner, then use the one bundled with vendored jruby. * Skip webserver specs when running in FIPS mode This commit skips the existing webserver tests. We have some options and need to understand some requirements for the webserver functionality for fips mode. The elastic/ingest-dev#5110 issue has a ton of details. * Skip cli `remove` integration tests for FIPS This commit skips tests that are running `remove` action for the pluginmanager. These require reaching out to rubygems which is not available in FIPS mode. These tests were added post initial integration tests scoping work but are clearly requiring skips for FIPS mode. * Add openssl package to FIPS testing env container The setup script for filebeats requires an openssl executable. This commit updates the testing container with this tool. See elastic/ingest-dev#5107 * Re-introduce retries for FIPS tests now that we are in a passing state * Backport 17203 and 17267 fedramp8x (#17271) * Pluginmanager clean after mutate (#17203) * pluginmanager: always clean after mutate * pluginmanager: don't skip updating plugins installed with --version * pr feedback (cherry picked from commit 8c96913) * Pluginmanager install preserve (#17267) * tests: integration tests for pluginmanager install --preserve * fix regression where pluginmanager's install --preserve flag didn't * Add :skip_fips to update_spec.rb * Run x-pack tests under FIPS mode (#17254) This commit adds two new CI cells to cover x-pack tests running in FIPS mode. This ensures we have coverage of these features when running existing x-pack tests. * observabilitySRE: docker rake tasks (#17272) * observabilitySRE: docker rake tasks * Apply suggestions from code review Co-authored-by: Cas Donoghue <[email protected]> * Update rakelib/plugin.rake * Update rakelib/plugin.rake * Update docker/Makefile Co-authored-by: Cas Donoghue <[email protected]> --------- Co-authored-by: Cas Donoghue <[email protected]> * Ensure env2yaml dep is properly expressed in observabilitySRE task (#17305) The `build-from-local-observability-sre-artifacts` task depends on the `env2yaml` task. This was easy to miss in local development if other images had been built. This commit updates the makefile to properly define that dependency. * Add a smoke test for observability SRE container (#17298) * Add a smoke test for observability SRE container Add a CI cell to ensure the observability contater is building successfully. In order to show success run a quick smoke test to point out any glaring issues. This adds some general, low risk plugins for doing quick testing. This will help developers in debugging as we work on this image. * Show what is happening when rake fails * Debug deeper in the stack Show the stdout/stderr when shelling out fails. * Debug layers of build tooling Open3 is not capturing stdout for some reason. Capture it and print to see what is wrong in CI. * Actually run ls command in docker container 🤦 * Update safe_system based on code review suggestion * Dynamically generate version for container invocation Co-authored-by: Ry Biesemeyer <[email protected]> * Refactor smoke test setup to script Avoid interpolation backflips with buildkite by extracting to a script. * Split out message surfacing improvment to separate PR. Moved to: #17310 * Extract version qualifier into standalone script * Wait for version-qualifier.sh script to land upstream Use #17311 once it lands and gets backported to 8.x. For now just hard code version. --------- Co-authored-by: Ry Biesemeyer <[email protected]> * Configure observability SRE container for FIPS (#17297) This commit establishes a pattern for configuring the container to run in fips mode. - Use chainguard-fips - Copy over java properties from ls tar archive - Convert default jks to BC keystore - Configure logstash to use java properties and FIPS config NOTE: this assumes bouncycastle jars are in the tarball. The elastic/ingest-dev#5049 ticket will address that. * Exclude plugin manager and keystore cli from observabilitySRE artifact (#17375) * Conditionally install bcfips jars when building/testing observabilitySRE (#17359) * Conditionally install bcfips jars when building for observabilitySRE This commit implements a pattern for performing specific gradle tasks based on a newly named "fedrampHighMode" option. This option is used to configure tests to run with additional configuration specific to the observabilitySRE use case. Similarly the additional jar dependencies for bouncycastle fips providers are conditionally installed gated on the "fedrampHighMode" option. In order to ensure the the "fedrampHighMode" option persists through the layers of sub-processes spawned between gradle and rake we store and respect an environment variable FEDRAMP_HIGH_MODE. This may be useful generally in building the docker image. Try codereview suggestion * Use gradle pattern for setting properties with env vars Gradle has a mechanism for setting properties with environment variables prefixed with `ORG_GRADLE_PROJECT`. This commit updates the gradle tasks to use that pattern. See https://docs.gradle.org/current/userguide/build_environment.html#setting_a_project_property for details. * Pull in latests commits from 8.x and update based on new patterns (#17385) * Fix empty node stats pipelines (#17185) (#17197) Fixed an issue where the `/_node/stats` API displayed empty pipeline metrics when X-Pack monitoring was enabled (cherry picked from commit 8678581) Co-authored-by: kaisecheng <[email protected]> * Update z_rubycheck.rake to no longer inject Xmx1g (#17211) This allows the environment variable JRUBY_OPTS to be used for setting properties like Xmx original pr: #16420 (cherry picked from commit f562f37) Co-authored-by: kaisecheng <[email protected]> * Improve warning for insufficient file resources for PQ max_bytes (#16656) (#17222) This commit refactors the `PersistedQueueConfigValidator` class to provide a more detailed, accurate and actionable warning when pipeline's PQ configs are at risk of running out of disk space. See #14839 for design considerations. The highlights of the changes include accurately determining the free resources on a filesystem disk and then providing a breakdown of the usage for each of the paths configured for a queue. (cherry picked from commit 0621544) Co-authored-by: Cas Donoghue <[email protected]> * gradle task migrate to the new artifacts-api (#17232) (#17236) This commit migrates gradle task to the new artifacts-api - remove dependency on staging artifacts - all builds use snapshot artifacts - resolve version from current branch, major.x, previous minor, with priority given in that order. Co-authored-by: Andrea Selva <[email protected]> (cherry picked from commit 0a74568) Co-authored-by: kaisecheng <[email protected]> * tests: ls2ls delay checking until events have been processed (#17167) (#17252) * tests: ls2ls delay checking until events have been processed * Make sure upstream sends expected number of events before checking the expectation with downstream. Remove unnecessary or duplicated logics from the spec. * Add exception handling in `wait_for_rest_api` to make wait for LS REST API retriable. --------- Co-authored-by: Mashhur <[email protected]> Co-authored-by: Mashhur <[email protected]> (cherry picked from commit 73ffa24) Co-authored-by: Ry Biesemeyer <[email protected]> * Additional cleanify changes to ls2ls integ tests (#17246) (#17255) * Additional cleanify changes to ls2ls integ tests: replace heartbeat-input with reload option, set queue drain to get consistent result. (cherry picked from commit 1e06eea) Co-authored-by: Mashhur <[email protected]> * [8.x] Reimplement LogStash::Numeric setting in Java (backport #17127) (#17273) This is an automatic backport of pull request #17127 done by [Mergify](https://mergify.com). ---- * Reimplement LogStash::Numeric setting in Java (#17127) Reimplements `LogStash::Setting::Numeric` Ruby setting class into the `org.logstash.settings.NumericSetting` and exposes it through `java_import` as `LogStash::Setting::NumericSetting`. Updates the rspec tests: - verifies `java.lang.IllegalArgumentException` instead of `ArgumentError` is thrown because the kind of exception thrown by Java code, during verification. (cherry picked from commit 07a3c8e) * Fixed reference of SettingNumeric class (on main modules were removed) --------- Co-authored-by: Andrea Selva <[email protected]> * [CI] Health report integration tests use the new artifacts-api (#17274) (#17277) migrate to the new artifacts-api (cherry picked from commit feb2b92) Co-authored-by: kaisecheng <[email protected]> * Backport 17203 and 17267 8.x (#17270) * Pluginmanager clean after mutate (#17203) * pluginmanager: always clean after mutate * pluginmanager: don't skip updating plugins installed with --version * pr feedback (cherry picked from commit 8c96913) * Pluginmanager install preserve (#17267) * tests: integration tests for pluginmanager install --preserve * fix regression where pluginmanager's install --preserve flag didn't * [Backport 8.x] benchmark script (#17283) This commit cherry-picked the missing becnhmark script PRs The deprecated artifacts-api is removed [CI] benchmark uses the new artifacts-api (#17224) [CI] benchmark readme (#16783) Introduce a new flag to explicitly permit legacy monitoring (#16586) (Only take the benchmark script) [ci] fix wrong queue type in benchmark marathon (#16465) [CI] fix benchmark marathon (#16447) [CI] benchmark dashboard and pipeline for testing against multiple versions (#16421) * Fix pqcheck and pqrepair on Windows (#17210) (#17259) A recent change to pqheck, attempted to address an issue where the pqcheck would not on Windows mahcines when located in a folder containing a space, such as "C:\program files\elastic\logstash". While this fixed an issue with spaces in folders, it introduced a new issue related to Java options, and the pqcheck was still unable to run on Windows. This PR attempts to address the issue, by removing the quotes around the Java options, which caused the option parsing to fail, and instead removes the explicit setting of the classpath - the use of `set CLASSPATH=` in the `:concat` function is sufficient to set the classpath, and should also fix the spaces issue Fixes: #17209 (cherry picked from commit ba5f215) Co-authored-by: Rob Bavey <[email protected]> * Shareable function for partitioning integration tests (#17223) (#17303) For the fedramp high work https://github.com/elastic/logstash/pull/17038/files a use case for multiple scripts consuming the partitioning functionality emerged. As we look to more advanced partitioning we want to ensure that the functionality will be consumable from multiple scripts. See #17219 (comment) (cherry picked from commit d916972) Co-authored-by: Cas Donoghue <[email protected]> * [8.x] Surface failures from nested rake/shell tasks (backport #17310) (#17317) * Surface failures from nested rake/shell tasks (#17310) Previously when rake would shell out the output would be lost. This made debugging CI logs difficult. This commit updates the stack with improved message surfacing on error. (cherry picked from commit 0d931a5) # Conflicts: # rubyUtils.gradle * Extend ruby linting tasks to handle file inputs (#16660) This commit extends the gradle and rake tasks to pass through a list of files for rubocop to lint. This allows more specificity and fine grained control for linting when the consumer of the tasks only wishes to lint a select few files. * Ensure shellwords library is loaded Without this depending on task load order `Shellwords` may not be available. --------- Co-authored-by: Cas Donoghue <[email protected]> * Forward Port of Release notes for `8.16.5` and `8.17.3` (#17187), (#17188) (#17266) (#17321) * Forward Port of Release notes for 8.17.3 (#17187) * Update release notes for 8.17.3 --------- Co-authored-by: logstashmachine <[email protected]> Co-authored-by: Rob Bavey <[email protected]> * Forward Port of Release notes for 8.16.5 (#17188) * Update release notes for 8.16.5 --------- Co-authored-by: logstashmachine <[email protected]> Co-authored-by: Rob Bavey <[email protected]> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: logstashmachine <[email protected]> (cherry picked from commit 63e8fd1) Co-authored-by: Rob Bavey <[email protected]> * Add Deprecation tag to arcsight module (#17331) * [8.x] Upgrade elasticsearch-ruby client. (backport #17161) (#17306) * Upgrade elasticsearch-ruby client. (#17161) * Fix Faraday removed basic auth option and apply the ES client module name change. (cherry picked from commit e748488) * Apply the required changes in elasticsearch_client.rb after upgrading the elasticsearch-ruby client to 8.x * Swallow the exception and make non-connectable client when ES client raises connection refuses exception. --------- Co-authored-by: Mashhur <[email protected]> Co-authored-by: Mashhur <[email protected]> * Removed unused configHash computation that can be replaced by PipelineConfig.configHash() (#17336) (#17345) Removed unused configHash computation happening in AbstractPipeline and used only in tests replaced by PipelineConfig.configHash() invocation (cherry picked from commit 787fd2c) Co-authored-by: Andrea Selva <[email protected]> * Use org.logstash.common.Util to hashing by default to SHA256 (#17346) (#17352) Removes the usage fo Apache Commons Codec MessgeDigest to use internal Util class with embodies hashing methods. (cherry picked from commit 9c0e50f) Co-authored-by: Andrea Selva <[email protected]> * Added test to verify the int overflow happen (#17353) (#17354) Use long instead of int type to keep the length of the first token. The size limit validation requires to sum two integers, one with the length of the accumulated chars till now plus the next fragment head part. If any of the two sizes is close to the max integer it generates an overflow and could successfully fail the test https://github.com/elastic/logstash/blob/9c0e50faacc4700da3dc84a3ba729b84bff860a8/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java#L123. To fall in this case it's required that sizeLimit is bigger then 2^32 bytes (2GB) and data fragments without any line delimiter is pushed to the tokenizer with a total size close to 2^32 bytes. (cherry picked from commit afde43f) Co-authored-by: Andrea Selva <[email protected]> * [8.x] add ci shared qualified-version script (backport #17311) (#17348) * add ci shared qualified-version script (#17311) * ci: add shareable script for generating qualified version * ci: use shared script to generate qualified version (cherry picked from commit 10b5a84) # Conflicts: # .buildkite/scripts/dra/build_docker.sh * resolve merge conflict --------- Co-authored-by: Rye Biesemeyer <[email protected]> * tests: make integration split quantity configurable (#17219) (#17367) * tests: make integration split quantity configurable Refactors shared splitter bash function to take a list of files on stdin and split into a configurable number of partitions, emitting only those from the currently-selected partition to stdout. Also refactors the only caller in the integration_tests launcher script to accept an optional partition_count parameter (defaulting to `2` for backward- compatibility), to provide the list of specs to the function's stdin, and to output relevant information about the quantity of partition splits and which was selected. * ci: run integration tests in 3 parts (cherry picked from commit 3e0f488) Co-authored-by: Rye Biesemeyer <[email protected]> * Update buildkite with new patterns from 8.x This commit updates the buildkite definitions to be compatible with the upstream 8.x branch. Specificially: - Split integration tests for fips into 3 runners. - Use the new shared bash helper for computing QUALIFIED_VERSION It also continues standardization of using a "fedrampHighMode" for indicating the tests should be running in the context of our custom image for the SRE team. * Bug fix: Actually use shared integration_tests.sh file After refactoring to use the same script, I forgot to actually use it in the buildkite definition... --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: kaisecheng <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ry Biesemeyer <[email protected]> Co-authored-by: Mashhur <[email protected]> Co-authored-by: Andrea Selva <[email protected]> Co-authored-by: Rob Bavey <[email protected]> Co-authored-by: Mashhur <[email protected]> * Pin rubocop-ast development gem due to new dep on prism (#17407) (#17433) The rubocop-ast gem just introduced a new dependency on prism. - https://rubygems.org/gems/rubocop-ast/versions/1.43.0 In our install default gem rake task we are seeing issues trying to build native extensions. I see that in upstream jruby they are seeing a similar problem (at least it is the same failure mode jruby/jruby#8415 This commit pins rubocop-ast to 1.42.0 which is the last version that did not have an explicit prism dependency. (cherry picked from commit 6de59f2) Co-authored-by: Cas Donoghue <[email protected]> * Add age filter fedramp (#17434) * net-zero-change refactor * add logstash-filter-age to observabilitySRE artifact * Add licenses for bouncycastle fips jars (#17406) This commit adds licences for bouncycastle jars that are added for the observability SRE container artifact. It re-enables the previously disabled license check and adds a new one running in fips mode. * Publish Observability SRE images to internal container registry (#17401) * POC for publishing observability SRE images This commit adds a step to the pull_request_pipeline buildkite definition to push a docker image to the elastic container registry. It is added here to show that we have the proper creds etc in CI to push the container where it needs to go. We will likely move this into the DRA pipeline once we are confident it is pushing to the correct place with a naming convention that works for all consumers/producers. The general idea is to build the container with our gradle task, then once we have that image we can tag it with the git sha and a "latest" identifier. This would allow consumers to choose between an exact sha for a stream like 8.19.0 or the "latest". I will also need to factor in the case where we have the tag *without* the sha postfix. Obviously we will want to fold this in to the existing DRA pipeline for building/staging images but for now it seems reasonable to handle this separately. * check variable resolution * Move POC code into DRA pipeline This commit takes the POC from the pull_request_pipeline and adds it to the DRA pipeline. Noteably, we take care to not disrupt anything about the existing DRA pipeline by making this wait until after the artifacts are published and we set a soft_fail. While this is being introduced and stabilized we want to ensure the existing DRA pipeline continues to work without interruption. As we get more stability we can look at a tigther integration. * Disambiguate architectures Eventually we will want to do proper annotations with manifests but for now just add arch to the tag. * Use docker manifest for multi-architecture builds This commit refactors the POC pipeline for pushing observabilty SRE containers to handle conflicts for tags based on target architectures. Cells with respective architectures build containers and push to the container registry with a unique identifier. Once those exist we introduce a separate step to use the docker manifest command to annotate those images such that a container client can download the correct image based on architecture. As a result for every artifact there will be 2 images pushed (one for each arch) and N manifests pushed. The manifests will handle the final naming that the consumer would expect. * Refactor docker naming scheme In order to follow more closely the existing tagging scheme this commit refactors the naming for images to include the build sha BEFORE the SNAPSHOT identifier. WHile this does not exactly follow the whole system that exists today for container images in DRA it follows a pattern that is more similar. Ideally we can iterate to fold handling of this container into DRA and in that case consumers would not need to update their patterns for identifying images. * Code review refactor Rename INCLUDE_SHA to INCLUDE_COMMIT_ID in qualified-version script. Confine use of this argument to individual invocations instead at top level in scripts. * Build observabilitySRE containers after DRA is published This gates build/push for observability SRE containers on success of DRA pipeline. * x-pack: add fips validation plugin from x-pack (#16940) * x-pack: add fips_validation plugin to be included in fips builds The `logstash-integration-fips_validation` plugin provides no runtime pipeline plugins, but instead provides hooks to ensure that the logstash process is correctly configured for compliance with FIPS 140-3. It is installed while building the observabilitySRE artifacts. * fips validation: ensure BCFIPS,BCJSSE,SUN are first 3 security providers * remove re-injection of BCFIPS jars * Update lib/bootstrap/rubygems.rb * add integration spec for fips_validation plugin * add missing logstash_plugin helper * fixup * skip non-fips spec on fips-configured artifact, add spec details * Improve smoke tests for observability SRE image (#17486) * Improve smoke tests for observability SRE image This commit adds a new rspec test to run the observability SRE container in a docker compose network with filebeat and elasticsearch. It uses some simple test data through a pipeline with plugins we expect to be used in production. The rspec tests will ensure the test data is flowing from filebeat to logstash to elasticsearch by querying elasticsearch for expected transformed data. * REVERT ME: debug whats goig on in CI :( * Run filebeat container as root * Work around strict file ownership perms for filebeat We add the filebeat config in a volume, the permissions checks fail due test runner not being a root user. This commit disables that check in filebeat as seems to be the consensus solution online for example: https://event-driven.io/en/tricks_on_how_to_set_up_related_docker_images/ * Dynaimcally generate PKI instead of checking it in Instead of checking in PKI, dynamically generate it with gradle task for starting containers and running the tests. This improvement avoids github warning of checked in keys and avoid expiration headaches. Generation is very fast and does not add any significant overhead to test setup. * Remove use of "should" in rspec docstrings see https://github.com/rubocop/rspec-style-guide?tab=readme-ov-file#should-in-example-docstrings * Ensure permissions readable for volume Now that certs are dynamically generated, ensure they are able to be read in container * Use elasticsearch-fips image for smoke testing * Add git ignore for temp certs * Fix naming convention for integration tests Co-authored-by: Rye Biesemeyer <[email protected]> * Use parameter expansion for FEDRAMP_HIGH_MODE Co-authored-by: Rye Biesemeyer <[email protected]> * Use parameter expansion for FEDRAMP_HIGH_MODE Co-authored-by: Rye Biesemeyer <[email protected]> * Use parameter expansion for FEDRAMP_HIGH_MODE Co-authored-by: Rye Biesemeyer <[email protected]> --------- Co-authored-by: Ry Biesemeyer <[email protected]> Co-authored-by: Ry Biesemeyer <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: kaisecheng <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Mashhur <[email protected]> Co-authored-by: Andrea Selva <[email protected]> Co-authored-by: Rob Bavey <[email protected]> Co-authored-by: Mashhur <[email protected]> NOTE: we decided to squash these commits as the feature branch had cherry-picks (and squshed change sets 182f15e ) from 8.x which would potentially make the commit history confusing. We determined that the benefit of having individual commits from the feature branch was outweighed by the potentially confusing git history. This will also make porting this bit of work to other streams more simple.

donoghuc force-pushed the POC-observability-sre-push branch from c27ea34 to 1426454 Compare March 24, 2025 19:33

donoghuc added 2 commits March 27, 2025 10:10

check variable resolution

297226b

donoghuc force-pushed the POC-observability-sre-push branch from 1426454 to 297226b Compare March 27, 2025 17:10

donoghuc marked this pull request as ready for review March 28, 2025 21:42

donoghuc changed the title ~~POC for publishing observability SRE images~~ Publish Observability SRE images to internal container registry Mar 28, 2025

donoghuc force-pushed the POC-observability-sre-push branch from 457aba0 to 8cc2b90 Compare March 28, 2025 22:08

Disambiguate architectures

976f104

Eventually we will want to do proper annotations with manifests but for now just add arch to the tag.

donoghuc force-pushed the POC-observability-sre-push branch from d14ba95 to 676bf86 Compare April 2, 2025 01:31

yaauie reviewed Apr 3, 2025

View reviewed changes

.buildkite/scripts/dra/generatesteps.py Show resolved Hide resolved

.buildkite/scripts/dra/build-and-push-observability-sre.sh Outdated Show resolved Hide resolved

.buildkite/scripts/common/qualified-version.sh Outdated Show resolved Hide resolved

Code review refactor

bc03bde

Rename INCLUDE_SHA to INCLUDE_COMMIT_ID in qualified-version script. Confine use of this argument to individual invocations instead at top level in scripts.

donoghuc force-pushed the POC-observability-sre-push branch from 97a3325 to bc03bde Compare April 3, 2025 19:31

Build observabilitySRE containers after DRA is published

b599715

This gates build/push for observability SRE containers on success of DRA pipeline.

donoghuc requested a review from yaauie April 7, 2025 19:01

yaauie approved these changes Apr 7, 2025

View reviewed changes

donoghuc merged commit 5d8d640 into elastic:feature/fedramp-high-8.x Apr 7, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish Observability SRE images to internal container registry #17401

Publish Observability SRE images to internal container registry #17401

donoghuc commented Mar 24, 2025 •

edited

Loading

donoghuc commented Mar 27, 2025

jsvd commented Mar 28, 2025

donoghuc commented Mar 28, 2025

donoghuc commented Mar 28, 2025 •

edited

Loading

donoghuc commented Mar 28, 2025

jsvd commented Mar 31, 2025

donoghuc commented Mar 31, 2025

donoghuc commented Mar 31, 2025

jsvd commented Apr 1, 2025 •

edited

Loading

donoghuc commented Apr 1, 2025

donoghuc commented Apr 2, 2025

donoghuc commented Apr 3, 2025

elastic-sonarqube bot commented Apr 3, 2025

elasticmachine commented Apr 3, 2025

yaauie left a comment

Publish Observability SRE images to internal container registry #17401

Publish Observability SRE images to internal container registry #17401

Conversation

donoghuc commented Mar 24, 2025 • edited Loading

Release notes

What does this PR do?

Why is it important/What is the impact to the user?

Checklist

Related issues

donoghuc commented Mar 27, 2025

jsvd commented Mar 28, 2025

donoghuc commented Mar 28, 2025

donoghuc commented Mar 28, 2025 • edited Loading

donoghuc commented Mar 28, 2025

jsvd commented Mar 31, 2025

donoghuc commented Mar 31, 2025

donoghuc commented Mar 31, 2025

jsvd commented Apr 1, 2025 • edited Loading

donoghuc commented Apr 1, 2025

donoghuc commented Apr 2, 2025

donoghuc commented Apr 3, 2025

elastic-sonarqube bot commented Apr 3, 2025

Quality Gate passed

elasticmachine commented Apr 3, 2025

💚 Build Succeeded

History

yaauie left a comment

Choose a reason for hiding this comment

donoghuc commented Mar 24, 2025 •

edited

Loading

donoghuc commented Mar 28, 2025 •

edited

Loading

jsvd commented Apr 1, 2025 •

edited

Loading