-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add px/agent_status_diagnostics pxl script #2064
Merged
ddelnano
merged 1 commit into
pixie-io:main
from
ddelnano:ddelnano/add-agent-status-diagnostics-script
Jan 6, 2025
Merged
Add px/agent_status_diagnostics pxl script #2064
ddelnano
merged 1 commit into
pixie-io:main
from
ddelnano:ddelnano/add-agent-status-diagnostics-script
Jan 6, 2025
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Dom Del Nano <[email protected]>
8 tasks
@pixie-io/vizier this is ready for review when you have the chance! |
aimichelle
approved these changes
Jan 6, 2025
ddelnano
added a commit
that referenced
this pull request
Jan 6, 2025
…ing kernel headers (#2065) Summary: Use `px/agent_status_diagnostics` script within px cli to detect missing kernel headers This PR leverages the script added in #2064 to detect missing kernel headers during cli deploys and `px collect-logs` commands. This solves 2/3 of the use cases I was hoping to identify for #2051 (the last being helm installs). A recent example of this problem is #1986, where a Go TLS tracing bug went undiagnosed for months (August to December). Amazon Linux 2023's headers are different enough that it breaks Go TLS tracing when pixie's pre-packaged headers are used. The tooling in this PR would have provided a few opportunities for this to be caught. Relevant Issues: #2051 Type of change: /kind feature Test Plan: Verified the following scenarios <details><summary>Test cases</summary> - [x] `px collect-logs` works against a cloud that doesn't have a `px/agent_status_diagnostics` script ``` $ bazel run -c opt --stamp src/pixie_cli:px -- collect-logs WARN[0006] healthcheck script detected the following warnings: error="Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes." Logs written to pixie_logs_20241223165214.zip # zip file contains px/agent_status output $ cat px_agent_diagnostics.txt {"_tableName_":"output","agent_id":"07fb4d26-3b53-4ba7-9bb7-f2cb10a1e63d","asid":79,"hostname":"gke-dev-ddelnano1-default-pool-b099382d-30mu","ip_address":"","agent_state":"AGENT_STATE_HEALTHY","create_time":"2024-12-18T12:43:44.41952403Z","last_heartbeat_ns":4303060450,"kernel_headers_installed":true} ``` - [x] `px collect-logs` works against a cloud that does have a `px/agent_status_diagnostics` script ``` $ bazel run src/pixie_cli:px -- collect-logs INFO: Analyzed target //src/pixie_cli:px (0 packages loaded, 0 targets configured). INFO: Found 1 target... Target //src/pixie_cli:px up-to-date: bazel-bin/src/pixie_cli/px_/px INFO: Elapsed time: 4.240s, Critical Path: 3.89s INFO: 3 processes: 1 internal, 2 linux-sandbox. INFO: Build completed successfully, 3 total actions INFO: Running command line: bazel-bin/src/pixie_cli/px_/px collect-logs Pixie CLI ******************************* * ENV VARS * PX_CLOUD_ADDR=testing.getcosmic.ai:443 ******************************* Logs written to pixie_logs_20241218164734.zip $ cat px_agent_diagnostics.txt {"_tableName_":"output","headers_installed_percent":1} ``` - [x] `px collect-logs` identifies when kernel headers are missing when `px/agent_status_diagnostics` present ``` $ Logs written to pixie_logs_20241223165214.zip $ bazel run -c opt --stamp src/pixie_cli:px -- --bundle https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json collect-logs [ ... ] WARN[0012] healthcheck script detected the following warnings: error="Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes." $ cat px_agent_diagnostics.txt {"_tableName_":"output","headers_installed_percent":0.5} ``` - [x] Artificially forcing context deadline (timeout) results in an error ``` $ git diff diff --git a/src/pixie_cli/pkg/vizier/script.go b/src/pixie_cli/pkg/vizier/script.go index 7d3b7e008..c957b8943 100644 --- a/src/pixie_cli/pkg/vizier/script.go +++ b/src/pixie_cli/pkg/vizier/script.go @@ -317,7 +317,7 @@ func RunSimpleHealthCheckScript(br *script.BundleManager, cloudAddr string, clus execScript = br.MustGetScript(script.AgentStatusScript) } - ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second) $ bazel run src/pixie_cli:px -- collect-logs WARN[0012]src/pixie_cli/pkg/vizier/logs.go:135 px.dev/pixie/src/pixie_cli/pkg/vizier.(*LogCollector).CollectPixieLogs() failed to run health check script error="context deadline exceeded" Logs written to pixie_logs_20241218165033.zip ``` - [x] `px collect-logs` prompts auth flow when credentials don't match current cloud ``` $ PX_CLOUD_ADDR=new-cloud bazel run src/pixie_cli:px -- collect-logs ******************************* * ENV VARS * PX_CLOUD_ADDR=new-cloud ******************************* Failed to authenticate. Please retry `px auth login`. ``` - [x] `px deploy` on pre v0.14.14 (older) vizier with existing bundle warns that kernel headers should be installed ``` # Additional flags provided to speed up vizier bootstrapping $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json ``` - [x] `px deploy` on pre v0.14.14 (older) vizier with latest bundle warns that kernel headers should be installed ``` # Additional flags provided to speed up vizier bootstrapping $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json [ ... ] Waiting for Pixie to pass healthcheck ✔ Wait for PEMs/Kelvin ✔ Wait for PEMs/Kelvin ✕ Wait for healthcheck ERR: Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes. Pixie healthcheck detected the following warnings: error=Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes. [ ...] ``` - [x] `px deploy` on v0.14.14 vizier with latest bundle warns appropriate when kernel headers are missing ``` $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key=<deploy key> --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json -v 0.14.14-pre-r1.0 [ ... ] Waiting for Pixie to pass healthcheck ✔ Wait for PEMs/Kelvin ✕ Wait for healthcheck ERR: Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes. Pixie healthcheck detected the following warnings: error=Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes. ``` </details> Changelog Message: Enhanced the `px` cli's `deploy` and `collect-logs` commands to surface when kernel headers aren't installed. This is a common source of bugs that can only be addressed by installing your distro's kernel headers. Signed-off-by: Dom Del Nano <[email protected]>
ddelnano
added a commit
to ddelnano/pixie
that referenced
this pull request
Jan 17, 2025
…ing kernel headers (pixie-io#2065) Summary: Use `px/agent_status_diagnostics` script within px cli to detect missing kernel headers This PR leverages the script added in pixie-io#2064 to detect missing kernel headers during cli deploys and `px collect-logs` commands. This solves 2/3 of the use cases I was hoping to identify for pixie-io#2051 (the last being helm installs). A recent example of this problem is pixie-io#1986, where a Go TLS tracing bug went undiagnosed for months (August to December). Amazon Linux 2023's headers are different enough that it breaks Go TLS tracing when pixie's pre-packaged headers are used. The tooling in this PR would have provided a few opportunities for this to be caught. Relevant Issues: pixie-io#2051 Type of change: /kind feature Test Plan: Verified the following scenarios <details><summary>Test cases</summary> - [x] `px collect-logs` works against a cloud that doesn't have a `px/agent_status_diagnostics` script ``` $ bazel run -c opt --stamp src/pixie_cli:px -- collect-logs WARN[0006] healthcheck script detected the following warnings: error="Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes." Logs written to pixie_logs_20241223165214.zip # zip file contains px/agent_status output $ cat px_agent_diagnostics.txt {"_tableName_":"output","agent_id":"07fb4d26-3b53-4ba7-9bb7-f2cb10a1e63d","asid":79,"hostname":"gke-dev-ddelnano1-default-pool-b099382d-30mu","ip_address":"","agent_state":"AGENT_STATE_HEALTHY","create_time":"2024-12-18T12:43:44.41952403Z","last_heartbeat_ns":4303060450,"kernel_headers_installed":true} ``` - [x] `px collect-logs` works against a cloud that does have a `px/agent_status_diagnostics` script ``` $ bazel run src/pixie_cli:px -- collect-logs INFO: Analyzed target //src/pixie_cli:px (0 packages loaded, 0 targets configured). INFO: Found 1 target... Target //src/pixie_cli:px up-to-date: bazel-bin/src/pixie_cli/px_/px INFO: Elapsed time: 4.240s, Critical Path: 3.89s INFO: 3 processes: 1 internal, 2 linux-sandbox. INFO: Build completed successfully, 3 total actions INFO: Running command line: bazel-bin/src/pixie_cli/px_/px collect-logs Pixie CLI ******************************* * ENV VARS * PX_CLOUD_ADDR=testing.getcosmic.ai:443 ******************************* Logs written to pixie_logs_20241218164734.zip $ cat px_agent_diagnostics.txt {"_tableName_":"output","headers_installed_percent":1} ``` - [x] `px collect-logs` identifies when kernel headers are missing when `px/agent_status_diagnostics` present ``` $ Logs written to pixie_logs_20241223165214.zip $ bazel run -c opt --stamp src/pixie_cli:px -- --bundle https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json collect-logs [ ... ] WARN[0012] healthcheck script detected the following warnings: error="Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes." $ cat px_agent_diagnostics.txt {"_tableName_":"output","headers_installed_percent":0.5} ``` - [x] Artificially forcing context deadline (timeout) results in an error ``` $ git diff diff --git a/src/pixie_cli/pkg/vizier/script.go b/src/pixie_cli/pkg/vizier/script.go index 7d3b7e008..c957b8943 100644 --- a/src/pixie_cli/pkg/vizier/script.go +++ b/src/pixie_cli/pkg/vizier/script.go @@ -317,7 +317,7 @@ func RunSimpleHealthCheckScript(br *script.BundleManager, cloudAddr string, clus execScript = br.MustGetScript(script.AgentStatusScript) } - ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second) $ bazel run src/pixie_cli:px -- collect-logs WARN[0012]src/pixie_cli/pkg/vizier/logs.go:135 px.dev/pixie/src/pixie_cli/pkg/vizier.(*LogCollector).CollectPixieLogs() failed to run health check script error="context deadline exceeded" Logs written to pixie_logs_20241218165033.zip ``` - [x] `px collect-logs` prompts auth flow when credentials don't match current cloud ``` $ PX_CLOUD_ADDR=new-cloud bazel run src/pixie_cli:px -- collect-logs ******************************* * ENV VARS * PX_CLOUD_ADDR=new-cloud ******************************* Failed to authenticate. Please retry `px auth login`. ``` - [x] `px deploy` on pre v0.14.14 (older) vizier with existing bundle warns that kernel headers should be installed ``` # Additional flags provided to speed up vizier bootstrapping $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json ``` - [x] `px deploy` on pre v0.14.14 (older) vizier with latest bundle warns that kernel headers should be installed ``` # Additional flags provided to speed up vizier bootstrapping $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json [ ... ] Waiting for Pixie to pass healthcheck ✔ Wait for PEMs/Kelvin ✔ Wait for PEMs/Kelvin ✕ Wait for healthcheck ERR: Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes. Pixie healthcheck detected the following warnings: error=Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes. [ ...] ``` - [x] `px deploy` on v0.14.14 vizier with latest bundle warns appropriate when kernel headers are missing ``` $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key=<deploy key> --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json -v 0.14.14-pre-r1.0 [ ... ] Waiting for Pixie to pass healthcheck ✔ Wait for PEMs/Kelvin ✕ Wait for healthcheck ERR: Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes. Pixie healthcheck detected the following warnings: error=Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes. ``` </details> Changelog Message: Enhanced the `px` cli's `deploy` and `collect-logs` commands to surface when kernel headers aren't installed. This is a common source of bugs that can only be addressed by installing your distro's kernel headers. Signed-off-by: Dom Del Nano <[email protected]> (cherry picked from commit 3c9c4bd)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary: Add px/agent_status_diagnostics pxl script
This PR adds a new pxl script that helps identify if linux headers are missing on any PEMs. This can be extended in the future, but the first use cause will be to execute the script during
px deploy
andpx collect-logs
.Relevant Issues: #2051
Type of change: /kind feature
Test Plan: Ran the script in the UI and tested it as part of the validation for #2065
Changelog Message: Add
px/agent_status_diagnostics
pxl script for checking common issues