-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Go TLS tracing not working on Amazon Linux 2023 #1986
Comments
@SamuraiPrinciple thanks for reporting this. I'm aiming to try to reproduce the issue later this week and will report back when I have more information. |
Hi @ddelnano - many thanks for the quick reply, it really is appreciated. Do let me know if I could be of any help (more logs, screenshare, ...). Take care and wish you a great day. |
This is still on my radar. I'm expecting to have time to look into this at the end of the week. |
That's great, thanks for the update. I've tried with the latest pixie version (0.14.11) and it's still not working. |
Hi @ddelnano hope all is well :) Wondering if you've managed to reproduce the issue? Happy to provide more details (EKS AMIs, pixie deployment config and sample go app) in case you didn't manage to get a repro... Many thanks, |
@SamuraiPrinciple sorry for missing to follow up on this! I'll be giving this some attention tomorrow or Thursday. |
No, no, no problem at all - just let me know if you're struggling to reproduce it, very happy to make it as easy as possible for you. Thanks for the reply and your hard work! |
Since our existing Go tls tracing test hasn't been updated for Go 1.22 yet, I wanted to rule out any obvious incompatibility. We will eventually have that version under test, but until that is upstream I've created this branch which adds coverage for Go 1.22. The test passes on that branch, which is a good sign. My next step is to reproduce the issue on EKS w/ AL2023. @SamuraiPrinciple while I haven't tried the EKS steps yet, can you able to share an example Go application that exhibits the problem and the compiler version? |
Hi, thanks for picking this up! I will provide you with the sample app later, but just as quick note - if I rotate EKS nodes so that they use AL2 AMIs, the tracing then works with everything else being exactly the same. |
No problem! Looking forward to seeing the example app as I didn't have luck with the reproduction on EKS. |
The repo contains a simple go program that polls https://example.com as well as a Dockerfile that can be used to build an image. There is also a very simple k8s deployment manifest that runs two containers (one for HTTP 1.1 and the other for HTTP/2). |
Thanks for the quick response. Does the HTTP/2 case work on AL2? Pixie's HTTP/2 tracing should only work for GRPC and when Go binaries have debug symbols (docs). As for the HTTP 1.1 case, can you explain more on the success case (when running on AL2)? I crafted a sample application very similar to yours and didn't see any traffic for AL2 or AL2023. I didn't consider my case a reproduction since I thought it was more similar to #899, which was never fully understood. I'll be able to continue debugging this on my end, but I'd still like to understand the AL2 case to help narrow down where the problem might be. |
Hi, mega-thanks for the quick reply! To be honest, we're not using HTTP/2 (i.e. we're actively disabling it - long story, but it's problematic for various reasons, unrelated to pixie), so I've only included that 'just in case'. I can confirm that on AL2 I can indeed see payloads for outbound TLS requests (HTTP 1.1). |
For your AL2 environment, can you run the PEM with You can add that arg to the DaemonSet even though the PID won't match on the other instances. Just make sure to grab the logs from the instance that has the application with the PID. |
I'm deploying pixie using a helm chart - do you happen to know what the quickest way is to pass an argument to pem daemonset? or do I have to manually update the pem daemonset? |
I think I've got it... |
Ok, I've tracked down the issue. The get_goid function is broken on AL23. I made the following changes to print out all the variables and here is the output that I'm seeing.
|
@SamuraiPrinciple can you try installing the I was able to verify that the issue above is because Pixie's prepackaged headers (what are used if kernel headers aren't available on host) is causing this thread lookup to fail. The fsbase is essential for locating the goid, so incorrectly reading that value breaks tracing. Essentially AL23 has enough kernel differences (back ports, etc) compared to vanilla linux that Pixie's pre-packaged headers aren't compatible. Our docs mention that it's highly recommended to install linux headers although we haven't had a good way to surface this to end users. Unfortunately these kernel header incompatibilities manifest in strange ways, so it's usually time consuming to identify the problem. Coincidentally I've been working on #2051 and it is getting close to making it in a release. I intend to make that check part of the Pixie cli and helm install process in addition to the |
Hi! Many thanks for the investigation and the info. I can confirm that installing linux headers package indeed resolves the issue. The HTTP 1.1 tls connections can be traced (I have noticed that gzipped responses are not rendered in UI - "resp_body: <removed: non-text content-type>,"). Happy to close this issue. |
No problem, sorry it took some time for me to dig into this and greatly appreciate your help throughout the process! I'm hopeful that surfacing the lack of linux headers in Pixie's diagnostic tools will help uncover and fix these problems before they become month long bugs! That tooling should be making its way in a release in the next few weeks (across vizier, operator, cli, etc). |
Glad I could be useful :) and many, many thanks for your help with this! |
…ing kernel headers (#2065) Summary: Use `px/agent_status_diagnostics` script within px cli to detect missing kernel headers This PR leverages the script added in #2064 to detect missing kernel headers during cli deploys and `px collect-logs` commands. This solves 2/3 of the use cases I was hoping to identify for #2051 (the last being helm installs). A recent example of this problem is #1986, where a Go TLS tracing bug went undiagnosed for months (August to December). Amazon Linux 2023's headers are different enough that it breaks Go TLS tracing when pixie's pre-packaged headers are used. The tooling in this PR would have provided a few opportunities for this to be caught. Relevant Issues: #2051 Type of change: /kind feature Test Plan: Verified the following scenarios <details><summary>Test cases</summary> - [x] `px collect-logs` works against a cloud that doesn't have a `px/agent_status_diagnostics` script ``` $ bazel run -c opt --stamp src/pixie_cli:px -- collect-logs WARN[0006] healthcheck script detected the following warnings: error="Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes." Logs written to pixie_logs_20241223165214.zip # zip file contains px/agent_status output $ cat px_agent_diagnostics.txt {"_tableName_":"output","agent_id":"07fb4d26-3b53-4ba7-9bb7-f2cb10a1e63d","asid":79,"hostname":"gke-dev-ddelnano1-default-pool-b099382d-30mu","ip_address":"","agent_state":"AGENT_STATE_HEALTHY","create_time":"2024-12-18T12:43:44.41952403Z","last_heartbeat_ns":4303060450,"kernel_headers_installed":true} ``` - [x] `px collect-logs` works against a cloud that does have a `px/agent_status_diagnostics` script ``` $ bazel run src/pixie_cli:px -- collect-logs INFO: Analyzed target //src/pixie_cli:px (0 packages loaded, 0 targets configured). INFO: Found 1 target... Target //src/pixie_cli:px up-to-date: bazel-bin/src/pixie_cli/px_/px INFO: Elapsed time: 4.240s, Critical Path: 3.89s INFO: 3 processes: 1 internal, 2 linux-sandbox. INFO: Build completed successfully, 3 total actions INFO: Running command line: bazel-bin/src/pixie_cli/px_/px collect-logs Pixie CLI ******************************* * ENV VARS * PX_CLOUD_ADDR=testing.getcosmic.ai:443 ******************************* Logs written to pixie_logs_20241218164734.zip $ cat px_agent_diagnostics.txt {"_tableName_":"output","headers_installed_percent":1} ``` - [x] `px collect-logs` identifies when kernel headers are missing when `px/agent_status_diagnostics` present ``` $ Logs written to pixie_logs_20241223165214.zip $ bazel run -c opt --stamp src/pixie_cli:px -- --bundle https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json collect-logs [ ... ] WARN[0012] healthcheck script detected the following warnings: error="Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes." $ cat px_agent_diagnostics.txt {"_tableName_":"output","headers_installed_percent":0.5} ``` - [x] Artificially forcing context deadline (timeout) results in an error ``` $ git diff diff --git a/src/pixie_cli/pkg/vizier/script.go b/src/pixie_cli/pkg/vizier/script.go index 7d3b7e008..c957b8943 100644 --- a/src/pixie_cli/pkg/vizier/script.go +++ b/src/pixie_cli/pkg/vizier/script.go @@ -317,7 +317,7 @@ func RunSimpleHealthCheckScript(br *script.BundleManager, cloudAddr string, clus execScript = br.MustGetScript(script.AgentStatusScript) } - ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second) $ bazel run src/pixie_cli:px -- collect-logs WARN[0012]src/pixie_cli/pkg/vizier/logs.go:135 px.dev/pixie/src/pixie_cli/pkg/vizier.(*LogCollector).CollectPixieLogs() failed to run health check script error="context deadline exceeded" Logs written to pixie_logs_20241218165033.zip ``` - [x] `px collect-logs` prompts auth flow when credentials don't match current cloud ``` $ PX_CLOUD_ADDR=new-cloud bazel run src/pixie_cli:px -- collect-logs ******************************* * ENV VARS * PX_CLOUD_ADDR=new-cloud ******************************* Failed to authenticate. Please retry `px auth login`. ``` - [x] `px deploy` on pre v0.14.14 (older) vizier with existing bundle warns that kernel headers should be installed ``` # Additional flags provided to speed up vizier bootstrapping $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json ``` - [x] `px deploy` on pre v0.14.14 (older) vizier with latest bundle warns that kernel headers should be installed ``` # Additional flags provided to speed up vizier bootstrapping $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json [ ... ] Waiting for Pixie to pass healthcheck ✔ Wait for PEMs/Kelvin ✔ Wait for PEMs/Kelvin ✕ Wait for healthcheck ERR: Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes. Pixie healthcheck detected the following warnings: error=Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes. [ ...] ``` - [x] `px deploy` on v0.14.14 vizier with latest bundle warns appropriate when kernel headers are missing ``` $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key=<deploy key> --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json -v 0.14.14-pre-r1.0 [ ... ] Waiting for Pixie to pass healthcheck ✔ Wait for PEMs/Kelvin ✕ Wait for healthcheck ERR: Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes. Pixie healthcheck detected the following warnings: error=Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes. ``` </details> Changelog Message: Enhanced the `px` cli's `deploy` and `collect-logs` commands to surface when kernel headers aren't installed. This is a common source of bugs that can only be addressed by installing your distro's kernel headers. Signed-off-by: Dom Del Nano <[email protected]>
…ing kernel headers (pixie-io#2065) Summary: Use `px/agent_status_diagnostics` script within px cli to detect missing kernel headers This PR leverages the script added in pixie-io#2064 to detect missing kernel headers during cli deploys and `px collect-logs` commands. This solves 2/3 of the use cases I was hoping to identify for pixie-io#2051 (the last being helm installs). A recent example of this problem is pixie-io#1986, where a Go TLS tracing bug went undiagnosed for months (August to December). Amazon Linux 2023's headers are different enough that it breaks Go TLS tracing when pixie's pre-packaged headers are used. The tooling in this PR would have provided a few opportunities for this to be caught. Relevant Issues: pixie-io#2051 Type of change: /kind feature Test Plan: Verified the following scenarios <details><summary>Test cases</summary> - [x] `px collect-logs` works against a cloud that doesn't have a `px/agent_status_diagnostics` script ``` $ bazel run -c opt --stamp src/pixie_cli:px -- collect-logs WARN[0006] healthcheck script detected the following warnings: error="Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes." Logs written to pixie_logs_20241223165214.zip # zip file contains px/agent_status output $ cat px_agent_diagnostics.txt {"_tableName_":"output","agent_id":"07fb4d26-3b53-4ba7-9bb7-f2cb10a1e63d","asid":79,"hostname":"gke-dev-ddelnano1-default-pool-b099382d-30mu","ip_address":"","agent_state":"AGENT_STATE_HEALTHY","create_time":"2024-12-18T12:43:44.41952403Z","last_heartbeat_ns":4303060450,"kernel_headers_installed":true} ``` - [x] `px collect-logs` works against a cloud that does have a `px/agent_status_diagnostics` script ``` $ bazel run src/pixie_cli:px -- collect-logs INFO: Analyzed target //src/pixie_cli:px (0 packages loaded, 0 targets configured). INFO: Found 1 target... Target //src/pixie_cli:px up-to-date: bazel-bin/src/pixie_cli/px_/px INFO: Elapsed time: 4.240s, Critical Path: 3.89s INFO: 3 processes: 1 internal, 2 linux-sandbox. INFO: Build completed successfully, 3 total actions INFO: Running command line: bazel-bin/src/pixie_cli/px_/px collect-logs Pixie CLI ******************************* * ENV VARS * PX_CLOUD_ADDR=testing.getcosmic.ai:443 ******************************* Logs written to pixie_logs_20241218164734.zip $ cat px_agent_diagnostics.txt {"_tableName_":"output","headers_installed_percent":1} ``` - [x] `px collect-logs` identifies when kernel headers are missing when `px/agent_status_diagnostics` present ``` $ Logs written to pixie_logs_20241223165214.zip $ bazel run -c opt --stamp src/pixie_cli:px -- --bundle https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json collect-logs [ ... ] WARN[0012] healthcheck script detected the following warnings: error="Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes." $ cat px_agent_diagnostics.txt {"_tableName_":"output","headers_installed_percent":0.5} ``` - [x] Artificially forcing context deadline (timeout) results in an error ``` $ git diff diff --git a/src/pixie_cli/pkg/vizier/script.go b/src/pixie_cli/pkg/vizier/script.go index 7d3b7e008..c957b8943 100644 --- a/src/pixie_cli/pkg/vizier/script.go +++ b/src/pixie_cli/pkg/vizier/script.go @@ -317,7 +317,7 @@ func RunSimpleHealthCheckScript(br *script.BundleManager, cloudAddr string, clus execScript = br.MustGetScript(script.AgentStatusScript) } - ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second) $ bazel run src/pixie_cli:px -- collect-logs WARN[0012]src/pixie_cli/pkg/vizier/logs.go:135 px.dev/pixie/src/pixie_cli/pkg/vizier.(*LogCollector).CollectPixieLogs() failed to run health check script error="context deadline exceeded" Logs written to pixie_logs_20241218165033.zip ``` - [x] `px collect-logs` prompts auth flow when credentials don't match current cloud ``` $ PX_CLOUD_ADDR=new-cloud bazel run src/pixie_cli:px -- collect-logs ******************************* * ENV VARS * PX_CLOUD_ADDR=new-cloud ******************************* Failed to authenticate. Please retry `px auth login`. ``` - [x] `px deploy` on pre v0.14.14 (older) vizier with existing bundle warns that kernel headers should be installed ``` # Additional flags provided to speed up vizier bootstrapping $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json ``` - [x] `px deploy` on pre v0.14.14 (older) vizier with latest bundle warns that kernel headers should be installed ``` # Additional flags provided to speed up vizier bootstrapping $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json [ ... ] Waiting for Pixie to pass healthcheck ✔ Wait for PEMs/Kelvin ✔ Wait for PEMs/Kelvin ✕ Wait for healthcheck ERR: Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes. Pixie healthcheck detected the following warnings: error=Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes. [ ...] ``` - [x] `px deploy` on v0.14.14 vizier with latest bundle warns appropriate when kernel headers are missing ``` $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key=<deploy key> --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json -v 0.14.14-pre-r1.0 [ ... ] Waiting for Pixie to pass healthcheck ✔ Wait for PEMs/Kelvin ✕ Wait for healthcheck ERR: Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes. Pixie healthcheck detected the following warnings: error=Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes. ``` </details> Changelog Message: Enhanced the `px` cli's `deploy` and `collect-logs` commands to surface when kernel headers aren't installed. This is a common source of bugs that can only be addressed by installing your distro's kernel headers. Signed-off-by: Dom Del Nano <[email protected]> (cherry picked from commit 3c9c4bd)
Describe the bug
Go TLS tracing does not work on Amazon Linux 2023.
To Reproduce
Provision EKS cluster (v1.30) with two node groups, one running AL2, one running AL2023. Deploy pixie. Run a Go (1.21 or 1.22) app (pod/deployment) that's making outbound HTTPS requests (HTTP 1.1). Go to px/http_data and try and observe said outbound traffic. Only traffic initiated by the pods running on AL2 node group would appear.
Expected behavior
Be able to see traffic initiated by pods scheduled on node group running AL2023.
Logs
Please attach the logs by running the following command:
App information (please complete the following information):
pem-AL2.log
pem-al2023.log
The text was updated successfully, but these errors were encountered: