Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mgs] API for ingesting ereports from SPs #7903

Open
wants to merge 32 commits into
base: main
Choose a base branch
from
Open

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Apr 1, 2025

oxidecomputer/management-gateway-service#370 adds code to the
gateway-messages and gateway-sp-comms crates to implement the MGS
side of the ereport ingestion protocol. For more information on the
protocol itself, refer to the following RFDs:

This branch integrates the changes from those crates into the actual
MGS application, as well as adding simulated ereports to the SP
simulator. I've added some simple tests based on this.

In addition, this branch restructures the initial implementation of
the control plane ereport API I added in #7833. That branch proposed
a single dropshot API that would be implemented by both sled-agent and
MGS. This was possible because the initial design would have indexed all
ereport producers (reporters) by a UUID. However, per recent
conversations with @cbiffle and @jgallagher, we've determined that Nexus
will instead request ereports from service processors indexed by SP
physical topology (e.g. type and slot), like the rest of the MGS HTTP
API. Therefore, we can no longer have a single HTTP API for ereporters
that's implemented by both MGS and sled-agents, and instead, SP ereport
ingestion should be a new endpoint on the MGS API.

This branch does that, moving the ereport query params into
ereport-types, eliminating the separate ereport-api and
ereport-client crates, and adding an ereport-ingestion-by-SP-location
endpoint to the management gateway API.

Furthermore, there are some terminology changes. The ereport
protocol has a value which we've variously referred to as an "instance
ID", a "generation ID", and a "restart nonce", all of which have
unfortunate name collisions that are potentially confusing or just
unpleasant. We've agreed to refer to this value everywhere as a
"restart ID", so this commit also changes that.

hawkw added 4 commits April 1, 2025 10:15
Currently, the initial ereport ingestion API I added in #7833 proposed
a single dropshot API that would be implemented by both sled-agent and
MGS. This was possible because the initial design would have indexed all
ereport producers (reporters) by a UUID. However, per recent
conversations with @cbiffle and @jgallagher, we've determined that Nexus
will instead request ereports from service processors indexed by SP
physical topology (e.g. type and slot), like the rest of the MGS HTTP
API. Therefore, we can no longer have a single HTTP API for ereporters
that's implemented by both MGS and sled-agents, and instead, SP ereport
ingestion should be a new endpoint on the MGS API.

This commit does that, moving the ereport query params into
`ereport-types`, eliminating the separate `ereport-api` and
`ereport-client` crates, and adding an ereport-ingestion-by-SP-location endpoint to the management gateway API.
@hawkw hawkw requested a review from jgallagher April 1, 2025 21:43
@hawkw hawkw self-assigned this Apr 1, 2025
@hawkw
Copy link
Member Author

hawkw commented Apr 1, 2025

I'm not 100% sure what our disposition on merging this ought to be, as it does add an API to the MGS http_entrypoints that's currently unimplemented and always returns an error. I figured it was good to at least go ahead and open the PR so that other changes can be based upon it...

@hawkw hawkw marked this pull request as draft April 2, 2025 19:19
@hawkw
Copy link
Member Author

hawkw commented Apr 2, 2025

I'm turning this into a draft as I'm going to keep using this branch to hack up the ereport protocol types a bit more.

hawkw added a commit that referenced this pull request Apr 4, 2025
It turns out that our Git dependency on the
oxidecomputer/management-gateway-service repo hasn't been updated in...
a while. We're currently on a commit from September of last year,
oxidecomputer/management-gateway-service@9bbac47.
This branch updates it to the current HEAD commit,
oxidecomputer/management-gateway-service@f9566e6.

The only changes in MGS that required code changes in Omicron are:

- oxidecomputer/management-gateway-service#291, where I added a new
  `MeasurementKind` for AMD CPU T<sub>ctl</sub> values (which are not temperatures in degrees Celcius, but a secret third thing).
- oxidecomputer/management-gateway-service#316 by @mkeeter, adding the
  interface to read SP task dumps over the network. Since this adds
  methods to the `sp_impl::SpHandler` trait, the SP simulator
  implementations need to be updated, or else they will no longer
  compile. For now, I've just made these `unimplemented!()`, as we're
  not currently actually _using_ them.

In my PR #7903 implementing ereport ingestion from SPs, I had to make
these changes as part of changing the MGS dependency to pull in the new
`gateway-sp-comms` code for ereports. Since this isn't actually related,
and is just necessary to update the Git dep, I figured I'd pull that
commit (49973ae) into its own PR.
hawkw added a commit that referenced this pull request Apr 8, 2025
It turns out that our Git dependency on the
oxidecomputer/management-gateway-service repo hasn't been updated in...
a while. We're currently on a commit from September of last year,
oxidecomputer/management-gateway-service@9bbac47.
This branch updates it to the current HEAD commit,
oxidecomputer/management-gateway-service@f9566e6.

The only changes in MGS that required code changes in Omicron are:

- oxidecomputer/management-gateway-service#291, where I added a new
`MeasurementKind` for AMD CPU T<sub>ctl</sub> values (which are not
temperatures in degrees Celcius, but a secret third thing).
- oxidecomputer/management-gateway-service#316 by @mkeeter, adding the
interface to read SP task dumps over the network. Since this adds
methods to the `sp_impl::SpHandler` trait, the SP simulator
implementations need to be updated, or else they will no longer compile.
For now, I've just made these `unimplemented!()`, as we're not currently
actually _using_ them.

In my PR #7903 implementing ereport ingestion from SPs, I had to make
these changes as part of changing the MGS dependency to pull in the new
`gateway-sp-comms` code for ereports. Since this isn't actually related,
and is just necessary to update the Git dep, I figured I'd pull that
commit (49973ae) into its own PR.

---------

Co-authored-by: John Gallagher <[email protected]>
@hawkw
Copy link
Member Author

hawkw commented Apr 11, 2025

this deploy failure looks like NTP badness

@hawkw hawkw changed the title [ereport] index ereport ingestion by SP type/slot [mgs] API for ingesting ereports from SPs Apr 12, 2025
@hawkw hawkw marked this pull request as ready for review April 12, 2025 17:59
@hawkw
Copy link
Member Author

hawkw commented Apr 12, 2025

Note that while this PR is broadly ready for review, we will want to merge oxidecomputer/management-gateway-service#370 first, and change this to depend on the main branch of MGS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant