Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] webhooks #7277

Open
wants to merge 200 commits into
base: main
Choose a base branch
from
Open

[nexus] webhooks #7277

wants to merge 200 commits into from

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Dec 18, 2024

This branch adds an MVP implementation of the internal machinery for delivering webhooks from Nexus. This includes:

  • webhook-related external API endpoints (as described in RFD 538)
  • database tables for storing webhook receiver configurations and, webhook events and tracking their
    delivery status
  • background tasks for actually delivering webhook events to receivers

The user-facing interface for webhooks is described in greater detail in RFD 538. The code change in this branch includes a "Big Theory Statement" comment that describes most of the implementation details, so reviewers are encouraged to refer to that for more information on the implementation.

Future Work

Immediate follow-up work (i.e. stuff I'd like to do shortly but would prefer to land in separate PRs):

  • Garbage collection for old records in the webhook_delivery, webhook_delivery_attempt, and webhook_event CRDB tables (need to figure out a good retention policy for events)
  • omdb db webhooks commands for actually looking at the webhook database tables
  • Oximeter metrics tracking webhook delivery attempt outcomes and latencies

Not currently planned, but possible future work:

  • Actually record webhook events when stuff happens :)
  • Some mechanism for communicating JSON schemas for webhook event payloads (either via OpenAPI 3.1, by sticking JSON schemas in the /v1/webhooks/event-classes endpoints, or both)
  • Allow webhook receivers to have roles with more restrictive permissions than fleet.viewer (see RFD 538 Appendix B.3); probably requires service accounts
  • Track receiver liveness and alert when a receiver has gone away (see RFD 538 Appendix B.4)

@hawkw hawkw force-pushed the eliza/webhook-models branch from 51f7f8e to 139cfe6 Compare December 18, 2024 21:10
@hawkw hawkw changed the base branch from eliza/webhook-api to main December 18, 2024 21:11
@hawkw hawkw requested a review from augustuswm December 18, 2024 21:11
@hawkw hawkw force-pushed the eliza/webhook-models branch 2 times, most recently from 140aea4 to 0b80c8f Compare January 8, 2025 17:28
@hawkw hawkw changed the title [nexus] Webhook DB models [nexus] webhooks Jan 11, 2025
@hawkw hawkw force-pushed the eliza/webhook-models branch from 41cf0b0 to 2bc5925 Compare January 17, 2025 19:20
@hawkw
Copy link
Member Author

hawkw commented Jan 24, 2025

I think I've come around a bit to @andrewjstone's proposal that the event classes be a DB enum, so I'm planning to change that. I'd like to have a way to include a couple "test" variants in there that aren't exposed in the public API, so I'll be giving some thought to how to deal with that.

@hawkw
Copy link
Member Author

hawkw commented Jan 24, 2025

I think I've come around a bit to @andrewjstone's proposal that the event classes be a DB enum, so I'm planning to change that.

Glob subscription entries in webhook_rx_event_glob should capture the schema version when they're created, so that we can trigger reprocessing (generating the exact event class subscriptions for those globs) if the schema has changed. It's probably fine for nexus to do glob reprocessing on startup rather than in a bg task, although online update might invalidate that assumption.

@hawkw
Copy link
Member Author

hawkw commented Jan 24, 2025

As far as GCing old events from the event table, dispatching an event should probably add a count of the number of receivers it was dispatched to, and then when we successfully deliver the event, we increment a count of successes. That way, we would not consider an event entry eligible to be deleted unless the two counts are equal; we want to hang onto events that weren't successfully delivered so any failed deliveries can be re-triggered.

GCing an event would also clean up any child delivery attempt records.

hawkw added 2 commits March 26, 2025 11:13
i thought `cargo check --all` would also check tests? weird...
@hawkw hawkw force-pushed the eliza/webhook-models branch from 67a041b to 1d09bfc Compare March 26, 2025 18:23
@hawkw hawkw requested a review from smklein March 26, 2025 18:34
@hawkw
Copy link
Member Author

hawkw commented Mar 26, 2025

@smklein I believe I've either addressed or replied to all your review comments, I'd love another look when you have the time!

@hawkw hawkw force-pushed the eliza/webhook-models branch from fa359f5 to f846125 Compare April 8, 2025 19:55
@hawkw hawkw requested a review from smklein April 8, 2025 21:25
@hawkw
Copy link
Member Author

hawkw commented Apr 8, 2025

AGH MY BAD i forgot that seanmonstar/reqwest#2623 hasn't actually been merged yet, oops. gotta take a git dep for now.

Comment on lines +1165 to +1171
// If we are configured to only bind external TCP connections on a specific interface, do so.
#[cfg(any(
target_os = "linux",
target_os = "macos",
target_os = "illumos",
))]
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it's trying to resolve #7277 (comment), correct?

  1. Is this actually getting set in our production config? I see that ExternalHttpClientConfig has a default value of None for interface, and I cannot see where this would otherwise be set.
  2. Is it possible for us to have a test here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whops, I started writing up a comment about this but I think I accidentally closed the tab without posting it. 😅

What I meant to say was that I still need to write a test for this, and that I think it's going to be kind of a pain. To wit:

  • the test needs to run in a production-like network (which I think means a live test)
  • the test will need to stand up some kind of underlay network service that it can assert DIDN'T get a webhook request sent to it
  • the test will need a way of triggering a webhook on a running control plane (which we don't have presently but probably should add for this sort of thing)
  • we should also be testing that a webhook request does get sent to a service external to teh rack network in a prod-like network config, which...i'm not sure how possible that is with the current live test framework? might require additional work on that, or manual verification for now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer the other question, the OPTE interface name gets set here by sled-agent when constructing the DeploymentConfig for a Nexus zone: 0cf8154#diff-b8a6f13742cae29f44d095f6b9e8c2febc712e0ff86f01f3c8ec9d4e5d2db396

It's unset otherwise, so that the integration tests for webhooks work outside of prod-like networks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants