Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define TR caching comms in ATD #353

Merged
merged 4 commits into from
Mar 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 69 additions & 1 deletion semgrep_output_v1.atd
Original file line number Diff line number Diff line change
Expand Up @@ -555,6 +555,8 @@ type sca_match = {
(* Note that in addition to "reachable" there are also the notions of
* "vulnerable" and "exploitable".
* coupling: see also SCA_match.ml
* TODO? have a Direct of xxx and Transitive of sca_transitive_match_kind?
* better so can be reused in other types such as tr_cache_result?
*)
type sca_match_kind = [
(* This is used for "parity" or "upgrade-only" rules. transitivity
Expand Down Expand Up @@ -1839,6 +1841,68 @@ type scan_config = {
?ci_config_from_cloud: ci_config_from_cloud option;
}

(* ------------------------------------------- *)
(* Transitive reachabilitiy (TR) caching comms *)
(* ------------------------------------------- *)
(* We want essentially to cache semgrep computation on third party packages
* to quickly know (rule_id x package_version) -> sca_transitive_match_kind
* to avoid downloading and recomputing each time the same thing.
*)

(* The "key".
* The rule_id and resolved_url should form a valid key for our TR cache
* database table. Indeed, semgrep should always return the same result when
* using the same rule and same resolved_url package. The content at the
* URL should hopefully not change (we could md5sum it just in case) and
* the content of the rule_id should also not change (could md5sum it maybe too).
* I've added tr_version below just in case we want to invalidate past
* cached entries (e.g., the semgrep engine itself changed enough that
* some past cached results might be wrong and should be recomputed)
*)
type tr_cache_key = {
rule_id: rule_id;
(* this can be the checksum of the content of the rule (JSON or YAML form) *)
rule_version: string;
(* does not have to match the Semgrep CLI version; can be bumped only
* when we think the match should be recomputed
* TODO: to be set in Transitive_reachability.ml tr_version constant
*)
engine_version: int;
(* ex: http://some-website/hello-world.0.1.2.tgz like in found_dependency
* 'resolved_url' field, but could be anything to describe a particular
* package. We could rely on https://github.com/package-url/purl-spec
*)
package_url: string;
(* extra key just in case (e.g., "prod" vs "dev") *)
extra: string;
}

(* The "value" *)
type tr_cache_match_result = {
(* alt: cache just sca_match? or sca_match_kind? or even define a separate
* sca_transitive_match type? which would be smaller than storing
* the whole set of matches
* alt: cache the whole cli_output? (which also contains the errors)
*)
matches: cli_match list;
}

(* Sent by the CLI to the POST /api/???? *)
type tr_query_cache_request = {
entries: tr_cache_key list;
}

(* Response by the backend the the POST /api/???? *)
type tr_query_cache_response = {
cached: (tr_cache_key * tr_cache_match_result) list;
}

(* Sent by the CLI to the POST /api/??? *)
type tr_add_cache_request = {
new_entries: (tr_cache_key * tr_cache_match_result) list;
}
(* TODO: tr_add_cache_response: string result (Ok | Error) *)

(* ----------------------------- *)
(* TODO a better CI config from cloud *)
(* ----------------------------- *)
Expand Down Expand Up @@ -2407,6 +2471,10 @@ type resolution_result = [
| ResolutionError of resolution_error_kind list
]

(* ----------------------------- *)
(* SCA transitive reachability *)
(* ----------------------------- *)

type transitive_finding = {
(* the important part is the sca_match in core_match_extra that
* we need to adjust and especially the sca_match_kind.
Expand All @@ -2424,7 +2492,7 @@ type transitive_reachability_filter_params = {
}

(* ----------------------------- *)
(* SCA part 4: Symbol analysis *)
(* Symbol analysis *)
(* ----------------------------- *)

(* "Symbol analysis" is about determining the third-party functions which
Expand Down
69 changes: 69 additions & 0 deletions semgrep_output_v1.jsonschema

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

26 changes: 25 additions & 1 deletion semgrep_output_v1.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading