-
-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: detect and extract json from log-lines #1248
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis pull request adds a new instruction in multiple Dockerfiles to copy the Changes
Sequence Diagram(s)sequenceDiagram
participant C as Client
participant H as HTTP Ingest Handler
participant U as Ingest Utils
participant E as KnownSchema Processor
C->>H: Send HTTP request with headers & payload
H->>U: Call flatten_and_push_logs(json, stream, log_source, extract_log)
alt Custom Log Source
U->>E: Invoke extract_from_inline_log(json, source, extract_log)
E-->>U: Return processed JSON or error
end
U-->>H: Return updated log data
H-->>C: Respond with success/error status
Suggested reviewers
Poem
Tip ⚡🧪 Multi-step agentic review comment chat (experimental)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/handlers/http/ingest.rs (1)
69-73
: Introducingextract_log
from headers
Extracting the optionalextract_log
from request headers yields more flexibility to process inline logs. Consider bounding or validating the header string length if logs are large.src/event/format/known_schema.rs (1)
38-103
:SchemaDefinition
withcheck_or_extract
This method systematically checks for existing fields before performing regex extraction. It's efficient, but consider providing more detailed error info for partial matches or missing groups.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
Dockerfile
(1 hunks)Dockerfile.debug
(1 hunks)Dockerfile.kafka
(1 hunks)resources/formats.json
(1 hunks)src/event/format/known_schema.rs
(1 hunks)src/event/format/mod.rs
(1 hunks)src/handlers/http/ingest.rs
(9 hunks)src/handlers/http/modal/utils/ingest_utils.rs
(3 hunks)src/handlers/mod.rs
(1 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
src/event/format/known_schema.rs (1)
src/event/format/mod.rs (1) (1)
new
(105:110)
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: coverage
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (22)
Dockerfile.kafka (1)
45-45
: LGTM - Adding resources directory to Docker imageThis change ensures the
resources
directory is available in the Kafka-enabled Docker image, which is needed for the new JSON extraction feature.Dockerfile.debug (1)
32-32
: LGTM - Adding resources directory to debug Docker imageThis change ensures the
resources
directory is available in the debug Docker image, consistent with changes to other Dockerfiles.Dockerfile (1)
33-33
: LGTM - Adding resources directory to main Docker imageThis change ensures the
resources
directory is available in the main Docker image, consistent with changes to other Dockerfiles.src/handlers/mod.rs (1)
25-25
: New header constant for log extraction featureThis constant defines a header key that will be used to control the JSON extraction functionality from HTTP requests. The naming convention is consistent with other header constants in this file.
You might want to verify all places where this constant is being used:
#!/bin/bash # Find all usages of the new EXTRACT_LOG_KEY constant rg "EXTRACT_LOG_KEY" --type rustsrc/event/format/mod.rs (1)
42-42
: New module for JSON extraction from log linesThis addition exposes the new
known_schema
module which implements the core functionality for detecting and extracting JSON data from log lines.Ensure the module implementation includes proper error handling and testing:
#!/bin/bash # Check if tests are included for the new module fd "known_schema" --type file | grep testresources/formats.json (1)
1-554
:❓ Verification inconclusive
Check for potential catastrophic backtracking in regex patterns
All patterns appear well-defined, but indefinite quantifiers such as.*
may lead to performance overhead or catastrophic backtracking in certain edge cases. It's recommended to test or verify them under load to ensure performance is acceptable, especially if logs are large or patterns are processed frequently.
🏁 Script executed:
#!/bin/bash # Searching for suspicious indefinite quantifiers in patterns. rg -A2 '\.\*' resources/formats.jsonLength of output: 28115
Performance Alert: Validate Regex Load Handling in formats.json
The regex patterns include several instances of unbounded quantifiers (e.g.,(?<body>.*)
). While these are consistently anchored (typically at the end of each pattern), which generally minimizes backtracking risks, their behavior under heavy or unusually large log entries remains uncertain. Please consider stress-testing these patterns with real-world log loads to confirm that they do not incur performance issues or catastrophic backtracking in production.src/handlers/http/modal/utils/ingest_utils.rs (3)
26-26
: Import ofKNOWN_SCHEMA_LIST
No issues with the new import. It consolidates all known schemas in a single place, which is a good approach for maintainability.
38-41
: New parameterextract_log
These changes introduce anOption<&str>
to accommodate the extraction of raw log data from a specific field. This approach is flexible and aligns well with optional header usage.
72-76
: Custom log extraction branch
CallingKNOWN_SCHEMA_LIST.extract_from_inline_log
forLogSource::Custom
is a solid extension. Any malformed or unknown format can gracefully fall back with the_
arm.src/handlers/http/ingest.rs (7)
31-33
: ImportingUnacceptable
and new header keys
Bringing inUnacceptable
for error handling andEXTRACT_LOG_KEY
constants is clear. This decouples the logic for recognized known formats from the rest of the code.
90-90
: Invokingflatten_and_push_logs
Passingextract_log
ensures custom inline log extraction for the default ingest path. Implementation is consistent with the newly introduced logic.
153-153
: No extraction for OTEL paths
These lines passNone
toflatten_and_push_logs
, skipping inline log extraction. This is correct for OTEL ingestion flows.Also applies to: 191-191, 231-231
273-277
:extract_log
usage inpost_event
Mirroring the earlier approach foringest
, which keeps the logic consistent across ingestion endpoints.
285-285
: Invokingflatten_and_push_logs
Same approach as iningest
, passing the optionalextract_log
. The code remains consistent.
351-353
: IntroducingKnownFormat
variant
AttachingUnacceptable
toPostError
asKnownFormat
clarifies error handling for unrecognized or invalid logs. This approach fosters clarity in user-facing error messages.
379-379
: HTTP status for KnownFormat errors
MappingKnownFormat
toBAD_REQUEST
is appropriate, as the log data doesn't match a recognized pattern. Ensure detailed messages guide users to correct their log format.src/event/format/known_schema.rs (6)
1-17
: License header updated
No issues found.
19-33
: Use statements and static initialization
Declaring a lazy staticKNOWN_SCHEMA_LIST
ensures we only parse and compile the known formats once, improving performance. The approach is straightforward and maintainable.
34-36
: Custom errorUnacceptable
Having a specialized error type clarifies when the log content fails known format checks.
105-125
: Data structures for known formats
UsingFormat
andPattern
to store multiple regex definitions per format is flexible. TheEventProcessor
organizes them well.
127-202
: Implementation ofEventProcessor::new
andextract_from_inline_log
Compiling regex patterns once ensures we avoid performance overhead at runtime. The approach to skipping unknown formats with a warning is practical, though consider if a fallback or partial extraction is desired.
205-474
: Comprehensive test suite
The tests cover multiple scenarios, including edge cases like missing fields and partial matches. This broad coverage is commendable, ensuring confidence in the new extraction logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/event/format/known_schema.rs (3)
126-159
: Consider handling regex compilation errors more robustly.Currently, regex patterns that fail to compile are silently ignored with only a log message. This could lead to situations where expected patterns are not applied without a clear indication to the user.
Consider tracking failed patterns and potentially surfacing this information through a method that can report on schema health, such as:
pub struct EventProcessor { /// Map of format names to their corresponding schema definitions pub schema_definitions: HashMap<String, SchemaDefinition>, + /// Patterns that failed to compile + compilation_errors: HashMap<String, Vec<String>>, } impl EventProcessor { /// Parses given formats from JSON text and stores them in-memory fn new(json_text: &str) -> Self { let mut processor = EventProcessor { schema_definitions: HashMap::new(), + compilation_errors: HashMap::new(), }; // ... existing code ... if let Some(pattern) = regex.pattern.and_then(|pattern| { Regex::new(&pattern) .inspect_err(|err| { error!("Error compiling regex pattern: {err}; Pattern: {pattern}") + processor.compilation_errors + .entry(format.name.clone()) + .or_insert_with(Vec::new) + .push(format!("{}: {}", pattern, err)); }) .ok() }) { schema.patterns.push(pattern); } processor } + + /// Returns information about patterns that failed to compile + pub fn get_compilation_errors(&self) -> &HashMap<String, Vec<String>> { + &self.compilation_errors + } }
168-171
: Incorrect error description comment.The documentation comment for the
Err(Unacceptable)
return value incorrectly states "JSON provided is acceptable for the known format" when it should indicate the opposite./// # Returns /// * `Ok` - The original JSON will now contain extracted fields -/// * `Err(Unacceptable)` - JSON provided is acceptable for the known format +/// * `Err(Unacceptable)` - JSON provided is not acceptable for the known format
145-154
: Consider verifying if any patterns were successfully compiled.If all regex patterns for a format fail to compile, you'll end up with a schema that has field mappings but no patterns. This could lead to silent failures where extraction is attempted but can never succeed.
Consider adding a validation step after processing all patterns:
for format in formats { + let mut has_valid_patterns = false; for regex in format.regex { let schema = processor .schema_definitions .entry(format.name.clone()) .or_insert_with(SchemaDefinition::default); schema.field_mappings.push(regex.fields.clone()); // Compile the regex pattern if present // NOTE: we only warn if the pattern doesn't compile if let Some(pattern) = regex.pattern.and_then(|pattern| { Regex::new(&pattern) .inspect_err(|err| { error!("Error compiling regex pattern: {err}; Pattern: {pattern}") }) .ok() }) { schema.patterns.push(pattern); + has_valid_patterns = true; } } + + // If no patterns were valid, log a warning + if !has_valid_patterns && processor.schema_definitions.contains_key(&format.name) { + warn!("Format '{}' has no valid regex patterns, extraction will always fail", format.name); + } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/event/format/known_schema.rs
(1 hunks)src/event/format/mod.rs
(2 hunks)src/handlers/http/ingest.rs
(9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/event/format/mod.rs
🧰 Additional context used
🧬 Code Definitions (1)
src/event/format/known_schema.rs (1)
src/event/format/mod.rs (1) (1)
new
(106-111)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: coverage
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
🔇 Additional comments (11)
src/handlers/http/ingest.rs (8)
31-33
: Imports added for new JSON extraction functionality.The imports of
Unacceptable
andEXTRACT_LOG_KEY
indicate the integration with the new JSON extraction feature from log lines.
69-73
: Well-implemented extraction of optional log value from headers.This code properly extracts an optional log format identifier from request headers, using the chain of
and_then
andto_str().ok()
for safe conversion.
90-90
: Header extraction passed to flattening function.The extracted log value is now passed to the
flatten_and_push_logs
function, allowing for JSON extraction from log lines during ingestion.
153-153
: Consistent use ofNone
for OTEL endpoints.You correctly pass
None
for the extract_log parameter in OTEL-specific handlers, indicating that no JSON extraction should occur for these specialized formats.Also applies to: 191-191, 231-231
273-277
: Consistent implementation of log extraction in post_event.The same pattern for extracting optional log values from headers is implemented here, maintaining consistency across all ingestion endpoints.
285-285
: Header extraction correctly passed to flattening function.Similar to line 90, the extracted log value is passed to
flatten_and_push_logs
to enable JSON extraction during processing.
351-352
: New error variant for schema validation.A new
KnownFormat
error variant has been added to thePostError
enum to handle cases where log format extraction fails, with automatic conversion from theUnacceptable
error type.
379-379
: Appropriate error status code for format validation failures.The new error variant correctly returns a
BAD_REQUEST
status code, which is appropriate for client-provided data that doesn't match the expected format.src/event/format/known_schema.rs (3)
27-32
: Good implementation of global lazy-loaded schema definitions.Using
Lazy
for the global instance ensures the JSON parsing only happens once at first access, which is efficient. The embedded JSON file throughinclude_str!
is a good approach for packaging the format definitions with the application.
34-36
: Simple and focused error type.The
Unacceptable
error type provides a clear message that includes the name of the format that failed to match, which will be helpful for debugging and user feedback.
60-102
: Well-implemented field extraction with performance optimizations.The
check_or_extract
method is well-designed with multiple efficiency considerations:
- Early return if fields already exist
- Graceful handling of missing log field
- Iterative pattern matching with early success return
This approach minimizes unnecessary regex processing which can be performance-intensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/event/format/known_schema.rs (1)
211-211
:⚠️ Potential issueAvoid using
unreachable!()
for input validation.Using
unreachable!()
for validating inputs is risky, as it will cause the application to panic if reached. Since this could be triggered by malformed client data, it should be handled more gracefully.- _ => unreachable!("We don't accept events of the form: {json}"), + _ => return Err(Unacceptable(format!("Unexpected JSON type, only objects and arrays are supported"))),
🧹 Nitpick comments (1)
src/handlers/http/ingest.rs (1)
289-297
: Structured extraction for post_event handler.The implementation mirrors the pattern in the
ingest
function but is structured differently. This section could potentially be refactored to reduce duplication and share code between the handlers.Consider extracting the common log source handling logic into a separate function to reduce duplication between
ingest
andpost_event
:- match &log_source { - LogSource::OtelLogs | LogSource::OtelMetrics | LogSource::OtelTraces => { - return Err(PostError::OtelNotSupported) - } - LogSource::Custom(src) => { - KNOWN_SCHEMA_LIST.extract_from_inline_log(&mut json, src, extract_log)?; - } - _ => {} - } + handle_log_extraction(&mut json, &log_source, extract_log)?; // Add this helper function elsewhere in the file: // fn handle_log_extraction(json: &mut Value, log_source: &LogSource, extract_log: Option<&str>) -> Result<(), PostError> { // match log_source { // LogSource::OtelLogs | LogSource::OtelMetrics | LogSource::OtelTraces => { // return Err(PostError::OtelNotSupported) // } // LogSource::Custom(src) => { // KNOWN_SCHEMA_LIST.extract_from_inline_log(json, src, extract_log)?; // } // _ => {} // } // Ok(()) // }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/event/format/known_schema.rs
(1 hunks)src/handlers/http/ingest.rs
(7 hunks)src/handlers/http/modal/utils/ingest_utils.rs
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/handlers/http/modal/utils/ingest_utils.rs
🧰 Additional context used
🧬 Code Definitions (1)
src/handlers/http/ingest.rs (3)
src/event/format/mod.rs (2) (2)
new
(106-111)from
(70-80)src/event/format/known_schema.rs (1) (1)
new
(147-167)src/metadata.rs (1) (1)
new
(95-130)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Kafka aarch64-apple-darwin
🔇 Additional comments (17)
src/handlers/http/ingest.rs (10)
31-31
: New import adds JSON schema extraction functionality.The import includes
Unacceptable
error type andKNOWN_SCHEMA_LIST
for validating and extracting structured data from logs according to predefined schemas.
33-33
: NewEXTRACT_LOG_KEY
constant integration.This import refers to a constant that specifies the HTTP header key used to identify the field containing the raw log data for extraction.
52-52
: Function signature modification adds mutable JSON handling.The parameter is now marked as
mut json
to allow modification of the JSON during log extraction.
69-73
: New header extraction logic for log field identification.This code extracts an optional field name from request headers that identifies which field in the JSON contains the raw log text to be parsed.
81-90
: Enhanced log source handling with structured extraction.The code now handles different log sources, particularly focusing on extracting structured data from custom logs using regex patterns. For OTel formats, it properly returns an error, while for custom sources it attempts to extract fields based on the defined schema.
91-91
: Improved log source metadata with extracted fields.This creates a
LogSourceEntry
that includes both the log source type and the set of fields that were successfully extracted, which enhances the metadata for the stream.
253-253
: Mutable JSON parameter for in-place field extraction.The JSON parameter is now mutable to allow the extraction process to modify it by adding structured fields parsed from log lines.
284-288
: Consistent header extraction logic across handlers.This implements the same header extraction logic as in the
ingest
function, ensuring consistent behavior when extracting the log field name.
365-366
: New error variant for schema validation failures.A new
KnownFormat
error type is added to handle cases where the log doesn't match the expected format according to the predefined schema.
393-393
: Proper HTTP status code for format validation errors.Returns a
BAD_REQUEST
status code when a log doesn't match the expected format, which is appropriate for client-side errors related to input validation.src/event/format/known_schema.rs (7)
27-32
: Global schema definitions from embedded JSON resource.The code loads predefined log format definitions from an embedded JSON file and initializes a global singleton using the
Lazy
pattern, which ensures thread-safe initialization on first access.
34-36
: Clear error type for format validation failures.The
Unacceptable
error type provides meaningful error messages when logs don't match the expected format, improving diagnostics.
40-51
: Resilient regex deserializer ignores invalid patterns.The deserializer handles malformed regex patterns gracefully by emitting warnings rather than failing completely. This prevents a single bad pattern from breaking the entire schema.
64-67
: Schema definition structure for pattern collections.The
SchemaDefinition
struct provides a clean abstraction for managing multiple regex patterns for a single log format.
83-129
: Efficient field extraction with preemptive checking.The
check_or_extract
method first checks if all required fields are already present before attempting extraction, which optimizes performance by avoiding unnecessary regex operations.
179-215
: Comprehensive JSON extraction with multi-format support.The
extract_from_inline_log
method handles both single JSON objects and arrays of objects, providing flexibility for different log formats.
218-497
: Comprehensive test suite for extraction functionality.The tests cover various scenarios including successful extraction, handling of pre-existing fields, failed matches, and error cases, which ensures the robustness of the implementation.
fn new(json_text: &str) -> Self { | ||
let mut processor = EventProcessor { | ||
schema_definitions: HashMap::new(), | ||
}; | ||
|
||
let formats: Vec<Format> = | ||
serde_json::from_str(json_text).expect("Known formats are stored as JSON text"); | ||
|
||
for format in formats { | ||
for regex in format.regex { | ||
let schema = processor | ||
.schema_definitions | ||
.entry(format.name.clone()) | ||
.or_insert_with(SchemaDefinition::default); | ||
|
||
schema.patterns.push(regex); | ||
} | ||
} | ||
|
||
processor | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Potential panic during initialization.
The new
method uses expect
when parsing the JSON schema definitions, which could cause a panic during application startup if the JSON is malformed.
Consider handling JSON parsing errors more gracefully:
- let formats: Vec<Format> =
- serde_json::from_str(json_text).expect("Known formats are stored as JSON text");
+ let formats: Vec<Format> = match serde_json::from_str(json_text) {
+ Ok(formats) => formats,
+ Err(e) => {
+ error!("Failed to parse format definitions: {}", e);
+ Vec::new() // Return empty formats to allow application to start with reduced functionality
+ }
+ };
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
fn new(json_text: &str) -> Self { | |
let mut processor = EventProcessor { | |
schema_definitions: HashMap::new(), | |
}; | |
let formats: Vec<Format> = | |
serde_json::from_str(json_text).expect("Known formats are stored as JSON text"); | |
for format in formats { | |
for regex in format.regex { | |
let schema = processor | |
.schema_definitions | |
.entry(format.name.clone()) | |
.or_insert_with(SchemaDefinition::default); | |
schema.patterns.push(regex); | |
} | |
} | |
processor | |
} | |
fn new(json_text: &str) -> Self { | |
let mut processor = EventProcessor { | |
schema_definitions: HashMap::new(), | |
}; | |
let formats: Vec<Format> = match serde_json::from_str(json_text) { | |
Ok(formats) => formats, | |
Err(e) => { | |
error!("Failed to parse format definitions: {}", e); | |
Vec::new() // Return empty formats to allow application to start with reduced functionality | |
} | |
}; | |
for format in formats { | |
for regex in format.regex { | |
let schema = processor | |
.schema_definitions | |
.entry(format.name.clone()) | |
.or_insert_with(SchemaDefinition::default); | |
schema.patterns.push(regex); | |
} | |
} | |
processor | |
} |
Fixes #XXXX.
Description
This PR has:
Summary by CodeRabbit
New Features
Bug Fixes