Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: detect and extract json from log-lines #1248

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

de-sh
Copy link
Contributor

@de-sh de-sh commented Mar 18, 2025

Fixes #XXXX.

Description


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Summary by CodeRabbit

  • New Features

    • Updated container builds now include essential resource assets for improved application functionality.
    • Enhanced log ingestion capabilities with support for custom header extraction, flexible log format configuration, and better error handling.
    • Introduced a new JSON file for defining various log formats, improving log parsing flexibility.
  • Bug Fixes

    • Improved handling of log source mappings to ensure accurate processing of log entries.

Copy link

coderabbitai bot commented Mar 18, 2025

Walkthrough

This pull request adds a new instruction in multiple Dockerfiles to copy the resources directory into the Docker image. A new JSON file providing log format definitions has been introduced. Additionally, a new module for processing and extracting structured log data from inline logs has been created, along with unit tests. The HTTP ingestion handlers are updated to extract an optional log from request headers, with corresponding changes in function signatures and error handling. A new constant for the header key is also added.

Changes

Files Change Summary
Dockerfile, Dockerfile.debug, Dockerfile.kafka Added COPY resources ./resources after COPY src ./src to include the resources directory during image build.
resources/formats.json Introduced a new JSON file defining multiple log format schemas with names, regex patterns, and capture fields.
src/event/format/known_schema.rs Created a new module for structured log processing, adding SchemaDefinition, EventProcessor, extraction methods, and tests.
src/event/format/mod.rs Added public module export: pub mod known_schema;
src/handlers/http/ingest.rs Updated HTTP ingestion flows to extract an optional log value using a new header; function signatures and error handling (new PostError::KnownFormat variant) were enhanced.
src/handlers/http/modal/utils/ingest_utils.rs Simplified control flow in flatten_and_push_logs function without altering functionality.
src/handlers/mod.rs Added new constant: EXTRACT_LOG_KEY with value "x-p-extract-log".

Sequence Diagram(s)

sequenceDiagram
    participant C as Client
    participant H as HTTP Ingest Handler
    participant U as Ingest Utils
    participant E as KnownSchema Processor

    C->>H: Send HTTP request with headers & payload
    H->>U: Call flatten_and_push_logs(json, stream, log_source, extract_log)
    alt Custom Log Source
      U->>E: Invoke extract_from_inline_log(json, source, extract_log)
      E-->>U: Return processed JSON or error
    end
    U-->>H: Return updated log data
    H-->>C: Respond with success/error status
Loading

Suggested reviewers

  • nikhilsinhaparseable

Poem

Hoppin' through the lines of code so bright,
I found new logs in Docker's light.
Resources and headers join in the dance,
As schemas extract logs with a prance.
With a twitch of my ear and a joyful hop,
I cheer these changes that just won't stop! 🐇🌟

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)
  • We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
    - To enable this feature, set early_access to true under in the settings.
✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@de-sh de-sh marked this pull request as ready for review March 18, 2025 16:20
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/handlers/http/ingest.rs (1)

69-73: Introducing extract_log from headers
Extracting the optional extract_log from request headers yields more flexibility to process inline logs. Consider bounding or validating the header string length if logs are large.

src/event/format/known_schema.rs (1)

38-103: SchemaDefinition with check_or_extract
This method systematically checks for existing fields before performing regex extraction. It's efficient, but consider providing more detailed error info for partial matches or missing groups.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 160dec4 and 231cc0d.

📒 Files selected for processing (9)
  • Dockerfile (1 hunks)
  • Dockerfile.debug (1 hunks)
  • Dockerfile.kafka (1 hunks)
  • resources/formats.json (1 hunks)
  • src/event/format/known_schema.rs (1 hunks)
  • src/event/format/mod.rs (1 hunks)
  • src/handlers/http/ingest.rs (9 hunks)
  • src/handlers/http/modal/utils/ingest_utils.rs (3 hunks)
  • src/handlers/mod.rs (1 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
src/event/format/known_schema.rs (1)
src/event/format/mod.rs (1) (1)
  • new (105:110)
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: coverage
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (22)
Dockerfile.kafka (1)

45-45: LGTM - Adding resources directory to Docker image

This change ensures the resources directory is available in the Kafka-enabled Docker image, which is needed for the new JSON extraction feature.

Dockerfile.debug (1)

32-32: LGTM - Adding resources directory to debug Docker image

This change ensures the resources directory is available in the debug Docker image, consistent with changes to other Dockerfiles.

Dockerfile (1)

33-33: LGTM - Adding resources directory to main Docker image

This change ensures the resources directory is available in the main Docker image, consistent with changes to other Dockerfiles.

src/handlers/mod.rs (1)

25-25: New header constant for log extraction feature

This constant defines a header key that will be used to control the JSON extraction functionality from HTTP requests. The naming convention is consistent with other header constants in this file.

You might want to verify all places where this constant is being used:

#!/bin/bash
# Find all usages of the new EXTRACT_LOG_KEY constant
rg "EXTRACT_LOG_KEY" --type rust
src/event/format/mod.rs (1)

42-42: New module for JSON extraction from log lines

This addition exposes the new known_schema module which implements the core functionality for detecting and extracting JSON data from log lines.

Ensure the module implementation includes proper error handling and testing:

#!/bin/bash
# Check if tests are included for the new module
fd "known_schema" --type file | grep test
resources/formats.json (1)

1-554:

❓ Verification inconclusive

Check for potential catastrophic backtracking in regex patterns
All patterns appear well-defined, but indefinite quantifiers such as .* may lead to performance overhead or catastrophic backtracking in certain edge cases. It's recommended to test or verify them under load to ensure performance is acceptable, especially if logs are large or patterns are processed frequently.


🏁 Script executed:

#!/bin/bash
# Searching for suspicious indefinite quantifiers in patterns.
rg -A2 '\.\*' resources/formats.json

Length of output: 28115


Performance Alert: Validate Regex Load Handling in formats.json
The regex patterns include several instances of unbounded quantifiers (e.g., (?<body>.*)). While these are consistently anchored (typically at the end of each pattern), which generally minimizes backtracking risks, their behavior under heavy or unusually large log entries remains uncertain. Please consider stress-testing these patterns with real-world log loads to confirm that they do not incur performance issues or catastrophic backtracking in production.

src/handlers/http/modal/utils/ingest_utils.rs (3)

26-26: Import of KNOWN_SCHEMA_LIST
No issues with the new import. It consolidates all known schemas in a single place, which is a good approach for maintainability.


38-41: New parameter extract_log
These changes introduce an Option<&str> to accommodate the extraction of raw log data from a specific field. This approach is flexible and aligns well with optional header usage.


72-76: Custom log extraction branch
Calling KNOWN_SCHEMA_LIST.extract_from_inline_log for LogSource::Custom is a solid extension. Any malformed or unknown format can gracefully fall back with the _ arm.

src/handlers/http/ingest.rs (7)

31-33: Importing Unacceptable and new header keys
Bringing in Unacceptable for error handling and EXTRACT_LOG_KEY constants is clear. This decouples the logic for recognized known formats from the rest of the code.


90-90: Invoking flatten_and_push_logs
Passing extract_log ensures custom inline log extraction for the default ingest path. Implementation is consistent with the newly introduced logic.


153-153: No extraction for OTEL paths
These lines pass None to flatten_and_push_logs, skipping inline log extraction. This is correct for OTEL ingestion flows.

Also applies to: 191-191, 231-231


273-277: extract_log usage in post_event
Mirroring the earlier approach for ingest, which keeps the logic consistent across ingestion endpoints.


285-285: Invoking flatten_and_push_logs
Same approach as in ingest, passing the optional extract_log. The code remains consistent.


351-353: Introducing KnownFormat variant
Attaching Unacceptable to PostError as KnownFormat clarifies error handling for unrecognized or invalid logs. This approach fosters clarity in user-facing error messages.


379-379: HTTP status for KnownFormat errors
Mapping KnownFormat to BAD_REQUEST is appropriate, as the log data doesn't match a recognized pattern. Ensure detailed messages guide users to correct their log format.

src/event/format/known_schema.rs (6)

1-17: License header updated
No issues found.


19-33: Use statements and static initialization
Declaring a lazy static KNOWN_SCHEMA_LIST ensures we only parse and compile the known formats once, improving performance. The approach is straightforward and maintainable.


34-36: Custom error Unacceptable
Having a specialized error type clarifies when the log content fails known format checks.


105-125: Data structures for known formats
Using Format and Pattern to store multiple regex definitions per format is flexible. The EventProcessor organizes them well.


127-202: Implementation of EventProcessor::new and extract_from_inline_log
Compiling regex patterns once ensures we avoid performance overhead at runtime. The approach to skipping unknown formats with a warning is practical, though consider if a fallback or partial extraction is desired.


205-474: Comprehensive test suite
The tests cover multiple scenarios, including edge cases like missing fields and partial matches. This broad coverage is commendable, ensuring confidence in the new extraction logic.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 18, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/event/format/known_schema.rs (3)

126-159: Consider handling regex compilation errors more robustly.

Currently, regex patterns that fail to compile are silently ignored with only a log message. This could lead to situations where expected patterns are not applied without a clear indication to the user.

Consider tracking failed patterns and potentially surfacing this information through a method that can report on schema health, such as:

 pub struct EventProcessor {
     /// Map of format names to their corresponding schema definitions
     pub schema_definitions: HashMap<String, SchemaDefinition>,
+    /// Patterns that failed to compile
+    compilation_errors: HashMap<String, Vec<String>>,
 }

 impl EventProcessor {
     /// Parses given formats from JSON text and stores them in-memory
     fn new(json_text: &str) -> Self {
         let mut processor = EventProcessor {
             schema_definitions: HashMap::new(),
+            compilation_errors: HashMap::new(),
         };

         // ... existing code ...

         if let Some(pattern) = regex.pattern.and_then(|pattern| {
             Regex::new(&pattern)
                 .inspect_err(|err| {
                     error!("Error compiling regex pattern: {err}; Pattern: {pattern}")
+                    processor.compilation_errors
+                        .entry(format.name.clone())
+                        .or_insert_with(Vec::new)
+                        .push(format!("{}: {}", pattern, err));
                 })
                 .ok()
         }) {
             schema.patterns.push(pattern);
         }

         processor
     }
+
+    /// Returns information about patterns that failed to compile
+    pub fn get_compilation_errors(&self) -> &HashMap<String, Vec<String>> {
+        &self.compilation_errors
+    }
 }

168-171: Incorrect error description comment.

The documentation comment for the Err(Unacceptable) return value incorrectly states "JSON provided is acceptable for the known format" when it should indicate the opposite.

 /// # Returns
 /// * `Ok` - The original JSON will now contain extracted fields
-/// * `Err(Unacceptable)` - JSON provided is acceptable for the known format  
+/// * `Err(Unacceptable)` - JSON provided is not acceptable for the known format

145-154: Consider verifying if any patterns were successfully compiled.

If all regex patterns for a format fail to compile, you'll end up with a schema that has field mappings but no patterns. This could lead to silent failures where extraction is attempted but can never succeed.

Consider adding a validation step after processing all patterns:

 for format in formats {
+    let mut has_valid_patterns = false;
     for regex in format.regex {
         let schema = processor
             .schema_definitions
             .entry(format.name.clone())
             .or_insert_with(SchemaDefinition::default);

         schema.field_mappings.push(regex.fields.clone());
         // Compile the regex pattern if present
         // NOTE: we only warn if the pattern doesn't compile
         if let Some(pattern) = regex.pattern.and_then(|pattern| {
             Regex::new(&pattern)
                 .inspect_err(|err| {
                     error!("Error compiling regex pattern: {err}; Pattern: {pattern}")
                 })
                 .ok()
         }) {
             schema.patterns.push(pattern);
+            has_valid_patterns = true;
         }
     }
+    
+    // If no patterns were valid, log a warning
+    if !has_valid_patterns && processor.schema_definitions.contains_key(&format.name) {
+        warn!("Format '{}' has no valid regex patterns, extraction will always fail", format.name);
+    }
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 231cc0d and 186a586.

📒 Files selected for processing (3)
  • src/event/format/known_schema.rs (1 hunks)
  • src/event/format/mod.rs (2 hunks)
  • src/handlers/http/ingest.rs (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/event/format/mod.rs
🧰 Additional context used
🧬 Code Definitions (1)
src/event/format/known_schema.rs (1)
src/event/format/mod.rs (1) (1)
  • new (106-111)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: coverage
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
🔇 Additional comments (11)
src/handlers/http/ingest.rs (8)

31-33: Imports added for new JSON extraction functionality.

The imports of Unacceptable and EXTRACT_LOG_KEY indicate the integration with the new JSON extraction feature from log lines.


69-73: Well-implemented extraction of optional log value from headers.

This code properly extracts an optional log format identifier from request headers, using the chain of and_then and to_str().ok() for safe conversion.


90-90: Header extraction passed to flattening function.

The extracted log value is now passed to the flatten_and_push_logs function, allowing for JSON extraction from log lines during ingestion.


153-153: Consistent use of None for OTEL endpoints.

You correctly pass None for the extract_log parameter in OTEL-specific handlers, indicating that no JSON extraction should occur for these specialized formats.

Also applies to: 191-191, 231-231


273-277: Consistent implementation of log extraction in post_event.

The same pattern for extracting optional log values from headers is implemented here, maintaining consistency across all ingestion endpoints.


285-285: Header extraction correctly passed to flattening function.

Similar to line 90, the extracted log value is passed to flatten_and_push_logs to enable JSON extraction during processing.


351-352: New error variant for schema validation.

A new KnownFormat error variant has been added to the PostError enum to handle cases where log format extraction fails, with automatic conversion from the Unacceptable error type.


379-379: Appropriate error status code for format validation failures.

The new error variant correctly returns a BAD_REQUEST status code, which is appropriate for client-provided data that doesn't match the expected format.

src/event/format/known_schema.rs (3)

27-32: Good implementation of global lazy-loaded schema definitions.

Using Lazy for the global instance ensures the JSON parsing only happens once at first access, which is efficient. The embedded JSON file through include_str! is a good approach for packaging the format definitions with the application.


34-36: Simple and focused error type.

The Unacceptable error type provides a clear message that includes the name of the format that failed to match, which will be helpful for debugging and user feedback.


60-102: Well-implemented field extraction with performance optimizations.

The check_or_extract method is well-designed with multiple efficiency considerations:

  1. Early return if fields already exist
  2. Graceful handling of missing log field
  3. Iterative pattern matching with early success return

This approach minimizes unnecessary regex processing which can be performance-intensive.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/event/format/known_schema.rs (1)

211-211: ⚠️ Potential issue

Avoid using unreachable!() for input validation.

Using unreachable!() for validating inputs is risky, as it will cause the application to panic if reached. Since this could be triggered by malformed client data, it should be handled more gracefully.

-           _ => unreachable!("We don't accept events of the form: {json}"),
+           _ => return Err(Unacceptable(format!("Unexpected JSON type, only objects and arrays are supported"))),
🧹 Nitpick comments (1)
src/handlers/http/ingest.rs (1)

289-297: Structured extraction for post_event handler.

The implementation mirrors the pattern in the ingest function but is structured differently. This section could potentially be refactored to reduce duplication and share code between the handlers.

Consider extracting the common log source handling logic into a separate function to reduce duplication between ingest and post_event:

- match &log_source {
-     LogSource::OtelLogs | LogSource::OtelMetrics | LogSource::OtelTraces => {
-         return Err(PostError::OtelNotSupported)
-     }
-     LogSource::Custom(src) => {
-         KNOWN_SCHEMA_LIST.extract_from_inline_log(&mut json, src, extract_log)?;
-     }
-     _ => {}
- }
+ handle_log_extraction(&mut json, &log_source, extract_log)?;

// Add this helper function elsewhere in the file:
// fn handle_log_extraction(json: &mut Value, log_source: &LogSource, extract_log: Option<&str>) -> Result<(), PostError> {
//     match log_source {
//         LogSource::OtelLogs | LogSource::OtelMetrics | LogSource::OtelTraces => {
//             return Err(PostError::OtelNotSupported)
//         }
//         LogSource::Custom(src) => {
//             KNOWN_SCHEMA_LIST.extract_from_inline_log(json, src, extract_log)?;
//         }
//         _ => {}
//     }
//     Ok(())
// }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 186a586 and a37d368.

📒 Files selected for processing (3)
  • src/event/format/known_schema.rs (1 hunks)
  • src/handlers/http/ingest.rs (7 hunks)
  • src/handlers/http/modal/utils/ingest_utils.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/handlers/http/modal/utils/ingest_utils.rs
🧰 Additional context used
🧬 Code Definitions (1)
src/handlers/http/ingest.rs (3)
src/event/format/mod.rs (2) (2)
  • new (106-111)
  • from (70-80)
src/event/format/known_schema.rs (1) (1)
  • new (147-167)
src/metadata.rs (1) (1)
  • new (95-130)
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Kafka aarch64-apple-darwin
🔇 Additional comments (17)
src/handlers/http/ingest.rs (10)

31-31: New import adds JSON schema extraction functionality.

The import includes Unacceptable error type and KNOWN_SCHEMA_LIST for validating and extracting structured data from logs according to predefined schemas.


33-33: New EXTRACT_LOG_KEY constant integration.

This import refers to a constant that specifies the HTTP header key used to identify the field containing the raw log data for extraction.


52-52: Function signature modification adds mutable JSON handling.

The parameter is now marked as mut json to allow modification of the JSON during log extraction.


69-73: New header extraction logic for log field identification.

This code extracts an optional field name from request headers that identifies which field in the JSON contains the raw log text to be parsed.


81-90: Enhanced log source handling with structured extraction.

The code now handles different log sources, particularly focusing on extracting structured data from custom logs using regex patterns. For OTel formats, it properly returns an error, while for custom sources it attempts to extract fields based on the defined schema.


91-91: Improved log source metadata with extracted fields.

This creates a LogSourceEntry that includes both the log source type and the set of fields that were successfully extracted, which enhances the metadata for the stream.


253-253: Mutable JSON parameter for in-place field extraction.

The JSON parameter is now mutable to allow the extraction process to modify it by adding structured fields parsed from log lines.


284-288: Consistent header extraction logic across handlers.

This implements the same header extraction logic as in the ingest function, ensuring consistent behavior when extracting the log field name.


365-366: New error variant for schema validation failures.

A new KnownFormat error type is added to handle cases where the log doesn't match the expected format according to the predefined schema.


393-393: Proper HTTP status code for format validation errors.

Returns a BAD_REQUEST status code when a log doesn't match the expected format, which is appropriate for client-side errors related to input validation.

src/event/format/known_schema.rs (7)

27-32: Global schema definitions from embedded JSON resource.

The code loads predefined log format definitions from an embedded JSON file and initializes a global singleton using the Lazy pattern, which ensures thread-safe initialization on first access.


34-36: Clear error type for format validation failures.

The Unacceptable error type provides meaningful error messages when logs don't match the expected format, improving diagnostics.


40-51: Resilient regex deserializer ignores invalid patterns.

The deserializer handles malformed regex patterns gracefully by emitting warnings rather than failing completely. This prevents a single bad pattern from breaking the entire schema.


64-67: Schema definition structure for pattern collections.

The SchemaDefinition struct provides a clean abstraction for managing multiple regex patterns for a single log format.


83-129: Efficient field extraction with preemptive checking.

The check_or_extract method first checks if all required fields are already present before attempting extraction, which optimizes performance by avoiding unnecessary regex operations.


179-215: Comprehensive JSON extraction with multi-format support.

The extract_from_inline_log method handles both single JSON objects and arrays of objects, providing flexibility for different log formats.


218-497: Comprehensive test suite for extraction functionality.

The tests cover various scenarios including successful extraction, handling of pre-existing fields, failed matches, and error cases, which ensures the robustness of the implementation.

Comment on lines +147 to +167
fn new(json_text: &str) -> Self {
let mut processor = EventProcessor {
schema_definitions: HashMap::new(),
};

let formats: Vec<Format> =
serde_json::from_str(json_text).expect("Known formats are stored as JSON text");

for format in formats {
for regex in format.regex {
let schema = processor
.schema_definitions
.entry(format.name.clone())
.or_insert_with(SchemaDefinition::default);

schema.patterns.push(regex);
}
}

processor
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Potential panic during initialization.

The new method uses expect when parsing the JSON schema definitions, which could cause a panic during application startup if the JSON is malformed.

Consider handling JSON parsing errors more gracefully:

- let formats: Vec<Format> =
-     serde_json::from_str(json_text).expect("Known formats are stored as JSON text");
+ let formats: Vec<Format> = match serde_json::from_str(json_text) {
+     Ok(formats) => formats,
+     Err(e) => {
+         error!("Failed to parse format definitions: {}", e);
+         Vec::new() // Return empty formats to allow application to start with reduced functionality
+     }
+ };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn new(json_text: &str) -> Self {
let mut processor = EventProcessor {
schema_definitions: HashMap::new(),
};
let formats: Vec<Format> =
serde_json::from_str(json_text).expect("Known formats are stored as JSON text");
for format in formats {
for regex in format.regex {
let schema = processor
.schema_definitions
.entry(format.name.clone())
.or_insert_with(SchemaDefinition::default);
schema.patterns.push(regex);
}
}
processor
}
fn new(json_text: &str) -> Self {
let mut processor = EventProcessor {
schema_definitions: HashMap::new(),
};
let formats: Vec<Format> = match serde_json::from_str(json_text) {
Ok(formats) => formats,
Err(e) => {
error!("Failed to parse format definitions: {}", e);
Vec::new() // Return empty formats to allow application to start with reduced functionality
}
};
for format in formats {
for regex in format.regex {
let schema = processor
.schema_definitions
.entry(format.name.clone())
.or_insert_with(SchemaDefinition::default);
schema.patterns.push(regex);
}
}
processor
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant