refactor: event handling in parseable #1218

de-sh · 2025-03-01T06:04:53Z

Fixes #XXXX.

Description

This PR has:

been tested to ensure log ingestion and log query works.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added documentation for new or modified features or behaviors.

Summary by CodeRabbit

New Features
- Introduced enhanced log ingestion endpoints that now support payloads with size metadata.
- Added dynamic schema management for more reliable log processing and event partitioning.
- Implemented a new structure for handling JSON requests with size limits.
- Added a new method for committing schemas to the stream.
- Enhanced error handling for log records during schema detection.
Refactor
- Streamlined event processing to improve the handling of streaming logs and metrics.
- Consolidated JSON handling to better support nested log structures.
- Updated log ingestion methods to utilize a more efficient internal stream approach.
Chore
- Removed obsolete modules and redundant components to improve maintainability and overall system efficiency.
- Updated visibility of constants for better accessibility across modules.

coderabbitai · 2025-03-01T06:05:00Z

Walkthrough

This pull request introduces several refactors and improvements across multiple modules. In the Kafka connector, the event processing method has been overhauled to return a payload size instead of an event object. The event formatting modules have been updated to streamline JSON flattening, schema inference, and record batch processing. HTTP ingestion handlers are refactored to leverage a unified log ingestion interface using a new JsonWithSize type, while several legacy modules and functions have been removed. Additionally, new schema management and partitioning features have been added, alongside adjustments to error handling and type updates across the codebase.

Changes

File(s)	Change Summary
`src/connectors/kafka/processor.rs`	Refactored `ParseableSinkProcessor`: removed `build_event_from_chunk` and added `process_event_from_chunk` to return the payload size; streamlined imports.
`src/event/format/json.rs`, `src/event/format/mod.rs`	Updated event formatting: Added `flatten_logs`, modified `to_data`, `decode`, `into_event` signatures, introduced new schema inference and validation methods.
`src/handlers/http/ingest.rs`	Refactored log ingestion functions (`ingest`, `handle_otel_logs_ingestion`, `handle_otel_metrics_ingestion`, `handle_otel_traces_ingestion`, `post_event`) to use `JsonWithSize` and `PARSEABLE.get_or_create_stream`; removed `ingest_internal_stream` and related test code.
`src/handlers/http/mod.rs`, `src/handlers/http/modal/ingest/ingestor_ingest.rs`, `src/handlers/http/modal/utils/ingest_utils.rs`, `src/handlers/http/modal/utils/mod.rs`	Removed deprecated modules and functions: deleted `kinesis` module declaration from HTTP, removed legacy ingestion utilities and HTTP POST event handler.
`src/lib.rs`	Added a new `kinesis` module declaration.
`src/parseable/streams.rs`	Added `commit_schema` to `Stream`, renamed variable (`schema_key` → `prefix`), and refined schema file handling and logging messages.
`src/utils/json/mod.rs`	Refactored JSON utilities: removed `convert_array_to_object`, updated `flatten_json_body` to return the nested value directly, added type alias `Json`, and added new test cases.
`src/event/mod.rs`	Modified `Event`: introduced new `PartitionEvent` struct and `partitions` field; updated processing methods to iterate over partitions; changed `origin_size` type and removed legacy `commit_schema` function.
`src/handlers/http/logstream.rs`	Enhanced `detect_schema`: now checks that each log record is a JSON object and returns a custom error if not.
`src/utils/arrow/flight.rs`	Updated `append_temporary_events`: now retrieves the schema via `stream.get_schema()` and passes a reference to `stream` in `push_logs_unchecked`.
`src/handlers/http/query.rs`	Updated schema commitment: now uses `PARSEABLE.get_or_create_stream` and commits schema directly on the stream.
`src/handlers/http/cluster/mod.rs`	Replaced `ingest_internal_stream` with an internal stream created via `PARSEABLE.get_or_create_stream`; updated cluster metrics processing and error logging.
`src/handlers/http/cluster/utils.rs`	Introduced new `JsonWithSize<T>` struct with a `FromRequest` implementation to encapsulate JSON payloads along with their byte size.
`src/metadata.rs`	Updated `update_stats` function by changing the `size` parameter from `u64` to `usize`.
`src/handlers/mod.rs`	Changed `LOG_SOURCE_KEY` from private to public constant.
`src/utils/actix.rs`	Added `get_custom_fields_from_header` to extract custom fields from HTTP request headers.

Possibly related PRs

refactor: capture ingestion time at receive #1210: The changes in the main PR and the retrieved PR are related as both involve modifications to the ParseableSinkProcessor struct's methods in src/connectors/kafka/processor.rs, specifically around the handling of event processing and the introduction of timestamp parameters.
refactor: process ain't async #1180: The changes in the main PR are related to the modifications in the process method of the ParseableSinkProcessor struct, which is also addressed in the retrieved PR, specifically regarding the handling of asynchronous processing.
refactor: Event per log, streamline data handling #1209: The changes in the main PR are related to the modifications in the src/connectors/kafka/processor.rs file, specifically the refactoring of the build_event_from_chunk method, which aligns with the changes in the retrieved PR that also involve significant modifications to the same method, focusing on streamlining event creation and handling.

Suggested labels

for next release

Suggested reviewers

nikhilsinhaparseable

Poem

I'm a little rabbit, hopping with glee,
Celebrating changes from code to decree,
Refactored paths and processes anew,
Streams and logs now dancing in view.
With bytes hopped neatly and errors made light,
In our code realm, everything feels just right!
🐇💻

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

src/event/format/json.rs (2)

63-108: Great addition of flatten_logs, but watch out for large JSON inputs.
This function elegantly handles different log sources (Kinesis, OTEL, etc.) and converts them into flattened JSON arrays using specialized transformers. However, for very large JSON inputs, the repeated conversions and appending could become memory-intensive. If scaling is expected, consider streaming approaches or chunk processing.

310-311: Consider partial ingestion for non-object JSON entries.
collect_keys returns None if a single entry is not an object. Depending on your requirements, you might want to skip or sanitize malformed entries instead of completely halting ingestion.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e732821 and 1386d3b.

📒 Files selected for processing (12)

src/connectors/kafka/processor.rs (4 hunks)
src/event/format/json.rs (10 hunks)
src/event/format/mod.rs (4 hunks)
src/handlers/http/ingest.rs (6 hunks)
src/handlers/http/mod.rs (0 hunks)
src/handlers/http/modal/ingest/ingestor_ingest.rs (2 hunks)
src/handlers/http/modal/utils/ingest_utils.rs (0 hunks)
src/handlers/http/modal/utils/mod.rs (0 hunks)
src/lib.rs (1 hunks)
src/parseable/streams.rs (3 hunks)
src/utils/arrow/mod.rs (2 hunks)
src/utils/json/mod.rs (1 hunks)

💤 Files with no reviewable changes (3)

src/handlers/http/mod.rs
src/handlers/http/modal/utils/mod.rs
src/handlers/http/modal/utils/ingest_utils.rs

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: coverage
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Build Kafka x86_64-unknown-linux-gnu

🔇 Additional comments (24)

src/lib.rs (1)

32-32: Added the kinesis module to support AWS Kinesis-based log ingestion.

This addition aligns well with the event handling refactor objective and enhances Parseable's data source support.

src/handlers/http/modal/ingest/ingestor_ingest.rs (2)

22-22: Updated import to use new push_logs method.

This change is part of the event handling refactoring, replacing the previous flatten_and_push_logs function with the simplified push_logs.

41-41: Function call updated to use the new push_logs interface.

The change maintains the same parameter signature while leveraging the improved backend implementation.

src/parseable/streams.rs (1)

549-549: Improved log formatting consistency.

Removed the line break in the warning message for better log readability and consistency.

src/utils/json/mod.rs (3)

282-295: Good test for error handling with invalid input.

This test ensures that passing non-object arrays to the conversion function results in appropriate errors, enhancing error handling verification.

297-356: Comprehensive test for nested JSON object handling with SchemaVersion::V0.

This test verifies the correct flattening behavior for arrays of objects containing nested structures, ensuring proper JSON transformation with the older schema version.

358-417: Parallel test case for SchemaVersion::V1 behavior.

This test ensures the flattening behavior works correctly with the newer schema version, showing the expected differences in how nested objects are processed compared to V0.

src/connectors/kafka/processor.rs (3)

28-30: Simplified import statement for better readability.

The consolidated import statement improves code organization by grouping related imports together.

39-68: Well-structured refactoring of the event processing logic.

This is a significant improvement that simplifies the event processing flow:

Method renamed from build_event_from_chunk to process_event_from_chunk to better reflect its purpose

Return type changed from ParseableEvent to u64 (payload size) which is more relevant for this context

Direct use of push_logs eliminates the need to extract schema-related attributes

The new implementation is more straightforward and focuses on the essential task of pushing logs to the stream.

77-79: Updated method call with appropriate logging.

The logging now correctly displays the size of processed records in bytes, which provides more useful operational metrics.

src/utils/arrow/mod.rs (2)

64-65: Documentation updated to reflect new parameter structure.

The documentation properly explains the new parameter format.

179-179: Test updated to use the new parameter format.

Good practice to ensure tests remain in sync with code changes.

src/event/format/mod.rs (5)

101-109: Enhanced method signature with additional parameters.

The to_data method signature has been expanded to include:

time_partition_limit - Provides control over time partition boundaries

custom_partition - Allows for custom partitioning logic

Renamed the log source parameter for clarity

These additions provide more flexibility in how event data is processed and partitioned.

116-126: Added corresponding parameters to into_recordbatch method.

Ensures consistency between related methods and properly propagates the new parameters.

128-135: Updated method call with new parameters.

The call to self.to_data has been properly updated to include all the new parameters, maintaining consistency throughout the code.

170-174: Updated replace_columns call to match new parameter format.

The call has been correctly updated to use the tuple format that was introduced in the utils/arrow/mod.rs file, showing good cross-module consistency.

179-192: Updated into_event method signature.

The method signature has been appropriately updated to include the new parameters, ensuring consistency with the other method changes in this file.

src/handlers/http/ingest.rs (5)

83-88: Streamlined internal stream ingestion.

Similar to the main ingestion endpoint, this function has been simplified to directly get the stream and push logs to it, maintaining consistency across the codebase.

116-120: Consistent approach for OTEL logs ingestion.

The OTEL-specific ingestion handler now follows the same simplified pattern as the other endpoints, providing a consistent approach throughout the codebase.

150-154: Consistent approach for OTEL metrics ingestion.

The OTEL metrics ingestion follows the same pattern, further demonstrating the consistency of the refactored implementation.

181-185: Consistent approach for OTEL traces ingestion.

The OTEL traces ingestion also follows the same simplified pattern, completing the consistent implementation across all ingestion endpoints.

233-237: Consistent approach for direct stream posting.

The post_event handler follows the same pattern as other ingestion endpoints, completing the consistent implementation across all entry points.

src/event/format/json.rs (2)

301-302: Confirm that forcing None on unknown fields aligns with your schema evolution strategy.
Returning None when encountering a new field triggers a new schema inference path. This is correct if you want dynamic schema evolution, but if you intend to reject unknown fields, you may want to handle it differently.

394-395: Good enhancement for test coverage.
Using explicit type imports (Int64Type) clarifies unit tests and ensures correctness in arrow array comparisons. This makes the test suite more readable.

src/parseable/streams.rs

src/utils/arrow/mod.rs

coderabbitai · 2025-03-01T07:35:51Z

src/event/format/mod.rs

+        if static_schema_flag
+            && schema.iter().any(|field| {
+                storage_schema
+                    .get(field.name())
+                    .is_none_or(|storage_field| storage_field != field)
+            })
+        {
            return Err(anyhow!("Schema mismatch"));
        }
+


🛠️ Refactor suggestion

Simplified schema validation with inline logic.

The previous separate is_schema_matching method has been replaced with inline schema validation logic that:

Is more concise and easier to understand

Directly expresses the validation condition

Uses modern Rust pattern with is_none_or for cleaner code

This change improves readability while maintaining the same functionality.

coderabbitai · 2025-03-01T07:35:51Z

src/handlers/http/ingest.rs

+    PARSEABLE
+        .get_or_create_stream(&stream_name)
+        .push_logs(json, &log_source)
+        .await?;



🛠️ Refactor suggestion

Simplified log ingestion with direct push_logs call.

This is a significant improvement that:

Replaces complex flattening and processing logic with a direct method call

Centralizes the log processing logic in the push_logs implementation

Makes the code more maintainable by reducing complexity at the HTTP handler level

The new approach provides a cleaner separation of concerns between the HTTP handling and log processing logic.

src/event/format/json.rs

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

src/handlers/http/logstream.rs (1)

117-123: Consider including context (such as array index) in error message.

Right now, the error doesn’t indicate which array element triggered this failure, which might complicate debugging. Including the index of the invalid item in the message (e.g., “Expected an object at index: X”) could help troubleshoot large payloads.

src/utils/json/mod.rs (1)

259-274: Test coverage for error case might be more explicit.

Verifying that flatten_json_body fails for arrays containing non-object elements is great. If you’d like deeper assertions, you could test the exact error type or message for increased robustness.

src/event/mod.rs (1)

58-95: Key concatenation may benefit from a delimiter before timestamps.

Right now, key.push_str(&parsed_timestamp_to_min); appends the timestamp directly to the hashed schema key. Using a delimiter (e.g., _ or #) could reduce the chance of collisions or confusion in the resulting key string.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38a52c2 and a7b2db3.

📒 Files selected for processing (6)

src/event/format/json.rs (9 hunks)
src/event/format/mod.rs (7 hunks)
src/event/mod.rs (1 hunks)
src/handlers/http/ingest.rs (7 hunks)
src/handlers/http/logstream.rs (1 hunks)
src/utils/json/mod.rs (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/event/format/mod.rs

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: coverage
GitHub Check: Quest Smoke and Load Tests for Distributed deployments

🔇 Additional comments (17)

src/utils/json/mod.rs (5)

25-25: Minor import expansion: no immediate issue.

Importing Map from serde_json is fine. Just be mindful of potential naming collisions with other modules named Json. Otherwise, this looks good.

32-32: Type alias is clear and improves readability.

Defining Json as a Map<String, Value> is a neat approach to standardize usage throughout the codebase.

67-67: Returning nested_value directly seems appropriate.

This line finalizes the flattening process correctly, no further concerns here.

276-333: Test naming and logic appear correct.

The test validates flattening with nested objects for SchemaVersion::V0. This thorough example increases confidence in the flattening logic.

335-392: Additional test for SchemaVersion::V1 is well-structured.

Ensuring different schema versions flatten data in the expected manner is excellent; no issues noted here.

src/event/mod.rs (2)

39-43: New PartitionEvent struct is a useful abstraction.

Encapsulating RecordBatch, timestamp, and partition data fosters clearer multi-partition handling. Confirm that storing RecordBatch by value doesn’t lead to unintended memory usage if events become large. Otherwise, this design is well-structured.

Also applies to: 51-51

100-111: Verify omission of time partition logic in process_unchecked.

Unlike the main process method, process_unchecked does not append timestamps or custom partition values to the computed key. If grouping data by time or custom partitions is still desired, consider adding them here too.
src/handlers/http/ingest.rs (5)

73-77: Simplified log ingestion with direct push_logs call.

This is a significant improvement that:

Replaces complex flattening and processing logic with a direct method call

Centralizes the log processing logic in the push_logs implementation

Makes the code more maintainable by reducing complexity at the HTTP handler level

The new approach provides a cleaner separation of concerns between the HTTP handling and log processing logic.

83-87: Good simplification of internal stream ingestion.

The code has been streamlined by removing schema retrieval and event creation logic, instead directly pushing logs to the internal stream. This maintains consistency with the other ingestion endpoints and reduces code complexity.

116-119: Consistent use of push_logs across all ingestion handlers.

All OpenTelemetry ingestion handlers now use the same pattern of getting the stream and directly pushing logs. This consistency improves maintainability and ensures uniform behavior across different log types.

Also applies to: 150-153, 181-184

233-236: Consistent log handling in post_event.

The post_event endpoint now follows the same pattern as other ingestion endpoints, maintaining consistency throughout the codebase.

250-251: Clarify the uncertainty in the comment.

The comment "NOTE: Maybe should be false" suggests uncertainty about the correct value for is_first_event. This should be resolved rather than left as a question in the code.

Consider investigating the impact of this flag on log processing and document the decision more clearly:
-        is_first_event: true, // NOTE: Maybe should be false
+        is_first_event: true, // Set to true as this is always considered a new event
or
-        is_first_event: true, // NOTE: Maybe should be false
+        is_first_event: false, // Set to false as this is processing existing data
src/event/format/json.rs (5)

67-131: Well-structured log flattening logic that handles multiple formats.

The new flatten_logs function centralizes the log processing logic for different sources (Kinesis, OTel logs/metrics/traces), making the code more maintainable and easier to extend with new log formats in the future.

The function effectively:

Deserializes the input based on log source type

Applies appropriate flattening logic for each format

Handles both single objects and arrays consistently

140-146: Method signatures updated for consistency.

Method signatures have been updated to include additional parameters like time_partition_limit and log_source, ensuring consistent handling of these parameters throughout the event processing pipeline.

Also applies to: 155-155, 235-246

343-352: Improved key collection with HashSet.

Changing the return type from Result<Vec<&'a str>, ()> to HashSet<&'a str> is a good improvement:

Uses a more appropriate data structure for unique keys

Simplifies error handling by eliminating explicit error types

Makes the code more concise

354-361: Simplified field mismatch detection.

The updated fields_mismatch function is more concise and readable, making use of Rust's idiomatic is_none_or pattern to check field validity.

452-1069: Comprehensive test coverage for JSON processing.

The extensive test suite covers various scenarios:

Basic objects with different types

Null handling

Schema derivation

Type mismatches

Empty objects

Arrays of objects

Nested types

Schema version differences

This thorough testing approach helps ensure the reliability of the JSON processing logic after the significant refactoring.

coderabbitai · 2025-03-01T09:03:01Z

src/event/format/json.rs

+        let fields = collect_keys(flattened.iter());



💡 Verification agent

🧩 Analysis chain

Avoid using expect for collecting keys.

Currently, if collect_keys returns an empty set (e.g., if any log entry is not an object), this code might not handle it properly. Consider adding error handling to avoid unexpected behavior.

Check if the updated implementation properly handles the case when fields is empty:

🏁 Script executed:

#!/bin/bash # Find all usages of collect_keys to verify error handling rg -A 3 "collect_keys" --type rust

Length of output: 594

🏁 Script executed:

#!/bin/bash # Search for any usage of `expect` in the file to confirm if it's still being used. rg "expect(" src/event/format/json.rs

Length of output: 104

Review Check: Error Handling for Empty Key Collection

After verifying the relevant code areas, I confirm that the use of collect_keys no longer involves an expect call. However, the potential issue remains: if collect_keys(flattened.iter()) returns an empty set—such as when a log entry isn’t an object—this state isn’t explicitly managed. While the subsequent call to derive_arrow_schema uses an if let to check for a valid schema, it isn’t clear whether this gracefully handles the empty key scenario (or if it might lead to unexpected behavior further downstream).

To improve robustness, please consider one of the following:

Explicit Check: Add a conditional that verifies fields isn’t empty before proceeding.

Fallback/Error Handling: Document or implement a fallback mechanism (or an error log/early return) to handle cases when no keys are collected.

This change will help ensure the code behaves predictably, even when encountering non-object log entries.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

src/parseable/streams.rs (1)
116-124: ⚠️ Potential issue

New push_logs method has potential error handling issues.

The new method provides a streamlined interface for log ingestion, but it calls unwrap() on line 117 which could lead to panics in production if serialization fails.

Additionally, the method is marked as async but doesn't contain any await operations.

Apply these changes to improve error handling:
-pub async fn push_logs(&self, json: Value, log_source: &LogSource) -> anyhow::Result<()> {
+pub fn push_logs(&self, json: Value, log_source: &LogSource) -> anyhow::Result<()> {
-    let origin_size = serde_json::to_vec(&json).unwrap().len() as u64; // string length need not be the same as byte length
+    let origin_size = serde_json::to_vec(&json)?.len() as u64; // string length need not be the same as byte length

    json::Event::new(json)
        .into_event(origin_size, self, log_source)?
        .process(self)?;

    Ok(())
}

🧹 Nitpick comments (12)

src/event/format/mod.rs (2)

116-116: Consider refactoring to reduce argument count
The #[allow(clippy::too_many_arguments)] hint suggests the function signature is getting bulky. For long-term maintainability, consider encapsulating related parameters into a dedicated struct or builder pattern.

153-154: Create a dedicated comment or function describing the "new_schema" preparation
Adding a short explanation or splitting into a helper function could improve clarity around the transformation steps for new_schema, especially considering time-partition updates and potential complexities in large schemas.

src/handlers/http/ingest.rs (2)

150-153: Refined OTEL metrics ingestion
This mirrors the pattern in logs ingestion. The uniform approach helps reduce code duplication, though a shared helper function might be considered if more OTEL variants arise.

257-257: Unchecked event processing
Calling process_unchecked(stream) avoids overhead from schema validation. While it boosts performance, keep in mind that it may also allow malformed records to enter your storage.

Would you like a fallback mechanism or error handling approach to reduce potential corruption when pushing data in "unchecked" mode?

src/event/format/json.rs (8)

33-37: Additional std and sync imports
Introducing HashMap, HashSet, and Arc is standard. If ingestion throughput is especially high, consider guard patterns to avoid overhead from frequent map operations.

67-131: Centralized flatten_logs function
Bringing all log-flattening logic under one function offers clarity. However, watch out for large JSON payloads causing bottlenecks at runtime. Depending on scale, consider streaming approaches or partial flattening to handle giant arrays.

157-157: Collecting keys for schema derivation
collect_keys(flattened.iter()) works well for simple merges. If partial or inconsistent records appear, it might lead to incomplete schemas or ingestion errors.

160-161: is_first event determination
Relying on whether a new schema is derived to set is_first is clever but might skip scenarios where the event is not literally “first” but still triggers a new field. Clarify or rename the variable if necessary.

231-239: Leverage stream-based partition config
Fetching partition configs like time_partition and custom_partitions from the stream ensures consistent ingestion logic. Just confirm that dynamic stream updates during ingestion do not conflict with these local references.

249-254: Extracting custom partitions
Splitting on commas for multiple custom partitions is straightforward. If user-supplied partition keys contain commas, you might need escaping or a more robust approach.

333-339: Schema derivation short-circuit
Returning None if a new field is found is straightforward. In future expansions, you might allow partial acceptance of new fields or dynamic updates rather than rejecting the entire batch.

355-360: Field mismatch detection
Inline fields_mismatch logic is easy to follow. In the future, consider returning a structured list of mismatch reasons to help users debug ingestion issues.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a7b2db3 and dc34a85.

📒 Files selected for processing (7)

src/event/format/json.rs (9 hunks)
src/event/format/mod.rs (6 hunks)
src/event/mod.rs (2 hunks)
src/handlers/http/ingest.rs (6 hunks)
src/parseable/mod.rs (1 hunks)
src/parseable/streams.rs (3 hunks)
src/utils/arrow/flight.rs (1 hunks)

🔇 Additional comments (33)

src/parseable/mod.rs (1)

31-31: Good addition of Stream to the public exports.

Adding the Stream type to the list of exported entities makes it accessible to other modules that import from parseable, supporting the event handling refactoring. This change follows good Rust module design patterns.

src/utils/arrow/flight.rs (2)

98-101: Improved by directly using Stream object instead of just the name.

The code now gets the Stream object first and uses it consistently throughout the function. This provides better encapsulation and follows object-oriented principles.

105-106: Streamlined function call by passing stream reference.

Changed from passing a stream name to passing the Stream reference directly, which is more efficient as it avoids looking up the stream again inside push_logs_unchecked.

src/parseable/streams.rs (1)

530-531: Improved logging message clarity.

The warning message for file not found has been streamlined for better readability, using file.display() to show the file path.

src/event/mod.rs (4)

39-43: Good introduction of PartitionEvent to encapsulate related data.

Creating a dedicated struct to group related data (record batch, timestamp, and partition values) improves code organization and makes the relationships between these fields more explicit.

45-52: Improved Event struct with partitions vector.

The Event struct has been refactored to contain a vector of PartitionEvents, which allows handling multiple partitions per event. This design is more flexible and better supports complex event processing scenarios.

56-94: Refactored process method to handle multiple partitions.

The process method now iterates through each partition and applies the appropriate operations. This change supports the new multi-partition event model and properly delegates to the stream for data storage.

The method signature has also been updated to take a Stream reference, strengthening the relationship between Event and Stream.

98-110: Consistent refactoring of process_unchecked method.

The process_unchecked method has been updated to match the changes in the process method, maintaining consistency in the API. This method now also takes a Stream reference and processes multiple partitions.

src/event/format/mod.rs (7)

23-23: Use of NonZeroU32 for partition limit is appropriate
This import indicates the intention to avoid zero values for the time partition limit. It's a good approach to guarantee meaningful partition constraints.

36-40: New imports align with the extended event processing logic
Bringing in Stream, get_field, get_timestamp_array, replace_columns, and Json is consistent with the updated record batch construction flow. Ensure that any performance-sensitive areas involving these utilities have sufficient benchmarks, especially under high ingestion loads.

108-112: Additional parameters enhance flexibility
Extending fn to_data with time_partition_limit and custom_partition provides finer control over partitioning. Ensure any external calls to to_data are updated to supply valid values for these parameters or None as applicable.

117-125: New signature provides direct data injection
The updated into_recordbatch method signature is clearer about required data, schema, and partition details. Confirm that all call sites pass consistent arguments to prevent runtime mismatches in timestamp or schema references.

143-149: Validate schema mismatch logic
This inlined check for static schema mismatch is convenient. However, be sure that any nominal type differences (e.g., Float64 vs. Int64 for schema version upgrades) are gracefully handled to avoid frequent ingestion failures.

158-163: Column replacement for the timestamp
Replacing the first column with the computed timestamp array is a neat solution. Just verify that clients querying this data are aware that index 0 is now reserved for the system timestamp, preventing confusion.

165-165: Finalize record batch as expected
Returning the newly generated RecordBatch is straightforward. No issues spotted, provided that downstream consumers handle any schema modifications introduced by prior steps.

src/handlers/http/ingest.rs (8)

30-30: New import for LogSource
This import is consistent with the shift to a direct log source model in ingestion functions. Confirm consistent usage across handlers.

34-34: Access to PARSEABLE & Stream
Exposing Stream and PARSEABLE again emphasizes centralizing ingestion. Ensure concurrency controls exist for multi-threaded ingestion scenarios.

73-76: Refactored ingestion flow with direct push_logs
This refactor removes unnecessary flattening logic at the handler level, delegating directly to push_logs. It simplifies code but double-check that any exceptions raised in push_logs (e.g., schema mismatch) are handled or reported properly here.

83-87: Ingesting into an existing stream
Using get_stream with a potential fallback path organizes ingestion well. Just confirm there's a fallback or a user-facing error if the stream name is invalid or not found.

116-119: OTel logs ingestion for newly created stream
A consistent approach with get_or_create_stream. Make sure any changes in OTel data structure are thoroughly tested to avoid partial ingestion or data corruption.

181-184: OTel traces ingestion
Similar to logs and metrics, everything is abstracted into push_logs. Keep an eye on the potential performance overhead if OTel data sets are large.

233-236: Unified ingestion for explicit logstream
Pushes logs via the same method used in other ingestion endpoints. Check that usage analytics or logging facets remain consistent for user-defined vs. internal streams.

249-254: is_first_event: true potential logic discrepancy
Hard-coding is_first_event to true might misrepresent repeated ingestion calls. Clarify or dynamically determine whether an event truly is the first to avoid misleading metadata or schema transitions.

src/event/format/json.rs (10)

29-31: OpenTelemetry log, metrics, and trace imports
These references show that OTel data structures are parsed directly. Keep an eye on updates to these crates to stay compatible with future OTel versions.

40-41: Using EventFormat and LogSource
Aligns with the new trait-based approach to ingest various event types. This fosters a cleaner separation of concerns.

42-47: Flattening different log types
Imports for specialized flatteners (Kinesis, OTel logs/metrics/traces) indicate a consolidated method of turning raw JSON into standardized records. Ensure each flattener is thoroughly tested for unusual or edge-case formats.

133-135: type Data = Vec<Json>
Retaining a vector representation can be memory-intensive if logs are large. If memory usage becomes an issue, consider streaming or chunk-based ingestion.

140-146: Expanded to_data signature
Including new partition parameters and the log_source clarifies responsibilities within EventFormat. Double-check that all call sites supply consistent arguments, or fallback defaults.

147-155: Flatten logs before inferring schema
Invoking flatten_logs here is logical. If flattening fails for some edge case (e.g., deeply nested objects or partially invalid OTel data), ensure bubble-up error handling is clearly traceable.

225-230: New into_event signature
Taking stream and log_source directly avoids confusion of many separate parameters. Make sure no potential concurrency pitfalls arise from referencing the shared stream object in parallel actions.

259-263: Extracting time partition
Falling back to the event’s own timestamp if no partition key is present ensures consistent event ordering. Just confirm that the partial mismatch between local times and global ordering is acceptable for the system’s design.

341-343: Key collection from JSON objects
No explicit check for non-object values is performed here—flattening presumably ensures objects only. If flattened logs can contain non-object nodes, an early error might improve clarity.

423-424: Extra type definitions for test arrays
Reusing these type casts in tests is convenient. This organization helps keep test code tidy.

coderabbitai · 2025-03-01T09:29:21Z

src/handlers/http/ingest.rs


    Ok(HttpResponse::Ok().finish())
 }

 pub async fn push_logs_unchecked(
    batches: RecordBatch,
-    stream_name: &str,
+    stream: &Stream,


⚠️ Potential issue

Updated function signature for push_logs_unchecked
Accepting a &Stream parameter emphasizes the new ingestion pattern. Ensure concurrency safety whenever multiple threads attempt “unchecked” pushes simultaneously.

In highly concurrent deployments, consider locking or queueing events to avoid partial writes or race conditions in process_unchecked.

coderabbitai · 2025-03-01T09:29:21Z

src/event/format/json.rs

+            let rb = Self::into_recordbatch(
+                p_timestamp,
+                vec![json],
+                schema.clone(),
+                &storage_schema,
+                static_schema_flag,
+                time_partition.as_ref(),
+                schema_version,
+            )?;
+
+            partitions.push(PartitionEvent {
+                rb,
+                parsed_timestamp,
+                custom_partition_values,
+            });
+        }


🛠️ Refactor suggestion

Per-event recordbatch creation
Generating a distinct RecordBatch for each item in the data array could be expensive for large volumes. Batching them together might improve performance if your usage tolerates it.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/event/format/json.rs (1)
357-366: Simplified collect_keys implementation but needs error handling

The collect_keys function now returns a HashSet<&'a str> instead of an Option, which simplifies code but means there's no built-in way to check if the set is empty or if no valid fields were collected.

Consider adding explicit handling for the empty set case in the to_data method to avoid issues with schema inference.
- let fields = collect_keys(flattened.iter());
+ let fields = collect_keys(flattened.iter());
+ if fields.is_empty() {
+     return Err(anyhow!("Failed to collect any valid fields from the provided JSON objects"));
+ }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc34a85 and 4c1f6d8.

📒 Files selected for processing (6)

Cargo.toml (1 hunks)
src/event/format/json.rs (9 hunks)
src/event/mod.rs (1 hunks)
src/handlers/http/ingest.rs (6 hunks)
src/handlers/http/query.rs (1 hunks)
src/parseable/streams.rs (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/parseable/streams.rs

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: coverage
GitHub Check: Quest Smoke and Load Tests for Distributed deployments

🔇 Additional comments (21)

Cargo.toml (1)

12-13: Updated arrow dependencies for better consistency.

The change simplifies the arrow-array dependency declaration while adding the core arrow crate with matching versions. This looks like a good improvement for dependency management.

src/handlers/http/query.rs (1)

175-177: Improved schema management using stream abstraction.

The refactoring improves encapsulation by having the stream instance handle its own schema commitment rather than using a standalone function. This aligns well with object-oriented principles and the broader refactoring pattern in this PR.

src/event/mod.rs (4)

34-38: Good introduction of PartitionEvent struct for better data encapsulation.

The new PartitionEvent struct cleanly encapsulates related fields (record batch, timestamp, and partition values) that were previously separate in the Event struct. This improves code organization and maintainability.

40-47: Well-structured Event refactoring to support multi-partition processing.

The Event struct now uses a HashMap to store multiple PartitionEvent instances, which provides more flexibility for handling logs with different partitioning schemes. This change is aligned with the overall refactoring goals and enhances the system's capability to process varied log formats.

51-75: Improved process method with cleaner iteration over partitions.

The refactored process method properly handles multiple partitions through iteration, maintaining the same functionality while being more flexible. The schema commitment, data pushing, and stat updating are all properly adapted to work with the new structure.

79-88: Enhanced process_unchecked method for consistent partition handling.

The updated method correctly iterates over partitions in the same way as the main process method, maintaining consistency in the codebase.

src/handlers/http/ingest.rs (9)

73-77: Simplified log ingestion with direct push_logs call.

This approach centralizes log processing logic and provides a cleaner separation of concerns between HTTP handling and log processing. The code is now more maintainable with reduced complexity at the handler level.

83-87: Streamlined internal stream ingestion using push_logs.

The ingestion process for internal streams has been simplified by directly using the push_logs method, which removes redundant code and improves maintainability.

116-119: Consistent use of push_logs for OTEL logs ingestion.

The refactoring applies the same pattern across all ingestion methods, which improves code consistency and makes future maintenance easier.

150-153: Consistent use of push_logs for OTEL metrics ingestion.

Following the same pattern as other ingestion methods maintains consistency across the codebase.

181-184: Consistent use of push_logs for OTEL traces ingestion.

The consistent application of the same pattern across all ingestion methods is excellent for code maintainability.

233-236: Consistent use of push_logs for direct event posting.

This maintains the same pattern established throughout the file, which is good for consistency and maintainability.

242-244: Updated function signature for push_logs_unchecked

The function now accepts a &Stream parameter directly, which aligns with the new ingestion pattern and emphasizes the stream-centric approach to log handling.

When deploying to highly concurrent environments, ensure proper synchronization mechanisms are in place to avoid race conditions when multiple threads attempt "unchecked" pushes simultaneously.

250-259: Well-structured Event construction using the new PartitionEvent format.

The creation of the unchecked event correctly uses the new partitioned structure with a consistent key generation based on schema fields. This aligns well with the changes in the event module.

263-263: Properly updated process_unchecked call with stream parameter.

The call to process_unchecked now correctly passes the stream parameter, matching the updated method signature in the Event struct.
src/event/format/json.rs (6)

23-52: Clean import organization and new dependencies

The imports have been properly updated to include necessary dependencies for the refactored event handling - OpenTelemetry proto types, specific collection types, and non-zero integers. The organization into distinct blocks improves readability.

68-132: Well-structured log flattening abstraction

The new flatten_logs function effectively centralizes the logic for handling different log sources (Kinesis, OpenTelemetry logs/traces/metrics) into a single function, which is excellent for maintainability. The pattern matching with detailed comments for each case makes the code easy to follow.

137-207: Improved to_data method with clean error handling

The to_data method has been nicely refactored with:

Clearer parameter naming (stored_schema vs previous schema)

Proper use of the new flatten_logs function

However, the code at line 158 still doesn't properly handle the case when collect_keys returns an empty set.
#!/bin/bash
# Check if collect_keys function is properly handling empty sets
rg "collect_keys" src/event/format/json.rs -B 2 -A 5
210-223: Simplified return type for consistency

Changing the return type from Result<RecordBatch, anyhow::Error> to anyhow::Result<RecordBatch> improves code consistency while maintaining the same functionality.

227-304: Excellent parameter reduction in into_event method

The method has been significantly improved by:

Taking references to structured objects (Stream and LogSource) rather than individual parameters

Extracting values from these objects when needed

Using proper error handling with the ? operator

This change reduces parameter count and makes the API more maintainable.

467-1083: Comprehensive test coverage for different scenarios

Excellent work adding thorough test cases that cover:

Basic JSON object processing

Null value handling

Schema derivation and compatibility

Array processing

Nested type handling

Schema version differences

This comprehensive test suite will help ensure the refactored code maintains correct behavior across all scenarios.

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c1f6d8 and 5a2bcc1.

📒 Files selected for processing (9)

src/connectors/kafka/processor.rs (4 hunks)
src/event/format/json.rs (9 hunks)
src/event/format/mod.rs (6 hunks)
src/event/mod.rs (1 hunks)
src/handlers/http/cluster/mod.rs (2 hunks)
src/handlers/http/cluster/utils.rs (2 hunks)
src/handlers/http/ingest.rs (9 hunks)
src/metadata.rs (1 hunks)
src/parseable/streams.rs (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/handlers/http/ingest.rs

🧰 Additional context used

🪛 GitHub Actions: Lint, Test and Coverage Report

src/handlers/http/cluster/utils.rs

[error] 291-291: Clippy: useless conversion to the same type: actix_web::Error. Consider removing .into(): ErrorPayloadTooLarge(byte_size).

[error] 303-303: Clippy: redundant closure. Replace the closure with the function itself: JsonPayloadError::Deserialize.

⏰ Context from checks skipped due to timeout of 90000ms (9)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Quest Smoke and Load Tests for Standalone deployments

🔇 Additional comments (20)

src/handlers/http/cluster/utils.rs (2)

284-297: Efficient payload handling and size limiting

The streaming approach to collect bytes chunk by chunk with size limit checks provides an effective defense against potential DoS attacks. The early return with ErrorPayloadTooLarge ensures that oversized payloads are rejected before processing the entire request.

🧰 Tools

🪛 GitHub Actions: Lint, Test and Coverage Report

[error] 291-291: Clippy: useless conversion to the same type: actix_web::Error. Consider removing .into(): ErrorPayloadTooLarge(byte_size).

19-36: Appropriate imports for the new functionality

All the necessary imports for handling futures, payloads, and error types have been added to support the new JsonWithSize<T> implementation.

src/metadata.rs (1)

38-38: Type change from u64 to usize for size parameter.

This change aligns the size parameter type with the origin_size field in the Event struct (in src/event/mod.rs), providing type consistency across the codebase. The change from u64 to usize is appropriate as usize is the natural type for memory-related operations in Rust.

src/handlers/http/cluster/mod.rs (1)

777-797: Streamlined cluster metrics ingestion workflow.

The refactored code improves error handling and provides a more direct approach to processing cluster metrics. The code now:

Properly checks if metrics are available before processing

Calculates the byte size correctly

Uses the new push_logs method from the Stream interface

Includes appropriate error handling and logging

This change is part of the broader event handling refactoring and provides a more consistent approach to log ingestion across the codebase.

src/connectors/kafka/processor.rs (2)

39-69: Improved Kafka record processing with direct log ingestion.

The refactored method process_event_from_chunk greatly simplifies the event processing flow by:

Removing unnecessary schema-related attribute retrievals

Directly pushing logs to the stream using the new push_logs method

Tracking and returning the payload size instead of constructing an event object

This approach is more efficient and aligns with the broader refactoring of event handling in the application.

78-80: Updated process call and logging to match new return type.

The method call and logging statement have been correctly updated to match the new return type (usize for payload size instead of an event object).

src/parseable/streams.rs (3)

116-127: New push_logs method provides streamlined log ingestion.

This new method offers a simplified interface for log processing by directly creating a JSON event and processing it. The method signature appropriately takes the JSON value, origin size, and log source as parameters, making the interface cleaner for callers.

The implementation is straightforward and handles the full event processing flow in a single method call, which aligns with the broader refactoring of event handling in the codebase.

630-641: New commit_schema method for schema management.

This method provides a clean way to merge a new schema with the existing one and update the metadata. The implementation correctly:

Retrieves the current schema

Merges it with the new schema

Converts the fields to a map for storage in the metadata

Updates the metadata with the merged schema

This addition enhances the Stream struct's capabilities for schema management.

533-534: Simplified warning log statement.

The multi-line warning has been consolidated into a single line for improved readability without changing its functionality.

src/event/mod.rs (4)

34-38: Good addition of the PartitionEvent struct.

The introduction of this struct nicely encapsulates the related data (record batch, timestamp, and partition values) that was previously scattered across the Event struct. This improves code organization and makes the data relationships clearer.

40-46: Event struct now uses a partition-based approach.

The refactoring from individual fields to a map-based partitioning approach is a good architectural decision that makes the code more flexible for handling diverse log sources and partitioning schemes.

Note the type change from u64 to usize for origin_size, which aligns better with Rust's memory model.

51-75: Processing logic appropriately updated for multi-partition support.

The process method has been effectively refactored to handle multiple partitions. Each partition is now processed independently with its own schema, timestamp, and custom partition values.

The change to pass the stream parameter explicitly rather than accessing it through a global or cached reference improves testability and makes dependencies clearer.

79-88: Well-structured process_unchecked method.

This method has been properly updated to match the changes in the process method, maintaining consistent behavior between the two methods.

src/event/format/mod.rs (2)

104-112: Improved to_data method signature with explicit parameters.

The addition of time_partition_limit, custom_partition, and log_source parameters makes the function more flexible and reduces the need for global state or context objects.

117-166: Well-refactored into_recordbatch method with simplified schema validation.

The previous separate is_schema_matching method has been replaced with inline schema validation logic that:

Is more concise and easier to understand

Directly expresses the validation condition

Uses modern Rust pattern with is_none_or for cleaner code

This change improves readability while maintaining the same functionality.

src/event/format/json.rs (5)

137-207: Refactored to_data method with improved parameter handling.

The method now properly uses the new flatten_logs function and handles the schema derivation more cleanly. The variable rename from schema to stored_schema improves clarity by distinguishing it from the derived schema.

However, there's still a potential issue with the error handling for empty key collections:

The implementation should more explicitly handle the case when fields is empty (line 158). While the code uses conditional logic to create a schema, it's not entirely clear how an empty field set would be handled downstream.

227-303: Significantly improved into_event implementation.

The implementation now:

Extracts all necessary configuration from the stream object

Creates partition-specific record batches

Uses appropriate key generation with schema and partition values

Concisely concatenates record batches for the same partition

This approach handles partitioning more effectively than the previous implementation and properly implements the partition-based architecture.

355-366: Simplified collect_keys function with better return type.

The function now returns a HashSet<&'a str> instead of a complex Result type, making it easier to use and understand.

368-375: Improved fields_mismatch implementation.

The use of is_none_or makes the code more concise and readable. The function now checks if the field doesn't exist in the schema or has an invalid type more clearly.

467-1083: Comprehensive test suite additions.

The addition of extensive unit tests is excellent. The tests cover:

Basic object conversion

Null handling

Schema derivation

Schema mismatch cases

Nested type handling

Schema version differences

This thorough test coverage will help ensure the refactored code works correctly and prevent future regressions.

src/handlers/http/cluster/utils.rs

src/event/format/json.rs

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/event/format/json.rs (1)
70-132: 🛠️ Refactor suggestion

Avoid using unreachable!() for data validation.
Although the logic suggests this branch won’t be reached, unexpected inputs in production could trigger a panic. Replace unreachable!() with a more descriptive error, preventing inadvertent panics.
-                _ => unreachable!("flatten would have failed beforehand"),
+                _ => return Err(anyhow!(
+                    "Unexpected JSON type encountered. Please verify inputs."
+                )),

🧹 Nitpick comments (3)

src/event/mod.rs (2)

34-40: Add explanatory documentation for PartitionEvent.
While this struct is straightforward, a short doc comment would help clarify its purpose and usage for future maintainers.

54-80: Consider consolidating repeated actions.
Calling commit_schema, push, update_stats, and LIVETAIL.process in the same loop may be acceptable, but running them repeatedly for multiple partitions could add overhead. Check whether some operations should be performed once outside the loop or batched together.

src/event/format/json.rs (1)

345-351: Broaden error handling or accepted formats.
Currently, only RFC3339 or RFC2822 formats are parsed as timestamps. If more date-time formats appear in production logs, consider expanding or customizing the parsing strategy.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 08fea7b and 9b0d865.

📒 Files selected for processing (3)

src/event/format/json.rs (7 hunks)
src/event/format/mod.rs (7 hunks)
src/event/mod.rs (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: coverage

🔇 Additional comments (18)

src/event/mod.rs (3)

23-28: No issues with the new imports.
These imports are necessary for the functionality introduced below and seem properly scoped.

42-45: Validate change from u64 to usize.
Using usize could lead to potential inconsistencies across platforms (e.g., 32-bit vs. 64-bit). If the original intent was to store large sizes, confirm that usize remains safe in all target environments.

Also applies to: 48-48

82-96: Check rationale behind omitting stats & live-tail updates.
Unlike process(), this process_unchecked() method doesn’t call update_stats or LIVETAIL.process. Verify that this omission is intended and won’t cause data inconsistencies or missing real-time updates.

src/event/format/json.rs (8)

29-31: No concerns regarding these imports.
They provide necessary definitions for OpenTelemetry protos, standard collections, and other project modules.

Also applies to: 33-37, 40-50

140-154: Logic in to_data is clear and consistent.
Delegating to flatten_logs is a clean approach for abstracting JSON transformations.

156-213: Validate handling of new or unseen fields.
When schema inference fails due to new fields, the function returns an error. Consider whether partial schema updates or ignoring unknown fields might be appropriate in some scenarios.

215-229: Slice-based decode approach looks good.
Taking a slice of data instead of a single element is a more flexible design for creating a RecordBatch.

232-317: Beware potential partition key collisions.
The partition key includes a schema hash, optional time partition, and custom partition fields. For large-scale systems, confirm the combination sufficiently avoids collisions, or else concurrency issues and data overwrites might occur.

356-367: Confirm rejection of new fields is intended.
If “one new field” triggers None, existing data gets discarded. Check if partial schema extension is an option, or if a strict approach is desired.

369-373: Efficient collect_keys implementation.
Using a HashSet here is a clean and direct approach; it improves performance for large sets of keys.

375-382: Sufficient mismatch checks.
fields_mismatch ensures strict consistency with the existing schema, which is suitable for strongly typing logs.

src/event/format/mod.rs (7)

23-23: New import for NonZeroU32 is valid.
This nonzero type helps ensure that partition limits are never zero at runtime.

27-27: anyhow::anyhow import usage is acceptable.
No concerns; using anyhow is consistent across the updated code.

114-120: Schema inference signature extension looks consistent.
Accepting additional parameters like time_partition_limit and custom_partitions is logically aligned with the approach in json.rs.

124-155: Reserved field assertion is beneficial.
Returning an error if p_timestamp is already present as a column prevents conflicts or accidental overwriting of the timestamp field.

157-178: Solid approach in into_recordbatch.
Using update_field_type_in_schema before decoding ensures we apply consistent transformations (timestamp conversions, numeric casts, etc.) before finalizing the RecordBatch.

179-179: Method signature aligns with the rest of the refactor.
Accepting a stream reference fosters consistency with the JSON event logic.

237-319: Dynamic type overrides are well-structured.
Allowing “time” or “date”-named fields to become timestamps, and numbers to become float64 in Schema V1, makes sense for a more flexible ingestion experience.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/event/format/json.rs (1)
78-141: 🛠️ Refactor suggestion

Robust log flattening with source-specific processing

The new flatten_logs method is a well-structured addition that cleanly handles different log formats based on the source type. The implementation properly uses type-specific flattening functions for various log sources.

However, there's a potential issue with the error handling:

Replace the unreachable! statement with a proper error return to avoid potential panics in production:
-            _ => unreachable!("flatten would have failed beforehand"),
+            other => return Err(anyhow!("Expected an object or an array but got: {:?}", other)),

🧹 Nitpick comments (1)

src/parseable/streams.rs (1)
237-245: Simplified schema retrieval with better error handling

The refactored get_schemas_if_present method is more concise and readable by directly iterating over schema files and using BufReader for efficient reading. However, there's a potential issue with the error handling in this method.

Consider handling potential errors when opening schema files instead of using expect, which could cause panics in production:
-            let file = File::open(path).expect("Schema File should exist");
+            let file = match File::open(path) {
+                Ok(file) => file,
+                Err(e) => {
+                    warn!("Failed to open schema file {}: {}", path.display(), e);
+                    continue;
+                }
+            };

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b0d865 and d096ce0.

📒 Files selected for processing (9)

src/event/format/json.rs (7 hunks)
src/event/mod.rs (1 hunks)
src/handlers/http/ingest.rs (9 hunks)
src/handlers/http/logstream.rs (1 hunks)
src/handlers/http/mod.rs (0 hunks)
src/lib.rs (1 hunks)
src/metadata.rs (1 hunks)
src/parseable/mod.rs (1 hunks)
src/parseable/streams.rs (12 hunks)

💤 Files with no reviewable changes (1)

src/handlers/http/mod.rs

🚧 Files skipped from review as they are similar to previous changes (4)

src/lib.rs
src/metadata.rs
src/handlers/http/logstream.rs
src/parseable/mod.rs

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: coverage
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Quest Smoke and Load Tests for Distributed deployments

🔇 Additional comments (12)

src/parseable/streams.rs (2)

574-585: Well-designed schema merging implementation

The new commit_schema method provides a clean and efficient way to merge the incoming schema with the existing one, making the schema evolution process more maintainable. The implementation properly handles the merging and updating of metadata.

112-131: Improved code clarity with meaningful parameter names

Renaming schema_key to prefix enhances code readability by using a more descriptive name that better represents the parameter's purpose. The updated DiskWriter implementation is also more consistent with the pattern used elsewhere in the codebase.

src/event/mod.rs (4)

34-39: Good data structure design for partition events

The new PartitionEvent struct is a well-designed abstraction that cleanly encapsulates the relationship between a RecordBatch and its corresponding date, improving code readability and maintainability.

40-48: Enhanced Event structure for better partition management

The improved Event struct now uses a HashMap to map partition prefixes to PartitionEvent instances, which is a more elegant approach that better encapsulates the relationship between partitions and their data.

52-70: Streamlined event processing with better error handling

The updated process method now properly iterates over partition events, commits the schema only if it's the first event, and efficiently pushes data for each partition. This approach is more maintainable and aligns well with the new data structure design.

72-78: Consistent pattern for unchecked processing

The process_unchecked method follows the same pattern as the process method, maintaining consistency throughout the codebase. The method is concise and focused on its specific responsibility.
src/handlers/http/ingest.rs (3)

44-80: Simplified log ingestion with direct push_logs call

This refactoring significantly improves the ingest endpoint by:

Using a direct method call to push_logs instead of complex flattening logic

Centralizing log processing logic

Enhancing code maintainability by reducing complexity at the HTTP handler level

The new approach provides a cleaner separation of concerns between HTTP handling and log processing.

236-260: Updated function signature for push_logs_unchecked

Accepting a &Stream parameter emphasizes the new ingestion pattern. Ensure concurrency safety whenever multiple threads attempt "unchecked" pushes simultaneously.

In highly concurrent deployments, consider locking or queueing events to avoid partial writes or race conditions in process_unchecked.
#!/bin/bash
# Look for any synchronization mechanisms around process_unchecked calls
rg -A 5 -B 5 "process_unchecked" --type rust
244-254:
❓ Verification inconclusive

Consider using more type-safe initialization for partitions

The initialization of PartitionEvent with hardcoded values for is_first_event (always true) seems potentially risky. This might lead to schema changes being committed unnecessarily if multiple threads are processing events concurrently.

🏁 Script executed:
#!/bin/bash
# Check if is_first_event is used for controlling schema commits in other places
rg "is_first_event.*commit_schema" --type rust
Length of output: 48
Action Required: Verify the Initialization of is_first_event in PartitionEvent

The current initialization hardcodes is_first_event as true, which might lead to unintended schema commits if multiple threads process events concurrently. Note that our search for usages linking is_first_event to schema commit operations (e.g. using patterns like commit_schema) did not return any results. However, this lack of evidence means we should manually verify if this field controls any critical behavior elsewhere in the codebase.

Review: Confirm whether is_first_event is intended to remain hardcoded or if a more robust, type-safe initialization is required.

Action: Manually verify (or add additional tests) that the current setup does not inadvertently trigger schema changes in concurrent scenarios.
src/event/format/json.rs (3)

149-162: Simplified interface for data conversion

The refactored to_data method now has a clearer contract, delegating the implementation details to the flatten_logs method. This separation of concerns enhances maintainability and readability.

394-397: Simplified key collection implementation

The collect_keys function has been simplified to directly collect keys into a HashSet, which is more concise and efficient than the previous implementation.

239-333:
❓ Verification inconclusive

Well-designed partition-aware event creation

The updated into_event method now properly handles partitioning based on schema keys, timestamps, and custom partition values. The implementation is thorough and correctly organizes data into appropriate partitions with their corresponding dates.

However, there's a potential issue with the prefix generation logic:

🏁 Script executed:
#!/bin/bash
# Check consistency of prefix generation across the codebase
rg "prefix = format!" --type rust
Length of output: 474
Prefix Generation Consistency — Please Verify Alignment

The updated into_event method is well structured and partition-aware, correctly organizing events by schema keys, timestamps, and custom partition values. The composite prefix generated in this method (using schema key, a formatted timestamp, and custom partition values) follows a clear intent.

However, a review of similar expressions across the codebase shows that:

src/event/format/json.rs builds a composite prefix via concatenation with dots.

src/utils/time.rs and src/parseable/streams.rs use format! with slightly different delimiters and structure (such as using folder separators like /).

Please verify that these differences in prefix formatting are intentional. If the prefixes are meant to follow the same structural conventions across modules, consider aligning the formatting logic to prevent potential partitioning mismatches.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/event/format/json.rs (1)

239-338: Improved event partitioning with clearer structure

The refactored into_event method now uses the Stream object to access all required parameters, which is a cleaner approach. The partitioning logic using JsonPartition is well-structured and makes the code more maintainable.

One potential improvement is to consider using a builder pattern for constructing the partitions, which could make the code even more readable, especially as the partitioning logic grows in complexity.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d096ce0 and e15b0d2.

📒 Files selected for processing (1)

src/event/format/json.rs (7 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (9)

GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: coverage
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin

🔇 Additional comments (22)

src/event/format/json.rs (22)

26-31: Enhanced imports for expanded functionality

The addition of Chrono date/time types and OpenTelemetry protobuf model imports properly supports the new log processing capabilities across multiple log sources.

39-53: Well-structured module organization

The imports are properly organized into logical groups, making dependencies clear. The codebase now directly uses important components like Stream and various flattening utilities for different log formats.

55-59: Appropriate addition of JsonPartition struct

The new JsonPartition struct effectively encapsulates related data (batch, schema, date) that was previously passed as separate parameters, improving code organization.

61-66: Enhanced Event struct with better source tracking

Good addition of log_source field to track the origin type of events, which supports the source-specific processing logic introduced in this PR.

69-76: Constructor updated to match new struct definition

The constructor properly initializes the new fields, maintaining backward compatibility with the existing API while supporting the new functionality.

78-141: Handle potential panic in unreachable code path

The flatten_logs function has a well-designed structure for handling different log sources. However, line 136 contains an unreachable! statement that could potentially panic in production if unexpected input is received.

Consider replacing the unreachable statement with a proper error handling:
-            _ => unreachable!("flatten would have failed beforehand"),
+            other => return Err(anyhow!("Expected an object or an array but got: {:?}", other)),
144-162: Clean simplification of to_data method

The method has been effectively simplified by delegating to the flatten_logs method, reducing code duplication and making the code more maintainable.

223-237: More efficient batch processing in decode method

The change to accept a slice of data rather than a single item allows for more efficient batch processing of events. This is a good optimization, especially for high-volume log ingestion.

392-394: Simplified key collection with HashSet

The collect_keys function now directly collects keys into a HashSet, which is appropriate for this use case, as it automatically handles duplicate keys and provides faster lookups.

397-403: Cleaner field mismatch detection

The updated fields_mismatch function is more concise and readable, using a cleaner approach to check for field validity.

495-513: Useful test helpers for type checking

The TestExt trait is a good addition that makes test assertions more readable and type-safe. This approach is much better than using raw type casting in the test code.

515-517: Utility function for test setup

The fields_to_map helper function effectively reduces repetition in the tests, making them more concise and easier to maintain.

519-550: Comprehensive test for basic object conversion

This test thoroughly verifies the conversion of a simple JSON object to a RecordBatch, correctly testing both schema inference and data conversion aspects.

552-579: Good testing of null value handling

The test properly verifies that null values are handled correctly in the conversion process, which is important for ensuring data integrity with real-world inputs.

581-614: Effective testing of schema derivation

This test properly verifies the schema derivation logic when a stored schema already exists, which is critical for ensuring schema evolution works correctly.

616-639: Important test for schema type mismatches

This test checks that the code correctly rejects inputs with types that don't match the schema, which is essential for ensuring data consistency.

641-664: Edge case testing for empty objects

Good test for handling empty JSON objects, which is an important edge case that could occur in production.

666-715: Thorough testing of array processing

This test effectively verifies that arrays of JSON objects are correctly processed into a RecordBatch with an inferred schema, including proper handling of missing fields.

717-812: Comprehensive null value handling in arrays

These tests thoroughly verify that null values in arrays of JSON objects are handled correctly, both with and without a pre-existing schema. This is critical for robustness with varied inputs.

814-918: Good testing of nested data structures

The test thoroughly verifies handling of nested data structures (arrays within objects), which is important for complex log formats.

920-978: Verification of schema version compatibilities

This test effectively verifies that the code handles different schema versions correctly, which is critical for backward compatibility as the system evolves.

164-221:
❓ Verification inconclusive

Simplification of schema inference logic

The code now better handles the case when no schema can be derived, with clearer error messaging and a more direct flow. The change from checking for None to using if let Some(schema) improves readability.

However, there's still a potential issue with collect_keys not handling empty sets explicitly. Consider adding a check to ensure the fields set isn't empty before proceeding.

🏁 Script executed:
#!/bin/bash
# Check how empty key sets are handled elsewhere in the codebase
rg "collect_keys.*empty|empty.*collect_keys" --type rust
Length of output: 58
Schema Inference Logic Improvement – Please Verify Empty Set Handling

The updated schema inference logic is clearer and easier to follow. The refactoring using if let Some(schema) has indeed improved readability and error messaging. However, the review comment’s concern regarding potential issues with collect_keys not explicitly handling empty sets remains unresolved. Our initial search for explicit handling (using rg "collect_keys.*empty|empty.*collect_keys") returned no results, so it’s unclear whether an empty fields set is adequately addressed elsewhere in the codebase.

Action Required: Please verify manually that when collect_keys returns an empty set, the downstream operations (such as schema derivation/merging) behave as intended. If the empty set could lead to runtime issues, consider adding an explicit check (and corresponding error handling or fallback behavior) before proceeding.

Signed-off-by: Devdutt Shenoi <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/handlers/http/cluster/utils.rs (2)

216-219: Add documentation for the JsonWithSize struct.

This is a well-designed struct that encapsulates JSON data with its size. Consider adding documentation comments to explain its purpose and usage, especially since it's a public struct that might be used by other developers.

+/// A wrapper struct that holds deserialized JSON data along with its original byte size.
+/// 
+/// This is useful for tracking payload sizes during request processing and can be used
+/// for monitoring and enforcing size limits.
 pub struct JsonWithSize<T> {
+    /// The deserialized JSON data
     pub json: T,
+    /// Size of the original JSON payload in bytes
     pub byte_size: usize,
 }

221-260: Well-implemented FromRequest trait with proper size checking.

The implementation correctly handles streaming JSON payloads, tracks their size, and enforces the size limit. The previously identified Clippy issues have been addressed properly.

Some suggestions for improvement:

Add documentation comments for the implementation
Consider adding debug logging for rejected oversized payloads

+/// Implements FromRequest for JsonWithSize<T> to allow automatic extraction and size tracking
+/// of JSON payloads from HTTP requests.
+/// 
+/// This implementation enforces the MAX_EVENT_PAYLOAD_SIZE limit and tracks the byte size
+/// of the original payload.
 impl<T: DeserializeOwned + 'static> FromRequest for JsonWithSize<T> {
     type Error = actix_web::error::Error;
     type Future = Pin<Box<dyn Future<Output = Result<Self, Self::Error>>>>;

     fn from_request(_: &HttpRequest, payload: &mut Payload) -> Self::Future {
         let limit = MAX_EVENT_PAYLOAD_SIZE;

         // Take ownership of payload for async processing
         let mut payload = payload.take();

         Box::pin(async move {
             // Buffer to collect all bytes
             let mut body = BytesMut::new();
             let mut byte_size = 0;

             // Collect all bytes from the payload stream
             while let Some(chunk) = payload.next().await {
                 let chunk = chunk?;
                 byte_size += chunk.len();

                 // Check the size limit
                 if byte_size > limit {
+                    // Log rejection of oversized payload
+                    tracing::debug!("Rejected payload exceeding size limit: {}/{} bytes", byte_size, limit);
                     return Err(ErrorPayloadTooLarge(byte_size));
                 }

                 // Extend our buffer with the chunk
                 body.extend_from_slice(&chunk);
             }

             // Convert the collected bytes to Bytes
             let bytes = body.freeze();

             // Deserialize the JSON payload
             let json =
                 serde_json::from_slice::<T>(&bytes).map_err(JsonPayloadError::Deserialize)?;

+            tracing::trace!("Successfully processed JSON payload: {} bytes", byte_size);
             Ok(JsonWithSize { json, byte_size })
         })
     }
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e15b0d2 and b106106.

📒 Files selected for processing (5)

src/handlers/http/cluster/mod.rs (3 hunks)
src/handlers/http/cluster/utils.rs (2 hunks)
src/handlers/http/logstream.rs (1 hunks)
src/handlers/http/query.rs (1 hunks)
src/parseable/mod.rs (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

src/handlers/http/logstream.rs
src/parseable/mod.rs
src/handlers/http/cluster/mod.rs
src/handlers/http/query.rs

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: coverage
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Quest Smoke and Load Tests for Standalone deployments

🔇 Additional comments (1)

src/handlers/http/cluster/utils.rs (1)

19-36: Good choice of imports for the new functionality.

The added imports support the new JSON payload handling functionality and are appropriately scoped.

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (1)

src/event/format/json.rs (1)
344-361: ⚠️ Potential issue

Risk of panic with .unwrap() in extract_custom_partition_values.
Using json.get(custom_partition_field.trim()).unwrap() can panic if the field is missing. To prevent runtime crashes, consider returning an error or skipping missing fields.

Here’s a possible fix:
- let custom_partition_value = json.get(custom_partition_field.trim()).unwrap().to_owned();
+ let custom_partition_value = match json.get(custom_partition_field.trim()) {
+     Some(value) => value.to_owned(),
+     None => {
+         // Return error or skip missing value
+         return Err(anyhow!("Missing custom partition field: {}", custom_partition_field));
+     }
+ };

🧹 Nitpick comments (5)

src/event/format/json.rs (5)

55-60: Consider adding a doc comment to JsonPartition.
A brief doc comment about the purpose and usage of JsonPartition would aid readability and maintainability.

63-66: Check cross-platform portability for origin_size.
Using usize might lead to cross-platform differences in maximum size. Consider using u64 instead to avoid potential type casting issues on 32-bit systems.

164-221: Potential performance bottleneck in infer_schema.
Repeatedly inferring and merging schemas on large data volumes can be expensive. Consider caching or consolidating these merge operations to reduce overhead.

240-339: Evaluate partition prefix collisions.
When building the partition prefix with dynamic custom partition values, consider how collisions or naming conflicts might be handled. This may require adding user warnings or fallback logic.

464-979: Comprehensive test coverage.
These tests effectively cover arrays, null values, nested fields, and schema derivation. Consider adding a negative test for the unreachable!() branch within flatten_logs to ensure robust error handling for malformed inputs.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b106106 and b8606b3.

📒 Files selected for processing (2)

src/event/format/json.rs (7 hunks)
src/handlers/http/cluster/utils.rs (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: coverage
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Kafka x86_64-unknown-linux-gnu

🔇 Additional comments (13)

src/handlers/http/cluster/utils.rs (3)

19-34: Well-structured import additions for new functionality

The new imports are appropriately organized and provide the necessary components for handling request payloads, futures, byte manipulation, and JSON deserialization - all essential for the new JsonWithSize implementation.

215-218: Good design of the JsonWithSize<T> struct

This generic struct effectively pairs the deserialized JSON data with its byte size, which is a clean approach for tracking resource usage during event ingestion.

220-259: Well-implemented FromRequest trait for JSON payload handling

The implementation of FromRequest for JsonWithSize<T> is robust, featuring:

Proper streaming of request payload to handle large data efficiently

Size limit enforcement to prevent resource exhaustion

Clear error handling for size violations and deserialization issues

Effective use of async/await patterns

I notice the implementation correctly addresses the previous Clippy warnings by:

Removing the redundant .into() conversion at line 242

Using direct function reference rather than closure for the error mapping at line 254

src/event/format/json.rs (10)

26-54: Use statements verified.
No issues found with these new or updated imports; they appear logically consistent and relevant to the newly introduced functionalities.

69-76: Initialization approach looks good.
The Event::new function is straightforward and aligns with standard Rust best practices.

78-135: flatten_logs function structure is clear.
The function covers multiple log sources and consistently flattens them using specialized helpers. Error handling for unexpected data is mostly robust.

136-136: Relying on unreachable!() can cause panics in production.
This pattern was flagged in a past review. Consider returning a user-friendly error instead of panicking if a truly unexpected value is encountered.

144-163: Clean extension of to_data with new arguments.
The refactoring to accept time_partition_limit and custom_partitions is logical and helps keep the flattening logic self-contained.

224-237: Arrow decoding logic looks fine.
Using the ReaderBuilder with serialize and flush is a standard approach. No apparent issues.

365-372: Time extraction logic is appropriate.
Using serde_json::from_value to parse RFC3339 strings into DateTime<Utc> is a robust approach; error cases are handled properly.

377-388: Schema derivation is straightforward.
Returning None if a field is missing is a simple fallback that the caller handles gracefully. No significant issues here.

390-394: collect_keys improvement noted.
This no longer uses expect, addressing previous concerns about potential panics.

396-403: No concerns in fields_mismatch.
The check for missing or invalid field types appears clear and robust.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

src/parseable/staging/writer.rs (2)
41-44: Consider making struct fields private for better encapsulation.

While the DiskWriter struct itself is a good addition, consider making the fields private and providing accessor methods if needed to follow Rust's information hiding principles.
 pub struct DiskWriter {
-    pub inner: StreamWriter<BufWriter<File>>,
-    pub path: PathBuf,
+    inner: StreamWriter<BufWriter<File>>,
+    path: PathBuf,
 }
47-53: Good implementation of the constructor method.

The new method appropriately sets up the file with OpenOptions and creates a buffered writer. Consider specifying a buffer size for potential performance optimization.
 pub fn new(path: PathBuf, schema: &Schema) -> Result<Self, StagingError> {
     let file = OpenOptions::new().create(true).append(true).open(&path)?;
-
-    let inner = StreamWriter::try_new_buffered(file, schema)?;
+    // Use a specified buffer size for potential I/O performance optimization
+    let buf_writer = BufWriter::with_capacity(64 * 1024, file); // 64KB buffer
+    let inner = StreamWriter::try_new(buf_writer, schema)?;

     Ok(Self { inner, path })
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8606b3 and fbd3a18.

📒 Files selected for processing (1)

src/parseable/staging/writer.rs (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: coverage

🔇 Additional comments (4)

src/parseable/staging/writer.rs (4)

21-21: Good use of imports for the new functionality.

The added imports support the new DiskWriter struct implementation, particularly the file operations and buffered writing capabilities.

29-33: Appropriate error handling and module dependencies.

Good addition of the tracing::error import for logging errors and proper organization of imports for ARROW_FILE_EXTENSION and error handling.

38-39: Good architectural improvement through encapsulation.

Changing from direct StreamWriter<File> to DiskWriter improves encapsulation and separation of concerns, making the code more maintainable.

55-57: Clean delegation to inner writer with proper error mapping.

This method appropriately delegates to the inner stream writer and maps the error to the domain-specific StagingError type.

src/parseable/staging/writer.rs

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (1)

src/event/format/json.rs (1)
351-368: ⚠️ Potential issue

Unwrap may trigger panic for missing partition fields.
Line 356 calls .unwrap() on json.get(...). If the field is missing, it will panic. For robust ingestion, consider returning an error or skipping missing fields.
- let custom_partition_value = json.get(custom_partition_field.trim()).unwrap().to_owned();
+ let custom_partition_value = match json.get(custom_partition_field.trim()) {
+     Some(val) => val.to_owned(),
+     None => return Err(anyhow!(
+         "Missing custom partition field: {}",
+         custom_partition_field
+     )),
+ };

♻️ Duplicate comments (1)

src/event/format/json.rs (1)
109-137: ⚠️ Potential issue

Avoid panic on unexpected input.
Using _ => unreachable!("flatten would have failed beforehand") can cause a runtime panic if the code ever encounters an unexpected JSON type. Consider returning a structured error to handle malformed data gracefully.
- _ => unreachable!("flatten would have failed beforehand"),
+ other => return Err(anyhow!("Expected object or array, found: {:?}", other)),

🧹 Nitpick comments (3)

src/event/format/json.rs (3)
164-205: Schema inference logic is improved, but there's a TODO.
Your logic for merging inferred schema with stored schema is sound. Line 179 has a TODO: marker; consider clarifying or removing if no additional plan is needed.

Would you like help completing or removing the TODO?

330-346: Potential string-escaping for partition prefixes.
If users include special characters in partition fields, consider minimal escaping or sanitizing to avoid filesystem or path issues.

382-399: Doc comment mismatch.
The doc for derive_arrow_schema and collect_keys mentions returning None if one value isn't an object, but the implementation differs. Update doc comments to match the current logic.
- // Returns None if even one of the value is not an Object
+ // Collects field names from the JSON object, returning a HashSet.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fbd3a18 and 592d3a1.

📒 Files selected for processing (3)

src/event/format/json.rs (6 hunks)
src/parseable/staging/writer.rs (2 hunks)
src/parseable/streams.rs (12 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: coverage
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Standalone deployments

🔇 Additional comments (23)

src/parseable/streams.rs (5)

113-114: Parameter rename improves clarity

Changing from schema_key to prefix better reflects the role of this parameter, making the code more intuitive.

119-131: DiskWriter implementation enhances resource management

The code now uses the new DiskWriter struct instead of direct StreamWriter operations, which provides better encapsulation and automatic resource cleanup via the Drop trait.

240-245: Simplified schema file reading

Using BufReader for reading schema files is more efficient for large files. The code now directly attempts to read schema files without the initial directory check, which streamlines the implementation.

327-333: Improved flush implementation with explicit resource management

The updated flush method clearly separates memory clearing and disk flushing operations with helpful comments. Using std::mem::take to drop disk writers is an effective way to ensure all resources are properly released.

567-578: New commit_schema method improves schema management

This new method encapsulates the logic for merging a provided schema with the current schema and updating the metadata. The implementation is clean and maintains proper error handling.

src/parseable/staging/writer.rs (6)

45-48: Well-structured DiskWriter struct

The new DiskWriter struct effectively encapsulates the file writing operations with clear field definitions.

51-57: Clean implementation of the new method

The constructor properly initializes the writer with appropriate file options and error handling.

59-61: Concise write method with proper error mapping

The write method correctly maps arrow errors to your application's StagingError type, maintaining consistent error handling.

63-74: Error handling in finish method could be improved

The finish method logs errors but doesn't propagate them to the caller, which could lead to silent failures.

77-81: Excellent Drop trait implementation

Implementing the Drop trait ensures that resources are properly cleaned up when the DiskWriter goes out of scope, preventing resource leaks.

42-43: Updated Writer struct field improves resource management

Changing the disk field to use DiskWriter instead of StreamWriter directly improves encapsulation and ensures proper resource cleanup.

src/event/format/json.rs (12)

26-37: Imports look fine.
No issues spotted with the additional imports for chrono, opentelemetry_proto, and various standard library collections.

40-53: Module references are consistent with the refactor.
These references to super modules and crate internals align well with the broader event-handling refactor.

55-66: Good introduction of JsonPartition struct.
Storing batched JSON, schema, and date helps organize partitions for ingestion.

69-84: Constructor changes look good.
The additional fields (origin_size and log_source) provide better tracking of upstream metadata.

85-107: Source-based flattening is well-structured.
Deserialization for each log source is straightforward, and error handling is appropriate.

144-146: Data alias and to_data function.
These changes align with the new approach of converting JSON into a vector of flattened logs.

Also applies to: 152-162

212-219: Field mismatch check is straightforward.
Your mismatch error for incompatible data types is clearly reported.

223-223: Decoding improvements.
Increasing batch size capacity and handling the flush result carefully should help performance.

Also applies to: 231-231

240-327: Refined into_event logic.
Partitioning events by prefix, date, and custom partitions is a neat approach to batch record creation. This helps align ingestion with user-defined grouping.

372-379: Time extraction is good.
Graceful error handling on missing or invalid fields ensures robust parsing.

403-465: Validation functions are thorough.
Overall logic checks for a wide range of DataType–JSON matchups, providing robust schema enforcement.

471-1033: Unit tests cover multiple scenarios effectively.
The variety of JSON shapes, null values, type mismatches, and partition logic ensures good confidence in the refactored code.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/event/format/json.rs (1)
78-141: 💡 Verification agent
❓ Verification inconclusive

Robust handling for different log sources with a potential edge case

The new flatten_logs method nicely handles different log sources with specific processing for each type. This improves maintainability by centralizing all flattening logic in one place.

However, line 136 uses unreachable!("flatten would have failed beforehand") which could potentially panic in production if an unexpected JSON structure is encountered that passes the earlier checks.
-            _ => unreachable!("flatten would have failed beforehand"),
+            _ => return Err(anyhow!("Unexpected JSON value type: expected object or array")),
This ensures graceful error handling even for unexpected edge cases.
Action Required: Replace unreachable! with Explicit Error Handling

The new flatten_logs method centralizes the flattening logic well. However, using unreachable!("flatten would have failed beforehand") on line 136 risks a production panic if an unexpected JSON value slips through the earlier validations. Instead, return an error to gracefully handle such cases.

Recommended change:
-            _ => unreachable!("flatten would have failed beforehand"),
+            _ => return Err(anyhow!("Unexpected JSON value type: expected object or array")),
This change ensures that even in unforeseen scenarios, the function returns an error rather than panicking.

🧹 Nitpick comments (1)

src/event/format/json.rs (1)
164-173: Enhanced schema inference with potential empty fields concern

The infer_schema method is well-structured, but there's a potential issue: the call to collect_keys at line 172 doesn't check if the returned set might be empty (e.g., if processing an empty object).

Consider adding validation for empty field sets:
-        let fields = collect_keys(data);
+        let fields = collect_keys(data);
+        if fields.is_empty() {
+            // Either log a warning or handle this case specifically
+            // Example: return Err(anyhow!("No fields found in the data"))
+        }
Validating this will help prevent potential issues downstream when an empty schema is used.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6bb7afc and 924ef5b.

📒 Files selected for processing (8)

src/connectors/kafka/processor.rs (4 hunks)
src/event/format/json.rs (6 hunks)
src/event/format/mod.rs (8 hunks)
src/event/mod.rs (1 hunks)
src/handlers/http/ingest.rs (9 hunks)
src/handlers/http/logstream.rs (1 hunks)
src/metadata.rs (1 hunks)
src/parseable/streams.rs (12 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

src/metadata.rs
src/handlers/http/logstream.rs

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: coverage
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Kafka x86_64-unknown-linux-gnu

🔇 Additional comments (33)

src/parseable/streams.rs (4)

575-586: Good implementation of schema evolution with commit_schema

This new method provides a clean, dedicated way to merge schemas and update metadata. The implementation correctly handles schema evolution by merging the current schema with the new one and updating the metadata accordingly.

The use of try_merge for schema evolution is appropriate, as it preserves field compatibility while allowing for schema evolution.

118-119: Improved clarity with renamed parameter and updated DiskWriter implementation

Changing schema_key to prefix improves code understandability, as it better reflects the parameter's purpose. The updated implementation using DiskWriter instead of StreamWriter streamlines the record writing process.

The match pattern with improved error handling and file creation flow is now more robust and easier to follow.

Also applies to: 125-141

248-254: Simplified schema retrieval logic

The refactored get_schemas_if_present method removes redundant checks and streamlines the file reading process with BufReader, which improves performance when reading schema files.

150-156: Good extraction of filename generation into dedicated method

The extraction of filename generation logic into filename_from_prefix promotes code reusability and improves maintainability.

src/connectors/kafka/processor.rs (2)

41-77: Streamlined Kafka record processing with direct stream ingestion

The refactored method process_event_from_chunk improves the processing flow by:

Directly pushing logs to the stream instead of creating an intermediate event object

Returning the payload size, which is more meaningful for logging/tracking

Simplifying the code path with fewer transformations

This change aligns with the overall event handling refactor and reduces complexity.

86-88: Updated method call with improved logging

The call to the renamed method correctly references the new return type, logging the processed size instead of just the record count, which provides more meaningful metrics.

src/event/mod.rs (4)

35-39: Good encapsulation with new PartitionEvent struct

The new PartitionEvent struct effectively encapsulates the record batch and its parsed timestamp, improving the organization of event data.

41-47: Improved event structure with partition-based approach

Replacing individual fields with a partition-based HashMap improves the Event struct by:

Better organizing related data (record batches and timestamps)

Supporting multi-partition events more naturally

Simplifying iteration over partitions during processing

The change to usize for origin_size is also more appropriate for memory sizing.

51-75: Enhanced event processing with stream-specific schema commitment

The refactored process method now:

Iterates cleanly over partitions

Commits the schema to the stream when needed (first event)

Continues to update statistics and livetail functionality

Passing the stream as a parameter provides better encapsulation and clearer dependencies.

78-87: Consistent updates to process_unchecked method

The changes to process_unchecked appropriately mirror the partition-based approach of the main process method, maintaining consistency in the API.

src/handlers/http/ingest.rs (6)

49-52: Simplified log ingestion with direct push_logs call

The refactored ingest function simplifies the ingestion flow by:

Using JsonWithSize to capture both the JSON payload and its size upfront

Obtaining the stream directly with get_or_create_stream

Creating and processing events in a clear, direct manner

This approach provides better separation of concerns between HTTP handling and log processing logic.

Also applies to: 85-90

97-100: Consistent pattern adoption across OTEL ingestion endpoints

The OTEL logs ingestion endpoint now follows the same pattern as the main ingestion endpoint, creating a more consistent codebase.

Also applies to: 130-135

142-145: Consistent pattern for OTEL metrics ingestion

The OTEL metrics ingestion endpoint follows the same simplified pattern, maintaining consistency across all ingestion endpoints.

Also applies to: 172-177

184-187: Consistent pattern for OTEL traces ingestion

The OTEL traces ingestion endpoint also adopts the simplified pattern, completing the consistent implementation across all ingestion endpoints.

Also applies to: 216-221

228-232: Direct event processing in post_event endpoint

The post_event endpoint now follows the same simplified pattern as other ingestion endpoints, improving consistency and maintainability.

Also applies to: 269-274

279-299: Updated function signature for push_logs_unchecked

The function now accepts a &Stream parameter instead of a stream name, emphasizing the new ingestion pattern. The implementation correctly uses the new PartitionEvent structure to match the refactored event handling system.

In highly concurrent deployments, consider locking or queueing events to avoid partial writes or race conditions in process_unchecked.

src/event/format/json.rs (9)

149-162: Well-designed method delegation to improve modularity

The refactored to_data method now delegates to the new flatten_logs method, which nicely separates concerns and allows for better code organization. This approach makes the code more maintainable by avoiding duplication and keeping related functionality together.

224-237: Improved batch processing for record creation

The decode method now accepts a slice of data objects rather than a single item, allowing for more efficient batch processing. This change aligns well with Arrow's batch-oriented processing model and should improve performance when handling multiple records.

240-323: Well-structured event partitioning with clear logic

The refactored into_event method now effectively:

Extracts configuration from the stream

Processes each JSON object in the data

Groups records by partition prefix

Creates record batches for each partition

This approach handles potentially large record sets efficiently by organizing them into logical partitions. The structure is clean and maintainable, with good separation of concerns between data retrieval, schema inference, and partition generation.

335-351: Clean prefix generation logic

The new generate_prefix function provides a clear, reusable way to create storage prefixes that incorporate schema, timestamp, and custom partition information. This improves maintainability by centralizing the prefix format in one place.

403-406: Simplified key collection with improved API

The collect_keys function has been simplified to directly return a HashSet<&str> instead of a more complex Result type. This makes the code more concise and easier to use.

408-420: Cleaner schema validation approach

The fields_mismatch function has been refactored to use a more functional approach with any(). This makes the validation logic more concise and easier to understand.

566-584: Useful test utilities for Arrow arrays

The TestExt trait provides convenient methods for working with Arrow arrays in tests, making assertions more readable and concise. This is a good practice for test code maintainability.

586-589: Helpful schema utility for tests

The fields_to_map function simplifies creating a HashMap from Field objects, which is useful for test setup. This improves test readability and reduces duplication.

591-1096: Comprehensive test suite covering core functionality

The extensive test suite thoroughly covers different aspects of the refactored code:

Basic object processing

Schema derivation and validation

Array handling

Handling null values

Nested type handling

Schema mismatch detection

Partition prefix generation

This comprehensive testing is crucial for ensuring the refactored code works as expected across various scenarios.

src/event/format/mod.rs (8)

123-129: Streamlined method signature with clear parameters

The to_data method signature has been improved to:

Take more specific parameters (time_partition_limit and custom_partitions)

Return a cleaner result type (anyhow::Result<Vec<Self::Data>>)

This makes the API more focused and the error handling more consistent.

131-138: Good separation of concerns with schema inference

Adding the dedicated infer_schema method separates the schema inference logic from data conversion. This improves maintainability by providing a clear place for schema validation and derivation logic.

139-139: Enhanced batch processing capability

The decode method now accepts a slice of data items rather than a single item, enabling more efficient batch processing. This aligns with Arrow's batch-oriented processing model.

142-172: Simplified schema validation with modern Rust patterns

The refactored prepare_and_validate_schema method uses the is_none_or pattern for cleaner validation logic. This makes the code more concise while maintaining readability.

The explicit Ok(schema) return also makes the success path clearer.

174-194: Well-encapsulated record batch creation

The new into_recordbatch method cleanly encapsulates the process of:

Creating a schema from fields

Updating field types

Decoding data into a record batch

Setting the timestamp column

This makes the record batch creation logic more maintainable and reusable.

196-196: Simplified stream-based event creation

The into_event method signature has been simplified to take a stream reference, which is a cleaner approach that gives the implementation access to all the configuration it needs from a central place.

254-254: Type consistency with JSON handling

Changing the log_records parameter type from Option<&Vec<Value>> to Option<&Json> maintains consistency with the JSON handling approach used throughout the codebase.

297-297: Consistent parameter typing

The log_record parameter type in override_data_type has been changed to use the Json type alias consistently with other parts of the codebase. This improves type consistency across the application.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/event/format/mod.rs (1)

296-299: Improved type safety with custom JSON type

Changed log_record parameter from Value to Json and updated the matching logic to use this type. This change:

Provides better type safety and specialized methods for JSON processing

Makes the code more maintainable by using a custom type that can evolve with application needs

Consider documenting the advantages of the Json type over Value in comments to help future developers understand the rationale.

Also applies to: 306-306

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 924ef5b and 3d8c338.

📒 Files selected for processing (1)

src/event/format/mod.rs (9 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (5)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Quest Smoke and Load Tests for Standalone deployments

🔇 Additional comments (9)

src/event/format/mod.rs (9)

23-23: Updated imports to support new functionality

The imports have been updated to support the new event handling approach:

Added NonZeroU32 for time partition limits with validation

Switched to anyhow for more ergonomic error handling

Added utility imports for JSON handling and Arrow operations

These changes align well with the refactoring objectives to improve the event handling system.

Also applies to: 27-27, 36-40

99-100: Improved code semantics with type alias

The IsFirstEvent type alias makes the code more self-documenting and readable by clearly indicating the meaning of the boolean value in function signatures.

123-129: Improved signature for to_data method

The signature changes in to_data method indicate a significant refactoring:

Added support for partition limits and custom partitions

Changed return type to return multiple data items (Vec<Self::Data>)

Switched to anyhow::Result for better error handling

This change appears to be part of an effort to improve the handling of large datasets by allowing batch processing and customized partitioning.

131-138: Added schema inference as a separate concern

The new infer_schema method is a good example of separation of concerns - extracting schema inference logic from data processing. This will make the code more maintainable and easier to test.

This approach allows for more flexible schema handling and will likely improve the performance of schema validation operations.

139-139: Updated decode method for batch processing

The decode method now accepts a slice of data items rather than a single item, enabling batch processing of events. This change should improve performance for high-volume log ingestion scenarios.

141-172: Simplified schema validation with inline logic

The previous separate is_schema_matching method has been replaced with inline schema validation logic that:

Is more concise and easier to understand

Directly expresses the validation condition

Uses modern Rust pattern with is_none_or for cleaner code

This change improves readability while maintaining the same functionality.

174-195: Restructured into_recordbatch for clarity and flexibility

The into_recordbatch method has been completely restructured to:

Take explicit parameters rather than using self

Accept preprocessed data and schema

Focus solely on converting data to a RecordBatch

This change follows good design principles by making the function more modular and single-purpose, which should make the code easier to maintain and test.

197-197: Added into_event method for type conversion

The new into_event method completes the event processing pipeline by providing a way to convert from the format-specific data type to the general Event type.

This addition makes the trait more comprehensive and ensures a complete processing flow from raw data to structured events.

251-256: Updated JSON handling with custom type

The parameter type change from Option<&Vec<Value>> to Option<&Json> and the corresponding implementation changes suggest a transition to a more specialized JSON handling approach.

This change likely provides better performance and more specific functionality for JSON processing, which is important for a log processing system.

Also applies to: 266-269

parseablehq#1218 (comment)

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/event/format/json.rs (1)
78-138: ⚠️ Potential issue

Well-implemented flatten_logs method needs error handling improvement.

The new flatten_logs method provides excellent support for multiple log sources with specialized processing for each format. However, there's a potential issue on line 136.

The unreachable!() statement could cause a panic in production if unexpected data is encountered. Replace it with proper error handling:
-            _ => unreachable!("flatten would have failed beforehand"),
+            other => return Err(anyhow!("Expected an object or array after flattening, received: {:?}", other)),
This will ensure robust error handling for all potential input types.

🧹 Nitpick comments (2)

src/utils/actix.rs (1)

85-143: Comprehensive implementation of custom fields extraction

The implementation of get_custom_fields_from_header is thorough and includes:

Extraction of user agent and source IP
Proper header filtering (x-p- prefixed headers)
Header value length validation
Special handling for LOG_SOURCE_KEY
Appropriate warning logs for limit conditions

A few suggestions to consider:

The code could benefit from extracting the header value truncation logic into a helper function for better readability
Consider adding debug logging to track which headers are being added

 pub fn get_custom_fields_from_header(req: HttpRequest) -> HashMap<String, String> {
     let user_agent = req
         .headers()
         .get(USER_AGENT)
         .and_then(|a| a.to_str().ok())
         .unwrap_or_default();

     let conn = req.connection_info().clone();

     let source_ip = conn.realip_remote_addr().unwrap_or_default();
     let mut p_custom_fields = HashMap::new();
     p_custom_fields.insert(USER_AGENT_KEY.to_string(), user_agent.to_string());
     p_custom_fields.insert(SOURCE_IP_KEY.to_string(), source_ip.to_string());

+    // Helper function to truncate value if needed
+    fn truncate_if_needed(value: &str, header_name: &str) -> String {
+        if value.len() > MAX_FIELD_VALUE_LENGTH {
+            warn!(
+                "Header value for '{}' exceeds maximum length, truncating",
+                header_name
+            );
+            value[..MAX_FIELD_VALUE_LENGTH].to_string()
+        } else {
+            value.to_string()
+        }
+    }
+
     // Iterate through headers and add custom fields
     for (header_name, header_value) in req.headers().iter() {
         // Check if we've reached the maximum number of custom fields
         if p_custom_fields.len() >= MAX_CUSTOM_FIELDS {
             warn!(
                 "Maximum number of custom fields ({}) reached, ignoring remaining headers",
                 MAX_CUSTOM_FIELDS
             );
             break;
         }

         let header_name = header_name.as_str();
         if header_name.starts_with("x-p-") && !IGNORE_HEADERS.contains(&header_name) {
             if let Ok(value) = header_value.to_str() {
                 let key = header_name.trim_start_matches("x-p-");
                 if !key.is_empty() {
-                    // Truncate value if it exceeds the maximum length
-                    let truncated_value = if value.len() > MAX_FIELD_VALUE_LENGTH {
-                        warn!(
-                            "Header value for '{}' exceeds maximum length, truncating",
-                            header_name
-                        );
-                        &value[..MAX_FIELD_VALUE_LENGTH]
-                    } else {
-                        value
-                    };
-                    p_custom_fields.insert(key.to_string(), truncated_value.to_string());
+                    let truncated_value = truncate_if_needed(value, header_name);
+                    p_custom_fields.insert(key.to_string(), truncated_value);
+                    // For debugging during development
+                    // trace!("Added custom field: {}={}", key, truncated_value);
                 } else {
                     warn!(
                         "Ignoring header with empty key after prefix: {}",
                         header_name
                     );
                 }
             }
         }

         if header_name == LOG_SOURCE_KEY {
             if let Ok(value) = header_value.to_str() {
-                p_custom_fields.insert(FORMAT_KEY.to_string(), value.to_string());
+                let truncated_value = truncate_if_needed(value, header_name);
+                p_custom_fields.insert(FORMAT_KEY.to_string(), truncated_value);
             }
         }
     }

     p_custom_fields
 }

src/parseable/streams.rs (1)

248-253: Improved file reading with BufReader.

The simplification of get_schemas_if_present with BufReader is a good optimization for file reading performance. However, the implementation uses expect() which could cause panics if a schema file exists but cannot be opened.

Consider handling file opening errors more gracefully:
-            let file = File::open(path).expect("Schema File should exist");
+            let file = match File::open(path) {
+                Ok(f) => f,
+                Err(e) => {
+                    warn!("Failed to open schema file at {}: {}", path.display(), e);
+                    continue;
+                }
+            };

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30e1e2c and 645f605.

📒 Files selected for processing (11)

src/connectors/kafka/processor.rs (4 hunks)
src/event/format/json.rs (6 hunks)
src/event/format/mod.rs (8 hunks)
src/event/mod.rs (2 hunks)
src/handlers/http/cluster/mod.rs (3 hunks)
src/handlers/http/ingest.rs (9 hunks)
src/handlers/mod.rs (1 hunks)
src/lib.rs (1 hunks)
src/metadata.rs (1 hunks)
src/parseable/streams.rs (12 hunks)
src/utils/actix.rs (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

src/lib.rs
src/metadata.rs
src/handlers/http/cluster/mod.rs
src/handlers/http/ingest.rs

🧰 Additional context used

🧬 Code Definitions (3)

src/connectors/kafka/processor.rs (3)

src/event/format/mod.rs (1) (1)

new (122-127)

src/event/format/json.rs (1) (1)

new (69-76)

src/metadata.rs (1) (1)

new (95-130)

src/event/format/mod.rs (4)

src/utils/arrow/mod.rs (1) (1)

add_parseable_fields (149-200)

src/event/format/json.rs (3) (3)

infer_schema (164-221)

new (69-76)

into_event (240-337)

src/parseable/streams.rs (1) (1)

new (96-113)

src/parseable/mod.rs (1) (1)

new (140-162)

src/event/format/json.rs (3)

src/otel/logs.rs (1) (1)

flatten_otel_logs (140-170)

src/otel/traces.rs (1) (1)

flatten_otel_traces (89-124)

src/utils/json/mod.rs (3) (3)

flatten_json_body (37-68)

arr_obj_with_nested_type (277-333)

arr_obj_with_nested_type_v1 (336-392)

🔇 Additional comments (35)

src/handlers/mod.rs (1)

24-24: Updated visibility of LOG_SOURCE_KEY constant to public

The change to make LOG_SOURCE_KEY public is appropriate since it's now being accessed from the utils/actix.rs module. This modification supports the refactored event handling implementation by properly exposing the constant where needed.

src/utils/actix.rs (2)

20-35: Import organization looks good

The imports have been properly organized to support the new functionality for extracting custom fields from HTTP headers. The inclusion of HashMap, USER_AGENT, and warning functionality is appropriate for the implementation.

37-39: Well-defined constants with clear limits

The constants are well-defined with reasonable limits. The exclusion of specific headers via IGNORE_HEADERS prevents duplication, while MAX_CUSTOM_FIELDS and MAX_FIELD_VALUE_LENGTH provide appropriate safeguards against potential abuse.

src/parseable/streams.rs (5)

118-146: Parameter renaming improves clarity.

The renaming of schema_key to prefix throughout the push method enhances code readability by better reflecting the parameter's purpose. The transition from StreamWriter to DiskWriter aligns with the event handling refactoring objectives of the PR.

575-586: Well-implemented schema management method.

The new commit_schema method provides a clear interface for merging schemas and updating metadata. It correctly handles schema merging and properly uses locks to protect shared data.

The method handles schema merging gracefully by returning errors through the ? operator rather than panicking, which is a good practice for a library function. It also correctly updates the metadata by writing to the RwLock.

922-937: Updated test for filename generation.

The test has been properly updated to reflect the changes in method signature and implementation. The use of the TimeRange functions aligns with the refactored code structure.

947-953: Test assertion updated correctly.

The updated assertion properly tests the renamed filename_from_prefix method with the correct parameter.

1005-1006: Simplified timestamp handling in tests.

Using Utc::now().naive_utc() simplifies timestamp generation in tests and makes them more maintainable.

src/connectors/kafka/processor.rs (3)

31-32: Improved import organization.

The changes consolidate multiple imports into a single line, making the code more maintainable.

44-82: Good refactoring to simplify event processing workflow.

The method has been transformed from building and returning an event to directly processing it and returning the size. This change effectively:

Simplifies the data flow by integrating with the PARSEABLE interface

Reduces complexity by removing intermediate event construction

Makes the return value more focused on what matters (payload size)

The new approach of getting or creating a stream and then pushing logs directly is more straightforward than the previous implementation.

91-93: Improved logging with payload size information.

Adding the payload size to the debug log provides valuable information for monitoring and debugging.

src/event/mod.rs (4)

38-41: Good addition of PartitionEvent for better encapsulation.

The new PartitionEvent struct cleanly encapsulates the record batch and its associated timestamp, providing a clear structure for partitioned data.

44-49: Improved Event structure with partitioning support.

The redesigned Event struct with partitioning support allows for:

Better organization of data across multiple partitions

Improved handling of different schemas through the prefix-based partitioning

More efficient storage and retrieval patterns

Note that origin_size has changed from u64 to usize, which better aligns with memory management types in Rust.

54-77: Well-structured process method for handling partitioned events.

The refactored process method now handles multiple partitions by:

Iterating through each partition

Committing schema for first events

Pushing data with the appropriate partition information

Updating stats and processing livetail with the relevant data

This is a significant improvement over a monolithic approach as it allows for more flexible handling of heterogeneous data.

81-90: Consistent refactoring of process_unchecked.

The process_unchecked method has been properly updated to match the partitioning approach in the main process method, ensuring consistency in the API.

src/event/format/json.rs (11)

29-53: Well-organized imports for improved maintainability.

The imports have been restructured to group related items, making the code more maintainable and readable. The addition of OpenTelemetry protobuf types enables proper support for these data formats.

55-59: Good addition of JsonPartition for intermediate processing.

The JsonPartition struct provides a clean way to organize intermediate data during processing before final conversion to PartitionEvent, facilitating the partition-based approach.

62-75: Enhanced Event struct with source information.

Adding the log_source field to the Event struct enables source-specific processing, which is crucial for handling various input formats correctly. The origin_size type change from u64 to usize aligns better with Rust's memory operations.

147-162: Streamlined to_data implementation using flatten_logs.

The refactored to_data method now delegates to flatten_logs, which provides a cleaner separation of concerns and makes the code more maintainable.

164-221: Well-structured schema inference with proper error handling.

The new infer_schema method cleanly separates schema inference logic, providing clear steps for:

Collecting fields from input data

Deriving schema from existing fields when possible

Inferring new schema when needed

Validating schema against input data

This separation of concerns improves maintainability and testability.

224-237: Improved decode method with batch processing support.

The updated decode method now works with slices of data instead of single items, enabling more efficient batch processing of records.

240-337: Excellent refactoring of into_event with partitioning support.

The redesigned into_event method now:

Extracts stream configuration from a Stream object for cleaner API

Processes each JSON object individually

Organizes data into partitions based on schema and timestamp

Creates record batches efficiently for each partition

This approach supports heterogeneous data better and aligns with the partitioning architecture.

340-356: Well-implemented generate_prefix function.

The generate_prefix function creates deterministic partition keys based on schema, timestamp, and custom partition values, which is essential for the partitioning architecture to work correctly.

409-411: Simplified collect_keys implementation.

The collect_keys function has been simplified to return a HashSet directly, which is more efficient and easier to understand than the previous implementation.

416-425: Improved fields_mismatch with cleaner error checking.

The updated fields_mismatch function uses a more concise approach with is_none_or for schema validation, making the code more readable while maintaining the same functionality.

570-1164: Excellent test coverage for the new functionality.

The extensive test suite comprehensively covers:

Basic object processing

Handling of null values

Schema inference and validation

Array processing

Nested data structures

Schema versioning differences

Partition key generation

This thorough testing ensures the robustness of the implementation.

src/event/format/mod.rs (9)

111-111: Good type alias for improved readability.

The IsFirstEvent type alias makes the code more self-documenting by giving semantic meaning to boolean return values.

135-141: Enhanced to_data signature with partition support.

The updated to_data method signature now includes:

Support for time partition limits

Custom partition specification

A more consistent return type using anyhow::Result

These changes enable more flexible and robust data partitioning.

143-150: Good addition of infer_schema trait method.

The new infer_schema method clearly defines the contract for schema inference, separating this concern from data conversion for better modularity.

151-151: Improved decode method signature for batch processing.

Changing decode to accept a slice of data rather than a single item enables more efficient batch processing of records.

154-168: Simplified prepare_and_validate_schema with better return type.

The refactored method now returns an EventSchema directly instead of a tuple, making it clearer what the method does and simplifying the calling code.

192-209: Well-structured into_recordbatch method.

The redesigned method:

Takes clearer, more specific parameters

Applies field type updates

Decodes data into a record batch

Adds parseable fields

Returns a ready-to-use record batch

This approach makes the method more focused and easier to understand.

211-215: Improved into_event signature for better integration.

The updated into_event method now takes a Stream reference, which provides a cleaner API by encapsulating all stream-related configuration.

273-273: Type update for better consistency.

Changing log_records from Option<&Vec<Value>> to Option<&Json> ensures consistency with the rest of the codebase's JSON handling.

284-287: Updated override_data_type usage for new JSON type.

The implementation has been properly updated to use the new JSON type, maintaining consistency with the rest of the codebase.

coderabbitai · 2025-03-21T11:51:34Z

src/parseable/streams.rs

+    pub fn filename_from_prefix(&self, prefix: &str) -> String {
        let mut hostname = hostname::get().unwrap().into_string().unwrap();
        if let Some(id) = &self.ingestor_id {
            hostname.push_str(id);
        }
-        format!(
-            "{stream_hash}.date={}.hour={:02}.minute={}.{}{hostname}.data.{ARROW_FILE_EXTENSION}",
-            parsed_timestamp.date(),
-            parsed_timestamp.hour(),
-            Minute::from(parsed_timestamp).to_slot(OBJECT_STORE_DATA_GRANULARITY),
-            custom_partition_values
-                .iter()
-                .sorted_by_key(|v| v.0)
-                .map(|(key, value)| format!("{key}={value}."))
-                .join("")
-        )
+        format!("{prefix}.{hostname}.data.{ARROW_FILE_EXTENSION}")
    }


💡 Verification agent

🧩 Analysis chain

Method name now better reflects its functionality.

The function name change to filename_from_prefix more accurately describes what the method does, improving code readability. However, be cautious about potential panics due to the unwrap() call on hostname retrieval.

Consider handling hostname retrieval failures more gracefully:

🏁 Script executed:

#!/bin/bash # Check for other unwrapped hostname calls in the codebase rg "hostname::get.unwrap" --type rust

Length of output: 233

Action Required: Avoid Unwrap-Induced Panics on Hostname Retrieval

The method name filename_from_prefix is now more descriptive and improves readability. However, the use of unwrap() on the hostname retrieval (found in both occurrences within src/parseable/streams.rs) remains a risk for potential panics. Consider refactoring these calls to handle errors more gracefully. For example:

Use pattern matching on the result of hostname::get() to provide a fallback or a clearer error message.

Alternatively, utilizing .expect("Failed to retrieve hostname") with a descriptive message might help debug issues in production.

This change will help safeguard the application against panics if hostname retrieval fails.

Devdutt Shenoi added 4 commits February 28, 2025 23:55

refactor: accept array in push_logs

e2a1fc3

move tests to associated module

d04ba90

refactor: remove is_schema_matching

5798b7b

doc: improve readability

78da6b4

Devdutt Shenoi added 8 commits March 1, 2025 11:54

simplify replace_columns

938c33d

ci: fix imports

6fb5385

push flattening into Event

23155b6

refactor: move kinesis to lib level

d604b65

refactor: perform flattening in to_data alone

303ba35

refactor: further streamline, associate w/ Parseable

c2faefc

ci: deepsource suggestion

354061a

fix: flattening

1386d3b

de-sh marked this pull request as ready for review March 1, 2025 07:32

de-sh requested a review from nikhilsinhaparseable March 1, 2025 07:32

coderabbitai bot requested changes Mar 1, 2025

View reviewed changes

Devdutt Shenoi added 2 commits March 1, 2025 13:12

remove unused code

38a52c2

fix: partitioning

a7b2db3

coderabbitai bot requested changes Mar 1, 2025

View reviewed changes

refactor: share Stream state when processing

dc34a85

coderabbitai bot requested changes Mar 1, 2025

View reviewed changes

Devdutt Shenoi added 4 commits March 1, 2025 18:54

refactor: Parseable::commit_schema

19708df

map schema keys to recordbatches

7215e8e

construct map directly

975b1f6

fix: concat to not lose data

4c1f6d8

coderabbitai bot reviewed Mar 1, 2025

View reviewed changes

refactor: extract byte_size during json deserialization

5a2bcc1

coderabbitai bot requested changes Mar 1, 2025

View reviewed changes

src/handlers/http/cluster/utils.rs Outdated Show resolved Hide resolved

src/event/format/json.rs Outdated Show resolved Hide resolved

Devdutt Shenoi added 2 commits March 2, 2025 12:29

ci: clippy suggestion

f268422

feat: DiskWriter handles writing to arrow part file

d304b9f

fix: rb per object

9b0d865

de-sh force-pushed the main branch from 08fea7b to 9b0d865 Compare March 3, 2025 19:07

coderabbitai bot reviewed Mar 3, 2025

View reviewed changes

Devdutt Shenoi added 2 commits March 4, 2025 17:31

Merge remote-tracking branch 'origin/main'

5bbd309

perf: partition at json level

d096ce0

coderabbitai bot reviewed Mar 4, 2025

View reviewed changes

style: deepsource suggestion

e15b0d2

coderabbitai bot reviewed Mar 4, 2025

View reviewed changes

Merge branch 'main' into main

b106106

Signed-off-by: Devdutt Shenoi <[email protected]>

coderabbitai bot reviewed Mar 7, 2025

View reviewed changes

Devdutt Shenoi added 2 commits March 10, 2025 12:44

chore: remove unused

ef27f97

fix: custom partitioned file names

b8606b3

coderabbitai bot reviewed Mar 10, 2025

View reviewed changes

perf: use a buffer

fbd3a18

coderabbitai bot requested changes Mar 10, 2025

View reviewed changes

src/parseable/staging/writer.rs Outdated Show resolved Hide resolved

src/parseable/staging/writer.rs Outdated Show resolved Hide resolved

Devdutt Shenoi added 2 commits March 10, 2025 21:47

refactor: drop to flush

c2f6769

fix & test: prefix generation

592d3a1

coderabbitai bot reviewed Mar 10, 2025

View reviewed changes

Devdutt Shenoi added 2 commits March 14, 2025 11:40

Merge branch 'main' into main

6bb7afc

Merge remote-tracking branch 'origin/main'

924ef5b

coderabbitai bot reviewed Mar 16, 2025

View reviewed changes

Devdutt Shenoi added 2 commits March 16, 2025 22:54

Merge branch 'main' into main

b108f92

spinoff parseablehq#1251

3d8c338

coderabbitai bot reviewed Mar 18, 2025

View reviewed changes

de-sh mentioned this pull request Mar 19, 2025

chore: rmreplace_columns, unused code #1251

Merged

3 tasks

de-sh pushed a commit to de-sh/parseable that referenced this pull request Mar 21, 2025

refactor: schema mismatch check

b6c9e66

parseablehq#1218 (comment)

de-sh mentioned this pull request Mar 21, 2025

refactor: schema mismatch check #1259

Open

3 tasks

Devdutt Shenoi added 2 commits March 21, 2025 15:04

revert removal of is_schema_matching

30e1e2c

Merge remote-tracking branch 'origin/main'

645f605

coderabbitai bot requested changes Mar 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: event handling in parseable #1218

refactor: event handling in parseable #1218

de-sh commented Mar 1, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 1, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Mar 1, 2025

coderabbitai bot Mar 1, 2025

coderabbitai bot left a comment

coderabbitai bot Mar 1, 2025

coderabbitai bot left a comment

coderabbitai bot Mar 1, 2025

coderabbitai bot Mar 1, 2025

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot Mar 21, 2025

refactor: event handling in parseable #1218

Are you sure you want to change the base?

refactor: event handling in parseable #1218

Conversation

de-sh commented Mar 1, 2025 • edited by coderabbitai bot Loading

Description

Summary by CodeRabbit

coderabbitai bot commented Mar 1, 2025 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Mar 1, 2025

Choose a reason for hiding this comment

coderabbitai bot Mar 1, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Mar 1, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Mar 1, 2025

Choose a reason for hiding this comment

coderabbitai bot Mar 1, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Mar 21, 2025

Choose a reason for hiding this comment

de-sh commented Mar 1, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 1, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)