Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: retire to_recordbatch in favor of to_event #1181

Closed
wants to merge 3 commits into from

Conversation

de-sh
Copy link
Contributor

@de-sh de-sh commented Feb 9, 2025

Fixes #XXXX.

Description

Please note that this PR requires #1180 to be merged first


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Summary by CodeRabbit

  • New Features

    • Introduced enhanced partition extraction to accurately capture timing and custom metadata from events.
    • Updated the event conversion process to use direct stream information, reducing parameter complexity.
  • Refactor

    • Streamlined the event ingestion workflow for more direct and reliable processing.
    • Improved consistency in handling stream information across the system by centralizing its access during event processing.

@de-sh de-sh changed the title refactor: retire to_recordbatch in favor of to_event refactor: retire to_recordbatch in favor of to_event Feb 10, 2025
@de-sh de-sh marked this pull request as draft February 10, 2025 07:18
Copy link

coderabbitai bot commented Mar 5, 2025

Walkthrough

This update streamlines and simplifies event processing and ingestion across multiple modules. In the Kafka connector, the event processing method has been renamed and refactored to process events directly. In the event formatting modules, new partition extraction logic is introduced, redundant methods removed, and method signatures adjusted for unified error handling. The Event struct now delegates stream name handling to its processing methods via a passed stream parameter. HTTP ingestion functions also utilize the stream object directly, reducing parameter complexity. Finally, the public exports for the parseable module now include the Stream type.

Changes

File(s) Change Summary
src/connectors/kafka/processor.rs Renamed build_event_from_chunk to process_event_from_chunk, refactoring logic to directly process events and return a Result<()> instead of constructing a ParseableEvent.
src/event/format/json.rs Added get_partitions for partition extraction; updated to_data return type to include EventSchema; removed into_event.
src/event/format/mod.rs Added get_partitions to the trait; updated to_data and decode signatures to use anyhow::Result; replaced into_recordbatch with to_event that now uses stream-based schema and partition extraction.
src/event/mod.rs Removed the stream_name field from the Event struct; updated process and process_unchecked methods to require a &Stream parameter, shifting stream name handling from the struct to the method inputs.
src/handlers/http/ingest.rs
src/handlers/http/modal/utils/ingest_utils.rs
Simplified ingestion functions by replacing schema retrieval and multi-parameter event creation (into_event) with direct stream object usage and a call to to_event; updated calls to process_unchecked to accept the stream parameter.
src/parseable/mod.rs Updated public exports to include the Stream type, expanding the module interface.

Sequence Diagram(s)

sequenceDiagram
    participant Kafka as Kafka Consumer
    participant Processor as ParseableSinkProcessor
    participant Formatter as Event Format

    Kafka->>Processor: Provide event chunk (records)
    Processor->>Formatter: Invoke to_event (internal conversion)
    Formatter-->>Processor: Return processed JSON event
    Processor-->>Kafka: Return processing Result
Loading
sequenceDiagram
    participant HTTP as HTTP Ingest Handler
    participant PARSE as Parseable Module
    participant Stream as Stream Object
    participant Formatter as Event Format
    participant EventProc as Event Processor

    HTTP->>PARSE: get_stream(stream_name)
    PARSE-->>HTTP: Return Stream object
    HTTP->>Formatter: Convert JSON to event (to_event with stream)
    Formatter-->>HTTP: Return event
    HTTP->>EventProc: Process event (using stream parameter)
    EventProc-->>HTTP: Return processing Result
Loading

Possibly related PRs

Suggested labels

for next release

Suggested reviewers

  • nikhilsinhaparseable

Poem

Oh, what a hop, what a cheer,
Code’s been refactored far and near.
I nibble changes with a joyful heart,
Streamlined logic—a fresh new start.
With binkies and bytes, I celebrate this art!
🐇💻

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@de-sh de-sh marked this pull request as ready for review March 5, 2025 18:17
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/event/mod.rs (2)

53-89: Use of stream_name from Stream in process

Removing the stream_name field from Event and referencing stream.stream_name directly is consistent with the overall goal of externalizing stream-related data. The logic (committing schema, pushing to storage, updating stats, notifying livetail) remains coherent.

Consider factoring out the repeating schema-commit and stats-update logic to a helper if further expansions or variants of process are planned, reducing potential duplication.


91-103: Minimal validations in process_unchecked

process_unchecked skips schema commit and stats updates, which appears intentional. The usage of stream.stream_name here is likewise consistent.

You might consider centralizing the duplicated push call into a shared helper to avoid divergence over time.

src/event/format/mod.rs (1)

118-179: New to_event method

This method encapsulates event creation by merging decoded data with partition and schema information, then returning a finalized Event. The insertion of p_timestamp as a reserved field is clearly enforced.

Consider making origin_format configurable or automatically inferred if multiple formats are supported; currently, it’s hardcoded to "json".

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8381e72 and 72c6ab9.

📒 Files selected for processing (7)
  • src/connectors/kafka/processor.rs (4 hunks)
  • src/event/format/json.rs (2 hunks)
  • src/event/format/mod.rs (4 hunks)
  • src/event/mod.rs (4 hunks)
  • src/handlers/http/ingest.rs (4 hunks)
  • src/handlers/http/modal/utils/ingest_utils.rs (1 hunks)
  • src/parseable/mod.rs (1 hunks)
🔇 Additional comments (14)
src/parseable/mod.rs (1)

31-31: Adding Stream to public exports is appropriate for the refactoring.

This change makes the Stream type accessible to other modules, which aligns with the PR objective of refactoring event processing to use the Stream object directly rather than passing individual stream properties.

src/handlers/http/modal/utils/ingest_utils.rs (1)

114-115: Simplified the event processing workflow.

The event processing has been refactored to use the stream object directly with the new to_event method instead of extracting and passing individual properties. This change:

  1. Reduces parameter complexity
  2. Makes the code more maintainable
  3. Creates a more consistent API across the codebase

This aligns with the PR objective of retiring to_recordbatch in favor of to_event.

src/connectors/kafka/processor.rs (3)

30-31: Properly updated imports to include EventFormat.

The imports have been appropriately updated to include both json and EventFormat, which are required for the new event processing workflow.


41-68: Method signature and implementation improved for clarity and efficiency.

The method has been significantly refactored from build_event_from_chunk to process_event_from_chunk with key improvements:

  1. Return type changed from ParseableEvent to Result<()>, reflecting that event processing now happens directly
  2. Eliminated several schema-related variables that were previously extracted from the stream
  3. Uses the Stream object directly for event creation and processing
  4. Simplified the event creation and processing workflow

This change creates a more streamlined event processing pipeline and reduces code complexity.


77-77: Updated method call to match the new implementation.

The method call in the process implementation has been correctly updated to use the new process_event_from_chunk method.

src/handlers/http/ingest.rs (2)

82-88: Streamlined the internal stream ingestion process.

The ingest_internal_stream function has been refactored to use the stream object directly rather than extracting and passing individual properties, making the code cleaner and more maintainable.


233-245: Simplified unchecked log pushing.

The push_logs_unchecked function now properly retrieves the stream object and uses it directly in process_unchecked, which is consistent with the refactoring pattern across the codebase.

src/event/mod.rs (1)

30-30: Imports refactored to include Stream

No issues found with this added import. The inclusion of Stream aligns with the shift toward passing stream references.

src/event/format/json.rs (3)

33-34: New imports for EventFormat and EventSchema

The added imports are aligned with the trait and struct usage below.


58-77: New partition extraction method get_partitions

Effective handling of both optional time_partition and custom_partitions. The fallback to the ingestion timestamp if time_partition is missing is clear.


86-86: Switched to anyhow::Result in to_data

Using anyhow::Result unifies error handling with other utilities in this file and across the codebase.

src/event/format/mod.rs (3)

26-39: Introduced anyhow and additional chrono imports

No conflicts or issues with these expanded imports. They facilitate the new approach to unified error handling and timestamp management.


100-105: Partition logic definition in the trait

Adding fn get_partitions directly to the trait ensures each event format implements consistent partition-extraction logic. The signature and return type look appropriate.


111-113: Unified error types for to_data and decode

Switching to anyhow::Result further standardizes error handling. This simplifies function signatures and helps keep the code uniform.

Comment on lines +326 to +818
// assert_eq!(
// rb.column_by_name("c_a")
// .unwrap()
// .as_any()
// .downcast_ref::<ListArray>()
// .unwrap(),
// &ListArray::from_iter_primitive::<Int64Type, _, _>(vec![
// None,
// None,
// Some(vec![Some(1i64)]),
// Some(vec![Some(1)])
// ])
// );

// assert_eq!(
// rb.column_by_name("c_b")
// .unwrap()
// .as_any()
// .downcast_ref::<ListArray>()
// .unwrap(),
// &ListArray::from_iter_primitive::<Int64Type, _, _>(vec![
// None,
// None,
// None,
// Some(vec![Some(2i64)])
// ])
// );
// }

// #[test]
// fn arr_obj_with_nested_type_v1() {
// let json = json!([
// {
// "a": 1,
// "b": "hello",
// },
// {
// "a": 1,
// "b": "hello",
// },
// {
// "a": 1,
// "b": "hello",
// "c": [{"a": 1}]
// },
// {
// "a": 1,
// "b": "hello",
// "c": [{"a": 1, "b": 2}]
// },
// ]);
// let flattened_json = convert_to_array(
// convert_array_to_object(
// json,
// None,
// None,
// None,
// SchemaVersion::V1,
// &crate::event::format::LogSource::default(),
// )
// .unwrap(),
// )
// .unwrap();

// let (rb, _) = into_event_batch(
// flattened_json,
// HashMap::default(),
// false,
// None,
// SchemaVersion::V1,
// )
// .unwrap();

// assert_eq!(rb.num_rows(), 4);
// assert_eq!(rb.num_columns(), 5);
// assert_eq!(
// rb.column_by_name("a").unwrap().as_float64_arr().unwrap(),
// &Float64Array::from(vec![Some(1.0), Some(1.0), Some(1.0), Some(1.0)])
// );
// assert_eq!(
// rb.column_by_name("b").unwrap().as_utf8_arr().unwrap(),
// &StringArray::from(vec![
// Some("hello"),
// Some("hello"),
// Some("hello"),
// Some("hello")
// ])
// );

// assert_eq!(
// rb.column_by_name("c_a").unwrap().as_float64_arr().unwrap(),
// &Float64Array::from(vec![None, None, Some(1.0), Some(1.0)])
// );

// assert_eq!(
// rb.column_by_name("c_b").unwrap().as_float64_arr().unwrap(),
// &Float64Array::from(vec![None, None, None, Some(2.0)])
// );
// }
// }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

❓ Verification inconclusive

Consider updating or removing commented-out tests.

A large portion of test code has been commented out, which appears to be related to the retired to_recordbatch functionality. Since the PR is retiring this functionality, the tests should either be:

  1. Updated to use the new to_event method and uncommented, or
  2. Removed entirely if they're no longer applicable

Commented-out code can lead to confusion and maintenance issues in the future.


🏁 Script executed:

#!/bin/bash
# Check if there are any other tests that would provide coverage for the new functionality

# Look for tests that might be using the new to_event function
rg -p "to_event.*test" --type rust

Length of output: 36


Clean Up Deprecated Test Code in src/handlers/http/ingest.rs

It appears that the commented-out tests (covering the retired to_recordbatch functionality) have not been replaced with tests for the new to_event method—as no active tests using to_event were found. Please either update these tests to use the new to_event method or remove them entirely to avoid confusion and prevent future maintenance issues.

@de-sh de-sh closed this Mar 19, 2025
@de-sh de-sh deleted the event-refactor branch March 19, 2025 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant