Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: replace anyhow with custom StaticSchemaError #1208

Merged
merged 4 commits into from
Mar 2, 2025

Conversation

7h3cyb3rm0nk
Copy link
Contributor

@7h3cyb3rm0nk 7h3cyb3rm0nk commented Feb 24, 2025

Fixes #1197

This PR introduces two important changes

  1. Adds validation for field names to prevent invalid inputs
  2. Replaces generic anyhow errors with custom error variants for better error handling

implementation Details

  • added validate_field_names() that checks for duplicate or empty field names
  • added custom error enum StaticSchemaError using thiserror and specific error variants
    MissingCustomPartition
    MissingTimePartition
    ReservedKey
    EmptyFieldName
    DuplicateField

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Enhanced validations now ensure that all schema field names are non-empty and unique, helping detect configuration issues early.
  • Bug Fixes

    • Improved error messaging provides clearer, more actionable feedback when required fields are missing or reserved values are used, streamlining the troubleshooting process for end users.

Copy link

coderabbitai bot commented Feb 24, 2025

Walkthrough

This pull request enhances error handling in the src/static_schema.rs file. The conversion functions have been modified to return a specific StaticSchemaError instead of a generic error type. Error reporting for missing custom/time partition fields, reserved keys, empty field names, and duplicate field names has been improved. Additionally, a new validate_field_names function has been introduced to ensure field names are non-empty and unique, alongside the creation of the StaticSchemaError enum to represent various error conditions clearly.

Changes

File Change Summary
src/static_schema.rs - Updated method signatures: functions now return StaticSchemaError instead of a generic error.
- Enhanced error handling: specific error variants for missing custom/time partitions, reserved keys, empty or duplicate field names.
- Added new function validate_field_names for checking field validity.
- Introduced StaticSchemaError enum.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Converter as convert_static_schema_to_arrow_schema
    participant Validator as validate_field_names
    participant Parser as add_parseable_fields_to_static_schema

    Client->>Converter: Request schema conversion
    Converter->>Validator: Validate each field name
    Validator-->>Converter: Return OK/error (StaticSchemaError)
    Converter->>Parser: Process parseable fields
    Parser-->>Converter: Return schema/result (StaticSchemaError)
    Converter-->>Client: Return final schema or error
Loading

Suggested reviewers

  • nikhilsinhaparseable

Poem

I hop through code with joyous cheer,
Fields validated, clean and clear.
Error types now sing a better tune,
In every line, our bugs are strewn.
With StaticSchemaError, the fixes bloom! 🐇🎉


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e233af and ec9827e.

📒 Files selected for processing (1)
  • src/static_schema.rs (5 hunks)
🔇 Additional comments (11)
src/static_schema.rs (11)

25-28: Improved organization of imports

Nice refactoring of imports into a structured block with organized collections. This makes the code more maintainable as dependencies are easier to identify.


65-65: Enhanced error type specificity

Good change to return a specific StaticSchemaError instead of a generic error type. This improves error handling by making the contract clearer to callers.


88-90: Better error reporting for missing custom partition

This change provides a more descriptive error with specific error variant, improving the developer experience when troubleshooting schema issues.


94-96: Good addition of field name tracking

The HashSet for tracking existing field names is an efficient way to prevent duplicates, with O(1) lookup time complexity.


98-98: Field validation at the right place

Well-placed validation that catches potential issues early in the process. The validation happens before any processing, ensuring invalid data doesn't propagate.


138-140: Improved error reporting for missing time partition

Similar to the custom partition error handling, this provides clearer error information through a specific error variant.


147-147: Consistent error type usage

Good consistency in updating all related function signatures to use the same specific error type.


155-155: Better reserved key error handling

Using a specific error variant for reserved keys makes the error more descriptive and easier to understand.


183-196: Well-implemented field name validation

The validation function is simple, clear, and effective with two primary checks:

  1. Non-empty field names
  2. No duplicate field names

The function signature is also clear, taking the field name to validate and the set of existing fields.


198-218: Well-structured error enum with descriptive messages

The thiserror crate is used effectively here to create a robust error type with detailed error messages. The error variants cover all the specific error cases in the schema validation process.


220-235: Good test coverage for new validation

The tests appropriately cover the two validation cases:

  1. Empty field names
  2. Duplicate field names

This ensures the validation logic works as expected.

✨ Finishing Touches
  • 📝 Generate Docstrings

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

github-actions bot commented Feb 24, 2025

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/static_schema.rs (2)

178-189: Consider adding field name format validation.

While the current validation handles empty and duplicate fields well, consider adding validation for field name format to prevent issues with special characters or spaces that might cause problems in downstream processing.

Example validation to consider:

 fn validate_field_names(field_name: &str, existing_fields: &mut HashSet<String>) -> Result<(), StaticSchemaError> {
     if field_name.is_empty() {
         return Err(StaticSchemaError::EmptyFieldName);
     }
 
+    // Validate field name format (alphanumeric and underscore only)
+    if !field_name.chars().all(|c| c.is_alphanumeric() || c == '_') {
+        return Err(StaticSchemaError::InvalidFieldNameFormat(field_name.to_string()));
+    }
+
     if !existing_fields.insert(field_name.to_string()) {
         return Err(StaticSchemaError::DuplicateField(field_name.to_string()));
     }
 
     Ok(())
 }

215-233: Add tests for partition and default time field validation.

While the current tests cover field name validation, consider adding tests for:

  • Custom partition validation
  • Time partition validation
  • Default time field validation

Example additional tests:

#[test]
fn test_missing_custom_partition() {
    let schema = StaticSchema {
        fields: vec![SchemaFields {
            name: "field1".to_string(),
            data_type: "string".to_string(),
        }],
    };
    let custom_partition = "custom_field".to_string();
    assert!(matches!(
        convert_static_schema_to_arrow_schema(schema, "", Some(&custom_partition)),
        Err(StaticSchemaError::MissingCustomPartition(_))
    ));
}

#[test]
fn test_missing_time_partition() {
    let schema = StaticSchema {
        fields: vec![SchemaFields {
            name: "field1".to_string(),
            data_type: "string".to_string(),
        }],
    };
    assert!(matches!(
        convert_static_schema_to_arrow_schema(schema, "time_field", None),
        Err(StaticSchemaError::MissingTimePartition(_))
    ));
}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 245ec54 and a3fb21d.

📒 Files selected for processing (1)
  • src/static_schema.rs (5 hunks)
🔇 Additional comments (4)
src/static_schema.rs (4)

25-25: LGTM: Appropriate import for field name tracking.

The addition of HashSet to track unique field names is a good choice for efficient duplicate detection.


62-62: LGTM: Improved error handling specificity.

The change from generic AnyError to specific StaticSchemaError enhances error handling clarity and type safety.

Also applies to: 141-141


85-85: LGTM: Clear and specific error messages.

The error messages are descriptive and provide actionable information about missing partitions and reserved fields.

Also applies to: 134-134, 150-150


192-213: LGTM: Well-structured error enum with clear messages.

The StaticSchemaError enum effectively categorizes different error cases with descriptive messages. Good use of thiserror for deriving the Error trait.

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 24, 2025
Copy link
Contributor

@de-sh de-sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but can you restrict import for HashSet to the test mod and run cargo fmt to ensure stray whitespaces are pruned?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/static_schema.rs (1)

217-233: Good test coverage for new validation logic.

The tests appropriately verify both validation rules (empty field names and duplicate field names).

Consider adding tests for the other error variants as well to ensure complete coverage of the error handling logic.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3fb21d and bd08e05.

📒 Files selected for processing (1)
  • src/static_schema.rs (5 hunks)
🔇 Additional comments (8)
src/static_schema.rs (8)

25-28: Clean organization of imports.

The imports are well-organized with proper grouping of related types. Adding HashSet alongside HashMap makes sense for the field name validation feature.


65-65: Good transition to specific error type.

Changing the return type from a generic error to a specific StaticSchemaError improves error handling clarity and provides better context for callers.


88-88: Improved error reporting with specific error variant.

Replacing generic anyhow errors with the specific MissingCustomPartition variant provides better error context and aligns with the PR objective of enhancing error management.


93-97: Effective implementation of field name validation.

Using a HashSet to track existing field names is efficient (O(1) lookups) and correctly validates fields early in the processing flow.


136-136: Consistent error handling with specific variant.

The MissingTimePartition error variant provides clearer feedback compared to a generic error message.


143-143: Consistent error handling throughout the module.

The function signature changes maintain consistency with the rest of the module, and replacing anyhow with DefaultTime error variant follows the same pattern of improved error specificity.

Also applies to: 152-152


180-191: Well-implemented field name validation.

The validation function is clean, focused, and efficient. It correctly checks for both empty field names and duplicates, returning appropriate error variants in each case.


194-215: Well-structured error enum with descriptive messages.

The StaticSchemaError enum is well-designed with specific variants for each error case. Using thiserror for deriving the Error trait is a good practice, and the error messages are clear and informative.

@7h3cyb3rm0nk 7h3cyb3rm0nk requested a review from de-sh February 26, 2025 05:25
"time partition field {time_partition} does not exist in the schema for the static schema logstream"
),
});
return Err(StaticSchemaError::MissingTimePartition(time_partition.to_string()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Err(StaticSchemaError::MissingTimePartition(time_partition.to_string()));
return Err(StaticSchemaError::MissingTimePartition(
time_partition.to_string(),
));

@@ -83,11 +85,15 @@ pub fn convert_static_schema_to_arrow_schema(

for partition in &custom_partition_list {
if !custom_partition_exists.contains_key(*partition) {
return Err(anyhow!("custom partition field {partition} does not exist in the schema for the static schema logstream"));
return Err(StaticSchemaError::MissingCustomPartition(partition.to_string()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Err(StaticSchemaError::MissingCustomPartition(partition.to_string()));
return Err(StaticSchemaError::MissingCustomPartition(
partition.to_string(),
));

}
add_parseable_fields_to_static_schema(parsed_schema)
}

fn add_parseable_fields_to_static_schema(
parsed_schema: ParsedSchema,
) -> Result<Arc<Schema>, AnyError> {
) -> Result<Arc<Schema>, StaticSchemaError> {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment on lines 180 to 181
fn validate_field_names(field_name: &str, existing_fields: &mut HashSet<String>) -> Result<(), StaticSchemaError> {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn validate_field_names(field_name: &str, existing_fields: &mut HashSet<String>) -> Result<(), StaticSchemaError> {
fn validate_field_names(
field_name: &str,
existing_fields: &mut HashSet<String>,
) -> Result<(), StaticSchemaError> {

Comment on lines 192 to 196


#[derive(Debug, thiserror::Error)]
pub enum StaticSchemaError{

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[derive(Debug, thiserror::Error)]
pub enum StaticSchemaError{
#[derive(Debug, thiserror::Error)]
pub enum StaticSchemaError {

MissingTimePartition(String),

#[error("field {DEFAULT_TIMESTAMP_KEY:?} is a reserved field")]
DefaultTime,
Copy link
Contributor

@de-sh de-sh Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DefaultTime,
ReservedKey(&'static str),

This should make more sense no?

#[cfg(test)]
mod tests {
use super::*;
use std::collections::HashSet;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use std::collections::HashSet;

I am sorry, I didn't understand why there was an import previously

@7h3cyb3rm0nk 7h3cyb3rm0nk force-pushed the static_schema branch 2 times, most recently from 2f0b686 to d61138f Compare February 26, 2025 16:40
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (13)
src/handlers/http/health_check.rs (3)

67-67: Consider confirming flush_and_convert successes.
flush_and_convert likely spawns tasks that join later. If the method returns additional information or errors, consider handling them here or logging them.


69-75: Partial failure handling might be enhanced.
The loop logs failures but continues. If complete reliability is critical, consider a strategy to handle or roll back partial failures.


77-86: Add or clarify retries on sync failure.
Currently, a warning is logged on sync failure, but the process proceeds. Consider adding a retry mechanism or a clear recovery path if remote upload is essential.

src/utils/arrow/mod.rs (1)

199-205: Beware of potential race conditions in time-based assertions.
Comparing the array's timestamp with now.timestamp_millis() might intermittently fail on slower systems when the test’s execution slightly delays the array creation. Consider adding a small tolerance or verifying exact values if deterministic.

src/sync.rs (4)

89-93: Error handling on sync handler joins.
Here, any JoinError is simply logged. If you need stronger guarantees (e.g., restart tasks), consider a more robust fallback.


104-104: Remote sync handler reinitialization.
The code logs the error and restarts the thread. Ensure logs are sufficient to debug repeated crashes if needed.


191-202: Robustly logs success or failure per flush+conversion task.
The approach is consistent with the new concurrency architecture. Optionally, consider collecting and reporting aggregated results if partial failures matter.


219-219: Panic handling logs the error and cleans up.
This is appropriate for catastrophic failures. If there's a chance for partial recovery, consider a separate fallback if beneficial.

src/parseable/streams.rs (3)

203-219: Fallback considerations for timestamps
The function relies on file creation times and uses .expect(..), which may panic on platforms lacking reliable creation metadata.

Consider gracefully handling this scenario and possibly falling back to modified time or skipping such files entirely if creation metadata is unavailable.


456-472: Metrics collection for total arrow files and sizes
Unwrapping metadata can lead to potential panics if files are removed or inaccessible mid-operation.

Use safer error handling or skip unreadable files to avoid unexpected panics.


168-168: File path format string
Embedding the hostname directly after the dot might obscure readability.

Consider using a clearer delimiter (e.g., .{hostname} or -).

src/static_schema.rs (2)

183-196: New function validate_field_names
The implementation excludes empty names and duplicates. Consider clarifying if whitespace-only is recognized as empty.

 fn validate_field_names(
     field_name: &str,
     existing_fields: &mut HashSet<String>,
 ) -> Result<(), StaticSchemaError> {
+    // For example, you could trim whitespace before checking:
+    let trimmed = field_name.trim();
+    if trimmed.is_empty() {
+        return Err(StaticSchemaError::EmptyFieldName);
+    }
     ...
 }

220-236: Unit tests for validation logic
Good coverage for both empty and duplicate field cases.

Consider adding a test to verify behavior when the field name contains only whitespace or special characters.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd08e05 and 2f0b686.

📒 Files selected for processing (15)
  • src/connectors/kafka/processor.rs (1 hunks)
  • src/event/format/mod.rs (3 hunks)
  • src/handlers/http/health_check.rs (2 hunks)
  • src/handlers/http/ingest.rs (14 hunks)
  • src/handlers/http/modal/mod.rs (0 hunks)
  • src/handlers/http/modal/utils/ingest_utils.rs (4 hunks)
  • src/lib.rs (1 hunks)
  • src/parseable/mod.rs (3 hunks)
  • src/parseable/streams.rs (15 hunks)
  • src/query/listing_table_builder.rs (1 hunks)
  • src/static_schema.rs (5 hunks)
  • src/storage/mod.rs (0 hunks)
  • src/storage/object_storage.rs (2 hunks)
  • src/sync.rs (8 hunks)
  • src/utils/arrow/mod.rs (3 hunks)
💤 Files with no reviewable changes (2)
  • src/handlers/http/modal/mod.rs
  • src/storage/mod.rs
🔇 Additional comments (47)
src/connectors/kafka/processor.rs (1)

69-75: Timestamp handling enhancement.

The addition of Utc::now() as a parameter to into_recordbatch ensures current timestamp information is correctly passed to the record batch creation process, improving time-sensitive operations handling across the codebase.

src/storage/object_storage.rs (3)

39-40: Updated import organization for logging levels.

The change properly updates import statements to reflect the change in logging level usage from debug to info for synchronization operations.


716-716: Code simplification for directory existence check.

The directory existence check has been simplified by directly calling exists() on the staging directory path, eliminating an unnecessary intermediate step.


722-722: Improved logging visibility for synchronization operations.

Changing from debug! to info! increases the visibility of stream synchronization operations in logs, which is appropriate for this important operational event.

src/query/listing_table_builder.rs (1)

34-37: Improved import organization.

The import statements have been consolidated into a cleaner format without changing functionality, improving code readability.

src/lib.rs (1)

62-70: Enhanced time interval constants with Duration type.

The changes improve type safety and clarity by:

  1. Converting time values from raw integers to Duration types
  2. Adding clear documentation for each constant's purpose
  3. Establishing a consistent relationship between constants where OBJECT_STORE_DATA_GRANULARITY is derived from LOCAL_SYNC_INTERVAL

This refactoring provides better semantics and reduces the risk of misinterpreting time values.

src/handlers/http/health_check.rs (2)

30-31: New concurrency and logging imports look appropriate.
These additions properly support asynchronous task management and structured logging without introducing any apparent conflicts.


64-65: JoinSet creation is a good concurrency approach.
Initializing a dedicated JoinSet makes the stream-flushing tasks more manageable. Just ensure that any lingering tasks won't outlive the shutdown, or if they do, that it is by design.

src/utils/arrow/mod.rs (3)

48-48: Chrono import is correct for date/time management.
No issues found. This is necessary to handle DateTime<Utc> references in the updated code.


136-137: Updated function signature improves clarity and testability.
Accepting an external DateTime<Utc> instead of generating it internally is a solid enhancement for reproducibility and unit testing.


211-211: Zero-size timestamp array test is valid.
Verifying boundary conditions for an empty array is a good practice. No concerns here.

src/sync.rs (8)

23-23: JoinSet import aligns with concurrency refactoring.
No concerns; this library usage is consistent with the pattern seen elsewhere.


30-30: Centralizing intervals is beneficial.
Referencing LOCAL_SYNC_INTERVAL and STORAGE_UPLOAD_INTERVAL in one place promotes maintainability.


67-67: Accurate duration logging.
Using start_time.elapsed() clarifies how long a task took post-threshold warning. This enhances observability.


75-75: Switched to multi-thread runtime.
This is ideal if tasks are CPU-intensive or numerous. Verify that the number of worker threads is sufficient for your throughput needs.


79-80: Local and remote sync concurrency.
Spawning both local and remote sync tasks allows them to operate independently. Confirm that both tasks handle partial failures and final states consistently.


122-122: Regular object store sync interval.
Employing interval_at(next_minute(), STORAGE_UPLOAD_INTERVAL) is a predictable approach. Just verify that large uploads complete before the next tick, or consider a check for in-progress tasks.


172-173: Documentation update for local_sync.
The docstring accurately represents that the function flushes and converts data to parquet. Good for clarity.


186-187: Local sync scheduling with JoinSet.
Each flush-and-convert triggers concurrent tasks. Might be worth confirming the maximum concurrency impact (e.g., memory usage with too many tasks).

src/event/format/mod.rs (2)

111-111: Parameter addition enhances timestamp control

Adding the p_timestamp parameter to the into_recordbatch method allows for more consistent timestamp handling across the codebase, enabling better testing and reproducibility.


149-149: Updated implementation uses externally-provided timestamp

This change correctly passes the provided timestamp to the get_timestamp_array function instead of generating a new timestamp internally, ensuring timestamp consistency.

src/parseable/mod.rs (2)

461-464: Improved error handling for conflicting partitions

Consolidating the error handling for when both time partition and custom partition are set improves code clarity and user experience.


613-618: Added validation to prevent conflicting partition configurations

This validation ensures that time partition and custom partition cannot be set simultaneously, which is an important business rule check that prevents invalid configurations.

src/handlers/http/modal/utils/ingest_utils.rs (5)

99-99: Centralizing timestamp creation improves consistency

Creating a single timestamp variable at the beginning of the function ensures consistent timestamps throughout the request processing lifecycle.


125-125: Using stored timestamp for consistent behavior

This change ensures that the same timestamp is used when no time partition is specified, improving consistency in the data.


148-148: Passing timestamp through function chain

This properly passes the timestamp to the into_event_batch function, maintaining consistency with the changes to the function signature.


173-173: Updated function signature to support timestamp parameter

The into_event_batch function now accepts a timestamp parameter, which aligns with the changes in the EventFormat trait and ensures consistent timestamp handling.


180-180: Forwarding timestamp to into_recordbatch

This correctly passes the timestamp to the underlying into_recordbatch method, completing the chain of timestamp propagation.

src/handlers/http/ingest.rs (4)

82-82: Renaming and creating timestamp variable for clarity

Replacing parsed_timestamp with now and capturing the current UTC time creates a clearer understanding of the timestamp's purpose.


96-96: Passing timestamp to into_recordbatch method

This correctly passes the timestamp to the function, which is consistent with the updated method signature.


104-104: Using consistent timestamp for Event creation

Using the same timestamp for the parsed_timestamp field ensures that all timestamp references within the request lifecycle are consistent.


399-399: Updated all test cases to include timestamp parameter

All test cases have been properly updated to include the timestamp parameter, ensuring that tests continue to work correctly with the new function signature.

Also applies to: 433-433, 468-468, 500-500, 615-615, 667-667, 715-715, 757-757, 846-846

src/parseable/streams.rs (10)

27-27: Leverage of SystemTime for file timestamps
No major concerns. The standard library’s SystemTime along with UNIX_EPOCH is a suitable choice for time-based operations.


33-33: Use of chrono for date/time operations
The chrono imports here seem appropriate for extracting time components (hours, minutes, etc.).


45-45: Async concurrency with JoinSet
Importing tokio::task::JoinSet is a clean approach for managing async tasks.


54-54: Referencing storage objects
Bringing in to_bytes, Retention, and StreamType is consistent with the updated logic for reading/writing data.


442-442: Switching to SystemTime acquisition
Using SystemTime::now() aligns well with minute_from_system_time. This provides consistent, OS-level timestamps.


673-674: Convenience method for flush and convert
This succinctly flushes to disk then delegates to prepare_parquet. The flow looks clear and maintainable.


744-750: Asynchronous flush_and_convert for all streams
Spawning the conversion tasks per stream is efficient. However, if a single stream encounters an error, the handling of partial failures may warrant further consideration.

Would you like additional checks or a concurrency test suite to verify robust error handling when multiple streams flush in parallel?

Also applies to: 758-758


769-769: chrono usage in tests
These imports for test-time manipulations are standard and acceptable.


886-886: Test path consistency
Ensuring the test’s path format matches production is good for validation.


920-920: Test path format with custom partition
No issues here—this string interpolation correctly includes the partitioning scheme.

src/static_schema.rs (5)

25-28: Introducing HashSet for field checking
No concerns—this is an appropriate data structure for uniqueness checks.


65-65: Return type changed to StaticSchemaError
Shifting from a generic error to a custom error type clarifies error semantics.


95-98: Validating field names before processing
This check ensures correctness by preventing empty or duplicate field names from proceeding.


138-140: Custom error for missing time partition
The dedicated MissingTimePartition error variant adequately highlights the missing field scenario.


198-218: Definition of StaticSchemaError
Well-structured use of Enum with comprehensive variants for static schema issues.

Comment on lines 819 to 835
if custom_partition_list.len() > 1 {
return Err(CreateStreamError::Custom {
msg: "Maximum 3 custom partition keys are supported".to_string(),
msg: "Maximum 1 custom partition key is supported".to_string(),
status: StatusCode::BAD_REQUEST,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

❓ Verification inconclusive

Breaking change: reduced maximum custom partition keys

The maximum number of custom partition keys has been reduced from 3 to 1. This is a breaking change that could affect existing users who are using multiple custom partition keys.


🏁 Script executed:

#!/bin/bash
# Find examples of code using multiple custom partition keys
rg -A 5 -B 5 "custom_partition.*split|split.*custom_partition" --glob "*.rs" | grep -v "len() > 1"

Length of output: 5563


Attention: Breaking Change Impact Detected

The change in src/parseable/mod.rs now enforces that only one custom partition key is allowed. The validator returns an error if custom_partition.split(',') yields more than one key, reducing the maximum from 3 to 1. This adjustment is intentionally breaking and may impact any users who rely on using multiple custom partition keys. Please review all related usages (e.g., in ingest_utils.rs, flatten.rs, and object_storage.rs) to ensure that any external or internal callers are aware of this limitation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently, please check "Code review limits" under "Moderation" settings.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/parseable/mod.rs (1)

819-822: ⚠️ Potential issue

Breaking change: reduced maximum custom partition keys

The maximum number of custom partition keys has been reduced from 3 to 1. This is a breaking change that could affect existing users who are using multiple custom partition keys.

Ensure that any external or internal callers are aware of this limitation, and verify that this change is properly documented in the release notes.

🛑 Comments failed to post (1)
src/parseable/streams.rs (1)

744-759: 🛠️ Refactor suggestion

Improved asynchronous processing with JoinSet

The function now spawns asynchronous tasks for each stream's conversion process, leveraging Tokio's JoinSet for better task management and parallelism.

Consider adding error handling for the spawned tasks:

 pub fn flush_and_convert(
     &self,
     joinset: &mut JoinSet<Result<(), StagingError>>,
     shutdown_signal: bool,
 ) {
     let streams: Vec<Arc<Stream>> = self
         .read()
         .expect(LOCK_EXPECT)
         .values()
         .map(Arc::clone)
         .collect();
     for stream in streams {
-        joinset.spawn(async move { stream.flush_and_convert(shutdown_signal) });
+        joinset.spawn(async move {
+            match stream.flush_and_convert(shutdown_signal) {
+                Ok(_) => Ok(()),
+                Err(e) => {
+                    error!("Error flushing and converting stream {}: {:?}", stream.stream_name, e);
+                    Err(e)
+                }
+            }
+        });
     }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    /// Asynchronously flushes arrows and compacts into parquet data on all streams in staging,
    /// so that it is ready to be pushed onto objectstore.
    pub fn flush_and_convert(
        &self,
        joinset: &mut JoinSet<Result<(), StagingError>>,
        shutdown_signal: bool,
    ) {
        let streams: Vec<Arc<Stream>> = self
            .read()
            .expect(LOCK_EXPECT)
            .values()
            .map(Arc::clone)
            .collect();
        for stream in streams {
            joinset.spawn(async move {
                match stream.flush_and_convert(shutdown_signal) {
                    Ok(_) => Ok(()),
                    Err(e) => {
                        error!("Error flushing and converting stream {}: {:?}", stream.stream_name, e);
                        Err(e)
                    }
                }
            });
        }
    }

@7h3cyb3rm0nk 7h3cyb3rm0nk requested a review from de-sh February 26, 2025 16:57
@7h3cyb3rm0nk 7h3cyb3rm0nk force-pushed the static_schema branch 3 times, most recently from 710a4e9 to 756be55 Compare February 26, 2025 17:53
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/static_schema.rs (1)

222-237: Good test coverage for validation

Tests for both empty field names and duplicate field names ensure the validation function works as expected.

Consider adding a positive test case that confirms valid field names pass validation.

 #[cfg(test)]
 mod tests {
     use super::*;
     #[test]
     fn empty_field_names() {
         let mut existing_field_names: HashSet<String> = HashSet::new();
         assert!(validate_field_names("", &mut existing_field_names).is_err());
     }

     #[test]
     fn duplicate_field_names() {
         let mut existing_field_names: HashSet<String> = HashSet::new();
         let _ = validate_field_names("test_field", &mut existing_field_names);
         assert!(validate_field_names("test_field", &mut existing_field_names).is_err());
     }
+
+    #[test]
+    fn valid_field_names() {
+        let mut existing_field_names: HashSet<String> = HashSet::new();
+        assert!(validate_field_names("field1", &mut existing_field_names).is_ok());
+        assert!(validate_field_names("field2", &mut existing_field_names).is_ok());
+    }
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d61138f and 756be55.

📒 Files selected for processing (1)
  • src/static_schema.rs (5 hunks)
🔇 Additional comments (10)
src/static_schema.rs (10)

25-28: Good organization of imports

The structured import for standard library components is clean and well-organized.


65-65: Improved error type specificity

Changing the return type from a generic error to the specific StaticSchemaError improves type safety and error handling clarity.


88-90: Better error reporting for missing custom partition

Using a specific error variant instead of anyhow! provides clearer error information.


94-95: Good validation approach using HashSet

Using a HashSet for tracking field names is an efficient way to check for duplicates.


98-98: Proper validation integration

The validation function is well-integrated into the schema processing flow.


138-140: Enhanced error handling for missing time partition

The specific error variant provides clearer context about the validation failure.


147-147: Consistent error type usage

Updating the return type to match StaticSchemaError maintains consistency throughout the module.


155-157: Improved reserved key error handling

The error now provides explicit information about which key is reserved.


185-198: Well-implemented field name validation

The validation function is concise and effectively checks for both empty field names and duplicates.


200-220: Comprehensive error enum with helpful messages

The StaticSchemaError enum is well-structured with descriptive error messages for each variant. Using the thiserror crate simplifies error handling implementation.

@de-sh
Copy link
Contributor

de-sh commented Feb 28, 2025

@7h3cyb3rm0nk please sign the CLA

@7h3cyb3rm0nk
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

nitisht added a commit to parseablehq/.github that referenced this pull request Feb 28, 2025
@7h3cyb3rm0nk 7h3cyb3rm0nk requested a review from de-sh March 1, 2025 16:40
@nitisht nitisht changed the title Add field name validation and replace anyhow with custom error variantss refactor: replace anyhow with custom StaticSchemaError Mar 2, 2025
@nitisht nitisht merged commit 887a63f into parseablehq:main Mar 2, 2025
14 checks passed
@7h3cyb3rm0nk 7h3cyb3rm0nk deleted the static_schema branch March 4, 2025 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🛠️ Validation of field names
3 participants