Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updates for enterprise #1247

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

parmesant
Copy link
Contributor

@parmesant parmesant commented Mar 18, 2025

Adds multiple updates for Parseable Enterprise

Description


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Summary by CodeRabbit

  • New Features

    • Added flexible configuration options to specify custom connection endpoints for improved integration with indexing services.
    • Introduced enhanced metadata management to clearly separate indexing from ingestion data.
    • Enabled multipart upload support for efficient handling of large file transfers across storage platforms.
    • Added a new function to generate unique indexer IDs.
  • Bug Fixes

    • Improved error handling for endpoint validation based on operational mode.
  • Refactor

    • Streamlined file processing with improved date-based filtering.
    • Simplified endpoint access control and extended HTTP timeout settings for better performance.
    • Updated metadata handling logic to accommodate new indexer structures.
    • Enhanced the TimeRange struct to support cloning.

Copy link

coderabbitai bot commented Mar 18, 2025

Walkthrough

This pull request introduces several new and enhanced functionalities. It updates the CLI options to include a configurable indexer endpoint and modifies URL generation based on operation mode. Changes in metadata handling include new structures and methods for both ingestors and indexers, plus exposed retrieval of indexer metadata via HTTP handlers. Additionally, the PR adds multipart upload methods to various storage implementations and adjusts utility functions and client configurations. Overall, the changes extend the system’s flexibility in endpoint configuration, metadata management, and file uploading.

Changes

File(s) Summary of Changes
src/cli.rs Added new public field indexer_endpoint in the Options struct; updated get_url to accept a mode parameter and select the appropriate endpoint.
src/enterprise/utils.rs Updated fetch_parquet_file_paths to use filter_map with date validation; added the chrono crate for date manipulation.
src/handlers/http/cluster/mod.rs, src/handlers/http/modal/mod.rs Added type alias IndexerMetadataArr, new async function get_indexer_info(), and new struct IndexerMetadata with associated methods; modified metadata storage logic.
src/handlers/http/middleware.rs, src/handlers/http/modal/ingest_server.rs Simplified request handling logic in ModeFilterMiddleware; updated check_querier_state function visibility to public.
src/parseable/mod.rs Added new field indexer_metadata and renamed/updated store_ingestor_metadata to store_metadata with mode parameter to handle both ingestor and indexer metadata.
src/lib.rs, src/metrics/prom_utils.rs Made HTTP_CLIENT public and increased its timeout from 10 to 30 seconds; updated URL retrieval calls to include a mode parameter.
src/storage/azure_blob.rs, src/storage/localfs.rs, src/storage/object_storage.rs, src/storage/s3.rs Introduced new async method upload_multipart for multipart uploads; in S3, defined multipart logic with a new size constant and adjusted method signatures.
src/utils/mod.rs, src/utils/time.rs Added new public function get_indexer_id for indexer ID generation; updated TimeRange to derive the Clone trait.

Sequence Diagram(s)

sequenceDiagram
    participant C as Client
    participant O as Options
    alt Request with Mode::Ingest
        C->>O: get_url(Mode::Ingest)
        O-->>C: Returns ingestor_endpoint URL
    else Request with Mode::Index
        C->>O: get_url(Mode::Index)
        O-->>C: Returns indexer_endpoint URL
    end
Loading
sequenceDiagram
    participant App as Application
    participant S3 as S3 Storage
    participant FS as FileSystem
    participant AW as Async Writer
    App->>S3: upload_multipart(key, path)
    S3->>FS: Read file metadata
    alt File size < 5MB
        S3->>AW: Upload file in a single part
    else File size ≥ 5MB
        S3->>FS: Split file into chunks
        loop Each Chunk
            S3->>AW: Upload chunk
        end
    end
    AW-->>App: Confirm upload result
Loading

Possibly related PRs

Suggested labels

for next release

Suggested reviewers

  • nikhilsinhaparseable

Poem

I'm a rabbit hopping in code so fine,
Through endpoints and metadata I twine.
Multipart uploads and URLs merge,
In every function, a joyful surge.
Hop along with code's bright design 🐰💻!

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)
  • We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
    - To enable this feature, set early_access to true under in the settings.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e2cbb9 and 29597f5.

📒 Files selected for processing (2)
  • src/storage/object_storage.rs (2 hunks)
  • src/storage/s3.rs (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: coverage
🔇 Additional comments (5)
src/storage/object_storage.rs (2)

93-97: Appropriate trait extension for multipart upload functionality.

The addition of this method to the ObjectStorage trait enables an important capability for handling large files, providing a more efficient way to upload data in chunks rather than all at once.


847-854: Good implementation of multipart upload with proper error handling.

The change from using upload_file to upload_multipart in the upload_files_from_staging method provides better handling for large files. The error handling appropriately logs failures and continues with the next file, preventing a single failure from breaking the entire upload process.

src/storage/s3.rs (3)

65-65: Good constant definition for minimum multipart size.

The 5MB minimum size follows AWS S3's requirements for multipart uploads and is appropriately defined as a constant.


514-570: Well-implemented multipart upload with size-based optimization.

The implementation correctly handles both small and large files differently, optimizing performance in both cases. For files smaller than the minimum multipart size, it uses a single upload operation, while for larger files it implements chunked multipart uploads.

The implementation addresses previous concerns about memory usage by:

  1. Only reading the entire file for small files (< 5MB)
  2. Using buffered reads for larger files instead of loading everything into memory
  3. Calculating exact buffer slices based on remaining bytes to read

Good error handling is also present, including abort logic if the multipart upload fails to complete.


586-592: Clean trait implementation delegating to private method.

This public method provides a clean interface that adheres to the trait definition while delegating implementation details to the private method.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@parmesant parmesant force-pushed the enterprise-changes branch from 8d3c740 to b98106b Compare March 18, 2025 10:07
@parmesant parmesant marked this pull request as ready for review March 18, 2025 10:09
@parmesant parmesant force-pushed the enterprise-changes branch from b98106b to 8699ce8 Compare March 18, 2025 10:09
@nitisht nitisht requested review from nikhilsinhaparseable and de-sh and removed request for nikhilsinhaparseable March 18, 2025 10:10
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (4)
src/cli.rs (1)

407-485: Refactor Duplicate Logic in get_url

Although adding Mode::Index is correct, the blocks for Mode::Ingest and Mode::Index largely duplicate logic. Consider extracting the shared logic into a helper function for maintainability.

pub fn get_url(&self, mode: Mode) -> Url {
    let (endpoint, env_var) = match mode {
        Mode::Ingest => (&self.ingestor_endpoint, "P_INGESTOR_ENDPOINT"),
        Mode::Index => (&self.indexer_endpoint, "P_INDEXER_ENDPOINT"),
        _ => panic!("Invalid mode"),
    };

    if endpoint.is_empty() {
-       // Duplicate code for returning default self.address-based URL ...
+       return self.build_default_url(); // Example helper
    }

    // ...
}
src/parseable/mod.rs (2)

132-133: Consider consolidating metadata fields with a generic approach.
This new indexer_metadata field closely mirrors the pattern of ingestor_metadata. Using a generic or unified structure could reduce code duplication and make maintenance simpler.


268-329: Refactor to eliminate duplication in store_metadata.
The branches for Mode::Ingest and Mode::Index share repeated logic. Extracting common steps into a helper function could reduce duplication and simplify maintenance.

src/handlers/http/modal/mod.rs (1)

337-510: Duplicated struct logic between IndexerMetadata and IngestorMetadata.
The new IndexerMetadata is nearly identical to IngestorMetadata. Consider extracting common fields into a shared struct or trait to reduce duplication and simplify maintenance.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 160dec4 and 8699ce8.

📒 Files selected for processing (15)
  • src/cli.rs (2 hunks)
  • src/enterprise/utils.rs (2 hunks)
  • src/handlers/http/cluster/mod.rs (2 hunks)
  • src/handlers/http/middleware.rs (1 hunks)
  • src/handlers/http/modal/ingest_server.rs (3 hunks)
  • src/handlers/http/modal/mod.rs (3 hunks)
  • src/lib.rs (1 hunks)
  • src/metrics/prom_utils.rs (2 hunks)
  • src/parseable/mod.rs (4 hunks)
  • src/storage/azure_blob.rs (1 hunks)
  • src/storage/localfs.rs (1 hunks)
  • src/storage/object_storage.rs (1 hunks)
  • src/storage/s3.rs (4 hunks)
  • src/utils/mod.rs (1 hunks)
  • src/utils/time.rs (1 hunks)
🧰 Additional context used
🧬 Code Definitions (5)
src/storage/azure_blob.rs (3)
src/storage/s3.rs (1) (1)
  • upload_multipart (589:595)
src/storage/object_storage.rs (1) (1)
  • upload_multipart (93:97)
src/storage/localfs.rs (1) (1)
  • upload_multipart (107:113)
src/metrics/prom_utils.rs (1)
src/option.rs (1) (1)
  • url (123:125)
src/enterprise/utils.rs (2)
src/storage/s3.rs (1) (1)
  • s (181:181)
src/parseable/streams.rs (1) (1)
  • parquet_files (239:251)
src/cli.rs (1)
src/option.rs (1) (1)
  • mode (127:135)
src/utils/mod.rs (1)
src/handlers/http/modal/mod.rs (1) (1)
  • get_indexer_id (452:454)
🔇 Additional comments (22)
src/utils/time.rs (1)

52-52: Good addition of Clone trait to TimeRange

Adding the Clone trait to the TimeRange struct allows for more flexible usage patterns when you need to duplicate the time range data, which is likely needed for the new indexer functionality being added.

src/utils/mod.rs (1)

45-50: Good pattern reuse for indexer ID generation

The implementation follows the same pattern as get_ingestor_id(), which provides consistency in how different component identifiers are generated in the system.

src/metrics/prom_utils.rs (2)

22-22: Added Mode enum import for URL determination

This import supports the URL generation changes based on the operation mode.


65-66: URL generation now mode-specific

The change specifies that the URL should be generated for the ingest mode, which aligns with the enterprise updates for differentiating between ingestor and indexer endpoints.

The comment "for now it is only for ingestor" is helpful for documenting the current state, but consider whether metrics should eventually be collected for indexers as well.

src/handlers/http/middleware.rs (1)

362-367:

❓ Verification inconclusive

Security concern: All endpoints are now accessible in Index mode.

The previous endpoint restriction logic for Index mode has been removed. This means all API endpoints are now accessible when running in Index mode, which could potentially expose sensitive operations that were previously restricted.

Please verify whether this change is intentional and aligned with the security model for enterprise deployments. If some endpoints should still be restricted in Index mode, consider adding appropriate filtering logic.


🏁 Script executed:

#!/bin/bash
# Check if there are any references to endpoint restrictions for Index mode elsewhere in the codebase
grep -r "Mode::Index" --include="*.rs" .

Length of output: 970


Attention: Verify Endpoint Filtering for Index Mode

It appears that the logic which previously restricted certain endpoints in Index mode has been removed from the middleware, leaving all endpoints accessible when running in this mode. The grep output shows that while Mode::Index is still referenced in several parts of the codebase (e.g., in src/catalog/mod.rs, src/main.rs, and others), there’s no evidence of any updated filtering logic in the middleware (specifically around lines 362–367 in src/handlers/http/middleware.rs).

  • Affected File: src/handlers/http/middleware.rs (lines 362–367)
  • Observation: The current implementation wraps the service call without any conditional checks or filtering based on the operating mode.
  • Impact: When running in Index mode, all API endpoints become accessible, potentially exposing sensitive operations that were previously restricted.

Please verify if this removal is intentional for the enterprise deployment security model. If endpoint restrictions should still apply in Index mode, consider reintroducing appropriate filtering logic.

src/storage/object_storage.rs (1)

93-97: New multipart upload interface added to ObjectStorage trait.

The upload_multipart method adds support for multipart file uploads across different storage backends. This is a good addition for handling large files more efficiently. However, ensure that all implementations (S3, Azure, LocalFS) properly implement this method, as the LocalFS implementation currently uses unimplemented!().

src/lib.rs (1)

75-79:

✅ Verification successful

HTTP client visibility and timeout changes.

Two changes have been made to the HTTP_CLIENT:

  1. The client is now publicly accessible (pub static) rather than private
  2. The request timeout has been increased from 10 to 30 seconds

The increased timeout may help with handling larger requests or slower network conditions, which is beneficial for enterprise deployments. However, making the client public exposes implementation details that might be better encapsulated.

Verify whether making the HTTP_CLIENT public is necessary and used appropriately throughout the codebase:


🏁 Script executed:

#!/bin/bash
# Find all places where HTTP_CLIENT is now used outside this module
grep -r "HTTP_CLIENT" --include="*.rs" . | grep -v "src/lib.rs"

Length of output: 1517


HTTP Client Visibility: Confirmed Usage Across the Codebase

The verification shows that HTTP_CLIENT is used extensively in modules outside src/lib.rs (e.g., in src/analytics.rs, src/audit.rs, src/handlers/http/cluster/*, and others). Given this widespread usage, making the HTTP client public appears to be a deliberate design decision. Additionally, increasing the request timeout from 10 to 30 seconds aligns well with handling larger requests or slower network conditions in enterprise deployments.

  • Public Exposure Justified: Multiple modules rely on HTTP_CLIENT, so its public visibility is necessary.
  • Timeout Increase Acceptable: The raised timeout supports more resilient network conditions.

Overall, the changes are appropriate, and no further adjustments are required.

src/handlers/http/cluster/mod.rs (2)

54-54: Consistent Import Usage

Bringing IndexerMetadata and IngestorMetadata into scope here is straightforward and consistent with the existing structure.


60-61: Maintain Naming Consistency

Defining IndexerMetadataArr in parallel with IngestorMetadataArr ensures consistent naming conventions for collection types. No issues here.

src/cli.rs (1)

298-305: Indexer Endpoint Added

Introducing the indexer_endpoint field aligns with the existing style and expands configuration for indexing services. It's good that a default value is provided, though consider validating non-empty values if indexing is mandatory.

src/handlers/http/modal/ingest_server.rs (3)

31-31: Importing Mode Enum

Using Mode here is a natural extension if the ingest server needs mode-specific logic. Nothing concerning spotted.


112-112: Storing Metadata with Explicit Mode

Calling store_metadata(Mode::Ingest) is consistent with the broader shift towards mode-based metadata handling. Looks fine.


255-255: Confirm Public Access to check_querier_state

Changing visibility to pub makes this function callable from other modules. Verify that external callers cannot misuse this to bypass any internal workflows, especially around system readiness or security checks.

src/enterprise/utils.rs (1)

3-3: Chrono Import for Date Handling

Bringing in chrono::{TimeZone, Utc} is appropriate for robust date/time operations. No immediate issues here.

src/parseable/mod.rs (3)

50-50: No issues found for the new imports.
The import statement for IndexerMetadata seems correct and consistent with the existing structure.


149-152: Check whether Mode::All also needs indexer metadata.
The logic only loads metadata for Mode::Index. Please verify if running in Mode::All should also initialize indexer_metadata.


158-158: No concerns with storing the new field.
Passing indexer_metadata to the struct constructor looks straightforward and is consistent with the existing pattern for ingestor metadata.

src/storage/s3.rs (3)

46-46: Import statement needed for async file I/O.
Using tokio’s OpenOptions and AsyncReadExt is appropriate for streaming file reads.


66-66: Defines the minimum part size for multipart uploads.
This constant (5MB) aligns with AWS S3’s minimum valid chunk size for multipart operations.


589-595: Public wrapper for _upload_multipart.
This method succinctly exposes the multipart upload functionality. No concerns identified here.

src/handlers/http/modal/mod.rs (2)

37-37: Importing Mode for indexing logic.
Referencing option::Mode is consistent with the rest of the file’s approach to handling server modes.


40-40: New utility imports for ID retrieval.
Using get_indexer_id and get_ingestor_id will help differentiate between these roles. No issues here.

Comment on lines +427 to +433
async fn upload_multipart(
&self,
_key: &RelativePath,
_path: &Path,
) -> Result<(), ObjectStorageError> {
unimplemented!()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implementation needed for upload_multipart

The method is currently stubbed with unimplemented!(), which is appropriate for a work-in-progress PR. However, this should be implemented before the feature is considered complete since other storage backends like S3 have working implementations.

Consider implementing the method using the commented-out code around line 382 as a starting point, or at minimum add a TODO comment with a timeline for implementation:

async fn upload_multipart(
    &self,
    _key: &RelativePath,
    _path: &Path,
) -> Result<(), ObjectStorageError> {
-    unimplemented!()
+    // TODO(enterprise): Implement multipart uploads for Azure Blob storage in ticket #XXXX
+    Err(ObjectStorageError::UnhandledError(Box::new(
+        std::io::Error::new(
+            std::io::ErrorKind::Unsupported,
+            "Multipart upload not implemented for Blob Storage yet",
+        ),
+    )))
}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async fn upload_multipart(
&self,
_key: &RelativePath,
_path: &Path,
) -> Result<(), ObjectStorageError> {
unimplemented!()
}
async fn upload_multipart(
&self,
_key: &RelativePath,
_path: &Path,
) -> Result<(), ObjectStorageError> {
// TODO(enterprise): Implement multipart uploads for Azure Blob storage in ticket #XXXX
Err(ObjectStorageError::UnhandledError(Box::new(
std::io::Error::new(
std::io::ErrorKind::Unsupported,
"Multipart upload not implemented for Blob Storage yet",
),
)))
}

Comment on lines +107 to +113
async fn upload_multipart(
&self,
_key: &RelativePath,
_path: &Path,
) -> Result<(), ObjectStorageError> {
unimplemented!()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implementation needed for upload_multipart method.

The upload_multipart method is currently unimplemented. This could cause runtime panics if this method is called in production. Consider implementing this method with actual functionality similar to the other storage backends, or at minimum, returning a proper error instead of using unimplemented!().

async fn upload_multipart(
    &self,
    _key: &RelativePath,
    _path: &Path,
) -> Result<(), ObjectStorageError> {
-    unimplemented!()
+    Err(ObjectStorageError::UnhandledError(Box::new(
+        std::io::Error::new(
+            std::io::ErrorKind::Unsupported,
+            "Multipart upload not implemented for LocalFS yet",
+        ),
+    )))
}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async fn upload_multipart(
&self,
_key: &RelativePath,
_path: &Path,
) -> Result<(), ObjectStorageError> {
unimplemented!()
}
async fn upload_multipart(
&self,
_key: &RelativePath,
_path: &Path,
) -> Result<(), ObjectStorageError> {
Err(ObjectStorageError::UnhandledError(Box::new(
std::io::Error::new(
std::io::ErrorKind::Unsupported,
"Multipart upload not implemented for LocalFS yet",
),
)))
}

Comment on lines +639 to +655
pub async fn get_indexer_info() -> anyhow::Result<IndexerMetadataArr> {
let store = PARSEABLE.storage.get_object_store();

let root_path = RelativePathBuf::from(PARSEABLE_ROOT_DIRECTORY);
let arr = store
.get_objects(
Some(&root_path),
Box::new(|file_name| file_name.starts_with("indexer")),
)
.await?
.iter()
// this unwrap will most definateley shoot me in the foot later
.map(|x| serde_json::from_slice::<IndexerMetadata>(x).unwrap_or_default())
.collect_vec();

Ok(arr)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve Error Handling in Metadata Deserialization

Using unwrap_or_default() can silently swallow parsing errors and may hinder debugging if the metadata is malformed. Prefer propagating errors or logging them for better visibility.

Consider changing to:

 .map(|x| {
-    serde_json::from_slice::<IndexerMetadata>(x).unwrap_or_default()
+    serde_json::from_slice::<IndexerMetadata>(x).unwrap_or_else(|e| {
+        error!("Failed to parse indexer metadata: {:?}", e);
+        IndexerMetadata::default()
+    })
 })

to detect and log failures, or fully propagate the error if data integrity is critical.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub async fn get_indexer_info() -> anyhow::Result<IndexerMetadataArr> {
let store = PARSEABLE.storage.get_object_store();
let root_path = RelativePathBuf::from(PARSEABLE_ROOT_DIRECTORY);
let arr = store
.get_objects(
Some(&root_path),
Box::new(|file_name| file_name.starts_with("indexer")),
)
.await?
.iter()
// this unwrap will most definateley shoot me in the foot later
.map(|x| serde_json::from_slice::<IndexerMetadata>(x).unwrap_or_default())
.collect_vec();
Ok(arr)
}
pub async fn get_indexer_info() -> anyhow::Result<IndexerMetadataArr> {
let store = PARSEABLE.storage.get_object_store();
let root_path = RelativePathBuf::from(PARSEABLE_ROOT_DIRECTORY);
let arr = store
.get_objects(
Some(&root_path),
Box::new(|file_name| file_name.starts_with("indexer")),
)
.await?
.iter()
// this unwrap will most definateley shoot me in the foot later
.map(|x| {
serde_json::from_slice::<IndexerMetadata>(x).unwrap_or_else(|e| {
error!("Failed to parse indexer metadata: {:?}", e);
IndexerMetadata::default()
})
})
.collect_vec();
Ok(arr)
}

Comment on lines +123 to 153
.filter_map(|file| {
let date = file.file_path.split("/").collect_vec();

let date = date.as_slice()[1..4].iter().map(|s| s.to_string());

let date = RelativePathBuf::from_iter(date);

parquet_files.entry(date).or_default().push(file);
let year = &date[1][5..9];
let month = &date[1][10..12];
let day = &date[1][13..15];
let hour = &date[2][5..7];
let min = &date[3][7..9];
let file_date = Utc
.with_ymd_and_hms(
year.parse::<i32>().unwrap(),
month.parse::<u32>().unwrap(),
day.parse::<u32>().unwrap(),
hour.parse::<u32>().unwrap(),
min.parse::<u32>().unwrap(),
0,
)
.unwrap();

if file_date < time_range.start {
None
} else {
let date = date.as_slice()[1..4].iter().map(|s| s.to_string());

let date = RelativePathBuf::from_iter(date);

parquet_files.entry(date).or_default().push(file);
Some("")
}
})
.for_each(|_| {});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Validate Path String Sub-slices

Extracting substrings like &date[1][5..9] risks panics if date[1] is shorter than expected. Consider verifying path segment lengths to guard against malformed or unexpected file paths.

- let year = &date[1][5..9];
+ if date[1].len() < 9 {
+     warn!("Unexpected file path format for: {:?}", date);
+     return None;
+ }
+ let year = &date[1][5..9];
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
.filter_map(|file| {
let date = file.file_path.split("/").collect_vec();
let date = date.as_slice()[1..4].iter().map(|s| s.to_string());
let date = RelativePathBuf::from_iter(date);
parquet_files.entry(date).or_default().push(file);
let year = &date[1][5..9];
let month = &date[1][10..12];
let day = &date[1][13..15];
let hour = &date[2][5..7];
let min = &date[3][7..9];
let file_date = Utc
.with_ymd_and_hms(
year.parse::<i32>().unwrap(),
month.parse::<u32>().unwrap(),
day.parse::<u32>().unwrap(),
hour.parse::<u32>().unwrap(),
min.parse::<u32>().unwrap(),
0,
)
.unwrap();
if file_date < time_range.start {
None
} else {
let date = date.as_slice()[1..4].iter().map(|s| s.to_string());
let date = RelativePathBuf::from_iter(date);
parquet_files.entry(date).or_default().push(file);
Some("")
}
})
.for_each(|_| {});
.filter_map(|file| {
let date = file.file_path.split("/").collect_vec();
if date[1].len() < 9 {
warn!("Unexpected file path format for: {:?}", date);
return None;
}
let year = &date[1][5..9];
let month = &date[1][10..12];
let day = &date[1][13..15];
let hour = &date[2][5..7];
let min = &date[3][7..9];
let file_date = Utc
.with_ymd_and_hms(
year.parse::<i32>().unwrap(),
month.parse::<u32>().unwrap(),
day.parse::<u32>().unwrap(),
hour.parse::<u32>().unwrap(),
min.parse::<u32>().unwrap(),
0,
)
.unwrap();
if file_date < time_range.start {
None
} else {
let date = date.as_slice()[1..4].iter().map(|s| s.to_string());
let date = RelativePathBuf::from_iter(date);
parquet_files.entry(date).or_default().push(file);
Some("")
}
})
.for_each(|_| {});

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/storage/object_storage.rs (2)

93-97: Add documentation for the new trait method.
Although this new trait method upload_multipart is a valuable addition, it would be beneficial to include a doc comment explaining usage details and any assumptions made about file sizes or concurrency constraints.


847-851: Ensure robust error handling for partial uploads.
While it is good that the loop continues upon upload errors, consider whether you want to provide any retry logic for partial file uploads. In high-latency or failure scenarios, having granular retries for each chunk could ensure more resilient uploads.

src/storage/s3.rs (2)

66-66: Consider making the threshold configurable.
Defining a 5 MB threshold for MIN_MULTIPART_UPLOAD_SIZE is reasonable, but it might be even more robust to allow a user or environment variable to configure this value for edge cases or variable bandwidth constraints.


514-565: Check concurrency and finalization logic in _upload_multipart.
This implementation executes part-uploads in parallel with tokio::spawn, which can improve speed but may also raise memory usage for very large files. Examine whether a bounded concurrency strategy or streaming approach is more suitable. Additionally, you may want to handle failures in async_writer.complete() by aborting the multipart upload to avoid leaving stale partials.

Do you want a verification script to scan for any usage of abort_multipart calls or relevant error handling in other files that might be triggered upon failure?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8699ce8 and cd4f854.

📒 Files selected for processing (2)
  • src/storage/object_storage.rs (2 hunks)
  • src/storage/s3.rs (4 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
src/storage/s3.rs (3)
src/storage/mod.rs (1) (1)
  • to_object_store_path (266:268)
src/storage/object_storage.rs (1) (1)
  • upload_multipart (93:97)
src/storage/azure_blob.rs (1) (1)
  • upload_multipart (427:433)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: coverage
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
🔇 Additional comments (2)
src/storage/s3.rs (2)

46-46: Necessary import for asynchronous file I/O.
This import of OpenOptions and AsyncReadExt is required to support non-blocking reads, which is essential for multipart uploads.


581-587: Minimal pass-through implementation.
Forwarding upload_multipart to the private _upload_multipart method looks consistent. No further concerns.

@parmesant parmesant force-pushed the enterprise-changes branch from cd4f854 to 7e2cbb9 Compare March 18, 2025 12:42
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/storage/object_storage.rs (2)

93-97: Add a short docstring to the new method for clarity.
It would be helpful to document the behavior, use cases, and expected error scenarios of upload_multipart, so future contributors can easily understand its role and how it integrates with the rest of the trait methods.


847-851: Consider parallel uploading and improved error handling.
Currently, the upload is performed sequentially within this loop, and if an error persists, the code logs it and proceeds. For large-scale deployments, consider performing parallel uploads for multiple files to reduce overall time. Also, evaluate whether a retry or rollback mechanism is necessary to handle partial failures in multipart uploads.

src/storage/s3.rs (1)

65-65: Make the minimum multipart upload size configurable if future needs demand it.
The current 5 MB threshold is fairly standard. However, if you anticipate different file sizes or have memory constraints, consider making it user-configurable for flexibility.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd4f854 and 7e2cbb9.

📒 Files selected for processing (2)
  • src/storage/object_storage.rs (2 hunks)
  • src/storage/s3.rs (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: coverage
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
🔇 Additional comments (2)
src/storage/s3.rs (2)

46-46: Import statement looks good.
The addition of OpenOptions and AsyncReadExt is appropriate for the asynchronous file reads.


578-584: Straightforward delegation to _upload_multipart.
This public method is cleanly forwarding the call to the private _upload_multipart function. The implementation is consistent with the trait requirement.

@parmesant parmesant force-pushed the enterprise-changes branch from 7e2cbb9 to 29597f5 Compare March 18, 2025 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant