updates for enterprise #1247

parmesant · 2025-03-18T10:01:41Z

Adds multiple updates for Parseable Enterprise

Description

This PR has:

been tested to ensure log ingestion and log query works.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added documentation for new or modified features or behaviors.

Summary by CodeRabbit

New Features
- Added flexible configuration options to specify custom connection endpoints for improved integration with indexing services.
- Introduced enhanced metadata management to clearly separate indexing from ingestion data.
- Enabled multipart upload support for efficient handling of large file transfers across storage platforms.
- Added a new function to generate unique indexer IDs.
Bug Fixes
- Improved error handling for endpoint validation based on operational mode.
Refactor
- Streamlined file processing with improved date-based filtering.
- Simplified endpoint access control and extended HTTP timeout settings for better performance.
- Updated metadata handling logic to accommodate new indexer structures.
- Enhanced the TimeRange struct to support cloning.

coderabbitai · 2025-03-18T10:01:49Z

Walkthrough

This pull request introduces several new and enhanced functionalities. It updates the CLI options to include a configurable indexer endpoint and modifies URL generation based on operation mode. Changes in metadata handling include new structures and methods for both ingestors and indexers, plus exposed retrieval of indexer metadata via HTTP handlers. Additionally, the PR adds multipart upload methods to various storage implementations and adjusts utility functions and client configurations. Overall, the changes extend the system’s flexibility in endpoint configuration, metadata management, and file uploading.

Changes

File(s)	Summary of Changes
`src/cli.rs`	Added new public field `indexer_endpoint` in the `Options` struct; updated `get_url` to accept a `mode` parameter and select the appropriate endpoint.
`src/enterprise/utils.rs`	Updated `fetch_parquet_file_paths` to use `filter_map` with date validation; added the `chrono` crate for date manipulation.
`src/handlers/http/cluster/mod.rs`, `src/handlers/http/modal/mod.rs`	Added type alias `IndexerMetadataArr`, new async function `get_indexer_info()`, and new struct `IndexerMetadata` with associated methods; modified metadata storage logic.
`src/handlers/http/middleware.rs`, `src/handlers/http/modal/ingest_server.rs`	Simplified request handling logic in `ModeFilterMiddleware`; updated `check_querier_state` function visibility to public.
`src/parseable/mod.rs`	Added new field `indexer_metadata` and renamed/updated `store_ingestor_metadata` to `store_metadata` with mode parameter to handle both ingestor and indexer metadata.
`src/lib.rs`, `src/metrics/prom_utils.rs`	Made `HTTP_CLIENT` public and increased its timeout from 10 to 30 seconds; updated URL retrieval calls to include a mode parameter.
`src/storage/azure_blob.rs`, `src/storage/localfs.rs`, `src/storage/object_storage.rs`, `src/storage/s3.rs`	Introduced new async method `upload_multipart` for multipart uploads; in S3, defined multipart logic with a new size constant and adjusted method signatures.
`src/utils/mod.rs`, `src/utils/time.rs`	Added new public function `get_indexer_id` for indexer ID generation; updated `TimeRange` to derive the `Clone` trait.

Sequence Diagram(s)

sequenceDiagram
    participant C as Client
    participant O as Options
    alt Request with Mode::Ingest
        C->>O: get_url(Mode::Ingest)
        O-->>C: Returns ingestor_endpoint URL
    else Request with Mode::Index
        C->>O: get_url(Mode::Index)
        O-->>C: Returns indexer_endpoint URL
    end

sequenceDiagram
    participant App as Application
    participant S3 as S3 Storage
    participant FS as FileSystem
    participant AW as Async Writer
    App->>S3: upload_multipart(key, path)
    S3->>FS: Read file metadata
    alt File size < 5MB
        S3->>AW: Upload file in a single part
    else File size ≥ 5MB
        S3->>FS: Split file into chunks
        loop Each Chunk
            S3->>AW: Upload chunk
        end
    end
    AW-->>App: Confirm upload result

Possibly related PRs

fix: bugs introduced in #1143 #1185: Directly related to modifications in the get_url method and endpoint handling, indicating shared code-level changes.

Suggested labels

for next release

Suggested reviewers

nikhilsinhaparseable

Poem

I'm a rabbit hopping in code so fine,
Through endpoints and metadata I twine.
Multipart uploads and URLs merge,
In every function, a joyful surge.
Hop along with code's bright design 🐰💻!

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)

We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
- To enable this feature, set early_access to true under in the settings.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e2cbb9 and 29597f5.

📒 Files selected for processing (2)

src/storage/object_storage.rs (2 hunks)
src/storage/s3.rs (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (6)

GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: coverage

🔇 Additional comments (5)

src/storage/object_storage.rs (2)

93-97: Appropriate trait extension for multipart upload functionality.

The addition of this method to the ObjectStorage trait enables an important capability for handling large files, providing a more efficient way to upload data in chunks rather than all at once.

847-854: Good implementation of multipart upload with proper error handling.

The change from using upload_file to upload_multipart in the upload_files_from_staging method provides better handling for large files. The error handling appropriately logs failures and continues with the next file, preventing a single failure from breaking the entire upload process.

src/storage/s3.rs (3)

65-65: Good constant definition for minimum multipart size.

The 5MB minimum size follows AWS S3's requirements for multipart uploads and is appropriately defined as a constant.

514-570: Well-implemented multipart upload with size-based optimization.

The implementation correctly handles both small and large files differently, optimizing performance in both cases. For files smaller than the minimum multipart size, it uses a single upload operation, while for larger files it implements chunked multipart uploads.

The implementation addresses previous concerns about memory usage by:

Only reading the entire file for small files (< 5MB)

Using buffered reads for larger files instead of loading everything into memory

Calculating exact buffer slices based on remaining bytes to read

Good error handling is also present, including abort logic if the multipart upload fails to complete.

586-592: Clean trait implementation delegating to private method.

This public method provides a clean interface that adheres to the trait definition while delegating implementation details to the private method.

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (4)

src/cli.rs (1)
407-485: Refactor Duplicate Logic in get_url

Although adding Mode::Index is correct, the blocks for Mode::Ingest and Mode::Index largely duplicate logic. Consider extracting the shared logic into a helper function for maintainability.
pub fn get_url(&self, mode: Mode) -> Url {
    let (endpoint, env_var) = match mode {
        Mode::Ingest => (&self.ingestor_endpoint, "P_INGESTOR_ENDPOINT"),
        Mode::Index => (&self.indexer_endpoint, "P_INDEXER_ENDPOINT"),
        _ => panic!("Invalid mode"),
    };

    if endpoint.is_empty() {
-       // Duplicate code for returning default self.address-based URL ...
+       return self.build_default_url(); // Example helper
    }

    // ...
}
src/parseable/mod.rs (2)

132-133: Consider consolidating metadata fields with a generic approach.
This new indexer_metadata field closely mirrors the pattern of ingestor_metadata. Using a generic or unified structure could reduce code duplication and make maintenance simpler.

268-329: Refactor to eliminate duplication in store_metadata.
The branches for Mode::Ingest and Mode::Index share repeated logic. Extracting common steps into a helper function could reduce duplication and simplify maintenance.

src/handlers/http/modal/mod.rs (1)

337-510: Duplicated struct logic between IndexerMetadata and IngestorMetadata.
The new IndexerMetadata is nearly identical to IngestorMetadata. Consider extracting common fields into a shared struct or trait to reduce duplication and simplify maintenance.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 160dec4 and 8699ce8.

📒 Files selected for processing (15)

src/cli.rs (2 hunks)
src/enterprise/utils.rs (2 hunks)
src/handlers/http/cluster/mod.rs (2 hunks)
src/handlers/http/middleware.rs (1 hunks)
src/handlers/http/modal/ingest_server.rs (3 hunks)
src/handlers/http/modal/mod.rs (3 hunks)
src/lib.rs (1 hunks)
src/metrics/prom_utils.rs (2 hunks)
src/parseable/mod.rs (4 hunks)
src/storage/azure_blob.rs (1 hunks)
src/storage/localfs.rs (1 hunks)
src/storage/object_storage.rs (1 hunks)
src/storage/s3.rs (4 hunks)
src/utils/mod.rs (1 hunks)
src/utils/time.rs (1 hunks)

🧰 Additional context used

🧬 Code Definitions (5)

src/storage/azure_blob.rs (3)

src/storage/s3.rs (1) (1)

upload_multipart (589:595)

src/storage/object_storage.rs (1) (1)

upload_multipart (93:97)

src/storage/localfs.rs (1) (1)

upload_multipart (107:113)

src/metrics/prom_utils.rs (1)

src/option.rs (1) (1)

url (123:125)

src/enterprise/utils.rs (2)

src/storage/s3.rs (1) (1)

s (181:181)

src/parseable/streams.rs (1) (1)

parquet_files (239:251)

src/cli.rs (1)

src/option.rs (1) (1)

mode (127:135)

src/utils/mod.rs (1)

src/handlers/http/modal/mod.rs (1) (1)

get_indexer_id (452:454)

🔇 Additional comments (22)

src/utils/time.rs (1)

52-52: Good addition of Clone trait to TimeRange

Adding the Clone trait to the TimeRange struct allows for more flexible usage patterns when you need to duplicate the time range data, which is likely needed for the new indexer functionality being added.

src/utils/mod.rs (1)

45-50: Good pattern reuse for indexer ID generation

The implementation follows the same pattern as get_ingestor_id(), which provides consistency in how different component identifiers are generated in the system.

src/metrics/prom_utils.rs (2)

22-22: Added Mode enum import for URL determination

This import supports the URL generation changes based on the operation mode.

65-66: URL generation now mode-specific

The change specifies that the URL should be generated for the ingest mode, which aligns with the enterprise updates for differentiating between ingestor and indexer endpoints.

The comment "for now it is only for ingestor" is helpful for documenting the current state, but consider whether metrics should eventually be collected for indexers as well.
src/handlers/http/middleware.rs (1)

362-367:
❓ Verification inconclusive

Security concern: All endpoints are now accessible in Index mode.

The previous endpoint restriction logic for Index mode has been removed. This means all API endpoints are now accessible when running in Index mode, which could potentially expose sensitive operations that were previously restricted.

Please verify whether this change is intentional and aligned with the security model for enterprise deployments. If some endpoints should still be restricted in Index mode, consider adding appropriate filtering logic.

🏁 Script executed:
#!/bin/bash
# Check if there are any references to endpoint restrictions for Index mode elsewhere in the codebase
grep -r "Mode::Index" --include="*.rs" .
Length of output: 970
Attention: Verify Endpoint Filtering for Index Mode

It appears that the logic which previously restricted certain endpoints in Index mode has been removed from the middleware, leaving all endpoints accessible when running in this mode. The grep output shows that while Mode::Index is still referenced in several parts of the codebase (e.g., in src/catalog/mod.rs, src/main.rs, and others), there’s no evidence of any updated filtering logic in the middleware (specifically around lines 362–367 in src/handlers/http/middleware.rs).

Affected File: src/handlers/http/middleware.rs (lines 362–367)

Observation: The current implementation wraps the service call without any conditional checks or filtering based on the operating mode.

Impact: When running in Index mode, all API endpoints become accessible, potentially exposing sensitive operations that were previously restricted.

Please verify if this removal is intentional for the enterprise deployment security model. If endpoint restrictions should still apply in Index mode, consider reintroducing appropriate filtering logic.
src/storage/object_storage.rs (1)

93-97: New multipart upload interface added to ObjectStorage trait.

The upload_multipart method adds support for multipart file uploads across different storage backends. This is a good addition for handling large files more efficiently. However, ensure that all implementations (S3, Azure, LocalFS) properly implement this method, as the LocalFS implementation currently uses unimplemented!().
src/lib.rs (1)

75-79:
✅ Verification successful

HTTP client visibility and timeout changes.

Two changes have been made to the HTTP_CLIENT:

The client is now publicly accessible (pub static) rather than private

The request timeout has been increased from 10 to 30 seconds

The increased timeout may help with handling larger requests or slower network conditions, which is beneficial for enterprise deployments. However, making the client public exposes implementation details that might be better encapsulated.

Verify whether making the HTTP_CLIENT public is necessary and used appropriately throughout the codebase:

🏁 Script executed:
#!/bin/bash
# Find all places where HTTP_CLIENT is now used outside this module
grep -r "HTTP_CLIENT" --include="*.rs" . | grep -v "src/lib.rs"
Length of output: 1517
HTTP Client Visibility: Confirmed Usage Across the Codebase

The verification shows that HTTP_CLIENT is used extensively in modules outside src/lib.rs (e.g., in src/analytics.rs, src/audit.rs, src/handlers/http/cluster/*, and others). Given this widespread usage, making the HTTP client public appears to be a deliberate design decision. Additionally, increasing the request timeout from 10 to 30 seconds aligns well with handling larger requests or slower network conditions in enterprise deployments.

Public Exposure Justified: Multiple modules rely on HTTP_CLIENT, so its public visibility is necessary.

Timeout Increase Acceptable: The raised timeout supports more resilient network conditions.

Overall, the changes are appropriate, and no further adjustments are required.
src/handlers/http/cluster/mod.rs (2)

54-54: Consistent Import Usage

Bringing IndexerMetadata and IngestorMetadata into scope here is straightforward and consistent with the existing structure.

60-61: Maintain Naming Consistency

Defining IndexerMetadataArr in parallel with IngestorMetadataArr ensures consistent naming conventions for collection types. No issues here.

src/cli.rs (1)

298-305: Indexer Endpoint Added

Introducing the indexer_endpoint field aligns with the existing style and expands configuration for indexing services. It's good that a default value is provided, though consider validating non-empty values if indexing is mandatory.

src/handlers/http/modal/ingest_server.rs (3)

31-31: Importing Mode Enum

Using Mode here is a natural extension if the ingest server needs mode-specific logic. Nothing concerning spotted.

112-112: Storing Metadata with Explicit Mode

Calling store_metadata(Mode::Ingest) is consistent with the broader shift towards mode-based metadata handling. Looks fine.

255-255: Confirm Public Access to check_querier_state

Changing visibility to pub makes this function callable from other modules. Verify that external callers cannot misuse this to bypass any internal workflows, especially around system readiness or security checks.

src/enterprise/utils.rs (1)

3-3: Chrono Import for Date Handling

Bringing in chrono::{TimeZone, Utc} is appropriate for robust date/time operations. No immediate issues here.

src/parseable/mod.rs (3)

50-50: No issues found for the new imports.
The import statement for IndexerMetadata seems correct and consistent with the existing structure.

149-152: Check whether Mode::All also needs indexer metadata.
The logic only loads metadata for Mode::Index. Please verify if running in Mode::All should also initialize indexer_metadata.

158-158: No concerns with storing the new field.
Passing indexer_metadata to the struct constructor looks straightforward and is consistent with the existing pattern for ingestor metadata.

src/storage/s3.rs (3)

46-46: Import statement needed for async file I/O.
Using tokio’s OpenOptions and AsyncReadExt is appropriate for streaming file reads.

66-66: Defines the minimum part size for multipart uploads.
This constant (5MB) aligns with AWS S3’s minimum valid chunk size for multipart operations.

589-595: Public wrapper for _upload_multipart.
This method succinctly exposes the multipart upload functionality. No concerns identified here.

src/handlers/http/modal/mod.rs (2)

37-37: Importing Mode for indexing logic.
Referencing option::Mode is consistent with the rest of the file’s approach to handling server modes.

40-40: New utility imports for ID retrieval.
Using get_indexer_id and get_ingestor_id will help differentiate between these roles. No issues here.

coderabbitai · 2025-03-18T10:14:38Z

src/storage/azure_blob.rs

+    async fn upload_multipart(
+        &self,
+        _key: &RelativePath,
+        _path: &Path,
+    ) -> Result<(), ObjectStorageError> {
+        unimplemented!()
+    }


⚠️ Potential issue

Implementation needed for upload_multipart

The method is currently stubbed with unimplemented!(), which is appropriate for a work-in-progress PR. However, this should be implemented before the feature is considered complete since other storage backends like S3 have working implementations.

Consider implementing the method using the commented-out code around line 382 as a starting point, or at minimum add a TODO comment with a timeline for implementation:

async fn upload_multipart( &self, _key: &RelativePath, _path: &Path, ) -> Result<(), ObjectStorageError> { - unimplemented!() + // TODO(enterprise): Implement multipart uploads for Azure Blob storage in ticket #XXXX + Err(ObjectStorageError::UnhandledError(Box::new( + std::io::Error::new( + std::io::ErrorKind::Unsupported, + "Multipart upload not implemented for Blob Storage yet", + ), + ))) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async fn upload_multipart(

&self,

_key: &RelativePath,

_path: &Path,

) -> Result<(), ObjectStorageError> {

unimplemented!()

}

async fn upload_multipart(

&self,

_key: &RelativePath,

_path: &Path,

) -> Result<(), ObjectStorageError> {

// TODO(enterprise): Implement multipart uploads for Azure Blob storage in ticket #XXXX

Err(ObjectStorageError::UnhandledError(Box::new(

std::io::Error::new(

std::io::ErrorKind::Unsupported,

"Multipart upload not implemented for Blob Storage yet",

),

)))

}

coderabbitai · 2025-03-18T10:14:39Z

src/storage/localfs.rs

+    async fn upload_multipart(
+        &self,
+        _key: &RelativePath,
+        _path: &Path,
+    ) -> Result<(), ObjectStorageError> {
+        unimplemented!()
+    }


⚠️ Potential issue

Implementation needed for upload_multipart method.

The upload_multipart method is currently unimplemented. This could cause runtime panics if this method is called in production. Consider implementing this method with actual functionality similar to the other storage backends, or at minimum, returning a proper error instead of using unimplemented!().

async fn upload_multipart( &self, _key: &RelativePath, _path: &Path, ) -> Result<(), ObjectStorageError> { - unimplemented!() + Err(ObjectStorageError::UnhandledError(Box::new( + std::io::Error::new( + std::io::ErrorKind::Unsupported, + "Multipart upload not implemented for LocalFS yet", + ), + ))) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async fn upload_multipart(

&self,

_key: &RelativePath,

_path: &Path,

) -> Result<(), ObjectStorageError> {

unimplemented!()

}

async fn upload_multipart(

&self,

_key: &RelativePath,

_path: &Path,

) -> Result<(), ObjectStorageError> {

Err(ObjectStorageError::UnhandledError(Box::new(

std::io::Error::new(

std::io::ErrorKind::Unsupported,

"Multipart upload not implemented for LocalFS yet",

),

)))

}

coderabbitai · 2025-03-18T10:14:39Z

src/handlers/http/cluster/mod.rs

+pub async fn get_indexer_info() -> anyhow::Result<IndexerMetadataArr> {
+    let store = PARSEABLE.storage.get_object_store();
+
+    let root_path = RelativePathBuf::from(PARSEABLE_ROOT_DIRECTORY);
+    let arr = store
+        .get_objects(
+            Some(&root_path),
+            Box::new(|file_name| file_name.starts_with("indexer")),
+        )
+        .await?
+        .iter()
+        // this unwrap will most definateley shoot me in the foot later
+        .map(|x| serde_json::from_slice::<IndexerMetadata>(x).unwrap_or_default())
+        .collect_vec();
+
+    Ok(arr)
+}


🛠️ Refactor suggestion

Improve Error Handling in Metadata Deserialization

Using unwrap_or_default() can silently swallow parsing errors and may hinder debugging if the metadata is malformed. Prefer propagating errors or logging them for better visibility.

Consider changing to:

.map(|x| { - serde_json::from_slice::<IndexerMetadata>(x).unwrap_or_default() + serde_json::from_slice::<IndexerMetadata>(x).unwrap_or_else(|e| { + error!("Failed to parse indexer metadata: {:?}", e); + IndexerMetadata::default() + }) })

to detect and log failures, or fully propagate the error if data integrity is critical.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pub async fn get_indexer_info() -> anyhow::Result<IndexerMetadataArr> {

let store = PARSEABLE.storage.get_object_store();

let root_path = RelativePathBuf::from(PARSEABLE_ROOT_DIRECTORY);

let arr = store

.get_objects(

Some(&root_path),

Box::new(|file_name| file_name.starts_with("indexer")),

)

.await?

.iter()

// this unwrap will most definateley shoot me in the foot later

.map(|x| serde_json::from_slice::<IndexerMetadata>(x).unwrap_or_default())

.collect_vec();

Ok(arr)

}

pub async fn get_indexer_info() -> anyhow::Result<IndexerMetadataArr> {

let store = PARSEABLE.storage.get_object_store();

let root_path = RelativePathBuf::from(PARSEABLE_ROOT_DIRECTORY);

let arr = store

.get_objects(

Some(&root_path),

Box::new(|file_name| file_name.starts_with("indexer")),

)

.await?

.iter()

// this unwrap will most definateley shoot me in the foot later

.map(|x| {

serde_json::from_slice::<IndexerMetadata>(x).unwrap_or_else(|e| {

error!("Failed to parse indexer metadata: {:?}", e);

IndexerMetadata::default()

})

})

.collect_vec();

Ok(arr)

}

coderabbitai · 2025-03-18T10:14:39Z

src/enterprise/utils.rs

+        .filter_map(|file| {
            let date = file.file_path.split("/").collect_vec();

-            let date = date.as_slice()[1..4].iter().map(|s| s.to_string());
-
-            let date = RelativePathBuf::from_iter(date);
-
-            parquet_files.entry(date).or_default().push(file);
+            let year = &date[1][5..9];
+            let month = &date[1][10..12];
+            let day = &date[1][13..15];
+            let hour = &date[2][5..7];
+            let min = &date[3][7..9];
+            let file_date = Utc
+                .with_ymd_and_hms(
+                    year.parse::<i32>().unwrap(),
+                    month.parse::<u32>().unwrap(),
+                    day.parse::<u32>().unwrap(),
+                    hour.parse::<u32>().unwrap(),
+                    min.parse::<u32>().unwrap(),
+                    0,
+                )
+                .unwrap();
+
+            if file_date < time_range.start {
+                None
+            } else {
+                let date = date.as_slice()[1..4].iter().map(|s| s.to_string());
+
+                let date = RelativePathBuf::from_iter(date);
+
+                parquet_files.entry(date).or_default().push(file);
+                Some("")
+            }
        })
        .for_each(|_| {});


🛠️ Refactor suggestion

Validate Path String Sub-slices

Extracting substrings like &date[1][5..9] risks panics if date[1] is shorter than expected. Consider verifying path segment lengths to guard against malformed or unexpected file paths.

- let year = &date[1][5..9]; + if date[1].len() < 9 { + warn!("Unexpected file path format for: {:?}", date); + return None; + } + let year = &date[1][5..9];

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

.filter_map(|file| {

let date = file.file_path.split("/").collect_vec();

let date = date.as_slice()[1..4].iter().map(|s| s.to_string());

let date = RelativePathBuf::from_iter(date);

parquet_files.entry(date).or_default().push(file);

let year = &date[1][5..9];

let month = &date[1][10..12];

let day = &date[1][13..15];

let hour = &date[2][5..7];

let min = &date[3][7..9];

let file_date = Utc

.with_ymd_and_hms(

year.parse::<i32>().unwrap(),

month.parse::<u32>().unwrap(),

day.parse::<u32>().unwrap(),

hour.parse::<u32>().unwrap(),

min.parse::<u32>().unwrap(),

0,

)

.unwrap();

if file_date < time_range.start {

None

} else {

let date = date.as_slice()[1..4].iter().map(|s| s.to_string());

let date = RelativePathBuf::from_iter(date);

parquet_files.entry(date).or_default().push(file);

Some("")

}

})

.for_each(|_| {});

.filter_map(|file| {

let date = file.file_path.split("/").collect_vec();

if date[1].len() < 9 {

warn!("Unexpected file path format for: {:?}", date);

return None;

}

let year = &date[1][5..9];

let month = &date[1][10..12];

let day = &date[1][13..15];

let hour = &date[2][5..7];

let min = &date[3][7..9];

let file_date = Utc

.with_ymd_and_hms(

year.parse::<i32>().unwrap(),

month.parse::<u32>().unwrap(),

day.parse::<u32>().unwrap(),

hour.parse::<u32>().unwrap(),

min.parse::<u32>().unwrap(),

0,

)

.unwrap();

if file_date < time_range.start {

None

} else {

let date = date.as_slice()[1..4].iter().map(|s| s.to_string());

let date = RelativePathBuf::from_iter(date);

parquet_files.entry(date).or_default().push(file);

Some("")

}

})

.for_each(|_| {});

src/storage/s3.rs

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

src/storage/object_storage.rs (2)

93-97: Add documentation for the new trait method.
Although this new trait method upload_multipart is a valuable addition, it would be beneficial to include a doc comment explaining usage details and any assumptions made about file sizes or concurrency constraints.

847-851: Ensure robust error handling for partial uploads.
While it is good that the loop continues upon upload errors, consider whether you want to provide any retry logic for partial file uploads. In high-latency or failure scenarios, having granular retries for each chunk could ensure more resilient uploads.

src/storage/s3.rs (2)

66-66: Consider making the threshold configurable.
Defining a 5 MB threshold for MIN_MULTIPART_UPLOAD_SIZE is reasonable, but it might be even more robust to allow a user or environment variable to configure this value for edge cases or variable bandwidth constraints.

514-565: Check concurrency and finalization logic in _upload_multipart.
This implementation executes part-uploads in parallel with tokio::spawn, which can improve speed but may also raise memory usage for very large files. Examine whether a bounded concurrency strategy or streaming approach is more suitable. Additionally, you may want to handle failures in async_writer.complete() by aborting the multipart upload to avoid leaving stale partials.

Do you want a verification script to scan for any usage of abort_multipart calls or relevant error handling in other files that might be triggered upon failure?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8699ce8 and cd4f854.

📒 Files selected for processing (2)

src/storage/object_storage.rs (2 hunks)
src/storage/s3.rs (4 hunks)

🧰 Additional context used

🧬 Code Definitions (1)

src/storage/s3.rs (3)

src/storage/mod.rs (1) (1)

to_object_store_path (266:268)

src/storage/object_storage.rs (1) (1)

upload_multipart (93:97)

src/storage/azure_blob.rs (1) (1)

upload_multipart (427:433)

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: coverage
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Quest Smoke and Load Tests for Standalone deployments

🔇 Additional comments (2)

src/storage/s3.rs (2)

46-46: Necessary import for asynchronous file I/O.
This import of OpenOptions and AsyncReadExt is required to support non-blocking reads, which is essential for multipart uploads.

581-587: Minimal pass-through implementation.
Forwarding upload_multipart to the private _upload_multipart method looks consistent. No further concerns.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

src/storage/object_storage.rs (2)

93-97: Add a short docstring to the new method for clarity.
It would be helpful to document the behavior, use cases, and expected error scenarios of upload_multipart, so future contributors can easily understand its role and how it integrates with the rest of the trait methods.

847-851: Consider parallel uploading and improved error handling.
Currently, the upload is performed sequentially within this loop, and if an error persists, the code logs it and proceeds. For large-scale deployments, consider performing parallel uploads for multiple files to reduce overall time. Also, evaluate whether a retry or rollback mechanism is necessary to handle partial failures in multipart uploads.

src/storage/s3.rs (1)

65-65: Make the minimum multipart upload size configurable if future needs demand it.
The current 5 MB threshold is fairly standard. However, if you anticipate different file sizes or have memory constraints, consider making it user-configurable for flexibility.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd4f854 and 7e2cbb9.

📒 Files selected for processing (2)

src/storage/object_storage.rs (2 hunks)
src/storage/s3.rs (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: coverage
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default x86_64-unknown-linux-gnu

🔇 Additional comments (2)

src/storage/s3.rs (2)

46-46: Import statement looks good.
The addition of OpenOptions and AsyncReadExt is appropriate for the asynchronous file reads.

578-584: Straightforward delegation to _upload_multipart.
This public method is cleanly forwarding the call to the private _upload_multipart function. The implementation is consistent with the trait requirement.

src/storage/s3.rs

parmesant force-pushed the enterprise-changes branch from 8d3c740 to b98106b Compare March 18, 2025 10:07

parmesant marked this pull request as ready for review March 18, 2025 10:09

updates for enterprise

8699ce8

parmesant force-pushed the enterprise-changes branch from b98106b to 8699ce8 Compare March 18, 2025 10:09

nitisht requested review from nikhilsinhaparseable and de-sh and removed request for nikhilsinhaparseable March 18, 2025 10:10

coderabbitai bot requested changes Mar 18, 2025

View reviewed changes

coderabbitai bot reviewed Mar 18, 2025

View reviewed changes

parmesant force-pushed the enterprise-changes branch from cd4f854 to 7e2cbb9 Compare March 18, 2025 12:42

coderabbitai bot requested changes Mar 18, 2025

View reviewed changes

src/storage/s3.rs Outdated Show resolved Hide resolved

upload multipart in parallel

29597f5

parmesant force-pushed the enterprise-changes branch from 7e2cbb9 to 29597f5 Compare March 18, 2025 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updates for enterprise #1247

updates for enterprise #1247

parmesant commented Mar 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 18, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Mar 18, 2025

coderabbitai bot Mar 18, 2025

coderabbitai bot Mar 18, 2025

coderabbitai bot Mar 18, 2025

coderabbitai bot left a comment

coderabbitai bot left a comment

updates for enterprise #1247

Are you sure you want to change the base?

updates for enterprise #1247

Conversation

parmesant commented Mar 18, 2025 • edited by coderabbitai bot Loading

Description

Summary by CodeRabbit

coderabbitai bot commented Mar 18, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Mar 18, 2025

Choose a reason for hiding this comment

coderabbitai bot Mar 18, 2025

Choose a reason for hiding this comment

coderabbitai bot Mar 18, 2025

Choose a reason for hiding this comment

coderabbitai bot Mar 18, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

parmesant commented Mar 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 18, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)