Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: use regex to extract info from arrow file name #1239

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

de-sh
Copy link
Contributor

@de-sh de-sh commented Mar 14, 2025

Fixes #XXXX.

Description

Using regex for this purpose would be a better way of dealing with the complex filenames instead of having to depend on specific string operations on the filenames themselves.


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features
    • Enhanced file processing by enforcing strict naming conventions, ensuring that data files comply with expected formats.
    • Improved conversion from arrow file paths to parquet file paths, with robust error handling for non-compliant filenames.
    • Added logging for unexpected filename structures to improve file handling robustness.

Copy link

coderabbitai bot commented Mar 14, 2025

Walkthrough

The changes add a static regex (ARROWS_NAME_STRUCTURE) to validate arrow file names, ensuring they follow a specific format. A new function (arrow_path_to_parquet) converts arrow file paths to parquet file paths using this regex, returning an Option<PathBuf>. The filtering logic in arrow_files is updated to apply the regex instead of a simple extension check, and arrow_files_grouped_exclude_time now logs a warning when the file name does not match the expected structure.

Changes

File(s) Summary
src/.../streams.rs - Added static regex ARROWS_NAME_STRUCTURE for validating arrow file naming.
- Introduced new function arrow_path_to_parquet that converts arrow file paths to parquet file paths and returns Option<PathBuf>.
- Updated filtering logic in arrow_files to use the regex.
- Modified arrow_files_grouped_exclude_time to log warnings on invalid file names.

Possibly related PRs

  • fix: ensure panic safety #1212: The changes in the main PR, specifically the introduction of the arrow_path_to_parquet function and the new regex pattern for validating arrow filenames, are related to the modifications in the retrieved PR that involve error handling and logging improvements in the streams.rs file, as both PRs address aspects of file handling and validation.

  • fix: capture time created from metadata not filename #1211: The changes in the main PR introduce a new regex pattern and a function for validating and converting arrow file paths, while the retrieved PR modifies the arrow_files_grouped_exclude_time method's parameter type and logic, indicating that they are both working on the same method but with different focuses; thus, they are related.

Suggested reviewers

  • nikhilsinhaparseable

Poem

I'm a little rabbit, hopping through the code,
With regex carrots and a new conversion road.
Arrow files transformed with a flick and a skip,
Logging warnings like a playful tip.
My ears twitch in joy, as changes unfold—
In the realm of code magic, vibrant and bold!
🐇🌟

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)
  • We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
    - To enable this feature, set early_access to true under in the settings.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 382a8d8 and e87d96f.

📒 Files selected for processing (1)
  • src/parseable/streams.rs (7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: coverage
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (5)
src/parseable/streams.rs (5)

73-91: Well-structured regex implementation with clear documentation.

The regex pattern is well-defined to match the expected file name structure, and the extensive documentation clearly explains the format, limitations, and provides examples. This is a good improvement over string manipulation for file name validation.


93-105: Effective implementation with proper error handling

The function correctly extracts filename components using regex, and properly handles potential errors by returning Option<PathBuf> and using the ? operator instead of unwrap calls. This addresses the issues mentioned in previous review comments.


222-226: Good improvement to filter validation logic

Replacing the simple extension check with regex validation ensures arrow files follow the expected naming convention, which will prevent issues with malformed filenames later in the process.


269-276: Enhanced error handling for unexpected file formats

Adding a warning log when encountering arrow files that don't match the expected pattern improves observability and makes debugging easier. This is a valuable improvement to the error handling flow.


1222-1310: Comprehensive test coverage for the new functionality

The test suite covers various scenarios including valid paths, invalid paths, and complex directory structures. These tests help verify the correctness of the regex pattern and the arrow_path_to_parquet function.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/parseable/streams.rs (1)

94-106: Improve error handling in arrow_path_to_parquet function.

The function correctly extracts the front part from the arrow file name using the regex pattern, but it uses unwrap() twice on line 96, which could panic if the path has no filename or the filename is not valid UTF-8.

Consider improving the error handling:

 fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> {
-    let filename = path.file_name().unwrap().to_str().unwrap();
+    let filename = path.file_name()?.to_str()?;
     let filename = ARROWS_NAME_STRUCTURE
         .captures(filename)
         .and_then(|c| c.get(1))?
         .as_str();
     let filename_with_random_number = format!("{filename}.data.{random_string}.parquet");
     let mut parquet_path = path.to_owned();
     parquet_path.set_file_name(filename_with_random_number);

     Some(parquet_path)
 }

This makes the function more robust by using the ? operator to propagate the None value if either operation fails.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between db4a68d and 41b3688.

📒 Files selected for processing (1)
  • src/parseable/streams.rs (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: coverage
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
🔇 Additional comments (4)
src/parseable/streams.rs (4)

36-45: Good addition of necessary imports.

The imports for once_cell::sync::Lazy and regex::Regex are appropriate for implementing the regex-based file name validation solution.


224-231: Improved arrow file filtering using regex.

The updated filtering logic using the regex pattern provides more robust validation of arrow filenames compared to the previous simple extension check.


274-281: Good error handling with warning for unexpected arrow files.

Adding a warning log when an arrow file doesn't match the expected pattern is a good practice. It will help identify problematic files during operation.


73-92:

❓ Verification inconclusive

Well-documented regex pattern for arrow file name validation.

The static regex pattern with detailed documentation explains the expected format, limitations, and gives examples, which is excellent for maintainability.

One consideration: the pattern uses \S+ for the front part which allows any non-whitespace character, but the documentation mentions limitations on partition keys/values. Consider if the regex should be more restrictive to enforce the documented limitations.


🏁 Script executed:

#!/bin/bash
# Check if there are any existing arrow files that might not match the new pattern
# This will help identify if the regex might be too restrictive for existing files

# Find current arrow files
echo "Searching for existing arrow files..."
find . -name "*.arrows" | grep -v "node_modules" | head -n 10

# Test a few example filenames against the regex
echo -e "\nTesting regex against sample filenames:"
test_regex() {
  if [[ "$1" =~ ^[a-zA-Z0-9&=]+\.(.*?)\.data\.arrows$ ]]; then
    echo "MATCH: $1"
  else
    echo "NO MATCH: $1"
  fi
}

# Valid examples
test_regex "abc123.date=2023-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows"
test_regex "20200201T1830f8a5fc1edc567d56&key1=value1&key2=value2.date=2020-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows"

# Invalid examples
test_regex "abc-123.date=2023-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows" # Invalid character in schema_key
test_regex "abc123.date=2023-01-21.hour=10.minute=30.key1=value-1.key2=value2.ee529ffc8e76.data.arrows" # Has hyphen in value

Length of output: 374


Attention: Refine Regex Restrictions & Address Shell Test Syntax

The documentation for the arrow file regex is clear and detailed, which is excellent for maintainability. However, two points need your attention:

  • Regex Specificity: The regex currently uses \S+ to capture the "front" part, which permits any non-whitespace characters. This does not enforce the documented limitations that partition keys and values must be strictly alphanumeric. Please consider updating the regex to use a more precise character class that only allows alphanumeric characters (and specific allowed punctuation) as described.

  • Shell Test Script Issue: The test script encountered a syntax error due to the unescaped ampersand (&) within the regex. When running regex tests in bash, special characters like & should be properly escaped or quoted to prevent shell misinterpretation. This adjustment will ensure reliable test execution.

Please address these items to align the implementation with the documentation and improve test reliability.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 14, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 41b3688 and 2995cac.

📒 Files selected for processing (1)
  • src/parseable/streams.rs (6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: coverage
🔇 Additional comments (5)
src/parseable/streams.rs (5)

36-36: Appropriate imports added.

The imports for once_cell::sync::Lazy and regex::Regex are correctly added to support the new regex-based file name extraction functionality.

Also applies to: 45-45


73-92: Good documentation and regex pattern implementation.

The documentation thoroughly explains the format, limitations, and provides examples for the arrow file name structure. Using Lazy::new() for initializing the static regex is a good practice that ensures the regex is compiled only once when needed.

The regex pattern looks correct for validating the described format:

  • Schema key (alphanumeric with & and = characters)
  • Front part (for parquet file naming)
  • Suffix .data.arrows

227-231: Improved file filtering with regex validation.

Replacing the simple extension check with regex validation improves the robustness of the arrow_files method by ensuring that files follow the expected naming pattern.


274-281: Added warning for unexpected arrow files.

Good improvement to log warnings for arrow files that don't match the expected pattern. This will help with debugging file naming issues.


1234-1296: Comprehensive test coverage for the new functionality.

The tests cover a good range of scenarios including:

  • Valid arrow path conversion
  • Invalid arrow paths (missing suffix)
  • Invalid schema keys with special characters
  • Complex nested paths
  • Edge cases with empty front part

These tests will help ensure the robustness of the implementation.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 14, 2025
de-sh added a commit to de-sh/parseable that referenced this pull request Mar 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant