-
-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: use regex to extract info from arrow file name #1239
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThe changes add a static regex ( Changes
Possibly related PRs
Suggested reviewers
Poem
Tip ⚡🧪 Multi-step agentic review comment chat (experimental)
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms (10)
🔇 Additional comments (5)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/parseable/streams.rs (1)
94-106
: Improve error handling inarrow_path_to_parquet
function.The function correctly extracts the front part from the arrow file name using the regex pattern, but it uses
unwrap()
twice on line 96, which could panic if the path has no filename or the filename is not valid UTF-8.Consider improving the error handling:
fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> { - let filename = path.file_name().unwrap().to_str().unwrap(); + let filename = path.file_name()?.to_str()?; let filename = ARROWS_NAME_STRUCTURE .captures(filename) .and_then(|c| c.get(1))? .as_str(); let filename_with_random_number = format!("{filename}.data.{random_string}.parquet"); let mut parquet_path = path.to_owned(); parquet_path.set_file_name(filename_with_random_number); Some(parquet_path) }This makes the function more robust by using the
?
operator to propagate theNone
value if either operation fails.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/parseable/streams.rs
(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: coverage
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
🔇 Additional comments (4)
src/parseable/streams.rs (4)
36-45
: Good addition of necessary imports.The imports for
once_cell::sync::Lazy
andregex::Regex
are appropriate for implementing the regex-based file name validation solution.
224-231
: Improved arrow file filtering using regex.The updated filtering logic using the regex pattern provides more robust validation of arrow filenames compared to the previous simple extension check.
274-281
: Good error handling with warning for unexpected arrow files.Adding a warning log when an arrow file doesn't match the expected pattern is a good practice. It will help identify problematic files during operation.
73-92
:❓ Verification inconclusive
Well-documented regex pattern for arrow file name validation.
The static regex pattern with detailed documentation explains the expected format, limitations, and gives examples, which is excellent for maintainability.
One consideration: the pattern uses
\S+
for the front part which allows any non-whitespace character, but the documentation mentions limitations on partition keys/values. Consider if the regex should be more restrictive to enforce the documented limitations.
🏁 Script executed:
#!/bin/bash # Check if there are any existing arrow files that might not match the new pattern # This will help identify if the regex might be too restrictive for existing files # Find current arrow files echo "Searching for existing arrow files..." find . -name "*.arrows" | grep -v "node_modules" | head -n 10 # Test a few example filenames against the regex echo -e "\nTesting regex against sample filenames:" test_regex() { if [[ "$1" =~ ^[a-zA-Z0-9&=]+\.(.*?)\.data\.arrows$ ]]; then echo "MATCH: $1" else echo "NO MATCH: $1" fi } # Valid examples test_regex "abc123.date=2023-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows" test_regex "20200201T1830f8a5fc1edc567d56&key1=value1&key2=value2.date=2020-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows" # Invalid examples test_regex "abc-123.date=2023-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows" # Invalid character in schema_key test_regex "abc123.date=2023-01-21.hour=10.minute=30.key1=value-1.key2=value2.ee529ffc8e76.data.arrows" # Has hyphen in valueLength of output: 374
Attention: Refine Regex Restrictions & Address Shell Test Syntax
The documentation for the arrow file regex is clear and detailed, which is excellent for maintainability. However, two points need your attention:
Regex Specificity: The regex currently uses
\S+
to capture the "front" part, which permits any non-whitespace characters. This does not enforce the documented limitations that partition keys and values must be strictly alphanumeric. Please consider updating the regex to use a more precise character class that only allows alphanumeric characters (and specific allowed punctuation) as described.Shell Test Script Issue: The test script encountered a syntax error due to the unescaped ampersand (
&
) within the regex. When running regex tests in bash, special characters like&
should be properly escaped or quoted to prevent shell misinterpretation. This adjustment will ensure reliable test execution.Please address these items to align the implementation with the documentation and improve test reliability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/parseable/streams.rs
(6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: coverage
🔇 Additional comments (5)
src/parseable/streams.rs (5)
36-36
: Appropriate imports added.The imports for
once_cell::sync::Lazy
andregex::Regex
are correctly added to support the new regex-based file name extraction functionality.Also applies to: 45-45
73-92
: Good documentation and regex pattern implementation.The documentation thoroughly explains the format, limitations, and provides examples for the arrow file name structure. Using
Lazy::new()
for initializing the static regex is a good practice that ensures the regex is compiled only once when needed.The regex pattern looks correct for validating the described format:
- Schema key (alphanumeric with
&
and=
characters)- Front part (for parquet file naming)
- Suffix
.data.arrows
227-231
: Improved file filtering with regex validation.Replacing the simple extension check with regex validation improves the robustness of the
arrow_files
method by ensuring that files follow the expected naming pattern.
274-281
: Added warning for unexpected arrow files.Good improvement to log warnings for arrow files that don't match the expected pattern. This will help with debugging file naming issues.
1234-1296
: Comprehensive test coverage for the new functionality.The tests cover a good range of scenarios including:
- Valid arrow path conversion
- Invalid arrow paths (missing suffix)
- Invalid schema keys with special characters
- Complex nested paths
- Edge cases with empty front part
These tests will help ensure the robustness of the implementation.
Fixes #XXXX.
Description
Using regex for this purpose would be a better way of dealing with the complex filenames instead of having to depend on specific string operations on the filenames themselves.
This PR has:
Summary by CodeRabbit
Summary by CodeRabbit