-
-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add api to execute clickbench benchmark #1219
base: main
Are you sure you want to change the base?
feat: add api to execute clickbench benchmark #1219
Conversation
WalkthroughA new benchmarking feature has been integrated into the system. A dedicated Rust module has been created to execute SQL queries on Parquet files using the DataFusion library. The feature exposes asynchronous endpoints that initialize a session, register data, and execute queries while returning results in JSON format. Additionally, a corresponding web scope and RBAC permission have been added to support access control. A minor change also refines error logging in the streams module without altering the underlying logic. Changes
Sequence Diagram(s)sequenceDiagram
participant C as Client
participant S as Server (HTTP Router)
participant CB as Clickbench Handler
participant R as Benchmark Runner
participant P as Parquet Data Source
C->>S: GET /benchmark/clickbench
S->>CB: Handle benchmarking request
CB->>R: call run_benchmark()
R->>R: Initialize session and register Parquet file ("hits")
R->>R: For each SQL query, parse and generate execution plan
loop Execute each query 3 times
R->>P: Execute query on Parquet file
P-->>R: Return query result and execution time
end
R->>CB: Return JSON with benchmark results
CB-->>S: Pass response
S-->>C: Send JSON response
Poem
Suggested reviewers
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
91bdbfb
to
a5dc222
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 1
🧹 Nitpick comments (5)
src/rbac/role.rs (1)
70-70
: Add necessary documentation for Benchmark action.Adding the
Benchmark
variant is straightforward, but it would be helpful to briefly document its intended usage so that future maintainers know why permissions related to benchmarking must be gated under this action.Also applies to: 112-112
src/handlers/http/modal/server.rs (2)
28-28
: Ensure consistent scope ordering and maintain readability.New imports and routing entries look fine. However, consider grouping or commenting the
.service(Self::get_benchmark_webscope())
call near other related performance/analytics endpoints for clarity.Also applies to: 91-92
165-173
: Implement input validation or usage notes for the benchmark endpoint.The new
/benchmark/clickbench
scope is clear and well-defined. However, if environment variables are not set correctly or files are missing, the endpoint may fail. Consider returning a user-friendly error or guidance within the endpoint on how to supply valid environment variables.src/handlers/http/clickbench.rs (2)
19-31
: Consider reorganizing imports for clarity.The imports are correct and necessary for the functionality. Optionally, you could separate standard library, third-party, and local dependencies into distinct blocks for readability.
75-83
: Graceful error handling for table registration.Registration logic is clear. Consider adding logs for troubleshooting if the parquet file registration fails, to help diagnose environment or file issues quickly.
🛑 Comments failed to post (1)
src/handlers/http/clickbench.rs (1)
40-73: 🛠️ Refactor suggestion
Avoid creating an additional tokio runtime inside an Actix service.
Using
#[tokio::main]
inrun_benchmark()
can lead to nested runtimes when running under Actix, which already uses a tokio runtime. Prefer convertingrun_benchmark
into a standard async function and usingtokio::spawn
rather than nesting a new runtime.- #[tokio::main(flavor = "multi_thread")] pub async fn run_benchmark() -> Result<Json<Value>, anyhow::Error> {
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.pub async fn run_benchmark() -> Result<Json<Value>, anyhow::Error> { let mut session_config = SessionConfig::from_env()?.with_information_schema(true); session_config = session_config.with_batch_size(8192); let rt_builder = RuntimeEnvBuilder::new(); // set memory pool size let runtime_env = rt_builder.build_arc()?; let state = SessionStateBuilder::new() .with_default_features() .with_config(session_config) .with_runtime_env(runtime_env) .build(); state .catalog_list() .catalog(&state.config_options().catalog.default_catalog) .expect("default catalog is provided by datafusion"); let ctx = SessionContext::new_with_state(state); let mut table_options = HashMap::new(); table_options.insert("binary_as_string", "true"); let parquet_file = env::var("PARQUET_LOCATION")?; register_hits(&ctx, &parquet_file).await?; let mut query_list = Vec::new(); let queries_file = env::var("QUERIES_FILE")?; let queries = fs::read_to_string(queries_file)?; for query in queries.lines() { query_list.push(query.to_string()); } execute_queries(&ctx, query_list).await }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/handlers/http/clickbench.rs (2)
40-58
: Session configuration could benefit from customization options.The session configuration uses hardcoded values which might not be optimal for all systems. Consider making the batch size configurable through environment variables.
- session_config = session_config.with_batch_size(8192); + let batch_size = env::var("CLICKBENCH_BATCH_SIZE").unwrap_or_else(|_| "8192".to_string()).parse::<usize>().unwrap_or(8192); + session_config = session_config.with_batch_size(batch_size);
120-123
: Enhance result format to include more details.The current JSON result format doesn't provide enough context for analysis. Including additional metadata would make the benchmark results more usable.
- let result_json = json!(results); + let result_json = json!({ + "metadata": { + "timestamp": chrono::Utc::now().to_rfc3339(), + "num_queries": query_list.len(), + "num_iterations": TRIES + }, + "results": results + });
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/handlers/http/clickbench.rs
(1 hunks)src/handlers/http/mod.rs
(1 hunks)src/handlers/http/modal/server.rs
(3 hunks)src/parseable/streams.rs
(1 hunks)src/rbac/role.rs
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- src/handlers/http/mod.rs
- src/rbac/role.rs
- src/parseable/streams.rs
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: coverage
🔇 Additional comments (4)
src/handlers/http/modal/server.rs (3)
28-28
: Import added for clickbench functionality.The import statement correctly adds the necessary module for the new clickbench benchmark functionality.
91-92
: New benchmark webscope added to server configuration.The benchmark webscope is properly integrated into the server's route configuration, following the established pattern for other webscopes.
165-173
:❓ Verification inconclusive
Benchmark webscope implementation looks good.
The implementation creates a web scope at the path
/benchmark/clickbench
with proper authorization using theAction::Benchmark
permission.
🏁 Script executed:
#!/bin/bash # Description: Verify that the Action::Benchmark enum variant is properly defined in the RBAC role module. # Search for the Benchmark variant in the Action enum rg -A 1 -B 1 "enum Action" --include="*.rs" rg "Benchmark," --include="*.rs"Length of output: 962
Benchmark webscope implementation looks good but requires manual verification of the RBAC configuration.
The implementation creates a web scope at the path
/benchmark/clickbench
with proper authorization usingAction::Benchmark
. However, our initial automated search forAction::Benchmark
encountered issues due to command syntax. Please manually verify (or re-run the revised search commands below) that theAction::Benchmark
variant is correctly defined in the RBAC role module.Suggested revised verification commands:
# Check the definition of the Action enum, filtering only Rust sources rg -g '*.rs' -A 1 -B 1 "enum Action" # Search for the Benchmark variant within Rust files rg -g '*.rs' "Benchmark,"src/handlers/http/clickbench.rs (1)
32-38
: HTTP handler implementation looks good.The function correctly spawns a blocking task to run the CPU-intensive benchmark operation, which is a good practice for handling potentially long-running operations in an async web server.
src/handlers/http/clickbench.rs
Outdated
let parquet_file = env::var("PARQUET_FILE")?; | ||
register_hits(&ctx, &parquet_file).await?; | ||
let mut query_list = Vec::new(); | ||
let queries_file = env::var("QUERIES_FILE")?; | ||
let queries = fs::read_to_string(queries_file)?; | ||
for query in queries.lines() { | ||
query_list.push(query.to_string()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve environment variable handling.
The code directly uses environment variables without providing descriptive error messages if they're missing. It would be better to provide helpful error messages when required environment variables are not set.
- let parquet_file = env::var("PARQUET_FILE")?;
+ let parquet_file = env::var("PARQUET_FILE")
+ .map_err(|_| anyhow::anyhow!("PARQUET_FILE environment variable not set. Please set it to the path of the hits.parquet file."))?;
- let queries_file = env::var("QUERIES_FILE")?;
+ let queries_file = env::var("QUERIES_FILE")
+ .map_err(|_| anyhow::anyhow!("QUERIES_FILE environment variable not set. Please set it to the path of the queries file."))?;
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
let parquet_file = env::var("PARQUET_FILE")?; | |
register_hits(&ctx, &parquet_file).await?; | |
let mut query_list = Vec::new(); | |
let queries_file = env::var("QUERIES_FILE")?; | |
let queries = fs::read_to_string(queries_file)?; | |
for query in queries.lines() { | |
query_list.push(query.to_string()); | |
} | |
let parquet_file = env::var("PARQUET_FILE") | |
.map_err(|_| anyhow::anyhow!("PARQUET_FILE environment variable not set. Please set it to the path of the hits.parquet file."))?; | |
register_hits(&ctx, &parquet_file).await?; | |
let mut query_list = Vec::new(); | |
let queries_file = env::var("QUERIES_FILE") | |
.map_err(|_| anyhow::anyhow!("QUERIES_FILE environment variable not set. Please set it to the path of the queries file."))?; | |
let queries = fs::read_to_string(queries_file)?; | |
for query in queries.lines() { | |
query_list.push(query.to_string()); | |
} |
src/handlers/http/clickbench.rs
Outdated
pub async fn execute_queries( | ||
ctx: &SessionContext, | ||
query_list: Vec<String>, | ||
) -> Result<Json<Value>, anyhow::Error> { | ||
const TRIES: usize = 3; | ||
let mut results = Vec::new(); | ||
|
||
for sql in query_list.iter() { | ||
let mut elapsed_times = Vec::new(); | ||
for _iteration in 1..=TRIES { | ||
let start = Instant::now(); | ||
let task_ctx = ctx.task_ctx(); | ||
let dialect = &task_ctx.session_config().options().sql_parser.dialect; | ||
let dialect = dialect_from_str(dialect).ok_or_else(|| { | ||
plan_datafusion_err!( | ||
"Unsupported SQL dialect: {dialect}. Available dialects: \ | ||
Generic, MySQL, PostgreSQL, Hive, SQLite, Snowflake, Redshift, \ | ||
MsSQL, ClickHouse, BigQuery, Ansi." | ||
) | ||
})?; | ||
|
||
let statements = DFParser::parse_sql_with_dialect(sql, dialect.as_ref())?; | ||
let statement = statements.front().unwrap(); | ||
let plan = ctx.state().statement_to_plan(statement.clone()).await?; | ||
|
||
let df = ctx.execute_logical_plan(plan).await?; | ||
let physical_plan = df.create_physical_plan().await?; | ||
|
||
let _ = collect(physical_plan, task_ctx.clone()).await?; | ||
let elapsed = start.elapsed().as_secs_f64(); | ||
elapsed_times.push(elapsed); | ||
} | ||
results.push(elapsed_times); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve benchmark results structure and error handling.
The current implementation has several issues:
- The results don't include query identifiers, making it hard to interpret which elapsed times correspond to which query
- There's no error handling if statements.front() returns None
- Pre-allocating vectors would improve performance slightly
pub async fn execute_queries(
ctx: &SessionContext,
query_list: Vec<String>,
) -> Result<Json<Value>, anyhow::Error> {
const TRIES: usize = 3;
- let mut results = Vec::new();
+ let mut results = Vec::with_capacity(query_list.len());
- for sql in query_list.iter() {
- let mut elapsed_times = Vec::new();
+ for (query_index, sql) in query_list.iter().enumerate() {
+ let mut elapsed_times = Vec::with_capacity(TRIES);
for _iteration in 1..=TRIES {
let start = Instant::now();
let task_ctx = ctx.task_ctx();
let dialect = &task_ctx.session_config().options().sql_parser.dialect;
let dialect = dialect_from_str(dialect).ok_or_else(|| {
plan_datafusion_err!(
"Unsupported SQL dialect: {dialect}. Available dialects: \
Generic, MySQL, PostgreSQL, Hive, SQLite, Snowflake, Redshift, \
MsSQL, ClickHouse, BigQuery, Ansi."
)
})?;
let statements = DFParser::parse_sql_with_dialect(sql, dialect.as_ref())?;
- let statement = statements.front().unwrap();
+ let statement = statements.front()
+ .ok_or_else(|| anyhow::anyhow!("No SQL statement found in query: {}", sql))?;
let plan = ctx.state().statement_to_plan(statement.clone()).await?;
let df = ctx.execute_logical_plan(plan).await?;
let physical_plan = df.create_physical_plan().await?;
let _ = collect(physical_plan, task_ctx.clone()).await?;
let elapsed = start.elapsed().as_secs_f64();
elapsed_times.push(elapsed);
}
- results.push(elapsed_times);
+ results.push(json!({
+ "query_index": query_index,
+ "query": sql,
+ "elapsed_times": elapsed_times
+ }));
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
pub async fn execute_queries( | |
ctx: &SessionContext, | |
query_list: Vec<String>, | |
) -> Result<Json<Value>, anyhow::Error> { | |
const TRIES: usize = 3; | |
let mut results = Vec::new(); | |
for sql in query_list.iter() { | |
let mut elapsed_times = Vec::new(); | |
for _iteration in 1..=TRIES { | |
let start = Instant::now(); | |
let task_ctx = ctx.task_ctx(); | |
let dialect = &task_ctx.session_config().options().sql_parser.dialect; | |
let dialect = dialect_from_str(dialect).ok_or_else(|| { | |
plan_datafusion_err!( | |
"Unsupported SQL dialect: {dialect}. Available dialects: \ | |
Generic, MySQL, PostgreSQL, Hive, SQLite, Snowflake, Redshift, \ | |
MsSQL, ClickHouse, BigQuery, Ansi." | |
) | |
})?; | |
let statements = DFParser::parse_sql_with_dialect(sql, dialect.as_ref())?; | |
let statement = statements.front().unwrap(); | |
let plan = ctx.state().statement_to_plan(statement.clone()).await?; | |
let df = ctx.execute_logical_plan(plan).await?; | |
let physical_plan = df.create_physical_plan().await?; | |
let _ = collect(physical_plan, task_ctx.clone()).await?; | |
let elapsed = start.elapsed().as_secs_f64(); | |
elapsed_times.push(elapsed); | |
} | |
results.push(elapsed_times); | |
} | |
pub async fn execute_queries( | |
ctx: &SessionContext, | |
query_list: Vec<String>, | |
) -> Result<Json<Value>, anyhow::Error> { | |
const TRIES: usize = 3; | |
let mut results = Vec::with_capacity(query_list.len()); | |
for (query_index, sql) in query_list.iter().enumerate() { | |
let mut elapsed_times = Vec::with_capacity(TRIES); | |
for _iteration in 1..=TRIES { | |
let start = Instant::now(); | |
let task_ctx = ctx.task_ctx(); | |
let dialect = &task_ctx.session_config().options().sql_parser.dialect; | |
let dialect = dialect_from_str(dialect).ok_or_else(|| { | |
plan_datafusion_err!( | |
"Unsupported SQL dialect: {dialect}. Available dialects: \ | |
Generic, MySQL, PostgreSQL, Hive, SQLite, Snowflake, Redshift, \ | |
MsSQL, ClickHouse, BigQuery, Ansi." | |
) | |
})?; | |
let statements = DFParser::parse_sql_with_dialect(sql, dialect.as_ref())?; | |
let statement = statements.front() | |
.ok_or_else(|| anyhow::anyhow!("No SQL statement found in query: {}", sql))?; | |
let plan = ctx.state().statement_to_plan(statement.clone()).await?; | |
let df = ctx.execute_logical_plan(plan).await?; | |
let physical_plan = df.create_physical_plan().await?; | |
let _ = collect(physical_plan, task_ctx.clone()).await?; | |
let elapsed = start.elapsed().as_secs_f64(); | |
elapsed_times.push(elapsed); | |
} | |
results.push(json!({ | |
"query_index": query_index, | |
"query": sql, | |
"elapsed_times": elapsed_times | |
})); | |
} | |
// Assuming there is further handling of `results` before returning | |
} |
use API `/api/v1/benchmark/clickbench` to perform benchmark on clickbench dataset and defined queries add env `PARQUET_FILE` to provide file path of hits.parquet add env `QUERIES_FILE` to provide file path of queries file 3 tries for each query, total 43 queries in the set api response with query no, iteration no, response time in ms
a5dc222
to
d694a70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
src/handlers/http/clickbench.rs (2)
111-115
: Consider making benchmark parameters configurable.The number of tries (3) is hardcoded. While this works, it might be better to make this configurable via environment variables or configuration parameters.
+static DEFAULT_TRIES: usize = 3; + pub async fn execute_queries( ctx: &SessionContext, query_list: Vec<String>, ) -> Result<Json<Value>, anyhow::Error> { - const TRIES: usize = 3; + let tries = env::var("BENCHMARK_TRIES") + .ok() + .and_then(|v| v.parse::<usize>().ok()) + .unwrap_or(DEFAULT_TRIES); - let mut results = Vec::with_capacity(query_list.len()); + let mut results = Vec::with_capacity(query_list.len()); - let mut total_elapsed_per_iteration = [0.0; TRIES]; + let mut total_elapsed_per_iteration = vec![0.0; tries];
59-95
: Consider adding memory limits for benchmarking consistency.The benchmark doesn't specify memory limits explicitly, which could lead to inconsistent results across different environments. Consider setting memory limits for more consistent benchmarking.
let mut session_config = SessionConfig::from_env()?.with_information_schema(true); session_config = session_config.with_batch_size(8192); + // Set memory limit for consistent benchmarking + session_config = session_config.with_memory_limit(Some(4 * 1024 * 1024 * 1024)); // 4GB limit let rt_builder = RuntimeEnvBuilder::new();
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/handlers/http/clickbench.rs
(1 hunks)src/handlers/http/mod.rs
(1 hunks)src/handlers/http/modal/server.rs
(3 hunks)src/parseable/streams.rs
(1 hunks)src/rbac/role.rs
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- src/handlers/http/mod.rs
- src/rbac/role.rs
- src/handlers/http/modal/server.rs
- src/parseable/streams.rs
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: coverage
🔇 Additional comments (3)
src/handlers/http/clickbench.rs (3)
83-90
: LGTM: Improved environment variable error handling.Good job implementing detailed error messages for environment variables. This matches the recommended approach from previous reviews.
97-105
: LGTM: Good error handling in register_hits function.The function properly wraps DataFusion errors with context about what operation was being performed, which will make debugging issues easier.
130-134
: LGTM: Proper error handling for SQL parsing.Good implementation of error handling when parsing SQL statements, particularly for the case when no SQL statement is found.
src/handlers/http/clickbench.rs
Outdated
for (iteration, total_elapsed) in total_elapsed_per_iteration.iter().enumerate() { | ||
warn!( | ||
"Total time for iteration {}: {} seconds", | ||
iteration + 1, | ||
total_elapsed | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use appropriate log levels for summary information.
Similar to the previous comment, use info!
level for logging summary information instead of warn!
.
- warn!(
+ info!(
"Total time for iteration {}: {} seconds",
iteration + 1,
total_elapsed
);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for (iteration, total_elapsed) in total_elapsed_per_iteration.iter().enumerate() { | |
warn!( | |
"Total time for iteration {}: {} seconds", | |
iteration + 1, | |
total_elapsed | |
); | |
} | |
for (iteration, total_elapsed) in total_elapsed_per_iteration.iter().enumerate() { | |
info!( | |
"Total time for iteration {}: {} seconds", | |
iteration + 1, | |
total_elapsed | |
); | |
} |
src/handlers/http/clickbench.rs
Outdated
pub async fn drop_system_caches() -> Result<(), anyhow::Error> { | ||
// Sync to flush file system buffers | ||
Command::new("sync") | ||
.status() | ||
.expect("Failed to execute sync command"); | ||
let _ = Command::new("sudo") | ||
.args(["sh", "-c", "echo 3 > /proc/sys/vm/drop_caches"]) | ||
.output() | ||
.map_err(|e| anyhow::Error::msg(e.to_string()))?; | ||
|
||
Ok(()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Platform dependency issue in system cache clearing.
The drop_system_caches
function has several issues:
- It uses Linux-specific commands (
/proc/sys/vm/drop_caches
) without checking if the platform is Linux - It requires sudo privileges which may not be available in all environments
- It uses
expect()
on the sync command which will panic if the command fails
Consider refactoring to make this cross-platform compatible or at least gracefully handle cases where these operations aren't supported:
pub async fn drop_system_caches() -> Result<(), anyhow::Error> {
// Sync to flush file system buffers
- Command::new("sync")
- .status()
- .expect("Failed to execute sync command");
+ match Command::new("sync").status() {
+ Ok(_) => {}
+ Err(e) => warn!("Failed to execute sync command: {}", e),
+ }
let _ = Command::new("sudo")
.args(["sh", "-c", "echo 3 > /proc/sys/vm/drop_caches"])
.output()
- .map_err(|e| anyhow::Error::msg(e.to_string()))?;
+ .map_err(|e| {
+ warn!("Failed to drop system caches: {}", e);
+ anyhow::Error::msg("Failed to drop system caches. This might be expected if not running on Linux or without sudo privileges.")
+ })?;
Ok(())
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
pub async fn drop_system_caches() -> Result<(), anyhow::Error> { | |
// Sync to flush file system buffers | |
Command::new("sync") | |
.status() | |
.expect("Failed to execute sync command"); | |
let _ = Command::new("sudo") | |
.args(["sh", "-c", "echo 3 > /proc/sys/vm/drop_caches"]) | |
.output() | |
.map_err(|e| anyhow::Error::msg(e.to_string()))?; | |
Ok(()) | |
} | |
pub async fn drop_system_caches() -> Result<(), anyhow::Error> { | |
// Sync to flush file system buffers | |
match Command::new("sync").status() { | |
Ok(_) => {}, | |
Err(e) => warn!("Failed to execute sync command: {}", e), | |
} | |
let _ = Command::new("sudo") | |
.args(["sh", "-c", "echo 3 > /proc/sys/vm/drop_caches"]) | |
.output() | |
.map_err(|e| { | |
warn!("Failed to drop system caches: {}", e); | |
anyhow::Error::msg("Failed to drop system caches. This might be expected if not running on Linux or without sudo privileges.") | |
})?; | |
Ok(()) | |
} |
src/handlers/http/clickbench.rs
Outdated
for (query_index, sql) in query_list.iter().enumerate() { | ||
let mut elapsed_times = Vec::with_capacity(TRIES); | ||
for iteration in 1..=TRIES { | ||
let start = Instant::now(); | ||
let task_ctx = ctx.task_ctx(); | ||
let dialect = &task_ctx.session_config().options().sql_parser.dialect; | ||
let dialect = dialect_from_str(dialect).ok_or_else(|| { | ||
plan_datafusion_err!( | ||
"Unsupported SQL dialect: {dialect}. Available dialects: \ | ||
Generic, MySQL, PostgreSQL, Hive, SQLite, Snowflake, Redshift, \ | ||
MsSQL, ClickHouse, BigQuery, Ansi." | ||
) | ||
})?; | ||
|
||
let statements = DFParser::parse_sql_with_dialect(sql, dialect.as_ref())?; | ||
let statement = statements | ||
.front() | ||
.ok_or_else(|| anyhow::anyhow!("No SQL statement found in query: {}", sql))?; | ||
let plan = ctx.state().statement_to_plan(statement.clone()).await?; | ||
|
||
let df = ctx.execute_logical_plan(plan).await?; | ||
let physical_plan = df.create_physical_plan().await?; | ||
|
||
let _ = collect(physical_plan, task_ctx.clone()).await?; | ||
let elapsed = start.elapsed().as_secs_f64(); | ||
total_elapsed_per_iteration[iteration - 1] += elapsed; | ||
|
||
warn!("query {query_count} iteration {iteration} completed in {elapsed} secs"); | ||
elapsed_times.push(elapsed); | ||
} | ||
query_count += 1; | ||
results.push(json!({ | ||
"query_index": query_index, | ||
"query": sql, | ||
"elapsed_times": elapsed_times | ||
})); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use appropriate log levels instead of warn for normal operations.
Using warn!
for routine benchmark information isn't appropriate as warnings should be reserved for actual warnings. Consider using info!
or debug!
instead.
- warn!("query {query_count} iteration {iteration} completed in {elapsed} secs");
+ info!("query {query_count} iteration {iteration} completed in {elapsed} secs");
Additionally, the outer loop structure and result collection look good with proper capacity pre-allocation and comprehensive result structure.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for (query_index, sql) in query_list.iter().enumerate() { | |
let mut elapsed_times = Vec::with_capacity(TRIES); | |
for iteration in 1..=TRIES { | |
let start = Instant::now(); | |
let task_ctx = ctx.task_ctx(); | |
let dialect = &task_ctx.session_config().options().sql_parser.dialect; | |
let dialect = dialect_from_str(dialect).ok_or_else(|| { | |
plan_datafusion_err!( | |
"Unsupported SQL dialect: {dialect}. Available dialects: \ | |
Generic, MySQL, PostgreSQL, Hive, SQLite, Snowflake, Redshift, \ | |
MsSQL, ClickHouse, BigQuery, Ansi." | |
) | |
})?; | |
let statements = DFParser::parse_sql_with_dialect(sql, dialect.as_ref())?; | |
let statement = statements | |
.front() | |
.ok_or_else(|| anyhow::anyhow!("No SQL statement found in query: {}", sql))?; | |
let plan = ctx.state().statement_to_plan(statement.clone()).await?; | |
let df = ctx.execute_logical_plan(plan).await?; | |
let physical_plan = df.create_physical_plan().await?; | |
let _ = collect(physical_plan, task_ctx.clone()).await?; | |
let elapsed = start.elapsed().as_secs_f64(); | |
total_elapsed_per_iteration[iteration - 1] += elapsed; | |
warn!("query {query_count} iteration {iteration} completed in {elapsed} secs"); | |
elapsed_times.push(elapsed); | |
} | |
query_count += 1; | |
results.push(json!({ | |
"query_index": query_index, | |
"query": sql, | |
"elapsed_times": elapsed_times | |
})); | |
} | |
for (query_index, sql) in query_list.iter().enumerate() { | |
let mut elapsed_times = Vec::with_capacity(TRIES); | |
for iteration in 1..=TRIES { | |
let start = Instant::now(); | |
let task_ctx = ctx.task_ctx(); | |
let dialect = &task_ctx.session_config().options().sql_parser.dialect; | |
let dialect = dialect_from_str(dialect).ok_or_else(|| { | |
plan_datafusion_err!( | |
"Unsupported SQL dialect: {dialect}. Available dialects: \ | |
Generic, MySQL, PostgreSQL, Hive, SQLite, Snowflake, Redshift, \ | |
MsSQL, ClickHouse, BigQuery, Ansi." | |
) | |
})?; | |
let statements = DFParser::parse_sql_with_dialect(sql, dialect.as_ref())?; | |
let statement = statements | |
.front() | |
.ok_or_else(|| anyhow::anyhow!("No SQL statement found in query: {}", sql))?; | |
let plan = ctx.state().statement_to_plan(statement.clone()).await?; | |
let df = ctx.execute_logical_plan(plan).await?; | |
let physical_plan = df.create_physical_plan().await?; | |
let _ = collect(physical_plan, task_ctx.clone()).await?; | |
let elapsed = start.elapsed().as_secs_f64(); | |
total_elapsed_per_iteration[iteration - 1] += elapsed; | |
info!("query {query_count} iteration {iteration} completed in {elapsed} secs"); | |
elapsed_times.push(elapsed); | |
} | |
query_count += 1; | |
results.push(json!({ | |
"query_index": query_index, | |
"query": sql, | |
"elapsed_times": elapsed_times | |
})); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/handlers/http/clickbench.rs (2)
62-62
: Avoid multipletokio::main
declarations in library code.Placing
#[tokio::main]
in this function makes it behave like an application entrypoint, which can lead to conflicts when the broader codebase includes other asynchronous main functions or test harnesses. Consider refactoring to allow a single main entrypoint, particularly if you plan to callrun_benchmark
from elsewhere.
91-96
: Use asynchronous file IO to avoid blocking within an async context.Reading large files synchronously in an async function can block the thread. Refactor to use
tokio::fs
to avoid blocking calls:- let queries = fs::read_to_string(queries_file)?; - for query in queries.lines() { - query_list.push(query.to_string()); - } + let queries = tokio::fs::read_to_string(queries_file).await?; + for line in queries.lines() { + query_list.push(line.to_string()); + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/handlers/http/clickbench.rs
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: coverage
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
🔇 Additional comments (1)
src/handlers/http/clickbench.rs (1)
45-60
: Consider making drop_system_caches cross-platform or optional.This logic depends on Linux-specific paths and privileges. A previous review already flagged this concern. The function should either gracefully detect non-Linux environments or skip the cache-drop step when sudo privileges aren’t available, rather than unconditionally attempting the command and returning an error.
src/handlers/http/clickbench.rs
Outdated
results.push(json!({ | ||
"query_index": query_index, | ||
"query": sql, | ||
"elapsed_times": { | ||
"iteration": iteration + 1, | ||
"elapsed_time": elapsed_times | ||
} | ||
})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix iteration numbering and avoid pushing growing arrays for each iteration.
Right now, each iteration adds a JSON object containing the entire elapsed_times
, and uses iteration + 1
in one place but logs iteration
in another. This inconsistency can confuse observers. Consider storing a single “elapsed_time” per iteration U and logging iteration consistently:
results.push(json!({
"query_index": query_index,
"query": sql,
- "elapsed_times": {
- "iteration": iteration + 1,
- "elapsed_time": elapsed_times
- }
+ "iteration": iteration,
+ "elapsed_time": elapsed
}));
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
results.push(json!({ | |
"query_index": query_index, | |
"query": sql, | |
"elapsed_times": { | |
"iteration": iteration + 1, | |
"elapsed_time": elapsed_times | |
} | |
})); | |
results.push(json!({ | |
"query_index": query_index, | |
"query": sql, | |
"iteration": iteration, | |
"elapsed_time": elapsed | |
})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
src/handlers/http/clickbench.rs (4)
34-43
: Consider simplifying the async invocation rather than usingspawn_blocking
.
Becauserun_benchmark
is itself an async function, usingtokio::task::spawn_blocking(run_benchmark)
introduces an extra layer you may not need if the tasks withinrun_benchmark
are non-blocking. If some tasks are intensive, you can continue usingspawn_blocking
, but ensure the approach is intentional.
62-98
: Remove or use the unusedtable_options
map.
You definetable_options
but do not reference it when registering the Parquet file or configuring the session. Either remove it to reduce noise or incorporate it if you intend to apply specialized read options.let mut table_options = HashMap::new(); table_options.insert("binary_as_string", "true"); - // Currently unused
110-116
: Increase the capacity of theresults
vector to accommodate all iterations.
You allocateVec::with_capacity(query_list.len())
but add 3 entries per query (for each iteration). To avoid reallocations, adjust the capacity toquery_list.len() * TRIES
.- let mut results = Vec::with_capacity(query_list.len()); + let mut results = Vec::with_capacity(query_list.len() * TRIES);
118-119
: Remove or utilizeelapsed_times
.
You populateelapsed_times
but never use it after pushing values, since you push each iteration’s results directly intoresults
. Consider either removingelapsed_times
or structuring the final results to present each query's times in a nested array.- let mut elapsed_times = Vec::with_capacity(TRIES); for iteration in 1..=TRIES { ... - elapsed_times.push(elapsed); } - // Not used anymore
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/handlers/http/clickbench.rs
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: coverage
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
🔇 Additional comments (1)
src/handlers/http/clickbench.rs (1)
45-60
: Platform dependency note for cache clearing.
This function is tightly coupled to Linux (writing to/proc/sys/vm/drop_caches
) and also requiressudo
privileges. If you intended multi-platform support, consider gracefully handling non-Linux environments or skipping cache dropping in those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/handlers/http/clickbench.rs (2)
89-91
: Consider ignoring or trimming blank lines in the queries file.If the queries file contains comments or empty lines, attempting to parse them will cause an error. Below is a quick fix to skip blank lines:
for query in queries.lines() { - query_list.push(query.to_string()); + let trimmed = query.trim(); + if !trimmed.is_empty() { + query_list.push(trimmed.to_string()); + } }
143-143
: Consider reducing log verbosity for query iteration.Repeated “info!” logs can quickly clutter the console during benchmarking. If these logs are mainly for debugging, use
debug!
instead.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/handlers/http/clickbench.rs
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: coverage
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (2)
src/handlers/http/clickbench.rs (2)
44-59
: Platform dependency issue in system cache clearing.This code is effectively Linux-specific and requires
sudo
privileges. Please note that this was flagged in a previous discussion.
92-95
: Verify concurrency approach when offloading query execution.Calling the asynchronous
execute_queries
function directly insidetokio::task::spawn_blocking
can be problematic. Typically,spawn_blocking
is meant for synchronous or CPU-bound work, and it may cause confusion in an already async context.
#[tokio::main(flavor = "multi_thread")] | ||
pub async fn execute_queries( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove or revise the #[tokio::main]
attribute for this library function.
Using #[tokio::main]
in a function called by Actix (which already runs under a Tokio runtime) can lead to conflicts. If you need a runtime in this context, consider reusing the Actix runtime or using tokio::spawn
instead:
-#[tokio::main(flavor = "multi_thread")]
pub async fn execute_queries(
ctx: &SessionContext,
query_list: Vec<String>,
) -> Result<Json<Value>, anyhow::Error> {
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
#[tokio::main(flavor = "multi_thread")] | |
pub async fn execute_queries( | |
pub async fn execute_queries( | |
ctx: &SessionContext, | |
query_list: Vec<String>, | |
) -> Result<Json<Value>, anyhow::Error> { |
use API
/api/v1/benchmark/clickbench
to perform benchmark on clickbench dataset and defined queries
add env
PARQUET_FILE
to provide file path of hits.parquet add envQUERIES_FILE
to provide file path of queries file3 tries for each query, total 43 queries in the set api response with query no, iteration no, response time in ms
Summary by CodeRabbit
New Features
/benchmark/clickbench
.Bug Fixes