Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip decoding of pages marked as pruned in PQ reader #18347

Draft
wants to merge 22 commits into
base: branch-25.06
Choose a base branch
from

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Mar 21, 2025

Description

This PR implements the necessary mechanism to skip decoding of data pages marked as pruned in the Parquet reader.

Closes #18316
Part of #17896

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Mar 21, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Mar 21, 2025
@mhaseeb123 mhaseeb123 added feature request New feature or request 2 - In Progress Currently a work in progress cuIO cuIO issue breaking Breaking change labels Mar 21, 2025
@mhaseeb123
Copy link
Member Author

mhaseeb123 commented Mar 21, 2025

Here are the four Parquet test files used in this PR.
testfiles1_2.zip: dict/plain encode: int, str, list<str>, list<list<str>>, list<list<list<int>>>
testfiles3_4.zip dict/plain encode: int, struct, float64; byte stream split encode: flba aka pa.binary(32)
testfiles5_6.zip byte stream split, delta len ba, delta byte array, binary bit packed encodings

@@ -641,6 +666,20 @@ CUDF_KERNEL void __launch_bounds__(decode_block_size)

bool const has_repetition = s->col.max_level[level_type::REPETITION] > 0;

// Exit early if the page is invalid
// MH: What to do if `has_repetition` is true?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nvdbaranec,
Do you think we need to any special handling for lists (has_repetiion == true) in this kernel. I only see it being used to set the dst address. Same question for a couple other kernels.

@@ -236,6 +249,16 @@ CUDF_KERNEL void __launch_bounds__(decode_block_size)
int page_idx = blockIdx.x;
int t = threadIdx.x;
int out_thread0;

// Exit early if the page is invalid
// MH: How to handle all types in this decoder? Also what to do if `has_repetition` is true?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple questions about this one:

  1. Same question as below about lists.
  2. Do we need any special handling for any other types as well?
  3. When is this generic decoder called?

@@ -473,6 +483,20 @@ CUDF_KERNEL void __launch_bounds__(decode_block_size)

bool const has_repetition = s->col.max_level[level_type::REPETITION] > 0;

// Exit early if the page is invalid
// MH: What to do if `has_repetition` is true?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about lists here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress breaking Breaking change CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
1 participant