Querying Parquet file specifically with a predicate returns invalid data error but works in other situations #14281

senyosimpson · 2025-01-24T17:26:28Z

Describe the bug

When making a query with a predicate against Parquet files generated with parquet-go , DataFusion errors saying the data is invalid. However, without a predicate, it works fine.

When using the CLI, I get the error:

» datafusion-cli --command "select * from 'go-parquet-writer/go-testfile.parquet' where age > 10"
DataFusion CLI v44.0.0
Error: External error: Parquet error: External: bad data

In my application, it is more descriptive, showing:

ParquetError(External(ProtocolError { kind: InvalidData, message: "cannot convert 2 into TType" }))

However, it appears that the file is intact. The metadata is successfully read and interpreted

» datafusion-cli --command "describe 'go-parquet-writer/go-testfile.parquet'"
DataFusion CLI v44.0.0
+---------------+-------------------------------------+-------------+
| column_name   | data_type                           | is_nullable |
+---------------+-------------------------------------+-------------+
| city          | Utf8View                            | NO          |
| country       | Utf8View                            | NO          |
| age           | UInt8                               | NO          |
| scale         | Int16                               | NO          |
| status        | UInt32                              | NO          |
| time_captured | Timestamp(Millisecond, Some("UTC")) | NO          |
| checked       | Boolean                             | NO          |
+---------------+-------------------------------------+-------------+
7 row(s) fetched.
Elapsed 0.001 seconds.

When I run without a predicate, I get back the data

» datafusion-cli --command "select * from 'go-parquet-writer/go-testfile.parquet'"
DataFusion CLI v44.0.0
+--------+---------+-----+-------+--------+--------------------------+---------+
| city   | country | age | scale | status | time_captured            | checked |
+--------+---------+-----+-------+--------+--------------------------+---------+
| Madrid | Spain   | 10  | -1    | 12     | 2025-01-24T16:34:00.715Z | false   |
| Athens | Greece  | 32  | 1     | 20     | 2025-01-24T17:34:00.715Z | true    |
+--------+---------+-----+-------+--------+--------------------------+---------+
2 row(s) fetched.
Elapsed 0.002 seconds.

It even works if I use ORDER BY and GROUP BY

» datafusion-cli --command "select * from 'go-parquet-writer/go-testfile.parquet' ORDER BY age DESC"
DataFusion CLI v44.0.0
+--------+---------+-----+-------+--------+--------------------------+---------+
| city   | country | age | scale | status | time_captured            | checked |
+--------+---------+-----+-------+--------+--------------------------+---------+
| Athens | Greece  | 32  | 1     | 20     | 2025-01-24T17:34:00.715Z | true    |
| Madrid | Spain   | 10  | -1    | 12     | 2025-01-24T16:34:00.715Z | false   |
+--------+---------+-----+-------+--------+--------------------------+---------+
2 row(s) fetched.
Elapsed 0.010 seconds.

» datafusion-cli --command "select city, SUM(age) AS age from 'go-parquet-writer/go-testfile.parquet' GROUP BY city"
DataFusion CLI v44.0.0
+--------+-----+
| city   | age |
+--------+-----+
| Athens | 32  |
| Madrid | 10  |
+--------+-----+
2 row(s) fetched.
Elapsed 0.004 seconds.

Additionally, this works when I use PyArrow and Pandas to load the Parquet file and filter it.

To Reproduce

The issue can be reproduced by creating a Parquet file with the parquet-go library and attempting to query it with a predicate in the query. To simplify, I created a public repo that has code to generate the file and similar examples in the README as shown in this report. A test file can be found in go-parquet-writer/go-testfile.parquet, generated by the Go program in that directory.

I've also gone through the effort of trying to achieve the same using PyArrow and Pandas (which you'll see in the repo under pyarrow-ex) to verify the Parquet file is not corrupted in some way. This works as expected.

Expected behavior

The Parquet files created by parquet-go can successfully be queried when the query contains a predicate.

Additional context

From everything I've gathered, this error is likely coming from this conversion function. However, it only skips checking 0x02 when a collection is being parsed. Weirdly, I don't have any list/map/set in my schema. I assume this means this 0x02 is being used to encode something else but it is beyond my knowledge.

I went spelunking in parquet-go codebase. The Thrift protocol implementation is split amongst the compact protocol, the Thrift type definitions and the encoding logic

The text was updated successfully, but these errors were encountered:

senyosimpson · 2025-01-24T23:10:08Z

I've heard from another user that they managed to work around this by switching off the page index when generating the files by parquet-go. However, when I tried this, I still ran into this problem

matthewmturner · 2025-01-25T01:40:08Z

Have you tried using the parquet_metadata function on the file to see if it returns anything? For example:

datafusion-cli --command "select * from parquet_metadata('go-parquet-writer/go-testfile.parquet')"

suremarc · 2025-01-25T03:37:45Z

I've heard from another user that they managed to work around this by switching off the page index when generating the files by parquet-go. However, when I tried this, I still ran into this problem

Hey, I'm the one who mentioned this in discord 😄 what I meant is that we disabled datafusion.execution.parquet.enable_page_index at query time, not that we skipped generating the page index when generating the parquet file.

In your repro repository I was able to confirm disabling the page index makes it work:

❯ datafusion-cli
DataFusion CLI v43.0.0
> select * from 'go-parquet-writer/go-testfile.parquet' where age > 10;
External error: Parquet error: External: bad data

> SET datafusion.execution.parquet.enable_page_index = false;
0 row(s) fetched. 
Elapsed 0.001 seconds.

> select * from 'go-parquet-writer/go-testfile.parquet' where age > 10;
+--------+---------+-----+-------+--------+--------------------------+---------+
| city   | country | age | scale | status | time_captured            | checked |
+--------+---------+-----+-------+--------+--------------------------+---------+
| Athens | Greece  | 32  | 1     | 20     | 2025-01-24T17:34:00.715Z | true    |
+--------+---------+-----+-------+--------+--------------------------+---------+
1 row(s) fetched. 
Elapsed 0.021 seconds.

Last time I looked at this issue I had a feeling that this was an issue with parquet-go's Thrift implementation but I wasn't able to find evidence, or tbh I also just don't remember since it's been quite some time... It's possible that pyarrow and pandas only work because they aren't utilizing the page index for predicate pushdown (not sure if they do or not).

senyosimpson · 2025-01-25T05:14:26Z

Have you tried using the parquet_metadata function on the file to see if it returns anything? For example:

datafusion-cli --command "select * from parquet_metadata('go-parquet-writer/go-testfile.parquet')"

That seems to works fine

» datafusion-cli --command "select path_in_schema, stats_min, stats_max from parquet_metadata('go-parquet-writer/go-testfile.parquet')"
DataFusion CLI v44.0.0
+-----------------+---------------+---------------+
| path_in_schema  | stats_min     | stats_max     |
+-----------------+---------------+---------------+
| "city"          | Athens        | Madrid        |
| "country"       | Greece        | Spain         |
| "age"           | 10            | 32            |
| "scale"         | -1            | 1             |
| "status"        | 12            | 20            |
| "time_captured" | 1737781788268 | 1737785388268 |
| "checked"       | false         | true          |
+-----------------+---------------+---------------+
7 row(s) fetched.
Elapsed 0.003 seconds.

Hey, I'm the one who mentioned this in discord 😄 what I meant is that we disabled datafusion.execution.parquet.enable_page_index at query time, not that we skipped generating the page index when generating the parquet file.

Ah right, that makes sense. Thanks for the help on getting that to work, that's super helpful!

It's possible that pyarrow and pandas only work because they aren't utilizing the page index for predicate pushdown (not sure if they do or not).

That seems to be the case (the documentation reports that the don't use the page index on the read side)

senyosimpson · 2025-01-25T10:08:22Z

Confirmed that the following works now.

let mut parquet_options = TableParquetOptions::new();
parquet_options
    .set("enable_page_index", "false")
    .expect("could not set enable_page_index config option");

let exec = ParquetExec::builder(scan_config)
    .with_table_parquet_options(parquet_options)
    .with_predicate(predicate)
    .build_arc();

Weirdly enough, setting the variable on the SessionContext doesn't work? Is that the correct behaviour? I'm sure I had it set right via

let ctx = SessionContext::new_with_config(
  SessionConfig::new().set_bool("datafusion.execution.parquet.enable_page_index", false)
)

kosiew · 2025-01-27T10:52:02Z

I used duckdb to copy go-testfile.parquet to go-testfile3.parquet.

D create table data1 as select * from 'go-testfile.parquet';
D copy data1 to 'go-testfile3.parquet' (format 'parquet');

Then I used datafusion-cli to query go-testfile3.parquet.

DataFusion CLI v44.0.0
> select * from 'tests/data/go-testfile3.parquet' where age > 10;
+--------+---------+-----+-------+--------+--------------------------+---------+
| city   | country | age | scale | status | time_captured            | checked |
+--------+---------+-----+-------+--------+--------------------------+---------+
| Athens | Greece  | 32  | 1     | 20     | 2025-01-24T17:34:00.715Z | true    |
+--------+---------+-----+-------+--------+--------------------------+---------+
1 row(s) fetched.
Elapsed 0.014 seconds.

> select * from 'tests/data/go-testfile.parquet' where age > 10;
External error: Parquet error: External: bad data

From tests, it looks like the issue is with go-parquet format version.

Here's the metadata for both files

go-testfile3.parquet -------------

created_by: DuckDB version v1.1.3 (build 19864453f7)
num_columns: 7
num_rows: 2
num_row_groups: 1
format_version: 1.0
serialized_size: 598

go-testfile.parquet -------------

created_by: github.com/parquet-go/parquet-go version 0.24.0(build )
num_columns: 7
num_rows: 2
num_row_groups: 1
format_version: 2.6
serialized_size: 721

senyosimpson · 2025-01-28T10:29:48Z

Ah, interesting, thanks for that!

senyosimpson · 2025-01-28T22:50:14Z

Interestingly enough, I can run a query on this just fine using duckdb

» duckdb
v1.1.3 19864453f7
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select * from 'go-testfile.parquet' where age > 10;
┌─────────┬─────────┬───────┬───────┬────────┬────────────────────────────┬─────────┐
│  city   │ country │  age  │ scale │ status │       time_captured        │ checked │
│ varchar │ varchar │ uint8 │ int16 │ uint32 │  timestamp with time zone  │ boolean │
├─────────┼─────────┼───────┼───────┼────────┼────────────────────────────┼─────────┤
│ Athens  │ Greece  │    32 │     1 │     20 │ 2025-01-25 13:13:48.332+02 │ true    │
└─────────┴─────────┴───────┴───────┴────────┴────────────────────────────┴─────────┘

And it has the format version you'd expect

D select file_name, created_by, format_version from parquet_file_metadata('go-testfile.parquet');
┌─────────────────────┬─────────────────────────────────────────────────────────┬────────────────┐
│      file_name      │                       created_by                        │ format_version │
│       varchar       │                         varchar                         │     int64      │
├─────────────────────┼─────────────────────────────────────────────────────────┼────────────────┤
│ go-testfile.parquet │ github.com/parquet-go/parquet-go version 0.24.0(build ) │              2 │
└─────────────────────┴─────────────────────────────────────────────────────────┴────────────────┘

However, if I use PyArrow and write out a file with the version format 2.6, datafusion works fine. So I'm still not sure

kosiew · 2025-01-31T04:22:15Z

Findings:

The error message cannot convert 2 into TType

        0x00 => Ok(TType::Stop),
        0x03 => Ok(TType::I08), // equivalent to TType::Byte
        0x04 => Ok(TType::I16),
        0x05 => Ok(TType::I32),
        0x06 => Ok(TType::I64),
        0x07 => Ok(TType::Double),
        0x08 => Ok(TType::String),
        0x09 => Ok(TType::List),
        0x0A => Ok(TType::Set),
        0x0B => Ok(TType::Map),
        0x0C => Ok(TType::Struct),
        unkn => Err(thrift::Error::Protocol(thrift::ProtocolError {
            kind: thrift::ProtocolErrorKind::InvalidData,
            message: format!("cannot convert {} into TType", unkn),
        })),

https://github.com/apache/arrow-rs/blob/3bf29a2c7474e59722d885cd11fafd0dca13a28e/parquet/src/thrift.rs#L284-L298

is raised when trying to load the parquet file metadata

datafusion/datafusion/core/src/datasource/physical_plan/parquet/opener.rs

Lines 126 to 130 in 53728b3

    
           let options = ArrowReaderOptions::new().with_page_index(enable_page_index); 
        
           let mut metadata_timer = file_metrics.metadata_load_time.timer(); 
        
           let metadata = 
        
               ArrowReaderMetadata::load_async(&mut reader, options.clone()).await?;

Since 0x02 is not supported in arrow-rs, would it be an acceptable solution to either:

raise a message to disable page_index
or
fallback to disabling page_index silently

senyosimpson · 2025-01-31T06:56:23Z

Since 0x02 is not supported in arrow-rs

What's interesting here is that function seems to be a helper function. 0x02 is handled here when reading a field for example: https://github.com/apache/arrow-rs/blob/3bf29a2c7474e59722d885cd11fafd0dca13a28e/parquet/src/thrift.rs#L131-L147.

But when there's a collection, it is skipped: https://github.com/apache/arrow-rs/blob/3bf29a2c7474e59722d885cd11fafd0dca13a28e/parquet/src/thrift.rs#L275-L280

But there's no collections in the example schemas which points to another instance that function is being used.

senyosimpson added the bug Something isn't working label Jan 24, 2025

alamb added the help wanted Extra attention is needed label Jan 30, 2025

alamb mentioned this issue Jan 30, 2025

[EPIC] A Collection of Bugs that would be good to look at (TM) #14375

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying Parquet file specifically with a predicate returns invalid data error but works in other situations #14281

Querying Parquet file specifically with a predicate returns invalid data error but works in other situations #14281

senyosimpson commented Jan 24, 2025 •

edited

Loading

senyosimpson commented Jan 24, 2025

matthewmturner commented Jan 25, 2025 •

edited

Loading

suremarc commented Jan 25, 2025 •

edited

Loading

senyosimpson commented Jan 25, 2025 •

edited

Loading

senyosimpson commented Jan 25, 2025

kosiew commented Jan 27, 2025 •

edited

Loading

senyosimpson commented Jan 28, 2025

senyosimpson commented Jan 28, 2025

kosiew commented Jan 31, 2025 •

edited

Loading

senyosimpson commented Jan 31, 2025

Querying Parquet file specifically with a predicate returns invalid data error but works in other situations #14281

Querying Parquet file specifically with a predicate returns invalid data error but works in other situations #14281

Comments

senyosimpson commented Jan 24, 2025 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

senyosimpson commented Jan 24, 2025

matthewmturner commented Jan 25, 2025 • edited Loading

suremarc commented Jan 25, 2025 • edited Loading

senyosimpson commented Jan 25, 2025 • edited Loading

senyosimpson commented Jan 25, 2025

kosiew commented Jan 27, 2025 • edited Loading

senyosimpson commented Jan 28, 2025

senyosimpson commented Jan 28, 2025

kosiew commented Jan 31, 2025 • edited Loading

senyosimpson commented Jan 31, 2025

senyosimpson commented Jan 24, 2025 •

edited

Loading

matthewmturner commented Jan 25, 2025 •

edited

Loading

suremarc commented Jan 25, 2025 •

edited

Loading

senyosimpson commented Jan 25, 2025 •

edited

Loading

kosiew commented Jan 27, 2025 •

edited

Loading

kosiew commented Jan 31, 2025 •

edited

Loading