Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect cast of integer columns to utf8 when comparing with utf8 constant #15161

Open
Tracked by #15072
scsmithr opened this issue Mar 11, 2025 · 4 comments
Open
Tracked by #15072
Labels
bug Something isn't working

Comments

@scsmithr
Copy link
Contributor

scsmithr commented Mar 11, 2025

Describe the bug

A comparison like column1 < '10' (where column1 is an int64) will cast column1 to utf8 instead of casting the utf8 constant to an integer.

Typically string constants in a sql query are treated as unknown, and preference should be on casting the "unknown" value to a target type.

To Reproduce

DataFusion CLI v46.0.0
> create table t1 as (values (1), (2), (3));
0 row(s) fetched. 
Elapsed 0.026 seconds.

> select * from t1 where column1 < '10';
+---------+
| column1 |
+---------+
| 1       |
+---------+
1 row(s) fetched. 
Elapsed 0.015 seconds.

> select * from t1 where column1 < 'hello';
+---------+
| column1 |
+---------+
| 1       |
| 2       |
| 3       |
+---------+
3 row(s) fetched. 
Elapsed 0.007 seconds.

> select arrow_typeof(column1) from t1 limit 1;
+--------------------------+
| arrow_typeof(t1.column1) |
+--------------------------+
| Int64                    |
+--------------------------+
1 row(s) fetched. 
Elapsed 0.010 seconds.

> explain select * from t1 where column1 < '10';
+---------------+-------------------------------------------------------+
| plan_type     | plan                                                  |
+---------------+-------------------------------------------------------+
| logical_plan  | Filter: CAST(t1.column1 AS Utf8) < Utf8("10")         |
|               |   TableScan: t1 projection=[column1]                  |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192           |
|               |   FilterExec: CAST(column1@0 AS Utf8) < 10            |
|               |     DataSourceExec: partitions=1, partition_sizes=[1] |
|               |                                                       |
+---------------+-------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.009 seconds.

Expected behavior

column1 not to be cast to a string.

Postgres output:

postgres=# create table t1 as (values (1), (2), (3));
SELECT 3
postgres=# select * from t1 where column1 < '10';
 column1 
---------
       1
       2
       3
(3 rows)

postgres=# select * from t1 where column1 < 'hello';
ERROR:  invalid input syntax for type integer: "hello"
LINE 1: select * from t1 where column1 < 'hello';

Additional context

No response

@scsmithr scsmithr added the bug Something isn't working label Mar 11, 2025
@alamb
Copy link
Contributor

alamb commented Mar 11, 2025

👋 @scsmithr

FYI @jonahgao

I vaguely remember something related to this automatic coercion in the past 🤔

BTW @alan910127 and I are trying to fix the equality case in

@jonahgao
Copy link
Member

I checked and found that Spark has the same behavior as PostgreSQL. Maybe we should use a different coercion rule for this since the current comparison_coercion is also used by union.

@alamb
Copy link
Contributor

alamb commented Mar 13, 2025

I checked and found that Spark has the same behavior as PostgreSQL. Maybe we should use a different coercion rule for this since the current comparison_coercion is also used by union.

I agree we should split the coercion rules for union and binary comparison

@alamb
Copy link
Contributor

alamb commented Mar 13, 2025

I also put this issue on my wishlist for

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants