-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parametrize ListArray inner field #15162
Comments
I'm not familiar with how comet interops with Spark, but it looks like whatever component is wrapping the spark execution is incorrectly exposing the schema of its outputs? Provided the components correctly advertise the schema of their outputs, I would expect DF's coercion machinery to handle the rest... Edit: Unless the requirement is for the output of the DF plan to match what schema of the equivalent Spark plan?? |
Thanks @tustvold the requirement is to customize hardcoded inner field for ListType which is hardcoded
Comet using the DataFusion physical plan expressions directly, there is no coercion phase and schema from Apache Spark for the same comes as
When RecordBatch created
it checks both schema(easy to modify) and column arrays schema and the error is thrown if both schemas doesn't match, in this specific case it doesn't match by inner list field name |
If field A and field B only the name is different, but we only check other part besides name, so we can pass it.
# use arrow_schema::*;
let field = Field::new("c1", DataType::Int64, false)
.with_name("c2");
assert_eq!(field.name(), "c2"); I am not sure if spark has a chance to call with_name? |
IMO this is the issue, the spark code is not returning data with the same schema as is then used to construct the RecordBatch. Where does the schema provided to the RecordBatch constructor come from? IMO either this schema needs to be updated to match what spark is actually returning, or the spark code needs to be updated to return the expected schema (e.g. by coercing on output).
This sounds great in theory, although I've yet to see a coherent approach to how this would work. Ultimately the field name is part of the schema in much the same way as StringViewArray and StringArray are different types, one can layer a logical schema type on top, but at some point something actually needs to coerce the types. |
Agree. Arrow-rs is not very configurable unlike to DataFusion, and to be honest I would love to see Arrow-rs support for external configs so it could be more flexible for common areas like Parquet reader(INT96 problem apache/arrow-rs#7220) or redefine some other behavior. But this is another topic. :)
Apache Spark expects If talking about the specific case for now it is
It this particular code the column arrays schema created as |
If you have a concrete proposal, feel free to raise an issue. FWIW most kernels do take various options to alter their behaviour, a non-trivial number of which specifically exist for Spark compatibility.
I think you misunderstand what I am saying, arrow can't be opinionated about what people are permitted to use as the list field name, as people can always do something different. Even if you made new_list_field default to using "element" rather than "item" it wouldn't resolve your problem as kernels could still construct lists with different naming. FWIW arrow-rs doesn't use new_list_field outside of tests for this reason, and in fact I actually objected to its addition for this reason - apache/arrow-rs#4544 (comment), kernels that construct lists should make this attribute configurable. If DF has such kernels, then DF should make this behaviour configurable. (Edit: or decide that use-cases that require a particular schema should coerce on output) |
Good point, yes.
Yeah, this comes as first option in the list |
Thank you for bringing this up @comphead -- I think we have struggled with this issue for a while downstream in DataFusion I think the core fix of this issue is not constructing As @tustvold points out, the field name is arbitrary and not consistent across arrow implementations. Plumbing some way to change it around might work, but we'll be forever trying to find all the corner cases. Thus in my opinion, rather than try and control the name of the field, a better approach is to change places where For example, the specific error that @comphead posted in this issue is
It seems like the that error actually comes from RecordBatch construction within arrow-rs Perhaps we can relax this check / update RecordBatch::new() to align incoming |
Thanks @alamb this option is brought up as option 3 in option list above.
I'm currently looking for any scenarios to verify if the inner field name check is reasonable. |
The major ones are interop boundaries, arrow itself largely doesn't care, but other systems do (as evidenced by this issue). For example, writing a RecordBatch to parquet, sending it over the C data interface, etc... There are quite a lot of places that assume the RecordBatch schema / StructArray schema are the same. I'm pretty certain we can't relax this check without it causing subtle breakage. IMO if spark has specific schema requirements, I'm not sure I see a way to avoid coercing at the boundary, it will be an indefinite game of wack-a-mole otherwise (not just for lists). |
Thanks @tustvold for explanation |
I guess it is so that child arrays can be processed independently without needing to somehow propagate their schema down from the parent at every callsite. But yes it is redundant, and this is why inconsistency causes problems at interop boundaries, as typically these just pick one to propogate. |
So the proposal as I understand it is to implement something like the follwing function that is called on all batches prior to returning to spark /// Converts the schema of `batch` to one suitable for Spark's conventions
///
/// Note only converts the schema, no data is copied
///
/// Transformations applied:
/// * The name of the fields in `DataType::List` are changed to "element"
/// ....
fn coerce_schema_for_spark(batch: RecordBatch) -> Result<RecordBatch> {
...
} |
one of the options yes, as |
Is your feature request related to a problem or challenge?
In Apache DataFusion Comet during implementation to handle ARRAY types from Apache Spark it was found that the inner field hardcoded name is different is Arrow-rs and Apache Spark.
The inner ListType field is hardcoded to
item
in https://github.com/apache/arrow-rs/blob/f4fde769ab6e1a9b75f890b7f8b47bc22800830b/arrow-schema/src/field.rs#L130However it is a
element
for Apache SparkBecause of this discrepancy the schema failed when the record batch gets created
In DataFusion the List creation method
Field::new_list_field
with hardcoded field name is heavily used. The ticket idea is to find a way how to parametrize this.Field::new_list_field
withField::new
which gives an opportunity to provide a custom name. However those methods are often called from the context where is noSessionContext
exist and thus there is no possibility to access to config variable where new name can be parametrizedRecordBatch::try_new
and for ListTypes avoid checking inner naming just check the inner datatype and other fields exceptname
Related apache/datafusion-comet#1456
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: