Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Null handing issue with Partial upserts and partialUpsertStrategies #14924

Open
raghukn opened this issue Jan 27, 2025 · 14 comments · May be fixed by #15107
Open

Null handing issue with Partial upserts and partialUpsertStrategies #14924

raghukn opened this issue Jan 27, 2025 · 14 comments · May be fixed by #15107
Assignees
Labels
beginner-task Small task for new contributors to ramp up

Comments

@raghukn
Copy link

raghukn commented Jan 27, 2025

As described in the partial update behavior - https://docs.pinot.apache.org/basics/data-import/upsert

Below case is described:

With partial upsert, if the value is null in either the existing record or the new coming record, Pinot will ignore the upsert strategy and the null value:

(null, newValue) -> newValue

(oldValue, null) -> oldValue

(null, null) -> null

Partial Update is very desirable. But this behavior of ignoring fields present in payload but has null value, is not desirable. The fact that some one is sending null value show he / she intends to overwrite existing value. Other wise they would not have sent this column / value in the first place.

How do we get this behavior work correctly and incoming null not ignored?

@raghukn
Copy link
Author

raghukn commented Feb 3, 2025

All,

Is there any work around atall to achieve this??

@mayankshriv
Copy link
Contributor

Hi @raghukn, the idea behind current behavior is very intentional, and desired by most use cases. For example:

  • Use cases where sparse events are coming in for a PK, and we want to collect the non-null values from all events.
  • Use case where most data for a PK doesn't change and you just want to override few specific fields, but don't want to send the entire payload again.

Having said that, I can see how your requirement is also valid. Wondering, why full upsert doesn't work for you?

cc: @Jackie-Jiang for comment on whether there should be a mode in partial upsert to support this.

@raghukn
Copy link
Author

raghukn commented Feb 3, 2025

Thanks @mayankshriv -- Below is my usecase (DeNormalizing table and its extension table)

  • Fact table Tab1 with PK ID1
  • Extension table of this, say TabX1 which shares the PK ID1 (but it is stored in a column in TabX1)
  • I stream data from above 2 tables into a single TOPIC in Kafka

Now, In Pinot Realtime table, I have union of all columns from Tab1 and TabX1. Since these 2 tables share same value of PK, I can merge the rows using Partial Update behavior. It works fine, except for fact, that I am unable to remove any of the values previously set by send NULL in the payload.

I agree what I am trying is a remote/complex case, but there must a way to UNSET values even with simple PARTIAL UPDATE flows, without needing to Send whole row all the times.

@kishoreg
Copy link
Member

kishoreg commented Feb 3, 2025

Raghu, the partial update handler is a simple handler that takes old record and new record and returns a new record. May be you can add another handler and I think the handler is already configurable

@raghukn
Copy link
Author

raghukn commented Feb 4, 2025

@kishoreg that sounds good, If there is a handler that I can implement to get different behavior for NULL values, its is sufficient. Let me explore on this.

@deemoliu
Copy link
Contributor

deemoliu commented Feb 7, 2025

@raghukn thanks for reporting the issue. if you want to have behavior that "incoming null not ignored", you can create a new merger.

The current overwrite merger works as "Overwrite if new value not null"
you can add a force overwrite merger "Overwrite regardless of new value"

@Skyeliu04
Copy link

Hi can you assign this to me? I'm interested in working on it

@deemoliu deemoliu added the beginner-task Small task for new contributors to ramp up label Feb 8, 2025
@himanish-star
Copy link

himanish-star commented Feb 21, 2025

Hi @deemoliu @raghukn .... I can create a quick PR for this as I've gone through the code.

So there are two solutions I believe:

  1. Directly modify the current merge logic to always overwrite prev column value with new column value even if the new value is NULL. This would not require the creation of a new merger and would just need to remove the if block and the new code would look as below:
if (!(merger instanceof OverwriteMerger)) {
    Object prevValue = previousRow.getValue(column);
    if (prevValue != null) {
        resultHolder.put(column, merger.merge(prevValue, newRow.getValue(column)));
    }
}
  1. Create a new merger named ForceOverwriteMerger and this will also need the addition of a new strategy named ForceOverwrite and we can have the separate handling for it.

Let me know which one to move forward with? I'm in the favor of Approach 2 so that Overwrite strategy remains intact and we introduce a new strategy and merger

@raghukn
Copy link
Author

raghukn commented Feb 21, 2025

@himanish-star , solution 1 works great as this is Partial update scenario & you will be updating provided values only (including null)

@himanish-star
Copy link

Raising a PR, right away

@himanish-star himanish-star linked a pull request Feb 21, 2025 that will close this issue
@himanish-star
Copy link

@raghukn I've updated the PR to use Approach 2 because as I had pointed out Approach 1 will break the existing merge strategy and would not be backward compatible

@raghukn
Copy link
Author

raghukn commented Feb 23, 2025

@himanish-star -- I would still think the behavior of not doing anything about a NULL value being sent to partial update flow as bug. NULL is sent for that column and one would expect that being updated to target column.

But anyways, if there is a way to set a column to NULl that is good for my usecase.

Thanks!

@deemoliu
Copy link
Contributor

deemoliu commented Feb 24, 2025

@himanish-star please go with solution2 because the current behavior for "Overwrite" merger is an expected behavior in many use cases.
Using different mergers "Overwrite" and "ForceOverwrite" can provide the flexibility for users to select how to handle null values.

@raghukn one of the most common use case of partial upsert is to fill the missing field values for a existing row (primary key). Solution1 will break it.

@deemoliu deemoliu assigned himanish-star and unassigned Skyeliu04 Feb 24, 2025
@himanish-star
Copy link

@himanish-star please go with solution2 because the current behavior for "Overwrite" merger is an expected behavior in many use cases. Using different mergers "Overwrite" and "ForceOverwrite" can provide the flexibility for users to select how to handle null values.

@raghukn one of the most common use case of partial upsert is to fill the missing field values for a existing row (primary key). Solution1 will break it.

Hi @deemoliu , I've updated the PR with approach 2 and I've also added unit tests. I'll update the docs once the PR is checked-in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beginner-task Small task for new contributors to ramp up
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants