-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental update #798
Comments
@dmpetrov I'm looking at this example for incremental update in issue description and I'm not sure if this will be enough for the users. This only assumes someone needs to add one new column for each new record in the bucket, but I think real world scenarios will be much more complex. For example, someone could be adding multiple new columns, generating new ones, filtering out something etc. First thing that comes to my mind is to maybe put a "contract" where user needs to define a custom function that receives chain and returns updated chain. In that function user should put all kind of modifications that is needed. So something like: def update_dc(dc: DataChain) -> DataChain:
return dc.filter(...).map(...).gen(...).merge(...)
dc = DataChain.incremental_dataset("s3://bkt/dir1/*.jpg", update_dc, "image_emb")
dc.update() # first update
dc.update() # second update
... My first idea is to:
Constraints:
I'm not sure how we can mitigate constraints as we cannot evaluate the whole script since there can be bunch of other unrelated code that should not run in incremental update of a specific dataset. |
Thank you @ilongin for thoughtful feedback! You are right that this is not enough. The proposed approach looks good. Some comments are below.
Why is this needed? Shouldn't it be temporary without name?
Only if you run the code in a separate environment/process. Is that the plan?
That's might be ok. Also, have you considered using requirements in function comments like here: #776 (comment) It can look like: def my_embedding(file: File) -> list[float]:
"""
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "pandas < 2.1.0"
# ]
# ///
"""
return [...] It looks pretty ugly if you ask me. Is there any other way of doing this? Our MLEM project had some functionality to deserialize all requirements from a current environment/process. It can also work on scripts level, not function level: > dc = DataChain.incremental_dataset("s3://bkt/dir1/*.jpg", script="my_update.py", func="my_func") my_update.py: # /// script
# requires-python = ">=3.12"
# dependencies = [
# "pandas < 2.1.0"
# "datachain > 0.6"
# ]
# ///
import pandas as pd
from datachain import File
def my_embedding(file: File) -> list[float]:
return [...] The script approach looks cleaner and we can start with it. But we need something for functions as well. |
My 2cs / ideas: (custom function and capturing envs seems to be complicated tbh). Since dataset can be considered as a graph of steps, can we get back to the root (probably storage, or reading newer version of another dataset), get diff there and apply step only to that delta + do union at the end? Most likely that's what most people want, no? I also I assume users will be fine to run their DataChain code, thus we don't need to control the environment. Something like: DataChain.from_storage("s3://bkt/dir1/").filter(...).save("something", incremental=True) is has to figure out to apply filter only to a subset of files - all new files in the storage since the last run (not sure about deletions?) wdyt, folks? |
Hm, I don't see why it should be temporary. My thought was that when user calls
Yes, I think we need to run this "update" function in separate env/process i.e. someone could create another script with just doing: ds = DataChain.from_dataset("my-incremental-update-ds")
ds.update() and this should work.
I would leave this on script level and save into dataset. Then we can use it on each incremental update call. I'm looking into that issue in parallel and I think inline script metadata with custom |
Yea, this seems like a good idea as well. The only thing is that if someone want's to use this dataset in some other script / use case, and want's to make sure it has the latest data, he would need to copy all that code (you put only my_new_ds = DataChain.from_dataset("my-incremental-ds").update().filter(...).map(...).save("some-new-dataset") |
Yep, I assume that people would control the dataset from its own script. Users won't be updating it (seems like an advanced case?) |
@shcheklein is your idea to apply only filters, but not functions? If that's the case, I'm afraid that it will prevent us from very basic use cases like delta updates for embeddings. |
@ilongin this might be a good workaround but only if we won't find a way to incorporate it to the regular API. |
The big goal here is to support - "Update" button in dataset UI. So, all required information has to be stored within the dataset. In realistic cases, people need to recompute embeddings for new files using their custom functions. This is a very basic use case we initially heart from users. Filters are also needed. And it could be a combination of (multiple) filters and (multiple) mappers like: def my_embedding(file: File) -> list[float]:
return [...]
target_product = my_embedding(File("product_example.jpg"))
(
DataChain
.from_storage("s3://bkt/dir1/")
.filter(C("file.path").glob("*.jpg"))
.map(emb=my_embedding)
.mutate(dist=func.cosine_distance(C("emd"), target_product),
.filter(C("dist") < 0.107)
.save("like_product", incremental=True) # It should know about both of the filters as well as custom function
)
# The most tricky part here is the requirements.txt What is outside of the scope: join, group, aggregate, distinct, etc. These operations are not incremental-update compatible since these are require a whole context (all files). The question is - how we can implement these functionality step by step. We should probably start with @shcheklein idea (if I understood it correctly) of saving only chains but not custom functions and progressing to the whole functionality.
I'd love to hear your feedback, folks! |
@dmpetrov thanks for the examples. So if we need to support that "Update" button in UI, we def need to store all info in the dataset , as you mentioned. The easiest solution to this, and I think that's what Ivan meant as well in his comment (@shcheklein please confirm) is for user to just re-run the same script where he created the dataset (ofc. we also need that
I think we could even implement If this contract of making sure no other side effects in the script is too limited (although I don't see a reason why), then other option is what I suggested - different type of "contract" where user needs to put all that mappers, filters etc. in a special function and make sure all imports needed are in the function itself. Then we would serialize that function (as string), along the way with all requirements and other settings and run it on demand. The advantages of this approach is that we can have handy Then we would have something like this: def update_fn(dc: DataChain) -> DataChain:
from datachain import func, Flile, C
def my_embedding(file: File) -> list[float]:
return [...]
target_product = my_embedding(File("product_example.jpg"))
return (
dc
.filter(C("file.path").glob("*.jpg"))
.map(emb=my_embedding)
.mutate(dist=func.cosine_distance(C("emd"), target_product)
.filter(C("dist") < 0.107)
)
ds = DataChain.create_incremental(DataChain.from_storage("s3://bkt/dir1/"), "like_product", update_fn)
# Some other script
dc = DataChain.from_dataset("like_product").update() # this is now possible to run in another script as we save everything we needed in the dataset itself. We can
I think we should choose between these 2 proposals and I think first one of re-running the script itself is the best. I don't think we should go with path of serializing separately filters, mutate, mappers etc. as it seems too complicated. |
Discussed with @dmpetrov - I think he understands the idea I was suggesting here. One new insight here - it seems we'll need to define |
@shcheklein ok, so final decision is to go with first approach in my last comment (re-running the whole script every time user wants to do inc update)? |
@ilongin can we as a first step do a few examples - like actual scripts from some demo projects? to see how they will look like? or what would it take to do a prototype and experiment with it? something quick a dirty? |
@shcheklein scripts will look the same as they are now, user just need to remember to not put side effects in it as he will run them multiple times, for each update. I started working on this and will post some examples here when it's done for CLI. For Studio some additional work will need to be done which can be done in step 2. |
I thought more about this... It looks like defining it in dataset level is ok since an incremental dataset has to have a single source (bucket or dataset). Otherwise, merge() will be needed which is not possible for incrementals - join, group, aggregate, distinct, are prohibited. |
It would be great to check if we can do this in chain level (not rerunning a script). If it's too complicated then we should take script-level as a shortcut. As you might remember, we went though this issue before with It would be great if we can do the same here, without limiting users with scripts. |
Description
User can do incremental update manually like:
It would be great if this can be supported out of the box. Users could then update datasets directly from the UI.
Challenges:
my_embedding()
. So, Inline project meta #776 might be a prerequisite.The text was updated successfully, but these errors were encountered: