Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Support System Generated Ingest Pipeline/Processor #17509

Open
bzhangam opened this issue Mar 4, 2025 · 6 comments
Open

[RFC] Support System Generated Ingest Pipeline/Processor #17509

bzhangam opened this issue Mar 4, 2025 · 6 comments
Labels
enhancement Enhancement or improvement to existing feature or request _No response_ untriaged

Comments

@bzhangam
Copy link

bzhangam commented Mar 4, 2025

Is your feature request related to a problem? Please describe

I'm working on a proposal in neural plugin to simply the neural search set up. We want to remove the step that user needs to set up an ingest pipeline to use ML model to generate the embedding.

Describe the solution you'd like

We propose to create a new field type semantic for original data. Then during indexing OpenSearch will check if there is a semantic field. If there is one it will automatically create an ingest processor and append it to the final ingest pipeline. If there is no final ingest pipeline then we will create a pipeline with that processor as the final ingest pipeline. This auto generated ingest processor is invisible to users and they don't need to manage it. In this solution we auto generate the ingest processor only based on the index configuration and we will limit the scope to that.

But we are also thinking should we set up a more generic solution for system generated ingest pipelines/processors for use cases that we want to auto generate them to simplify the user experience?

Related component

No response

Describe alternatives you've considered

No response

Additional context

[RFC] Support Semantic Field Type to Simplify Neural Search Set Up HLD
[RFC] Support Semantic Field Type to Simplify Neural Search Set Up LLD

@bzhangam bzhangam added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 4, 2025
@dbwiddis
Copy link
Member

dbwiddis commented Mar 4, 2025

But we are also thinking should we set up a more generic solution for system generated ingest pipelines/processors for use cases that we want to auto generate them to simplify the user experience?

Are you referring to Flow Framework?

Wouldn't we just modify the Semantic Search Template (and related templates) with whatever improvements you're proposing?

@model-collapse
Copy link

The proposal is about adding a semantic field, I think all the stuff can be covered by neural search plugin, why we need any code change to core? Can you elaborate?

@bzhangam
Copy link
Author

bzhangam commented Mar 5, 2025

But we are also thinking should we set up a more generic solution for system generated ingest pipelines/processors for use cases that we want to auto generate them to simplify the user experience?

Are you referring to Flow Framework?

Wouldn't we just modify the Semantic Search Template (and related templates) with whatever improvements you're proposing?

No. We want to further simplify it by providing a new field type where user just need to provide the model id during index creation. And we will auto add embedding fields to the index mapping, generate embeddings during the ingest and rewrite query against the embedding.

Below is an example:

PUT /my-nlp-index
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "text"
      },
      "text": {
        "type": "semantic",
        "model_id": "aVeif4oB5Vm0Tdw8zYO2"
      }
    }
  }
}

Then we will create the index like:

"mappings": {
            "properties": {
                "id": {
                    "type": "text"
                },
                "text": {
                    "type": "semantic",
                    "raw_field_type": "text",
                    "model_id": "aVeif4oB5Vm0Tdw8zYO2"
                },
                // Auto add semantic_info fields
                "text_semantic_info": {
                    "properties": {
                        // Use nested field to handle text chunking
                        "chunks": {
                            "type": "nested",
                            "properties": {
                                // Use knn_vector for TEXT_EMBEDDING model
                                "embedding": {
                                    "type": "knn_vector",
                                    "dimension": 768,
                                    "method": {
                                        "engine": "faiss",
                                        "space_type": "l2",
                                        "name": "hnsw",
                                        "parameters": {}
                                    }
                                },
                                "text": {
                                    "type": "text"
                                }
                            }
                        },
                        // metadata of the model we use to generate the embedding
                        "model": {
                            "properties": {
                                "id": {
                                    "type": "text",
                                    "index": false
                                },
                                "name": {
                                    "type": "text",
                                    "index": false
                                },
                                "type": {
                                    "type": "text",
                                    "index": false
                                }
                            }
                        }
                    }
                }
            }
        }

Then during ingest we will auto do text chunking and embedding generation. We don't want to create a concrete ingest pipeline and ask user to manage it. That's why we propose to internally create the ingest processor based on the index mapping and inject it to the ingest process.

@bzhangam
Copy link
Author

bzhangam commented Mar 5, 2025

The proposal is about adding a semantic field, I think all the stuff can be covered by neural search plugin, why we need any code change to core? Can you elaborate?

Yeah the proposal is about a new field type in neural search plugin. But we need the support from core to allow us inject the auto generated ingest processor to the ingest process. This can be a generic behavior that any plugin can systematically create an ingest processor based on the index and inject it to the ingest process. The main reason we need this is we want to do some ingest work without a real ingest pipeline. In this way we can simplify the neural search set up.

@bzhangam
Copy link
Author

bzhangam commented Mar 6, 2025

Hi @dbwiddis @model-collapse. Do you still see any concern here? If you need more clarification you can take a look at:
[RFC] Support Semantic Field Type to Simplify Neural Search Set Up HLD
[RFC] Support Semantic Field Type to Simplify Neural Search Set Up LLD

@shwetathareja
Copy link
Member

In your proposal @bzhangam, how are below embedding configuration provided?

"properties": {
                                // Use knn_vector for TEXT_EMBEDDING model
                                "embedding": {
                                    "type": "knn_vector",
                                    "dimension": 768,
                                    "method": {
                                        "engine": "faiss",
                                        "space_type": "l2",
                                        "name": "hnsw",
                                        "parameters": {}
                                    }
                                },

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request _No response_ untriaged
Projects
None yet
Development

No branches or pull requests

4 participants