[RFC] Support System Generated Ingest Pipeline/Processor #17509

bzhangam · 2025-03-04T18:22:42Z

Is your feature request related to a problem? Please describe

I'm working on a proposal in neural plugin to simply the neural search set up. We want to remove the step that user needs to set up an ingest pipeline to use ML model to generate the embedding.

Describe the solution you'd like

We propose to create a new field type semantic for original data. Then during indexing OpenSearch will check if there is a semantic field. If there is one it will automatically create an ingest processor and append it to the final ingest pipeline. If there is no final ingest pipeline then we will create a pipeline with that processor as the final ingest pipeline. This auto generated ingest processor is invisible to users and they don't need to manage it. In this solution we auto generate the ingest processor only based on the index configuration and we will limit the scope to that.

But we are also thinking should we set up a more generic solution for system generated ingest pipelines/processors for use cases that we want to auto generate them to simplify the user experience?

Related component

No response

Describe alternatives you've considered

No response

Additional context

[RFC] Support Semantic Field Type to Simplify Neural Search Set Up HLD
[RFC] Support Semantic Field Type to Simplify Neural Search Set Up LLD

dbwiddis · 2025-03-04T18:27:55Z

But we are also thinking should we set up a more generic solution for system generated ingest pipelines/processors for use cases that we want to auto generate them to simplify the user experience?

Are you referring to Flow Framework?

Wouldn't we just modify the Semantic Search Template (and related templates) with whatever improvements you're proposing?

model-collapse · 2025-03-05T01:49:16Z

The proposal is about adding a semantic field, I think all the stuff can be covered by neural search plugin, why we need any code change to core? Can you elaborate?

bzhangam · 2025-03-05T17:48:57Z

But we are also thinking should we set up a more generic solution for system generated ingest pipelines/processors for use cases that we want to auto generate them to simplify the user experience?

Are you referring to Flow Framework?

Wouldn't we just modify the Semantic Search Template (and related templates) with whatever improvements you're proposing?

No. We want to further simplify it by providing a new field type where user just need to provide the model id during index creation. And we will auto add embedding fields to the index mapping, generate embeddings during the ingest and rewrite query against the embedding.

Below is an example:

PUT /my-nlp-index
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "text"
      },
      "text": {
        "type": "semantic",
        "model_id": "aVeif4oB5Vm0Tdw8zYO2"
      }
    }
  }
}

Then we will create the index like:

"mappings": {
            "properties": {
                "id": {
                    "type": "text"
                },
                "text": {
                    "type": "semantic",
                    "raw_field_type": "text",
                    "model_id": "aVeif4oB5Vm0Tdw8zYO2"
                },
                // Auto add semantic_info fields
                "text_semantic_info": {
                    "properties": {
                        // Use nested field to handle text chunking
                        "chunks": {
                            "type": "nested",
                            "properties": {
                                // Use knn_vector for TEXT_EMBEDDING model
                                "embedding": {
                                    "type": "knn_vector",
                                    "dimension": 768,
                                    "method": {
                                        "engine": "faiss",
                                        "space_type": "l2",
                                        "name": "hnsw",
                                        "parameters": {}
                                    }
                                },
                                "text": {
                                    "type": "text"
                                }
                            }
                        },
                        // metadata of the model we use to generate the embedding
                        "model": {
                            "properties": {
                                "id": {
                                    "type": "text",
                                    "index": false
                                },
                                "name": {
                                    "type": "text",
                                    "index": false
                                },
                                "type": {
                                    "type": "text",
                                    "index": false
                                }
                            }
                        }
                    }
                }
            }
        }

Then during ingest we will auto do text chunking and embedding generation. We don't want to create a concrete ingest pipeline and ask user to manage it. That's why we propose to internally create the ingest processor based on the index mapping and inject it to the ingest process.

bzhangam · 2025-03-05T17:51:29Z

The proposal is about adding a semantic field, I think all the stuff can be covered by neural search plugin, why we need any code change to core? Can you elaborate?

Yeah the proposal is about a new field type in neural search plugin. But we need the support from core to allow us inject the auto generated ingest processor to the ingest process. This can be a generic behavior that any plugin can systematically create an ingest processor based on the index and inject it to the ingest process. The main reason we need this is we want to do some ingest work without a real ingest pipeline. In this way we can simplify the neural search set up.

bzhangam · 2025-03-06T19:00:38Z

Hi @dbwiddis @model-collapse. Do you still see any concern here? If you need more clarification you can take a look at:
[RFC] Support Semantic Field Type to Simplify Neural Search Set Up HLD
[RFC] Support Semantic Field Type to Simplify Neural Search Set Up LLD

shwetathareja · 2025-03-10T08:55:18Z

In your proposal @bzhangam, how are below embedding configuration provided?

"properties": {
                                // Use knn_vector for TEXT_EMBEDDING model
                                "embedding": {
                                    "type": "knn_vector",
                                    "dimension": 768,
                                    "method": {
                                        "engine": "faiss",
                                        "space_type": "l2",
                                        "name": "hnsw",
                                        "parameters": {}
                                    }
                                },

bzhangam added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 4, 2025

github-actions bot added the _No response_ label Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Support System Generated Ingest Pipeline/Processor #17509

[RFC] Support System Generated Ingest Pipeline/Processor #17509

bzhangam commented Mar 4, 2025 •

edited

Loading

dbwiddis commented Mar 4, 2025

model-collapse commented Mar 5, 2025

bzhangam commented Mar 5, 2025

bzhangam commented Mar 5, 2025

bzhangam commented Mar 6, 2025

shwetathareja commented Mar 10, 2025

[RFC] Support System Generated Ingest Pipeline/Processor #17509

[RFC] Support System Generated Ingest Pipeline/Processor #17509

Comments

bzhangam commented Mar 4, 2025 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

dbwiddis commented Mar 4, 2025

model-collapse commented Mar 5, 2025

bzhangam commented Mar 5, 2025

bzhangam commented Mar 5, 2025

bzhangam commented Mar 6, 2025

shwetathareja commented Mar 10, 2025

bzhangam commented Mar 4, 2025 •

edited

Loading