Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

null_pointer_exception - Cannot invoke "String.equals(Object)" because the return value of "org.opensearch.wlm.QueryGroupTask.getQueryGroupId()" is null #17518

Open
etgraylog opened this issue Mar 5, 2025 · 5 comments
Labels
bug Something isn't working Search Search query, autocomplete ...etc

Comments

@etgraylog
Copy link

etgraylog commented Mar 5, 2025

Describe the bug

This is a continuation of the problem reported in issue #16874.

In version 2.19.1 the Warning message can be generated by _search/scroll API queries:

[2025-03-05T06:47:19,875][INFO ][o.o.n.Node               ] [10.0.1.242] version[2.19.1], pid[83053], build[tar/2e4741fb45d1b150aaeeadf66d41445b23ff5982/2025-02-27T01:16:47.726162386Z], OS[Linux/6.8.0-1021-aws/amd64], JVM[Eclipse Adoptium/OpenJDK 64-Bit Server VM/21.0.6/21.0.6+7-LTS]
...
[2025-03-05T06:48:42,518][WARN ][o.o.w.QueryGroupTask     ] [10.0.1.242] QueryGroup _id can't be null, It should be set before accessing it. This is abnormal behaviour

And the cause of the Warning msgs seems to somehow affect the Node Stats API as well.

BEFORE the Warning messages are triggered:

user@es-master-data-node-614:/usr/share/opensearch/logs# grep -c 'This is abnormal behaviour' os-cluster-1.log
0
user@es-master-data-node-614:/usr/share/opensearch/logs#
user@es-master-data-node-614:/usr/share/opensearch/logs# curl -s -X GET "http://10.0.1.242:9200/_nodes/stats?pretty" -u admin:******** -k | tail -n 20
            "rejection_count" : { }
          }
        }
      },
      "caches" : {
        "request_cache" : {
          "size_in_bytes" : 0,
          "evictions" : 0,
          "hit_count" : 0,
          "miss_count" : 0,
          "item_count" : 0,
          "store_name" : "noop_store"
        }
      },
      "remote_store" : {
        "last_successful_fetch_of_pinned_timestamps" : -1
      }
    }
  }
}

AFTER the Warning messages are triggered:

user@es-master-data-node-614:/usr/share/opensearch/logs# grep -c 'This is abnormal behaviour' os-cluster-1.log
537
user@es-master-data-node-614:/usr/share/opensearch/logs# curl -s -X GET "http://10.0.1.242:9200/_nodes/stats?pretty" -u admin:******** -k
{
  "_nodes" : {
    "total" : 1,
    "successful" : 0,
    "failed" : 1,
    "failures" : [
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [Yo6mfyfRQvyNyMkb3iLuMg]",
        "node_id" : "Yo6mfyfRQvyNyMkb3iLuMg",
        "caused_by" : {
          "type" : "null_pointer_exception",
          "reason" : "Cannot invoke \"String.equals(Object)\" because the return value of \"org.opensearch.wlm.QueryGroupTask.getQueryGroupId()\" is null"
        }
      }
    ]
  },
  "cluster_name" : "os-cluster-1",
  "nodes" : { }
}
user@es-master-data-node-614:/usr/share/opensearch/logs#

The Nodes Stats API then continues to generate NPEs with the same reason in response to HTTP GETs for _nodes/stats until the OpenSearch node is restarted.

Related component

Search

To Reproduce

The steps to reproduce are essentially the same as documented in #16874. Yet in this issue, let's include a step to show how the Nodes Stats API can somehow be affected apparently by what triggers the reported Warning message:

  1. Query Node Stats API to confirm is accessible and responding as expected.
  2. Confirm zero instances of Warning msg in OpenSearch node(s) log-file.
  3. Execute repeated Scroll search queries until Warning msg appears in the OpenSearch node(s) log-file.
  4. Query Node Stats API to confirm is accessible and observe NPE.

Expected behavior

The expected behavior has 2 parts:

  • No WARN QueryGroupTask message to occur.
  • No NPE generated in response to an HTTP GET for _nodes/stats.

Additional Details

Plugins
Please list all plugins currently enabled.

  • opensearch-alerting
  • opensearch-anomaly-detection
  • opensearch-asynchronous-search
  • opensearch-cross-cluster-replication
  • opensearch-custom-codecs
  • opensearch-flow-framework
  • opensearch-geospatial
  • opensearch-index-management
  • opensearch-job-scheduler
  • opensearch-knn
  • opensearch-ltr
  • opensearch-ml
  • opensearch-neural-search
  • opensearch-notifications
  • opensearch-notifications-core
  • opensearch-observability
  • opensearch-performance-analyzer
  • opensearch-reports-scheduler
  • opensearch-security
  • opensearch-security-analytics
  • opensearch-skills
  • opensearch-sql
  • opensearch-system-templates
  • query-insights
  • repository-s3

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS] Debian
  • Version [e.g. 22] 12 (Bookworm)

Additional context
Add any other context about the problem here.

@etgraylog etgraylog added bug Something isn't working untriaged labels Mar 5, 2025
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Mar 5, 2025
@etgraylog etgraylog changed the title null_pointer_exception - Cannot invoke \"String.equals(Object)\" because the return value of \"org.opensearch.wlm.QueryGroupTask.getQueryGroupId()\" is null null_pointer_exception - Cannot invoke "String.equals(Object)" because the return value of "org.opensearch.wlm.QueryGroupTask.getQueryGroupId()" is null Mar 5, 2025
@sandeshkr419
Copy link
Contributor

@ansjcy @deshsidd Can you please check on this once?

@deshsidd
Copy link
Contributor

deshsidd commented Mar 5, 2025

cc @kaushalmahi12 This might be more relevant to wlm based on the path of the error org.opensearch.wlm.QueryGroupTask

@kaushalmahi12
Copy link
Contributor

kaushalmahi12 commented Mar 5, 2025

❯ curl -s "localhost:9200/_nodes/stats?pretty"                          
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "runTask",
  "nodes" : {
    ....
      "caches" : {
        "request_cache" : {
          "size_in_bytes" : 0,
          "evictions" : 0,
          "hit_count" : 0,
          "miss_count" : 0,
          "item_count" : 0,
          "store_name" : "noop_store"
        }
      },
      "remote_store" : {
        "last_successful_fetch_of_pinned_timestamps" : -1
      }
    }
  }
}

@etgraylog I followed the repro steps and the issue didn't occur. Can you share the stacktrace from the logs ?

@kaushalmahi12
Copy link
Contributor

Steps I followed

  1. Checked out the main branch
  2. Spawned the OS from the local code using ./gradlew run
  3. Loaded some sample data into the opensearch
  4. Created the scroll_id
  5. Ran oha -z 1m "http://localhost:9200/_search/scroll/${scroll_id}?scroll=30s"
  6. Ran curl -s "localhost:9200/_nodes/stats?pretty"

@etgraylog
Copy link
Author

etgraylog commented Mar 6, 2025

A stacktrace is not generated in the log of the OpenSearch node when this occurs @kaushalmahi12.

To reproduce this, I use an single-shard index (zero replicas) consisting of 40,000,878 documents (35.2GiB), which a sliced scroll-search targets. It might require for example oha -p 10 or more to reproduce.

Here is a Gist containing a snippet of the OpenSearch node's log file containing messages from a time when NPEs were being noted in the response from its _nodes/stats API to cURL HTTP GETs, along with another file that contains the output from the HTTP GETs to _nodes/stats API and also other output related to the health of the OpenSearch node and shards.

Note that the date command was appended to the cURL commands to indicate when they were executed in relation to the contents of the OpenSearch node's log-file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search Search query, autocomplete ...etc
Projects
Status: 🆕 New
Development

No branches or pull requests

4 participants