How to Debug an Unresponsive Elasticsearch Cluster

How to Debug an Unresponsive Elasticsearch Cluster

Elasticsearch is an open-source search engine and analytics store used by a variety of applications from search in e-commerce stores, to internal log management tools using the ELK stack (short for “Elasticsearch, Logstash, Kibana”). As a distributed database, your data is partitioned into “shards” which are then allocated to one or more servers.

Because of this sharding, a read or write request to an Elasticsearch cluster requires coordinating between multiple nodes as there is no “global view” of your data on a single server. While this makes Elasticsearch highly scalable, it also makes it much more complex to setup and tune than other popular databases like MongoDB or PostgresSQL, which can run on a single server.

When reliability issues come up, firefighting can be stressful if your Elasticsearch setup is buggy or unstable. Your incident could be impacting customers which could negatively impact revenue and your business reputation. Fast remediation steps are important, yet spending a large amount of time researching solutions online during an incident or outage is not a luxury most engineers have. This guide is intended to be a cheat sheet for common issues that engineers running that can cause issues with Elasticsearch and what to look for.

As a general purpose tool, Elasticsearch has thousands of different configurations which enables it to fit a variety of different workloads. Even if published online, a data model or configuration that worked for one company may not be appropriate for yours. There is no magic bullet getting Elasticsearch to scale, and requires diligent performance testing and trial/error.

Unresponsive Elasticsearch cluster issues

Cluster stability issues are some of the hardest to debug, especially if nothing changes with your data volume or code base.

Check size of cluster state

What does it do:

  • Elasticsearch cluster state tracks the global state of our cluster, and is the heart of controlling traffic and the cluster. Cluster state includes metadata on nodes in your cluster, status of shards and how they are mapped to nodes, index mappings (i.e. the schema), and more.
  • Cluster state usually doesn’t change often. However, certain operations such as adding a new field to an index mapping can trigger an update.
  • Because cluster updates broadcast to all nodes in the cluster, it should be small (<100MB).
  • A large cluster state can quickly make the cluster unstable. A common way this happens is through a mapping explosion (too many keys in an index) or too many indices.

What to look for:

  • Download the cluster state using the below command and look at the size of the JSON returned.

    curl -XGET 'http://localhost:9200/_cluster/state'
    
  • In particular, look at which indices have the most fields in the cluster state which could be the offending index causing stability issues. If the cluster state is large and increasing. You can also get an idea looking at individual index or match against an index pattern like so:

    curl -XGET 'http://localhost:9200/_cluster/state/_all/my_index-*'
    
  • You can also see the offending index’s mapping using the following command:

    curl -XGET 'http://localhost:9200/my_index/_mapping'
    

How to fix:

  • Look at how data is being indexed. A common way mapping explosion occurs is when high-cardinality identifiers are being used as a JSON key. Each time a new key is seen like “4” and”5”, the cluster state is updated. For example, the below JSON will quickly cause stability issues with Elasticsearch as each key is being added to the global state.
      {
     "1": {
       "status": "ACTIVE"
     },
     "2": {
       "status": "ACTIVE"
     },
     "3": {
       "status": "DISABLED"
     }
      }
    
  • To fix, flatten your data into something that is Elasticsearch friendly:
    {
      [
        {
          "id": "1",
          "status": "ACTIVE"
        },
        {
          "id": "2",
          "status": "ACTIVE"
        },
        {
          "id": "3",
          "status": "DISABLED"
        }
      ]
    }
    

Check Elasticsearch Tasks Queue

What does it do:

  • When a request is made against elasticsearch (index operation, query operation, etc), it’s first inserted into the task queue, until a worker thread can pick it up.
  • Once a worker pool has a thread free, it will pick up a task from the task queue and process it.
  • These operations are usually made by you via HTTP requests on the :9200 and :9300 ports, but they can also be internal to handle maintenance tasks on an index
  • At a given time there may be hundreds or thousands of in-flight operations, but should complete very quickly (like microseconds or milliseconds).

What to look for:

  • Run the below command and look for tasks that are stuck running for a long time like minutes or hours.
  • This means something is starving the cluster and preventing it from making forward progress.
  • It’s ok for certain long running tasks like moving an index to take a long time. However, normal query and index operations should be quick.

    curl -XGET 'http://localhost:9200/_cat/tasks?detailed'
    
  • With the ?detailed param, you can get more info on the target index and query.
  • Look for patterns in which tasks are consistently at the top of the list. Is it the same index? Is it the same node?
  • If so, maybe something is wrong with that index’s data or the node is overloaded.

How to fix:

  • If the volume of requests is higher than normal, then look at ways to optimize the requests (such as using bulk APIs or more efficient queries/writes)
  • If not change in volume and looks random, this implies something else is slowing down the cluster. The backup of tasks is just a symptom of a larger issue.
  • If you don’t know where the requests come from, add the X-Opaque-Id header to your Elasticsearch clients to identify which clients are triggering the queries.

Checks Elasticsearch Pending Tasks

What does it do:

  • Pending tasks are pending updates to the cluster state such as creating a new index or updating its mapping.
  • Unlike the previous tasks queue, pending updates require a multi step handshake to broadcast the update to all nodes in the cluster, which can take some time.
  • There should be almost zero in-flight tasks in a given time. Keep in mind, expensive operations like a snapshot restore can cause this to spike temporarily.

What to look for:

  • Run the command and ensure none or few tasks in-flight.

    curl curl curl -XGET 'http://localhost:9200/_cat/pending_tasks'
    
  • If it looks to be a constant stream of cluster updates that finish quickly, look at what might be triggering them. Is it a mapping explosion or creating too many indices?
  • If it’s just a few, but they seem stuck, look at the logs and metrics of the master node to see if there are any issues. For example, is the master node running into memory or network issues such that it can’t process cluster updates?

Hot Threads

What does it do:

  • The hot threads API is a valuable built-in profiler to tell you where Elasticseach is spending the most time.
  • This can provide insights such as whether Elasticsearch is spending too much time on index refresh or performing expensive queries.

What to look for:

  • Make a call to the hot threads API. To improve accuracy, it’s recommended to capture many snapshots using the ?snapshots param
    curl -XGET 'http://localhost:9200/_nodes/hot_threads?snapshots=1000'
    
  • This will return stack traces seen when the snapshot was taken.
  • Look for the same stack in many different snapshots. For example, you might see the text 5/10 snapshots sharing following 20 elements. This means a thread spending time in that area of the code during 5 snapshots.
  • You should also look at the CPU %. If an area of code had both high snapshot sharing and also high CPU %, this is a hot code path.
  • By looking at the code module, disassemble what Elasticseach is doing.
  • If you see wait or park state, this is usually ok.

How to fix:

  • If a large amount of CPU time is spent on index refresh, then try increasing the refresh interval beyond the default 1 second.
  • If you see a large amount in cache, maybe your default caching settings are suboptimal and causing heavy miss.

Memory Issues

Check Elasticsearch Heap / Garbage Collection

What does it do:

  • As a JVM process, the heap is the area of memory where a lot of Elasticsearch data structures are stored and requires garbage collection cycles to prune old objects.
  • For typical production setups, Elasticsearch locks all memory using mlockall on boot and disables swapping. If you’re not doing this, do it now.
  • If Heap is consistently above 85% or 90% for a node, this means we are coming close to out of memory.

What to look for:

  • Search for collecting in the last in Elasticsearch logs. If these are present, this means Elasticsearch is spending higher overhead on garbage collection (which takes time away from other productive tasks).
  • A few of these every now and then ok as long as Elasticsearch is not spending majority of it’s CPU cycles on garbage collection (calculate the percentage of time spent on collecting relative to the overall time provided)
  • A node that is spending 100% time on garbage collection is stalled and cannot make forward progress.
  • Nodes that appear to have network issues like timeouts may actually be due to memory issues. This is because a node can’t respond to incoming requests during a garbage collection cycle.

How to fix:

  • The easiest is to add more nodes to increase the heap available for the cluster. However, it takes time for Elasticsearch to rebalance shards to the empty nodes.
  • If only a small set of nodes have high heap usage, you may need to better balance your customer. For example, if your shards vary in size drastically or have different query/index bandwidths, you may have allocated too many hot shards to the same set of nodes. To move a shard, use the reroute API. Just adjust the shard awareness sensitivity to ensure it doesn’t get moved back.

    curl -XPOST -H "Content-Type: application/json" localhost:9200/_cluster/reroute -d '
    {
    "commands": [
        {
          "move": {
            "index": "test", "shard": 0,
            "from_node": "node1", "to_node": "node2"
          }
        }
      ]
    }'
    
  • If you are sending large bulk requests to Elasticsearch, try reducing the batch size so that each batch is under 100MB. While larger batches help reduce network overhead, they require allocating more memory to buffer the request which cannot be freed until after both the request is complete and the next GC cycle.

Check Elasticsearch Old Memory Pressure

What does it do:

  • The old memory pool contains objects that have survived multiple garbage collection cycles and are long living objects.
  • If the old memory is over 75%, you might want to pay attention to it. As this fills up beyond 85%, more GC cycles will happen but the objects can’t be cleaned up.

What to look for:

  • Look at the old pool used / old pool max. If this is over 85%, that is concerning

How to fix:

  • Are you eagerly loading a lot of fielddata. These reside in memory for a long time.
  • Are you performing many long running analytics tasks? Certain tasks should be offloaded to a distributed computing framework designed for map/reduce operations like Apache Spark.

Check Elasticsearch FieldData Size

What does it do:

  • FieldData is used for computing aggregations on a field such as terms aggregation
  • Usually fielddata for a field is not loaded in memory until the first time an aggregation is performed on it.
  • However, this can also be precomputed on index refresh if eager_load_ordinals is set.

What to look for:

  • Look at an index or all indices fielddata size like so:

    curl -XGET 'http://localhost:9200/index_1/_stats/fielddata'
    
  • An index could have very large field data structures if we are using it on the wrong type of data. Are you performing aggregations on very high-cardinality fields like a UUID or trace id? Fielddata is not suited for very high-cardinality fields as they will create massive fielddata structures.
  • Do you have a lot of fields with eager_load_ordinals set or allocate a large amount to the fielddata cache. This causes the fielddata to be genreated at refresh time vs query time. While it can speed up aggregations, it’s not optimal if you’re computing the fielddata for many fields at index refresh and never consume it in your queeries.

How to fix:

  • Make adjustments to your queries or mapping to not aggregate on very-high cardinality keys.
  • Audit your mapping to reduce the number that have eager_load_ordinals set to true.

Elasticsearch Networking issues

Node left or node disconnected

What does it do:

  • A node will be eventually be removed from the cluster if it does not respond to requests.
  • This allows shards to be replicated to other nodes to meet the replication factor and ensure high availability even if a node was removed.

What to look for:

  • Look at the master node logs. Even though there are multiple masters, you should look at the master node that is currently elected. You can use the nodes API or a tool like Cerebro to do this.
  • Look if there is a consistent node that times out or has issues. For example, you can see which nodes are still pending for a cluster update by looking for the phrase pending nodes in the master node’s logs.
  • If you see the same node keep getting added but then removed, it may imply the node is overloaded or unresponsive.
  • If you can’t reach the node from your master node, it could imply a networking issue. You could also be running into the NIC or CPU bandwidth limitations

How to fix:

  • Test with the setting transport.compression set to true This will compress traffic between nodes (such as from ingestion nodes to data nodes) reducing network bandwidth at the expense of CPU bandwidth.
  • Note: Earlier versions called this setting transport.tcp.compression
  • If you also have memory issues, try increasing memory. A node may become unresponsive due to large time spent on garbage collection.

Not enough master node issues

What does it do:

  • The master and other nodes need to discover each other to formulate a cluster.
  • On first boot, you must provide a static set of master nodes so you don’t have a split brain problem.
  • Other nodes will then discover the cluster automatically as long as the master nodes are present.

What to look for:

  • Enable Trace logging to review discovery related activities.

    curl -XPUT -H "Content-Type: application/json" localhost:9200/_cluster/_settings -d '
    {
      "transient": {"logger.discovery.zen":"TRACE"}
    }'
    
  • Review configuration such as minimum_master_nodes (if older than 6.x).
  • Look at whether all master nodes in your initial master nodes list can ping each other.
  • Review whether you have quorum, which should be number of master nodes / 2 +1. If you have less than quorum, no updates to cluster state will occur to protect data integrity.

How to fix:

  • Sometimes network or DNS issues can cause the original master nodes to not be reachable.
  • Review that you have at least number of master nodes / 2 +1 master nodes currently running.

Shard allocation errors

Elasticsearch in Yellow or Red State (Unassigned shards)

What does it do:

  • When a node reboots or a cluster restore is started, the shards are not immediaty available.
  • Recovery is throttled to ensure the cluster does not get overwhelmed.
  • Yellow state means primary indices are allocated, but secondary (replica) shards have not been allocated yet. While yellow indices are both readable and writable, availability is decreased. Yellow state is usually self-healable as the cluster replicates shards.
  • Red indices means primary shards are not allocated. This could be transient such as during a snapshot restore operation, but can also imply major problems such as missing data.

What to look for:

  • See reason behind why allocation has stopped

    curl -XGET 'http://localhost:9200/_cluster/allocation/explain'
    
    curl -XGET 'http://localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason'
    
  • Get a list of red indices, to understand which indices are contributing to red state. The cluster state will be in the red state as long as at least one index is red.

    curl -XGET 'http:localhost:9200/_cat/indices' | grep red
    
  • For more detail on a single index, you can see recovery status for the offending index

    curl -XGET 'http:localhost:9200/index_1/_recovery'
    

How to fix:

  • If you see a timeout from max_retries (maybe the cluster was busy during allocation), you can temporarily increase the circuit breaker threshold (Default is 5). Once the number is above the circuit breaker, Elasticsearch will start to initialize the unassigned shards.

    curl -XPUT -H "Content-Type: application/json" localhost:9200/index1,index2/_settings -d '
    {
      "index.allocation.max_retries": 7
    }'
    

Elasticsearch Disk issues

Index is read-only

What does it do:

  • Elasticsearch has three disk based watermarks that influences shard allocation. The cluster.routing.allocation.disk.watermark.low watermark prevents new shards from being allocated to a node with disk filling up. By default, this is 85% of the disk used.
  • The cluster.routing.allocation.disk.watermark.high watermark will force the cluster to start moving shards off of the node to other nodes. By default, this is 90%. This will start to move data around until below the high watermark. If Elasticsearch disk exceeds the flood stage watermark cluster.routing.allocation.disk.watermark.flood_stage, is when the disk is getting so full that moving might not be fast enough before the disk runs out of space. When reached, indices are placed in a read-only state to avoid data corruption.

What to look for:

  • Look at your disk space for each node
  • Review logs for nodes for a message like below:

    high disk watermark [90%] exceeded on XXXXXXXX free: 5.9gb[9.5%], shards will be relocated away from this node
    
  • Once the flood stage reached, you’ll see logs like so:

    flood stage disk watermark [95%] exceeded on XXXXXXXX free: 1.6gb[2.6%], all indices on this node will be marked read-only
    
  • Once this happens, the indices on that node are read-only.
  • To confirm, see which indices have read_only_allow_delete set to true.

    curl -XGET 'http://localhost:9200/_all/_settings?pretty' | grep read_only
    

How to fix:

  • First, clean up disk space such as by deleting local logs or tmp files.
  • To remove this block of read-only, make the command:

    curl -XPUT -H "Content-Type: application/json" localhost:9200/_all/_settings -d '
    {
      "index.blocks.read_only_allow_delete": null
    }'
    

Conclusion

Troubleshooting stability and performance issues can be challenging. The best way to find the root cause is by using the scientific method of hypothesis and proving it correct or incorrect. Using these tools and the Elasticsearch management API, you can gain a lot of insights into how Elasticsearch is performing and where issues may be.

Tired of messing with ELK? Turnkey API analytics.

Tired of messing with ELK? Turnkey API analytics.

Learn More