The Observability Blog

Categories:
  • Metrics
  • OpenTelemetry

How to monitor Elasticsearch with OpenTelemetry

Deepa Ramachandra headshot
by Deepa Ramachandra on
June 15, 2022

Some popular monitoring tools in the market can complicate and create blind spots in your Elasticsearch monitoring. That’s why we made monitoring Elasticsearch simple, straightforward and actionable. Read along as we dive into the steps to monitor Elasticsearch using observIQ’s distribution of the OpenTelemetry collector. To monitor Elasticsearch we will configure two OpenTelemetry receivers, the elasticsearch receiver and the JVM receiver.

It is always good to stick to industry standards, and when it comes to monitoring, OpenTelemetry is the standard. We are simplifying the use of OpenTelemetry for all users. If you are as excited as we are, take a look at the details of this support in our repo. 

You can utilize this receiver in conjunction with any OTel collector: including the OpenTelemetry Collector and observIQ’s distribution of the collector.

What signals matter?

Elasticsearch has clusters, nodes, and masters; which are concepts specific to Elasticsearch. When monitoring a cluster, you are essentially collecting metrics from a single Elasticsearch node or multiple nodes in the cluster. Some of the most critical elastic search metrics to monitor:

Cluster health based on node availability and shards:

Elasticsearch’s most favorable feature is its scalability, which heavily depends on optimized cluster performance. Metrics deliver useful data such as cluster status, node status and shard numbers split categorically as active shards, initializing shards, relocating shards and unassigned shards. In addition to this, the elasticsearch.node.shards.size  metrics gives the size of shards assigned to a specific node.

Node health based on disk space availability, CPU and memory usage percentages:

Elasticsearch’s performance depends on how efficiently its memory is used, specifically the memory health of each node. Constant node reboots could lead to an increased read from disk activity, bringing down the performance. CPU usage is another critical component of Elasticsearch monitoring. Heavy search or indexing workloads can increase the CPU usage resulting in degraded performance. Metrics such as elasticsearch.node.fs.disk.available, elasticsearch.node.cluster.io help chart these values and derive useful inferences.

JVM metrics for JVM heap, garbage collection, and thread pool:

Elasticsearch is Java based, hence, it runs within a JVM(Java Virtual Machine). Cluster performance depends on the JVM heap usage’s efficiency. All Java objects exist within a JVM heap, which is created when JVM application starts and the objects are retained in the heap until it is full. JVM heap is  tracked using the metrics jvm.memory.heap.max,jvm.memory.heap.used and jvm.memory.heap.committed. 

Once the JVM heap is full, garbage collection is initiated. JVM’s garbage collection is an ogoing process, it is critical to ensure that it does not retard the application’s performance in any way. JVM’s garbage collection capabilities are tracked using the metrics jvm.gc.collections.count and jvm.gc.collections.elapsed. Each node maintains threadpools of all types, the threadpools in turn have worker threads, that reduce the overhead on the overall performance. Threadpools queue the requests and serve them when the node has available bandwidth to accommodate the request.

All of the metrics related to the categories above can be gathered with the Elasticsearch receiver – so let’s get started!

Configuring the Elasticsearch receiver

Use the following configuration to gather metrics using the elasticsearch receiver and forward the metrics to the destination of your choice. OpenTelemetry supports over a dozen destinations to which you can forward the gathered metrics. More information is available about exporters in OpenTelemetry’s repo. In this sample, the configuration for the elastic receiver is covered. For details on the JVM receiver check OpenTelemetry’s repo.

Receiver configuration:

  1. Use the nodes attribute to specify the node that is being monitored.
  2. Setup the endpoint attribute as the system that is running the elasticsearch instance
  3. Configure the collection_interval attribute. It is is set to 60 seconds in this sample configuration.
receivers:
 elasticsearch:
   nodes: ["_local"]
   endpoint: http://localhost:9200
   collection_interval: 60s

Processor configuration:

  1. The resourcedetection processor is used to create a unique identity to each metric host such that you have the ability to filter between the various hosts to view the metrics specific to that host.
  2. The resource processor is used to set these parameters used to identify
  3. The resourceattributetransposer processor is used to enrich the metrics data with the cluster informatio. This makes it easier to drill down to the metrics for each cluster.
  4. The batch processor is used to batch all the metrics together during collection.
processors:
 # Resourcedetection is used to add a unique (host.name)
 # to the metric resource(s), allowing users to filter
 # between multiple agent systems.
 resourcedetection:
   detectors: ["system"]
   system:
     hostname_sources: ["os"]

 resource:
   attributes:
   - key: location
     value: global
     action: upsert

 resourceattributetransposer:
   operations:
     - from: host.name
       to: agent
     - from: elasticsearch.cluster.name
       to: cluster_name

 batch:

Exporter Configuration:

In this example, the metrics are exported to Google Cloud Operations. If you would like to forward your metrics to a different destination, check the destinations that OpenTelemetry supports at this time, here.

exporters:
 googlecloud:
   retry_on_failure:
     enabled: false

Set up the pipeline


service:
 pipelines:
   metrics:
     receivers:
     - elasticsearch
     processors:
     - resourcedetection
     - resource
      - resourceattributetransposer
     - batch
     exporters:
     - googlecloud

Viewing the metrics

All the metrics the elasticsearch receiver scrapes are listed below. In addition to those, the attributes and their usage is also listed. It helps to understand the attributes used, if your usage requires enriching the metrics data further with these attributes.

MetricDescription
elasticsearch.node.cache.memory.usage

The size in bytes of the cache.

elasticsearch.node.thread_pool.threads

The number of threads in the thread pool.

elasticsearch.node.thread_pool.tasks.queued

The number of queued tasks in the thread pool.

elasticsearch.node.thread_pool.tasks.finished

The number of tasks finished by the thread pool.

elasticsearch.node.shards.size

The size of the shards assigned to this node.

elasticsearch.node.operations.time

Time spent on operations.

elasticsearch.node.operations.completed

The number of operations completed.

elasticsearch.node.http.connections

The number of HTTP connections to the node.

elasticsearch.node.fs.disk.available

The amount of disk space available across all file stores for this node.

elasticsearch.node.cluster.io

The number of bytes sent and received on the network for internal cluster communication.

elasticsearch.node.cluster.connections

The number of open tcp connections for internal cluster communication.

elasticsearch.node.cache.evictions

The number of evictions from the cache.

elasticsearch.node.documents

The number of documents on the node.

elasticsearch.node.open_files

The number of open file descriptors held by the node.

jvm.classes.loaded

The number of loaded classes

jvm.gc.collections.count

The total number of garbage collections that have occurred

jvm.gc.collections.elapsed

The approximate accumulated collection elapsed time

jvm.memory.heap.max

The maximum amount of memory can be used for the heap

jvm.memory.heap.used

The current heap memory usage

jvm.memory.heap.committed

The amount of memory that is guaranteed to be available for the heap

jvm.memory.nonheap.used

The current non-heap memory usage

jvm.memory.nonheap.committed

The amount of memory that is guaranteed to be available for non-heap purposes

jvm.memory.pool.max

The maximum amount of memory can be used for the memory pool

jvm.memory.pool.used

The current memory pool memory usage

jvm.threads.count

The current number of threads

elasticsearch.cluster.shards

The number of shards in the cluster.

elasticsearch.cluster.data_nodes

The number of data nodes in the cluster.

elasticsearch.cluster.nodes

The total number of nodes in the cluster.

elasticsearch.cluster.health

The health status of the cluster.

List of attributes:

Attribute NameAttribute Description
elasticsearch.cluster.name

The name of the elasticsearch cluster.

elasticsearch.node.name

The name of the elasticsearch node.

cache_name

The name of cache.

fs_direction

The direction of filesystem IO.

collector_name

The name of the garbage collector.

memory_pool_name

The name of the JVM memory pool.

disk_usage_state

The state of a section of space on disk.

direction

The direction of network data.

document_state

The state of the document.

shard_state

The state of the shard.

operation

The type of operation.

thread_pool_name

The name of the thread pool.

thread_state

The state of the thread.

task_state

The state of the task.

health_status

The health status of the cluster.

observIQ’s distribution is a game-changer for companies looking to implement the OpenTelemetry standards. The single line installer, seamlessly integrated receivers, exporter, and processor pool make working with this collector simple. Follow this space to keep up with all our future posts and simplified configurations for various sources. For questions, requests, and suggestions, reach out to our support team at support@observIQ.com.