Technical “How-To’s”

How to monitor Elasticsearch with OpenTelemetry

Deepa Ramachandra
Deepa Ramachandra
Share:

Some popular monitoring tools in the market can complicate and create blind spots in your Elasticsearch monitoring. That’s why we made monitoring Elasticsearch simple, straightforward, and actionable. Read along as we dive into the steps to monitor Elasticsearch using observIQ’s distribution of the OpenTelemetry collector. To monitor Elasticsearch, we will configure two OpenTelemetry receivers, the Elasticsearch receiver and the JVM receiver.

Broken image

It is always good to stick to industry standards, and when it comes to monitoring, OpenTelemetry is the standard. We are simplifying the use of OpenTelemetry for all users. If you are as excited as we are, look at the details of this support in our repo.

You can utilize this receiver in conjunction with any OTel Collector, including the OpenTelemetry Collector and observIQ’s distribution of the collector.

What signals matter?

Elasticsearch has clusters, nodes, and masters, which are concepts specific to Elasticsearch. When monitoring a cluster, you collect metrics from a single Elasticsearch node or multiple nodes in the cluster. Some of the most critical elastic search metrics to monitor:

Cluster health based on node availability and shards:

Elasticsearch’s most favorable feature is its scalability, which depends on optimized cluster performance. Metrics deliver valuable data such as cluster status, node status, and shard numbers split categorically as active shards, initializing shards, relocating shards, and unassigned shards. In addition to this, the elasticsearch.node.shards.size metrics give the size of shards assigned to a specific node.

Node health based on disk space availability, CPU, and memory usage percentages:

Elasticsearch’s performance depends on how efficiently its memory is used, specifically the memory health of each node. Constant node reboots could lead to an increased read from disk activity, reducing performance. CPU usage is another critical component of Elasticsearch monitoring. Heavy search or indexing workloads can increase CPU usage, resulting in degraded performance. Metrics such as elasticsearch.node.fs.disk.available, elasticsearch.node.cluster.io helps chart these values and derive valuable inferences.

Related Content: How to Install and Configure an OpenTelemetry Collector

JVM metrics for JVM heap, garbage collection, and thread pool:

Elasticsearch is Java-based and runs within a JVM(Java Virtual Machine). Cluster performance depends on the efficiency of the JVM heap usage. All Java objects exist within a JVM heap, created when the JVM application starts, and the objects are retained in the heap until it is complete. JVM heap is tracked using the metrics jvm.memory.heap.max,jvm.memory.heap.used and jvm.memory.heap.committed.

Once the JVM heap is full, garbage collection is initiated. JVM’s garbage collection is an ongoing process; it is critical to ensure that it does not retard the application’s performance in any way. JVM’s garbage collection capabilities are tracked using the metrics jvm.gc.collections.count and jvm.gc.collections.elapsed. Each node maintains thread pools of all types; the thread pools,, in turn,, have worker threads that reduce the overhead on the overall performance. Threadpools queue the requests and serve them when the node has available bandwidth to accommodate the request.

All metrics related to the categories above can be gathered with the Elasticsearch receiver – so let’s get started!

Configuring the Elasticsearch receiver

You can use the following configuration to gather metrics using the Elasticsearch receiver and forward the metrics to the destination of your choice. OpenTelemetry supports over a dozen destinations to which you can forward the collected metrics. More information is available about exporters in OpenTelemetry’s repo. In this sample, the configuration for the elastic receiver is covered. For details on the JVM receiver, check OpenTelemetry’s repo.

Receiver configuration:

  1. Use the nodes attribute to specify the node that is being monitored.
  2. Set up the endpoint attribute as the system running the elasticsearch instance.
  3. Configure the collection_interval attribute. It is set to 60 seconds in this sample configuration.
yaml
1receivers:
2 elasticsearch:
3   nodes: ["_local"]
4   endpoint: http://localhost:9200
5   collection_interval: 60s

Processor configuration:

  1. The resourcedetection processor creates a unique identity for each metric host so that you can filter between the various hosts to view the metrics specific to that host.
  2. The resource processor is used to set and identify these parameters.
  3. The resourceattributetransposer processor enriches the metrics data with the cluster information. This makes it easier to drill down to the metrics for each cluster.
  4. The batch processor is used to batch all the metrics together during collection.
yaml
1processors:
2 # Resourcedetection is used to add a unique (host.name)
3 # to the metric resource(s), allowing users to filter
4 # between multiple agent systems.
5 resourcedetection:
6   detectors: ["system"]
7   system:
8     hostname_sources: ["os"]
9
10 resource:
11   attributes:
12   - key: location
13     value: global
14     action: upsert
15
16 resourceattributetransposer:
17   operations:
18     - from: host.name
19       to: agent
20     - from: elasticsearch.cluster.name
21       to: cluster_name
22
23 batch:

Related Content: What is the OpenTelemetry Transform Language (OTTL)?

Exporter Configuration:

In this example, the metrics are exported to Google Cloud Operations. If you would like to forward your metrics to a different destination, check the destinations that OpenTelemetry supports at this time, here.

yaml
1exporters:
2 googlecloud:
3   retry_on_failure:
4     enabled: false

Set up the pipeline.

yaml
1service:
2 pipelines:
3   metrics:
4     receivers:
5     - elasticsearch
6     processors:
7     - resourcedetection
8     - resource
9      - resourceattributetransposer
10     - batch
11     exporters:
12     - googlecloud

Viewing the metrics

All the metrics the Elasticsearch receiver scrapes are listed below. In addition to those, the attributes and their usage are also listed. It helps to understand the attributes used if your usage requires enriching the metrics data further with these attributes.

MetricDescription
elasticsearch.node.cache.memory.usageThe size in bytes of the cache.
elasticsearch.node.thread_pool.threadsThe number of threads in the thread pool.
elasticsearch.node.thread_pool.tasks.queuedThe number of queued tasks in the thread pool.
elasticsearch.node.thread_pool.tasks.finishedThe number of tasks finished by the thread pool.
elasticsearch.node.shards.sizeThe size of the shards assigned to this node.
elasticsearch.node.operations.timeTime spent on operations.
elasticsearch.node.operations.completedThe number of operations completed.
elasticsearch.node.http.connectionsThe number of HTTP connections to the node.
elasticsearch.node.fs.disk.availableThe amount of disk space available across all file stores for this node.
elasticsearch.node.cluster.ioThe number of bytes sent and received on the network for internal cluster communication.
elasticsearch.node.cluster.connectionsThe number of open tcp connections for internal cluster communication.
elasticsearch.node.cache.evictionsThe number of evictions from the cache.
elasticsearch.node.documentsThe number of documents on the node.
elasticsearch.node.open_filesThe number of open file descriptors held by the node.
jvm.classes.loadedThe number of loaded classes
jvm.gc.collections.countThe total number of garbage collections that have occurred
jvm.gc.collections.elapsedThe approximate accumulated collection elapsed time
jvm.memory.heap.maxThe maximum amount of memory can be used for the heap
jvm.memory.heap.usedThe current heap memory usage
jvm.memory.heap.committedThe amount of memory that is guaranteed to be available for the heap
jvm.memory.nonheap.usedThe current non-heap memory usage
jvm.memory.nonheap.committedThe amount of memory that is guaranteed to be available for non-heap purposes
jvm.memory.pool.maxThe maximum amount of memory can be used for the memory pool
jvm.memory.pool.usedThe current memory pool memory usage
jvm.threads.countThe current number of threads
elasticsearch.cluster.shardsThe number of shards in the cluster.
elasticsearch.cluster.data_nodesThe number of data nodes in the cluster.
elasticsearch.cluster.nodesThe total number of nodes in the cluster.
elasticsearch.cluster.healthThe health status of the cluster.

List of attributes:

Attribute NameAttribute Description
elasticsearch.cluster.nameThe name of the elasticsearch cluster.
elasticsearch.node.nameThe name of the elasticsearch node.
cache_nameThe name of cache.
fs_directionThe direction of filesystem IO.
collector_nameThe name of the garbage collector.
memory_pool_nameThe name of the JVM memory pool.
disk_usage_stateThe state of a section of space on disk.
directionThe direction of network data.
document_stateThe state of the document.
shard_stateThe state of the shard.
operationThe type of operation.
thread_pool_nameThe name of the thread pool.
thread_stateThe state of the thread.
task_stateThe state of the task.
health_statusThe health status of the cluster.

observIQ’s distribution is a game-changer for companies looking to implement the OpenTelemetry standards. The single-line installer, seamlessly integrated receivers, exporter, and processor pool make working with this collector simple. Follow this space to keep up with all our future posts and simplified configurations for various sources. For questions, requests, and suggestions, contact our support team at support@observIQ.com.

Deepa Ramachandra
Deepa Ramachandra
Share:

Related posts

All posts

Get our latest content
in your inbox every week

By subscribing to our Newsletter, you agreed to our Privacy Notice

Community Engagement

Join the Community

Become a part of our thriving community, where you can connect with like-minded individuals, collaborate on projects, and grow together.

Ready to Get Started

Deploy in under 20 minutes with our one line installation script and start configuring your pipelines.

Try it now