Technical “How-To’s”

How to monitor Hadoop with OpenTelemetry

Deepa Ramachandra
Deepa Ramachandra
Share:

We are back with a simplified configuration for another critical open-source component, Hadoop. Monitoring Hadoop applications helps to ensure that the data sets are distributed as expected across the cluster. Although Hadoop is considered to be very resilient to network mishaps, monitoring Hadoop clusters is inevitable. Hadoop is monitored using the JMX receiver.

The configuration detailed in this post uses observIQ’s distribution of the OpenTelemetry collector. We are simplifying the use of OpenTelemetry for all users. If you are as excited as we are, take a look at the details of this support in our repo.

The JMX receiver used in this configuration works with any Open Telemetry collector: including the OpenTelemetry Collector and observIQ’s distribution of the collector.

What signals matter?

Monitoring performance metrics for Hadoop is necessary to ensure that all the jobs are running as expected and the clusters humming. The following categories of metrics are monitored using this configuration:

HDFS Metrics:

It is critical to monitor the Apache Hadoop Distributed File System(HDFS) to ensure the disc space availability, data storage optimization, and to track the capacity of the file system. There are two types of HDFS metrics, namely, NameNode and DataNode. In the HDFS architecture there is a single NameNode with multiple Datanodes. Metrics related to the NameNode are the most important metrics to monitor, any failure in the NameNode renders the data in that cluster becomes inaccessible. The most critical metrics to scrape are:

  • Use the metrics to gauge the overall performance of the Hadoop system
  • Keep track of any anomalies in data directory growth and optimize data storage across the entire Hadoop system.

All of the metrics related to the categories above can be gathered with the JMXreceiver – so let’s get started!

Configuring the jmxreceiver

After the installation, the config file for the collector can be found at:

  • C:\Program Files\observIQ OpenTelemetry Collector\config.yaml (Windows)
  • /opt/observiq-otel-collector/config.yaml(Linux)

Receiver Configuration:

  1. Specify the jar_path for the JMX receiver. We are using the JMX receiver to gather Hadoop metrics. The jar_path attribute lets you specify the path to the jar file that facilitates gathering Hadoop metrics. This file path is created automatically when observIQ’s distribution of the OpenTelemetry Collector is installed.
  2. Setup the endpoint attribute as the system that is running the Hadoop instance
  3. Set the target_system attribute to Hadoop and JVM. When we connect to the JMX receiver there are different categories of metrics, the Hadoop metrics, and JVM metrics are the ones that this configuration intends to scrape. This attribute specifies that.
  4. Configure the collection_interval attribute. It is set to 60 seconds in this sample configuration.

Use resource_attributes to set the local host port number. The properties option allows you to set arbitrary attributes. For instance, if you are configuring multiple JMX receivers to collect metrics from many Hadoop servers, this attribute allows you to set the unique IP addresses for each of those endpoint systems. Please note that this is not the only use of the properties option.

yaml
1receivers:
2  jmx:
3    jar_path: /opt/opentelemetry-java-contrib-jmx-metrics.jar
4    endpoint: localhost:8004
5    target_system: hadoop,jvm
6    collection_interval: 60s
7    resource_attributes:
8      hadoop.endpoint: localhost:8004

Processors:

  1. The resource detection processor is used to create a distinction between metrics received from multiple Hadoop systems. This helps with filtering metrics from specific Redis hosts in the monitoring tool, in this case, Google Cloud operations.
  2. Add the batch processor to bundle the metrics from multiple receivers. We highly recommend using this processor in the configuration, especially for the benefit of the logging component of the collector. To learn more about this processor check the documentation.
yaml
1processors:
2  resourcedetection:
3    detectors: ["system"]
4    system:
5      hostname_sources: ["os"]
6batch:

Exporters:

In this example, the metrics are exported to Google Cloud Operations using the googlecloudexporter. If you would like to forward your metrics to a different destination, check the destinations that OpenTelemetry supports at this time, here.

yaml
1exporters: 
2  googlecloud:
3    metric:
4      resource_filters:
5        - prefix: hadoop

Set up the pipeline:

yaml
1service:
2  pipelines:
3    metrics:
4      receivers:
5      - jmx
6      exporters:
7      - googlecloud

Viewing the metrics collected

All the metrics the JMX receiver scrapes are listed below.

MetricDescription
hadoop.name_node.capacity.usageThe current used capacity across all data nodes reporting to the name node.
adoop.name_node.capacity.limitThe total capacity allotted to data nodes reporting to the name node.
hadoop.name_node.block.countThe total number of blocks on the name node.
hadoop.name_node.block.missingThe number of blocks reported as missing to the name node.
hadoop.name_node.block.corruptThe number of blocks reported as corrupt to the name node.
hadoop.name_node.volume.failedThe number of failed volumes reported to the name node.
hadoop.name_node.file.countThe total number of files being tracked by the name node.
hadoop.name_node.file.loadThe current number of concurrent file accesses.
hadoop.name_node.data_node.countThe number of data nodes reporting to the name node.

Alerting:

With these metrics and dashboards created for the Hadoop server, here are some alerting and monitoring steps that you can implement:

  • In addition to the metrics specific to the Hadoop server, track the OS and JVM metrics to give a complete view of the data usage capacity and projects for the HDFS
  • Set alerting for thresholds for capacity, blocks missing, block corrupt, and volume failures.
  • Avoid false alarms by resampling the metrics
  • Set up alerts for failures related to individual data nodes
  • Set up alerts for memory shortage related metric thresholds

observIQ’s distribution is a game-changer for companies looking to implement the OpenTelemetry standards. The single line installer, seamlessly integrated receivers, exporter, and processor pool make working with this collector simple. Follow this space to keep up with all our future posts and simplified configurations for various sources. For questions, requests, and suggestions, reach out to our support team at support@observIQ.com or join the conversation on Slack!

Deepa Ramachandra
Deepa Ramachandra
Share:

Related posts

All posts

Get our latest content
in your inbox every week

By subscribing to our Newsletter, you agreed to our Privacy Notice

Community Engagement

Join the Community

Become a part of our thriving community, where you can connect with like-minded individuals, collaborate on projects, and grow together.

Ready to Get Started

Deploy in under 20 minutes with our one line installation script and start configuring your pipelines.

Try it now