Technical “How-To’s”

How to Monitor Apache Flink with OpenTelemetry

Jonathan Wamsley
Jonathan Wamsley
Share:

Apache Flink monitoring support is now available in the open-source OpenTelemetry collector. You can check out the OpenTelemetry repo here! You can utilize this receiver in conjunction with any OTel collector, including the OpenTelemetry Collector and observIQ’s collector distribution.

Below are quick instructions for setting up observIQ’s OpenTelemetry distribution and shipping Apache Flink telemetry to a popular backend: Google Cloud Ops. You can find out more on observIQ’s GitHub page: https://github.com/observIQ/observiq-otel-collector

What signals matter?

Apache Flink is an open-source, unified batch processing and stream processing framework. The Apache Flink collector records 29 unique metrics, so there is a lot of data to pay attention to. Some specific metrics that users find valuable are:

  • Uptime and restarts
    • Two different metrics record the duration a job has continued uninterrupted and the number of full restarts a job has committed, respectively.
  • Checkpoints
    • Several metrics monitoring checkpoints can tell you the number of active checkpoints, the number of completed and failed checkpoints, and the duration of ongoing and past checkpoints.
  • Memory Usage
    • Memory-related metrics are often relevant to monitor. The Apache Flink collector ships metrics that can tell you about total memory usage, both present and over time, mins and maxes, and how the memory is divided between different processes.

The Apache Flink receiver can gather all the above categories – so let’s get started.

Before you begin

If you don’t already have an OpenTelemetry collector built with the latest Apache Flink receiver installed, you’ll need to do that first. We suggest using the observIQ OpenTelemetry Collector distro that includes the Apache Flink receiver (and many others), and is simple to install with our one-line installer.

Configuring the Apache Flink receiver

Navigate to your OpenTelemetry configuration file. If you’re using the observIQ Collector, you’ll find it in one of the following locations:

  • /opt/observiq-otel-collector/config.yaml (Linux)

For the observIQ OpenTelemetry Collector, edit the configuration file to include the Apache Flink receiver as shown below:

yaml
1receivers:
2  flinkmetrics:
3    endpoint: http://localhost:8081
4    collection_interval: 10s
5
6Processors:
7  nop:
8   # Resourcedetection is used to add a unique (host.name)
9  # to the metric resource(s),...  target_key: namespace
10
11exporters:
12  nop:
13    # Add the exporter for your preferred destination(s)
14
15service:
16  pipelines:
17    metrics:
18      receivers: [flinkmetrics]
19      processors: [nop]
20      exporters: [nop]

You can find the relevant config file here if you’re using the Google Ops Agent instead.

Viewing the metrics collected

The Apache Flink metrics will now be delivered to your desired destination following the steps detailed above.

MetricDescription
flink.jvm.cpu.loadThe CPU usage of the JVM for a jobmanager or taskmanager.
flink.jvm.cpu.timeThe CPU time used by the JVM for a jobmanager or taskmanager.
flink.jvm.memory.heap.usedThe amount of heap memory currently used.
flink.jvm.memory.heap.committedThe amount of heap memory guaranteed to be available to the JVM.
flink.jvm.memory.heap.maxThe maximum amount of heap memory that can be used for memory management.
flink.jvm.memory.nonheap.usedThe amount of non-heap memory currently used.
flink.jvm.memory.nonheap.committedThe amount of non-heap memory guaranteed to be available to the JVM.
flink.jvm.memory.nonheap.maxThe maximum amount of non-heap memory that can be used for memory management.
flink.jvm.memory.metaspace.usedThe amount of memory currently used in the Metaspace memory pool.
flink.jvm.memory.metaspace.committedThe amount of memory guaranteed to be available to the JVM in the Metaspace memory pool.
flink.jvm.memory.metaspace.maxThe maximum amount of memory that can be used in the Metaspace memory pool.
flink.jvm.memory.direct.usedThe amount of memory used by the JVM for the direct buffer pool.
flink.jvm.memory.direct.total_capacityThe total capacity of all buffers in the direct buffer pool.
flink.jvm.memory.mapped.usedThe amount of memory used by the JVM for the mapped buffer pool.
flink.jvm.memory.mapped.total_capacityThe number of buffers in the mapped buffer pool.
flink.memory.managed.usedThe amount of managed memory currently used.
flink.memory.managed.totalThe total amount of managed memory.
flink.jvm.threads.countThe total number of live threads.
flink.jvm.gc.collections.countThe total number of collections that have occurred.
flink.jvm.gc.collections.timeThe total time spent performing garbage collection.
flink.jvm.class_loader.classes_loadedThe total number of classes loaded since the start of the JVM.
flink.job.restart.countThe total number of restarts since this job was submitted, including full restarts and fine-grained restarts.
flink.job.last_checkpoint.timeThe end to end duration of the last checkpoint.
flink.job.last_checkpoint.sizeThe total size of the last checkpoint.
flink.job.checkpoint.countThe number of checkpoints completed or failed.
flink.job.checkpoint.in_progressThe number of checkpoints in progress.
flink.task.record.countThe number of records a task has.
flink.operator.record.countThe number of records an operator has.
flink.operator.watermark.outputThe last watermark this operator has emitted.

observIQ’s distribution of the OpenTelemetry collector is a game-changer for companies looking to implement OpenTelemetry standards. The single-line installer, seamlessly integrated receivers, exporter, and processor pool make working with this collector simple. Follow this space to keep up with all our future posts and simplified configurations for various sources. For questions, requests, and suggestions, contact our support team at support@observIQ.com.

Jonathan Wamsley
Jonathan Wamsley
Share:

Related posts

All posts

Get our latest content
in your inbox every week

By subscribing to our Newsletter, you agreed to our Privacy Notice

Community Engagement

Join the Community

Become a part of our thriving community, where you can connect with like-minded individuals, collaborate on projects, and grow together.

Ready to Get Started

Deploy in under 20 minutes with our one line installation script and start configuring your pipelines.

Try it now