Agent Resilience

Reliable collector architecture can be obtained with the combination of retry, queue, and load balancing.

Retry

BindPlane destinations have the ability to retry sending telemetry batches when there is an error or a network outage.

Configuration

Retry is enabled by default on all destinations that support it. By default, failed requests will be retried after five seconds and progressively back off for up to 30 seconds. After five minutes, requests will be permanently dropped.

observIQ docs - Collector Resilience - image 1

Best Practices

For workloads that cannot afford to have telemetry dropped, the five-minute maximum elapsed time should be increased significantly. Keep in mind that a large max elapsed time combined with a large backend outage will cause the collector to "buffer" a significant amount of telemetry to disk. Aggregator collectors should be provisioned with disks large enough to sustain an outage lasting hours or days.

If overwhelming the backend during an outage recovery is not a concern, reducing the max interval to match the initial interval can decrease the time it will take to recover from an outage, as telemetry sending will be retried more frequently.

Sending Queue

When telemetry requests are retried, they are first stored in a sending queue. This sending queue is stored on disk in order to guarantee persistence in the event of a collector system crash.

Configuration

The sending queue has three options

  • Number of consumers
  • Queue size
  • Persistent queuing
observIQ docs - Collector Resilience - image 2

Number of Consumers

This option determines how many batches will be retried in parallel. For example, 10 consumers will retry 10 batches at a time. If each batch contains 100 logs, the collector will retry 1,000 logs.

Generally, the default value of 10 is suitable for low and high-volume systems. Decreasing this number will cause the collector to recover from large outages slower, but will keep resource consumption low. Alternatively, increasing this number will mean that the collector is going to put more strain on the backend because it will be retrying more batches in parallel.

Queue Size

The queue size option determines how many batches are stored in the queue. When the queue is at capacity, additional batches will be dropped.

Keep in mind that the queue size is the number of batches. You can calculate the number of metrics, traces, and logs by taking the batch size and multiplying it by the queue size. You can use the Batch processor to configure batch sizes.

Persistent Queuing

Persistent queue is a feature that allows the BindPlane agent to buffer telemetry batches to disk when a request to the backend fails. The BindPlane agent supports persistent queue by default and it is recommended that it be enabled at all times. Persistent queue protects against data loss if the agent system is suddenly shut down due to a crash or other outside factors.

If persistent queue is disabled, failed telemetry batches will be buffered in memory. This will increase performance on high throughput systems, at the expense of reliability. During an outage, memory buffering will increase memory consumption drastically and can cause the BindPlane agent to crash if the system runs out of memory.

Load Balancing

Load balancing allows you to operate a fleet of aggregator agents for increased performance and redundancy. Load balancers allow you to scale your aggregator fleet horizontally and sustain failures without ensuring an outage.

The BindPlane collector can work with a wide range of load balancers when operating in aggregator mode. This documentation will not discuss any particular option, as most popular load-balancing solutions support the required options for operating multiple collectors reliably.

Load balancing best practices:

  • Health checks. The load balancer should be configured to ensure the collector is ready to receive traffic.
  • Even connection distribution. Connections should be distributed evenly among collectors.
  • Protocol support: OpenTelemetry has a wide range of network-based receivers. In order to support all of them, the load balancer should support transport protocols TCP and UDP as well as application protocols HTTP and gRPC.

Use Cases

The following source types can be used with a load balancer:

  • OTLP
  • Syslog
  • TCP / UDP
  • Splunk HEC
  • Fluent Forward

Any source type that receives telemetry from remote systems over the network is a suitable candidate for load balancing.