Skip to main content
Skip table of contents

Log-based Monitoring Recommendations

Use this reference to configure log-based alerts for each listed component by matching the specified log patterns within your alerting system. When an alert is triggered, follow the corresponding recommended action to diagnose and remediate the issue.

The log-based monitoring rules described in this section can be implemented using any standard log alerting platform (for example, Azure Monitor, AWS CloudWatch, Splunk, or Datadog). These examples are provided as guidance and do not represent an exhaustive set of conditions to monitor.

OpenSearch

1. Cluster health & disk space

Index

Log text to monitor

Recommended action

Severity

A

Cluster health status changed from [GREEN] to [RED] OR Cluster health status changed from [YELLOW] to [RED]

Treat as an incident. Freeze risky changes and notify on-call. Use GET _cluster/health and _cat/shards to identify failing indices, then check node logs for crashes, disk-full conditions, or mapping/index corruption. Restore failed nodes or recover data from snapshots as needed.

Critical

B

high disk watermark OR disk.watermark.high

Shard allocation to the node is stopped. Act immediately: free disk (delete old indices, move data, or extend volumes) or scale out with additional data nodes. Confirm that watermark thresholds fit your disk size and free-space policy.

Critical

C

flood stage disk watermark OR disk usage exceeded flood-stage watermark

Cluster is at emergency disk level and will begin blocking writes to affected indices. Urgently free space on the node (delete indices, move shards, or increase storage). Only consider raising flood-stage watermark after you have restored sufficient free disk.

Critical

D

disk usage exceeded flood-stage watermark, index has read-only-allow-delete block

Index has been set to read-only due to disk pressure. First resolve the disk issue (delete old indices, move data, increase storage). Once stable, clear the block with PUT /<index>/_settings { "index.blocks.read_only_allow_delete": false }. Add alerts on disk usage and shard relocation so you catch issues before flood stage.

Critical

2. Memory

Index

Log text to monitor

Recommended action

Severity

A

OutOfMemoryError: Java heap space

Node has exhausted heap and crashed or is about to crash. Treat as Sev1. Check OS and JVM metrics, collect a heap dump if possible, and look for heavy aggregations or large fielddata usage. Reduce query complexity, increase heap within supported limits, and/or scale out with more data nodes.

Critical

3. Shards, indices & allocation

Index

Log text to monitor

Recommended action

Severity

A

SearchPhaseExecutionException: all shards failed

Search requests are failing across all shards. Treat as high priority if impacting production queries. Inspect the full error (especially root_cause) for mapping issues, illegal arguments, or circuit breaker trips. Use _cat/shards and _cluster/allocation/explain to confirm shard health and fix the underlying resource or mapping problem before rerunning queries.

Critical

B

Shard failed to start OR primary shard is not active OR Replica shard is not allocated

Shards for one or more indices are not coming online. Run _cat/shards and _cluster/allocation/explain to see unassigned shard reasons. Check node logs for disk, permission, or corruption errors. Restore indices from snapshots or reindex from source if shards cannot be recovered.

Critical

C

IndexPrimaryShardNotAllocatedException

Primary shard hasn’t been allocated, so the index is unavailable. Use the allocation explain API to find allocation constraints (node attributes, disk thresholds, incompatible versions). Fix the constraint and rerun allocation or recreate the index from snapshot.

Critical

5. Security, TLS & authorization

Index

Log text to monitor

Recommended action

Severity

A

OpenSearchSecurityException[OpenSearch Security not initialized for __PATH__]

Security plugin has not successfully initialized and requests are blocked. Verify that security system indexes exist, and rerun the security admin / initialization procedure, watching for errors. Check certificate configuration and opensearch.yml security settings across all nodes to ensure they match.

Critical

B

PKIX path building failed OR unable to find valid certification path to requested target OR SSLHandshakeException OR Received fatal alert: bad_certificate

TLS handshake has failed between client and OpenSearch or between nodes. Verify certificate chains (CA, intermediate, node certificates), expiration dates, and hostname/SANs. Ensure clients trust the issuing CA and that your opensearch.yml and client configuration reference the correct key/trust stores.

Critical

6. Node discovery, cluster-manager election & cluster formation

Index

Log text to monitor

Recommended action

Severity

A

MasterNotDiscoveredException

Cluster cannot elect a cluster-manager node; reads/writes are blocked. Confirm that the expected number of cluster-manager-eligible nodes are running and reachable. Verify discovery.seed_hosts and cluster.initial_cluster_manager_nodes settings and check for network partition or split-brain scenarios.

Critical

7. OpenSearch Dashboards & client connectivity

Index

Log text to monitor

Recommended action

Severity

A

No living connections OR NoNodeAvailableException

Clients cannot connect to any OpenSearch node. Verify the OpenSearch endpoint, port, and TLS settings in the client configuration. Check that the cluster is up and healthy and that firewalls, security groups, or Kubernetes Services allow access from the client.

Critical

Kafka

1. Cluster Health & Availability

Index

Log text to monitor

Recommended action

Severity

A

Fatal error during KafkaServer startup

The broker failed to start properly. Check the server logs for the root cause (e.g., port conflict, configuration error). Ensure the broker configuration is correct and that the port is available.

Critical

B

Authorization failed

Authorization was denied for a client request. Check ACLs and authorizer configuration to ensure legitimate clients have access. If unexpected, investigate potential security breach or misconfiguration.

Critical

2. Disk & Storage

Index

Log text to monitor

Recommended action

Severity

A

Error while writing meta.properties

The broker encountered an error writing to the meta.properties file in a log directory. This usually indicates a disk failure or permissions issue. Check the disk health and file permissions for the log directories.

Critical

B

Failed to create or validate data directory

Kafka could not create or validate one of its data directories. Verify that the directory exists, permissions are correct, and the filesystem is mounted and writable.

Critical

C

Disk error while locking directory

The broker failed to lock a log directory, possibly because another process is using it or the filesystem is read-only/corrupted. Ensure no other Kafka process is running on the same data directory.

Critical

3. Replication & Consistency

Index

Log text to monitor

Recommended action

Severity

A

under replicated

Partitions are under-replicated, meaning not all replicas are in sync. This increases the risk of data loss. Check network connectivity between brokers and ensure all brokers are running and healthy.

Critical

Fluentd

1. Buffer & Performance

Index

Log text to monitor

Recommended action

Severity

A

BufferOverflowError OR BufferQueueLimitError

The buffer is full and Fluentd typically stops accepting new logs or drops them. Check if the output destination is down or slow. Consider increasing buffer size chunk_limit_size / total_limit_size or scaling out consumers.

Critical

B

failed to flush

Fluentd failed to flush the buffer to the destination. This might be due to network issues, authentication failures, or destination downtime. Check the error message details for the root cause.

Critical

2. Configuration & Startup

Index

Log text to monitor

Recommended action

Severity

A

config error

Fluentd encountered a configuration error and likely failed to start or load a plugin. Verify the configuration syntax, plugin names, and required parameters.

Critical

B

Address already in use

Fluentd failed to bind to a network port because it is already in use. Check if another Fluentd instance or service is running on the same port.

Critical

3. Processing

Index

Log text to monitor

Recommended action

Severity

A

pattern not matched

The input or filter parser could not match the incoming log line against the specified pattern. This usually results in data loss or unparsed records. Review the log format and the regex/pattern in the configuration.

Critical

MongoDB

1. Cluster Health & Crash

Index

Log text to monitor

Recommended action

Severity

A

Fatal assertion OR aborting after fassert() failure

The MongoDB process has encountered a fatal error and is shutting down. Review the stack trace in the logs. Restart the process and check for data consistency.

Critical

B

Out of memory

The process is running out of memory. Check memory usage trends and consider scaling up resources or optimizing queries/indexes.

Critical

2. Storage

Index

Log text to monitor

Recommended action

Severity

A

WiredTiger error

The WiredTiger storage engine reported an error. This often indicates corrupt data files or underlying disk issues. Restore from backup if data is corrupted.

Critical

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.