Log-based Monitoring Recommendations

Use this reference to configure log-based alerts for each listed component by matching the specified log patterns within your alerting system. When an alert is triggered, follow the corresponding recommended action to diagnose and remediate the issue.

The log-based monitoring rules described in this section can be implemented using any standard log alerting platform (for example, Azure Monitor, AWS CloudWatch, Splunk, or Datadog). These examples are provided as guidance and do not represent an exhaustive set of conditions to monitor.

OpenSearch

1. Cluster health & disk space

Index	Log text to monitor	Recommended action	Severity
A	`Cluster health status changed from [GREEN] to [RED]` OR `Cluster health status changed from [YELLOW] to [RED]`	Treat as an incident. Freeze risky changes and notify on-call. Use `GET _cluster/health` and `_cat/shards` to identify failing indices, then check node logs for crashes, disk-full conditions, or mapping/index corruption. Restore failed nodes or recover data from snapshots as needed.	Critical
B	`high disk watermark` OR `disk.watermark.high`	Shard allocation to the node is stopped. Act immediately: free disk (delete old indices, move data, or extend volumes) or scale out with additional data nodes. Confirm that watermark thresholds fit your disk size and free-space policy.	Critical
C	`flood stage disk watermark` OR `disk usage exceeded flood-stage watermark`	Cluster is at emergency disk level and will begin blocking writes to affected indices. Urgently free space on the node (delete indices, move shards, or increase storage). Only consider raising flood-stage watermark after you have restored sufficient free disk.	Critical
D	`disk usage exceeded flood-stage watermark, index has read-only-allow-delete block`	Index has been set to read-only due to disk pressure. First resolve the disk issue (delete old indices, move data, increase storage). Once stable, clear the block with `PUT /<index>/_settings { "index.blocks.read_only_allow_delete": false }`. Add alerts on disk usage and shard relocation so you catch issues before flood stage.	Critical

2. Memory

Index	Log text to monitor	Recommended action	Severity
A	`OutOfMemoryError: Java heap space`	Node has exhausted heap and crashed or is about to crash. Treat as Sev1. Check OS and JVM metrics, collect a heap dump if possible, and look for heavy aggregations or large fielddata usage. Reduce query complexity, increase heap within supported limits, and/or scale out with more data nodes.	Critical

3. Shards, indices & allocation

Index	Log text to monitor	Recommended action	Severity
A	`SearchPhaseExecutionException: all shards failed`	Search requests are failing across all shards. Treat as high priority if impacting production queries. Inspect the full error (especially `root_cause`) for mapping issues, illegal arguments, or circuit breaker trips. Use `_cat/shards` and `_cluster/allocation/explain` to confirm shard health and fix the underlying resource or mapping problem before rerunning queries.	Critical
B	`Shard failed to start` OR `primary shard is not active` OR `Replica shard is not allocated`	Shards for one or more indices are not coming online. Run `_cat/shards` and `_cluster/allocation/explain` to see unassigned shard reasons. Check node logs for disk, permission, or corruption errors. Restore indices from snapshots or reindex from source if shards cannot be recovered.	Critical
C	`IndexPrimaryShardNotAllocatedException`	Primary shard hasn’t been allocated, so the index is unavailable. Use the allocation explain API to find allocation constraints (node attributes, disk thresholds, incompatible versions). Fix the constraint and rerun allocation or recreate the index from snapshot.	Critical

5. Security, TLS & authorization

Index	Log text to monitor	Recommended action	Severity
A	`OpenSearchSecurityException[OpenSearch Security not initialized for __PATH__]`	Security plugin has not successfully initialized and requests are blocked. Verify that security system indexes exist, and rerun the security admin / initialization procedure, watching for errors. Check certificate configuration and `opensearch.yml` security settings across all nodes to ensure they match.	Critical
B	`PKIX path building failed` OR `unable to find valid certification path to requested target` OR `SSLHandshakeException` OR `Received fatal alert: bad_certificate`	TLS handshake has failed between client and OpenSearch or between nodes. Verify certificate chains (CA, intermediate, node certificates), expiration dates, and hostname/SANs. Ensure clients trust the issuing CA and that your `opensearch.yml` and client configuration reference the correct key/trust stores.	Critical

6. Node discovery, cluster-manager election & cluster formation

Index	Log text to monitor	Recommended action	Severity
A	`MasterNotDiscoveredException`	Cluster cannot elect a cluster-manager node; reads/writes are blocked. Confirm that the expected number of cluster-manager-eligible nodes are running and reachable. Verify `discovery.seed_hosts` and `cluster.initial_cluster_manager_nodes` settings and check for network partition or split-brain scenarios.	Critical

7. OpenSearch Dashboards & client connectivity

Index	Log text to monitor	Recommended action	Severity
A	`No living connections` OR `NoNodeAvailableException`	Clients cannot connect to any OpenSearch node. Verify the OpenSearch endpoint, port, and TLS settings in the client configuration. Check that the cluster is up and healthy and that firewalls, security groups, or Kubernetes Services allow access from the client.	Critical

Kafka

1. Cluster Health & Availability

Index	Log text to monitor	Recommended action	Severity
A	`Fatal error during KafkaServer startup`	The broker failed to start properly. Check the server logs for the root cause (e.g., port conflict, configuration error). Ensure the broker configuration is correct and that the port is available.	Critical
B	`Authorization failed`	Authorization was denied for a client request. Check ACLs and authorizer configuration to ensure legitimate clients have access. If unexpected, investigate potential security breach or misconfiguration.	Critical

2. Disk & Storage

Index	Log text to monitor	Recommended action	Severity
A	`Error while writing meta.properties`	The broker encountered an error writing to the `meta.properties` file in a log directory. This usually indicates a disk failure or permissions issue. Check the disk health and file permissions for the log directories.	Critical
B	`Failed to create or validate data directory`	Kafka could not create or validate one of its data directories. Verify that the directory exists, permissions are correct, and the filesystem is mounted and writable.	Critical
C	`Disk error while locking directory`	The broker failed to lock a log directory, possibly because another process is using it or the filesystem is read-only/corrupted. Ensure no other Kafka process is running on the same data directory.	Critical

3. Replication & Consistency

Index	Log text to monitor	Recommended action	Severity
A	`under replicated`	Partitions are under-replicated, meaning not all replicas are in sync. This increases the risk of data loss. Check network connectivity between brokers and ensure all brokers are running and healthy.	Critical

Fluentd

1. Buffer & Performance

Index	Log text to monitor	Recommended action	Severity
A	`BufferOverflowError` OR `BufferQueueLimitError`	The buffer is full and Fluentd typically stops accepting new logs or drops them. Check if the output destination is down or slow. Consider increasing buffer size `chunk_limit_size` / `total_limit_size` or scaling out consumers.	Critical
B	`failed to flush`	Fluentd failed to flush the buffer to the destination. This might be due to network issues, authentication failures, or destination downtime. Check the error message details for the root cause.	Critical

2. Configuration & Startup

Index	Log text to monitor	Recommended action	Severity
A	`config error`	Fluentd encountered a configuration error and likely failed to start or load a plugin. Verify the configuration syntax, plugin names, and required parameters.	Critical
B	`Address already in use`	Fluentd failed to bind to a network port because it is already in use. Check if another Fluentd instance or service is running on the same port.	Critical

3. Processing

Index	Log text to monitor	Recommended action	Severity
A	`pattern not matched`	The input or filter parser could not match the incoming log line against the specified pattern. This usually results in data loss or unparsed records. Review the log format and the regex/pattern in the configuration.	Critical

MongoDB

1. Cluster Health & Crash

Index	Log text to monitor	Recommended action	Severity
A	`Fatal assertion` OR `aborting after fassert() failure`	The MongoDB process has encountered a fatal error and is shutting down. Review the stack trace in the logs. Restart the process and check for data consistency.	Critical
B	`Out of memory`	The process is running out of memory. Check memory usage trends and consider scaling up resources or optimizing queries/indexes.	Critical

2. Storage

Index	Log text to monitor	Recommended action	Severity
A	`WiredTiger error`	The WiredTiger storage engine reported an error. This often indicates corrupt data files or underlying disk issues. Restore from backup if data is corrupted.	Critical