Log-based Monitoring Recommendations
Use this reference to configure log-based alerts for each listed component by matching the specified log patterns within your alerting system. When an alert is triggered, follow the corresponding recommended action to diagnose and remediate the issue.
The log-based monitoring rules described in this section can be implemented using any standard log alerting platform (for example, Azure Monitor, AWS CloudWatch, Splunk, or Datadog). These examples are provided as guidance and do not represent an exhaustive set of conditions to monitor.
OpenSearch
1. Cluster health & disk space
Index | Log text to monitor | Recommended action | Severity |
|---|---|---|---|
A |
| Treat as an incident. Freeze risky changes and notify on-call. Use | Critical |
B |
| Shard allocation to the node is stopped. Act immediately: free disk (delete old indices, move data, or extend volumes) or scale out with additional data nodes. Confirm that watermark thresholds fit your disk size and free-space policy. | Critical |
C |
| Cluster is at emergency disk level and will begin blocking writes to affected indices. Urgently free space on the node (delete indices, move shards, or increase storage). Only consider raising flood-stage watermark after you have restored sufficient free disk. | Critical |
D |
| Index has been set to read-only due to disk pressure. First resolve the disk issue (delete old indices, move data, increase storage). Once stable, clear the block with | Critical |
2. Memory
Index | Log text to monitor | Recommended action | Severity |
|---|---|---|---|
A |
| Node has exhausted heap and crashed or is about to crash. Treat as Sev1. Check OS and JVM metrics, collect a heap dump if possible, and look for heavy aggregations or large fielddata usage. Reduce query complexity, increase heap within supported limits, and/or scale out with more data nodes. | Critical |
3. Shards, indices & allocation
Index | Log text to monitor | Recommended action | Severity |
|---|---|---|---|
A |
| Search requests are failing across all shards. Treat as high priority if impacting production queries. Inspect the full error (especially | Critical |
B |
| Shards for one or more indices are not coming online. Run | Critical |
C |
| Primary shard hasn’t been allocated, so the index is unavailable. Use the allocation explain API to find allocation constraints (node attributes, disk thresholds, incompatible versions). Fix the constraint and rerun allocation or recreate the index from snapshot. | Critical |
5. Security, TLS & authorization
Index | Log text to monitor | Recommended action | Severity |
|---|---|---|---|
A |
| Security plugin has not successfully initialized and requests are blocked. Verify that security system indexes exist, and rerun the security admin / initialization procedure, watching for errors. Check certificate configuration and | Critical |
B |
| TLS handshake has failed between client and OpenSearch or between nodes. Verify certificate chains (CA, intermediate, node certificates), expiration dates, and hostname/SANs. Ensure clients trust the issuing CA and that your | Critical |
6. Node discovery, cluster-manager election & cluster formation
Index | Log text to monitor | Recommended action | Severity |
|---|---|---|---|
A |
| Cluster cannot elect a cluster-manager node; reads/writes are blocked. Confirm that the expected number of cluster-manager-eligible nodes are running and reachable. Verify | Critical |
7. OpenSearch Dashboards & client connectivity
Index | Log text to monitor | Recommended action | Severity |
|---|---|---|---|
A |
| Clients cannot connect to any OpenSearch node. Verify the OpenSearch endpoint, port, and TLS settings in the client configuration. Check that the cluster is up and healthy and that firewalls, security groups, or Kubernetes Services allow access from the client. | Critical |
Kafka
1. Cluster Health & Availability
Index | Log text to monitor | Recommended action | Severity |
|---|---|---|---|
A |
| The broker failed to start properly. Check the server logs for the root cause (e.g., port conflict, configuration error). Ensure the broker configuration is correct and that the port is available. | Critical |
B |
| Authorization was denied for a client request. Check ACLs and authorizer configuration to ensure legitimate clients have access. If unexpected, investigate potential security breach or misconfiguration. | Critical |
2. Disk & Storage
Index | Log text to monitor | Recommended action | Severity |
|---|---|---|---|
A |
| The broker encountered an error writing to the | Critical |
B |
| Kafka could not create or validate one of its data directories. Verify that the directory exists, permissions are correct, and the filesystem is mounted and writable. | Critical |
C |
| The broker failed to lock a log directory, possibly because another process is using it or the filesystem is read-only/corrupted. Ensure no other Kafka process is running on the same data directory. | Critical |
3. Replication & Consistency
Index | Log text to monitor | Recommended action | Severity |
|---|---|---|---|
A |
| Partitions are under-replicated, meaning not all replicas are in sync. This increases the risk of data loss. Check network connectivity between brokers and ensure all brokers are running and healthy. | Critical |
Fluentd
1. Buffer & Performance
Index | Log text to monitor | Recommended action | Severity |
|---|---|---|---|
A |
| The buffer is full and Fluentd typically stops accepting new logs or drops them. Check if the output destination is down or slow. Consider increasing buffer size | Critical |
B |
| Fluentd failed to flush the buffer to the destination. This might be due to network issues, authentication failures, or destination downtime. Check the error message details for the root cause. | Critical |
2. Configuration & Startup
Index | Log text to monitor | Recommended action | Severity |
|---|---|---|---|
A |
| Fluentd encountered a configuration error and likely failed to start or load a plugin. Verify the configuration syntax, plugin names, and required parameters. | Critical |
B |
| Fluentd failed to bind to a network port because it is already in use. Check if another Fluentd instance or service is running on the same port. | Critical |
3. Processing
Index | Log text to monitor | Recommended action | Severity |
|---|---|---|---|
A |
| The input or filter parser could not match the incoming log line against the specified pattern. This usually results in data loss or unparsed records. Review the log format and the regex/pattern in the configuration. | Critical |
MongoDB
1. Cluster Health & Crash
Index | Log text to monitor | Recommended action | Severity |
|---|---|---|---|
A |
| The MongoDB process has encountered a fatal error and is shutting down. Review the stack trace in the logs. Restart the process and check for data consistency. | Critical |
B |
| The process is running out of memory. Check memory usage trends and consider scaling up resources or optimizing queries/indexes. | Critical |
2. Storage
Index | Log text to monitor | Recommended action | Severity |
|---|---|---|---|
A |
| The WiredTiger storage engine reported an error. This often indicates corrupt data files or underlying disk issues. Restore from backup if data is corrupted. | Critical |