On-Prem Troubleshooting Guide
This troubleshooting guide provides detailed procedures for diagnosing and resolving issues with the Keyfactor AgileSec Analytics Platform. It covers both single-node and multi-node deployments with focus on root cause analysis and resolution steps.
1. Introduction
1.1 Platform Architecture Overview
The platform consists of three service tiers:
Infrastructure Layer:
OpenSearch - Search and analytics engine
OpenSearch Dashboards - Web interface for OpenSearch
MongoDB - Operational data store with replica set support
Kafka - Message queue broker (KRaft mode)
Supporting Services:
HAProxy - Load balancer and reverse proxy
Fluentd (td-agent) - Ship data from Kafka to OpenSearch
Application Microservices:
Java Backend Services: ingestion, scheduler, sm (Security Manager), analytics-manager
Node.js Frontend Services: webui, api, cbom
1.2 Deployment Types
Type | Description |
|---|---|
Single-Node | All services on one server |
Multi-Node | Distributed deployment with PRIMARY_FULL_BACKEND, FULL_BACKEND, FRONTEND, and SCAN nodes |
1.3 Service Dependencies
Services Dependencies:
opensearch : (no dependencies)
opensearch-dashboards : opensearch
mongodb : (no dependencies)
kafka : (no dependencies)
td-agent : opensearch, kafka
scheduler : opensearch, mongodb, kafka
analytics-manager : mongodb, kafka
ingestion : kafka
webui : api
api : opensearch, mongodb, kafka, cbom, sm
cbom : opensearch, mongodb, kafka, sm
sm : (no dependencies)
haproxy : (no dependencies)
2. Log File Locations and Rotation
2.1 Log Directory Structure
All logs are stored under $installation_path/logs/.
2.2 Infrastructure Service Logs
Quick platform logs: These are the main log files for each infrastructure service.
Service | Log Location | Description |
|---|---|---|
OpenSearch |
| Cluster logs, slow queries, deprecation warnings |
MongoDB |
| Server logs, query logs |
Kafka |
| Broker logs, controller logs |
Full platform logs: These show detailed logs for each platforn.
Service | Log Location | Description |
|---|---|---|
OpenSearch |
| Cluster logs, slow queries, deprecation warnings |
MongoDB |
| Server logs, query logs |
Kafka |
| Broker logs, controller logs |
2.3 Application Microservice Logs
Service | Log Location |
|---|---|
webui |
|
api |
|
cbom |
|
sm |
|
analytics-manager |
|
ingestion |
|
scheduler |
|
2.4 Supporting Service Logs
Service | Log Location |
|---|---|
HAProxy |
|
Fluentd (td-agent) |
|
2.5 Management and Health Check Logs
Log File | Purpose |
|---|---|
| Automated health check results and service restart attempts |
2.6 Log Rotation Configuration
OpenSearch Log Rotation: OpenSearch handles its own log rotation. Configure in $installation_path/services/opensearch/config/opensearch.yml:
logger.deprecation.level: warn
appender.rolling.type: RollingFile
appender.rolling.policies.size.size: 100MB
System Log Rotation: For application logs, you can configure logrotate manually by creating a configuration file (e.g., /etc/logrotate.d/kf-agilesec) similar to the example below. Adjust log-retention period based on your organizational policies.
Important: You must replace <installation_path>, <your_user>, and <your_group> with actual values from your environment.
<installation_path>/logs/*.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
copytruncate
create 0640 <your_user> <your_group>
}
3. Service Status Inspection Commands
3.1 Using the Unified Management Script
The primary tool for service management is manage.sh:
cd $installation_path
./scripts/manage.sh <action> [options] [service1 service2 ...]
Available Actions:
Action | Description |
|---|---|
| Start services |
| Stop services |
| Stop and then start services |
| Reload service configuration (where supported) |
| Check status of services |
| List available services |
Options:
Option | Description |
|---|---|
| Enable debug mode (show service output in console) |
| Enable silent mode (no output displayed) |
3.2 Common Status Commands
# Check all services status
./scripts/manage.sh status
# Check specific service status
./scripts/manage.sh status opensearch
./scripts/manage.sh status mongodb kafka
# Start all services
./scripts/manage.sh start
# Start specific service with debug output
./scripts/manage.sh start -d opensearch
# Stop specific services
./scripts/manage.sh stop haproxy td-agent
# Restart a service
./scripts/manage.sh restart scheduler
# Reload HAProxy configuration
./scripts/manage.sh reload haproxy
3.3 Understanding Status Output
The status command shows:
Service name
Running/Not running status
Process ID (if running)
Example Output:
2025-01-15 10:30:45 [INFO] Service Status
opensearch Running (PID: 12345)
mongodb Running (PID: 12346)
kafka Running (PID: 12347)
webui Not running
api Running (PID: 12349)
3.4 Systemd Service Status
The platform uses a systemd service for automatic startup:
# Check systemd service status
sudo systemctl status kf_analytics.service
# Enable automatic startup
sudo systemctl enable kf_analytics.service
# View service logs
sudo journalctl -u kf_analytics.service -f
4. Common Failure Scenarios and Fixes
This section provides detailed root cause analysis and resolution steps for the most common issues.
4.1 OpenSearch Failures
4.1.1 Service Fails to Start
Symptoms:
OpenSearch process exits immediately after starting
No listening on port 9200
Log shows "Unable to lock JVM Memory" or certificate errors
Root Cause Analysis:
Cause | Log Indicator | Resolution |
|---|---|---|
Memory lock failure | "Unable to lock JVM Memory" | Configure ulimits (see below) |
Insufficient heap | "OutOfMemoryError" | Increase OPENSEARCH_JAVA_OPTS |
Certificate errors | "SSLHandshakeException" | Validate certificate paths |
Port already in use | "Address already in use" | Kill conflicting process |
Resolution - Memory Lock:
# Check current limits
ulimit -l
# Run tune.sh command to adjust system settings
cd <installer_directory>
sudo ./scripts/tune.sh
# Important: Logout/Login to refresh session
Resolution - Heap Size:
# Check current heap settings in service environment file
cat $installation_path/services/opensearch/config/jvm.options
# Modify -Xms and -Xmx values
# Example:
# -Xms32g
# -Xmx32g
4.1.2 Cluster State RED
Symptoms:
API returns cluster health status as "red"
Some indices are unavailable
Write operations failing
Root Cause Analysis:
Unassigned primary shards
Node disconnection in multi-node setup
Disk space exhaustion
Diagnostic Commands:
# Set certificate variables
# Important:
# - Replace </path/to/installation> with actual path
# - Replace "kf-agilesec.internal" with your actual domain name
export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export CA_CERT=$installation_path/certificates/ca/agilesec-rootca-cert.pem
export CLIENT_CERT=$installation_path/certificates/$analytics_internal_domain/admin-user-cert.pem
export CLIENT_KEY=$installation_path/certificates/$analytics_internal_domain/admin-user-key.pem
# Check cluster health
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
'https://127.0.0.1:9200/_cluster/health?pretty'
# Check shard allocation explanation
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
'https://127.0.0.1:9200/_cluster/allocation/explain?pretty'
# List unassigned shards
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
'https://127.0.0.1:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason'
Resolution:
# Enable shard allocation (if disabled)
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
-X PUT 'https://127.0.0.1:9200/_cluster/settings' \
-H 'Content-Type: application/json' -d '{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}'
# Reroute stuck shards
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
-X POST 'https://127.0.0.1:9200/_cluster/reroute?retry_failed=true'
4.1.3 Certificate Authentication Failure
Symptoms:
"SSLHandshakeException" in logs
"certificate verify failed" errors
Services cannot connect to OpenSearch
Root Cause Analysis:
Certificate expired
Wrong CA certificate
Certificate chain incomplete
DN not in allowed list
Diagnostic Commands:
# Important:
# - Replace <client-cert-path> with actual path
# - Replace <ca-cert-path> with actual path
export CLIENT_CERT=<client-cert-path>
export CA_CERT=<ca-cert-path>
# Check certificate expiration
openssl x509 -in $CLIENT_CERT -noout -dates
# Verify certificate chain
openssl verify -CAfile $CA_CERT $CLIENT_CERT
# Check certificate subject/issuer
openssl x509 -in $CLIENT_CERT -noout -subject -issuer
Resolution:
Regenerate certificates if expired
Ensure CA certificate matches the one used to sign client/server certificates
Verify
plugins.security.nodes_dnandplugins.security.authcz.admin_dnin opensearch.yml
4.1.4 JVM Heap Exhaustion
Symptoms:
"OutOfMemoryError: Java heap space" in logs
Service becomes unresponsive
Frequent garbage collection pauses
Root Cause Analysis:
Heap size too small for data volume
Large aggregation queries
Resolution:
# Check current heap usage
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
'https://127.0.0.1:9200/_nodes/stats/jvm?pretty' | grep -A 10 "heap"
# Check current heap settings in service environment file
cat $installation_path/services/opensearch/config/jvm.options
# Modify -Xms and -Xmx values
# Example:
# -Xms32g
# -Xmx32g
# Restart OpenSearch
./scripts/manage.sh restart opensearch
4.1.5 Disk Space Exhaustion
Symptoms:
Write operations rejected
"disk watermark exceeded" in logs
Index status becomes read-only
Root Cause Analysis:
Data growth exceeding available disk
Log accumulation
Old indices not cleaned up
Diagnostic Commands:
# Check disk usage
# Important:
# - Replace </path/to/installation> with actual path
# - Replace "kf-agilesec.internal" with your actual domain name
export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export CA_CERT=$installation_path/certificates/ca/agilesec-rootca-cert.pem
export CLIENT_CERT=$installation_path/certificates/$analytics_internal_domain/admin-user-cert.pem
export CLIENT_KEY=$installation_path/certificates/$analytics_internal_domain/admin-user-key.pem
# Check disk size of each nodes
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
'https://127.0.0.1:9200/_cat/allocation?v&h=node,shards,disk.percent,disk.used,disk.avail,disk.total'
Resolution:
Increase free disk size if it is too low
4.2 MongoDB Failures
4.2.1 Service Fails to Start
Symptoms:
mongod process exits immediately
"permission denied" or socket errors in logs
Root Cause Analysis:
Cause | Log Indicator | Resolution |
|---|---|---|
Port in use | "Address already in use" | Kill conflicting process |
TLS certificate issues | "cannot read certificate" | Check certificate paths |
Lock file exists | "Unable to acquire lock" | Remove stale lock file |
Resolution - Lock File:
# Remove stale lock file (only if mongod is not running)
rm -f $installation_path/data/mongodb/mongod.lock
4.2.2 TLS/mTLS Connection Failure
Symptoms:
"SSL peer certificate validation failed"
"certificate verify failed"
Services cannot connect to MongoDB
Root Cause Analysis:
Client certificate not trusted by server
CA mismatch between client and server
Certificate expired
Wrong certificate key file
Diagnostic Commands:
# Test MongoDB connection with TLS
# Important:
# - Replace </path/to/installation> with actual path
# - Replace "kf-agilesec.internal" with your actual domain name
# - Replace "<Path to installer-directory>" with actual path
export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export installer_path=<Path to installer-directory> #Note that <installer_path> is different from <installation_path>
export OPENSSL_CONF="$installer_path/templates/mongodb/openssl-mongosh.cnf"
$installation_path/bin/mongosh \
--tls \
--tlsCAFile $installation_path/certificates/ca/agilesec-rootca-cert.pem \
--tlsCertificateKeyFile $installation_path/certificates/$analytics_internal_domain/shared-client-combo-cert-key.pem \
--host 127.0.0.1 \
--authenticationMechanism MONGODB-X509 \
--port 27017 \
--eval 'db.runCommand({ ping: 1 })'
Resolution:
Verify
tlsCAFilepath in mongod.conf matches the CA used to sign client certificatesEnsure
tlsCertificateKeyFilecontains both certificate and keyCheck certificate expiration dates
4.3 Kafka Failures
4.3.1 Broker Fails to Start
Symptoms:
Kafka process exits immediately
"Address already in use" error
KRaft controller election fails
Root Cause Analysis:
Cause | Log Indicator | Resolution |
|---|---|---|
Port conflict | "Address already in use" | Check ports 9092, 9093, 9094 |
KRaft init failure | "Cluster ID mismatch" | Check cluster_id.txt |
SSL configuration | "SSL handshake failed" | Verify keystore paths |
Insufficient disk | "No space left on device" | Free disk space |
Diagnostic Commands:
# Check if ports are in use
ss -tlnp | grep -E '(9092|9093|9094)'
# Check Kafka logs
tail -100 $installation_path/services/kafka_*/logs/server.log
Resolution - Port Conflict:
# Find and kill process using the port
fuser -k 9092/tcp
fuser -k 9093/tcp
fuser -k 9094/tcp
# Restart Kafka
./scripts/manage.sh start kafka
4.4 Java Microservice Failures
Applies to: scheduler, sm, ingestion, analytics-manager
4.4.1 Kafka Connection Failure
Symptoms:
Java Microservice logs show the error with below Exception pattern:
"TimeoutException"
"DisconnectException"
"SerializationException"
"DeserializationException"
"CommitFailedException"
"AuthorizationException"
"SaslAuthenticationException"
Root Cause Analysis:
Kafka broker not running or not healthy
SSL configuration mismatch
CPU/MEM of kafka broker is peak
Network connectivity issues
Resolution:
Verify Kafka is running:
./scripts/manage.sh status kafkaCheck kafka is healthy or not ( See Section 5.3 for kafka health checks)
Check Kafka load (CPU, Disk, Memory usage)
Verify SSL certificate match between Kafka and microservice
Check network connectivity between microservice and Kafka brokers
4.4.2 MongoDB Connection Failure
Symptoms:
Java Microservice logs show the error with below Exception pattern:
"MongoTimeoutException"
"MongoSocketOpenException"
"MongoSocketReadException"
"Timed out while waiting for a server"
"Connection refused"
"No server chosen by ReadPreference"
"MongoSecurityException"
"Authentication failed"
"Unauthorized"
"not authorized on .* to execute"
Resolution:
Verify mongo is running:
./scripts/manage.sh status mongodbCheck mongo is healthy or not ( See Section 5.2 for mongo health checks)
Check MongoDB load (CPU, Disk, Memory usage)
Verify SSL certificate match between MongoDB and microservice
Check network connectivity between microservice and MongoDB cluster
4.4.3 SM Service Keystore Failure
Symptoms:
"Cannot load keystore"
"Keystore was tampered with, or password was incorrect"
Resolution:
# Verify keystore exists and is readable
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export sm_keystore_pass=$(cat $installation_path/certificates/$analytics_internal_domain/sm-service-keystore.pass)
ls -la $installation_path/certificates/$analytics_internal_domain/sm-service.p12
# Verify password
$installation_path/bin/java/bin/keytool -list -keystore $installation_path/certificates/$analytics_internal_domain/sm-service.p12 -storepass "$sm_keystore_pass"
4.5 Node.js Microservice Failures
Applies to: webui, api, cbom
4.5.1 Kafka Connection Failure
Symptoms:
Nodejs Microservice logs show the error with below Exception pattern:
"KafkaJSConnectionError"
"SSL alert number"
Root Cause Analysis:
Kafka broker not running or not healthy
SSL configuration mismatch
CPU/MEM of kafka broker is peak
Network connectivity issues
Resolution:
Verify Kafka is running:
./scripts/manage.sh status kafkaCheck kafka is healthy or not ( See Section 5.3 for kafka health checks)
Check Kafka load (CPU, Disk, Memory usage)
Verify SSL certificate match between Kafka and microservice
Check network connectivity between microservice and Kafka brokers
4.5.2 MongoDB Connection Errors
Symptoms:
Nodejs Microservice logs show the error with below Exception pattern:
"MongooseError"
"buffering timed out after"
Resolution:
Verify mongo is running:
./scripts/manage.sh status mongodbCheck mongo is healthy or not ( See Section 5.2 for mongo health checks)
Check MongoDB load (CPU, Disk, Memory usage)
Verify SSL certificate match between MongoDB and microservice
Check network connectivity between microservice and MongoDB cluster
4.5.3 Opensearch Connection Errors
Symptoms:
Nodejs Microservice logs show the error with below Exception pattern:
"Query open-search error"
"ECONNREFUSED"
Resolution:
Verify OpenSearch is running:
./scripts/manage.sh status opensearchCheck OpenSearch is healthy or not ( See Section 5.1 for opensearch health checks)
Check OpenSearch load (CPU, Disk, Memory usage)
Verify SSL certificate match between OpenSearch and microservice
Check network connectivity between microservice and OpenSearch cluster
5. Cluster Health Diagnostics
5.1 OpenSearch Cluster Health
Using curl with certificate authentication for cluster diagnostics
# Set up certificate environment
export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export CA_CERT=$installation_path/certificates/ca/agilesec-rootca-cert.pem
export CLIENT_CERT=$installation_path/certificates/$analytics_internal_domain/admin-user-cert.pem
export CLIENT_KEY=$installation_path/certificates/$analytics_internal_domain/admin-user-key.pem
# Important:
# - Replace </path/to/installation> with actual path
# - Replace "kf-agilesec.internal" with your actual domain name
# Cluster health
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
'https://127.0.0.1:9200/_cluster/health?pretty'
# Node status
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
'https://127.0.0.1:9200/_cat/nodes?v&h=ip,name,role,master,heap.percent,disk.used_percent'
# Cluster statistics
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
'https://127.0.0.1:9200/_cluster/stats?pretty'
# Index health
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
'https://127.0.0.1:9200/_cat/indices?v'
Using Opensearch Dashboard for cluster diagnostics
Opensearch Dashboard's Dev Tools console provides additional cluster diagnostic capabilities
Use
GET /_cat/nodes?vandGET /_cat/indices?vfor detailed node and index statusUse
GET /_cluster/healthfor cluster health statusUse
GET /_cluster/statsfor cluster statisticsUse
GET /_cat/shards?vfor shard allocation statusMonitor for yellow/red health indicators which may indicate shard allocation issues
Pay attention to unassigned shards and disk usage percentages (look for nodes with high disk usage, especially above 80%)
Check for any nodes showing "UNREACHABLE" or "DISCONNECTED" status
Look for nodes with high memory usage (above 85%) which may indicate performance issues
Watch for nodes with high CPU usage (above 90%) which may indicate resource contention
Check for any nodes with excessive load averages that may indicate system overload
5.2 MongoDB Replica Set Health
# Important:
# - Replace </path/to/installation> with actual path
# - Replace "kf-agilesec.internal" with your actual domain name
# - Replace "<Path to installer-directory>" with actual path
export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export installer_path=<Path to installer-directory> #Note that <installer_path> is different from <installation_path>
export OPENSSL_CONF="$installer_path/templates/mongodb/openssl-mongosh.cnf"
# Connect and check replica set status
$installation_path/bin/mongosh \
--tls \
--tlsCAFile $installation_path/certificates/ca/agilesec-rootca-cert.pem \
--tlsCertificateKeyFile $installation_path/certificates/$analytics_internal_domain/admin-user-combo-cert-key.pem \
--authenticationMechanism MONGODB-X509 \
--host 127.0.0.1 \
--port 27017 \
--eval 'rs.status()'
# Check replica set members
$installation_path/bin/mongosh \
--tls \
--tlsCAFile $installation_path/certificates/ca/agilesec-rootca-cert.pem \
--tlsCertificateKeyFile $installation_path/certificates/$analytics_internal_domain/admin-user-combo-cert-key.pem \
--authenticationMechanism MONGODB-X509 \
--host 127.0.0.1 \
--port 27017 \
--quiet \
--eval 'const h=db.hello(); print("Members:", h.hosts.length); printjson(h.hosts)'
5.3 Kafka Cluster Health
export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
CA_PEM="$installation_path/certificates/ca/agilesec-rootca-cert.pem"
CLIENT_CERT="$installation_path/certificates/$analytics_internal_domain/shared-client-cert.pem"
CLIENT_KEY="$installation_path/certificates/$analytics_internal_domain/shared-client-key.pem"
CLIENT_COMBO="/tmp/client-combo-key-cert.pem"
# 2. Create a combined PEM file (Required for ssl.keystore.location in Java)
# The order usually doesn't matter, but Key + Cert is standard.
cat "$CLIENT_KEY" "$CLIENT_CERT" > "$CLIENT_COMBO"
# 3. Generate the ssl.properties file
cat > /tmp/ssl.properties << EOF
security.protocol=SSL
# Truststore (The CA Certificate)
ssl.truststore.type=PEM
ssl.truststore.location=$CA_PEM
# Keystore (The Client Key + Certificate)
ssl.keystore.type=PEM
ssl.keystore.location=$CLIENT_COMBO
# If your Private Key is encrypted, uncomment the line below:
# ssl.key.password=$(cat $installation_path/certificates/$analytics_internal_domain/agilesec-client-keystore.pass)
EOF
# Important:
# - Replace </path/to/installation> with actual path
# - Replace "kf-agilesec.internal" with your actual domain name
# List all topics
export JAVA_HOME=$installation_path/bin/java
$installation_path/services/kafka_*/bin/kafka-topics.sh \
--bootstrap-server 127.0.0.1:9092 \
--command-config /tmp/ssl.properties \
--list
# Describe all consumer groups
$installation_path/services/kafka_*/bin/kafka-consumer-groups.sh \
--bootstrap-server 127.0.0.1:9092 \
--command-config /tmp/ssl.properties \
--describe --all-groups
# Check broker metadata
$installation_path/services/kafka_*/bin/kafka-dump-log.sh \
--files $installation_path/data/kafka/__cluster_metadata-0/00000000000000000000.log \
--cluster-metadata-decoder
6. Node Recovery Procedures
6.1 Single-Node Recovery
Complete Recovery Sequence:
# 1. Stop all services gracefully
./scripts/manage.sh stop
# 2. Verify all processes are stopped
./scripts/manage.sh status
ps aux | grep -E '(opensearch|mongod|kafka|java|node)'
# 3. Verify disk space
df -h
# 4. Start services in order
./scripts/manage.sh start
# 5. Verify all services
./scripts/manage.sh status
6.2 Multi-Node Recovery
6.2.1 OpenSearch Node Recovery
# On the failed node:
# 1. Stop OpenSearch
./scripts/manage.sh stop opensearch
# 2. Check and fix data directory if corrupted. Backup the directory before deleting.
# (Only if necessary, e.g., when recovering from backup - THIS WILL RESULT IN DATA LOSS)
# rm -rf $installation_path/data/opensearch/node
# 3. Restart OpenSearch
./scripts/manage.sh start opensearch
# 4. Monitor cluster recovery
watch -n 5 "curl -s -k --cacert \$CA_CERT --cert \$CLIENT_CERT --key \$CLIENT_KEY \
'https://127.0.0.1:9200/_cluster/health?pretty' | grep -E '(status|relocating|initializing)'"
6.2.2 MongoDB Node Recovery
Secondary Node Recovery:
# 1. Stop MongoDB on failed secondary
./scripts/manage.sh stop mongodb
# 2. Optionally resync from primary (if data is corrupted)
# rm -rf $installation_path/data/mongodb/*
# 3. Start MongoDB
./scripts/manage.sh start mongodb
# 4. MongoDB will automatically sync from primary
Primary Node Recovery:
# If primary fails, a secondary should automatically become primary
# After recovering the failed node:
# 1. Start MongoDB
./scripts/manage.sh start mongodb
# 2. It will rejoin as secondary and sync
# 3. To force it back to primary (if desired):
# rs.stepDown() # On current primary
6.2.3 Kafka Broker Recovery
# 1. Stop failed broker
./scripts/manage.sh stop kafka
# 2. Check for log corruption
ls -la $installation_path/data/kafka/
# 3. Start broker
./scripts/manage.sh start kafka
# 4. Verify broker rejoined cluster
$installation_path/services/kafka_*/bin/kafka-dump-log.sh \
--files $installation_path/data/kafka/__cluster_metadata-0/00000000000000000000.log \
--cluster-metadata-decoder
6.3 Full Platform Recovery Sequence
For complete platform recovery after major failure:
# Phase 1: Infrastructure (run on all nodes)
./scripts/manage.sh start opensearch mongodb kafka
# Phase 2: Wait for infrastructure to stabilize
sleep 120
# Verify infrastructure health on each node
./scripts/manage.sh status opensearch mongodb kafka
# Phase 3: Supporting services
./scripts/manage.sh start haproxy td-agent
# Phase 4: Application services
./scripts/manage.sh start sm analytics-manager ingestion scheduler
./scripts/manage.sh start webui api cbom
# Phase 5: Final verification
./scripts/manage.sh status
7. Performance Troubleshooting
7.1 Resource Utilization Analysis
# CPU usage by process
top -b -n 1 | head -20
# Memory usage
free -h
ps aux --sort=-%mem | head -10
7.2 OpenSearch Performance Issues
High Query Latency:
# Enable slow query log
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
-X PUT 'https://127.0.0.1:9200/_all/_settings' \
-H 'Content-Type: application/json' -d '{
"index.search.slowlog.threshold.query.warn": "10s",
"index.search.slowlog.threshold.query.info": "5s"
}'
# Check hot threads
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
'https://127.0.0.1:9200/_nodes/hot_threads'
# Check pending tasks
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
'https://127.0.0.1:9200/_cluster/pending_tasks?pretty'
7.3 MongoDB Performance Issues
// Go to mongosh
$installation_path/bin/mongosh \
--tls \
--tlsCAFile $installation_path/certificates/ca/agilesec-rootca-cert.pem \
--tlsCertificateKeyFile $installation_path/certificates/$analytics_internal_domain/admin-user-combo-cert-key.pem \
--authenticationMechanism MONGODB-X509 \
--host 127.0.0.1 \
--port 27017
// Enable profiling for slow queries
db.setProfilingLevel(1, { slowms: 100 })
// Check slow queries
db.system.profile.find().sort({ ts: -1 }).limit(10).pretty()
// Check current operations
db.currentOp({ "secs_running": { "$gt": 5 } })
// Server status
db.serverStatus()
7.4 Kafka Performance Issues
# Check consumer lag
$installation_path/services/kafka_*/bin/kafka-consumer-groups.sh \
--bootstrap-server 127.0.0.1:9092 \
--command-config /tmp/ssl.properties \
--describe --all-groups
# Check log segment sizes
du -sh $installation_path/data/kafka/*
8. Support Bundle Log Collection
8.1 Run command to generate support bundle logs
Create a support bundle script collect-logs.sh containing all relevant diagnostic information:
Important: Replace /installation/path with your actual installation directory path.
#!/bin/bash
# Create support bundle
export installation_path="/home/ec2-user/Installer_package/singlenode-26122025-rhel8_340-rc1" # Update this path to your actual installation directory
BUNDLE_DIR="/tmp/support-bundle-$(date +%Y%m%d-%H%M%S)" #
mkdir -p "$BUNDLE_DIR"
# System information
echo "=== System Info ===" > "$BUNDLE_DIR/system_info.txt"
uname -a >> "$BUNDLE_DIR/system_info.txt"
cat /etc/os-release >> "$BUNDLE_DIR/system_info.txt"
free -h >> "$BUNDLE_DIR/system_info.txt"
df -h >> "$BUNDLE_DIR/system_info.txt"
uptime >> "$BUNDLE_DIR/system_info.txt"
# Service status
$installation_path/scripts/manage.sh status > "$BUNDLE_DIR/service_status.txt" 2>&1
# Application logs (last 5000 lines each)
mkdir -p "$BUNDLE_DIR/logs"
for log in $installation_path/logs/*.log; do
tail -5000 "$log" > "$BUNDLE_DIR/logs/$(basename $log)" 2>/dev/null
done
# Configuration (sanitized)
mkdir -p "$BUNDLE_DIR/config"
# Copy configs but remove sensitive data
for conf in $installation_path/config_envs/*; do
grep -v -E '(PASSWORD|SECRET|KEY|TOKEN)' "$conf" > "$BUNDLE_DIR/config/$(basename $conf)" 2>/dev/null
done
# Create archive
tar -czf "${BUNDLE_DIR}.tar.gz" -C /tmp "$(basename $BUNDLE_DIR)"
rm -rf "$BUNDLE_DIR"
echo "Support bundle created: ${BUNDLE_DIR}.tar.gz"
8.2 Files to Include
Get the bundle logs file from the /tmp directory after running the script. It includes:
Category | Files |
|---|---|
All Platform Logs |
|
Configuration |
|
System Info | OS version, memory, disk, uptime |
8.3 Sanitizing Sensitive Information
Important: Before sharing logs, remove:
Passwords and secrets
API keys and tokens
Private keys and certificates
Personal information
Private IPs
8.4 Submitting Support Bundles
Create the support bundle using the procedure above
Verify no sensitive information is included
Upload to secure file sharing as directed by support
Include ticket number and description of the issue