On-Prem Troubleshooting Guide

This troubleshooting guide provides detailed procedures for diagnosing and resolving issues with the Keyfactor AgileSec Analytics Platform. It covers both single-node and multi-node deployments with focus on root cause analysis and resolution steps.

1. Introduction

1.1 Platform Architecture Overview

The platform consists of three service tiers:

Infrastructure Layer:

OpenSearch - Search and analytics engine
OpenSearch Dashboards - Web interface for OpenSearch
MongoDB - Operational data store with replica set support
Kafka - Message queue broker (KRaft mode)

Supporting Services:

HAProxy - Load balancer and reverse proxy
Fluentd (td-agent) - Ship data from Kafka to OpenSearch

Application Microservices:

Java Backend Services: ingestion, scheduler, sm (Security Manager), analytics-manager
Node.js Frontend Services: webui, api, cbom

1.2 Deployment Types

Type	Description
Single-Node	All services on one server
Multi-Node	Distributed deployment with PRIMARY_FULL_BACKEND, FULL_BACKEND, FRONTEND, and SCAN nodes

1.3 Service Dependencies

Services Dependencies:

CODE

opensearch            : (no dependencies)
opensearch-dashboards : opensearch
mongodb               : (no dependencies)
kafka                 : (no dependencies)
td-agent              : opensearch, kafka
scheduler             : opensearch, mongodb, kafka
analytics-manager     : mongodb, kafka
ingestion             : kafka
webui                 : api
api                   : opensearch, mongodb, kafka, cbom, sm
cbom                  : opensearch, mongodb, kafka, sm
sm                    : (no dependencies)
haproxy               : (no dependencies)

2. Log File Locations and Rotation

2.1 Log Directory Structure

All logs are stored under $installation_path/logs/.

2.2 Infrastructure Service Logs

Quick platform logs: These are the main log files for each infrastructure service.

Service	Log Location	Description
OpenSearch	`$installation_path/logs/opensearch.log`	Cluster logs, slow queries, deprecation warnings
MongoDB	`$installation_path/logs/mongodb.log`	Server logs, query logs
Kafka	`$installation_path/logs/kafka.log`	Broker logs, controller logs

Full platform logs: These show detailed logs for each platforn.

Service	Log Location	Description
OpenSearch	`$installation_path/services/opensearch/logs/`	Cluster logs, slow queries, deprecation warnings
MongoDB	`$installation_path/services/mongodb/logs/mongod.log`	Server logs, query logs
Kafka	`$installation_path/services/kafka/logs/`	Broker logs, controller logs

2.3 Application Microservice Logs

Service	Log Location
webui	`$installation_path/logs/webui.log`
api	`$installation_path/logs/api.log`
cbom	`$installation_path/logs/cbom.log`
sm	`$installation_path/logs/sm.log`
analytics-manager	`$installation_path/logs/analytics-manager.log`
ingestion	`$installation_path/logs/ingestion.log`
scheduler	`$installation_path/logs/scheduler.log`

2.4 Supporting Service Logs

Service	Log Location
HAProxy	`$installation_path/logs/haproxy.log`
Fluentd (td-agent)	`$installation_path/logs/td-agent.log`

2.5 Management and Health Check Logs

Log File	Purpose
`$installation_path/logs/health_check_cronjob.log`	Automated health check results and service restart attempts

2.6 Log Rotation Configuration

OpenSearch Log Rotation: OpenSearch handles its own log rotation. Configure in $installation_path/services/opensearch/config/opensearch.yml:

CODE

logger.deprecation.level: warn
appender.rolling.type: RollingFile
appender.rolling.policies.size.size: 100MB

System Log Rotation: For application logs, you can configure logrotate manually by creating a configuration file (e.g., /etc/logrotate.d/kf-agilesec) similar to the example below. Adjust log-retention period based on your organizational policies.

Important: You must replace <installation_path>, <your_user>, and <your_group> with actual values from your environment.

CODE

<installation_path>/logs/*.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    copytruncate
    create 0640 <your_user> <your_group>
}

3. Service Status Inspection Commands

3.1 Using the Unified Management Script

The primary tool for service management is manage.sh:

CODE

cd $installation_path
./scripts/manage.sh <action> [options] [service1 service2 ...]

Available Actions:

Action	Description
`start`	Start services
`stop`	Stop services
`restart`	Stop and then start services
`reload`	Reload service configuration (where supported)
`status`	Check status of services
`list`	List available services

Options:

Option	Description
`-d, --debug`	Enable debug mode (show service output in console)
`-s, --silent`	Enable silent mode (no output displayed)

3.2 Common Status Commands

CODE

# Check all services status
./scripts/manage.sh status

# Check specific service status
./scripts/manage.sh status opensearch
./scripts/manage.sh status mongodb kafka

# Start all services
./scripts/manage.sh start

# Start specific service with debug output
./scripts/manage.sh start -d opensearch

# Stop specific services
./scripts/manage.sh stop haproxy td-agent

# Restart a service
./scripts/manage.sh restart scheduler

# Reload HAProxy configuration
./scripts/manage.sh reload haproxy

3.3 Understanding Status Output

The status command shows:

Service name
Running/Not running status
Process ID (if running)

Example Output:

CODE

2025-01-15 10:30:45 [INFO] Service Status
opensearch             Running (PID: 12345)
mongodb                Running (PID: 12346)
kafka                  Running (PID: 12347)
webui                  Not running
api                    Running (PID: 12349)

3.4 Systemd Service Status

The platform uses a systemd service for automatic startup:

CODE

# Check systemd service status
sudo systemctl status kf_analytics.service

# Enable automatic startup
sudo systemctl enable kf_analytics.service

# View service logs
sudo journalctl -u kf_analytics.service -f

4. Common Failure Scenarios and Fixes

This section provides detailed root cause analysis and resolution steps for the most common issues.

4.1 OpenSearch Failures

4.1.1 Service Fails to Start

Symptoms:

OpenSearch process exits immediately after starting
No listening on port 9200
Log shows "Unable to lock JVM Memory" or certificate errors

Root Cause Analysis:

Cause	Log Indicator	Resolution
Memory lock failure	"Unable to lock JVM Memory"	Configure ulimits (see below)
Insufficient heap	"OutOfMemoryError"	Increase OPENSEARCH_JAVA_OPTS
Certificate errors	"SSLHandshakeException"	Validate certificate paths
Port already in use	"Address already in use"	Kill conflicting process

Resolution - Memory Lock:

CODE

# Check current limits
ulimit -l

# Run tune.sh command to adjust system settings
cd <installer_directory>
sudo ./scripts/tune.sh

# Important: Logout/Login to refresh session

Resolution - Heap Size:

CODE

# Check current heap settings in service environment file
cat $installation_path/services/opensearch/config/jvm.options

# Modify -Xms and -Xmx values
# Example: 
#  -Xms32g 
#  -Xmx32g

4.1.2 Cluster State RED

Symptoms:

API returns cluster health status as "red"
Some indices are unavailable
Write operations failing

Root Cause Analysis:

Unassigned primary shards
Node disconnection in multi-node setup
Disk space exhaustion

Diagnostic Commands:

CODE

# Set certificate variables
# Important: 
#   - Replace </path/to/installation> with actual path
#   - Replace "kf-agilesec.internal" with your actual domain name

export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export CA_CERT=$installation_path/certificates/ca/agilesec-rootca-cert.pem
export CLIENT_CERT=$installation_path/certificates/$analytics_internal_domain/admin-user-cert.pem
export CLIENT_KEY=$installation_path/certificates/$analytics_internal_domain/admin-user-key.pem

# Check cluster health
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cluster/health?pretty'

# Check shard allocation explanation
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cluster/allocation/explain?pretty'

# List unassigned shards
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason'

Resolution:

CODE

# Enable shard allocation (if disabled)
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  -X PUT 'https://127.0.0.1:9200/_cluster/settings' \
  -H 'Content-Type: application/json' -d '{
    "transient": {
      "cluster.routing.allocation.enable": "all"
    }
  }'

# Reroute stuck shards
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  -X POST 'https://127.0.0.1:9200/_cluster/reroute?retry_failed=true'

4.1.3 Certificate Authentication Failure

Symptoms:

"SSLHandshakeException" in logs
"certificate verify failed" errors
Services cannot connect to OpenSearch

Root Cause Analysis:

Certificate expired
Wrong CA certificate
Certificate chain incomplete
DN not in allowed list

Diagnostic Commands:

CODE

# Important: 
#   - Replace <client-cert-path> with actual path
#   - Replace <ca-cert-path> with actual path
export CLIENT_CERT=<client-cert-path>
export CA_CERT=<ca-cert-path>
# Check certificate expiration
openssl x509 -in $CLIENT_CERT -noout -dates

# Verify certificate chain
openssl verify -CAfile $CA_CERT $CLIENT_CERT

# Check certificate subject/issuer
openssl x509 -in $CLIENT_CERT -noout -subject -issuer

Resolution:

Regenerate certificates if expired
Ensure CA certificate matches the one used to sign client/server certificates
Verify plugins.security.nodes_dn and plugins.security.authcz.admin_dn in opensearch.yml

4.1.4 JVM Heap Exhaustion

Symptoms:

"OutOfMemoryError: Java heap space" in logs
Service becomes unresponsive
Frequent garbage collection pauses

Root Cause Analysis:

Heap size too small for data volume
Large aggregation queries

Resolution:

CODE

# Check current heap usage
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_nodes/stats/jvm?pretty' | grep -A 10 "heap"

# Check current heap settings in service environment file
cat $installation_path/services/opensearch/config/jvm.options

# Modify -Xms and -Xmx values
# Example: 
#  -Xms32g 
#  -Xmx32g

# Restart OpenSearch
./scripts/manage.sh restart opensearch

4.1.5 Disk Space Exhaustion

Symptoms:

Write operations rejected
"disk watermark exceeded" in logs
Index status becomes read-only

Root Cause Analysis:

Data growth exceeding available disk
Log accumulation
Old indices not cleaned up

Diagnostic Commands:

CODE

# Check disk usage
# Important: 
#   - Replace </path/to/installation> with actual path
#   - Replace "kf-agilesec.internal" with your actual domain name

export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export CA_CERT=$installation_path/certificates/ca/agilesec-rootca-cert.pem
export CLIENT_CERT=$installation_path/certificates/$analytics_internal_domain/admin-user-cert.pem
export CLIENT_KEY=$installation_path/certificates/$analytics_internal_domain/admin-user-key.pem

# Check disk size of each nodes
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cat/allocation?v&h=node,shards,disk.percent,disk.used,disk.avail,disk.total'

Resolution:

Increase free disk size if it is too low

4.2 MongoDB Failures

4.2.1 Service Fails to Start

Symptoms:

mongod process exits immediately
"permission denied" or socket errors in logs

Root Cause Analysis:

Cause	Log Indicator	Resolution
Port in use	"Address already in use"	Kill conflicting process
TLS certificate issues	"cannot read certificate"	Check certificate paths
Lock file exists	"Unable to acquire lock"	Remove stale lock file

Resolution - Lock File:

CODE

# Remove stale lock file (only if mongod is not running)
rm -f $installation_path/data/mongodb/mongod.lock

4.2.2 TLS/mTLS Connection Failure

Symptoms:

"SSL peer certificate validation failed"
"certificate verify failed"
Services cannot connect to MongoDB

Root Cause Analysis:

Client certificate not trusted by server
CA mismatch between client and server
Certificate expired
Wrong certificate key file

Diagnostic Commands:

CODE

# Test MongoDB connection with TLS
# Important: 
#   - Replace </path/to/installation> with actual path
#   - Replace "kf-agilesec.internal" with your actual domain name
#   - Replace "<Path to installer-directory>" with actual path

export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export installer_path=<Path to installer-directory> #Note that <installer_path> is different from <installation_path>
export OPENSSL_CONF="$installer_path/templates/mongodb/openssl-mongosh.cnf"


$installation_path/bin/mongosh \
  --tls \
  --tlsCAFile $installation_path/certificates/ca/agilesec-rootca-cert.pem \
  --tlsCertificateKeyFile $installation_path/certificates/$analytics_internal_domain/shared-client-combo-cert-key.pem \
  --host 127.0.0.1 \
  --authenticationMechanism MONGODB-X509 \
  --port 27017 \
  --eval 'db.runCommand({ ping: 1 })'

Resolution:

Verify tlsCAFile path in mongod.conf matches the CA used to sign client certificates
Ensure tlsCertificateKeyFile contains both certificate and key
Check certificate expiration dates

4.3 Kafka Failures

4.3.1 Broker Fails to Start

Symptoms:

Kafka process exits immediately
"Address already in use" error
KRaft controller election fails

Root Cause Analysis:

Cause	Log Indicator	Resolution
Port conflict	"Address already in use"	Check ports 9092, 9093, 9094
KRaft init failure	"Cluster ID mismatch"	Check cluster_id.txt
SSL configuration	"SSL handshake failed"	Verify keystore paths
Insufficient disk	"No space left on device"	Free disk space

Diagnostic Commands:

CODE

# Check if ports are in use
ss -tlnp | grep -E '(9092|9093|9094)'

# Check Kafka logs
tail -100 $installation_path/services/kafka_*/logs/server.log

Resolution - Port Conflict:

CODE

# Find and kill process using the port
fuser -k 9092/tcp
fuser -k 9093/tcp
fuser -k 9094/tcp

# Restart Kafka
./scripts/manage.sh start kafka

4.4 Java Microservice Failures

Applies to: scheduler, sm, ingestion, analytics-manager

4.4.1 Kafka Connection Failure

Symptoms:

Java Microservice logs show the error with below Exception pattern:
- "TimeoutException"
- "DisconnectException"
- "SerializationException"
- "DeserializationException"
- "CommitFailedException"
- "AuthorizationException"
- "SaslAuthenticationException"

Root Cause Analysis:

Kafka broker not running or not healthy
SSL configuration mismatch
CPU/MEM of kafka broker is peak
Network connectivity issues

Resolution:

Verify Kafka is running: ./scripts/manage.sh status kafka
Check kafka is healthy or not ( See Section 5.3 for kafka health checks)
Check Kafka load (CPU, Disk, Memory usage)
Verify SSL certificate match between Kafka and microservice
Check network connectivity between microservice and Kafka brokers

4.4.2 MongoDB Connection Failure

Symptoms:

Java Microservice logs show the error with below Exception pattern:
- "MongoTimeoutException"
- "MongoSocketOpenException"
- "MongoSocketReadException"
- "Timed out while waiting for a server"
- "Connection refused"
- "No server chosen by ReadPreference"
- "MongoSecurityException"
- "Authentication failed"
- "Unauthorized"
- "not authorized on .* to execute"

Resolution:

Verify mongo is running: ./scripts/manage.sh status mongodb
Check mongo is healthy or not ( See Section 5.2 for mongo health checks)
Check MongoDB load (CPU, Disk, Memory usage)
Verify SSL certificate match between MongoDB and microservice
Check network connectivity between microservice and MongoDB cluster

4.4.3 SM Service Keystore Failure

Symptoms:

"Cannot load keystore"
"Keystore was tampered with, or password was incorrect"

Resolution:

CODE

# Verify keystore exists and is readable
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export sm_keystore_pass=$(cat $installation_path/certificates/$analytics_internal_domain/sm-service-keystore.pass)

ls -la $installation_path/certificates/$analytics_internal_domain/sm-service.p12

# Verify password
$installation_path/bin/java/bin/keytool -list -keystore $installation_path/certificates/$analytics_internal_domain/sm-service.p12 -storepass "$sm_keystore_pass"

4.5 Node.js Microservice Failures

Applies to: webui, api, cbom

4.5.1 Kafka Connection Failure

Symptoms:

Nodejs Microservice logs show the error with below Exception pattern:
- "KafkaJSConnectionError"
- "SSL alert number"

Root Cause Analysis:

Kafka broker not running or not healthy
SSL configuration mismatch
CPU/MEM of kafka broker is peak
Network connectivity issues

Resolution:

Verify Kafka is running: ./scripts/manage.sh status kafka
Check kafka is healthy or not ( See Section 5.3 for kafka health checks)
Check Kafka load (CPU, Disk, Memory usage)
Verify SSL certificate match between Kafka and microservice
Check network connectivity between microservice and Kafka brokers

4.5.2 MongoDB Connection Errors

Symptoms:

Nodejs Microservice logs show the error with below Exception pattern:
- "MongooseError"
- "buffering timed out after"

Resolution:

Verify mongo is running: ./scripts/manage.sh status mongodb
Check mongo is healthy or not ( See Section 5.2 for mongo health checks)
Check MongoDB load (CPU, Disk, Memory usage)
Verify SSL certificate match between MongoDB and microservice
Check network connectivity between microservice and MongoDB cluster

4.5.3 Opensearch Connection Errors

Symptoms:

Nodejs Microservice logs show the error with below Exception pattern:
- "Query open-search error"
- "ECONNREFUSED"

Resolution:

Verify OpenSearch is running: ./scripts/manage.sh status opensearch
Check OpenSearch is healthy or not ( See Section 5.1 for opensearch health checks)
Check OpenSearch load (CPU, Disk, Memory usage)
Verify SSL certificate match between OpenSearch and microservice
Check network connectivity between microservice and OpenSearch cluster

5. Cluster Health Diagnostics

5.1 OpenSearch Cluster Health

Using curl with certificate authentication for cluster diagnostics

CODE

# Set up certificate environment
export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export CA_CERT=$installation_path/certificates/ca/agilesec-rootca-cert.pem
export CLIENT_CERT=$installation_path/certificates/$analytics_internal_domain/admin-user-cert.pem
export CLIENT_KEY=$installation_path/certificates/$analytics_internal_domain/admin-user-key.pem

# Important: 
#   - Replace </path/to/installation> with actual path
#   - Replace "kf-agilesec.internal" with your actual domain name
# Cluster health
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cluster/health?pretty'

# Node status
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cat/nodes?v&h=ip,name,role,master,heap.percent,disk.used_percent'

# Cluster statistics
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cluster/stats?pretty'

# Index health
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cat/indices?v'

Using Opensearch Dashboard for cluster diagnostics
- Opensearch Dashboard's Dev Tools console provides additional cluster diagnostic capabilities
- Use GET /_cat/nodes?v and GET /_cat/indices?v for detailed node and index status
- Use GET /_cluster/health for cluster health status
- Use GET /_cluster/stats for cluster statistics
- Use GET /_cat/shards?v for shard allocation status
- Monitor for yellow/red health indicators which may indicate shard allocation issues
- Pay attention to unassigned shards and disk usage percentages (look for nodes with high disk usage, especially above 80%)
- Check for any nodes showing "UNREACHABLE" or "DISCONNECTED" status
- Look for nodes with high memory usage (above 85%) which may indicate performance issues
- Watch for nodes with high CPU usage (above 90%) which may indicate resource contention
- Check for any nodes with excessive load averages that may indicate system overload

5.2 MongoDB Replica Set Health

CODE

# Important: 
#   - Replace </path/to/installation> with actual path
#   - Replace "kf-agilesec.internal" with your actual domain name
#   - Replace "<Path to installer-directory>" with actual path
export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export installer_path=<Path to installer-directory> #Note that <installer_path> is different from <installation_path>
export OPENSSL_CONF="$installer_path/templates/mongodb/openssl-mongosh.cnf"

# Connect and check replica set status
$installation_path/bin/mongosh \
  --tls \
  --tlsCAFile $installation_path/certificates/ca/agilesec-rootca-cert.pem \
  --tlsCertificateKeyFile $installation_path/certificates/$analytics_internal_domain/admin-user-combo-cert-key.pem \
  --authenticationMechanism MONGODB-X509 \
  --host 127.0.0.1 \
  --port 27017 \
  --eval 'rs.status()'

# Check replica set members
$installation_path/bin/mongosh \
  --tls \
  --tlsCAFile $installation_path/certificates/ca/agilesec-rootca-cert.pem \
  --tlsCertificateKeyFile $installation_path/certificates/$analytics_internal_domain/admin-user-combo-cert-key.pem \
  --authenticationMechanism MONGODB-X509 \
  --host 127.0.0.1 \
  --port 27017 \
  --quiet \
  --eval 'const h=db.hello(); print("Members:", h.hosts.length); printjson(h.hosts)'

5.3 Kafka Cluster Health

CODE

export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain

CA_PEM="$installation_path/certificates/ca/agilesec-rootca-cert.pem"
CLIENT_CERT="$installation_path/certificates/$analytics_internal_domain/shared-client-cert.pem"
CLIENT_KEY="$installation_path/certificates/$analytics_internal_domain/shared-client-key.pem"
CLIENT_COMBO="/tmp/client-combo-key-cert.pem"

# 2. Create a combined PEM file (Required for ssl.keystore.location in Java)
# The order usually doesn't matter, but Key + Cert is standard.
cat "$CLIENT_KEY" "$CLIENT_CERT" > "$CLIENT_COMBO"

# 3. Generate the ssl.properties file
cat > /tmp/ssl.properties << EOF
security.protocol=SSL

# Truststore (The CA Certificate)
ssl.truststore.type=PEM
ssl.truststore.location=$CA_PEM

# Keystore (The Client Key + Certificate)
ssl.keystore.type=PEM
ssl.keystore.location=$CLIENT_COMBO

# If your Private Key is encrypted, uncomment the line below:
# ssl.key.password=$(cat $installation_path/certificates/$analytics_internal_domain/agilesec-client-keystore.pass)
EOF

# Important: 
#   - Replace </path/to/installation> with actual path
#   - Replace "kf-agilesec.internal" with your actual domain name

# List all topics
export JAVA_HOME=$installation_path/bin/java
$installation_path/services/kafka_*/bin/kafka-topics.sh \
  --bootstrap-server 127.0.0.1:9092 \
  --command-config /tmp/ssl.properties \
  --list

# Describe all consumer groups
$installation_path/services/kafka_*/bin/kafka-consumer-groups.sh \
  --bootstrap-server 127.0.0.1:9092 \
  --command-config /tmp/ssl.properties \
  --describe --all-groups

# Check broker metadata
$installation_path/services/kafka_*/bin/kafka-dump-log.sh \
  --files $installation_path/data/kafka/__cluster_metadata-0/00000000000000000000.log \
  --cluster-metadata-decoder

6. Node Recovery Procedures

6.1 Single-Node Recovery

Complete Recovery Sequence:

CODE

# 1. Stop all services gracefully
./scripts/manage.sh stop

# 2. Verify all processes are stopped
./scripts/manage.sh status
ps aux | grep -E '(opensearch|mongod|kafka|java|node)'

# 3. Verify disk space
df -h

# 4. Start services in order
./scripts/manage.sh start 

# 5. Verify all services
./scripts/manage.sh status

6.2 Multi-Node Recovery

6.2.1 OpenSearch Node Recovery

CODE

# On the failed node:

# 1. Stop OpenSearch
./scripts/manage.sh stop opensearch

# 2. Check and fix data directory if corrupted. Backup the directory before deleting.
# (Only if necessary, e.g., when recovering from backup - THIS WILL RESULT IN DATA LOSS)
# rm -rf $installation_path/data/opensearch/node

# 3. Restart OpenSearch
./scripts/manage.sh start opensearch

# 4. Monitor cluster recovery
watch -n 5 "curl -s -k --cacert \$CA_CERT --cert \$CLIENT_CERT --key \$CLIENT_KEY \
  'https://127.0.0.1:9200/_cluster/health?pretty' | grep -E '(status|relocating|initializing)'"

6.2.2 MongoDB Node Recovery

Secondary Node Recovery:

CODE

# 1. Stop MongoDB on failed secondary
./scripts/manage.sh stop mongodb

# 2. Optionally resync from primary (if data is corrupted)
# rm -rf $installation_path/data/mongodb/*

# 3. Start MongoDB
./scripts/manage.sh start mongodb

# 4. MongoDB will automatically sync from primary

Primary Node Recovery:

CODE

# If primary fails, a secondary should automatically become primary
# After recovering the failed node:

# 1. Start MongoDB
./scripts/manage.sh start mongodb

# 2. It will rejoin as secondary and sync
# 3. To force it back to primary (if desired):
# rs.stepDown()  # On current primary

6.2.3 Kafka Broker Recovery

CODE

# 1. Stop failed broker
./scripts/manage.sh stop kafka

# 2. Check for log corruption
ls -la $installation_path/data/kafka/

# 3. Start broker
./scripts/manage.sh start kafka

# 4. Verify broker rejoined cluster
$installation_path/services/kafka_*/bin/kafka-dump-log.sh \
  --files $installation_path/data/kafka/__cluster_metadata-0/00000000000000000000.log \
  --cluster-metadata-decoder

6.3 Full Platform Recovery Sequence

For complete platform recovery after major failure:

CODE

# Phase 1: Infrastructure (run on all nodes)
./scripts/manage.sh start opensearch mongodb kafka

# Phase 2: Wait for infrastructure to stabilize
sleep 120

# Verify infrastructure health on each node
./scripts/manage.sh status opensearch mongodb kafka

# Phase 3: Supporting services
./scripts/manage.sh start haproxy td-agent

# Phase 4: Application services
./scripts/manage.sh start sm analytics-manager ingestion scheduler
./scripts/manage.sh start webui api cbom

# Phase 5: Final verification
./scripts/manage.sh status

7. Performance Troubleshooting

7.1 Resource Utilization Analysis

CODE

# CPU usage by process
top -b -n 1 | head -20

# Memory usage
free -h
ps aux --sort=-%mem | head -10

7.2 OpenSearch Performance Issues

High Query Latency:

CODE

# Enable slow query log
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  -X PUT 'https://127.0.0.1:9200/_all/_settings' \
  -H 'Content-Type: application/json' -d '{
    "index.search.slowlog.threshold.query.warn": "10s",
    "index.search.slowlog.threshold.query.info": "5s"
  }'

# Check hot threads
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_nodes/hot_threads'

# Check pending tasks
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cluster/pending_tasks?pretty'

7.3 MongoDB Performance Issues

CODE

// Go to mongosh
$installation_path/bin/mongosh \
  --tls \
  --tlsCAFile $installation_path/certificates/ca/agilesec-rootca-cert.pem \
  --tlsCertificateKeyFile $installation_path/certificates/$analytics_internal_domain/admin-user-combo-cert-key.pem \
  --authenticationMechanism MONGODB-X509 \
  --host 127.0.0.1 \
  --port 27017

// Enable profiling for slow queries
db.setProfilingLevel(1, { slowms: 100 })

// Check slow queries
db.system.profile.find().sort({ ts: -1 }).limit(10).pretty()

// Check current operations
db.currentOp({ "secs_running": { "$gt": 5 } })

// Server status
db.serverStatus()

7.4 Kafka Performance Issues

CODE

# Check consumer lag
$installation_path/services/kafka_*/bin/kafka-consumer-groups.sh \
  --bootstrap-server 127.0.0.1:9092 \
  --command-config /tmp/ssl.properties \
  --describe --all-groups

# Check log segment sizes
du -sh $installation_path/data/kafka/*

8. Support Bundle Log Collection

8.1 Run command to generate support bundle logs

Create a support bundle script collect-logs.sh containing all relevant diagnostic information:

Important: Replace /installation/path with your actual installation directory path.

CODE

#!/bin/bash
# Create support bundle
export installation_path="/home/ec2-user/Installer_package/singlenode-26122025-rhel8_340-rc1" # Update this path to your actual installation directory

BUNDLE_DIR="/tmp/support-bundle-$(date +%Y%m%d-%H%M%S)" # 
mkdir -p "$BUNDLE_DIR"

# System information
echo "=== System Info ===" > "$BUNDLE_DIR/system_info.txt"
uname -a >> "$BUNDLE_DIR/system_info.txt"
cat /etc/os-release >> "$BUNDLE_DIR/system_info.txt"
free -h >> "$BUNDLE_DIR/system_info.txt"
df -h >> "$BUNDLE_DIR/system_info.txt"
uptime >> "$BUNDLE_DIR/system_info.txt"

# Service status
$installation_path/scripts/manage.sh status > "$BUNDLE_DIR/service_status.txt" 2>&1

# Application logs (last 5000 lines each)
mkdir -p "$BUNDLE_DIR/logs"
for log in $installation_path/logs/*.log; do
  tail -5000 "$log" > "$BUNDLE_DIR/logs/$(basename $log)" 2>/dev/null
done

# Configuration (sanitized)
mkdir -p "$BUNDLE_DIR/config"
# Copy configs but remove sensitive data
for conf in $installation_path/config_envs/*; do
  grep -v -E '(PASSWORD|SECRET|KEY|TOKEN)' "$conf" > "$BUNDLE_DIR/config/$(basename $conf)" 2>/dev/null
done

# Create archive
tar -czf "${BUNDLE_DIR}.tar.gz" -C /tmp "$(basename $BUNDLE_DIR)"
rm -rf "$BUNDLE_DIR"

echo "Support bundle created: ${BUNDLE_DIR}.tar.gz"

8.2 Files to Include

Get the bundle logs file from the /tmp directory after running the script. It includes:

Category	Files
All Platform Logs	`$installation_path/logs/*.log`
Configuration	`$installation_path/config_envs/*` (sanitized)
System Info	OS version, memory, disk, uptime

8.3 Sanitizing Sensitive Information

Important: Before sharing logs, remove:

Passwords and secrets
API keys and tokens
Private keys and certificates
Personal information
Private IPs

8.4 Submitting Support Bundles

Create the support bundle using the procedure above
Verify no sensitive information is included
Upload to secure file sharing as directed by support
Include ticket number and description of the issue