Skip to main content
Skip table of contents

On-Prem Troubleshooting Guide

This troubleshooting guide provides detailed procedures for diagnosing and resolving issues with the Keyfactor AgileSec Analytics Platform. It covers both single-node and multi-node deployments with focus on root cause analysis and resolution steps.

1. Introduction

1.1 Platform Architecture Overview

The platform consists of three service tiers:

Infrastructure Layer:

  • OpenSearch - Search and analytics engine

  • OpenSearch Dashboards - Web interface for OpenSearch

  • MongoDB - Operational data store with replica set support

  • Kafka - Message queue broker (KRaft mode)

Supporting Services:

  • HAProxy - Load balancer and reverse proxy

  • Fluentd (td-agent) - Ship data from Kafka to OpenSearch

Application Microservices:

  • Java Backend Services: ingestion, scheduler, sm (Security Manager), analytics-manager

  • Node.js Frontend Services: webui, api, cbom

1.2 Deployment Types

Type

Description

Single-Node

All services on one server

Multi-Node

Distributed deployment with PRIMARY_FULL_BACKEND, FULL_BACKEND, FRONTEND, and SCAN nodes

1.3 Service Dependencies

Services Dependencies:

CODE
opensearch            : (no dependencies)
opensearch-dashboards : opensearch
mongodb               : (no dependencies)
kafka                 : (no dependencies)
td-agent              : opensearch, kafka
scheduler             : opensearch, mongodb, kafka
analytics-manager     : mongodb, kafka
ingestion             : kafka
webui                 : api
api                   : opensearch, mongodb, kafka, cbom, sm
cbom                  : opensearch, mongodb, kafka, sm
sm                    : (no dependencies)
haproxy               : (no dependencies)

2. Log File Locations and Rotation

2.1 Log Directory Structure

All logs are stored under $installation_path/logs/.

2.2 Infrastructure Service Logs

Quick platform logs: These are the main log files for each infrastructure service.

Service

Log Location

Description

OpenSearch

$installation_path/logs/opensearch.log

Cluster logs, slow queries, deprecation warnings

MongoDB

$installation_path/logs/mongodb.log

Server logs, query logs

Kafka

$installation_path/logs/kafka.log

Broker logs, controller logs

Full platform logs: These show detailed logs for each platforn.

Service

Log Location

Description

OpenSearch

$installation_path/services/opensearch/logs/

Cluster logs, slow queries, deprecation warnings

MongoDB

$installation_path/services/mongodb/logs/mongod.log

Server logs, query logs

Kafka

$installation_path/services/kafka/logs/

Broker logs, controller logs

2.3 Application Microservice Logs

Service

Log Location

webui

$installation_path/logs/webui.log

api

$installation_path/logs/api.log

cbom

$installation_path/logs/cbom.log

sm

$installation_path/logs/sm.log

analytics-manager

$installation_path/logs/analytics-manager.log

ingestion

$installation_path/logs/ingestion.log

scheduler

$installation_path/logs/scheduler.log

2.4 Supporting Service Logs

Service

Log Location

HAProxy

$installation_path/logs/haproxy.log

Fluentd (td-agent)

$installation_path/logs/td-agent.log

2.5 Management and Health Check Logs

Log File

Purpose

$installation_path/logs/health_check_cronjob.log

Automated health check results and service restart attempts

2.6 Log Rotation Configuration

OpenSearch Log Rotation: OpenSearch handles its own log rotation. Configure in $installation_path/services/opensearch/config/opensearch.yml:

CODE
logger.deprecation.level: warn
appender.rolling.type: RollingFile
appender.rolling.policies.size.size: 100MB

System Log Rotation: For application logs, you can configure logrotate manually by creating a configuration file (e.g., /etc/logrotate.d/kf-agilesec) similar to the example below. Adjust log-retention period based on your organizational policies.

Important: You must replace <installation_path>, <your_user>, and <your_group> with actual values from your environment.

CODE
<installation_path>/logs/*.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    copytruncate
    create 0640 <your_user> <your_group>
}

3. Service Status Inspection Commands

3.1 Using the Unified Management Script

The primary tool for service management is manage.sh:

CODE
cd $installation_path
./scripts/manage.sh <action> [options] [service1 service2 ...]

Available Actions:

Action

Description

start

Start services

stop

Stop services

restart

Stop and then start services

reload

Reload service configuration (where supported)

status

Check status of services

list

List available services

Options:

Option

Description

-d, --debug

Enable debug mode (show service output in console)

-s, --silent

Enable silent mode (no output displayed)

3.2 Common Status Commands

CODE
# Check all services status
./scripts/manage.sh status

# Check specific service status
./scripts/manage.sh status opensearch
./scripts/manage.sh status mongodb kafka

# Start all services
./scripts/manage.sh start

# Start specific service with debug output
./scripts/manage.sh start -d opensearch

# Stop specific services
./scripts/manage.sh stop haproxy td-agent

# Restart a service
./scripts/manage.sh restart scheduler

# Reload HAProxy configuration
./scripts/manage.sh reload haproxy

3.3 Understanding Status Output

The status command shows:

  • Service name

  • Running/Not running status

  • Process ID (if running)

Example Output:

CODE
2025-01-15 10:30:45 [INFO] Service Status
opensearch             Running (PID: 12345)
mongodb                Running (PID: 12346)
kafka                  Running (PID: 12347)
webui                  Not running
api                    Running (PID: 12349)

3.4 Systemd Service Status

The platform uses a systemd service for automatic startup:

CODE
# Check systemd service status
sudo systemctl status kf_analytics.service

# Enable automatic startup
sudo systemctl enable kf_analytics.service

# View service logs
sudo journalctl -u kf_analytics.service -f

4. Common Failure Scenarios and Fixes

This section provides detailed root cause analysis and resolution steps for the most common issues.

4.1 OpenSearch Failures

4.1.1 Service Fails to Start

Symptoms:

  • OpenSearch process exits immediately after starting

  • No listening on port 9200

  • Log shows "Unable to lock JVM Memory" or certificate errors

Root Cause Analysis:

Cause

Log Indicator

Resolution

Memory lock failure

"Unable to lock JVM Memory"

Configure ulimits (see below)

Insufficient heap

"OutOfMemoryError"

Increase OPENSEARCH_JAVA_OPTS

Certificate errors

"SSLHandshakeException"

Validate certificate paths

Port already in use

"Address already in use"

Kill conflicting process

Resolution - Memory Lock:

CODE
# Check current limits
ulimit -l

# Run tune.sh command to adjust system settings
cd <installer_directory>
sudo ./scripts/tune.sh

# Important: Logout/Login to refresh session

Resolution - Heap Size:

CODE
# Check current heap settings in service environment file
cat $installation_path/services/opensearch/config/jvm.options

# Modify -Xms and -Xmx values
# Example: 
#  -Xms32g 
#  -Xmx32g

4.1.2 Cluster State RED

Symptoms:

  • API returns cluster health status as "red"

  • Some indices are unavailable

  • Write operations failing

Root Cause Analysis:

  • Unassigned primary shards

  • Node disconnection in multi-node setup

  • Disk space exhaustion

Diagnostic Commands:

CODE
# Set certificate variables
# Important: 
#   - Replace </path/to/installation> with actual path
#   - Replace "kf-agilesec.internal" with your actual domain name

export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export CA_CERT=$installation_path/certificates/ca/agilesec-rootca-cert.pem
export CLIENT_CERT=$installation_path/certificates/$analytics_internal_domain/admin-user-cert.pem
export CLIENT_KEY=$installation_path/certificates/$analytics_internal_domain/admin-user-key.pem

# Check cluster health
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cluster/health?pretty'

# Check shard allocation explanation
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cluster/allocation/explain?pretty'

# List unassigned shards
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason'

Resolution:

CODE
# Enable shard allocation (if disabled)
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  -X PUT 'https://127.0.0.1:9200/_cluster/settings' \
  -H 'Content-Type: application/json' -d '{
    "transient": {
      "cluster.routing.allocation.enable": "all"
    }
  }'

# Reroute stuck shards
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  -X POST 'https://127.0.0.1:9200/_cluster/reroute?retry_failed=true'

4.1.3 Certificate Authentication Failure

Symptoms:

  • "SSLHandshakeException" in logs

  • "certificate verify failed" errors

  • Services cannot connect to OpenSearch

Root Cause Analysis:

  • Certificate expired

  • Wrong CA certificate

  • Certificate chain incomplete

  • DN not in allowed list

Diagnostic Commands:

CODE
# Important: 
#   - Replace <client-cert-path> with actual path
#   - Replace <ca-cert-path> with actual path
export CLIENT_CERT=<client-cert-path>
export CA_CERT=<ca-cert-path>
# Check certificate expiration
openssl x509 -in $CLIENT_CERT -noout -dates

# Verify certificate chain
openssl verify -CAfile $CA_CERT $CLIENT_CERT

# Check certificate subject/issuer
openssl x509 -in $CLIENT_CERT -noout -subject -issuer

Resolution:

  • Regenerate certificates if expired

  • Ensure CA certificate matches the one used to sign client/server certificates

  • Verify plugins.security.nodes_dn and plugins.security.authcz.admin_dn in opensearch.yml

4.1.4 JVM Heap Exhaustion

Symptoms:

  • "OutOfMemoryError: Java heap space" in logs

  • Service becomes unresponsive

  • Frequent garbage collection pauses

Root Cause Analysis:

  • Heap size too small for data volume

  • Large aggregation queries

Resolution:

CODE
# Check current heap usage
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_nodes/stats/jvm?pretty' | grep -A 10 "heap"

# Check current heap settings in service environment file
cat $installation_path/services/opensearch/config/jvm.options

# Modify -Xms and -Xmx values
# Example: 
#  -Xms32g 
#  -Xmx32g

# Restart OpenSearch
./scripts/manage.sh restart opensearch

4.1.5 Disk Space Exhaustion

Symptoms:

  • Write operations rejected

  • "disk watermark exceeded" in logs

  • Index status becomes read-only

Root Cause Analysis:

  • Data growth exceeding available disk

  • Log accumulation

  • Old indices not cleaned up

Diagnostic Commands:

CODE
# Check disk usage
# Important: 
#   - Replace </path/to/installation> with actual path
#   - Replace "kf-agilesec.internal" with your actual domain name

export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export CA_CERT=$installation_path/certificates/ca/agilesec-rootca-cert.pem
export CLIENT_CERT=$installation_path/certificates/$analytics_internal_domain/admin-user-cert.pem
export CLIENT_KEY=$installation_path/certificates/$analytics_internal_domain/admin-user-key.pem

# Check disk size of each nodes
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cat/allocation?v&h=node,shards,disk.percent,disk.used,disk.avail,disk.total'

Resolution:

  • Increase free disk size if it is too low


4.2 MongoDB Failures

4.2.1 Service Fails to Start

Symptoms:

  • mongod process exits immediately

  • "permission denied" or socket errors in logs

Root Cause Analysis:

Cause

Log Indicator

Resolution

Port in use

"Address already in use"

Kill conflicting process

TLS certificate issues

"cannot read certificate"

Check certificate paths

Lock file exists

"Unable to acquire lock"

Remove stale lock file

Resolution - Lock File:

CODE
# Remove stale lock file (only if mongod is not running)
rm -f $installation_path/data/mongodb/mongod.lock

4.2.2 TLS/mTLS Connection Failure

Symptoms:

  • "SSL peer certificate validation failed"

  • "certificate verify failed"

  • Services cannot connect to MongoDB

Root Cause Analysis:

  • Client certificate not trusted by server

  • CA mismatch between client and server

  • Certificate expired

  • Wrong certificate key file

Diagnostic Commands:

CODE
# Test MongoDB connection with TLS
# Important: 
#   - Replace </path/to/installation> with actual path
#   - Replace "kf-agilesec.internal" with your actual domain name
#   - Replace "<Path to installer-directory>" with actual path

export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export installer_path=<Path to installer-directory> #Note that <installer_path> is different from <installation_path>
export OPENSSL_CONF="$installer_path/templates/mongodb/openssl-mongosh.cnf"


$installation_path/bin/mongosh \
  --tls \
  --tlsCAFile $installation_path/certificates/ca/agilesec-rootca-cert.pem \
  --tlsCertificateKeyFile $installation_path/certificates/$analytics_internal_domain/shared-client-combo-cert-key.pem \
  --host 127.0.0.1 \
  --authenticationMechanism MONGODB-X509 \
  --port 27017 \
  --eval 'db.runCommand({ ping: 1 })'

Resolution:

  • Verify tlsCAFile path in mongod.conf matches the CA used to sign client certificates

  • Ensure tlsCertificateKeyFile contains both certificate and key

  • Check certificate expiration dates


4.3 Kafka Failures

4.3.1 Broker Fails to Start

Symptoms:

  • Kafka process exits immediately

  • "Address already in use" error

  • KRaft controller election fails

Root Cause Analysis:

Cause

Log Indicator

Resolution

Port conflict

"Address already in use"

Check ports 9092, 9093, 9094

KRaft init failure

"Cluster ID mismatch"

Check cluster_id.txt

SSL configuration

"SSL handshake failed"

Verify keystore paths

Insufficient disk

"No space left on device"

Free disk space

Diagnostic Commands:

CODE
# Check if ports are in use
ss -tlnp | grep -E '(9092|9093|9094)'

# Check Kafka logs
tail -100 $installation_path/services/kafka_*/logs/server.log

Resolution - Port Conflict:

CODE
# Find and kill process using the port
fuser -k 9092/tcp
fuser -k 9093/tcp
fuser -k 9094/tcp

# Restart Kafka
./scripts/manage.sh start kafka

4.4 Java Microservice Failures

Applies to: scheduler, sm, ingestion, analytics-manager

4.4.1 Kafka Connection Failure

Symptoms:

  • Java Microservice logs show the error with below Exception pattern:

    • "TimeoutException"

    • "DisconnectException"

    • "SerializationException"

    • "DeserializationException"

    • "CommitFailedException"

    • "AuthorizationException"

    • "SaslAuthenticationException"

Root Cause Analysis:

  • Kafka broker not running or not healthy

  • SSL configuration mismatch

  • CPU/MEM of kafka broker is peak

  • Network connectivity issues

Resolution:

  • Verify Kafka is running: ./scripts/manage.sh status kafka

  • Check kafka is healthy or not ( See Section 5.3 for kafka health checks)

  • Check Kafka load (CPU, Disk, Memory usage)

  • Verify SSL certificate match between Kafka and microservice

  • Check network connectivity between microservice and Kafka brokers

4.4.2 MongoDB Connection Failure

Symptoms:

  • Java Microservice logs show the error with below Exception pattern:

    • "MongoTimeoutException"

    • "MongoSocketOpenException"

    • "MongoSocketReadException"

    • "Timed out while waiting for a server"

    • "Connection refused"

    • "No server chosen by ReadPreference"

    • "MongoSecurityException"

    • "Authentication failed"

    • "Unauthorized"

    • "not authorized on .* to execute"

Resolution:

  • Verify mongo is running: ./scripts/manage.sh status mongodb

  • Check mongo is healthy or not ( See Section 5.2 for mongo health checks)

  • Check MongoDB load (CPU, Disk, Memory usage)

  • Verify SSL certificate match between MongoDB and microservice

  • Check network connectivity between microservice and MongoDB cluster

4.4.3 SM Service Keystore Failure

Symptoms:

  • "Cannot load keystore"

  • "Keystore was tampered with, or password was incorrect"

Resolution:

CODE
# Verify keystore exists and is readable
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export sm_keystore_pass=$(cat $installation_path/certificates/$analytics_internal_domain/sm-service-keystore.pass)

ls -la $installation_path/certificates/$analytics_internal_domain/sm-service.p12

# Verify password
$installation_path/bin/java/bin/keytool -list -keystore $installation_path/certificates/$analytics_internal_domain/sm-service.p12 -storepass "$sm_keystore_pass"


4.5 Node.js Microservice Failures

Applies to: webui, api, cbom

4.5.1 Kafka Connection Failure

Symptoms:

  • Nodejs Microservice logs show the error with below Exception pattern:

    • "KafkaJSConnectionError"

    • "SSL alert number"

Root Cause Analysis:

  • Kafka broker not running or not healthy

  • SSL configuration mismatch

  • CPU/MEM of kafka broker is peak

  • Network connectivity issues

Resolution:

  • Verify Kafka is running: ./scripts/manage.sh status kafka

  • Check kafka is healthy or not ( See Section 5.3 for kafka health checks)

  • Check Kafka load (CPU, Disk, Memory usage)

  • Verify SSL certificate match between Kafka and microservice

  • Check network connectivity between microservice and Kafka brokers

4.5.2 MongoDB Connection Errors

Symptoms:

  • Nodejs Microservice logs show the error with below Exception pattern:

    • "MongooseError"

    • "buffering timed out after"

Resolution:

  • Verify mongo is running: ./scripts/manage.sh status mongodb

  • Check mongo is healthy or not ( See Section 5.2 for mongo health checks)

  • Check MongoDB load (CPU, Disk, Memory usage)

  • Verify SSL certificate match between MongoDB and microservice

  • Check network connectivity between microservice and MongoDB cluster


4.5.3 Opensearch Connection Errors

Symptoms:

  • Nodejs Microservice logs show the error with below Exception pattern:

    • "Query open-search error"

    • "ECONNREFUSED"

Resolution:

  • Verify OpenSearch is running: ./scripts/manage.sh status opensearch

  • Check OpenSearch is healthy or not ( See Section 5.1 for opensearch health checks)

  • Check OpenSearch load (CPU, Disk, Memory usage)

  • Verify SSL certificate match between OpenSearch and microservice

  • Check network connectivity between microservice and OpenSearch cluster


5. Cluster Health Diagnostics

5.1 OpenSearch Cluster Health

  • Using curl with certificate authentication for cluster diagnostics

CODE
# Set up certificate environment
export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export CA_CERT=$installation_path/certificates/ca/agilesec-rootca-cert.pem
export CLIENT_CERT=$installation_path/certificates/$analytics_internal_domain/admin-user-cert.pem
export CLIENT_KEY=$installation_path/certificates/$analytics_internal_domain/admin-user-key.pem

# Important: 
#   - Replace </path/to/installation> with actual path
#   - Replace "kf-agilesec.internal" with your actual domain name
# Cluster health
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cluster/health?pretty'

# Node status
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cat/nodes?v&h=ip,name,role,master,heap.percent,disk.used_percent'

# Cluster statistics
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cluster/stats?pretty'

# Index health
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cat/indices?v'
  • Using Opensearch Dashboard for cluster diagnostics

    • Opensearch Dashboard's Dev Tools console provides additional cluster diagnostic capabilities

    • Use GET /_cat/nodes?v and GET /_cat/indices?v for detailed node and index status

    • Use GET /_cluster/health for cluster health status

    • Use GET /_cluster/stats for cluster statistics

    • Use GET /_cat/shards?v for shard allocation status

    • Monitor for yellow/red health indicators which may indicate shard allocation issues

    • Pay attention to unassigned shards and disk usage percentages (look for nodes with high disk usage, especially above 80%)

    • Check for any nodes showing "UNREACHABLE" or "DISCONNECTED" status

    • Look for nodes with high memory usage (above 85%) which may indicate performance issues

    • Watch for nodes with high CPU usage (above 90%) which may indicate resource contention

    • Check for any nodes with excessive load averages that may indicate system overload

5.2 MongoDB Replica Set Health

CODE
# Important: 
#   - Replace </path/to/installation> with actual path
#   - Replace "kf-agilesec.internal" with your actual domain name
#   - Replace "<Path to installer-directory>" with actual path
export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain
export installer_path=<Path to installer-directory> #Note that <installer_path> is different from <installation_path>
export OPENSSL_CONF="$installer_path/templates/mongodb/openssl-mongosh.cnf"

# Connect and check replica set status
$installation_path/bin/mongosh \
  --tls \
  --tlsCAFile $installation_path/certificates/ca/agilesec-rootca-cert.pem \
  --tlsCertificateKeyFile $installation_path/certificates/$analytics_internal_domain/admin-user-combo-cert-key.pem \
  --authenticationMechanism MONGODB-X509 \
  --host 127.0.0.1 \
  --port 27017 \
  --eval 'rs.status()'

# Check replica set members
$installation_path/bin/mongosh \
  --tls \
  --tlsCAFile $installation_path/certificates/ca/agilesec-rootca-cert.pem \
  --tlsCertificateKeyFile $installation_path/certificates/$analytics_internal_domain/admin-user-combo-cert-key.pem \
  --authenticationMechanism MONGODB-X509 \
  --host 127.0.0.1 \
  --port 27017 \
  --quiet \
  --eval 'const h=db.hello(); print("Members:", h.hosts.length); printjson(h.hosts)'

5.3 Kafka Cluster Health

CODE
export installation_path=</path/to/installation>
export analytics_internal_domain="kf-agilesec.internal" # default: kf-agilesec.internal. Change it to match your actual domain

CA_PEM="$installation_path/certificates/ca/agilesec-rootca-cert.pem"
CLIENT_CERT="$installation_path/certificates/$analytics_internal_domain/shared-client-cert.pem"
CLIENT_KEY="$installation_path/certificates/$analytics_internal_domain/shared-client-key.pem"
CLIENT_COMBO="/tmp/client-combo-key-cert.pem"

# 2. Create a combined PEM file (Required for ssl.keystore.location in Java)
# The order usually doesn't matter, but Key + Cert is standard.
cat "$CLIENT_KEY" "$CLIENT_CERT" > "$CLIENT_COMBO"

# 3. Generate the ssl.properties file
cat > /tmp/ssl.properties << EOF
security.protocol=SSL

# Truststore (The CA Certificate)
ssl.truststore.type=PEM
ssl.truststore.location=$CA_PEM

# Keystore (The Client Key + Certificate)
ssl.keystore.type=PEM
ssl.keystore.location=$CLIENT_COMBO

# If your Private Key is encrypted, uncomment the line below:
# ssl.key.password=$(cat $installation_path/certificates/$analytics_internal_domain/agilesec-client-keystore.pass)
EOF

# Important: 
#   - Replace </path/to/installation> with actual path
#   - Replace "kf-agilesec.internal" with your actual domain name

# List all topics
export JAVA_HOME=$installation_path/bin/java
$installation_path/services/kafka_*/bin/kafka-topics.sh \
  --bootstrap-server 127.0.0.1:9092 \
  --command-config /tmp/ssl.properties \
  --list

# Describe all consumer groups
$installation_path/services/kafka_*/bin/kafka-consumer-groups.sh \
  --bootstrap-server 127.0.0.1:9092 \
  --command-config /tmp/ssl.properties \
  --describe --all-groups

# Check broker metadata
$installation_path/services/kafka_*/bin/kafka-dump-log.sh \
  --files $installation_path/data/kafka/__cluster_metadata-0/00000000000000000000.log \
  --cluster-metadata-decoder

6. Node Recovery Procedures

6.1 Single-Node Recovery

Complete Recovery Sequence:

CODE
# 1. Stop all services gracefully
./scripts/manage.sh stop

# 2. Verify all processes are stopped
./scripts/manage.sh status
ps aux | grep -E '(opensearch|mongod|kafka|java|node)'

# 3. Verify disk space
df -h

# 4. Start services in order
./scripts/manage.sh start 

# 5. Verify all services
./scripts/manage.sh status

6.2 Multi-Node Recovery

6.2.1 OpenSearch Node Recovery

CODE
# On the failed node:

# 1. Stop OpenSearch
./scripts/manage.sh stop opensearch

# 2. Check and fix data directory if corrupted. Backup the directory before deleting.
# (Only if necessary, e.g., when recovering from backup - THIS WILL RESULT IN DATA LOSS)
# rm -rf $installation_path/data/opensearch/node

# 3. Restart OpenSearch
./scripts/manage.sh start opensearch

# 4. Monitor cluster recovery
watch -n 5 "curl -s -k --cacert \$CA_CERT --cert \$CLIENT_CERT --key \$CLIENT_KEY \
  'https://127.0.0.1:9200/_cluster/health?pretty' | grep -E '(status|relocating|initializing)'"

6.2.2 MongoDB Node Recovery

Secondary Node Recovery:

CODE
# 1. Stop MongoDB on failed secondary
./scripts/manage.sh stop mongodb

# 2. Optionally resync from primary (if data is corrupted)
# rm -rf $installation_path/data/mongodb/*

# 3. Start MongoDB
./scripts/manage.sh start mongodb

# 4. MongoDB will automatically sync from primary

Primary Node Recovery:

CODE
# If primary fails, a secondary should automatically become primary
# After recovering the failed node:

# 1. Start MongoDB
./scripts/manage.sh start mongodb

# 2. It will rejoin as secondary and sync
# 3. To force it back to primary (if desired):
# rs.stepDown()  # On current primary

6.2.3 Kafka Broker Recovery

CODE
# 1. Stop failed broker
./scripts/manage.sh stop kafka

# 2. Check for log corruption
ls -la $installation_path/data/kafka/

# 3. Start broker
./scripts/manage.sh start kafka

# 4. Verify broker rejoined cluster
$installation_path/services/kafka_*/bin/kafka-dump-log.sh \
  --files $installation_path/data/kafka/__cluster_metadata-0/00000000000000000000.log \
  --cluster-metadata-decoder

6.3 Full Platform Recovery Sequence

For complete platform recovery after major failure:

CODE
# Phase 1: Infrastructure (run on all nodes)
./scripts/manage.sh start opensearch mongodb kafka

# Phase 2: Wait for infrastructure to stabilize
sleep 120

# Verify infrastructure health on each node
./scripts/manage.sh status opensearch mongodb kafka

# Phase 3: Supporting services
./scripts/manage.sh start haproxy td-agent

# Phase 4: Application services
./scripts/manage.sh start sm analytics-manager ingestion scheduler
./scripts/manage.sh start webui api cbom

# Phase 5: Final verification
./scripts/manage.sh status

7. Performance Troubleshooting

7.1 Resource Utilization Analysis

CODE
# CPU usage by process
top -b -n 1 | head -20

# Memory usage
free -h
ps aux --sort=-%mem | head -10

7.2 OpenSearch Performance Issues

High Query Latency:

CODE
# Enable slow query log
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  -X PUT 'https://127.0.0.1:9200/_all/_settings' \
  -H 'Content-Type: application/json' -d '{
    "index.search.slowlog.threshold.query.warn": "10s",
    "index.search.slowlog.threshold.query.info": "5s"
  }'

# Check hot threads
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_nodes/hot_threads'

# Check pending tasks
curl -k --cacert $CA_CERT --cert $CLIENT_CERT --key $CLIENT_KEY \
  'https://127.0.0.1:9200/_cluster/pending_tasks?pretty'

7.3 MongoDB Performance Issues

CODE
// Go to mongosh
$installation_path/bin/mongosh \
  --tls \
  --tlsCAFile $installation_path/certificates/ca/agilesec-rootca-cert.pem \
  --tlsCertificateKeyFile $installation_path/certificates/$analytics_internal_domain/admin-user-combo-cert-key.pem \
  --authenticationMechanism MONGODB-X509 \
  --host 127.0.0.1 \
  --port 27017

// Enable profiling for slow queries
db.setProfilingLevel(1, { slowms: 100 })

// Check slow queries
db.system.profile.find().sort({ ts: -1 }).limit(10).pretty()

// Check current operations
db.currentOp({ "secs_running": { "$gt": 5 } })

// Server status
db.serverStatus()

7.4 Kafka Performance Issues

CODE
# Check consumer lag
$installation_path/services/kafka_*/bin/kafka-consumer-groups.sh \
  --bootstrap-server 127.0.0.1:9092 \
  --command-config /tmp/ssl.properties \
  --describe --all-groups

# Check log segment sizes
du -sh $installation_path/data/kafka/*

8. Support Bundle Log Collection

8.1 Run command to generate support bundle logs

Create a support bundle script collect-logs.sh containing all relevant diagnostic information:

Important: Replace /installation/path with your actual installation directory path.

CODE
#!/bin/bash
# Create support bundle
export installation_path="/home/ec2-user/Installer_package/singlenode-26122025-rhel8_340-rc1" # Update this path to your actual installation directory

BUNDLE_DIR="/tmp/support-bundle-$(date +%Y%m%d-%H%M%S)" # 
mkdir -p "$BUNDLE_DIR"

# System information
echo "=== System Info ===" > "$BUNDLE_DIR/system_info.txt"
uname -a >> "$BUNDLE_DIR/system_info.txt"
cat /etc/os-release >> "$BUNDLE_DIR/system_info.txt"
free -h >> "$BUNDLE_DIR/system_info.txt"
df -h >> "$BUNDLE_DIR/system_info.txt"
uptime >> "$BUNDLE_DIR/system_info.txt"

# Service status
$installation_path/scripts/manage.sh status > "$BUNDLE_DIR/service_status.txt" 2>&1

# Application logs (last 5000 lines each)
mkdir -p "$BUNDLE_DIR/logs"
for log in $installation_path/logs/*.log; do
  tail -5000 "$log" > "$BUNDLE_DIR/logs/$(basename $log)" 2>/dev/null
done

# Configuration (sanitized)
mkdir -p "$BUNDLE_DIR/config"
# Copy configs but remove sensitive data
for conf in $installation_path/config_envs/*; do
  grep -v -E '(PASSWORD|SECRET|KEY|TOKEN)' "$conf" > "$BUNDLE_DIR/config/$(basename $conf)" 2>/dev/null
done

# Create archive
tar -czf "${BUNDLE_DIR}.tar.gz" -C /tmp "$(basename $BUNDLE_DIR)"
rm -rf "$BUNDLE_DIR"

echo "Support bundle created: ${BUNDLE_DIR}.tar.gz"

8.2 Files to Include

Get the bundle logs file from the /tmp directory after running the script. It includes:

Category

Files

All Platform Logs

$installation_path/logs/*.log

Configuration

$installation_path/config_envs/* (sanitized)

System Info

OS version, memory, disk, uptime

8.3 Sanitizing Sensitive Information

Important: Before sharing logs, remove:

  • Passwords and secrets

  • API keys and tokens

  • Private keys and certificates

  • Personal information

  • Private IPs

8.4 Submitting Support Bundles

  1. Create the support bundle using the procedure above

  2. Verify no sensitive information is included

  3. Upload to secure file sharing as directed by support

  4. Include ticket number and description of the issue

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.