Skip to content

docs(deployment): add monitoring & observability guide (#14838)#16774

Open
YAMRAJ13y wants to merge 1 commit into
OpenCTI-Platform:masterfrom
YAMRAJ13y:docs/14838-monitoring-guide
Open

docs(deployment): add monitoring & observability guide (#14838)#16774
YAMRAJ13y wants to merge 1 commit into
OpenCTI-Platform:masterfrom
YAMRAJ13y:docs/14838-monitoring-guide

Conversation

@YAMRAJ13y

Copy link
Copy Markdown

Proposed changes

Adds a new Monitoring & observability page to the deployment docs, as requested in #14838.

The existing Telemetry page is reference-level (metric names, attributes, exporter config). This guide adds the operational "so what?" layer on top:

  • The three layers to monitor — application (platform/workers/connectors), infrastructure (ElasticSearch, Redis, RabbitMQ, S3/MinIO), and host/orchestrator — and how to correlate symptoms across them.
  • What to watch for each native platform metric (opencti_api_requests, opencti_api_errors, opencti_api_latency, opencti_api_direct_bulk/side_bulk, opencti_sent_email, plus Node.js runtime metrics).
  • A suggested Grafana dashboard structure organised by operational concern (OpenCTI ships none).
  • Baselines & alerts guidance — framed as deployment-relative (no invented absolute numbers), alerting on deviations/trends.

It links to the existing Telemetry reference and to the Troubleshooting guide (including the Redis state section) for root-cause correlation, and is registered in the nav under Deployment → Platform, after Telemetry.

All metric names referenced are verified against the source (opencti-graphql/src/config/tracing.ts) and the existing telemetry reference.

The guide intentionally avoids prescriptive absolute thresholds (e.g. "p95 < X ms"), since healthy ranges depend on data volume and sizing. Happy to adjust the structure/placement to maintainer preference.

Related issues

Closes #14838

How to test

Render the docs and open Deployment → Platform → Monitoring & observability; verify the new page and its nav entry, and that the links to Telemetry/Troubleshooting resolve.

…orm#14838)

Add an operational monitoring guide layered on top of the telemetry reference:
the three layers to monitor (application, infrastructure, host), what to watch
for each native platform metric, a suggested Grafana dashboard structure by
operational concern, and guidance on setting deployment-relative baselines and
alerts. Register the page in the deployment navigation.

Closes OpenCTI-Platform#14838
Copilot AI review requested due to automatic review settings June 23, 2026 15:19

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new deployment documentation page that provides operator-focused guidance for monitoring and observability in production OpenCTI deployments, complementing the existing Telemetry reference page.

Changes:

  • Adds a new Monitoring & observability guide under Deployment → Platform, focusing on what to watch and how to structure dashboards.
  • Registers the new page in docs/mkdocs.yml navigation after the existing Telemetry entry.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
docs/mkdocs.yml Adds the new Monitoring & observability page to the Deployment → Platform navigation.
docs/docs/deployment/monitoring.md Introduces the new operational monitoring guide (metrics to watch, dashboard layout suggestions, baseline/alerting guidance).


!!! note "Workers and connectors"

Workers and connectors are the ingestion engine. Watch worker throughput against the RabbitMQ queue depth: if the queue grows while workers are busy, you are ingestion-bound and may need more workers or more ElasticSearch capacity. If the queue grows while workers are *idle*, look for a downstream problem (for example a flushed Redis — see [Troubleshooting](advanced/troubleshooting.md)).
OpenCTI's behaviour is tightly coupled to its dependencies. Monitor each one directly:

- **ElasticSearch / OpenSearch** — cluster status (green/yellow/red), disk usage against the flood-stage watermark, JVM heap pressure, rejected writes and merge activity. Disk filling up is the single most common cause of ingestion stalls (`TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark`).
- **Redis** — memory usage against `maxmemory`, blocked clients and the slowlog. Redis holds critical platform state; see the Redis section in [Troubleshooting](advanced/troubleshooting.md) for why it must never be flushed.
@aHenryJard aHenryJard added the community Contribution from the community. label Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community Contribution from the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs: add monitoring & observability guide for production deployments

3 participants