docs(deployment): add monitoring & observability guide (#14838) by YAMRAJ13y · Pull Request #16774 · OpenCTI-Platform/opencti

YAMRAJ13y · 2026-06-23T15:19:11Z

Proposed changes

Adds a new Monitoring & observability page to the deployment docs, as requested in #14838.

The existing Telemetry page is reference-level (metric names, attributes, exporter config). This guide adds the operational "so what?" layer on top:

The three layers to monitor — application (platform/workers/connectors), infrastructure (ElasticSearch, Redis, RabbitMQ, S3/MinIO), and host/orchestrator — and how to correlate symptoms across them.
What to watch for each native platform metric (opencti_api_requests, opencti_api_errors, opencti_api_latency, opencti_api_direct_bulk/side_bulk, opencti_sent_email, plus Node.js runtime metrics).
A suggested Grafana dashboard structure organised by operational concern (OpenCTI ships none).
Baselines & alerts guidance — framed as deployment-relative (no invented absolute numbers), alerting on deviations/trends.

It links to the existing Telemetry reference and to the Troubleshooting guide (including the Redis state section) for root-cause correlation, and is registered in the nav under Deployment → Platform, after Telemetry.

All metric names referenced are verified against the source (opencti-graphql/src/config/tracing.ts) and the existing telemetry reference.

The guide intentionally avoids prescriptive absolute thresholds (e.g. "p95 < X ms"), since healthy ranges depend on data volume and sizing. Happy to adjust the structure/placement to maintainer preference.

Related issues

Closes #14838

How to test

Render the docs and open Deployment → Platform → Monitoring & observability; verify the new page and its nav entry, and that the links to Telemetry/Troubleshooting resolve.

…orm#14838) Add an operational monitoring guide layered on top of the telemetry reference: the three layers to monitor (application, infrastructure, host), what to watch for each native platform metric, a suggested Grafana dashboard structure by operational concern, and guidance on setting deployment-relative baselines and alerts. Register the page in the deployment navigation. Closes OpenCTI-Platform#14838

Copilot

Pull request overview

Adds a new deployment documentation page that provides operator-focused guidance for monitoring and observability in production OpenCTI deployments, complementing the existing Telemetry reference page.

Changes:

Adds a new Monitoring & observability guide under Deployment → Platform, focusing on what to watch and how to structure dashboards.
Registers the new page in docs/mkdocs.yml navigation after the existing Telemetry entry.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`docs/mkdocs.yml`	Adds the new Monitoring & observability page to the Deployment → Platform navigation.
`docs/docs/deployment/monitoring.md`	Introduces the new operational monitoring guide (metrics to watch, dashboard layout suggestions, baseline/alerting guidance).

+
+!!! note "Workers and connectors"
+
+    Workers and connectors are the ingestion engine. Watch worker throughput against the RabbitMQ queue depth: if the queue grows while workers are busy, you are ingestion-bound and may need more workers or more ElasticSearch capacity. If the queue grows while workers are *idle*, look for a downstream problem (for example a flushed Redis — see [Troubleshooting](advanced/troubleshooting.md)).


+OpenCTI's behaviour is tightly coupled to its dependencies. Monitor each one directly:
+
+- **ElasticSearch / OpenSearch** — cluster status (green/yellow/red), disk usage against the flood-stage watermark, JVM heap pressure, rejected writes and merge activity. Disk filling up is the single most common cause of ingestion stalls (`TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark`).
+- **Redis** — memory usage against `maxmemory`, blocked clients and the slowlog. Redis holds critical platform state; see the Redis section in [Troubleshooting](advanced/troubleshooting.md) for why it must never be flushed.


Copilot AI review requested due to automatic review settings June 23, 2026 15:19

Copilot started reviewing on behalf of YAMRAJ13y June 23, 2026 15:19 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

aHenryJard added the community Contribution from the community. label Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(deployment): add monitoring & observability guide (#14838)#16774

docs(deployment): add monitoring & observability guide (#14838)#16774
YAMRAJ13y wants to merge 1 commit into
OpenCTI-Platform:masterfrom
YAMRAJ13y:docs/14838-monitoring-guide

YAMRAJ13y commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		!!! note "Workers and connectors"

		Workers and connectors are the ingestion engine. Watch worker throughput against the RabbitMQ queue depth: if the queue grows while workers are busy, you are ingestion-bound and may need more workers or more ElasticSearch capacity. If the queue grows while workers are idle, look for a downstream problem (for example a flushed Redis — see [Troubleshooting](advanced/troubleshooting.md)).

Uh oh!

Conversation

YAMRAJ13y commented Jun 23, 2026

Proposed changes

Related issues

How to test

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants