docs(deployment): add monitoring & observability guide (#14838)#16774
Open
YAMRAJ13y wants to merge 1 commit into
Open
docs(deployment): add monitoring & observability guide (#14838)#16774YAMRAJ13y wants to merge 1 commit into
YAMRAJ13y wants to merge 1 commit into
Conversation
…orm#14838) Add an operational monitoring guide layered on top of the telemetry reference: the three layers to monitor (application, infrastructure, host), what to watch for each native platform metric, a suggested Grafana dashboard structure by operational concern, and guidance on setting deployment-relative baselines and alerts. Register the page in the deployment navigation. Closes OpenCTI-Platform#14838
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new deployment documentation page that provides operator-focused guidance for monitoring and observability in production OpenCTI deployments, complementing the existing Telemetry reference page.
Changes:
- Adds a new Monitoring & observability guide under Deployment → Platform, focusing on what to watch and how to structure dashboards.
- Registers the new page in
docs/mkdocs.ymlnavigation after the existing Telemetry entry.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
docs/mkdocs.yml |
Adds the new Monitoring & observability page to the Deployment → Platform navigation. |
docs/docs/deployment/monitoring.md |
Introduces the new operational monitoring guide (metrics to watch, dashboard layout suggestions, baseline/alerting guidance). |
|
|
||
| !!! note "Workers and connectors" | ||
|
|
||
| Workers and connectors are the ingestion engine. Watch worker throughput against the RabbitMQ queue depth: if the queue grows while workers are busy, you are ingestion-bound and may need more workers or more ElasticSearch capacity. If the queue grows while workers are *idle*, look for a downstream problem (for example a flushed Redis — see [Troubleshooting](advanced/troubleshooting.md)). |
| OpenCTI's behaviour is tightly coupled to its dependencies. Monitor each one directly: | ||
|
|
||
| - **ElasticSearch / OpenSearch** — cluster status (green/yellow/red), disk usage against the flood-stage watermark, JVM heap pressure, rejected writes and merge activity. Disk filling up is the single most common cause of ingestion stalls (`TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark`). | ||
| - **Redis** — memory usage against `maxmemory`, blocked clients and the slowlog. Redis holds critical platform state; see the Redis section in [Troubleshooting](advanced/troubleshooting.md) for why it must never be flushed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed changes
Adds a new Monitoring & observability page to the deployment docs, as requested in #14838.
The existing Telemetry page is reference-level (metric names, attributes, exporter config). This guide adds the operational "so what?" layer on top:
opencti_api_requests,opencti_api_errors,opencti_api_latency,opencti_api_direct_bulk/side_bulk,opencti_sent_email, plus Node.js runtime metrics).It links to the existing Telemetry reference and to the Troubleshooting guide (including the Redis state section) for root-cause correlation, and is registered in the nav under Deployment → Platform, after Telemetry.
All metric names referenced are verified against the source (
opencti-graphql/src/config/tracing.ts) and the existing telemetry reference.Related issues
Closes #14838
How to test
Render the docs and open Deployment → Platform → Monitoring & observability; verify the new page and its nav entry, and that the links to Telemetry/Troubleshooting resolve.