feat: consecutive-window alerts#2472
Conversation
|
|
@mrkaye97 is attempting to deploy a commit to the HyperDX Team on Vercel. A member of the Team first needs to authorize it. |
Greptile SummaryThis PR adds a "consecutive windows" feature to alerts, requiring a threshold condition to be met for N continuous evaluation windows before a notification fires. A new
Confidence Score: 5/5The change is safe to merge. The consecutive-window counting logic, fired-field tracking, and resolution-notification guard all behave correctly under normal operating conditions. All three code paths (single_value, zero-fill, and group-by time-series) correctly set fired=true/false in both branches of the shouldFireBasedOnConsecutiveWindows check, and the resolution guard properly uses fired !== false to suppress spurious OK webhooks when no notification was ever sent. No new defects were found in this review pass. packages/api/src/tasks/checkAlerts/index.ts — the shouldFireBasedOnConsecutiveWindows function is the most complex new code path and benefits from additional review, particularly around edge cases that arise when the alert-check service experiences downtime. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Alert check fires] --> B{doesExceedThreshold?}
B -- No --> C[history.state = OK\nfired = undefined]
B -- Yes --> D[history.state = ALERT]
D --> E{shouldFireBased\nOnConsecutiveWindows?}
E -- numConsecutiveWindows ≤ 1 --> F[return true]
E -- numConsecutiveWindows > 1 --> G[Query N-1 most recent\nAlertHistory records\ncreatedAt < nowRoundDown]
G --> H{length == N-1 AND\nall state == ALERT?}
H -- No --> I[history.fired = false\nNo notification sent]
H -- Yes --> F
F --> J[history.fired = true\nSend ALERT notification]
C --> K{sendNotificationIfResolved}
I --> K
J --> K
K --> L{previousHistory.state == ALERT\nAND fired !== false\nAND currentHistory.state == OK?}
L -- Yes --> M[Send OK/resolution notification]
L -- No --> N[Skip resolution]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[Alert check fires] --> B{doesExceedThreshold?}
B -- No --> C[history.state = OK\nfired = undefined]
B -- Yes --> D[history.state = ALERT]
D --> E{shouldFireBased\nOnConsecutiveWindows?}
E -- numConsecutiveWindows ≤ 1 --> F[return true]
E -- numConsecutiveWindows > 1 --> G[Query N-1 most recent\nAlertHistory records\ncreatedAt < nowRoundDown]
G --> H{length == N-1 AND\nall state == ALERT?}
H -- No --> I[history.fired = false\nNo notification sent]
H -- Yes --> F
F --> J[history.fired = true\nSend ALERT notification]
C --> K{sendNotificationIfResolved}
I --> K
J --> K
K --> L{previousHistory.state == ALERT\nAND fired !== false\nAND currentHistory.state == OK?}
L -- Yes --> M[Send OK/resolution notification]
L -- No --> N[Skip resolution]
Reviews (5): Last reviewed commit: "chore: lint" | Re-trigger Greptile |
| const alertHistory = await AlertHistory.find({ | ||
| alert: new mongoose.Types.ObjectId(alert.id), | ||
| ...groupFilter, | ||
| createdAt: { $lt: nowInMinsRoundDown }, | ||
| }) | ||
| .sort({ createdAt: -1 }) | ||
| .limit(numWindowsToLookBack - 1); | ||
|
|
||
| return ( | ||
| alertHistory.length === numWindowsToLookBack - 1 && | ||
| alertHistory.every(h => h.state === AlertState.ALERT) | ||
| ); |
There was a problem hiding this comment.
Consecutive-window check doesn't verify temporal proximity
shouldFireBasedOnConsecutiveWindows queries the N-1 most recent AlertHistory records without filtering by timestamp proximity. If the alert-check service experienced downtime (e.g., hours of outage), old ALERT records from before the gap would still be returned by the sort-and-limit query and would satisfy the every(h => h.state === ALERT) check, causing the current window to fire as if the intervening period had never broken the streak.
Adding a lower-bound timestamp filter — e.g. createdAt: { $gte: new Date(nowInMinsRoundDown.getTime() - (numWindowsToLookBack - 1) * windowSizeInMins * 60_000), $lt: nowInMinsRoundDown } — would ensure only truly consecutive windows are counted.
There was a problem hiding this comment.
This seems reasonable, would love input from a maintainer on this one, as I'm not too familiar here!
0214704 to
897fa74
Compare
Summary
Fixes #2469 (comment)
Adding consecutive-window configuration to alerts, so that you can specify a condition like "only fire this alert after some condition is met for N consecutive windows." This helps prevent flaky alerts (and pages), and cut down on alert noise in many cases.
Not too sure on a few things here, so would love some feedback:
Screenshots or video
Screenshot from my local of the new alert modal:
Testing Locally
I tested by creating a trivial alert that'd fire on any error log that we see at least once per minute for three minutes, and then sent a curl once per minute to trigger it:
Sent webhooks to
webhook.site, and saw logs matching what we'd expect:And then:
How to test on Vercel preview
Preview routes:
Wasn't sure what to put here!
References
Source Issue: #2469 (comment)