Recieving Resolved mail for the check where fatigue setting is applied

fatigue feature helps suppress alerts that are repeated within a certain time frame., but we are seeing some different behavior in fatigue enabled sensu checks

fatigue setting in sensu check configuration

  • occurrences": “3”,
  • interval": “1200”,
  • allow_resolution": “false”

Expected behavior based on above setting if sensu check failed continuously, expected alert rase in 3rd failure and wait for 20 min and see check got fixed or not if check still failing raise alert every 20 min

here issue is for example fatigue enabled check failed 1st time and issue fixed in 2nd execution check got pass sending resolved alert without sending any critical alert

Filter created :

{
“type”: “EventFilter”,
“api_version”: “core/v2”,
“metadata”: {
“name”: “fatique-resolve-alert-filter”,
“namespace”: “default”
},
“spec”: {
“action”: “allow”,
“expressions”: [
“event.check.status == 0 && event.check.history.slice(-parseInt(event.check.metadata.annotations[‘fatigue_check/occurrences’])).filter(h => h.status == 2).length >= parseInt(event.check.metadata.annotations[‘fatigue_check/occurrences’])”
],
“runtime_assets”:
}
}

But it seems like sensu doesnt support slice filter functions

I need it to be dynamically check all the n entries from the number of occurrences

@justinhenderson

Hi Shivani,
I am looking at it now. Please allow me couple of hours to take a deep look at it and update you accordingly.

Hi @Nasirhussen_Mulla

Any update . And you let us know that parseInt(event.check.metadata.annotations[‘fatigue_check/occurrences’]) is correct and works expectedly as the value is not getting updated

Hey Shivani, Based on your configuration, here’s what we have observed:

  1. Issue Description:
  • Unexpected Resolved Alert: A resolved alert is sent after a single failure without a preceding critical alert. This suggests that the logic isn’t correctly handling a single failure followed by a resolution.
  • Possible Cause: The fatigue_check filter isn’t suppressing the resolved alert as expected.
  1. Expected Behavior:
  • Critical Alert Trigger: A critical alert should be triggered after the 3rd consecutive failure, as specified by your "occurrences": "3" setting.
  • Alert Interval: If the issue persists, Sensu should wait 1200 seconds (20 minutes) between sending additional alerts by your "interval": "1200" setting.
  1. Analysis of the Filter Issue:
  • Disallowed Features in Sensu Filters:

  • Although Sensu’s event filters utilize ECMAScript 5, certain features—such as the slice method—are disallowed to prevent potential impacts on system performance.

  • This limitation affects your filter expressions, particularly the one using slice to process the event history.

  • Effect on Logic:

  • Using the unsupported slice method may cause the filter to malfunction, sending resolved alerts without prior critical alerts.

  1. Recommendations:
  • Modify the Filter Expression:

  • Replace the slice method with supported JavaScript constructs.

  • Use a manual loop to iterate over the event.check.history array and evaluate the last n entries.

  • Updated Filter Expression Example:

event.check.status == 0 && (() => { const occurrences = parseInt(event.check.metadata.annotations[‘fatigue_check/occurrences’]); const history = event.check.history; let failureCount = 0; for (let i = history.length - occurrences; i < history.length; i++) { if (history[i] && history[i].status == 2) { failureCount++; } } return failureCount >= occurrences; })()

  • Explanation:

  • This expression avoids the use of slice and utilizes a supported loop to count the number of failures in the last n history entries.

  • It ensures that a resolved alert is only sent if there were enough prior failures to trigger a critical alert.

  1. Potential Naming Confusion:
  • Misspelling of “Fatigue”:

  • We noticed that “fatigue” is misspelled as “fatique” in your configuration names and references.

  • While this may not be causing issues currently due to consistent usage, correcting the spelling can ensure clarity in the future.

  1. Configuration Details:
  • EventFilter Configuration:

  • Name: fatique-resolve-alert-filter (Consider changing to fatigue-resolve-alert-filter )

  • Location:

  • metadata.name : "fatique-resolve-alert-filter"

  • runtime_assets : ["fatigue-check-filter"] (Note the correct spelling here)

  • Handler Configuration:

  • Name: fatique_resolve_handler (Consider changing to fatigue_resolve_handler )

  • Location:

  • metadata.name : "fatique_resolve_handler"

  • filters : ["fatique-resolve-alert-filter"]

  • Check Configuration:

  • Name: elasticsearch-scroll-count-alert

  • Location:

  • handlers : ["tester_handler", "alert_handler", "fatique_resolve_handler"]

  1. Request for Additional Information if the issue continues after adjustments:
  • Please provide the changes you’ve made and the configuration files for other resources, such as handlers or filters, to better assist you.
  • It would be very helpful if you could share the script you’re using, as it might contain elements that affect the output and the pipeline.
  1. Next Steps:
  • Implement the Updated Filter: Apply the modified filter expression to replace the unsupported slice method.
  • Verify Alert Behavior: Test the alert flow to ensure critical and resolved alerts are sent as expected.
  • Consider Correcting Naming: Update the configuration names to the correct spelling of “fatigue” to avoid confusion.

Please don’t hesitate to let us know if you have any questions or need any more help with these changes.

Best,
Nasirhussen

Hi @Nasirhussen_Mulla

I tried out the expression suggested by you .

{
“type”: “EventFilter”,
“api_version”: “core/v2”,
“metadata”: {
“annotations”: null,
“labels”: null,
“name”: “fatigue-resolve-alert-filter”,
“namespace”: “default”
},
“spec”: {
“action”: “allow”,
“expressions”: [
“event.check.status == 0 && (() => { const occurrences = parseInt(event.check.metadata.annotations[‘fatigue_check/occurrences’]); const history = event.check.history; let failureCount = 0; for (let i = history.length - occurrences; i < history.length; i++) { if (history[i] && history[i].status == 2) { failureCount++; } } return failureCount >= occurrences; })()”
],
“runtime_assets”: [
“fatigue-check-filter”
]
}
}

Handler

{
“api_version”: “core/v2”,
“type”: “Handler”,
“metadata”: {
“namespace”: “default”,
“name”: “fatique_resolve_handler”
},
“spec”: {
“type”: “pipe”,
“command”: “sensu-email-handler --authMethod login -T /etc/sensu/configs/conf.d/email-body-templates/resolve_email_template -S ‘ENV:@@envname@@|ALERT:{{.Entity.Name}}/{{.Check.Name}}:RESOLVED’ -f @@FROM_MAIL_ID@@ -t @@mail_recipient@@ -s smtp.sendgrid.net -u apikey -p @@FROM_MAIL_ID_PSWRD@@”,
“timeout”: 1200,
“filters”: [
“fatigue-resolve-alert-filter”
]
}
}

Check defination :

{
“api_version”: “core/v2”,
“type”: “Check”,
“metadata”: {
“namespace”: “default”,
“name”: “elasticsearch-scroll-count-alert”,
“labels”: {},
“annotations”: {
“fatigue_check/occurrences”: “3”,
“fatigue_check/interval”: “21600”,
“fatigue_check/allow_resolution”: “false”
}
},
“spec”: {
“command”: “python3.11 /etc/sensu/plugins/xyz.py”,
“subscriptions”: [
“worker”
],
“publish”: true,
“round_robin”: true,
“cron”: “*/5 * * * *”,
“handlers”: [
“tester_handler”,
“alert_handler”,
“fatigue_resolve_handler”
],
“proxy_entity_name”: “proxyclient”,
“timeout”:120
}
}
But after making thec changes the sensu server pods are crashing

Logs :

Error: error putting resource #3 with name “fatique-resolve-alert-filter” and namespace “default” (/api/core/v2/namespaces/default/filters/fatique-resolve-alert-filter): resource is invalid: syntax error in expression 0: (anonymous): Line 1:30 Unexpected token ) (and 10 more errors)
could not configure Event Filters

Please can you suggest here further

Hi @Shivani_Bhardwaj

It will be more helpful if you could DM us recent 48 hours backend logs or pod logs where backend is running. Meanwhile I am doing further investigation with @justinhenderson and will keep you posted.

Many Thanks,
Nasirhussen

HI @Nasirhussen_Mulla
These are the recent logs
The pod is crashing internally with the configuration

Hi @Nasirhussen_Mulla

For better understanding to see where the issue could be we tried out the command one by now

The filter is not passing with
{
“type”: “EventFilter”,
“api_version”: “core/v2”,
“metadata”: {
“annotations”: null,
“labels”: null,
“name”: “fatigue-resolve-alert-filter”,
“namespace”: “default”
},
“spec”: {
“action”: “allow”,
“expressions”: [
“event.check.status == 0”,
“const occurrences = parseInt(event.check.metadata.annotations[‘fatigue_check/occurrences’]”
],
“runtime_assets”: [
“fatigue-check-filter”
]
}
}
The pod is crashing if i use this single line in expression . Please can you help us to debug further

Hi @Shivani_Bhardwaj,

Yes, we are actively working on it and testing some filter expressions. We will provide an update once the testing is complete.

Many Thanks,
Nasirhussen

Hi @Nasirhussen_Mulla

Any luck on this ??

Hi @Shivani_Bhardwaj

I’ve prepared another filter and am currently performing the final testing on our end. I’ll update you as soon as possible.

Many Thanks,
Nasirhussen

@Nasirhussen_Mulla Thanks for your update

Hi @Shivani_Bhardwaj

I’ve prepared and tested the following filter, and it is now working as expected.

type: EventFilter
api_version: core/v2
metadata:
name: fatigue-resolve-alert-filter
namespace: default
labels:
Sensu | Page not found sensuctl
created_by: sensu
spec:
action: allow
expressions:
- >-
event.check.occurrences ==
event.check.annotations[“fatigue_check/occurrences”] ||
(event.check.occurrences % event.check.annotations[“fatigue_check/interval”] == 0)
runtime_assets:
- sensu/sensu-go-fatigue-check-filter

Many Thanks,
Nasirhussen

Hi @Shivani_Bhardwaj
I Believe label is not coming in appropriate way so you can use below screenshot for reference.

Hi @Nasirhussen_Mulla

  • Condition 1:
    If the check fails fewer times than the defined number of ["fatigue_check/occurrences"], no resolved alert should be sent. — Works

  • Condition 2:
    If the check fails equal to or more than the defined ["fatigue_check/occurrences"], then a resolved alert should be sent upon a successful resolution of the check.

I see that Condition 2 (where a resolved alert should be sent after the number of failures equals or exceeds the defined occurrences) is not being satisfied.

we need to ensure that the logic accurately counts failures and then triggers a resolved alert after the failures meet or exceed the threshold.

Hi @Shivani_Bhardwaj ,

Could you please elaborate condition 2 in depth and also please confirm when the resolved alert should be triggered on first occurrence of success or it will wait till the occurrence mentioned in dynamic field to be fulfilled with success.

Many Thanks,

Nasirhussen

Hi @Nasirhussen_Mulla

A resolved alert should only be sent if the number of failures in the last N history entries is equal to or greater than fatigue_check/occurrences.

It should wait till the occurrence mentioned in dynamic field to be fulfilled with success.

Hi @Nasirhussen_Mulla

Please can you help us here in this case .

Hi Shivani,

Yes, I have started working on the expression based on your second condition requirement. It’s taking some time to create and test the filter properly.

Many Thanks,
Nasirhussen

Hi @Shivani_Bhardwaj

Just to reconfirm the requirements from your side, based on our understanding, the fatigue check filter will function as follows:

  1. Critical Alert: A critical alert will be triggered after the number of consecutive failures reaches the value specified in the “occurrence” dynamic field.
  2. Resolved Alert: A resolved alert will be triggered after the number of consecutive successes reaches the value specified in the “occurrence” dynamic field.
    Could you please confirm if our understanding of condition 2 is correct and that the filter will indeed behave as described for resolved alerts?

Many Thanks,
Nasirhussen