I have setup sensu checks against my windows datacenter servers and i have noticed this weird behavior where sensu windows plugin checks (cpu / ram / disk) take up 20% CPU per run and sometimes it stops responding to the backend because the machine runs out of resources to report which leads to the powershell scripts not exiting thereby increasing the cpu usage. The issue is not the running out of resources as much as powershell scripts not finishing the running of the scripts thereby causing issues on the instances. This seems to point to possibly a leak in the scripts on the instances but i cannot say for sure.
The instances are windows instances on Microsoft Windows Server 2016 Datacenter 10.0.14393 Build 14393 / Server which are hosted on AWS. The version of Sensu Agent is 6.2 and Sensu backend is Sensu Go 6.1.0
The windows plugin in use is sensu/sensu-windows-powershell-checks and checks are running something as simple as powershell.exe -ExecutionPolicy ByPass -C check-windows-cpu-load.ps1 90 98
or powershell.exe -ExecutionPolicy ByPass -C check-windows-ram.ps1 90 98
.
Another thing to note is the same scripts in the plugin were being used with Sensu Core and this was an issue then as well. From what i have noticed / observed, this issue seems to happen as we have upgraded our instances from Microsoft Windows Server 2012 Datacenter to Microsoft Windows Server 2016 Datacenter but i am not able to confirm the issue on older machines anymore because of lack of access to them.
Hey,
If you run powershell scripts manually outside of Sensu, do these scripts incur similar cpu load and time to complete? They are just scripts so you can run a version of that command from a terminal on that host. Because the load was similar under the Sensu Core as well, it feels like this is a powershell issue or maybe these scripts need to be refactored. Unfortunately, I’m not the best person to assess the performance of these powershell scripts… i’m steeped in posix shell experience, but not powershell (and powershell on linux while available is a totally different animal because it doesn’t have the windows API integrations these powershell scripts use)
If there was someone out there with windows powershell experient that could contribute performance optimizations, I’m happy to work with them and setup a github actions testing workflow in the repository so we could track the performance make sure it doesn’t regress in the future.
What might help you mitigate in the meantime is set a timeout in the checks to give the scripts a boxed time budget, and if the timeouts are tripping push the schedule interval and timeouts back a bit so they fire less often.
Yup running them manually also seems to push the CPU usage around the 20% mark for a quick second or two. For now I have increased the frequency of running the checks from 60 to 300 seconds to help with the CPU hogging of the calls. I am myself a linux developer and this was brought to my attention from a Windows team member who noticed this oddity for over the last few months. Let me know if there is something else i can do to help investigate? More information? A place to test any changes?
I am running a test currently to see if the same issue exists in Windows 2012 and will be able to gather enough data to compare the performance of the machine over a week period. I should be able to report back on Tuesday with the stats.
Okay so since this is happening outside of Sensu itself, what I’m going to do is setup a github action windows environment that can be used as a test bed to get some performance testing in the repo where these plugins live. I’ll spin up a branch in the repository feeding the repository and try to build a cutdown noop powershell check script as a baseline. I’ll see what I can setup next week.
1 Like
Hey,
So I’ve spun up a windows based github action just to see what’s going on outside of Sensu…
here’s what’s weird… it looks like the checks i threw in there from the readme examples complete quickly.
Take a look…
https://github.com/sensu/sensu-windows-powershell-checks/runs/1939726860?check_suite_focus=true
is there a specific check you’d like me to try in the github action environment?
The fact that the github action environment seems to be running these much more quickly…is interesting. What’s different about that windows env github action is running?
@jspaleta I am specifically running the check-windows-cpu-load.ps1 and the check-windows-ram.ps1 checks. What version of Windows are the actions being run against? Could the OS version itself be the issue? So far i have seen nothing in Windows Server 2012 Datacenter or Windows Server 2008 Datacenter OS but the issue with those OS’s are that there are many flaws and need patching or they are at EOL.
@todd ill get someone from my windows team to work with me to try and set this up and see if it works. I am assuming i will need to install these scripts at the windows agent location as well
From the first look, seems like the new asset suggested by you works much better due to the lack of powershell spin up for the check executions. I’m still monitoring a Windows 2016 instance to see if there is any issue with these assets from nixwiz. Currently testing the setup by running the check every second to hit the instance hard and reach a state where the possible memory leak issue seen in the other asset comes through but so far an hour of checks and nothing. I will keep you both posted
1 Like
Here is the difference between 2 cpu checks that are being run. Powershell check (3rd from the top) for cpu takes about 5.0% consistently (i was wrong before when i mentioned 20% since this is a brand new instance with nothing on it) while the command based on nixwiz asset barely registers any CPU spike (last process in the image). These both are being run with an interval of 1 second but the difference is significant with the cpu and memory usage on the instance.
Even though the powershell process itself only takes 5% per check, it seems to trigger background processes to get the information which itself takes more CPU thereby spiking the usage. This is not the case in nixwiz assets.
To test the theory, i had 2 sets of checks run in different subscriptions (windows-sensu and windows-nixwiz) and the result with both enabled on the instance was the CPU spike upto 50%. As seen in the screenshot the highest consumption is the anti-malware service (windows defender) but it seems to be triggered by the checks themselves. As soon as the sensu powershell asset checks are run, the cpu consumption of the anti-malware also spikes thereby using upto 50% of the resources. As soon as i disabled the windows-sensu subscription on the instance, the anti-malware process also seems to go down.
Same consumption when only windows-sensu subscription is enabled looks like the following
I am going to test this theory next on Windows 2012 instance but let me know if you want more information while i have access to a windows machine
1 Like
This maybe all associated with powershell itself, as performance may be dependent on the version of powershell being used under the scripts. Most of the useful actions the scripts take are really using powershell comlets that come with powershell itself. This could explain why the github action seems to take
Maybe your performance is related to this?
https://github.com/Powershell/PSReadLine/issues/673PowerShell Crashing on Startup · Issue #673 · PowerShell/PSReadLine · GitHub
It looks like PowerShell keeps a command history file, that it reads in on startup…and when that file gets very large… PowerShell takes a lot of time to startup. That would explain both the cpu and memory spikes…reading in a huge file into memory and would also explain why i’m not seeing the performance hit on the github action runner.
If you nuke the history file, as per the github instructions… does that help your performance?
This maybe helpful to track down which history file you need to prune…
Powershell.exe -NonInteractive -NoProfile -ExecutionPolicy Bypass -NoLogo -Command Get-PSReadlineOption
From that command you will be able to see where the history file is that you need to potentially get rid of. You may want to run this as a Sensu check, to ensure you find the correct file that matches the sensu user.
Also what version does PowerShell.exe report as its version?
Powershell.exe -NonInteractive -NoProfile -ExecutionPolicy Bypass -NoLogo -Command $PSVersionTable
For my github runner its reporting:
Name Value
---- -----
PSVersion 5.1.17763.1490
PSEdition Desktop
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0...}
BuildVersion 10.0.17763.1490
CLRVersion 4.0.30319.42000
WSManStackVersion 3.0
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
If you discover the history file is the problem (and its not clear that it will be), you might be able to mitigate by prefixing the check command being run with additional powershell commands instructing the PSReadline module from saving into the history file. You’ll still need to prune that file, but once you do that, you can stop it from accomulating more history.
Something like this construction:
Powershell.exe -NonInteractive -NoProfile -ExecutionPolicy Bypass -NoLogo -Command
"Set-PSReadLineOption -HistorySaveStyle SaveNothing; check-windows-disk.ps1 90 95"
What I don’t know how to do is prevent powershell from reading in its default history file at startup.
1 Like
I checked the history file and it does not seem to keep anything in there if i sign out. I believe on actual production instances this would not be the case since the service user which runs the calls would never sign out thereby having a large file. But to run your test of nuking the history file, it did help with the performance immediately but as the checks kept running on, it did start to spike again. Also the powershell version on my end was
I guess the command
Powershell.exe -NonInteractive -NoProfile -ExecutionPolicy Bypass -NoLogo -Command
"Set-PSReadLineOption -HistorySaveStyle SaveNothing; check-windows-disk.ps1 90 95"
to not save the history could possibly help in the short term but since we cannot change the way Powershell starts up, we are just pushing the problem for a later day. This command without adding to history can be used if we need to monitor process with the existing sensu-windows-powershell plugin but at some point we may need to move towards other plugins.
Again since powershell reading the history file on every startup is a likely issue that we cannot solve at sensu level, i do feel the non-powershell checks / cross platform checks are definitely more ideal for windows.
hey,
very interesting result. Nuking the file helped, but once the file got recreated, performance dragged again? Did I get that right?
Looking at the powershell rsreadline options, we might also be able to set the history file to read/write use to a NUL file, which is equivalent to /dev/null to maybe disable the usage of the history file.
Now that you have confirmed that nuking the history file helps… I might be able to put together a test case in the github action workflows using a large enough history file. We might be able to disable the file as part of the script actions. How big does the file have to be for you to feel the performance burn?
Longer term… I agree… moving to golang based plugins will be best. There is a golang wmi package that will allow wql queries against the windows management interface. Which is basically what all of these powershell scripts are doing, just using baked in powershell modules to abstract the queries into objects powershell can work with. If there’s someone out there who enjoys writing wql queries using the wmi commandline tools. they could definitely help flush such a golang effort out a bit. Once I have a useful query that works with the wmi commandline tools, then rebuilding the golang around that to parse the output should be too difficult.
But I expect a lot of people are going to want/need write in-house powershell as a quick goto for inhouse check logic and having a pattern for that to offer people will help.
That is correct. A few hours of per second checks with history enabled started to slow down the performance. I would say there were about 10000 lines in the history when things slowed down.
I do agree people would want to run their own powershell scripts to check logic and i believe this thread and your solution about purging powershell history is a potential temporary fix that people having issues can be made aware of.
great…
10000 lines that’s small enough to simulate in the github actions without a problem for sure.
So once I can create that situation in the CI action, I can then test to see if we can disable that history as part of default script action.
I should be able to test something out by end of the week. Thanks for hanging in there. Now that I know the history file is a reproducible performance burn, hopefully we can make the scripts smart enough to disable it. fingers crossed.
@swapzero
Hey,following up on this,
I still can’t reproduce this in the github action environment. What’s interesting is in the github action environment the offending PSReadline module doesn’t seem to be loaded into powershell session on inspection with Get-Module cmdlet.
Would you be willing to run some adhoc diagnostic checks in your windows system. Not on an interval… just executing them once manually and reporting back the output from the sensu event?
Hey @jspaleta . Sorry i was out and missed this. Let me know what you need and ill try to have the diagnostic run
thanks,
I just want to see what powershell modules are actually loaded in the powershell session.
The github action based performance environment doesn’t seem to have the readline module loaded… until i ask powershell a question concerning the readline module configuration…then it loads it.
So if the readline module isnt loaded… its not clear if i’ll ever see a performance hit due ot its log writing settings.
It’s just a little bit of a head scratcher because I can’t reproduce in my naive way. So before I go further I want to see if your powershell session has the module loaded. I’ll get back with specific check definitions for you soon. thanks.
Here are the results of the get-module command.
> get-module
ModuleType Version Name ExportedCommands
---------- ------- ---- ----------------
Manifest 3.1.0.0 Microsoft.PowerShell.Utility {Add-Member, Add-Type, Clear-Variable, Compare-Object...}
Script 1.2 PSReadline {Get-PSReadlineKeyHandler, Get-PSReadlineOption, Remove-PSReadlineKeyHandler, Set-PSReadlineKeyHandler...}
and doing -All
> get-module -All
ModuleType Version Name ExportedCommands
---------- ------- ---- ----------------
Binary 3.0.0.0 Microsoft.PowerShell.Commands.Ut... {New-Object, Measure-Object, Select-Object, Sort-Object...}
Binary 3.0.0.0 Microsoft.PowerShell.PSReadLine {Get-PSReadlineOption, Set-PSReadlineOption, Set-PSReadlineKeyHandler, Get-PSReadlineKeyHandler...}
Script 0.0 Microsoft.PowerShell.Utility {ConvertFrom-SddlString, Format-Hex, Get-FileHash, Import-PowerShellDataFile...}
Manifest 3.1.0.0 Microsoft.PowerShell.Utility {Add-Member, Add-Type, Clear-Variable, Compare-Object...}
Script 1.2 PSReadline {Get-PSReadlineKeyHandler, Get-PSReadlineOption, Remove-PSReadlineKeyHandler, Set-PSReadlineKeyHandler...}
yeah,…what i need is the output of that run as a powershell based sensu check.
is PSReadline loaded for you from sensu agent’s session.
got it. give me 5mins to get back to you