I have sensu running on two hosts – one is running the full stack, other is just a client.
The Sensu host has no issues, keepalives are appearing and all checks/metrics collectors are running
The second host is running all checks/metrics, but it has flat out ceased sending keepalives. It was sending keepalives for a ~12 hour period and then gave up. I’ve completely bounced sensu on both hosts, including rabbitmq and redis (even did flushdb/flushall for redis-cli).
I would focus on the client. Is there anything in the sensu-client.log
around that time that indicates something going wrong? Can you post
those logs?
···
On Fri, Jan 16, 2015 at 9:01 AM, <awpti@awpti.org> wrote:
Topic says it.
I have sensu running on two hosts -- one is running the full stack, other is
just a client.
The Sensu host has no issues, keepalives are appearing and all
checks/metrics collectors are running
The second host is running all checks/metrics, but it has flat out ceased
sending keepalives. It was sending keepalives for a ~12 hour period and then
gave up. I've completely bounced sensu on both hosts, including rabbitmq and
redis (even did flushdb/flushall for redis-cli).
The logs have not a single line relating to keepalives. I’ve since added yet another host and that one is also not delivering keepalives. I’ve dug the sensu-client.log files for “keep”, “keepalive”, etc. There are no errors. All messages appear to be “info” state.
The clients work fine – they do their checks and delivery their metrics. They simply are not sending keepalives.
I think the configuration may be malformed which is making it send
keepalives in the name of "null" ? Or "local"? Looks odd.
Are you using configuration management to configure these or by hand?
···
On Mon, Jan 19, 2015 at 10:41 AM, <awpti@awpti.org> wrote:
The logs have not a single line relating to keepalives. I've since added yet
another host and that one is also not delivering keepalives. I've dug the
sensu-client.log files for "keep", "keepalive", etc. There are no errors.
All messages appear to be "info" state.
The clients work fine -- they do their checks and delivery their metrics.
They simply are not sending keepalives.
On Saturday, January 17, 2015 at 11:46:40 AM UTC-7, Kyle Anderson wrote:
I would focus on the client. Is there anything in the sensu-client.log
around that time that indicates something going wrong? Can you post
those logs?
On Fri, Jan 16, 2015 at 9:01 AM, <aw...@awpti.org> wrote:
> Topic says it.
>
> I have sensu running on two hosts -- one is running the full stack,
> other is
> just a client.
>
> The Sensu host has no issues, keepalives are appearing and all
> checks/metrics collectors are running
>
> The second host is running all checks/metrics, but it has flat out
> ceased
> sending keepalives. It was sending keepalives for a ~12 hour period and
> then
> gave up. I've completely bounced sensu on both hosts, including rabbitmq
> and
> redis (even did flushdb/flushall for redis-cli).
>
> What else can I do?
By hand – I assume config management is an enterprise feature?
The configs look identical between the host that does work and the ones that don’t (other than the number of items in the subscriptions array and optional config items for mysql / rabbitmq creds).
config management is not an enterprise feature. I highly recommend to
use chef or puppet (or similar) to put down valid configs and reduce
human error.
I bet that there is a *different* json somewhere in your tree
overriding the "client" configuration.
Is that paste of the auth20.json file? Are you able to paste your
entire config? (find /etc/sensu/conf.d/ -type f | xargs cat)
···
On Mon, Jan 19, 2015 at 11:21 AM, <awpti@awpti.org> wrote:
By hand -- I assume config management is an enterprise feature?
The configs look identical between the host that does work and the ones that
don't (other than the number of items in the subscriptions array and
optional config items for mysql / rabbitmq creds).
I think the configuration may be malformed which is making it send
keepalives in the name of "null" ? Or "local"? Looks odd.
Are you using configuration management to configure these or by hand?
On Mon, Jan 19, 2015 at 10:41 AM, <aw...@awpti.org> wrote:
> The logs have not a single line relating to keepalives. I've since added
> yet
> another host and that one is also not delivering keepalives. I've dug
> the
> sensu-client.log files for "keep", "keepalive", etc. There are no
> errors.
> All messages appear to be "info" state.
>
> The clients work fine -- they do their checks and delivery their
> metrics.
> They simply are not sending keepalives.
>
> Log file is here:
>
> http://pastie.org/9841304
>
> Repeat ad nauseum.
>
>
> On Saturday, January 17, 2015 at 11:46:40 AM UTC-7, Kyle Anderson wrote:
>>
>> I would focus on the client. Is there anything in the sensu-client.log
>> around that time that indicates something going wrong? Can you post
>> those logs?
>>
>> On Fri, Jan 16, 2015 at 9:01 AM, <aw...@awpti.org> wrote:
>> > Topic says it.
>> >
>> > I have sensu running on two hosts -- one is running the full stack,
>> > other is
>> > just a client.
>> >
>> > The Sensu host has no issues, keepalives are appearing and all
>> > checks/metrics collectors are running
>> >
>> > The second host is running all checks/metrics, but it has flat out
>> > ceased
>> > sending keepalives. It was sending keepalives for a ~12 hour period
>> > and
>> > then
>> > gave up. I've completely bounced sensu on both hosts, including
>> > rabbitmq
>> > and
>> > redis (even did flushdb/flushall for redis-cli).
>> >
>> > What else can I do?
Yes, that was a paste of the auth20.json file as compared to the
sensu.json file.
On Monday, January 19, 2015 at 12:57:54 PM UTC-7, Kyle Anderson wrote:
config management is not an enterprise feature. I highly recommend to
use chef or puppet (or similar) to put down valid configs and reduce
human error.
I bet that there is a *different* json somewhere in your tree
overriding the "client" configuration.
Is that paste of the auth20.json file? Are you able to paste your
entire config? (find /etc/sensu/conf.d/ -type f | xargs cat)
On Mon, Jan 19, 2015 at 11:21 AM, <aw...@awpti.org> wrote:
> By hand -- I assume config management is an enterprise feature?
>
> The configs look identical between the host that does work and the ones
> that
> don't (other than the number of items in the subscriptions array and
> optional config items for mysql / rabbitmq creds).
>
> Here they are:
>
> http://pastie.org/9841473
>
> On Monday, January 19, 2015 at 11:45:47 AM UTC-7, Kyle Anderson wrote:
>>
>> This looks suspicious to me:
>>
>>
>>
>> {"timestamp":"2015-01-19T11:20:19.816294-0700","level":"warn","message":"config
>> file applied
>>
>> changes","file":"/etc/sensu/conf.d/auth20.json","changes":{"client":[null,{"name":"local","address":"10.100.29.40","subscriptions":["vitals_all","metrics_base"]}]}}
>> {"timestamp":"2015-01-19T11:20:19.
>>
>> client: null?
>>
>> I think the configuration may be malformed which is making it send
>> keepalives in the name of "null" ? Or "local"? Looks odd.
>>
>> Are you using configuration management to configure these or by hand?
>>
>> On Mon, Jan 19, 2015 at 10:41 AM, <aw...@awpti.org> wrote:
>> > The logs have not a single line relating to keepalives. I've since
>> > added
>> > yet
>> > another host and that one is also not delivering keepalives. I've
>> > dug
>> > the
>> > sensu-client.log files for "keep", "keepalive", etc. There are no
>> > errors.
>> > All messages appear to be "info" state.
>> >
>> > The clients work fine -- they do their checks and delivery their
>> > metrics.
>> > They simply are not sending keepalives.
>> >
>> > Log file is here:
>> >
>> > http://pastie.org/9841304
>> >
>> > Repeat ad nauseum.
>> >
>> >
>> > On Saturday, January 17, 2015 at 11:46:40 AM UTC-7, Kyle Anderson >>> >> > wrote:
>> >>
>> >> I would focus on the client. Is there anything in the
>> >> sensu-client.log
>> >> around that time that indicates something going wrong? Can you post
>> >> those logs?
>> >>
>> >> On Fri, Jan 16, 2015 at 9:01 AM, <aw...@awpti.org> wrote:
>> >> > Topic says it.
>> >> >
>> >> > I have sensu running on two hosts -- one is running the full
>> >> > stack,
>> >> > other is
>> >> > just a client.
>> >> >
>> >> > The Sensu host has no issues, keepalives are appearing and all
>> >> > checks/metrics collectors are running
>> >> >
>> >> > The second host is running all checks/metrics, but it has flat
>> >> > out
>> >> > ceased
>> >> > sending keepalives. It was sending keepalives for a ~12 hour
>> >> > period
>> >> > and
>> >> > then
>> >> > gave up. I've completely bounced sensu on both hosts, including
>> >> > rabbitmq
>> >> > and
>> >> > redis (even did flushdb/flushall for redis-cli).
>> >> >
>> >> > What else can I do?
Yes, that was a paste of the auth20.json file as compared to the
sensu.json file.
On Monday, January 19, 2015 at 12:57:54 PM UTC-7, Kyle Anderson wrote:
config management is not an enterprise feature. I highly recommend to
use chef or puppet (or similar) to put down valid configs and reduce
human error.
I bet that there is a *different* json somewhere in your tree
overriding the "client" configuration.
Is that paste of the auth20.json file? Are you able to paste your
entire config? (find /etc/sensu/conf.d/ -type f | xargs cat)
On Mon, Jan 19, 2015 at 11:21 AM, <aw...@awpti.org> wrote:
> By hand -- I assume config management is an enterprise feature?
>
> The configs look identical between the host that does work and the ones
> that
> don't (other than the number of items in the subscriptions array and
> optional config items for mysql / rabbitmq creds).
>
> Here they are:
>
> http://pastie.org/9841473
>
> On Monday, January 19, 2015 at 11:45:47 AM UTC-7, Kyle Anderson wrote:
>>
>> This looks suspicious to me:
>>
>>
>>
>> {"timestamp":"2015-01-19T11:20:19.816294-0700","level":"warn","message":"config
>> file applied
>>
>> changes","file":"/etc/sensu/conf.d/auth20.json","changes":{"client":[null,{"name":"local","address":"10.100.29.40","subscriptions":["vitals_all","metrics_base"]}]}}
>> {"timestamp":"2015-01-19T11:20:19.
>>
>> client: null?
>>
>> I think the configuration may be malformed which is making it send
>> keepalives in the name of "null" ? Or "local"? Looks odd.
>>
>> Are you using configuration management to configure these or by hand?
>>
>> On Mon, Jan 19, 2015 at 10:41 AM, <aw...@awpti.org> wrote:
>> > The logs have not a single line relating to keepalives. I've since
>> > added
>> > yet
>> > another host and that one is also not delivering keepalives. I've
>> > dug
>> > the
>> > sensu-client.log files for "keep", "keepalive", etc. There are no
>> > errors.
>> > All messages appear to be "info" state.
>> >
>> > The clients work fine -- they do their checks and delivery their
>> > metrics.
>> > They simply are not sending keepalives.
>> >
>> > Log file is here:
>> >
>> > http://pastie.org/9841304
>> >
>> > Repeat ad nauseum.
>> >
>> >
>> > On Saturday, January 17, 2015 at 11:46:40 AM UTC-7, Kyle Anderson >>>> >> > wrote:
>> >>
>> >> I would focus on the client. Is there anything in the
>> >> sensu-client.log
>> >> around that time that indicates something going wrong? Can you post
>> >> those logs?
>> >>
>> >> On Fri, Jan 16, 2015 at 9:01 AM, <aw...@awpti.org> wrote:
>> >> > Topic says it.
>> >> >
>> >> > I have sensu running on two hosts -- one is running the full
>> >> > stack,
>> >> > other is
>> >> > just a client.
>> >> >
>> >> > The Sensu host has no issues, keepalives are appearing and all
>> >> > checks/metrics collectors are running
>> >> >
>> >> > The second host is running all checks/metrics, but it has flat
>> >> > out
>> >> > ceased
>> >> > sending keepalives. It was sending keepalives for a ~12 hour
>> >> > period
>> >> > and
>> >> > then
>> >> > gave up. I've completely bounced sensu on both hosts, including
>> >> > rabbitmq
>> >> > and
>> >> > redis (even did flushdb/flushall for redis-cli).
>> >> >
>> >> > What else can I do?
Ah, if it does a big merge, then perhaps I’m misunderstanding the client config. I was under the impression that each machine (sensu-client) needs its own .json file with a client value.
All 3 servers have a .json file with a { “client”: { “name”: “some.host.name” … }
Is this incorrect? The documentation implies it, though it’s possible I’ve overlooked something.
Okay, I’ve dug through the docs again and it’s quite clear that the config I’m using is absolutely correct. All I can determine from the error is that sensu is failing to properly read / parse the json file. It has replaced the “name” value with “local” on each machine rather than using the value I supplied.
I’m flat out not even sure what else to do at this point – from my perspective and from the docs, my configuration is correct.
I'm also at a loss. Do you feel comfortable tar'ing /etc/sensu/ to analyze?
Also can you paste
grep -r -C1 'client' /etc/sensu/
and
grep -r -C1 'local' /etc/sensu ?
"local" or "client" *must* be set in more than one place *somewhere*.
···
On Wed, Jan 21, 2015 at 9:54 AM, <awpti@awpti.org> wrote:
Okay, I've dug through the docs again and it's quite clear that the config
I'm using is absolutely correct. All I can determine from the error is that
sensu is failing to properly read / parse the json file. It has replaced the
"name" value with "local" on each machine rather than using the value I
supplied.
I'm flat out not even sure what else to do at this point -- from my
perspective and from the docs, my configuration is correct.
Have you verified that ntp is running and that the time is correct on the servers? I get bit by this occasionally.
···
On Friday, January 16, 2015 at 12:01:13 PM UTC-5, aw...@awpti.org wrote:
Topic says it.
I have sensu running on two hosts – one is running the full stack, other is just a client.
The Sensu host has no issues, keepalives are appearing and all checks/metrics collectors are running
The second host is running all checks/metrics, but it has flat out ceased sending keepalives. It was sending keepalives for a ~12 hour period and then gave up. I’ve completely bounced sensu on both hosts, including rabbitmq and redis (even did flushdb/flushall for redis-cli).
Whatever the reason that it is mis-reading the json is the core issue here.
···
On Thu, Jan 22, 2015 at 5:48 AM, James Taylor - OP <jtaylor@onpointlearning.com> wrote:
Have you verified that ntp is running and that the time is correct on the
servers? I get bit by this occasionally.
On Friday, January 16, 2015 at 12:01:13 PM UTC-5, aw...@awpti.org wrote:
Topic says it.
I have sensu running on two hosts -- one is running the full stack, other
is just a client.
The Sensu host has no issues, keepalives are appearing and all
checks/metrics collectors are running
The second host is running all checks/metrics, but it has flat out ceased
sending keepalives. It was sending keepalives for a ~12 hour period and then
gave up. I've completely bounced sensu on both hosts, including rabbitmq and
redis (even did flushdb/flushall for redis-cli).
So, I banged away at this for a few hours today and it turns out… it was the time!
ntpdate was keeping the clocks synced… for a little bit. Turns out two of the dev host nodes (ESXi) had desynced clocks because their own time sync feature was stopped. It just so happened that ntpdate would update between two checks and then the clocks would sync to the bad times right afterwards.
It seems that null value coming from sensu-client.log is irrelevant.
Sometimes, I hate VMware. Sometimes. Most of the time I love the hell out of it. I’m moving us away from syncing off the hosts to syncing from our core router’s ntp service.
···
On Thursday, January 22, 2015 at 9:43:10 AM UTC-7, Kyle Anderson wrote:
I agree that ntp can be an issue sometimes, but in this case, this is
Well gosh. I'm sorry I sent everyone down the wrong trail.
···
On Thu, Jan 22, 2015 at 11:51 AM, <awpti@awpti.org> wrote:
So, I banged away at this for a few hours today and it turns out.. it was
the time!
ntpdate was keeping the clocks synced.. for a little bit. Turns out two of
the dev host nodes (ESXi) had desynced clocks because their own time sync
feature was stopped. It just so happened that ntpdate would update between
two checks and then the clocks would sync to the bad times right afterwards.
It seems that null value coming from sensu-client.log is irrelevant.
Sometimes, I hate VMware. Sometimes. Most of the time I love the hell out of
it. I'm moving us away from syncing off the hosts to syncing from our core
router's ntp service.
On Thursday, January 22, 2015 at 9:43:10 AM UTC-7, Kyle Anderson wrote:
I agree that ntp can be an issue sometimes, but in this case, this is
the core issue:
Whatever the reason that it is mis-reading the json is the core issue
here.
On Thu, Jan 22, 2015 at 5:48 AM, James Taylor - OP >> <jta...@onpointlearning.com> wrote:
> Have you verified that ntp is running and that the time is correct on
> the
> servers? I get bit by this occasionally.
>
>
> On Friday, January 16, 2015 at 12:01:13 PM UTC-5, aw...@awpti.org wrote:
>>
>> Topic says it.
>>
>> I have sensu running on two hosts -- one is running the full stack,
>> other
>> is just a client.
>>
>> The Sensu host has no issues, keepalives are appearing and all
>> checks/metrics collectors are running
>>
>> The second host is running all checks/metrics, but it has flat out
>> ceased
>> sending keepalives. It was sending keepalives for a ~12 hour period and
>> then
>> gave up. I've completely bounced sensu on both hosts, including
>> rabbitmq and
>> redis (even did flushdb/flushall for redis-cli).
>>
>> What else can I do?