Sensu testing -- second machine stopped sending keepalives!

Topic says it.

I have sensu running on two hosts – one is running the full stack, other is just a client.

The Sensu host has no issues, keepalives are appearing and all checks/metrics collectors are running

The second host is running all checks/metrics, but it has flat out ceased sending keepalives. It was sending keepalives for a ~12 hour period and then gave up. I’ve completely bounced sensu on both hosts, including rabbitmq and redis (even did flushdb/flushall for redis-cli).

What else can I do?

I would focus on the client. Is there anything in the sensu-client.log
around that time that indicates something going wrong? Can you post
those logs?

···

On Fri, Jan 16, 2015 at 9:01 AM, <awpti@awpti.org> wrote:

Topic says it.

I have sensu running on two hosts -- one is running the full stack, other is
just a client.

The Sensu host has no issues, keepalives are appearing and all
checks/metrics collectors are running

The second host is running all checks/metrics, but it has flat out ceased
sending keepalives. It was sending keepalives for a ~12 hour period and then
gave up. I've completely bounced sensu on both hosts, including rabbitmq and
redis (even did flushdb/flushall for redis-cli).

What else can I do?

The logs have not a single line relating to keepalives. I’ve since added yet another host and that one is also not delivering keepalives. I’ve dug the sensu-client.log files for “keep”, “keepalive”, etc. There are no errors. All messages appear to be “info” state.

The clients work fine – they do their checks and delivery their metrics. They simply are not sending keepalives.

Log file is here:

http://pastie.org/9841304

Repeat ad nauseum.

···

On Saturday, January 17, 2015 at 11:46:40 AM UTC-7, Kyle Anderson wrote:

I would focus on the client. Is there anything in the sensu-client.log

around that time that indicates something going wrong? Can you post

those logs?

On Fri, Jan 16, 2015 at 9:01 AM, aw...@awpti.org wrote:

Topic says it.

I have sensu running on two hosts – one is running the full stack, other is

just a client.

The Sensu host has no issues, keepalives are appearing and all

checks/metrics collectors are running

The second host is running all checks/metrics, but it has flat out ceased

sending keepalives. It was sending keepalives for a ~12 hour period and then

gave up. I’ve completely bounced sensu on both hosts, including rabbitmq and

redis (even did flushdb/flushall for redis-cli).

What else can I do?

This looks suspicious to me:

{"timestamp":"2015-01-19T11:20:19.816294-0700","level":"warn","message":"config
file applied changes","file":"/etc/sensu/conf.d/auth20.json","changes":{"client":[null,{"name":"local","address":"10.100.29.40","subscriptions":["vitals_all","metrics_base"]}]}}
{"timestamp":"2015-01-19T11:20:19.

client: null?

I think the configuration may be malformed which is making it send
keepalives in the name of "null" ? Or "local"? Looks odd.

Are you using configuration management to configure these or by hand?

···

On Mon, Jan 19, 2015 at 10:41 AM, <awpti@awpti.org> wrote:

The logs have not a single line relating to keepalives. I've since added yet
another host and that one is also not delivering keepalives. I've dug the
sensu-client.log files for "keep", "keepalive", etc. There are no errors.
All messages appear to be "info" state.

The clients work fine -- they do their checks and delivery their metrics.
They simply are not sending keepalives.

Log file is here:

http://pastie.org/9841304

Repeat ad nauseum.

On Saturday, January 17, 2015 at 11:46:40 AM UTC-7, Kyle Anderson wrote:

I would focus on the client. Is there anything in the sensu-client.log
around that time that indicates something going wrong? Can you post
those logs?

On Fri, Jan 16, 2015 at 9:01 AM, <aw...@awpti.org> wrote:
> Topic says it.
>
> I have sensu running on two hosts -- one is running the full stack,
> other is
> just a client.
>
> The Sensu host has no issues, keepalives are appearing and all
> checks/metrics collectors are running
>
> The second host is running all checks/metrics, but it has flat out
> ceased
> sending keepalives. It was sending keepalives for a ~12 hour period and
> then
> gave up. I've completely bounced sensu on both hosts, including rabbitmq
> and
> redis (even did flushdb/flushall for redis-cli).
>
> What else can I do?

By hand – I assume config management is an enterprise feature?

The configs look identical between the host that does work and the ones that don’t (other than the number of items in the subscriptions array and optional config items for mysql / rabbitmq creds).

Here they are:

http://pastie.org/9841473

···

On Monday, January 19, 2015 at 11:45:47 AM UTC-7, Kyle Anderson wrote:

This looks suspicious to me:

{“timestamp”:“2015-01-19T11:20:19.816294-0700”,“level”:“warn”,“message”:"config

file applied changes",“file”:“/etc/sensu/conf.d/auth20.json”,“changes”:{“client”:[null,{“name”:“local”,“address”:“10.100.29.40”,“subscriptions”:[“vitals_all”,“metrics_base”]}]}}

{“timestamp”:"2015-01-19T11:20:19.

client: null?

I think the configuration may be malformed which is making it send

keepalives in the name of “null” ? Or “local”? Looks odd.

Are you using configuration management to configure these or by hand?

On Mon, Jan 19, 2015 at 10:41 AM, aw...@awpti.org wrote:

The logs have not a single line relating to keepalives. I’ve since added yet

another host and that one is also not delivering keepalives. I’ve dug the

sensu-client.log files for “keep”, “keepalive”, etc. There are no errors.

All messages appear to be “info” state.

The clients work fine – they do their checks and delivery their metrics.

They simply are not sending keepalives.

Log file is here:

http://pastie.org/9841304

Repeat ad nauseum.

On Saturday, January 17, 2015 at 11:46:40 AM UTC-7, Kyle Anderson wrote:

I would focus on the client. Is there anything in the sensu-client.log

around that time that indicates something going wrong? Can you post

those logs?

On Fri, Jan 16, 2015 at 9:01 AM, aw...@awpti.org wrote:

Topic says it.

I have sensu running on two hosts – one is running the full stack,

other is

just a client.

The Sensu host has no issues, keepalives are appearing and all

checks/metrics collectors are running

The second host is running all checks/metrics, but it has flat out

ceased

sending keepalives. It was sending keepalives for a ~12 hour period and

then

gave up. I’ve completely bounced sensu on both hosts, including rabbitmq

and

redis (even did flushdb/flushall for redis-cli).

What else can I do?

config management is not an enterprise feature. I highly recommend to
use chef or puppet (or similar) to put down valid configs and reduce
human error.

I bet that there is a *different* json somewhere in your tree
overriding the "client" configuration.

Is that paste of the auth20.json file? Are you able to paste your
entire config? (find /etc/sensu/conf.d/ -type f | xargs cat)

···

On Mon, Jan 19, 2015 at 11:21 AM, <awpti@awpti.org> wrote:

By hand -- I assume config management is an enterprise feature?

The configs look identical between the host that does work and the ones that
don't (other than the number of items in the subscriptions array and
optional config items for mysql / rabbitmq creds).

Here they are:

http://pastie.org/9841473

On Monday, January 19, 2015 at 11:45:47 AM UTC-7, Kyle Anderson wrote:

This looks suspicious to me:

{"timestamp":"2015-01-19T11:20:19.816294-0700","level":"warn","message":"config
file applied
changes","file":"/etc/sensu/conf.d/auth20.json","changes":{"client":[null,{"name":"local","address":"10.100.29.40","subscriptions":["vitals_all","metrics_base"]}]}}
{"timestamp":"2015-01-19T11:20:19.

client: null?

I think the configuration may be malformed which is making it send
keepalives in the name of "null" ? Or "local"? Looks odd.

Are you using configuration management to configure these or by hand?

On Mon, Jan 19, 2015 at 10:41 AM, <aw...@awpti.org> wrote:
> The logs have not a single line relating to keepalives. I've since added
> yet
> another host and that one is also not delivering keepalives. I've dug
> the
> sensu-client.log files for "keep", "keepalive", etc. There are no
> errors.
> All messages appear to be "info" state.
>
> The clients work fine -- they do their checks and delivery their
> metrics.
> They simply are not sending keepalives.
>
> Log file is here:
>
> http://pastie.org/9841304
>
> Repeat ad nauseum.
>
>
> On Saturday, January 17, 2015 at 11:46:40 AM UTC-7, Kyle Anderson wrote:
>>
>> I would focus on the client. Is there anything in the sensu-client.log
>> around that time that indicates something going wrong? Can you post
>> those logs?
>>
>> On Fri, Jan 16, 2015 at 9:01 AM, <aw...@awpti.org> wrote:
>> > Topic says it.
>> >
>> > I have sensu running on two hosts -- one is running the full stack,
>> > other is
>> > just a client.
>> >
>> > The Sensu host has no issues, keepalives are appearing and all
>> > checks/metrics collectors are running
>> >
>> > The second host is running all checks/metrics, but it has flat out
>> > ceased
>> > sending keepalives. It was sending keepalives for a ~12 hour period
>> > and
>> > then
>> > gave up. I've completely bounced sensu on both hosts, including
>> > rabbitmq
>> > and
>> > redis (even did flushdb/flushall for redis-cli).
>> >
>> > What else can I do?

The only files that exist on the client install are:

/etc/sensu/config.json

/etc/sensu/conf.d/[hostname].json

/etc/sensu/plugins/(various ruby scripts)

Yes, that was a paste of the auth20.json file as compared to the sensu.json file.

···

On Monday, January 19, 2015 at 12:57:54 PM UTC-7, Kyle Anderson wrote:

config management is not an enterprise feature. I highly recommend to

use chef or puppet (or similar) to put down valid configs and reduce

human error.

I bet that there is a different json somewhere in your tree

overriding the “client” configuration.

Is that paste of the auth20.json file? Are you able to paste your

entire config? (find /etc/sensu/conf.d/ -type f | xargs cat)

On Mon, Jan 19, 2015 at 11:21 AM, aw...@awpti.org wrote:

By hand – I assume config management is an enterprise feature?

The configs look identical between the host that does work and the ones that

don’t (other than the number of items in the subscriptions array and

optional config items for mysql / rabbitmq creds).

Here they are:

http://pastie.org/9841473

On Monday, January 19, 2015 at 11:45:47 AM UTC-7, Kyle Anderson wrote:

This looks suspicious to me:

{“timestamp”:“2015-01-19T11:20:19.816294-0700”,“level”:“warn”,“message”:"config

file applied

changes",“file”:“/etc/sensu/conf.d/auth20.json”,“changes”:{“client”:[null,{“name”:“local”,“address”:“10.100.29.40”,“subscriptions”:[“vitals_all”,“metrics_base”]}]}}

{“timestamp”:"2015-01-19T11:20:19.

client: null?

I think the configuration may be malformed which is making it send

keepalives in the name of “null” ? Or “local”? Looks odd.

Are you using configuration management to configure these or by hand?

On Mon, Jan 19, 2015 at 10:41 AM, aw...@awpti.org wrote:

The logs have not a single line relating to keepalives. I’ve since added

yet

another host and that one is also not delivering keepalives. I’ve dug

the

sensu-client.log files for “keep”, “keepalive”, etc. There are no

errors.

All messages appear to be “info” state.

The clients work fine – they do their checks and delivery their

metrics.

They simply are not sending keepalives.

Log file is here:

http://pastie.org/9841304

Repeat ad nauseum.

On Saturday, January 17, 2015 at 11:46:40 AM UTC-7, Kyle Anderson wrote:

I would focus on the client. Is there anything in the sensu-client.log

around that time that indicates something going wrong? Can you post

those logs?

On Fri, Jan 16, 2015 at 9:01 AM, aw...@awpti.org wrote:

Topic says it.

I have sensu running on two hosts – one is running the full stack,

other is

just a client.

The Sensu host has no issues, keepalives are appearing and all

checks/metrics collectors are running

The second host is running all checks/metrics, but it has flat out

ceased

sending keepalives. It was sending keepalives for a ~12 hour period

and

then

gave up. I’ve completely bounced sensu on both hosts, including

rabbitmq

and

redis (even did flushdb/flushall for redis-cli).

What else can I do?

Small correct: graphite01 config, not the auth20 config.

···

On Monday, January 19, 2015 at 1:08:03 PM UTC-7, aw...@awpti.org wrote:

The only files that exist on the client install are:

/etc/sensu/config.json

/etc/sensu/conf.d/[hostname].json

/etc/sensu/plugins/(various ruby scripts)

Yes, that was a paste of the auth20.json file as compared to the sensu.json file.

On Monday, January 19, 2015 at 12:57:54 PM UTC-7, Kyle Anderson wrote:

config management is not an enterprise feature. I highly recommend to

use chef or puppet (or similar) to put down valid configs and reduce

human error.

I bet that there is a different json somewhere in your tree

overriding the “client” configuration.

Is that paste of the auth20.json file? Are you able to paste your

entire config? (find /etc/sensu/conf.d/ -type f | xargs cat)

On Mon, Jan 19, 2015 at 11:21 AM, aw...@awpti.org wrote:

By hand – I assume config management is an enterprise feature?

The configs look identical between the host that does work and the ones that

don’t (other than the number of items in the subscriptions array and

optional config items for mysql / rabbitmq creds).

Here they are:

http://pastie.org/9841473

On Monday, January 19, 2015 at 11:45:47 AM UTC-7, Kyle Anderson wrote:

This looks suspicious to me:

{“timestamp”:“2015-01-19T11:20:19.816294-0700”,“level”:“warn”,“message”:"config

file applied

changes",“file”:“/etc/sensu/conf.d/auth20.json”,“changes”:{“client”:[null,{“name”:“local”,“address”:“10.100.29.40”,“subscriptions”:[“vitals_all”,“metrics_base”]}]}}

{“timestamp”:"2015-01-19T11:20:19.

client: null?

I think the configuration may be malformed which is making it send

keepalives in the name of “null” ? Or “local”? Looks odd.

Are you using configuration management to configure these or by hand?

On Mon, Jan 19, 2015 at 10:41 AM, aw...@awpti.org wrote:

The logs have not a single line relating to keepalives. I’ve since added

yet

another host and that one is also not delivering keepalives. I’ve dug

the

sensu-client.log files for “keep”, “keepalive”, etc. There are no

errors.

All messages appear to be “info” state.

The clients work fine – they do their checks and delivery their

metrics.

They simply are not sending keepalives.

Log file is here:

http://pastie.org/9841304

Repeat ad nauseum.

On Saturday, January 17, 2015 at 11:46:40 AM UTC-7, Kyle Anderson wrote:

I would focus on the client. Is there anything in the sensu-client.log

around that time that indicates something going wrong? Can you post

those logs?

On Fri, Jan 16, 2015 at 9:01 AM, aw...@awpti.org wrote:

Topic says it.

I have sensu running on two hosts – one is running the full stack,

other is

just a client.

The Sensu host has no issues, keepalives are appearing and all

checks/metrics collectors are running

The second host is running all checks/metrics, but it has flat out

ceased

sending keepalives. It was sending keepalives for a ~12 hour period

and

then

gave up. I’ve completely bounced sensu on both hosts, including

rabbitmq

and

redis (even did flushdb/flushall for redis-cli).

What else can I do?

Can you paste the contents of config.json on that client?

···

On Mon, Jan 19, 2015 at 12:08 PM, <awpti@awpti.org> wrote:

Small correct: graphite01 config, not the auth20 config.

On Monday, January 19, 2015 at 1:08:03 PM UTC-7, aw...@awpti.org wrote:

The only files that exist on the client install are:

/etc/sensu/config.json
/etc/sensu/conf.d/[hostname].json
/etc/sensu/plugins/(various ruby scripts)

Yes, that was a paste of the auth20.json file as compared to the
sensu.json file.

On Monday, January 19, 2015 at 12:57:54 PM UTC-7, Kyle Anderson wrote:

config management is not an enterprise feature. I highly recommend to
use chef or puppet (or similar) to put down valid configs and reduce
human error.

I bet that there is a *different* json somewhere in your tree
overriding the "client" configuration.

Is that paste of the auth20.json file? Are you able to paste your
entire config? (find /etc/sensu/conf.d/ -type f | xargs cat)

On Mon, Jan 19, 2015 at 11:21 AM, <aw...@awpti.org> wrote:
> By hand -- I assume config management is an enterprise feature?
>
> The configs look identical between the host that does work and the ones
> that
> don't (other than the number of items in the subscriptions array and
> optional config items for mysql / rabbitmq creds).
>
> Here they are:
>
> http://pastie.org/9841473
>
> On Monday, January 19, 2015 at 11:45:47 AM UTC-7, Kyle Anderson wrote:
>>
>> This looks suspicious to me:
>>
>>
>>
>> {"timestamp":"2015-01-19T11:20:19.816294-0700","level":"warn","message":"config
>> file applied
>>
>> changes","file":"/etc/sensu/conf.d/auth20.json","changes":{"client":[null,{"name":"local","address":"10.100.29.40","subscriptions":["vitals_all","metrics_base"]}]}}
>> {"timestamp":"2015-01-19T11:20:19.
>>
>> client: null?
>>
>> I think the configuration may be malformed which is making it send
>> keepalives in the name of "null" ? Or "local"? Looks odd.
>>
>> Are you using configuration management to configure these or by hand?
>>
>> On Mon, Jan 19, 2015 at 10:41 AM, <aw...@awpti.org> wrote:
>> > The logs have not a single line relating to keepalives. I've since
>> > added
>> > yet
>> > another host and that one is also not delivering keepalives. I've
>> > dug
>> > the
>> > sensu-client.log files for "keep", "keepalive", etc. There are no
>> > errors.
>> > All messages appear to be "info" state.
>> >
>> > The clients work fine -- they do their checks and delivery their
>> > metrics.
>> > They simply are not sending keepalives.
>> >
>> > Log file is here:
>> >
>> > http://pastie.org/9841304
>> >
>> > Repeat ad nauseum.
>> >
>> >
>> > On Saturday, January 17, 2015 at 11:46:40 AM UTC-7, Kyle Anderson >>> >> > wrote:
>> >>
>> >> I would focus on the client. Is there anything in the
>> >> sensu-client.log
>> >> around that time that indicates something going wrong? Can you post
>> >> those logs?
>> >>
>> >> On Fri, Jan 16, 2015 at 9:01 AM, <aw...@awpti.org> wrote:
>> >> > Topic says it.
>> >> >
>> >> > I have sensu running on two hosts -- one is running the full
>> >> > stack,
>> >> > other is
>> >> > just a client.
>> >> >
>> >> > The Sensu host has no issues, keepalives are appearing and all
>> >> > checks/metrics collectors are running
>> >> >
>> >> > The second host is running all checks/metrics, but it has flat
>> >> > out
>> >> > ceased
>> >> > sending keepalives. It was sending keepalives for a ~12 hour
>> >> > period
>> >> > and
>> >> > then
>> >> > gave up. I've completely bounced sensu on both hosts, including
>> >> > rabbitmq
>> >> > and
>> >> > redis (even did flushdb/flushall for redis-cli).
>> >> >
>> >> > What else can I do?

Sorry, auth20.json. The config.json is correct, but auth20.json looks
like it has *another* "client" entry.

Sensu does a big json merge, another client entry would produce this
kind of error.

···

On Mon, Jan 19, 2015 at 12:16 PM, Kyle Anderson <kyle@xkyle.com> wrote:

Can you paste the contents of config.json on that client?

On Mon, Jan 19, 2015 at 12:08 PM, <awpti@awpti.org> wrote:

Small correct: graphite01 config, not the auth20 config.

On Monday, January 19, 2015 at 1:08:03 PM UTC-7, aw...@awpti.org wrote:

The only files that exist on the client install are:

/etc/sensu/config.json
/etc/sensu/conf.d/[hostname].json
/etc/sensu/plugins/(various ruby scripts)

Yes, that was a paste of the auth20.json file as compared to the
sensu.json file.

On Monday, January 19, 2015 at 12:57:54 PM UTC-7, Kyle Anderson wrote:

config management is not an enterprise feature. I highly recommend to
use chef or puppet (or similar) to put down valid configs and reduce
human error.

I bet that there is a *different* json somewhere in your tree
overriding the "client" configuration.

Is that paste of the auth20.json file? Are you able to paste your
entire config? (find /etc/sensu/conf.d/ -type f | xargs cat)

On Mon, Jan 19, 2015 at 11:21 AM, <aw...@awpti.org> wrote:
> By hand -- I assume config management is an enterprise feature?
>
> The configs look identical between the host that does work and the ones
> that
> don't (other than the number of items in the subscriptions array and
> optional config items for mysql / rabbitmq creds).
>
> Here they are:
>
> http://pastie.org/9841473
>
> On Monday, January 19, 2015 at 11:45:47 AM UTC-7, Kyle Anderson wrote:
>>
>> This looks suspicious to me:
>>
>>
>>
>> {"timestamp":"2015-01-19T11:20:19.816294-0700","level":"warn","message":"config
>> file applied
>>
>> changes","file":"/etc/sensu/conf.d/auth20.json","changes":{"client":[null,{"name":"local","address":"10.100.29.40","subscriptions":["vitals_all","metrics_base"]}]}}
>> {"timestamp":"2015-01-19T11:20:19.
>>
>> client: null?
>>
>> I think the configuration may be malformed which is making it send
>> keepalives in the name of "null" ? Or "local"? Looks odd.
>>
>> Are you using configuration management to configure these or by hand?
>>
>> On Mon, Jan 19, 2015 at 10:41 AM, <aw...@awpti.org> wrote:
>> > The logs have not a single line relating to keepalives. I've since
>> > added
>> > yet
>> > another host and that one is also not delivering keepalives. I've
>> > dug
>> > the
>> > sensu-client.log files for "keep", "keepalive", etc. There are no
>> > errors.
>> > All messages appear to be "info" state.
>> >
>> > The clients work fine -- they do their checks and delivery their
>> > metrics.
>> > They simply are not sending keepalives.
>> >
>> > Log file is here:
>> >
>> > http://pastie.org/9841304
>> >
>> > Repeat ad nauseum.
>> >
>> >
>> > On Saturday, January 17, 2015 at 11:46:40 AM UTC-7, Kyle Anderson >>>> >> > wrote:
>> >>
>> >> I would focus on the client. Is there anything in the
>> >> sensu-client.log
>> >> around that time that indicates something going wrong? Can you post
>> >> those logs?
>> >>
>> >> On Fri, Jan 16, 2015 at 9:01 AM, <aw...@awpti.org> wrote:
>> >> > Topic says it.
>> >> >
>> >> > I have sensu running on two hosts -- one is running the full
>> >> > stack,
>> >> > other is
>> >> > just a client.
>> >> >
>> >> > The Sensu host has no issues, keepalives are appearing and all
>> >> > checks/metrics collectors are running
>> >> >
>> >> > The second host is running all checks/metrics, but it has flat
>> >> > out
>> >> > ceased
>> >> > sending keepalives. It was sending keepalives for a ~12 hour
>> >> > period
>> >> > and
>> >> > then
>> >> > gave up. I've completely bounced sensu on both hosts, including
>> >> > rabbitmq
>> >> > and
>> >> > redis (even did flushdb/flushall for redis-cli).
>> >> >
>> >> > What else can I do?

Ah, if it does a big merge, then perhaps I’m misunderstanding the client config. I was under the impression that each machine (sensu-client) needs its own .json file with a client value.

All 3 servers have a .json file with a { “client”: { “name”: “some.host.name” … }

Is this incorrect? The documentation implies it, though it’s possible I’ve overlooked something.

http://pastie.org/9841833

···

On Monday, January 19, 2015 at 1:29:25 PM UTC-7, Kyle Anderson wrote:

Sorry, auth20.json. The config.json is correct, but auth20.json looks

like it has another “client” entry.

Sensu does a big json merge, another client entry would produce this

kind of error.

On Mon, Jan 19, 2015 at 12:16 PM, Kyle Anderson ky...@xkyle.com wrote:

Can you paste the contents of config.json on that client?

On Mon, Jan 19, 2015 at 12:08 PM, aw...@awpti.org wrote:

Small correct: graphite01 config, not the auth20 config.

On Monday, January 19, 2015 at 1:08:03 PM UTC-7, aw...@awpti.org wrote:

The only files that exist on the client install are:

/etc/sensu/config.json

/etc/sensu/conf.d/[hostname].json

/etc/sensu/plugins/(various ruby scripts)

Yes, that was a paste of the auth20.json file as compared to the

sensu.json file.

On Monday, January 19, 2015 at 12:57:54 PM UTC-7, Kyle Anderson wrote:

config management is not an enterprise feature. I highly recommend to

use chef or puppet (or similar) to put down valid configs and reduce

human error.

I bet that there is a different json somewhere in your tree

overriding the “client” configuration.

Is that paste of the auth20.json file? Are you able to paste your

entire config? (find /etc/sensu/conf.d/ -type f | xargs cat)

On Mon, Jan 19, 2015 at 11:21 AM, aw...@awpti.org wrote:

By hand – I assume config management is an enterprise feature?

The configs look identical between the host that does work and the ones

that

don’t (other than the number of items in the subscriptions array and

optional config items for mysql / rabbitmq creds).

Here they are:

http://pastie.org/9841473

On Monday, January 19, 2015 at 11:45:47 AM UTC-7, Kyle Anderson wrote:

This looks suspicious to me:

{“timestamp”:“2015-01-19T11:20:19.816294-0700”,“level”:“warn”,“message”:"config

file applied

changes",“file”:“/etc/sensu/conf.d/auth20.json”,“changes”:{“client”:[null,{“name”:“local”,“address”:“10.100.29.40”,“subscriptions”:[“vitals_all”,“metrics_base”]}]}}

{“timestamp”:"2015-01-19T11:20:19.

client: null?

I think the configuration may be malformed which is making it send

keepalives in the name of “null” ? Or “local”? Looks odd.

Are you using configuration management to configure these or by hand?

On Mon, Jan 19, 2015 at 10:41 AM, aw...@awpti.org wrote:

The logs have not a single line relating to keepalives. I’ve since

added

yet

another host and that one is also not delivering keepalives. I’ve

dug

the

sensu-client.log files for “keep”, “keepalive”, etc. There are no

errors.

All messages appear to be “info” state.

The clients work fine – they do their checks and delivery their

metrics.

They simply are not sending keepalives.

Log file is here:

http://pastie.org/9841304

Repeat ad nauseum.

On Saturday, January 17, 2015 at 11:46:40 AM UTC-7, Kyle Anderson > > >>>> >> > wrote:

I would focus on the client. Is there anything in the

sensu-client.log

around that time that indicates something going wrong? Can you post

those logs?

On Fri, Jan 16, 2015 at 9:01 AM, aw...@awpti.org wrote:

Topic says it.

I have sensu running on two hosts – one is running the full

stack,

other is

just a client.

The Sensu host has no issues, keepalives are appearing and all

checks/metrics collectors are running

The second host is running all checks/metrics, but it has flat

out

ceased

sending keepalives. It was sending keepalives for a ~12 hour

period

and

then

gave up. I’ve completely bounced sensu on both hosts, including

rabbitmq

and

redis (even did flushdb/flushall for redis-cli).

What else can I do?

Okay, I’ve dug through the docs again and it’s quite clear that the config I’m using is absolutely correct. All I can determine from the error is that sensu is failing to properly read / parse the json file. It has replaced the “name” value with “local” on each machine rather than using the value I supplied.

I’m flat out not even sure what else to do at this point – from my perspective and from the docs, my configuration is correct.

I'm also at a loss. Do you feel comfortable tar'ing /etc/sensu/ to analyze?

Also can you paste
grep -r -C1 'client' /etc/sensu/
and
grep -r -C1 'local' /etc/sensu ?

"local" or "client" *must* be set in more than one place *somewhere*.

···

On Wed, Jan 21, 2015 at 9:54 AM, <awpti@awpti.org> wrote:

Okay, I've dug through the docs again and it's quite clear that the config
I'm using is absolutely correct. All I can determine from the error is that
sensu is failing to properly read / parse the json file. It has replaced the
"name" value with "local" on each machine rather than using the value I
supplied.

I'm flat out not even sure what else to do at this point -- from my
perspective and from the docs, my configuration is correct.

Have you verified that ntp is running and that the time is correct on the servers? I get bit by this occasionally.

···

On Friday, January 16, 2015 at 12:01:13 PM UTC-5, aw...@awpti.org wrote:

Topic says it.

I have sensu running on two hosts – one is running the full stack, other is just a client.

The Sensu host has no issues, keepalives are appearing and all checks/metrics collectors are running

The second host is running all checks/metrics, but it has flat out ceased sending keepalives. It was sending keepalives for a ~12 hour period and then gave up. I’ve completely bounced sensu on both hosts, including rabbitmq and redis (even did flushdb/flushall for redis-cli).

What else can I do?

I agree that ntp can be an issue sometimes, but in this case, this is
the core issue:

{"timestamp":"2015-01-19T11:20:19.816294-0700","level":"warn","message":"config
file applied changes","file":"/etc/sensu/conf.d/auth20.json","changes":{"client":[null,{"name":"local","address":"10.100.29.40","subscriptions":["vitals_all","metrics_base"]}]}}
{"timestamp":"2015-01-19T11:20:19.

Whatever the reason that it is mis-reading the json is the core issue here.

···

On Thu, Jan 22, 2015 at 5:48 AM, James Taylor - OP <jtaylor@onpointlearning.com> wrote:

Have you verified that ntp is running and that the time is correct on the
servers? I get bit by this occasionally.

On Friday, January 16, 2015 at 12:01:13 PM UTC-5, aw...@awpti.org wrote:

Topic says it.

I have sensu running on two hosts -- one is running the full stack, other
is just a client.

The Sensu host has no issues, keepalives are appearing and all
checks/metrics collectors are running

The second host is running all checks/metrics, but it has flat out ceased
sending keepalives. It was sending keepalives for a ~12 hour period and then
gave up. I've completely bounced sensu on both hosts, including rabbitmq and
redis (even did flushdb/flushall for redis-cli).

What else can I do?

So, I banged away at this for a few hours today and it turns out… it was the time!

ntpdate was keeping the clocks synced… for a little bit. Turns out two of the dev host nodes (ESXi) had desynced clocks because their own time sync feature was stopped. It just so happened that ntpdate would update between two checks and then the clocks would sync to the bad times right afterwards.

It seems that null value coming from sensu-client.log is irrelevant.

Sometimes, I hate VMware. Sometimes. Most of the time I love the hell out of it. I’m moving us away from syncing off the hosts to syncing from our core router’s ntp service.

···

On Thursday, January 22, 2015 at 9:43:10 AM UTC-7, Kyle Anderson wrote:

I agree that ntp can be an issue sometimes, but in this case, this is

the core issue:

{“timestamp”:“2015-01-19T11:20:19.816294-0700”,“level”:“warn”,“message”:"config

file applied changes",“file”:“/etc/sensu/conf.d/auth20.json”,“changes”:{“client”:[null,{“name”:“local”,“address”:“10.100.29.40”,“subscriptions”:[“vitals_all”,“metrics_base”]}]}}

{“timestamp”:"2015-01-19T11:20:19.

Whatever the reason that it is mis-reading the json is the core issue here.

On Thu, Jan 22, 2015 at 5:48 AM, James Taylor - OP > > jta...@onpointlearning.com wrote:

Have you verified that ntp is running and that the time is correct on the

servers? I get bit by this occasionally.

On Friday, January 16, 2015 at 12:01:13 PM UTC-5, aw...@awpti.org wrote:

Topic says it.

I have sensu running on two hosts – one is running the full stack, other

is just a client.

The Sensu host has no issues, keepalives are appearing and all

checks/metrics collectors are running

The second host is running all checks/metrics, but it has flat out ceased

sending keepalives. It was sending keepalives for a ~12 hour period and then

gave up. I’ve completely bounced sensu on both hosts, including rabbitmq and

redis (even did flushdb/flushall for redis-cli).

What else can I do?

Well gosh. I'm sorry I sent everyone down the wrong trail. :frowning:

···

On Thu, Jan 22, 2015 at 11:51 AM, <awpti@awpti.org> wrote:

So, I banged away at this for a few hours today and it turns out.. it was
the time!

ntpdate was keeping the clocks synced.. for a little bit. Turns out two of
the dev host nodes (ESXi) had desynced clocks because their own time sync
feature was stopped. It just so happened that ntpdate would update between
two checks and then the clocks would sync to the bad times right afterwards.

It seems that null value coming from sensu-client.log is irrelevant.

Sometimes, I hate VMware. Sometimes. Most of the time I love the hell out of
it. I'm moving us away from syncing off the hosts to syncing from our core
router's ntp service.

On Thursday, January 22, 2015 at 9:43:10 AM UTC-7, Kyle Anderson wrote:

I agree that ntp can be an issue sometimes, but in this case, this is
the core issue:

{"timestamp":"2015-01-19T11:20:19.816294-0700","level":"warn","message":"config
file applied
changes","file":"/etc/sensu/conf.d/auth20.json","changes":{"client":[null,{"name":"local","address":"10.100.29.40","subscriptions":["vitals_all","metrics_base"]}]}}
{"timestamp":"2015-01-19T11:20:19.

Whatever the reason that it is mis-reading the json is the core issue
here.

On Thu, Jan 22, 2015 at 5:48 AM, James Taylor - OP >> <jta...@onpointlearning.com> wrote:
> Have you verified that ntp is running and that the time is correct on
> the
> servers? I get bit by this occasionally.
>
>
> On Friday, January 16, 2015 at 12:01:13 PM UTC-5, aw...@awpti.org wrote:
>>
>> Topic says it.
>>
>> I have sensu running on two hosts -- one is running the full stack,
>> other
>> is just a client.
>>
>> The Sensu host has no issues, keepalives are appearing and all
>> checks/metrics collectors are running
>>
>> The second host is running all checks/metrics, but it has flat out
>> ceased
>> sending keepalives. It was sending keepalives for a ~12 hour period and
>> then
>> gave up. I've completely bounced sensu on both hosts, including
>> rabbitmq and
>> redis (even did flushdb/flushall for redis-cli).
>>
>> What else can I do?