Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print "no action" run_cycle message at most every minute #2161

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mbanck
Copy link
Contributor

@mbanck mbanck commented Jan 7, 2022

I was getting annoyed at the overly chatty Patroni log if there are now problems and loop_wait is at the default 10 seconds.

So this prints the "no action.[...]" line only once every minute.

@CyberDem0n
Copy link
Collaborator

We can accept such a change only if it is possible to configure the behavior.

@mbanck
Copy link
Contributor Author

mbanck commented Jan 17, 2022

We can accept such a change only if it is possible to configure the behavior.

Hrm, in that case, maybe having a config flag print_heartbeat: yes/no would be better? I think anything else would be overkill.

@CyberDem0n
Copy link
Collaborator

But boolean can't describe the implemented behavior...

@CyberDem0n
Copy link
Collaborator

What else I don't like is changed severity of https://github.com/zalando/patroni/pull/2161/files#diff-b6a3a0ef64783a6d32feb908d74a3a009e9f25008b8ca5313c9c9747e3b74a63R174

Currently it is visible only when something is different from normal

patroni/patroni/log.py

Lines 178 to 185 in 3e1076a

if self._root_logger.level == logging.INFO:
if record.msg.startswith('Lock owner: '):
prev_record, record = record, None
else:
if prev_record and prev_record.thread == record.thread:
if not (record.msg.startswith('no action. ') or record.msg.startswith('PAUSE: no action')):
self.log_handler.handle(prev_record)
prev_record = None
and it helps with analysis if something went wrong.

@mbanck
Copy link
Contributor Author

mbanck commented Jan 17, 2022

But boolean can't describe the implemented behavior...

Yes, I would change the PR to either print the "no action" heartbeat (default) or not (if that new config option is set). If you think having a threshold parameter (and my original implementation) is better, than ok, I will look into it.

@mbanck
Copy link
Contributor Author

mbanck commented Jan 17, 2022

What else I don't like is changed severity of https://github.com/zalando/patroni/pull/2161/files#diff-b6a3a0ef64783a6d32feb908d74a3a009e9f25008b8ca5313c9c9747e3b74a63R174

Currently it is visible only when something is different from normal

patroni/patroni/log.py

Lines 178 to 185 in 3e1076a

if self._root_logger.level == logging.INFO:
if record.msg.startswith('Lock owner: '):
prev_record, record = record, None
else:
if prev_record and prev_record.thread == record.thread:
if not (record.msg.startswith('no action. ') or record.msg.startswith('PAUSE: no action')):
self.log_handler.handle(prev_record)
prev_record = None

and it helps with analysis if something went wrong.

You're right, this change is bogus - I saw it printed on the standby and assumed it would be printed every time, thanks for pointing that out.

@CyberDem0n
Copy link
Collaborator

The logic in the log.py knows that messages are normally coming in pairs, Lock owner ... + 'no action ...' and skips the first one.
Since you skip many no action messages it probably started showing Lock owner ....

@mbanck
Copy link
Contributor Author

mbanck commented Feb 10, 2022

But boolean can't describe the implemented behavior...

Yes, I would change the PR to either print the "no action" heartbeat (default) or not (if that new config option is set). If you think having a threshold parameter (and my original implementation) is better, than ok, I will look into it.

Can you clarify what you think is best here?

@CyberDem0n
Copy link
Collaborator

The thing is that heart-beat logs are super-useful when something goes wrong. When everything is normal we get logs every loop_wait seconds +- a few ms. If you see that the log time gets off by a few seconds this is already an indicator of the problem.

Of course, you won't be able to understand what exactly was wrong, but knowing the time when the problem started always helps to investigate further by looking into postgres logs or logs of the DCS.

Right now Lock owner INFO logs are filtered out in the

patroni/patroni/log.py

Lines 178 to 185 in 3e1076a

if self._root_logger.level == logging.INFO:
if record.msg.startswith('Lock owner: '):
prev_record, record = record, None
else:
if prev_record and prev_record.thread == record.thread:
if not (record.msg.startswith('no action. ') or record.msg.startswith('PAUSE: no action')):
self.log_handler.handle(prev_record)
prev_record = None
.
This code "detects" that everything is normal by analyzing messages and if the Lock owner: is followed by the no action., the first one is "removed".

If you do filtering of the no action messages outside of log.py, you break the pattern Lock owner-no action.-Lock owner-no action. Hence, if we want to have a new filter, it should happen in the same place (patroni/log.py#L178-L185)

But, we can do better than that. Instead of simply discarding 9 out of 10 no action. messages we can accumulate them into an array.
Something like:

if length(array) < 10:
    array.append(msg)
else:
    self.log_handler.handle(msg)
    array.clear()

In case the log anomality is detected we will first output all accumulated messages from the array. This way we can partially retain the usefulness of verbose logging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants