fluentd logs showing sigkill issue while tailing the logfile which is getting updated almost every second. #4304

moharana-subhashree · 2023-09-14T07:06:59Z

moharana-subhashree
Sep 14, 2023

Describe the bug

version used: td-agent 4.4.2 fluentd 1.15.3

fluentd configuration used , the source logs ctr.log is coming podman services, and this ctr.log log file is getting updated every less than a second of time. when fluentd tails this file and it gives sigkill issue.

{"log":{"message":"following tail of /xxx/podman/storage/overlay-containers//userdata/ctr.log"},"extension":{"worker_id":0},"type":"log","level":"info","timezone":"xxx/xxx","system":"BSSC","systemid":"035d867d3f4641bc89b5e2858f0116a9","host":"bcmt-fluentd-worker-bssc-fluentd-daemonset-86pfk.ncms","time":"2023-09-14T08:11:02+0300"}
{"time":"2023-09-14T08:23:00+0300","level":"error","message":"Worker 0 finished unexpectedly with signal SIGKILL"}

<source>
  @type tail
  @log_level debug
  path /data0/podman/storage/overlay-containers/*/userdata/ctr.log
  pos_file /var/log/td-agent/podman.pos
  refresh_interval 1s
  tag kubernetes.podman.*
 <parse>
    @type cri
 </parse>

  <match **>
    @stdout
  </match>

when enabled debug logs it shows below:
{"log":{"message":"tailing paths: target = /xxx/podman/storage/overlay-containers/060a7faf16e6f7b481e81b214165cda82d16fb7a2dbc16e13c11d41db95969ca/userdata/ctr.log,/data0/podman/storage/overlay-containers/xxx/userdata/ctr.log | existing = /data0/podman/storage/overlay-containers//userdata/ctr.log,/data0/podman/storage/overlay-containers/def7/userdata/ctr.log"},"extension":{"worker_id":0},"type":"log","level":"debug","timezone":"Europe/Helsinki","system":"BSSC","systemid":"035d867d3f4641bc89b5e2858f0116a9","host":"bcmt-fluentd-worker-bssc-fluentd-daemonset-86pfk.ncms","time":"2023-09-14T09:55:29+0300"}

To Reproduce

version used: td-agent 4.4.2 fluentd 1.15.3

Expected behavior

the sigkill shouldn't appear on the fluentd logs.

Your Environment

- Fluentd version:
- TD Agent version:
- Operating system:
- Kernel version:

Your Configuration

<source>
      @type tail
      @log_level debug
      path /data0/podman/storage/overlay-containers/*/userdata/ctr.log
      pos_file /var/log/td-agent/podman.pos
      refresh_interval 1s
      tag kubernetes.podman.*
     <parse>
        @type cri
     </parse>
</source>

      <match **>
        @stdout
      </match>

Your Error Log

{"time":"2023-09-14T06:24:02+0300","level":"error","message":"Worker 0 finished unexpectedly with signal SIGKILL"}

Additional context

No response

daipom · 2023-09-14T08:22:52Z

daipom
Sep 14, 2023
Maintainer

It doesn't reproduce in my environment.
It is possible that out of memory causes SIGKILL.
Please confirm the machine resources.

Ubuntu 20.04
Fluentd v1.15.3
Ruby 3.2.2

Config:

<source>
  @type tail
  tag test
  path /test/fluentd/input/test.log*
  pos_file /test/fluentd/pos/pos
  refresh_interval 1s
  <parse>
    @type none
  </parse>
</source>

<match test.**>
  @type stdout
</match>

Add a dummy log per 0.3s.

$ while true; do echo "foo" >> test.log; sleep 0.3; done

Then, SIGKILL doesn't occur.

0 replies

moharana-subhashree · 2023-09-14T15:23:02Z

moharana-subhashree
Sep 14, 2023
Author

Thanks for your reply.
This source file /data0/podman/storage/overlay-containers/*/userdata/ctr.log is coming from podman services and podman service try to update the console logs less than every second (new logs entries are getting added at the end of the ctr.log file), mostly podman keeps ctr.log open the entire time and continues to log to the open file descriptor.
and when fluentd try to tail this log file, next moment it gets SIGKILL. However, the size of this ctr.log file is less than 10mb size.

{"log":{"message":"following tail of /data0/podman/storage/overlay-containers/7cb6e646f24f157edba1f58ab5402f353011c7565cd82e93c9631c13aa934670/userdata/ctr.log"},"extension":{"worker_id":0},"type":"log","level":"info","timezone":"UTC","system":"XXX","systemid":"035d867d3f4641bc89b5e2858f0116a9","host":"xxxx","time":"2023-09-12T04:40:42+0000"}
{"log":{"message":"Worker 0 finished unexpectedly with signal SIGKILL"},"type":"log","level":"error","timezone":"UTC","system":"XXX","systemid":"035d867d3f4641bc89b5e2858f0116a9","host":"xxxx","time":"2023-09-12T04:42:59+0000"}

when I check the memory consumption of fluentd worker process during this time, its really not high in memory usage that time, unsure how it can lead to Sigkill issue.

0 replies

daipom · 2023-09-15T02:32:09Z

daipom
Sep 15, 2023
Maintainer

Thanks for your report.
Something sends SIGKILL to the worker process.
That is probably not a problem with Fluentd, but with the environment.

when I check the memory consumption of fluentd worker process during this time, its really not high in memory usage that time

Can you also check the memory consumption of the entire system (or the container)?
Even if the memory consumption of that process is small, it may be possible to be SIGKILLed if the overall memory runs out.

0 replies

moharana-subhashree · 2023-09-15T05:17:34Z

moharana-subhashree
Sep 15, 2023
Author

I monitored the kubectl top pod command output till the time sigkill happens, I see the memory usage is very normal ~600mb and memory limit given to pod is 4gb.

one query, do we need to provide any extra configuration/settings for fluentd to read in such situation where the service keeps the log file open entire time and continues to log to the open file descriptor every milli seconds?

0 replies

daipom · 2023-09-15T08:18:33Z

daipom
Sep 15, 2023
Maintainer

Hmm, it may be something wrong with the disk, CPU, or some other resource, not the memory...

do we need to provide any extra configuration/settings for fluentd to read in such situation where the service keeps the log file open entire time and continues to log to the open file descriptor every milli seconds?

If the cause is a load to trace such a fast file update, setting enable_stat_watcher to false may solve the issue.
By default, in_tail plugin tries to read the new logs for every file update.
By setting this to false, we can fix the interval to 1s.
This may solve some resource problems.

<source>
  @type tail
  @log_level debug
  path /data0/podman/storage/overlay-containers/*/userdata/ctr.log
  pos_file /var/log/td-agent/podman.pos
  refresh_interval 1s
  tag kubernetes.podman.*
  enable_stat_watcher false
 <parse>
    @type cri
 </parse>
</source>

0 replies

moharana-subhashree · 2023-09-19T08:03:52Z

moharana-subhashree
Sep 19, 2023
Author

on the system when I run the "top" linux command I see few processes along with Ruby consuming cpu more than 100% ... ~150 to ~200% of cpu usage. could it be a cause?
also one query that, when I remove the podman service source section which generates a file which getting updated every milli seconds and then restart fluentd pods after which we never observe sigkill issue.
Now I'm little confused which point could be a issue here ? could you pls help providing your inputs... ?

0 replies

daipom · 2023-09-19T08:15:16Z

daipom
Sep 19, 2023
Maintainer

on the system when I run the "top" linux command I see few processes along with Ruby consuming cpu more than 100% ... ~150 to ~200% of cpu usage. could it be a cause?

I think we should consider the possibility that it is the cause.
Please give enough CPU and see if this issue reproduces.

when I remove the podman service source section which generates a file which getting updated every milli seconds and then restart fluentd pods after which we never observe sigkill issue.
Now I'm little confused which point could be a issue here ? could you pls help providing your inputs... ?

Does this mean that once the CPU shortage was resolved, this issue no longer occurred?
If so, it is reasonable to assume that the CPU shortage was causing SIGKILL.

0 replies

moharana-subhashree · 2023-09-19T09:16:23Z

moharana-subhashree
Sep 19, 2023
Author

Thanks for the quick response.
About this - "when I remove the podman service source section which generates a file which getting updated every milli seconds and then restart fluentd pods after which we never observe sigkill issue ". This condition observed with same cpu configured on the system, no changes made to the cpu to the fluentd container nor to the node. This is strange that just removing this source section resolve the SIGKILL issue.

The top command is ran on the node where fluentd is running, that gives more than 100% cpu utilization, It is not run inside the fluentd container. can we consider this resource utilization?

Do we have any flag or configuration which can be given to fluentd to see more detailed logs that which exact event is causing a SIGKILL issue?

0 replies

daipom · 2023-09-19T10:34:45Z

daipom
Sep 19, 2023
Maintainer

This condition observed with same cpu configured on the system, no changes made to the cpu to the fluentd container nor to the node. This is strange that just removing this source section resolve the SIGKILL issue.

This will reduce the load of Fluentd, right?
Then, I think it is possible that this resolves the SIGKILL issue.

The system may SIGKILL high-loaded processes when the resource is not enough.
(It seems strange that the most high-resource process using over 100% CPU is not SIGKILLED, but I have heard that there are processes that are less likely to be SIGKILLED.)

I'm not familiar with the SIGKILL mechanism, but I think you should resolve the resource issue first.

The top command is ran on the node where fluentd is running, that gives more than 100% cpu utilization, It is not run inside the fluentd container. can we consider this resource utilization?

Sorry, I'm not familiar with k8s.
Please try to confirm whether there is a resource shortage.
If there is a possibility of that, please consider improving the resource.

Do we have any flag or configuration which can be given to fluentd to see more detailed logs that which exact event is causing a SIGKILL issue?

I think not. It is not Fluentd's issue.

0 replies

moharana-subhashree · 2023-09-20T08:57:51Z

moharana-subhashree
Sep 20, 2023
Author

when I look at the fluentd debug level logs i see sigkill happening just after tailing the log file from the source where it's getting updated every milli seconds. at this time memory and cpu given fluentd pod was quite high. (cpu: 7 core and memory: 10gb) .
If SIGKILL happens to fluentd process, does it log anywhere in fluentd (at debug/trace level) why it was killed? or some helpful stacktrace if we could find in logs?

1 reply

daipom Sep 21, 2023
Maintainer

No, Fluentd process just receives SIGKILL and doesn't know the reason.
If there is any information, I think it is on the side of sending the SIGKILL.
(I don't know much about that...)

daipom · 2023-09-21T02:05:47Z

daipom
Sep 21, 2023
Maintainer

I moved this to Discussion Q&A because, at this point, we cannot judge whether this is a Fluentd's bug or not (the contributing guidelines).

0 replies

daipom · 2023-09-22T07:39:07Z

daipom
Sep 22, 2023
Maintainer

Note: additional info #4306

0 replies

moharana-subhashree · 2023-09-22T08:40:28Z

moharana-subhashree
Sep 22, 2023
Author

#3614 , as I see here pos file having two entries may be causing fluentd to not tail the latest log file after log rotation happens.
bash-4.4# cat /var/log/td-agent/podman.pos
/data0/podman/storage/overlay-containers/def7447440589e8d5d0b246449fbcf1f32e08242700e0c5533b3315e96b61c9b/userdata/ctr.log 0000000000000000 00000000084fc16e
/data0/podman/storage/overlay-containers/060a7faf16e6f7b481e81b214165cda82d16fb7a2dbc16e13c11d41db95969ca/userdata/ctr.log 00000000001061c3 000000000037d711

1 reply

daipom Sep 25, 2023
Maintainer

Hmm, I think there is nothing wrong with this pos file.
They are just 2 different log files.

First, I suggest you resolve the lack of CPU resources and see how it goes.

moharana-subhashree · 2023-09-25T07:04:27Z

moharana-subhashree
Sep 25, 2023
Author

hi @daipom,
I have tried configuring 12gb memory and 7 core cpu to fluentd and the same source was enabled, fluentd still show sigkill on pod logs.
when monitoring the resource utilization of fluentd i see it was very normal and very much less than the configured limits.

I see sigkill in a pattern, that after every log rotation (every1 hr) happens for the source ctr.log file, after that within 10mins of time interval we always see sigkill signal in fluentd pod logs.

below is the trace logs - every time we see sigkill this below is the log pattern in fluentd.
`
{"log":{"message":"detected rotation of /data0/podman/storage/overlay-containers/060a7faf16e6f7b481e81b214165cda82d16fb7a2dbc16e13c11d41db95969ca/userdata/ctr.log"},"extension":{"worker_id":0},"type":"log","level":"info","timezone":"xx","system":"xx","systemid":"xx","host":"xx","time":"2023-09-25T01:11:02+0300"}
{"log":{"message":"following tail of /data0/podman/storage/overlay-containers/060a7faf16e6f7b481e81b214165cda82d16fb7a2dbc16e13c11d41db95969ca/userdata/ctr.log"},"extension":{"worker_id":0},"type":"log","level":"info","timezone":"xx","system":"xx","systemid":"xx","host":"xx","time":"2023-09-25T01:11:02+0300"}
{"log":{"message":"tailing paths: target = /data0/podman/storage/overlay-containers/060a7faf16e6f7b481e81b214165cda82d16fb7a2dbc16e13c11d41db95969ca/userdata/ctr.log,/data0/podman/storage/overlay-containers/def7447440589e8d5d0b246449fbcf1f32e08242700e0c5533b3315e96b61c9b/userdata/ctr.log | existing = /data0/podman/storage/overlay-containers/060a7faf16e6f7b481e81b214165cda82d16fb7a2dbc16e13c11d41db95969ca/userdata/ctr.log,/data0/podman/storage/overlay-containers/def7447440589e8d5d0b246449fbcf1f32e08242700e0c5533b3315e96b61c9b/userdata/ctr.log"},"extension":{"worker_id":0},"type":"log","level":"debug","timezone":"xx","system":"xx","systemid":"xx","host":"xx","time":"2023-09-25T01:11:03+0300"}
{"time":"2023-09-25T01:11:27+0300","level":"error","message":"Worker 0 finished unexpectedly with signal SIGKILL"}

when we remove this source from fluentd I dont see sigkill issue.

`

1 reply

daipom Sep 25, 2023
Maintainer

Could you change refresh_interval 1s to refresh_interval 60s?

moharana-subhashree · 2023-09-25T07:53:47Z

moharana-subhashree
Sep 25, 2023
Author

ok, i'll try with refresh_interval 60s and update..
Shall I also provide enable_stat_watcher false along with refresh_interval 60s ?

1 reply

daipom Sep 25, 2023
Maintainer

Shall I also provide enable_stat_watcher false along with refresh_interval 60s ?

Yes, please.
I think we should confirm if reducing the workload of in_tail will prevent SIGKILL.

moharana-subhashree · 2023-09-25T09:51:40Z

moharana-subhashree
Sep 25, 2023
Author

hi @daipom
with refresh_interval 60s and enable_stat_watcher false doesn't help. still sigkill appears.

<source>
  @type tail
  path /data0/podman/storage/overlay-containers/*/userdata/ctr.log
  pos_file /var/log/td-agent/podman.pos
  refresh_interval 60s
  enable_stat_watcher false
  tag kubernetes.podman.*
  <parse>
    @type cri
  </parse>
</source>

fluentd will keeps tailing the ctr.log file, and no issues seen.

{"log":{"message":"following tail of /data0/podman/storage/overlay-containers/2f0b40af4be32c9e4ef3983e370fa5d2996f93442f7316c44dbc66999983e93a/userdata/ctr.log"},"extension":{"worker_id":0},"type":"log","level":"info","timezone":"xx","system":"xx","systemid":"xx","host":"xx","time":"2023-09-25T11:13:15+0300"}
{"log":{"message":"fluentd worker is now running worker=0"},"extension":{"worker_id":0},"type":"log","level":"info","timezone":"xx","system":"xx","systemid":"xx","host":"xx","time":"2023-09-25T11:13:15+0300"}
{"time":"2023-09-25T11:13:34+0300","level":"info","message":"fluentd worker is now running worker=0","worker_id":0}
{"time":"2023-09-25T11:13:34+0300","level":"info","message":"disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter] uses #filter_stream method.","worker_id":0}
{"log":{"message":"parameter 'enable_ruby' in <filter **>\n @type record_modifier\n enable_ruby true\n remove_keys "dummy"\n \n dummy ${require 'json';if record["extension"].empty?; record.delete("extension"); end}\n \n is not used."},"type":"log","level":"unavailable","timezone":"xx","system":"xx","systemid":"xx","host":"xx","time":"2023-09-25T11:13:34+0300"}
{"time":"2023-09-25T11:13:34+0300","level":"info","message":"disable filter chain optimization because [Fluent::Plugin::RecordTransformerFilter] uses #filter_stream method.","worker_id":0}
`

when for the first time the ctr.log file rotation happens, and the the tail happens for the ctr.log (new ctr.log file after rotation happened) after that next log shows the sigkill.

`
{"log":{"message":"detected rotation of /data0/podman/storage/overlay-containers/fb7463644f6630b81a0608a1b5904b12058df0b844587a16f62d7ac471bf5266/userdata/ctr.log"},"extension":{"worker_id":0},"type":"log","level":"info","timezone":"xx","system":"xx","systemid":"xx","host":"xx","time":"2023-09-25T12:11:02+0300"}
{"log":{"message":"following tail of /data0/podman/storage/overlay-containers/fb7463644f6630b81a0608a1b5904b12058df0b844587a16f62d7ac471bf5266/userdata/ctr.log"},"extension":{"worker_id":0},"type":"log","level":"info","timezone":"xx","system":"xx","systemid":"xx","host":"xx","time":"2023-09-25T12:11:02+0300"}
{"time":"2023-09-25T12:24:32+0300","level":"error","message":"Worker 0 finished unexpectedly with signal SIGKILL"}

`

0 replies

moharana-subhashree · 2023-09-27T16:17:14Z

moharana-subhashree
Sep 27, 2023
Author

hi @daipom
I have found that, whenever log rotation happens (11th minutes of every hour), after that i see a rotated file gets created as ctr.log.1 and the inode of the current ctr.log is not getting updated (ideally it should have updated the inode of the log file after log rotation). and then fluentd tail the new ctr.log file and gets sigkill issue.

when I stop the log rotation mechanism, i never see a sigkill issue and fluentd runs successfully.
how can I handle this type of issue..I found a few issue link related to this issue like #3614
#4190 , can you look into this behavior of fluentd ?

0 replies

daipom · 2024-05-16T05:44:07Z

daipom
May 16, 2024
Maintainer

Sorry for the delay.
Note: Known issues where in_tai may not detect log rotation have been resolved in Fluentd v1.16.2 and v1.16.3.
Please try the latest version.

0 replies

fluentd logs showing sigkill issue while tailing the logfile which is getting updated almost every second. #4304

moharana-subhashree Sep 14, 2023

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

Replies: 18 comments · 4 replies

daipom Sep 14, 2023 Maintainer

moharana-subhashree Sep 14, 2023 Author

daipom Sep 15, 2023 Maintainer

moharana-subhashree Sep 15, 2023 Author

daipom Sep 15, 2023 Maintainer

moharana-subhashree Sep 19, 2023 Author

daipom Sep 19, 2023 Maintainer

moharana-subhashree Sep 19, 2023 Author

daipom Sep 19, 2023 Maintainer

moharana-subhashree Sep 20, 2023 Author

daipom Sep 21, 2023 Maintainer

daipom Sep 21, 2023 Maintainer

daipom Sep 22, 2023 Maintainer

moharana-subhashree Sep 22, 2023 Author

daipom Sep 25, 2023 Maintainer

moharana-subhashree Sep 25, 2023 Author

daipom Sep 25, 2023 Maintainer

moharana-subhashree Sep 25, 2023 Author

daipom Sep 25, 2023 Maintainer

moharana-subhashree Sep 25, 2023 Author

moharana-subhashree Sep 27, 2023 Author

daipom May 16, 2024 Maintainer

moharana-subhashree
Sep 14, 2023

Replies: 18 comments 4 replies

daipom
Sep 14, 2023
Maintainer

moharana-subhashree
Sep 14, 2023
Author

daipom
Sep 15, 2023
Maintainer

moharana-subhashree
Sep 15, 2023
Author

daipom
Sep 15, 2023
Maintainer

moharana-subhashree
Sep 19, 2023
Author

daipom
Sep 19, 2023
Maintainer

moharana-subhashree
Sep 19, 2023
Author

daipom
Sep 19, 2023
Maintainer

moharana-subhashree
Sep 20, 2023
Author

daipom Sep 21, 2023
Maintainer

daipom
Sep 21, 2023
Maintainer

daipom
Sep 22, 2023
Maintainer

moharana-subhashree
Sep 22, 2023
Author

daipom Sep 25, 2023
Maintainer

moharana-subhashree
Sep 25, 2023
Author

daipom Sep 25, 2023
Maintainer

moharana-subhashree
Sep 25, 2023
Author

daipom Sep 25, 2023
Maintainer

moharana-subhashree
Sep 25, 2023
Author

moharana-subhashree
Sep 27, 2023
Author

daipom
May 16, 2024
Maintainer