fluent-bit is corrupting chunk data #8834

pecastro · 2024-05-17T15:07:30Z

Bug Report

Describe the bug
Running in an EKS cluster as a DaemonSet whilst reading containerd logs it occasionally corrupts the log data on the time field leaving the chunk file blocked in the tail.0/ directory unable to be flushed to the OUTPUT.

To Reproduce

start fluentbit as Daemonset on K8S
Use tail plugin to collect container logs
Please refer configs mentioned below in Configuration section

Contents of chunk file when stuck:

/var/log/flb-storage/tail.0/1-1715689020.629303639.flb

config.seen<D9>#2024-05-11T08:42:31.887233529+01:00<BC>cni.projectcalico.org/podIPs<B2>a-random-ip<A8>pod _name<B9>opensearch-cluster-data-2<A6>labels<8B><BA>app.kubernetes.io/instance<AF>opensearch-data<B6>app.kubernetes.io/name<AA>opensearch<D9> app.kubernetes.io/team-component<B7>opensearch-cluster-data<B8>controller-revision-hash<D9>"opensearch-cluster-data-868b795d8d<AD>helm.sh/chart<B1>opensearch-2.17.0<B7>sidecar.istio.io/inject<A5>false<D9>"statefulset.kubernetes.io/pod-name<B9>opensearch-cluster-data-2<A4> team<A4>a-random-team<B9>app.kubernetes.io/version<A6>2.11.1<BB>app.kubernetes.io/component<B7> opensearch-cluster-data<BC>app.kubernetes.io/managed-by<A4>Helm<AE> namespace_name<AA>opensearch<AF> container_image<D9>random-container-image-name<A6> pod_id<D9>$d641198c-6c29-4744-b20a-21a828f62f9b<A4>time<B4>14:42.26019153+01:00<AF>es_index_prefix<BE>a-random-index-prefix<A2>_p<A1>F<B4>kubernetes_namespace<82>

The above is just a snippet of the whole file and some fields were amended above to protect the data.
Pay close attention to the time field.

Error message from the OUTPUT:

2024-05-15 01:13:43 +0000 [error]: #1 failed to process request error_class=Fluent::Plugin::Parser::ParserError error="invalid time format: value = 14:42.26019153+01:00, error_class = ArgumentError, error = invalid xmlschema format: \"14:42.26019153+01:00\""

The original log file from which this entry was collected didn't exhibit that datetime truncation at the beginning of the datetime string.

Expected behavior

The log files to parsed and chunked correctly like so many million others are.

Screenshots
N/A

Your Environment

Version used: 3.0.3
Configuration:

  custom_parsers.conf: |
    [PARSER[]
        Name docker_no_time
        Format json
        Time_Keep Off
        Time_Key time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
  fluent-bit.conf: |
    [SERVICE[]
        Daemon                              Off
        Flush                               1
        Log_Level                           error
        Parsers_File                        /fluent-bit/etc/parsers.conf
        Parsers_File                        /fluent-bit/etc/conf/custom_parsers.conf
        HTTP_Server                         On
        HTTP_Listen                         0.0.0.0
        HTTP_Port                           2020
        Health_Check                        On
        scheduler.cap                       300
        storage.path                        /var/log/flb-storage/
        storage.max_chunks_up               128
        storage.sync                        full
        storage.backlog.mem_limit           5M
        storage.delete_irrecoverable_chunks on

    [INPUT[]
        Name                              tail
        Path                              /var/log/containers/*.log
        multiline.parser                  cri
        Tag                               kube.*
        Skip_Long_Lines                   On
        Skip_Empty_Lines                  On
        Buffer_Chunk_Size                 64KB
        Buffer_Max_Size                   128KB
        DB                                /var/log/flb-storage/containers.db
        storage.type                      filesystem
        storage.pause_on_chunks_overlimit on

    [INPUT[]
        Name                              systemd
        Tag                               host.*
        Systemd_Filter                    _SYSTEMD_UNIT=kubelet.service
        Systemd_Filter                    _SYSTEMD_UNIT=docker.service
        Systemd_Filter                    _SYSTEMD_UNIT=containerd.service
        DB                                /var/log/flb-storage/systemd.db
        Read_From_Tail                    On
        storage.type                      filesystem
        storage.pause_on_chunks_overlimit on

    [FILTER[]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc.cluster.local:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Labels              On
        Annotations         On
        Buffer_Size         1MB
        Use_Kubelet         On
        namespace_labels    On

    [FILTER[]
        Name         modify
        Match        host.*
        Rename       _HOSTNAME        hostname
        Rename       _SYSTEMD_UNIT    systemd_unit
        Rename       MESSAGE          log
        Remove_regex ^((?!hostname|systemd_unit|log).)*$

    [FILTER[]
        Name         aws
        Match        host.*
        imds_version v2

    [FILTER[]
        Name  modify
        Match *
        Add   environment_name   env-name
        Add   cluster_name       cluster-name

    [FILTER[]
        Name   lua
        Match  *
        script /fluent-bit/scripts/index_name_filter.lua
        call   index_name

    [OUTPUT[]
        Name                      http
        Alias                     an-alias-name
        Match                     *
        Host                      a-host-name.com
        Port                      443
        http_User                 ${FLUENTD_USER}
        http_Passwd               ${FLUENTD_PASSWORD}
        URI                       /a-given-tag
        Format                    json
        header                    User-Agent a-user-agent
        header_tag                FLUENT-TAG
        json_date_format          iso8601
        tls                       on
        tls.verify                off
        compress                  gzip
        Retry_Limit               no_limits
        net.dns.resolver          async
        log_suppress_interval     10s
        storage.total_limit_size  500M
        Log_Level                 error

Environment name and version (e.g. Kubernetes? What version?): 1.27.12
Server type and version: Running as docker images on Kubernetes (fluent-bit:3.0.3-debug)
Operating System and version: Linux
Filters and plugins: tail, systemd, kubernetes, modify, http

Additional context

This means that log entries will get stuck and never be processed and indexed as they should.

Potentially related issues:
#8413 #8718 #8798 #5217

The text was updated successfully, but these errors were encountered:

pecastro added the status: waiting-for-triage label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fluent-bit is corrupting chunk data #8834

fluent-bit is corrupting chunk data #8834

pecastro commented May 17, 2024

fluent-bit is corrupting chunk data #8834

fluent-bit is corrupting chunk data #8834

Comments

pecastro commented May 17, 2024

Bug Report