Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seastar allocator causes infinite recursive call if seastar is compiled as a shared library #2247

Closed
tchaikov opened this issue May 17, 2024 · 0 comments · Fixed by #2255
Closed
Assignees

Comments

@tchaikov
Copy link
Contributor

tchaikov commented May 17, 2024

this can be reproduced with:

$ ./configure.py --mode dev
$ ninja -C build/dev test_unit_tls
$ build/dev/tests/unit/tls_test
Running 31 test cases...
INFO  2024-05-17 13:44:00,704 seastar - Reactor backend: io_uring
INFO  2024-05-17 13:44:00,707 seastar - Perf-based stall detector creation failed (EACCESS), try setting /proc/sys/kernel/perf_event_paranoid to 1 or less to enable kernel backtraces: falling back to posix timer.
random-seed=4026002337
Segmentation fault (core dumped)

on fedora 40.

the backtrace looks like:

#47271 0x00007ffff7fdb754 in update_get_addr (ti=0x7ffff7dbf288, gen=<optimized out>) at ../elf/dl-tls.c:916
#47272 0x00007ffff7fde4ec in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#47273 0x00007ffff7ac12a0 in free () from /home/kefu/dev/seastar/build/dev/libseastar.so
#47274 0x00007ffff7fdb67b in free (ptr=<optimized out>) at ../include/rtld-malloc.h:50
#47275 _dl_update_slotinfo (req_modid=2, new_gen=2) at ../elf/dl-tls.c:822
#47276 0x00007ffff7fdb754 in update_get_addr (ti=0x7ffff7dbf288, gen=<optimized out>) at ../elf/dl-tls.c:916
#47277 0x00007ffff7fde4ec in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#47278 0x00007ffff7ac12a0 in free () from /home/kefu/dev/seastar/build/dev/libseastar.so
#47279 0x00007ffff7fdb67b in free (ptr=<optimized out>) at ../include/rtld-malloc.h:50
#47280 _dl_update_slotinfo (req_modid=2, new_gen=2) at ../elf/dl-tls.c:822
#47281 0x00007ffff7fdb754 in update_get_addr (ti=0x7ffff7dbf288, gen=<optimized out>) at ../elf/dl-tls.c:916
#47282 0x00007ffff7fde4ec in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#47283 0x00007ffff7ac12a0 in free () from /home/kefu/dev/seastar/build/dev/libseastar.so
#47284 0x00007ffff7fdb67b in free (ptr=<optimized out>) at ../include/rtld-malloc.h:50
#47285 _dl_update_slotinfo (req_modid=2, new_gen=2) at ../elf/dl-tls.c:822
#47286 0x00007ffff7fdb754 in update_get_addr (ti=0x7ffff7dbf288, gen=<optimized out>) at ../elf/dl-tls.c:916
#47287 0x00007ffff7fde4ec in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#47288 0x00007ffff7ac12a0 in free () from /home/kefu/dev/seastar/build/dev/libseastar.so
#47289 0x00007ffff7fdb67b in free (ptr=<optimized out>) at ../include/rtld-malloc.h:50
#47290 _dl_update_slotinfo (req_modid=2, new_gen=2) at ../elf/dl-tls.c:822
#47291 0x00007ffff7fdb754 in update_get_addr (ti=0x7ffff7dbf288, gen=<optimized out>) at ../elf/dl-tls.c:916
#47292 0x00007ffff7fde4ec in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#47293 0x00007ffff7ac12a0 in free () from /home/kefu/dev/seastar/build/dev/libseastar.so
#47294 0x00007ffff7fdb67b in free (ptr=<optimized out>) at ../include/rtld-malloc.h:50
#47295 _dl_update_slotinfo (req_modid=2, new_gen=2) at ../elf/dl-tls.c:822
#47296 0x00007ffff7fdb754 in update_get_addr (ti=0x7ffff7dbf288, gen=<optimized out>) at ../elf/dl-tls.c:916
#47297 0x00007ffff7fde4ec in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#47298 0x00007ffff7ac12a0 in free () from /home/kefu/dev/seastar/build/dev/libseastar.so
#47299 0x00007ffff7fdb67b in free (ptr=<optimized out>) at ../include/rtld-malloc.h:50
#47300 _dl_update_slotinfo (req_modid=2, new_gen=2) at ../elf/dl-tls.c:822
#47301 0x00007ffff7fdb754 in update_get_addr (ti=0x7ffff7dbf288, gen=<optimized out>) at ../elf/dl-tls.c:916
#47302 0x00007ffff7fde4ec in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#47303 0x00007ffff7b2641f in seastar::reactor::service_highres_timer() () from /home/kefu/dev/seastar/build/dev/libseastar.so
#47304 0x00007ffff7a8fff5 in seastar::reactor_backend_uring::hrtimer_completion::complete_with(long) () from /home/kefu/dev/seastar/build/dev/libseastar.so
#47305 0x00007ffff7a8f0c4 in seastar::reactor_backend_uring::wait_and_process_events(__sigset_t const*) () from /home/kefu/dev/seastar/build/dev/libseastar.so
#47306 0x00007ffff7b273ca in seastar::reactor::do_run() () from /home/kefu/dev/seastar/build/dev/libseastar.so
#47307 0x00007ffff7b43118 in std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) ()
   from /home/kefu/dev/seastar/build/dev/libseastar.so
#47308 0x00007ffff7adb2cb in seastar::posix_thread::start_routine(void*) () from /home/kefu/dev/seastar/build/dev/libseastar.so
#47309 0x00007ffff72a91b7 in start_thread () from /lib64/libc.so.6
#47310 0x00007ffff732b39c in clone3 () from /lib64/libc.so.6

it looks like an issue in glibc's TLS support. and it seems that there is a patch trying to address this, see https://patchwork.ozlabs.org/project/glibc/patch/8734v1ieke.fsf@oldenburg.str.redhat.com/ . but at the moment of writing, the patch has not landed on upstream's master branch (https://sourceware.org/git/?p=glibc.git).

tchaikov added a commit to tchaikov/seastar that referenced this issue May 17, 2024
instead of using ubuntu:jammy and setup-cpp action for prepearing
the building toolchain, use fedora:40 container for building and
testing.

after switching to the github workflow based CI, we've been
seeing test failures due to networking issue:

```
  Failed to install llvm via system package manager Error: Command failed with exit code 35: curl -LJO https://apt.llvm.org/llvm.sh
    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                   Dload  Upload   Total   Spent    Left  Speed

    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
    0     0    0     0    0     0      0      0 --:--:--  0:04:19 --:--:--     0
  curl: (35) OpenSSL SSL_connect: Broken pipe in connection to apt.llvm.org:443
```

since fedora 40 comes with all the dependencies we need, let's
build and test in a container with the fedora:40 image. with,
hopefully the better CDN of the docker, and more reliable mirrors
of fedora repositories, and the package retrievial machinary built
into fedora's package management tools, we should have a more
resilient CI.

please note, in this change, we also

* install git before checkout the repo. the reason is that, unlike
  the github-hosted runner, the fedora:40 image does not have `git`
  installed, so we have to install it manually before using
  "actions/checkout" action.
* install clang-tools-extra when building with C++ modules enabled,
  because cmake and clang depend on clang-scan-deps to analyze the
  dependencies in betweener of C++20 modules.
* use static library in "dev" build mode. this is to work around
  the issue where seastar allocator causes infinite recursive call
  if seastar is compiled as a shared library. this only happens when
  the tree is compiled with newer glibc. see also scylladb#2247

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue May 17, 2024
instead of using ubuntu:jammy and setup-cpp action for prepearing
the building toolchain, use fedora:40 container for building and
testing.

after switching to the github workflow based CI, we've been
seeing test failures due to networking issue:

```
  Failed to install llvm via system package manager Error: Command failed with exit code 35: curl -LJO https://apt.llvm.org/llvm.sh
    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                   Dload  Upload   Total   Spent    Left  Speed

    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
    0     0    0     0    0     0      0      0 --:--:--  0:04:19 --:--:--     0
  curl: (35) OpenSSL SSL_connect: Broken pipe in connection to apt.llvm.org:443
```

since fedora 40 comes with all the dependencies we need, let's
build and test in a container with the fedora:40 image. with,
hopefully the better CDN of the docker, and more reliable mirrors
of fedora repositories, and the package retrievial machinary built
into fedora's package management tools, we should have a more
resilient CI.

please note, in this change, we also

* install git before checkout the repo. the reason is that, unlike
  the github-hosted runner, the fedora:40 image does not have `git`
  installed, so we have to install it manually before using
  "actions/checkout" action.
* install clang-tools-extra when building with C++ modules enabled,
  because cmake and clang depend on clang-scan-deps to analyze the
  dependencies in betweener of C++20 modules.
* use static library in "dev" build mode. this is to work around
  the issue where seastar allocator causes infinite recursive call
  if seastar is compiled as a shared library. this only happens when
  the tree is compiled with newer glibc. see also scylladb#2247

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue May 17, 2024
instead of using ubuntu:jammy and setup-cpp action for prepearing
the building toolchain, use fedora:40 container for building and
testing.

after switching to the github workflow based CI, we've been
seeing test failures due to networking issue:

```
  Failed to install llvm via system package manager Error: Command failed with exit code 35: curl -LJO https://apt.llvm.org/llvm.sh
    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                   Dload  Upload   Total   Spent    Left  Speed

    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
    0     0    0     0    0     0      0      0 --:--:--  0:04:19 --:--:--     0
  curl: (35) OpenSSL SSL_connect: Broken pipe in connection to apt.llvm.org:443
```

since fedora 40 comes with all the dependencies we need, let's
build and test in a container with the fedora:40 image. with,
hopefully the better CDN of the docker, and more reliable mirrors
of fedora repositories, and the package retrievial machinary built
into fedora's package management tools, we should have a more
resilient CI.

please note, in this change, we also

* install git before checkout the repo. the reason is that, unlike
  the github-hosted runner, the fedora:40 image does not have `git`
  installed, so we have to install it manually before using
  "actions/checkout" action.
* install clang-tools-extra when building with C++ modules enabled,
  because cmake and clang depend on clang-scan-deps to analyze the
  dependencies in betweener of C++20 modules.
* use static library in "dev" build mode. this is to work around
  the issue where seastar allocator causes infinite recursive call
  if seastar is compiled as a shared library. this only happens when
  the tree is compiled with newer glibc. see also scylladb#2247

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue May 20, 2024
quote from
https://patchwork.ozlabs.org/project/glibc/patch/8734v1ieke.fsf@oldenburg.str.redhat.com/

> On the glibc side, we should recommend that intercepting mallocs and its
> dependencies use initial-exec TLS because that kind of TLS does not use
> malloc.  If intercepting mallocs using dynamic TLS work at all, that's
> totally by accident, and was in the past helped by glibc bug 19924.

so instead of allocating TLS variables using malloc, let's allocate them
using initial-exec TLS model. another approach is to single out the
static TLS variables in the code path of malloc/free and apply
`__attribute__ ((tls_model("initial-exec")))` to them, and optionally
only do this when we are building shared library.

but this could be overkill as

1. we build static library in the release build
2. the total size of the static TLS variables is presumably small, so
   the application linking against the seastar shared library
   should be able to afford this.

see also
https://patchwork.ozlabs.org/project/glibc/patch/8734v1ieke.fsf@oldenburg.str.redhat.com/
and
https://sourceware.org/bugzilla/show_bug.cgi?id=19924

Fixes scylladb#2247
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
@tchaikov tchaikov self-assigned this May 20, 2024
tchaikov added a commit to tchaikov/seastar that referenced this issue May 20, 2024
quote from
https://patchwork.ozlabs.org/project/glibc/patch/8734v1ieke.fsf@oldenburg.str.redhat.com/

> On the glibc side, we should recommend that intercepting mallocs and its
> dependencies use initial-exec TLS because that kind of TLS does not use
> malloc.  If intercepting mallocs using dynamic TLS work at all, that's
> totally by accident, and was in the past helped by glibc bug 19924.

so instead of allocating TLS variables using malloc, let's allocate them
using initial-exec TLS model. another approach is to single out the
static TLS variables in the code path of malloc/free and apply
`__attribute__ ((tls_model("initial-exec")))` to them, and optionally
only do this when we are building shared library.

but this could be overkill as

1. we build static library in the release build
2. the total size of the static TLS variables is presumably small, so
   the application linking against the seastar shared library
   should be able to afford this.

see also
https://patchwork.ozlabs.org/project/glibc/patch/8734v1ieke.fsf@oldenburg.str.redhat.com/
and
https://sourceware.org/bugzilla/show_bug.cgi?id=19924

Fixes scylladb#2247
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue May 20, 2024
instead of using ubuntu:jammy and setup-cpp action for prepearing
the building toolchain, use fedora:40 container for building and
testing.

after switching to the github workflow based CI, we've been
seeing test failures due to networking issue:

```
  Failed to install llvm via system package manager Error: Command failed with exit code 35: curl -LJO https://apt.llvm.org/llvm.sh
    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                   Dload  Upload   Total   Spent    Left  Speed

    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
    0     0    0     0    0     0      0      0 --:--:--  0:04:19 --:--:--     0
  curl: (35) OpenSSL SSL_connect: Broken pipe in connection to apt.llvm.org:443
```

since fedora 40 comes with all the dependencies we need, let's
build and test in a container with the fedora:40 image. with,
hopefully the better CDN of the docker, and more reliable mirrors
of fedora repositories, and the package retrievial machinary built
into fedora's package management tools, we should have a more
resilient CI.

please note, in this change, we also

* install git before checkout the repo. the reason is that, unlike
  the github-hosted runner, the fedora:40 image does not have `git`
  installed, so we have to install it manually before using
  "actions/checkout" action.
* install clang-tools-extra when building with C++ modules enabled,
  because cmake and clang depend on clang-scan-deps to analyze the
  dependencies in betweener of C++20 modules.
* use static library in "dev" build mode. this is to work around
  the issue where seastar allocator causes infinite recursive call
  if seastar is compiled as a shared library. this only happens when
  the tree is compiled with newer glibc. see also scylladb#2247

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
avikivity pushed a commit that referenced this issue May 20, 2024
quote from
https://patchwork.ozlabs.org/project/glibc/patch/8734v1ieke.fsf@oldenburg.str.redhat.com/

> On the glibc side, we should recommend that intercepting mallocs and its
> dependencies use initial-exec TLS because that kind of TLS does not use
> malloc.  If intercepting mallocs using dynamic TLS work at all, that's
> totally by accident, and was in the past helped by glibc bug 19924.

so instead of allocating TLS variables using malloc, let's allocate them
using initial-exec TLS model. another approach is to single out the
static TLS variables in the code path of malloc/free and apply
`__attribute__ ((tls_model("initial-exec")))` to them, and optionally
only do this when we are building shared library.

but this could be overkill as

1. we build static library in the release build
2. the total size of the static TLS variables is presumably small, so
   the application linking against the seastar shared library
   should be able to afford this.

see also
https://patchwork.ozlabs.org/project/glibc/patch/8734v1ieke.fsf@oldenburg.str.redhat.com/
and
https://sourceware.org/bugzilla/show_bug.cgi?id=19924

Fixes #2247
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue May 20, 2024
instead of using ubuntu:jammy and setup-cpp action for prepearing
the building toolchain, use fedora:40 container for building and
testing.

after switching to the github workflow based CI, we've been
seeing test failures due to networking issue:

```
  Failed to install llvm via system package manager Error: Command failed with exit code 35: curl -LJO https://apt.llvm.org/llvm.sh
    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                   Dload  Upload   Total   Spent    Left  Speed

    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
    0     0    0     0    0     0      0      0 --:--:--  0:04:19 --:--:--     0
  curl: (35) OpenSSL SSL_connect: Broken pipe in connection to apt.llvm.org:443
```

since fedora 40 comes with all the dependencies we need, let's
build and test in a container with the fedora:40 image. with,
hopefully the better CDN of the docker, and more reliable mirrors
of fedora repositories, and the package retrievial machinary built
into fedora's package management tools, we should have a more
resilient CI.

please note, in this change, we also

* install git before checkout the repo. the reason is that, unlike
  the github-hosted runner, the fedora:40 image does not have `git`
  installed, so we have to install it manually before using
  "actions/checkout" action.
* install clang-tools-extra when building with C++ modules enabled,
  because cmake and clang depend on clang-scan-deps to analyze the
  dependencies in betweener of C++20 modules.
* use static library in "dev" build mode. this is to work around
  the issue where seastar allocator causes infinite recursive call
  if seastar is compiled as a shared library. this only happens when
  the tree is compiled with newer glibc. see also scylladb#2247

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant