Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition split hung #1963

Open
acelyc111 opened this issue Mar 26, 2024 · 3 comments
Open

Partition split hung #1963

acelyc111 opened this issue Mar 26, 2024 · 3 comments
Labels
type/bug This issue reports a bug.

Comments

@acelyc111
Copy link
Member

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?
    Create a table with 8 partitions
    Write some data to the table
    Start partition split

  2. What did you expect to see?
    The partition split can be completed successfully.

  3. What did you see instead?
    The partitions are in SPLITTING state, the process hung and could not response requests
    企业微信截图_59984773-3820-4bd4-8fbf-c0fb02312e36

  4. What version of Pegasus are you using?
    2.4

@acelyc111 acelyc111 added the type/bug This issue reports a bug. label Mar 26, 2024
@acelyc111
Copy link
Member Author

acelyc111 commented Mar 26, 2024

Piece of the pstack:

Thread 273 (Thread 0x7f3df4dbc700 (LWP 25825)):
#0  0x00007f3e0f2bbb3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f3e0f2bbbcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f3e0f2bbc6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f3e1192a6e2 in dsn::task::wait(int) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#4  0x00007f3e1193e44f in dsn::task_tracker::wait_outstanding_tasks() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#5  0x00007f3e1179eef6 in ?? () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#6  0x00007f3e117a0e1a in dsn::replication::replica_app_info::store(std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#7  0x00007f3e116fbdd4 in dsn::replication::replica::store_app_info(dsn::app_info&, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#8  0x00007f3e1173b18d in dsn::replication::replica::initialize_on_new() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#9  0x00007f3e11764fe8 in dsn::replication::replica_stub::new_replica(dsn::gpid, dsn::app_info const&, bool, bool, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#10 0x00007f3e1176564f in dsn::replication::replica_stub::create_child_replica_if_not_found(dsn::gpid, dsn::app_info*, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#11 0x00007f3e11765917 in dsn::replication::replica_stub::create_child_replica(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#12 0x00007f3e117ee1a6 in std::_Function_handler<void (), std::_Bind<void (dsn::replication::replica_stub::*(dsn::replication::replica_stub*, dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string))(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&)> >::_M_invoke(std::_Any_data const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#13 0x00007f3e11929f21 in dsn::task::exec_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#14 0x00007f3e1193f422 in dsn::task_worker::loop() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#15 0x00007f3e1193f5a0 in dsn::task_worker::run_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#16 0x00007f3e11125e9f in ?? () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_utils.so
#17 0x00007f3e0f2b5ea5 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f3e0d785b0d in clone () from /lib64/libc.so.6
Thread 272 (Thread 0x7f3df45bb700 (LWP 25826)):
#0  0x00007f3e0f2bbb3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f3e0f2bbbcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f3e0f2bbc6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f3e1196a132 in dsn::tools::std_rwlock_nr_provider::lock_write() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#4  0x00007f3e118f8e0d in dsn::zrwlock_nr::lock_write() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#5  0x00007f3e117653a3 in dsn::replication::replica_stub::create_child_replica_if_not_found(dsn::gpid, dsn::app_info*, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#6  0x00007f3e11765917 in dsn::replication::replica_stub::create_child_replica(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#7  0x00007f3e117ee1a6 in std::_Function_handler<void (), std::_Bind<void (dsn::replication::replica_stub::*(dsn::replication::replica_stub*, dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string))(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&)> >::_M_invoke(std::_Any_data const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#8  0x00007f3e11929f21 in dsn::task::exec_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#9  0x00007f3e1193f422 in dsn::task_worker::loop() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#10 0x00007f3e1193f5a0 in dsn::task_worker::run_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#11 0x00007f3e11125e9f in ?? () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_utils.so
#12 0x00007f3e0f2b5ea5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f3e0d785b0d in clone () from /lib64/libc.so.6
Thread 271 (Thread 0x7f3df3dba700 (LWP 25827)):
#0  0x00007f3e0f2bbb3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f3e0f2bbbcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f3e0f2bbc6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f3e1196a132 in dsn::tools::std_rwlock_nr_provider::lock_write() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#4  0x00007f3e118f8e0d in dsn::zrwlock_nr::lock_write() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#5  0x00007f3e117653a3 in dsn::replication::replica_stub::create_child_replica_if_not_found(dsn::gpid, dsn::app_info*, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#6  0x00007f3e11765917 in dsn::replication::replica_stub::create_child_replica(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#7  0x00007f3e117ee1a6 in std::_Function_handler<void (), std::_Bind<void (dsn::replication::replica_stub::*(dsn::replication::replica_stub*, dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string))(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&)> >::_M_invoke(std::_Any_data const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#8  0x00007f3e11929f21 in dsn::task::exec_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#9  0x00007f3e1193f422 in dsn::task_worker::loop() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#10 0x00007f3e1193f5a0 in dsn::task_worker::run_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#11 0x00007f3e11125e9f in ?? () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_utils.so
#12 0x00007f3e0f2b5ea5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f3e0d785b0d in clone () from /lib64/libc.so.6
Thread 270 (Thread 0x7f3df35b9700 (LWP 25828)):
#0  0x00007f3e0f2bbb3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f3e0f2bbbcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f3e0f2bbc6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f3e1196a132 in dsn::tools::std_rwlock_nr_provider::lock_write() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#4  0x00007f3e118f8e0d in dsn::zrwlock_nr::lock_write() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#5  0x00007f3e117653a3 in dsn::replication::replica_stub::create_child_replica_if_not_found(dsn::gpid, dsn::app_info*, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#6  0x00007f3e11765917 in dsn::replication::replica_stub::create_child_replica(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#7  0x00007f3e117ee1a6 in std::_Function_handler<void (), std::_Bind<void (dsn::replication::replica_stub::*(dsn::replication::replica_stub*, dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string))(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&)> >::_M_invoke(std::_Any_data const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#8  0x00007f3e11929f21 in dsn::task::exec_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#9  0x00007f3e1193f422 in dsn::task_worker::loop() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#10 0x00007f3e1193f5a0 in dsn::task_worker::run_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#11 0x00007f3e11125e9f in ?? () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_utils.so
#12 0x00007f3e0f2b5ea5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f3e0d785b0d in clone () from /lib64/libc.so.6
Thread 269 (Thread 0x7f3df2db8700 (LWP 25829)):
#0  0x00007f3e0f2bbb3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f3e0f2bbbcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f3e0f2bbc6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f3e1196a132 in dsn::tools::std_rwlock_nr_provider::lock_write() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#4  0x00007f3e118f8e0d in dsn::zrwlock_nr::lock_write() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#5  0x00007f3e117653a3 in dsn::replication::replica_stub::create_child_replica_if_not_found(dsn::gpid, dsn::app_info*, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#6  0x00007f3e11765917 in dsn::replication::replica_stub::create_child_replica(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#7  0x00007f3e117ee1a6 in std::_Function_handler<void (), std::_Bind<void (dsn::replication::replica_stub::*(dsn::replication::replica_stub*, dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string))(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&)> >::_M_invoke(std::_Any_data const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#8  0x00007f3e11929f21 in dsn::task::exec_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#9  0x00007f3e1193f422 in dsn::task_worker::loop() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#10 0x00007f3e1193f5a0 in dsn::task_worker::run_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#11 0x00007f3e11125e9f in ?? () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_utils.so
#12 0x00007f3e0f2b5ea5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f3e0d785b0d in clone () from /lib64/libc.so.6
Thread 268 (Thread 0x7f3df25b7700 (LWP 25830)):
#0  0x00007f3e0f2bbb3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f3e0f2bbbcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f3e0f2bbc6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f3e1196a132 in dsn::tools::std_rwlock_nr_provider::lock_write() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#4  0x00007f3e118f8e0d in dsn::zrwlock_nr::lock_write() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#5  0x00007f3e117653a3 in dsn::replication::replica_stub::create_child_replica_if_not_found(dsn::gpid, dsn::app_info*, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#6  0x00007f3e11765917 in dsn::replication::replica_stub::create_child_replica(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#7  0x00007f3e117ee1a6 in std::_Function_handler<void (), std::_Bind<void (dsn::replication::replica_stub::*(dsn::replication::replica_stub*, dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string))(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&)> >::_M_invoke(std::_Any_data const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#8  0x00007f3e11929f21 in dsn::task::exec_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#9  0x00007f3e1193f422 in dsn::task_worker::loop() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#10 0x00007f3e1193f5a0 in dsn::task_worker::run_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#11 0x00007f3e11125e9f in ?? () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_utils.so
#12 0x00007f3e0f2b5ea5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f3e0d785b0d in clone () from /lib64/libc.so.6
Thread 267 (Thread 0x7f3df1db6700 (LWP 25831)):
#0  0x00007f3e0f2bbb3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f3e0f2bbbcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f3e0f2bbc6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f3e1196a132 in dsn::tools::std_rwlock_nr_provider::lock_write() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#4  0x00007f3e118f8e0d in dsn::zrwlock_nr::lock_write() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#5  0x00007f3e117653a3 in dsn::replication::replica_stub::create_child_replica_if_not_found(dsn::gpid, dsn::app_info*, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#6  0x00007f3e11765917 in dsn::replication::replica_stub::create_child_replica(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#7  0x00007f3e117ee1a6 in std::_Function_handler<void (), std::_Bind<void (dsn::replication::replica_stub::*(dsn::replication::replica_stub*, dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string))(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&)> >::_M_invoke(std::_Any_data const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#8  0x00007f3e11929f21 in dsn::task::exec_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#9  0x00007f3e1193f422 in dsn::task_worker::loop() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#10 0x00007f3e1193f5a0 in dsn::task_worker::run_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#11 0x00007f3e11125e9f in ?? () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_utils.so
#12 0x00007f3e0f2b5ea5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f3e0d785b0d in clone () from /lib64/libc.so.6
Thread 266 (Thread 0x7f3df15b5700 (LWP 25832)):
#0  0x00007f3e0f2bbb3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f3e0f2bbbcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f3e0f2bbc6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f3e1196a132 in dsn::tools::std_rwlock_nr_provider::lock_write() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#4  0x00007f3e118f8e0d in dsn::zrwlock_nr::lock_write() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#5  0x00007f3e117653a3 in dsn::replication::replica_stub::create_child_replica_if_not_found(dsn::gpid, dsn::app_info*, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#6  0x00007f3e11765917 in dsn::replication::replica_stub::create_child_replica(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#7  0x00007f3e117ee1a6 in std::_Function_handler<void (), std::_Bind<void (dsn::replication::replica_stub::*(dsn::replication::replica_stub*, dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string))(dsn::rpc_address, dsn::app_info, long, dsn::gpid, dsn::gpid, std::string const&)> >::_M_invoke(std::_Any_data const&) () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#8  0x00007f3e11929f21 in dsn::task::exec_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#9  0x00007f3e1193f422 in dsn::task_worker::loop() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#10 0x00007f3e1193f5a0 in dsn::task_worker::run_internal() () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_replica_server.so
#11 0x00007f3e11125e9f in ?? () from /sensorsdata/main/program/skv/skv_offline/replica_server/lib/libdsn_utils.so
#12 0x00007f3e0f2b5ea5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f3e0d785b0d in clone () from /lib64/libc.so.6

Piece of the config:

[threadpool.THREAD_POOL_DEFAULT]
name = default
partitioned = false
worker_priority = THREAD_xPRIORITY_NORMAL
worker_count = 8

As shown above, the THREAD_POOL_DEFAULT thread-pool has 8 threads, the table to be split has 8 partitions too, and the cluster has only 1 replica server.

When received the partition split request, all the 8 partitions start to add child replicas:

    tasking::enqueue(LPC_CREATE_CHILD,
                     tracker(),
                     std::bind(&replica_stub::create_child_replica,
                               _stub,
                               _replica->_config.hp_primary,
                               _replica->_app_info,
                               _child_init_ballot,
                               _child_gpid,
                               get_gpid(),
                               _replica->_dir),
                     get_gpid().thread_hash());
  1. The tasks are enqueued as task code LPC_CREATE_CHILD which is dealt by THREAD_POOL_DEFAULT thread-pool.
  2. Now all the 8 threads are exhausted.
  3. When one of them start to create the replica info by replica_app_info::store(), it will be enqued as LPC_AIO_INFO_WRITE which is also use the thread-pool THREAD_POOL_DEFAULT.
  4. In the step 3, the task hold the lock _replicas_lock of replica_stub.
  5. The other tasks are waiting the lock to be released, but the owner of the lock is requiring a thread to write file in step 3, but there is no more threads avaiable.
  6. Forming the deadlock.

@acelyc111
Copy link
Member Author

The lastest version has update the logic to use rocksdb::WriteStringToFile to write file in current thread.

This issue is possible to occur on version 2.4 and 2.5, if you encounter this, you can enlarge the value of worker_count in [threadpool.THREAD_POOL_DEFAULT] section.

@acelyc111
Copy link
Member Author

Don't close the issue for convenient search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug This issue reports a bug.
Projects
None yet
Development

No branches or pull requests

1 participant