Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cylon container fails #679

Open
AymenFJA opened this issue Oct 10, 2023 · 8 comments
Open

Cylon container fails #679

AymenFJA opened this issue Oct 10, 2023 · 8 comments

Comments

@AymenFJA
Copy link
Contributor

AymenFJA commented Oct 10, 2023

Hello @nirandaperera and Cylon team,

I was testing Cylon container with Kubernetes on AWS. I have a multi-node setup of MPI environment on the cluster.

I tested Cylon with 1 and 2 nodes (each node has 128 cores and 16GB of memory per core (total per node is 2048 GB)) both runs worked just fine when executing join operation with ~35M rows using the following script https://github.com/cylondata/cylon/blob/main/summit/scripts/cylon_scaling.py.

The command line that I used:

mpirun -n 256 cylon_scaling.py -s w -n 35000000

I repeated the same setup but this time with 3 or 4 nodes:

mpirun -n 384 cylon_scaling.py -s w -n 35000000
mpirun -n 512 cylon_scaling.py -s w -n 35000000

And I started getting the following error:

[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) fail[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)                                                                                                             [cylon-join-worker-1][[60663,1],146][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(258) failed: Bad file descriptor (9)
[cylon-join-worker-1:17259] *** Process received signal ***
[cylon-join-worker-1:17259] Signal: Segmentation fault (11)
[cylon-join-worker-1:17259] Signal code: Address not mapped (1)
[cylon-join-worker-1:17259] Failing at address: (nil)
[cylon-join-worker-1:17259] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fe751c04090]
[cylon-join-worker-1:17259] *** End of error message ***

[cylon-join-worker-1:17186] *** Process received signal ***
[cylon-join-worker-1:17186] Signal: Segmentation fault (11)
[cylon-join-worker-1:17186] Signal code: Address not mapped (1)
[cylon-join-worker-1:17186] Failing at address: 0x18
[cylon-join-worker-1:17186] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f50e0825090]
[cylon-join-worker-1:17186] [ 1] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_tcp.so(mca_btl_tcp_endpoint_send+0x609)[0x7f50dbcdbfa9]
[cylon-join-worker-1:17186] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x1a1)[0x7f50db6d8bc1]
[cylon-join-worker-1:17186] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_isend+0x482)[0x7f50db6ca3a2]
[cylon-join-worker-1:17186] [ 4] /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Isend+0x12d)[0x7f50dd28893d]
[cylon-join-worker-1:17186] [ 5] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon10MPIChannel13progressSendsEv+0x15a)[0x7f5025e8703a]
[cylon-join-worker-1:17186] [ 6] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon8AllToAll10isCompleteEv+0x233)[0x7f5025e8d103]
[cylon-join-worker-1:17186] [ 7] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon13ArrowAllToAll10isCompleteEv+0x76a)[0x7f5025bb069a]
[cylon-join-worker-1:17186] [ 8] /cylon/build/lib/libcylon.so.0.6.0(+0x4ed43c)[0x7f5025ecc43c]
[cylon-join-worker-1:17186] [ 9] /cylon/build/lib/libcylon.so.0.6.0(+0x4ee323)[0x7f5025ecd323]
[cylon-join-worker-1:17186] [10] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon15DistributedJoinERKSt10shared_ptrINS_5TableEES4_RKNS_4join6config10JoinConfigERS2_+0x8a)[0x7f5025ece00a]
[cylon-join-worker-1:17186] [11] /cylon/ENV/lib/python3.8/site-packages/pycylon-0+untagged.1302.g44a27a6-py3.8-linux-x86_64.egg/pycylon/data/table.cpython-38-x86_64-linux-gnu.so(+0x75c02)[0x7f50db67ec02]
[cylon-join-worker-1:17186] [12] /cylon/ENV/bin/python3(PyCFunction_Call+0x59)[0x5f6939]
[cylon-join-worker-1:17186] [13] /cylon/ENV/bin/python3(_PyObject_MakeTpCall+0x296)[0x5f7506]
[cylon-join-worker-1:17186] [14] /cylon/ENV/bin/python3(_PyEval_EvalFrameDefault+0x6259)[0x571019]
[cylon-join-workerl address (192.168.99.12:0) failed: Address already in use (98)
cylon-join-launcher:00001] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cylon-join-launcher:00001] 24 more processes have sent help message help-mpi-btl-tcp.txt / socket flag fail
[cylon-join-launcher:00001] 93 more processes have sent help message help-mpi-btl-tcp.txt / peer hung up
[cylon-join-launcher:00001] 11 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail

Any help here would be appreciate it.

@mstaylor
Copy link
Collaborator

This looks like an execution issue. There are also subtleties related to docker usage and port mapping. You might want to try host networking and ensure there are not processes already executing/idle on the various nodes. For ECS, it has been necessary to specifically map ports to avoid this sort of thing.

@AymenFJA
Copy link
Contributor Author

@mstaylor Can you elaborate more, please, on ECS, it has been necessary to specifically map ports to avoid this sort of thing. It would be great If you have an example of how to do so. Thanks.

@AymenFJA
Copy link
Contributor Author

AymenFJA commented Oct 23, 2023

@mstaylor, a gentle reminder about the comment above.

@mstaylor
Copy link
Collaborator

@AymenFJA - your issue is here: [cylon-join-workerl address (192.168.99.12:0) failed: Address already in use (98)

For my research experiments, I use UCX/UXX/Redis which is a bit different. For OpenMPI, you might consider using the following approach: https://github.com/everpeace/kube-openmpi. If you switch to ECS, you can generate a task that includes port mapping. Here's an example from my ECS task mapping:

"family": "cylon-ucc-ucx-redis-ec2-4_26_9100000-8Node-task",
"containerDefinitions": [
{
"name": "redisUCSUCX",
"image": "448324707516.dkr.ecr.us-east-1.amazonaws.com/cylon-ucc-ucx-redis:latest",
"cpu": 4096,
"memory": 26624,
"portMappings": [
{
"name": "redisucsucx-18-tcp",
"containerPort": 18,
"hostPort": 18,
"protocol": "tcp"
},
{
"name": "redisucsucx-41768-tcp",
"containerPort": 41768,
"hostPort": 41768,
"protocol": "tcp"
},...

The issue is your are running on pods with addresses already in use (hence the error logged). What does your hosts file look like?

@mstaylor
Copy link
Collaborator

@AymenFJA - did you use our docker image or build an image based on updates in main?

@AymenFJA
Copy link
Contributor Author

Thanks, @mstaylor, for your response. Can we have a 1-1 meeting to discuss it? It would be great to do that. If you agree, I can ping you on Slack and take it from there.

@mstaylor
Copy link
Collaborator

@AymenFJA - that sounds great.

@AymenFJA
Copy link
Contributor Author

@mstaylor I pinged you on slack/cylondata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants