Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.5.0b0版本使用spu执行任务完成后,程序无法自动退出 #1281

Open
nfangxu opened this issue May 9, 2024 · 2 comments
Open
Assignees

Comments

@nfangxu
Copy link

nfangxu commented May 9, 2024

Issue Type

Bug

Source

binary

Secretflow Version

1.5.0b0

OS Platform and Distribution

Centos 7.9.2009

Python version

3.10.14

Bazel version

No response

GCC/Compiler version

No response

What happend and What you expected to happen.

两台机器分布启动 ray 集群:

# 192.168.3.21
export ip="192.168.3.21"
ray start --head --node-ip-address="${ip}" --port="9010" --include-dashboard=False --disable-usage-stats
# 192.168.3.23
export ip="192.168.3.23"
ray start --head --node-ip-address="${ip}" --port="9010" --include-dashboard=False --disable-usage-stats

分别执行:

# 192.168.3.23
python3 demo.py -p=client
# 192.168.3.21
python3 demo.py -p=server

执行完毕输出日志如下:

  • client
[root@sf-3-23 ~]# python3 demo.py -p=client
2024-05-09 11:24:55,420 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 192.168.3.23:9010...
2024-05-09 11:24:55,436 INFO worker.py:1724 -- Connected to Ray cluster.
2024-05-09 11:24:55.481 INFO api.py:233 [client] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'client': '0.0.0.0:9020', 'server': '192.168.3.21:9020'}, 'CURRENT_PARTY_NAME': 'client', 'TLS_CONFIG': {}}
2024-05-09 11:24:55.481 DEBUG message_queue.py:56 [client] -- [Anonymous_job] Starting new thread[DataSendingQueueThread] for message polling.
2024-05-09 11:24:55.482 DEBUG cleanup.py:67 [client] -- [Anonymous_job] Start check sending thread.
2024-05-09 11:24:55.482 DEBUG message_queue.py:56 [client] -- [Anonymous_job] Starting new thread[ErrorSendingQueueThread] for message polling.
2024-05-09 11:24:55.483 DEBUG cleanup.py:69 [client] -- [Anonymous_job] Start check error sending thread.
2024-05-09 11:24:55.483 DEBUG barriers.py:445 [client] -- [Anonymous_job] Starting ReceiverProxyActor with options: {'max_concurrency': 1, 'name': 'SenderReceiverProxyActor'}
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:24:56.954 INFO link.py:38 [client] -- [Anonymous_job] brpc options: {'message_max_size_in_bytes': 2147483647, 'timeout_in_ms': 1800000, 'connect_retry_times': 8640, 'connect_retry_interval_ms': 10000, 'recv_timeout_ms': 21600000, 'http_timeout_ms': 21600000, 'exit_on_sending_failure': True}
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:24:56.954 WARNING link_config.py:34 [client] -- [Anonymous_job] http_timeout_ms and timeout_ms are set at the same time, http_timeout_ms 21600000 will be used.
(SenderReceiverProxyActor pid=17006) I0509 11:24:56.980060 17006 external/com_github_brpc_brpc/src/brpc/server.cpp:1158] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=9020.
(SenderReceiverProxyActor pid=17006) W0509 11:24:56.980113 17006 external/com_github_brpc_brpc/src/brpc/server.cpp:1164] Builtin services are disabled according to ServerOptions.has_builtin_services
2024-05-09 11:25:02.569 INFO barriers.py:465 [client] -- [Anonymous_job] Succeeded to create receiver proxy actor.
2024-05-09 11:25:02.569 INFO barriers.py:520 [client] -- [Anonymous_job] Try ping ['server'] at 0 attemp, up to 3600 attemps.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:02.579 DEBUG barriers.py:397 [client] -- [Anonymous_job] Sending send data to seq_id ping of server from ping without credentials.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:02.579 DEBUG barriers.py:408 [client] -- [Anonymous_job] Succeeded to send data to seq_id ping of server from ping. Response is True
=========================Start
2024-05-09 11:25:02.631 DEBUG pyu.py:105 [client] -- [Anonymous_job] PYU remote function: <function get_data at 0x7f406d3205e0>, num_returns=None, args len: 1, kwargs len: 0.
2024-05-09 11:25:02.632 DEBUG utils.py:63 [client] -- [Anonymous_job] Insert fed object, arg.party=client
2024-05-09 11:25:02.636 DEBUG pyu.py:105 [client] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.get_shares_chunk_count at 0x7f404853d480>, num_returns=None, args len: 4, kwargs len: 0.
(_run pid=4520) INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': 
(_run pid=4520) INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=4520) INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
2024-05-09 11:25:04.529 DEBUG utils.py:63 [client] -- [Anonymous_job] Insert fed object, arg.party=client
2024-05-09 11:25:04.540 DEBUG pyu.py:105 [client] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.run_spu_io at 0x7f404853e320>, num_returns=4, args len: 4, kwargs len: 0.
2024-05-09 11:25:04.541 DEBUG utils.py:63 [client] -- [Anonymous_job] Insert fed object, arg.party=client
2024-05-09 11:25:04.541 DEBUG utils.py:63 [client] -- [Anonymous_job] Insert fed object, arg.party=client
2024-05-09 11:25:04.542 DEBUG fed_actor.py:104 [client] -- [Anonymous_job] Actor method call: infeed_share, num_returns: 1
2024-05-09 11:25:04.544 DEBUG utils.py:63 [client] -- [Anonymous_job] Insert fed object, arg.party=client
2024-05-09 11:25:04.545 DEBUG fed_actor.py:104 [client] -- [Anonymous_job] Actor method call: del_share, num_returns: 1
2024-05-09 11:25:04.545 DEBUG pyu.py:105 [client] -- [Anonymous_job] PYU remote function: <function get_data at 0x7f406d3205e0>, num_returns=None, args len: 1, kwargs len: 0.
2024-05-09 11:25:04.545 DEBUG pyu.py:105 [client] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.get_shares_chunk_count at 0x7f404853e320>, num_returns=None, args len: 4, kwargs len: 0.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:04.530 DEBUG barriers.py:397 [client] -- [Anonymous_job] Sending send data to seq_id 7 of server from 6#0 without credentials.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:04.530 DEBUG barriers.py:408 [client] -- [Anonymous_job] Succeeded to send data to seq_id 7 of server from 6#0. Response is True
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:05.609 DEBUG barriers.py:397 [client] -- [Anonymous_job] Sending send data to seq_id 10 of server from 8#1 without credentials.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:05.609 DEBUG barriers.py:408 [client] -- [Anonymous_job] Succeeded to send data to seq_id 10 of server from 8#1. Response is True
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:05.611 DEBUG barriers.py:397 [client] -- [Anonymous_job] Sending send data to seq_id 10 of server from 8#3 without credentials.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:05.611 DEBUG barriers.py:408 [client] -- [Anonymous_job] Succeeded to send data to seq_id 10 of server from 8#3. Response is True
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:05.612 DEBUG link.py:93 [client] -- [Anonymous_job] Getting data for 15 from 14#0 of server
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:05.612 DEBUG link.py:114 [client] -- [Anonymous_job] Received data for ping from ping.
2024-05-09 11:25:06.650 DEBUG pyu.py:105 [client] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.run_spu_io at 0x7f404853e680>, num_returns=4, args len: 4, kwargs len: 0.
2024-05-09 11:25:06.651 DEBUG utils.py:66 [client] -- [Anonymous_job] Insert recv_op, arg task id 16#1, current task id 17
2024-05-09 11:25:06.652 DEBUG utils.py:66 [client] -- [Anonymous_job] Insert recv_op, arg task id 16#2, current task id 17
2024-05-09 11:25:06.653 DEBUG fed_actor.py:104 [client] -- [Anonymous_job] Actor method call: infeed_share, num_returns: 1
2024-05-09 11:25:06.653 DEBUG utils.py:63 [client] -- [Anonymous_job] Insert fed object, arg.party=client
2024-05-09 11:25:06.653 DEBUG fed_actor.py:104 [client] -- [Anonymous_job] Actor method call: del_share, num_returns: 1
=========================Success
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.647 DEBUG link.py:114 [client] -- [Anonymous_job] Received data for 15 from 14#0.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.648 DEBUG link.py:120 [client] -- [Anonymous_job] Getted data for 15 from 14#0 of server.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.654 DEBUG link.py:93 [client] -- [Anonymous_job] Getting data for 17 from 16#1 of server
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.655 DEBUG link.py:114 [client] -- [Anonymous_job] Received data for 17 from 16#1.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.655 DEBUG link.py:120 [client] -- [Anonymous_job] Getted data for 17 from 16#1 of server.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.657 DEBUG link.py:93 [client] -- [Anonymous_job] Getting data for 17 from 16#2 of server
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.657 DEBUG link.py:114 [client] -- [Anonymous_job] Received data for 17 from 16#2.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.657 DEBUG link.py:120 [client] -- [Anonymous_job] Getted data for 17 from 16#2 of server.
  • server
[root@sf-3-21 ~]# python3 demo.py -p=server
2024-05-09 11:24:52,188 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 192.168.3.21:9010...
2024-05-09 11:24:52,203 INFO worker.py:1724 -- Connected to Ray cluster.
2024-05-09 11:24:52.249 INFO api.py:233 [server] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'client': '192.168.3.23:9020', 'server': '0.0.0.0:9020'}, 'CURRENT_PARTY_NAME': 'server', 'TLS_CONFIG': {}}
2024-05-09 11:24:52.249 DEBUG message_queue.py:56 [server] -- [Anonymous_job] Starting new thread[DataSendingQueueThread] for message polling.
2024-05-09 11:24:52.250 DEBUG cleanup.py:67 [server] -- [Anonymous_job] Start check sending thread.
2024-05-09 11:24:52.250 DEBUG message_queue.py:56 [server] -- [Anonymous_job] Starting new thread[ErrorSendingQueueThread] for message polling.
2024-05-09 11:24:52.250 DEBUG cleanup.py:69 [server] -- [Anonymous_job] Start check error sending thread.
2024-05-09 11:24:52.250 DEBUG barriers.py:445 [server] -- [Anonymous_job] Starting ReceiverProxyActor with options: {'max_concurrency': 1, 'name': 'SenderReceiverProxyActor'}
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:24:53.721 INFO link.py:38 [server] -- [Anonymous_job] brpc options: {'message_max_size_in_bytes': 2147483647, 'timeout_in_ms': 1800000, 'connect_retry_times': 8640, 'connect_retry_interval_ms': 10000, 'recv_timeout_ms': 21600000, 'http_timeout_ms': 21600000, 'exit_on_sending_failure': True}
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:24:53.722 WARNING link_config.py:34 [server] -- [Anonymous_job] http_timeout_ms and timeout_ms are set at the same time, http_timeout_ms 21600000 will be used.
(SenderReceiverProxyActor pid=24536) I0509 11:24:53.749538 24536 external/com_github_brpc_brpc/src/brpc/server.cpp:1158] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=9020.
(SenderReceiverProxyActor pid=24536) W0509 11:24:53.749588 24536 external/com_github_brpc_brpc/src/brpc/server.cpp:1164] Builtin services are disabled according to ServerOptions.has_builtin_services
(SenderReceiverProxyActor pid=24536) I0509 11:24:53.869094 24630 external/com_github_brpc_brpc/src/brpc/socket.cpp:2466] Checking Socket{id=0 addr=192.168.3.23:9020} (0x3513080)
(SenderReceiverProxyActor pid=24536) I0509 11:24:59.871872 24662 external/com_github_brpc_brpc/src/brpc/socket.cpp:2526] Revived Socket{id=0 addr=192.168.3.23:9020} (0x3513080) (Connectable)
2024-05-09 11:25:02.792 INFO barriers.py:465 [server] -- [Anonymous_job] Succeeded to create receiver proxy actor.
2024-05-09 11:25:02.792 INFO barriers.py:520 [server] -- [Anonymous_job] Try ping ['client'] at 0 attemp, up to 3600 attemps.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:02.799 DEBUG barriers.py:397 [server] -- [Anonymous_job] Sending send data to seq_id ping of client from ping without credentials.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:02.800 DEBUG barriers.py:408 [server] -- [Anonymous_job] Succeeded to send data to seq_id ping of client from ping. Response is True
=========================Start
2024-05-09 11:25:02.852 DEBUG pyu.py:105 [server] -- [Anonymous_job] PYU remote function: <function get_data at 0x7fa10e3005e0>, num_returns=None, args len: 1, kwargs len: 0.
2024-05-09 11:25:02.852 DEBUG pyu.py:105 [server] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.get_shares_chunk_count at 0x7fa10471b0a0>, num_returns=None, args len: 4, kwargs len: 0.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:02.856 DEBUG link.py:93 [server] -- [Anonymous_job] Getting data for 7 from 6#0 of client
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:02.857 DEBUG link.py:114 [server] -- [Anonymous_job] Received data for ping from ping.
2024-05-09 11:25:04.757 DEBUG pyu.py:105 [server] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.run_spu_io at 0x7fa10471b130>, num_returns=4, args len: 4, kwargs len: 0.
2024-05-09 11:25:04.758 DEBUG utils.py:66 [server] -- [Anonymous_job] Insert recv_op, arg task id 8#1, current task id 10
2024-05-09 11:25:04.760 DEBUG utils.py:66 [server] -- [Anonymous_job] Insert recv_op, arg task id 8#3, current task id 10
2024-05-09 11:25:04.762 DEBUG fed_actor.py:104 [server] -- [Anonymous_job] Actor method call: infeed_share, num_returns: 1
2024-05-09 11:25:04.764 DEBUG utils.py:63 [server] -- [Anonymous_job] Insert fed object, arg.party=server
2024-05-09 11:25:04.764 DEBUG fed_actor.py:104 [server] -- [Anonymous_job] Actor method call: del_share, num_returns: 1
2024-05-09 11:25:04.772 DEBUG pyu.py:105 [server] -- [Anonymous_job] PYU remote function: <function get_data at 0x7fa10e3005e0>, num_returns=None, args len: 1, kwargs len: 0.
2024-05-09 11:25:04.772 DEBUG utils.py:63 [server] -- [Anonymous_job] Insert fed object, arg.party=server
2024-05-09 11:25:04.778 DEBUG pyu.py:105 [server] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.get_shares_chunk_count at 0x7fa10471b130>, num_returns=None, args len: 4, kwargs len: 0.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:04.753 DEBUG link.py:114 [server] -- [Anonymous_job] Received data for 7 from 6#0.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:04.754 DEBUG link.py:120 [server] -- [Anonymous_job] Getted data for 7 from 6#0 of client.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:04.761 DEBUG link.py:93 [server] -- [Anonymous_job] Getting data for 10 from 8#1 of client
2024-05-09 11:25:06.716 DEBUG utils.py:63 [server] -- [Anonymous_job] Insert fed object, arg.party=server
2024-05-09 11:25:06.720 DEBUG pyu.py:105 [server] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.run_spu_io at 0x7fa104719480>, num_returns=4, args len: 4, kwargs len: 0.
2024-05-09 11:25:06.722 DEBUG utils.py:63 [server] -- [Anonymous_job] Insert fed object, arg.party=server
2024-05-09 11:25:06.722 DEBUG utils.py:63 [server] -- [Anonymous_job] Insert fed object, arg.party=server
2024-05-09 11:25:06.722 DEBUG fed_actor.py:104 [server] -- [Anonymous_job] Actor method call: infeed_share, num_returns: 1
2024-05-09 11:25:06.723 DEBUG utils.py:63 [server] -- [Anonymous_job] Insert fed object, arg.party=server
2024-05-09 11:25:06.723 DEBUG fed_actor.py:104 [server] -- [Anonymous_job] Actor method call: del_share, num_returns: 1
=========================Success
(_run pid=10863) INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': 
(_run pid=10863) INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=10863) INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.865 DEBUG link.py:114 [server] -- [Anonymous_job] Received data for 10 from 8#1.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.866 DEBUG link.py:120 [server] -- [Anonymous_job] Getted data for 10 from 8#1 of client.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.867 DEBUG link.py:93 [server] -- [Anonymous_job] Getting data for 10 from 8#3 of client
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.868 DEBUG link.py:114 [server] -- [Anonymous_job] Received data for 10 from 8#3.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.868 DEBUG link.py:120 [server] -- [Anonymous_job] Getted data for 10 from 8#3 of client.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.868 DEBUG barriers.py:397 [server] -- [Anonymous_job] Sending send data to seq_id 15 of client from 14#0 without credentials.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.869 DEBUG barriers.py:408 [server] -- [Anonymous_job] Succeeded to send data to seq_id 15 of client from 14#0. Response is True
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.870 DEBUG barriers.py:397 [server] -- [Anonymous_job] Sending send data to seq_id 17 of client from 16#1 without credentials.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.870 DEBUG barriers.py:408 [server] -- [Anonymous_job] Succeeded to send data to seq_id 17 of client from 16#1. Response is True
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.871 DEBUG barriers.py:397 [server] -- [Anonymous_job] Sending send data to seq_id 17 of client from 16#2 without credentials.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.871 DEBUG barriers.py:408 [server] -- [Anonymous_job] Succeeded to send data to seq_id 17 of client from 16#2. Response is True

Reproduction code to reproduce the issue.

demo.py代码如下:

import argparse
import secretflow as sf
import logging

def ray_init(self_party):
    sf.shutdown()

    ip={
        "server": "192.168.3.21",
        "client": "192.168.3.23",
    }[self_party]

    sf.init(address=ip+":9010",
            cluster_config={
                'self_party': self_party,
                'parties': {
                    'client': {
                        'id': 'client',
                        'party': 'client',
                        'address': '192.168.3.23:9020',
                        'listen_addr': '0.0.0.0:9020',
                    },
                    'server': {
                        'id': 'server',
                        'party': 'server',
                        'address': '192.168.3.21:9020',
                        'listen_addr': '0.0.0.0:9020',
                    }
                },
            },
            log_to_driver=True,
            logging_level=logging.getLevelName(logging.DEBUG).lower(),
            cross_silo_comm_backend='brpc_link',
            cross_silo_comm_options={
                "message_max_size_in_bytes": (2 << 30) - 1,
                "timeout_in_ms": 30 * 60 * 1000,
                # BRPC Config
                "connect_retry_times": 6 * 60 * 24,
                "connect_retry_interval_ms": 10 * 1000,
                "recv_timeout_ms": 6 * 3600 * 1000,
                "http_timeout_ms": 6 * 3600 * 1000,
                },
            )

def spu_init():
    cluster_def = {
        "runtime_config": {
            "protocol": "SEMI2K",
            "field": "FM128",
            "fxp_fraction_bits": 32,
            "fxp_div_goldschmidt_iters": 10,
        },
        "nodes": [
            {
                "party": 'client',
                'address': '192.168.3.23:9030',
                "listen_address": "0.0.0.0:9030"
            },
            {
                "party": 'server',
                'address': '192.168.3.21:9030',
                "listen_address": "0.0.0.0:9030"
            },
        ],
    }

    # link_desc
    link_desc = {
        "connect_retry_times": 6 * 60 * 24,
        "connect_retry_interval_ms": 10 * 1000,
        "recv_timeout_ms": 6 * 3600 * 1000,
        "http_timeout_ms": 6 * 3600 * 1000,
        "throttle_window_size": 0,
        "brpc_channel_protocol": "http",
        "brpc_channel_connection_type": "pooled",
    }

    return sf.SPU(cluster_def=cluster_def, link_desc=link_desc)


def get_data(i):
    return i


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-p", "--party", default="", help="party id")
    args = parser.parse_args()

    # Ray init
    ray_init(args.party)
    spu_device = spu_init()
    pyus = [sf.PYU("client"), sf.PYU("server")]

    print("=========================Start")
    for pyu in pyus:
        pyu(get_data)(1).to(spu_device)

    print("=========================Success")
@nfangxu
Copy link
Author

nfangxu commented May 9, 2024

相关版本信息如下(spu 因为改动过,所以使用的是 0.8.0b0 版本):

# pip3 list | grep secretflow
secretflow                   1.5.0b0
secretflow-rayfed            0.2.1a1
secretflow-serving-lib       0.3.0.dev20240320
# pip3 list | grep spu
spu                          0.8.0b0

@anakinxc anakinxc removed their assignment May 9, 2024
@ian-huu
Copy link
Member

ian-huu commented May 9, 2024

需要在脚本最后加上 sf.shutdown(),可能会看到报错 AttributeError: 'NoneType' object has no attribute 'get_job_name',这个是已知问题,会尽快修复。

此外,为了保证在 shutdown 之前执行完任务,建议在 shutdown 之前加上 sf.wait(某个结果),比如:

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-p", "--party", default="", help="party id")
    args = parser.parse_args()

    # Ray init
    ray_init(args.party)
    spu_device = spu_init()
    pyus = [sf.PYU("client"), sf.PYU("server")]

    print("=========================Start")
    spu_objs = []
    for pyu in pyus:
        obj = pyu(get_data)(1).to(spu_device)
        spu_objs.append(obj)

    print("=========================Success")

    sf.wait(spu_objs)

    sf.shutdown()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants