Skip to content

Ray-2.9.0

Compare
Choose a tag to compare
@architkulkarni architkulkarni released this 21 Dec 00:32
· 2 commits to releases/2.9.0 since this release
9be5a16

Release Highlights

  • This release contains fixes for the Ray Dashboard. Additional context can be found here: https://www.anyscale.com/blog/update-on-ray-cves-cve-2023-6019-cve-2023-6020-cve-2023-6021-cve-2023-48022-cve-2023-48023 
  • Ray Train has now upgraded support for spot node preemption -- allowing Ray Train to handle preemption node failures differently than application errors.
  • Ray is now compatible with Pydantic versions <2.0.0 and >=2.5.0, addressing a piece of user feedback we’ve consistently received.
  • The Ray Dashboard now has a page for Ray Data to monitor real-time execution metrics.
  • Streaming generator is now officially a public API (#41436, #38784). Streaming generator allows writing streaming applications easily on top of Ray via Python generator API and has been used for Ray Serve and Ray data for several releases. See the documentation for details. 
  • We’ve added experimental support for new accelerators: Intel GPU (#38553), Intel Gaudi Accelerators (#40561), and Huawei Ascend NPU (#41256).

Ray Libraries

Ray Data

🎉 New Features:

💫 Enhancements:

  • Optimize OpState.outqueue_num_blocks (#41748)
  • Improve stall detection for StreamingOutputsBackpressurePolicy (#41637)
  • Enable read-only Datasets to be executed on new execution backend (#41466, #41597)
  • Inherit block size from downstream ops (#41019)
  • Use runtime object memory for scheduling (#41383)
  • Add retries to file writes (#41263)
  • Make range datasource streaming (#41302)
  • Test core performance metrics (#40757)
  • Allow ConcurrencyCapBackpressurePolicy._cap_multiplier to be set to 1.0 (#41222)
  • Create StatsManager to manage _StatsActor remote calls (#40913)
  • Expose max_retry_cnt parameter for BigQuery Write (#41163)
  • Add rows outputted to data metrics (#40280)
  • Add fault tolerance to remote tasks (#41084)
  • Add operator-level dropdown to ray data overview (#40981)
  • Avoid slicing too-small blocks (#40840)
  • Ray Data jobs detail table (#40756)
  • Update default shuffle block size to 1GB (#40839)
  • Log progress bar to data logs (#40814)
  • Operator level metrics (#40805)

🔨 Fixes:

  • Partial fix for Dataset.context not being sealed after creation (#41569)
  • Fix the issue that DataContext is not propagated when using streaming_split (#41473)
  • Fix Parquet partition filter bug (#40947)
  • Fix split read output blocks (#41070)
  • Fix BigQueryDatasource fault tolerance bugs (#40986)

📖 Documentation:

  • Add example of how to read and write custom file types (#41785)
  • Fix ray.data.read_databricks_tables doc (#41366)
  • Add read_json docs example for setting PyArrow block size when reading large files (#40533)
  • Add AllToAllAPI to dataset methods (#40842)

Ray Train

🎉 New Features:

  • Support reading Result from cloud storage (#40622)

💫 Enhancements:

  • Sort local Train workers by GPU ID (#40953)
  • Improve logging for Train worker scheduling information (#40536)
  • Load the latest unflattened metrics with Result.from_path (#40684)
  • Skip incrementing failure counter on preemption node died failures (#41285)
  • Update TensorFlow ReportCheckpointCallback to delete temporary directory (#41033)

🔨 Fixes:

  • Update config dataclass repr to check against None (#40851)
  • Add a barrier in Lightning RayTrainReportCallback to ensure synchronous reporting. (#40875)
  • Restore Tuner and Results properly from moved storage path (#40647)

📖 Documentation:

  • Improve torch, lightning quickstarts and migration guides + fix torch restoration example (#41843)
  • Clarify error message when trying to use local storage for multi-node distributed training and checkpointing (#41844)
  • Copy edits and adding links to docstrings (#39617)
  • Fix the missing ray module import in PyTorch Guide (#41300)
  • Fix typo in lightning_mnist_example.ipynb (#40577)
  • Fix typo in deepspeed.rst (#40320)

🏗 Architecture refactoring:

  • Remove Legacy Trainers (#41276)

Ray Tune

🎉 New Features:

  • Support reading Result from cloud storage (#40622)

💫 Enhancements:

  • Skip incrementing failure counter on preemption node died failures (#41285)

🔨 Fixes:

  • Restore Tuner and Results properly from moved storage path (#40647)

📖 Documentation:

  • Remove low value Tune examples and references to them  (#41348)
  • Clarify when to use MLflowLoggerCallback and setup_mlflow (#37854)

🏗 Architecture refactoring:

  • Delete legacy TuneClient/TuneServer APIs (#41469)
  • Delete legacy Searchers (#41414)
  • Delete legacy persistence utilities (air.remote_storage, etc.) (#40207)

Ray Serve

🎉 New Features:

  • Introduce logging config so that users can set different logging parameters for different applications & deployments.
  • Added gRPC context object into gRPC deployments for user to set custom code and details back to the client.
  • Introduce a runtime environment feature that allows running applications in different containers with different images. This feature is experimental and a new guide can be found in the Serve docs.

💫 Enhancements:

  • Explicitly handle gRPC proxy task cancellation when the client dropped a request to not waste compute resources. 
  • Enable async __del__ in the deployment to execute custom clean up steps.
  • Make Ray Serve compatible with Pydantic versions <2.0.0 and >=2.5.0.

🔨 Fixes:

  • Fixed gRPC proxy streaming request latency metrics to include the entire lifecycle of the request, including the time to consume the generator.
  • Fixed gRPC proxy timeout request status from CANCELLED to DEADLINE_EXCEEDED.
  • Fixed previously Serve shutdown spamming log files with logs for each event loop to only log once on shutdown.
  • Fixed issue during batch requests when a request is dropped, the batch loop will be killed and not processed any future requests.
  • Updating replica log filenames to only include POSIX-compliant characters (removed the “#” character).
  • Replicas will now be gracefully shut down after being marked unhealthy due to health check failures instead of being force killed.
    • This behavior can be toggled using the environment variable RAY_SERVE_FORCE_STOP_UNHEALTHY_REPLICAS=1, but this is planned to be removed in the near future. If you rely on this behavior, please file an issue on github.

RLlib

🎉 New Features:

  • New API stack (in progress):
    • New MultiAgentEpisode class introduced. Basis for upcoming multi-agent EnvRunner, which will replace RolloutWorker APIs. (#40263, #40799)
    • PPO runs with new SingleAgentEnvRunner (w/o Policy/RolloutWorker APIs). CI learning tests added. (#39732, #41074, #41075)
    • By default: PPO reverted to use old API stack by default, for now. Pending feature-completion of new API stack (incl. multi-agent, RNN support, new EnvRunners, etc..). (#40706)
  • Old API stack:
    • APPO/IMPALA: Enable using 2 separate optimizers for policy and vs (and 2 learning rates) on the old API stack. (#40927)
    • Added on_workers_recreated callback to Algorithm, which is triggered after workers have failed and been restarted. (#40354)

💫 Enhancements:

🔨 Fixes:

  • Restoring from a checkpoint from an older wheel (where AlgorithmConfig.rl_module_spec was NOT a “@Property” yet) breaks when trying to load from this checkpoint. (#41157)
  • SampleBatch slicing crashes when using tf + SEQ_LENS + zero-padding. (#40905)
  • Other fixes: #39978, #40788, #41168, #41204

📖 Documentation:

  • Updated codeblocks in RLlib. (#37271)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

  • Streaming generator is now officially a public API (#41436, #38784). Streaming generator allows writing streaming applications easily on top of Ray via Python generator API and has been used for Ray serve and Ray data for several releases. See the documentation for details. 
    • As part of the change, num_returns=”dynamic” is planning to be deprecated, and its return type is changed from ObjectRefGenerator -> “DynamicObjectRefGenerator”
  • Add experimental accelerator support for new hardwares.
    • Add experimental support for Intel GPU (#38553)
    • Add experimental support for Intel Gaudi Accelerators (#40561)
    • Add experimental support for Huawei Ascend NPU (#41256)
  • Add the initial support to run MPI based code on top of Ray.(#40917, #41349)

💫 Enhancements:

  • Optimize next/anext performance for streaming generator (#41270)
  • Make the number of connections and thread number of the object manager client tunable. (#41421)
  • Add __ray_call__ default actor method (#41534)

🔨 Fixes:

  • Fix NullPointerException cause by raylet id is empty when get actor info in java worker (#40560)
  • Fix a bug where SIGTERM is ignored to worker processes (#40210)
  • Fix mmap file leak. (#40370)
  • Fix the lifetime issue in Plasma server client releasing object. (#40809)
  • Upgrade grpc from 1.50.2 to 1.57.1 to include security fixes (#39090)
  • Fix the bug where two head nodes are shown from ray list nodes (#40838)
  • Fix the crash when the GCS address is not valid. (#41253)
  • Fix the issue of unexpectedly high socket usage in ray core worker processes. (#41121)
  • Make worker_process_setup_hook work with strings instead of Python functions (#41479)

Ray Clusters

💫 Enhancements:

  • Stability improvements for the vSphere cluster launcher
  • Better CLI output for cluster launcher

🔨 Fixes:

  • Fixed run_init for TPU command runner

📖Documentation:

  • Added missing steps and simplified YAML in top-level clusters quickstart
  • Clarify that job entrypoints run on the head node by default and how to override it

Dashboard

💫 Enhancements:

  • Improvements to the Ray Data Dashboard
    • Added Ray Data-specific overview on jobs page, including a table view with Dataset-level metrics
    • Added operator-level metrics granularity to drill down on Dataset operators
    • Added additional metrics for monitoring iteration over Datasets

Docs

🎉 New Features:

  • Updated to Sphinx version 7.1.2. Previously, the docs build used Sphinx 4.3.2. Upgrading to a recent version provides a more modern user experience while fixing many long standing issues. Let us know how you like the upgrade or any other docs issues on your mind, on the Ray Slack #docs channel.

Thanks

Many thanks to all those who contributed to this release!

@justinvyu, @zcin, @avnishn, @jonathan-anyscale, @shrekris-anyscale, @LeonLuttenberger, @c21, @JingChen23, @liuyang-my, @ahmed-mahran, @huchen2021, @raulchen, @scottjlee, @jiwq, @z4y1b2, @jjyao, @JoshTanke, @marxav, @ArturNiederfahrenhorst, @SongGuyang, @jerome-habana, @rickyyx, @rynewang, @batuhanfaik, @can-anyscale, @allenwang28, @wingkitlee0, @angelinalg, @peytondmurray, @rueian, @KamenShah, @stephanie-wang, @bryanjuho, @sihanwang41, @ericl, @sofianhnaide, @RaffaGonzo, @xychu, @simonsays1980, @pcmoritz, @aslonnie, @WeichenXu123, @architkulkarni, @matthew29tang, @larrylian, @iycheng, @hongchaodeng, @rudeigerc, @rkooo567, @robertnishihara, @alanwguo, @emmyscode, @kevin85421, @alexeykudinkin, @michaelhly, @ijrsvt, @ArkAung, @mattip, @harborn, @sven1977, @liuxsh9, @woshiyyya, @hahahannes, @GeneDer, @vitsai, @Zandew, @evalaiyc98, @edoakes, @matthewdeng, @bveeramani