Skip to content

Releases: ray-project/ray

Ray-2.22.0

14 May 23:39
a8ab7b8
Compare
Choose a tag to compare

Ray Libraries

Ray Data

🎉 New Features:

  • Add function to dynamically generate ray_remote_args for Map APIs (#45143)
  • Allow manually setting resource limits for training jobs (#45188)

💫 Enhancements:

  • Introduce abstract interface for data autoscaling (#45002)
  • Add debugging info for SplitCoordinator (#45226)

🔨 Fixes:

  • Don’t show AllToAllOperator progress bar if the disable flag is set (#45136)
  • Don't load Arrow PyExtensionType by default (#45084)
  • Don't raise batch size error if num_gpus=0 (#45202)

Ray Train

💫 Enhancements:

  • [XGBoost][LightGBM] Update RayTrainReportCallback to only save checkpoints on rank 0 (#45083)

Ray Core

🔨 Fixes:

  • Fix the cpu percentage metrics for dashboard process (#45124)

Dashboard

💫 Enhancements:

  • Improvements to log viewer so line numbers do not get selected when copying text.
  • Improvements to the log viewer to avoid unnecessary re-rendering which causes text selection to clear.

Many thanks to all those who contributed to this release: @justinvyu, @simonsays1980, @chris-ray-zhang, @kevin85421, @angelinalg, @rynewang, @brycehuang30, @alanwguo, @jjyao, @shaikhismail, @khluu, @can-anyscale, @bveeramani, @jrosti, @WeichenXu123, @MortalHappiness, @raulchen, @scottjlee, @ruisearch42, @aslonnie, @alexeykudinkin

Ray-2.21.0

08 May 20:34
a912be8
Compare
Choose a tag to compare

Ray Libraries

Ray Data

🎉 New features:

  • Add read_lance API to read Lance Dataset (#45106)

🔨 Fixes:

  • Retry RaySystemError application errors (#45079)

📖 Documentation:

  • Fix broken references in data documentation (#44956)

Ray Train

📖 Documentation:

  • Fix broken links in Train documentation (#44953)

Ray Tune

📖 Documentation:

  • Update Hugging Face example to add reference (#42771)

🏗 Architecture refactoring:

  • Remove deprecated ray.air.callbacks modules (#45104)

Ray Serve

💫 Enhancements:

  • Allow methods to pass type @serve.batch type hint (#45004)
  • Allow configuring Serve control loop interval (#45063)

🔨 Fixes:

  • Fix bug with controller failing to recover for autoscaling deployments (#45118)
  • Fix control+c after serve run doesn't shutdown serve components (#45087)
  • Fix lightweight update max ongoing requests (#45006)

RLlib

🎉 New Features:

  • New MetricsLogger API now fully functional on the new API stack (working now also inside Learner classes, i.e. loss functions). (#44995, #45109)

💫 Enhancements:

  • Renamings and cleanups (toward new API stack and more consistent naming schemata): WorkerSet -> EnvRunnerGroup, DEFAULT_POLICY_ID -> DEFAULT_MODULE_ID, config.rollouts() -> config.env_runners(), etc.. (#45022, #44920)
  • Changed behavior of EnvRunnerGroup.foreach_worker… methods to new defaults: mark_healthy=True (used to be False) and healthy_only=True (used to be False). (#44993)
  • Fix get_state()/from_state() methods in SingleAgent- and MultiAgentEpisodes. (#45012)

🔨 Fixes:

📖 Documentation:

  • Example scripts using the MetricsLogger for env rendering and recording w/ WandB: #45073, #45107

Ray Core

🔨 Fixes:

  • Fix ray.init(logging_format) argument is ignored (#45037)
  • Handle unserializable user exception (#44878)
  • Fix dashboard process event loop blocking issues (#45048, #45047)

Dashboard

🔨 Fixes:

  • Fix Nodes page sorting not working correctly.
  • Add back “actors per page” UI control in the actors page.

Many thanks to all those who contributed to this release: @rynewang, @can-anyscale, @scottsun94, @bveeramani, @ceddy4395, @GeneDer, @zcin, @JoshKarpel, @nikitavemuri, @stephanie-wang, @jackhumphries, @matthewdeng, @yash97, @simonsays1980, @peytondmurray, @evalaiyc98, @c21, @alanwguo, @shrekris-anyscale, @kevin85421, @hongchaodeng, @sven1977, @st--, @khluu

Ray-2.20.0

01 May 21:58
5708e75
Compare
Choose a tag to compare

Ray Libraries

Ray Data

💫 Enhancements:

  • Dedupe repeated schema during ParquetDatasource metadata prefetching (#44750)
  • Update map_groups implementation to better handle large outputs (#44862)
  • Deprecate prefetch_batches arg of iter_rows and change default value (#44982)
  • Adding in default behavior to false for creating dirs on s3 writes (#44972)
  • Make internal UDF names more descriptive (#44985)
  • Make name a required argument for AggregateFn (#44880)

📖 Documentation:

  • Add key concepts to and revise "Data Internals" page (#44751)

Ray Train

💫 Enhancements:

  • Setup XGBoost CommunicatorContext automatically (#44883)
  • Track Train Run Info with TrainStateActor (#44585)

📖 Documentation:

  • Add documentation for accelerator_type (#44882)
  • Update Ray Train example titles (#44369)

Ray Tune

💫 Enhancements:

  • Remove trial table when running Ray Train in a Jupyter notebook (#44858)
  • Clean up temporary checkpoint directories for class Trainables (ex: RLlib) (#44366)

📖 Documentation:

  • Fix minor doc format issues (#44865)
  • Remove outdated ScalingConfig references (#44918)

Ray Serve

💫 Enhancements:

  • Handle push metric interval is now configurable with environment variable RAY_SERVE_HANDLE_METRIC_PUSH_INTERVAL_S (#32920)
  • Improve performance of developer API serve.get_app_handle (#44812)

🔨 Fixes:

  • Fix memory leak in handles for autoscaling deployments (the leak happens when
  • RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=1) (#44877)

RLlib

🎉 New Features:

  • Introduce MetricsLogger, a unified API for users of RLlib to log custom metrics and stats in all of RLlib’s components (Algorithm, EnvRunners, and Learners). Rolled out for new API stack for Algorithm (training_step) and EnvRunners (custom callbacks). Learner (custom loss functions) support in progress. #44888, #44442
  • Introduce “inference-only” (slim) mode for RLModules that run inside an EnvRunner (and thus don’t require value-functions or target networks): #44797

💫 Enhancements:

  • MultiAgentEpisodeReplayBuffer for new API stack (preparation for multi-agent support of SAC and DQN): #44450
  • AlgorithmConfig cleanup and renaming of properties and methods for better consistency/transparency: #44896

🔨 Fixes:

Ray Core and Ray Clusters

💫 Enhancements:

  • Report GCS internal pubsub buffer metrics and cap message size (#44749)

🔨 Fixes:

  • Fix task submission never return when network partition happens (#44692)
  • Fix incorrect use of ssh port forward option. (#44973)
  • Make sure dashboard will exit if grpc server fails (#44928)
  • Make sure dashboard agent will exit if grpc server fails (#44899)

Thanks @can-anyscale, @hongchaodeng, @zcin, @marwan116, @khluu, @bewestphal, @scottjlee, @andrewsykim, @anyscalesam, @MortalHappiness, @justinvyu, @JoshKarpel, @woshiyyya, @rynewang, @Abirdcfly, @omatthew98, @sven1977, @marcelocarmona, @rueian, @mattip, @angelinalg, @aslonnie, @matthewdeng, @abizjakpro, @simonsays1980, @jjyao, @terraflops1048576, @hongpeng-guo, @stephanie-wang, @bw-matthew, @bveeramani, @ruisearch42, @kevin85421, @Tongruizhe

Many thanks to all those who contributed to this release!

Ray-2.12.0

25 Apr 21:50
Compare
Choose a tag to compare

Ray Libraries

Ray Data

🎉 New Features:

  • Store Ray Data logs in special subdirectory (#44743)

💫 Enhancements:

  • Add in local_read option to from_torch (#44752)

🔨 Fixes:

  • Fix the config to disable progress bar (#44342)

📖 Documentation:

  • Clarify deprecated Datasource docstrings (#44790)

Ray Train

🔨 Fixes:

  • Disable gathering the full state dict in RayFSDPStrategy for lightning>2.1 (#44569)

Ray Tune

💫 Enhancements:

  • Remove spammy log for "new output engine" (#44824)
  • Enable isort (#44693)

Ray Serve

🔨 Fixes:

  • [Serve] fix getting attributes on stdout during Serve logging redirect (#44787)

RLlib

🎉 New Features:

  • Support of images and video logging in WandB (env rendering example script for the new API stack coming up). (#43356)

💫 Enhancements:

  • Better support and separation-of-concerns for model_config_dict in new API stack. (#44263)
  • Added example script to pre-train an RLModule in single-agent fashion, then bring checkpoint into multi-agent setup and continue training. (#44674)
  • More examples scripts got translated from the old- to the new API stack: Curriculum learning, custom-gym-env, etc..: (#44706, #44707, #44735, #44841)

Ray Core and Ray Clusters

🔨 Fixes:

  • Fix GetAllJobInfo is_running_tasks is not returning the correct value when driver starts ray (#44459)

Thanks

Many thanks to all those who contributed to this release!
@can-anyscale, @hongpeng-guo, @sven1977, @zcin, @shrekris-anyscale, @liuxsh9, @jackhumphries, @GeneDer, @woshiyyya, @simonsays1980, @omatthew98, @andrewsykim, @n30111, @architkulkarni, @bveeramani, @aslonnie, @alexeykudinkin, @WeichenXu123, @rynewang, @matthewdeng, @angelinalg, @c21

Ray-2.11.0

17 Apr 23:31
Compare
Choose a tag to compare

Release Highlights

  • [data] Support reading Avro files with ray.data.read_avro
  • [train] Added experimental support for AWS Trainium (Neuron) and Intel HPU.

Ray Libraries

Ray Data

🎉 New Features:

  • Support reading Avro files with ray.data.read_avro (#43663)

💫 Enhancements:

  • Pin ipywidgets==7.7.2 to enable Data progress bars in VSCode Web (#44398)
  • Change log level for ignored exceptions (#44408)

🔨 Fixes:

  • Change Parquet encoding ratio lower bound from 2 to 1 (#44470)
  • Fix throughput time calculations for metrics (#44138)
  • Fix nested ragged numpy.ndarray (#44236)
  • Fix Ray debugger incompatibility caused by trimmed error stack trace (#44496)

📖 Documentation:

  • Update "Data Loading and Preprocessing" doc (#44165)
  • Move imports into TFPRedictor in batch inference example (#44434)

Ray Train

🎉 New Features:

  • Add experimental support for AWS Trainium (Neuron) (#39130)
  • Add experimental support for Intel HPU (#43343)

💫 Enhancements:

  • Log a deprecation warning for local_dir and related environment variables (#44029)
  • Enforce xgboost>=1.7 for XGBoostTrainer usage (#44269)

🔨 Fixes:

  • Fix ScalingConfig(accelerator_type) to request an appropriate resource amount (#44225)
  • Fix maximum recursion issue when serializing exceptions (#43952)
  • Remove base config deepcopy when initializing the trainer actor (#44611)

🏗 Architecture refactoring:

  • Remove deprecated BatchPredictor (#43934)

Ray Tune

💫 Enhancements:

  • Add support for new style lightning import (#44339)
  • Log a deprecation warning for local_dir and related environment variables (#44029)

🏗 Architecture refactoring:

  • Remove scikit-optimize search algorithm (#43969)

Ray Serve

🔨 Fixes:

  • Dynamically-created applications will no longer be deleted when a config is PUT via the REST API (#44476).
  • Fix _to_object_ref memory leak (#43763)
  • Log warning to reconfigure max_ongoing_requests if max_batch_size is less than max_ongoing_requests (#43840)
  • Deployment fails to start with ModuleNotFoundError in Ray 3.10 (#44329)
    • This was fixed by reverting the original core changes on the sys.path behavior. Revert "[core] If there's working_dir, don't set _py_driver_sys_path." (#44435)
  • The batch_queue_cls parameter is removed from the @serve.batch decorator (#43935)

RLlib

🎉 New Features:

  • New API stack: DQN Rainbow is now available for single-agent (#43196, #43198, #43199)
  • PrioritizedEpisodeReplayBuffer is available for off-policy learning using the EnvRunner API (SingleAgentEnvRunner) and supports random n-step sampling (#42832, #43258, #43458, #43496, #44262)

💫 Enhancements:

  • Restructured examples/ folder; started moving example scripts to the new API stack (#44559, #44067, #44603)
  • Evaluation do-over: Deprecate enable_async_evaluation option (in favor of existing evaluation_parallel_to_training setting). (#43787)
  • Add: module_for API to MultiAgentEpisode (analogous to policy_for API of the old Episode classes). (#44241)
  • All rllib_contrib old stack algorithms have been removed from rllib/algorithms (#43656)

🔨 Fixes:

📖 Documentation:

Ray Core and Ray Clusters

🎉 New Features:

  • Added Ray check-open-ports CLI for checking potential open ports to the public (#44488)

💫 Enhancements:

  • Support nodes sharing the same spilling directory without conflicts. (#44487)
  • Create two subclasses of RayActorError to distinguish between actor died (ActorDiedError) and actor temporarily unavailable (ActorUnavailableError) cases.

🔨 Fixes:

  • Fixed the ModuleNotFound issued introduced in 2.10 (#44435)
  • Fixed an issue where agent process is using too much CPU (#44348)
  • Fixed race condition in multi-threaded actor creation (#44232)
  • Fixed several streaming generator bugs (#44079, #44257, #44197)
  • Fixed an issue where user exception raised from tasks cannot be subclassed (#44379)

Dashboard

💫 Enhancements:

  • Add serve controller metrics to serve system dashboard page (#43797)
  • Add Serve Application rows to Serve top-level deployments details page (#43506)
  • [Actor table page enhancements] Include "NodeId", "CPU", "Memory", "GPU", "GRAM" columns in the actor table page. Add sort functionality to resource utilization columns. Enable searching table by "Class" and "Repr". (#42588) (#42633) (#42788)

🔨 Fixes:

  • Fix default sorting of nodes in Cluster table page to first be by "Alive" nodes, then head nodes, then alphabetical by node ID. (#42929)
  • Fix bug where the Serve Deployment detail page fails to load if the deployment is in "Starting" state (#43279)

Docs

💫 Enhancements:

  • Landing page refreshes its look and feel. (#44251)

Thanks

Many thanks to all those who contributed to this release!

@aslonnie, @brycehuang30, @MortalHappiness, @astron8t-voyagerx, @edoakes, @sven1977, @anyscalesam, @scottjlee, @hongchaodeng, @slfan1989, @hebiao064, @fishbone, @zcin, @GeneDer, @shrekris-anyscale, @kira-lin, @chappidim, @raulchen, @c21, @WeichenXu123, @marian-code, @bveeramani, @can-anyscale, @mjd3, @justinvyu, @jackhumphries, @Bye-legumes, @ashione, @alanwguo, @Dreamsorcerer, @KamenShah, @jjyao, @omatthew98, @autolisis, @Superskyyy, @stephanie-wang, @simonsays1980, @davidxia, @angelinalg, @architkulkarni, @chris-ray-zhang, @kevin85421, @rynewang, @peytondmurray, @zhangyilun, @khluu, @matthewdeng, @ruisearch42, @pcmoritz, @mattip, @jerome-habana, @alexeykudinkin

Ray-2.10.0

21 Mar 19:02
09abba2
Compare
Choose a tag to compare

Release Highlights

Ray 2.10 release brings important stability improvements and enhancements to Ray Data, with Ray Data becoming generally available (GA).

  • [Data] Ray Data becomes generally available with stability improvements in streaming execution, reading and writing data, better tasks concurrency control, and debuggability improvement with dashboard, logging and metrics visualization.
  • [RLlib] “New API Stack” officially announced as alpha for PPO and SAC.
  • [Serve] Added a default autoscaling policy set via num_replicas=”auto” (#42613).
  • [Serve] Added support for active load shedding via max_queued_requests (#42950).
  • [Serve] Added replica queue length caching to the DeploymentHandle scheduler (#42943).
    • This should improve overhead in the Serve proxy and handles.
    • max_ongoing_requests (max_concurrent_queries) is also now strictly enforced (#42947).
    • If you see any issues, please report them on GitHub and you can disable this behavior by setting: RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0.
  • [Serve] Renamed the following parameters. Each of the old names will be supported for another release before removal.
    • max_concurrent_queries -> max_ongoing_requests
    • target_num_ongoing_requests_per_replica -> target_ongoing_requests
    • downscale_smoothing_factor -> downscaling_factor
    • upscale_smoothing_factor -> upscaling_factor
  • [Serve] WARNING: the following default values will change in Ray 2.11:
    • Default for max_ongoing_requests will change from 100 to 5.
    • Default for target_ongoing_requests will change from 1 to 2.
  • [Core] Autoscaler v2 is in alpha and can be tried out with Kuberay. It has improved observability and stability compared to v1.
  • [Train] Added support for accelerator types via ScalingConfig(accelerator_type).
  • [Train] Revamped the XGBoostTrainer and LightGBMTrainer to no longer depend on xgboost_ray and lightgbm_ray. A new, more flexible API will be released in a future release.
  • [Train/Tune] Refactored local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR.

Ray Libraries

Ray Data

🎉 New Features:

  • Streaming execution stability improvement to avoid memory issue, including per-operator resource reservation, streaming generator output buffer management, and better runtime resource estimation (#43026, #43171, #43298, #43299, #42930, #42504)
  • Metadata read stability improvement to avoid AWS transient error, including retry on application-level exception, spread tasks across multiple nodes, and configure retry interval (#42044, #43216, #42922, #42759).
  • Allow tasks concurrency control for read, map, and write APIs (#42849, #43113, #43177, #42637)
  • Data dashboard and statistics improvement with more runtime metrics for each components (#43790, #43628, #43241, #43477, #43110, #43112)
  • Allow to specify application-level error to retry for actor task (#42492)
  • Add num_rows_per_file parameter to file-based writes (#42694)
  • Add DataIterator.materialize (#43210)
  • Skip schema call in DataIterator.to_tf if tf.TypeSpec is provided (#42917)
  • Add option to append for Dataset.write_bigquery (#42584)
  • Deprecate legacy components and classes (#43575, #43178, #43347, #43349, #43342, #43341, #42936, #43144, #43022, #43023)

💫 Enhancements:

  • Restructure stdout logging for better readability (#43360)
  • Add a more performant way to read large TFRecord datasets (#42277)
  • Modify ImageDatasource to use Image.BILINEAR as the default image resampling filter (#43484)
  • Reduce internal stack trace output by default (#43251)
  • Perform incremental writes to Parquet files (#43563)
  • Warn on excessive driver memory usage during shuffle ops (#42574)
  • Distributed reads for ray.data.from_huggingface (#42599)
  • Remove Stage class and related usages (#42685)
  • Improve stability of reading JSON files to avoid PyArrow errors (#42558, #42357)

🔨 Fixes:

  • Turn off actor locality by default (#44124)
  • Normalize block types before internal multi-block operations (#43764)
  • Fix memory metrics for OutputSplitter (#43740)
  • Fix race condition issue in OpBufferQueue (#43015)
  • Fix early stop for multiple Limit operators. (#42958)
  • Fix deadlocks caused by Dataset.streaming_split for job hanging (#42601)

📖 Documentation:

Ray Train

🎉 New Features:

  • Add support for accelerator types via ScalingConfig(accelerator_type) for improved worker scheduling (#43090)

💫 Enhancements:

  • Add a backend-specific context manager for train_func for setup/teardown logic (#43209)
  • Remove DEFAULT_NCCL_SOCKET_IFNAME to simplify network configuration (#42808)
  • Colocate Trainer with rank 0 Worker for to improve scheduling behavior (#43115)

🔨 Fixes:

  • Enable scheduling workers with memory resource requirements (#42999)
  • Make path behavior OS-agnostic by using Path.as_posix over os.path.join (#42037)
  • [Lightning] Fix resuming from checkpoint when using RayFSDPStrategy (#43594)
  • [Lightning] Fix deadlock in RayTrainReportCallback (#42751)
  • [Transformers] Fix checkpoint reporting behavior when get_latest_checkpoint returns None (#42953)

📖 Documentation:

  • Enhance docstring and user guides for train_loop_config (#43691)
  • Clarify in ray.train.report docstring that it is not a barrier (#42422)
  • Improve documentation for prepare_data_loader shuffle behavior and set_epoch (#41807)

🏗 Architecture refactoring:

  • Simplify XGBoost and LightGBM Trainer integrations. Implemented XGBoostTrainer and LightGBMTrainer as DataParallelTrainer. Removed dependency on xgboost_ray and lightgbm_ray. (#42111, #42767, #43244, #43424)
  • Refactor local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to storage_path, rather than having another copy in the user’s home directory (~/ray_results). (#43369, #43403, #43689)
  • Split overloaded ray.train.torch.get_device into another get_devices API for multi-GPU worker setup (#42314)
  • Refactor restoration configuration to be centered around storage_path (#42853, #43179)
  • Deprecations related to SyncConfig (#42909)
  • Remove deprecated preprocessor argument from Trainers (#43146, #43234)
  • Hard-deprecate MosaicTrainer and remove SklearnTrainer (#42814)

Ray Tune

💫 Enhancements:

  • Increase the minimum number of allowed pending trials for faster auto-scaleup (#43455)
  • Add support to TBXLogger for logging images (#37822)
  • Improve validation of Experiment(config) to handle RLlib AlgorithmConfig (#42816, #42116)

🔨 Fixes:

  • Fix reuse_actors error on actor cleanup for function trainables (#42951)
  • Make path behavior OS-agnostic by using Path.as_posix over os.path.join (#42037)

📖 Documentation:

🏗 Architecture refactoring:

  • Refactor local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to storage_path, rather than having another copy in the user’s home directory (~/ray_results). (#43369, #43403, #43689)
  • Deprecations related to SyncConfig and chdir_to_trial_dir (#42909)
  • Refactor restoration configuration to be centered around storage_path (#42853, #43179)
  • Add back NevergradSearch (#42305)
  • Clean up invalid checkpoint_dir and reporter deprecation notices (#42698)

Ray Serve

🎉 New Features:

  • Added support for active load shedding via max_queued_requests (#42950).
  • Added a default autoscaling policy set via num_replicas=”auto” (#42613).

🏗 API Changes:

  • Renamed the following parameters. Each of the old names will be supported for another release before removal.
    • max_concurrent_queries to max_ongoing_requests
    • target_num_ongoing_requests_per_replica to target_ongoing_requests
    • downscale_smoothing_factor to downscaling_factor
    • upscale_smoothing_factor to upscaling_factor
  • WARNING: the following default values will change in Ray 2.11:
    • Default for max_ongoing_requests will change from 100 to 5.
    • Default for target_ongoing_requests will change from 1 to 2.

💫 Enhancements:

  • Add RAY_SERVE_LOG_ENCODING env to set the global logging behavior for Serve (#42781).
  • Config Serve's gRPC proxy to allow large payload (#43114).
  • Add blocking flag to serve.run() (#43227).
  • Add actor id and worker id to Serve structured logs (#43725).
  • Added replica queue length caching to the DeploymentHandle scheduler (#42943).
    • This should improve overhead in the Serve proxy and handles.
    • max_ongoing_requests (max_concurrent_queries) is also now strictly enforced ([#42947](https://githu...
Read more

Ray-2.9.3

22 Feb 19:57
62655e1
Compare
Choose a tag to compare

This patch release contains fixes for Ray Core, Ray Data, and Ray Serve.

Ray Core

🔨 Fixes:

  • Fix protobuf breaking change by adding a compat layer. (#43172)
  • Bump up task failure logs to warnings to make sure failures could be troubleshooted (#43147)
  • Fix placement group leaks (#42942)

Ray Data

🔨 Fixes:

  • Skip schema call in to_tf if tf.TypeSpec is provided (#42917)
  • Skip recording memory spilled stats when get_memory_info_reply is failed (#42824)

Ray Serve

🔨 Fixes:

  • Fixing DeploymentStateManager qualifying replicas as running prematurely (#43075)

Thanks

Many thanks to all those who contributed to this release!

@rynewang, @GeneDer, @alexeykudinkin, @edoakes, @c21, @rkooo567

Ray-2.9.2

06 Feb 01:23
fce7a36
Compare
Choose a tag to compare

This patch release contains fixes for Ray Core, Ray Data, and Ray Serve.

Ray Core

🔨 Fixes:

  • Fix out of disk test on release branch (#42724)

Ray Data

🔨 Fixes:

  • Fix failing huggingface test (#42727)
  • Fix deadlocks caused by streaming_split (#42601) (#42755)
  • Fix locality config not being respected in DataConfig (#42204
    #42204) (#42722)
  • Stability & accuracy improvements for Data+Train benchmark (#42027)
  • Add retry for _sample_fragment during ParquetDatasource._estimate_files_encoding_ratio() (#42759) (#42774)
  • Skip recording memory spilled stats when get_memory_info_reply is failed (#42824) (#42834)

Ray Serve

🔨 Fixes:

  • Pin the fastapi & starlette version to avoid breaking proxy (#42740
    #42740)
  • Fix IS_PYDANTIC_2 logic for pydantic<1.9.0 (#42704) (#42708)
  • fix missing message body for json log formats (#42729) (#42874)

Thanks

Many thanks to all those who contributed to this release!

@c21, @raulchen, @can-anyscale, @edoakes, @peytondmurray, @scottjlee, @aslonnie, @architkulkarni, @GeneDer, @Zandew, @sihanwang41

Ray-2.9.1

19 Jan 00:28
cfbf98c
Compare
Choose a tag to compare

This patch release contains fixes for Ray Core, Ray Data, and Ray Serve.

Ray Core

🔨 Fixes:

  • Adding debupgy as the ray debugger (#42311)
  • Fix task events profile events per task leak (#42248)
  • Make sure redis sync context and async context connect to the same redis instance (#42040)

Ray Data

🔨 Fixes:

  • [Data] Retry write if error during file clean up (#42326)

Ray Serve

🔨 Fixes:

  • Improve handling the websocket server disconnect scenario (#42130)
  • Fix pydantic config documentation (#42216)
  • Address issues under high network delays:
    • Enable setting queue length response deadline via environment variable (#42001)
    • Add exponential backoff for queue_len_response_deadline_s (#42041)

Ray-2.9.0

21 Dec 00:32
9be5a16
Compare
Choose a tag to compare

Release Highlights

  • This release contains fixes for the Ray Dashboard. Additional context can be found here: https://www.anyscale.com/blog/update-on-ray-cves-cve-2023-6019-cve-2023-6020-cve-2023-6021-cve-2023-48022-cve-2023-48023 
  • Ray Train has now upgraded support for spot node preemption -- allowing Ray Train to handle preemption node failures differently than application errors.
  • Ray is now compatible with Pydantic versions <2.0.0 and >=2.5.0, addressing a piece of user feedback we’ve consistently received.
  • The Ray Dashboard now has a page for Ray Data to monitor real-time execution metrics.
  • Streaming generator is now officially a public API (#41436, #38784). Streaming generator allows writing streaming applications easily on top of Ray via Python generator API and has been used for Ray Serve and Ray data for several releases. See the documentation for details. 
  • We’ve added experimental support for new accelerators: Intel GPU (#38553), Intel Gaudi Accelerators (#40561), and Huawei Ascend NPU (#41256).

Ray Libraries

Ray Data

🎉 New Features:

💫 Enhancements:

  • Optimize OpState.outqueue_num_blocks (#41748)
  • Improve stall detection for StreamingOutputsBackpressurePolicy (#41637)
  • Enable read-only Datasets to be executed on new execution backend (#41466, #41597)
  • Inherit block size from downstream ops (#41019)
  • Use runtime object memory for scheduling (#41383)
  • Add retries to file writes (#41263)
  • Make range datasource streaming (#41302)
  • Test core performance metrics (#40757)
  • Allow ConcurrencyCapBackpressurePolicy._cap_multiplier to be set to 1.0 (#41222)
  • Create StatsManager to manage _StatsActor remote calls (#40913)
  • Expose max_retry_cnt parameter for BigQuery Write (#41163)
  • Add rows outputted to data metrics (#40280)
  • Add fault tolerance to remote tasks (#41084)
  • Add operator-level dropdown to ray data overview (#40981)
  • Avoid slicing too-small blocks (#40840)
  • Ray Data jobs detail table (#40756)
  • Update default shuffle block size to 1GB (#40839)
  • Log progress bar to data logs (#40814)
  • Operator level metrics (#40805)

🔨 Fixes:

  • Partial fix for Dataset.context not being sealed after creation (#41569)
  • Fix the issue that DataContext is not propagated when using streaming_split (#41473)
  • Fix Parquet partition filter bug (#40947)
  • Fix split read output blocks (#41070)
  • Fix BigQueryDatasource fault tolerance bugs (#40986)

📖 Documentation:

  • Add example of how to read and write custom file types (#41785)
  • Fix ray.data.read_databricks_tables doc (#41366)
  • Add read_json docs example for setting PyArrow block size when reading large files (#40533)
  • Add AllToAllAPI to dataset methods (#40842)

Ray Train

🎉 New Features:

  • Support reading Result from cloud storage (#40622)

💫 Enhancements:

  • Sort local Train workers by GPU ID (#40953)
  • Improve logging for Train worker scheduling information (#40536)
  • Load the latest unflattened metrics with Result.from_path (#40684)
  • Skip incrementing failure counter on preemption node died failures (#41285)
  • Update TensorFlow ReportCheckpointCallback to delete temporary directory (#41033)

🔨 Fixes:

  • Update config dataclass repr to check against None (#40851)
  • Add a barrier in Lightning RayTrainReportCallback to ensure synchronous reporting. (#40875)
  • Restore Tuner and Results properly from moved storage path (#40647)

📖 Documentation:

  • Improve torch, lightning quickstarts and migration guides + fix torch restoration example (#41843)
  • Clarify error message when trying to use local storage for multi-node distributed training and checkpointing (#41844)
  • Copy edits and adding links to docstrings (#39617)
  • Fix the missing ray module import in PyTorch Guide (#41300)
  • Fix typo in lightning_mnist_example.ipynb (#40577)
  • Fix typo in deepspeed.rst (#40320)

🏗 Architecture refactoring:

  • Remove Legacy Trainers (#41276)

Ray Tune

🎉 New Features:

  • Support reading Result from cloud storage (#40622)

💫 Enhancements:

  • Skip incrementing failure counter on preemption node died failures (#41285)

🔨 Fixes:

  • Restore Tuner and Results properly from moved storage path (#40647)

📖 Documentation:

  • Remove low value Tune examples and references to them  (#41348)
  • Clarify when to use MLflowLoggerCallback and setup_mlflow (#37854)

🏗 Architecture refactoring:

  • Delete legacy TuneClient/TuneServer APIs (#41469)
  • Delete legacy Searchers (#41414)
  • Delete legacy persistence utilities (air.remote_storage, etc.) (#40207)

Ray Serve

🎉 New Features:

  • Introduce logging config so that users can set different logging parameters for different applications & deployments.
  • Added gRPC context object into gRPC deployments for user to set custom code and details back to the client.
  • Introduce a runtime environment feature that allows running applications in different containers with different images. This feature is experimental and a new guide can be found in the Serve docs.

💫 Enhancements:

  • Explicitly handle gRPC proxy task cancellation when the client dropped a request to not waste compute resources. 
  • Enable async __del__ in the deployment to execute custom clean up steps.
  • Make Ray Serve compatible with Pydantic versions <2.0.0 and >=2.5.0.

🔨 Fixes:

  • Fixed gRPC proxy streaming request latency metrics to include the entire lifecycle of the request, including the time to consume the generator.
  • Fixed gRPC proxy timeout request status from CANCELLED to DEADLINE_EXCEEDED.
  • Fixed previously Serve shutdown spamming log files with logs for each event loop to only log once on shutdown.
  • Fixed issue during batch requests when a request is dropped, the batch loop will be killed and not processed any future requests.
  • Updating replica log filenames to only include POSIX-compliant characters (removed the “#” character).
  • Replicas will now be gracefully shut down after being marked unhealthy due to health check failures instead of being force killed.
    • This behavior can be toggled using the environment variable RAY_SERVE_FORCE_STOP_UNHEALTHY_REPLICAS=1, but this is planned to be removed in the near future. If you rely on this behavior, please file an issue on github.

RLlib

🎉 New Features:

  • New API stack (in progress):
    • New MultiAgentEpisode class introduced. Basis for upcoming multi-agent EnvRunner, which will replace RolloutWorker APIs. (#40263, #40799)
    • PPO runs with new SingleAgentEnvRunner (w/o Policy/RolloutWorker APIs). CI learning tests added. (#39732, #41074, #41075)
    • By default: PPO reverted to use old API stack by default, for now. Pending feature-completion of new API stack (incl. multi-agent, RNN support, new EnvRunners, etc..). (#40706)
  • Old API stack:
    • APPO/IMPALA: Enable using 2 separate optimizers for policy and vs (and 2 learning rates) on the old API stack. (#40927)
    • Added on_workers_recreated callback to Algorithm, which is triggered after workers have failed and been restarted. (#40354)

💫 Enhancements:

🔨 Fixes:

  • Restoring from a checkpoint from an older wheel (where AlgorithmConfig.rl_module_spec was NOT a “@Property” yet) breaks when trying to load from this checkpoint. (#41157)
  • SampleBatch slicing crashes when using tf + SEQ_LENS + zero-padding. (#40905)
  • Other fixes: #39978, #40788, #41168, #41204

📖 Documentation:

  • Updated codeblocks in RLlib. (#37271)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

  • Streaming generator is now officially a public API (#41436, #38784). Streaming generator allows writing streaming applications easily on top of Ray via Python generator API and has been used for Ray serve and Ray data for several releases. See the documentation for details. 
    • As part of the change, num_returns=”dynamic” is planning to be deprecated, and its return type is changed from ObjectRefGenerator -> “DynamicObjectRefGenerator”
  • Add experimental accelerator support for new hardwares.
    • Add experimental support for Intel GPU (#38553)
    • Add experimental support for Intel Gaudi Accelerators (#40561)
    • Add experimental support for Huawei Ascend NPU (#41256)
  • Add the initial support to run MPI based code on top of Ray.(#40917, #41349)

💫 Enhancements:

  • Optimize next/anext performance for streaming generator (#41270)
  • Make the number of connections and thread number of the object manager client tunable. (#41421)
  • Add __ray_call__ default actor method (#41534)

🔨 Fixes:

  • Fix NullPointerException cause by raylet id is empty when get actor info in java worker (#40560)
  • Fix a bug where SIGTERM is ignored to worker processes (#40210)
  • Fix mmap file leak. (#40370)
  • Fix the lifetime issue in Plasma server client releasing object. (...
Read more