Ray-2.5.0
The Ray 2.5 release features focus on a number of enhancements and improvements across the Ray ecosystem, including:
- Training LLMs with Ray Train: New support for checkpointing distributed models, and Pytorch Lightning FSDP to enable training large models on Ray Train’s LightningTrainer
- LLM applications with Ray Serve & Core: New support for streaming responses and model multiplexing
- Improvements to Ray Data: In 2.5, strict mode is enabled by default. This means that schemas are required for all Datasets, and standalone Python objects are no longer supported. Also, the default batch format is fixed to NumPy, giving better performance for batch inference.
- RLlib enhancements: New support for multi-gpu training, along with ray-project/rllib-contrib to contain the community contributed algorithms
- Core enhancements: Enable new feature of lightweight resource broadcasting to improve reliability and scalability. Add many enhancements for Core reliability, logging, scheduler, and worker process.
Ray Libraries
Ray AIR
💫Enhancements:
- Experiment restore stress tests (#33706)
- Context-aware output engine
- Add parameter columns to status table (#35388)
- Context-aware output engine: Add docs, experimental feature docs, prepare default on (#35129)
- Fix trial status at end (more info + cut off) (#35128)
- Improve leaked mentions of Tune concepts (#35003)
- Improve passed time display (#34951)
- Use flat metrics in results report, use Trainable._progress_metrics (#35035)
- Print experiment information at experiment start (#34952)
- Print single trial config + results as table (#34788)
- Print out worker ip for distributed train workers. (#33807)
- Minor fix to print configuration on start. (#34575)
- Check
air_verbosity
against None. (#33871) - Better wording for empty config. (#33811)
- Flatten config and metrics before passing to mlflow (#35074)
- Remote_storage: Prefer fsspec filesystems over native pyarrow (#34663)
- Use filesystem wrapper to exclude files from upload (#34102)
- GCE test variants for air_benchmark and air_examples (#34466)
- New storage path configuration
🔨 Fixes:
- Store unflattened metrics in _TrackedCheckpoint (#35658) (#35706)
- Fix
test_tune_torch_get_device_gpu
race condition (#35004) - Deflake test_e2e_train_flow.py (#34308)
- Pin deepspeed version for now to unblock ci. (#34406)
- Fix AIR benchmark configuration link failure. (#34597)
- Fix unused config building function in lightning MNIST example.
📖Documentation:
- Change doc occurrences of ray.data.Dataset to ray.data.Datastream (#34520)
- DreamBooth example: Fix code for batch size > 1 (#34398)
- Synced tabs in AIR getting started (#35170)
- New Ray AIR link for try it out (#34924)
- Correctly Render the Enumerate Numbers in
convert_torch_code_to_ray_air
(#35224)
Ray Data Processing
🎉 New Features:
- Implement Strict Mode and enable it by default.
- Add column API to Dataset (#35241)
- Configure progress bars via DataContext (#34638)
- Support using concurrent actors for ActorPool (#34253)
- Add take_batch API for collecting data in the same format as iter_batches and map_batches (#34217)
💫Enhancements:
- Improve map batches error message for strict mode migration (#35368)
- Improve docstring and warning message for from_huggingface (#35206)
- Improve notebook widget display (#34359)
- Implement some operator fusion logic for the new backend (#35178 #34847)
- Use wait based prefetcher by default (#34871)
- Implement limit physical operator (#34705 #34844)
- Require compute spec to be explicitly spelled out #34610
- Log a warning if the batch size is misconfigured in a way that would grossly reduce parallelism for actor pool. (#34594)
- Add alias parameters to the aggregate function, and add quantile fn (#34358)
- Improve repr for Arrow Table and pandas types (#34286 #34502)
- Defer first block computation when reading a Datasource with schema information in metadata (#34251)
- Improve handling of KeyboardInterrupt (#34441)
- Validate aggregation key in Aggregate LogicalOperator (#34292)
- Add usage tag for which block formats are used (#34384)
- Validate sort key in Sort LogicalOperator (#34282)
- Combine_chunks before chunking pyarrow.Table block into batches (#34352)
- Use read stage name for naming Data-read tasks on Ray Dashboard (#34341)
- Update path expansion warning (#34221)
- Improve state initialization for ActorPoolMapOperator (#34037)
🔨 Fixes:
- Fix ipython representation (#35414)
- Fix bugs in handling of nested ndarrays (and other complex object types) (#35359)
- Capture the context when the dataset is first created (#35239)
- Cooperatively exit producer threads for iter_batches (#34819)
- Autoshutdown executor threads when deleted (#34811)
- Fix backpressure when reading directly from input datasource (#34809)
- Fix backpressure handling of queued actor pool tasks (#34254)
- Fix row count after applying filter (#34372)
- Remove unnecessary setting of global logging level to INFO when using Ray Data (#34347)
- Make sure the tf and tensor iteration work in dataset pipeline (#34248)
- Fix '_unwrap_protocol' for Windows systems (#31296)
📖Documentation:
Ray Train
🎉 New Features:
- Experimental support for distributed checkpointing (#34709)
💫Enhancements:
- LightningTrainer: Enable prog bar (#35350)
- LightningTrainer enable checkpoint full dict with FSDP strategy (#34967)
- Support FSDP Strategy for LightningTrainer (#34148)
🔨 Fixes:
- Fix HuggingFace -> Transformers wrapping logic (#35276, #35284)
- LightningTrainer always resumes from the latest AIR checkpoint during restoration. (#35617) (#35791)
- Fix lightning trainer devices setting (#34419)
- TorchCheckpoint: Specifying pickle_protocol in
torch.save()
(#35615) (#35790)
📖Documentation:
- Improve visibility of Trainer restore and stateful callback restoration (#34350)
- Fix rendering of diff code-blocks (#34355)
- LightningTrainer Dolly V2 FSDP Fine-tuning Example (#34990)
- Update LightningTrainer MNIST example. (#34867)
- LightningTrainer Advanced Example (#34082, #34429)
🏗 Architecture refactoring:
- Restructure
ray.train
HuggingFace modules (#35270) (#35488) - rename _base_dataset to _base_datastream (#34423)
Ray Tune
🎉 New Features:
💫Enhancements:
- Make `Tuner.restore(trainable=...) a required argument (#34982)
- Enable
tune.ExperimentAnalysis
to pull experiment checkpoint files from the cloud if needed (#34461) - Add support for nested hyperparams in PB2 (#31502)
- Release test for durable multifile checkpoints (#34860)
- GCE variants for remaining Tune tests (#34572)
- Add tune frequent pausing release test. (#34501)
- Add PyArrow to ray[tune] dependencies (#34397)
- Fix new execution backend for BOHB (#34828)
- Add tune frequent pausing release test. (#34501)
🔨 Fixes:
- Set config on trial restore (#35000)
- Fix
test_tune_torch_get_device_gpu
race condition (#35004) - Fix a typo in
tune/execution/checkpoint_manager
state serialization. (#34368) - Fix tune_scalability_network_overhead by adding
--smoke-test
. (#34167) - Fix lightning_gpu_tune_.* release test (#35193)
📖Documentation:
🏗 Architecture refactoring:
- Use Ray-provided
tabulate
package (#34789)
Ray Serve
🎉 New Features:
- Add support for json logging format.(#35118)
- Add experimental support for model multiplexing.(#35399, #35326)
- Added experimental support for HTTP StreamingResponses. (#35720)
- Add support for application builders & arguments (#34584)
💫Enhancements:
- Add more bucket size for histogram metrics. (#35242).
- Add route information into the custom metrics. (#35246)
- Add HTTPProxy details to Serve Dashboard UI (#35159)
- Add status_code to http qps & latency (#35134)
- Stream Serve logs across different drivers (#35070)
- Add health checking for http proxy actors (#34944)
- Better surfacing of errors in serve status (#34773)
- Enable TLS on gRPCIngress if RAY_USE_TLS is on (#34403)
- Wait until replicas have finished recovering (with timeout) to broadcast
LongPoll
updates (#34675) - Replace
ClassNode
andFunctionNode
withApplication
in top-level Serve APIs (#34627)
🔨 Fixes:
- Set
app_msg
to empty string by default (#35646) - Fix dead replica counts in the stats. (#34761)
- Add default app name (#34260)
- gRPC Deployment schema check & minor improvements (#34210)
📖Documentation:
- Clean up API reference and various docstrings (#34711)
- Clean up
RayServeHandle
andRayServeSyncHandle
docstrings & typing (#34714)
RLlib
🎉 New Features:
- Migrating approximately ~25 of the 30 algorithms from RLlib into rllib_contrib. You can review the REP here. This release we have covered A3C and MAML.
- The APPO / IMPALA and PPO are all moved to the new Learner and RLModule stack.
- The RLModule now supports Checkpointing.(#34717 #34760)
💫Enhancements:
- Introduce experimental larger than GPU train batch size feature for torch (#34189)
- Change occurences of "_observation_space_in_preferred_format" to "_obs_space_in_preferred_format" (#33907)
- Add a flag to allow disabling initialize_loss_from_dummy_batch logit. (#34208)
- Remove check specs from default Model forward code path to improve performance (#34877)
- Remove some specs from encoders to smoothen dev experience (#34911)
🔨 Fixes:
- Fix MultiCallbacks class: To be used only with utility function that returns a class to use in the config. (#33863)
- Fix test backward compatibility test for RL Modules (#33857)
- Don't serialize config in Policy states (unless needed for msgpack-type checkpoints). (#33865)
- DM control suite wrapper fix: dtype of obs needs to be pinned to float32. (#33876)
- In the Json_writer convert all non string keys to keys (#33896)
- Fixed a bug with kl divergence calculation of torch.Dirichlet distribution within RLlib (#34209)
- Change broken link in parameter_noise.py (#34231)
- Fixed bug in restoring a gpu trained algorithm (#35024)
- Fix IMPALA/APPO when using multi GPU setup and Multi-Agent Env (#35120)
📖Documentation:
- Add examples and docs for Catalog (#33898)
Ray Core and Ray Clusters
Ray Core
🎉 New Features:
- Support both sync and async actor generator interface. (#35584 #35708 #35324 #35656 #35803 #35794 #35707)
💫Enhancements:
- [Scheduler] Introduce spill_on_unavailable option for soft NodeAffinitySchedulingStrategy (#34224)
- [Data] Use wait based prefetcher by default (#34871)
- [Reliability] During GCS restarts, grpc based resource broadcaster should only add ALIVE nodes during initialization (#35349)
- [Reliability] Guarantee the ordering of put ActorTaskSpecTable and ActorTable (#35683) (#35718)
- [Reliability] Graceful handling of returning bundles when node is removed (#34726) (#35542)
- [Reliability] Task backend - marking tasks failed on worker death (#33818)
- [Reliability] Task backend - Add worker dead info to failed tasks when job exits. (#34166)
- [Logging] Make ray.get(timeout=0) to throw timeout error (#35126)
- [Logging] Provide good error message if the factional resource precision is beyond 0.0001 (#34590)
- [Logging] Add debug logs to show UpdateResourceUsage rpc source (#35062)
- [Logging] Add actor_id as an attribute of RayActorError when the actor constructor fails (#34958)
- [Logging] Worker startup hook (#34738)
- [Worker] Partially addresses ray child process leaks by killing all child processes in the CoreWorker shutdown sequence. (#33976)
- [Worker] Change worker niceness in job submission environment (#34727)
- Shorten the membership checking time to 5 seconds. (#34769)
- [Syncer] Remove spammy logs. (#34654)
- [Syncer] Delete disconnected node view in ray syncer when connection is broken. (#35312)
- [Syncer] Turn on ray syncer again. (#35116)
- [Syncer] Start ray syncer reconnection after a delay (#35115)
- Serialize requests in the redis store client. (#35123)
- Reduce self alive check from 60s to 5s. (#34992)
- Add object owner and copy metrics to node stats (#35119)
- Start the synchronization connection after receiving all nodes info. (#34645)
- Improve the workflow finding Redis leader. (#34108)
- Make execute_after accept chrono (#35099)
- Lazy import autoscaler + don't import opentelemetry unless setup hook (#33964)
🔨 Fixes:
- [pubsub] Handle failures when publish fails. (#33115)
- Convert gcs port read from env variable from str to int (#34482)
- Fix download_wheels.sh wheel urls (#34616)
- Fix ray start command output (#34081)
- Fetch_local once for each object ref (#34884)
- Combine_chunks before chunking pyarrow.Table block into batches (#34352)
- Replace deprecated usage of get_runtime_context().node_id (#34874)
- Fix std::move without std namespace (#34149)
- Fix the recursion error when an async actor has lots of deserialization. (#35494) (#35532)
- Fix async actor shutdown issue when exit_actor is used (#32407)
- [Event] Fix incorrect event timestamp (#34402)
- [Metrics] Fix shared memory is not displayed properly (#34301)
- Fix GCS FD usage increase regression. (#35624) (#35738)
- Fix raylet memory leak in the wrong setup. (#35647) (#35673)
- Retry failed redis request (#35249) (#35481)
- Add more messages when accessing a dead actor. (#34697)
- Fix the placement group stress test regression. (#34192)
- Mark Raylet unhealthy if GCS can't recognize it. (#34087)
- Remove multiple core workers in one process 2/n (#34942)
- Remove python 3.6 support (#34373 #34416)
📖Documentation:
- Make doc code snippet testable (#35274 #35057)
- Revamp ray core api reference [1/n] (#34428)
- Add Ray core fault tolerance guide for GCS and node (#33446)
- Ray Debugging Doc Part 1 (OOM) (#34309)
- Rewrite the placement group documentation (#33518)
Ray Clusters
💫Enhancements:
- [Docker] [runtime env] Bump boto3 version from 1.4.8 to 1.26.82, add pyOpenSSL and cryptography (#33273)
- [Jobs] Fix race condition in supervisor actor creation and add timeout for pending jobs (#34223)
- [Release test] [Cluster launcher] Add gcp minimal and full cluster launcher release test (#34878)
- [Release test] [Cluster launcher] Add release test for aws
example-full.yaml
(#34487)
📖Documentation:
- [runtime env] Clarify conditions for local
pip
andconda
requirements files (#34071) - [KubeRay] Provide GKE instructions in KubeRay example (#33339)
- [KubeRay] Update KubeRay doc for release v0.5.0 (#34178)
Dashboard
💫Enhancements:
- Feature flag task logs recording (#34056)
- Fix log proxy not loading non test/plain files. (#33870)
- [no_early_kickoff] Make dashboard address connectable from remote nodes when not set to 127.0.0.1 (localhost) (#35027)
- [state][job] Supporting job listing(getting) and logs from state API (#35124)
- [state][ci] Fix stress_test_state_api_scale (#35332)
- [state][dashboard][log] Fix subdirectory log getting (#35283)
- [state] Push down filtering to GCS for listing/getting task from state api (#35109)(#34433)
- [state] Task log - Improve log tailing from log_client and support tailing from offsets [2/4] (#28188)
- [state] Use
--err
flag to query stderr logs from worker/actors instead of--suffix=err
(#34300) - [state][no_early_kickoff] Make state api return results that are strongly typed (#34297)
- [state] Efficient get/list actors with filters on some high-cardinality fields (#34348)
- [state] Fix list nodes test in test_state_api.py (#34349)
- [state] Add head node flag
is_head_node
to state API and GcsNodeInfo (#34299) - Make actor tasks' name default to <actor_repr>.<task_name> (#35371)
- Task backend GC policy - worker update [1/3] (#34896)
- [state] Support task logs from state API (#35101)
Known Issues
- A bug in the Autoscaler can cause undefined behaviour when clusters attempt to scale up aggressively. This is fixed in following releases, as well as post-release on the 2.5.0 branch (#36482).
Many thanks to all those who contributed to this release!
@vitsai, @XiaodongLv, @justinvyu, @Dan-Yeh, @dependabot[bot], @alanwguo, @grimreaper, @yiwei00000, @pomcho555, @ArturNiederfahrenhorst, @maxpumperla, @jjyao, @ijrsvt, @sven1977, @Yard1, @pcmoritz, @c21, @architkulkarni, @jbedorf, @amogkam, @ericl, @jiafuzha, @clarng, @shrekris-anyscale, @matthewdeng, @gjoliver, @jcoffi, @edoakes, @ethanabrooks, @iycheng, @Rohan138, @angelinalg, @Linniem, @aslonnie, @zcin, @wuisawesome, @Catch-Bull, @woshiyyya, @avnishn, @jjyyxx, @jianoaix, @bveeramani, @sihanwang41, @scottjlee, @YQ-Wang, @mattip, @can-anyscale, @xwjiang2010, @fedassembly, @joncarter1, @robin-anyscale, @rkooo567, @DACUS1995, @simran-2797, @ProjectsByJackHe, @zen-xu, @ashahab, @larrylian, @kouroshHakha, @raulchen, @sofianhnaide, @scv119, @nathan-az, @kevin85421, @rickyyx, @Sahar-E, @krfricke, @chaowanggg, @peytondmurray, @cadedaniel