[Auto Parallel]: Speed up intra-op plan generation by 44% #5446
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📌 Checklist before creating the PR
[doc/gemini/tensor/...]: A concise description
🚨 Issue number
resolves #5436
📝 What does this PR do?
As stated in issue #5436, the generation of a
DimSpec
object is comparatively costly because a dictionary is created every time and a deep copy of two strings is being made. The large volume ofDimSpec
objects created leads to several seconds spent in total by creatingDimSpec
objects when generating intra-op plans with ColossalAuto.This pull requests addresses this inefficiency by converting this dict to a class attribute of the
DimSpec
class, so it's shared among all its instances. This dict is initialized lazily the first time the propertydifference_dict
is used. This is possible because the contents ofdifference_dict
are not modified by other portions of the code.Additionally, the methods
build_difference_2d_dict
andconvert_str_to_shard_list
are made class/static methods and private because they don't need access to the instance properties.Effect
These changes reduce the end-to-end wall-clock time to build the strategy constructor by 44% while running the script
examples/tutorial/auto_parallel/auto_parallel_with_resnet.py
. The time required to compute a solution on a laptop with Intel Core i7 7700HQ was on average 56.79 s before the change. After applying above-mentioned changes, the wall-clock time reduced to about 31.93 s.💥 Checklist before requesting a review
⭐️ Do you enjoy contributing to Colossal-AI?