Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Introduce alternative list sorting for scalar and order comparable elements #2609

Open
wants to merge 9 commits into
base: nightly
Choose a base branch
from

Conversation

mzaks
Copy link
Contributor

@mzaks mzaks commented May 10, 2024

This is a big PR which addresses #2390 and makes following PRs:
#2517
#2378
#2409

obsolete.

It replaces the recently introduced builtin/sort module. I suppose that the replacement should not be problematic as the legacy sort module was introduced due to discussion in #2390 and is not used anywhere in the code base as far as I can tell.

@mzaks mzaks requested a review from a team as a code owner May 10, 2024 12:31
@mzaks mzaks changed the title Introduce alternative list sorting for scalar and order comparable elements [stdlib] Introduce alternative list sorting for scalar and order comparable elements May 10, 2024
@gabrieldemarmiesse
Copy link
Contributor

That's a pretty cool PR! To make the review easier (1000 lines of code to read in one go is a lot), do you think we could merge

first or that would be incompatible? If we merge those three and then rebase this branch on top of nightly afterwards, that might make it more managable. Are there any conflict in the design philosophy between those and yours?

@JoeLoser
Copy link
Collaborator

That's a pretty cool PR! To make the review easier (1000 lines of code to read in one go is a lot), do you think we could merge

first or that would be incompatible? If we merge those three and then rebase this branch on top of nightly afterwards, that might make it more managable. Are there any conflict in the design philosophy between those and yours?

All of the dependent PRs landed in today's nightly. Do you mind rebasing like Gabriel suggested?

@mzaks mzaks force-pushed the feature/alternative-sort-module branch from 817160f to 630e5f8 Compare May 11, 2024 05:08
@mzaks
Copy link
Contributor Author

mzaks commented May 11, 2024

The branch is rebased, still quite a large PR :)

What I can offer is to remove Radix sort from this PR and introduce it in the next one.

@soraros
Copy link
Contributor

soraros commented May 11, 2024

@mzaks I'm in favour of removing Radix from this PR. I also think the Radix needs a lot more comments/design doc/implementation doc/reference, and Dict's inline comments is a good example for that.

@soraros
Copy link
Contributor

soraros commented May 11, 2024

Also, can we make functions other than sort private (prefix by _)? I think they should be treated as experimental/unstable. We could maybe also keep the old implementations if only for evaluation purposes, and only remove them when we build up the benchmark infrastructure.

@mzaks
Copy link
Contributor Author

mzaks commented May 11, 2024

Also, can we make functions other than sort private (prefix by _)? I think they should be treated as experimental/unstable. We could maybe also keep the old implementation if only for evaluation purpose, and only remove them when we build up the benchmark infrastructure.

My Idea was, that there is the sort function which users will use by default. This function decides which algorithm is best given the list size. And there are the actual sorting algorithms quick_sort, insertion_sort etc... which users can explicitly use, they have same API, so I don't see the point of hiding them.

Regarding the old std lib sort functions. I did a benchmark https://github.com/mzaks/mojo-sort/blob/main/benchmark.mojo and the old sort function performs worse through out the test. Radix sort performs incredible when the list size is higher. Hence why I added Radix sort from the get go, but now I removed it. I agree that it adds lots of complexity to this PR.

@soraros
Copy link
Contributor

soraros commented May 11, 2024

My Idea was, that there is the sort function which users will use by default. This function decides which algorithm is best given the list size. And there are the actual sorting algorithms quick_sort, insertion_sort etc... which users can explicitly use, they have same API, so I don't see the point of hiding them.

I agree that we can and probably should expose them. I only suggested to use the prefix as indicator for unstable APIs (signature may change, they should live in algorithm, etc.), we might want to separate them from stable ones like sort. Maybe we can move them to utils.sort, and have the free function sort be a thin wrapper.

Regarding the old std lib sort functions. I did a benchmark https://github.com/mzaks/mojo-sort/blob/main/benchmark.mojo and the old sort function performs worse through out the test. Radix sort performs incredible when the list size is higher. Hence why I added Radix sort from the get go, but now I removed it. I agree that it adds lots of complexity to this PR.

I knew that you did the benchmark, I don't question your result at all, and the Redix sort is a great addition. I meant keeping the old implementation to test the benchmarking infrastructure in the Mojo repo (CI etc.), and use the future PR which removes them as test bed.

But, of course, these are merely my own views and suggests. As long as the team is ok with your approach, I'm fine with them too. Thank you for the great work.

Copy link
Collaborator

@JoeLoser JoeLoser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!!! Thanks so much for pushing on this.

I left a few minor comments that I'd like to see addressed before we land this.


Args:
v: Input vector to sort.
list: The list of the order comparable elements which will be sorted inpace.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quibble s/inpace/in-place

stdlib/src/builtin/sort.mojo Show resolved Hide resolved
stdlib/src/builtin/sort.mojo Show resolved Hide resolved
stdlib/src/builtin/sort.mojo Outdated Show resolved Hide resolved
stdlib/test/builtin/test_sort.mojo Outdated Show resolved Hide resolved
stdlib/test/builtin/test_sort.mojo Show resolved Hide resolved
@mzaks mzaks force-pushed the feature/alternative-sort-module branch from 0e93b77 to 9559893 Compare May 16, 2024 10:42
@mzaks
Copy link
Contributor Author

mzaks commented May 16, 2024

I left a few minor comments that I'd like to see addressed before we land this.

The comments are addressed and the branch is rebased.

@JoeLoser
Copy link
Collaborator

I left a few minor comments that I'd like to see addressed before we land this.

The comments are addressed and the branch is rebased.

Nice, thank you! Would you like to add a changelog entry for your awesome work here as well? Sorry I forgot to mention this yesterday.

@JoeLoser JoeLoser self-assigned this May 16, 2024
@gryznar
Copy link
Contributor

gryznar commented May 16, 2024

There are few places with typos / incosistency:

  • inpace
  • in-pace
  • in place

Everywhere should be : "in-place"

@gryznar
Copy link
Contributor

gryznar commented May 16, 2024

We can use some more meaningful name than "D". "D" is both used here for DType and CollectionComparableElement Maybe just: "type"?


Args:
v: Input integer vector to sort.
list: The list of the order comparable elements which will be sorted inpace.
low: Int value identifying the lowes index of the list section to be sorted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*lowest

@mzaks mzaks requested a review from a team as a code owner May 17, 2024 09:02
@mzaks mzaks force-pushed the feature/alternative-sort-module branch 5 times, most recently from 07d2bf5 to 0770b9e Compare May 17, 2024 09:44
@mzaks mzaks requested a review from gryznar May 17, 2024 09:46
Copy link
Contributor

@gryznar gryznar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more nit. Changelog entry is not very consistent with the rest, but overall LGTM

@mzaks mzaks requested a review from gryznar May 17, 2024 12:53
@mzaks
Copy link
Contributor Author

mzaks commented May 17, 2024

One more nit. Changelog entry is not very consistent with the rest, but overall LGTM

I rephrased it a bit :)


fn _quicksort[
type: CollectionElement, cmp_fn: fn (type, type) capturing -> Bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question Previously, this quicksort function allowed specifying a comparator; we use this functionality internally for sorting by greater_than comparator for example instead of the default less than comparator. Do you mind adding this level of flexibility and plumbing it through the various sorting algorithms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There a wrinkle with that, currently all sort algorithms are comparison based so it is not a problem. But Radix sort is not comparison based, hence it would mean we would have an issue to call it from the sort function, if sort function allows configurable comparison operator. I guess if the main motivation is for sorting in descending order, then we can introduce an ascending/descending parameter, instead of a sorting function. This should work for radix sort as well. If there are other motivations, then I guess we can go with operator and avoid calling radix sort from the sort function. We could still introduce it as a standalone function in the module though.

Let me know what you prefer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. We have a couple of use cases internally for calling quick_sort with various comparators such as >=, <=, or == on the elements when doing the comparisons. I don't think we should go the ascending/descending route since it's not much more work to be generic over any comparison function. Perhaps we should add an explicit sort_by function that allows passing this generic functor working on comparison of two ComparableCollectionElement values? Something like https://doc.rust-lang.org/std/primitive.slice.html#method.sort_by for example. With this particularly, we can decide whether we want to push the function (and default it to a generic less than function for sorting purposes) in the sort API or have it be a different named function (such as the sort_by).

I think we should go in this general direction and then think about various things as follow-ups — do we want various sorting APIs or have a one-stop-shop like in C++ here? Of note, you mentioned non-comparison based sorting algorithms wouldn't be suitable for this sort of function signature.

What do you think is best? Also CC @laszlokindrat @rparolin in case you have thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is, how generic do we want the sort to be.

The current internal functions signatures are:
[type: AnyRegType, cmp_fn: _cmp_fn_type](array: Pointer[type])

and

[type: CollectionElement, cmp_fn: fn (type, type) capturing -> Bool](array: UnsafePointer[type])

which are called from public signatures like:

sort(inout buff: Pointer[Int], len: Int):

fn sort[type: DType](inout buff: Pointer[Scalar[type]], len: Int):

fn sort(inout v: List[Int]):

fn sort[type: DType](inout v: List[Scalar[type]]):

and

fn sort[type: CollectionElement, cmp_fn: fn (type, type) capturing -> Bool](inout v: List[type]):

The state of the sort module actually changed a lot since my PR was opened so I think it is best to take a step back. I will make this PR less aggressive I will just introduce a new sort function with following signature:

fn sort[type: ComparableCollectionElement](inout v: List[type]):

So the PR will introduce the initial concept of generic comparison sorting and then we can build upon that.

return right
@always_inline
@parameter
fn _partition(low: Int, high: Int) -> Int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion Do you mind lifting this out of _quick_sort as a free function? We use _partition internally and it would be nice to preserve that level of flexibility for custom sorting things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoeLoser I just wanted to lift the _partition function to upper level, but then I looked through the previous code again and I noticed that the top level partition and _partition functions expect cmp_fn: _cmp_fn_type as a parameter, which I think is quite specific for the builtin sort module. Or rather, we need to decide if we want a complete set of functions which expect the cmp_fn: _cmp_fn_type parameter. Currently I implemented a set of functions for Scalar list elements (mainly because allows some performance optimisations) and ComparableCollectionElement list elements.

If we want to keep

alias _cmp_fn_type = fn[type: AnyRegType] (type, type) capturing -> Bool

It will make sense to introduce a third set of functions just with this signature. To be honest I would be agains it as it increases maintenance complexity for a very specific use case, but I am ok with going along with it, if it important internally.

Copy link
Collaborator

@JoeLoser JoeLoser May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my other comment, I think we should introduce a generic range partition that works on ComparableCollectionElement I suppose. I'm imagining something similar to std::ranges::partition from C++ as seen here. What do you think about that? I think partition as a general algorithm (outside of just the sorting use case) is useful and it'll evolve over time.

@mzaks mzaks force-pushed the feature/alternative-sort-module branch from df7ef95 to 3227d3d Compare May 20, 2024 14:38
@JoeLoser
Copy link
Collaborator

!sync

mzaks added 9 commits May 23, 2024 15:55
…ements

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
@mzaks mzaks force-pushed the feature/alternative-sort-module branch from 3227d3d to a6cff31 Compare May 23, 2024 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants