Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Adding a builtin sort/sorted function #2390

Open
1 task done
Moosems opened this issue Apr 23, 2024 · 14 comments
Open
1 task done

[Feature Request] Adding a builtin sort/sorted function #2390

Moosems opened this issue Apr 23, 2024 · 14 comments
Labels
enhancement New feature or request mojo Issues that are related to mojo mojo-repo Tag all issues with this label mojo-stdlib Tag for issues related to standard library

Comments

@Moosems
Copy link

Moosems commented Apr 23, 2024

Review Mojo's priorities

What is your request?

I think it would be a great improvement to the stdlib to bring a sort/sorted function that uses a Sortable trait for things like large length string sorting.

What is your motivation for this change?

I use the sort function all the time in python (especially with a key) and I'd love to see this make its way into Mojo

Any other details?

No response

@Moosems Moosems added enhancement New feature or request mojo Issues that are related to mojo labels Apr 23, 2024
@jayzhan211
Copy link
Contributor

jayzhan211 commented Apr 24, 2024

I'm interested to support this feature.

@ematejska ematejska added the mojo-repo Tag all issues with this label label Apr 29, 2024
@jayzhan211
Copy link
Contributor

I think we need to implement comparison for CollectionElement

'T' does not implement the '__lt__' method

@Moosems
Copy link
Author

Moosems commented Apr 30, 2024

That'd be a good start :)

@JoeLoser
Copy link
Collaborator

JoeLoser commented May 3, 2024

This would be great! Right now we only have a basic sorting algorithm in the algorithm package that only works on AnyRegType if I recall correctly. Perhaps it makes sense to open source that as a reference and then we can make it work on AnyType, improve the performance of it, and use the ordering trait in #2462.

Do you or @mzaks already have a generic sorting algorithm on AnyType/some refined trait such as the ordering one? If so, would very much welcome a PR into the library and we can build out this area.

@jayzhan211
Copy link
Contributor

Do you or @mzaks already have a generic sorting algorithm on AnyType/some refined trait such as the ordering one? If so, would very much welcome a PR into the library and we can build out this area.

I don't have :( but willing to help on this! Perhaps building on what you have mentioned

@mzaks
Copy link
Contributor

mzaks commented May 3, 2024

Do you or @mzaks already have a generic sorting algorithm on AnyType/some refined trait such as the ordering one? If so, would very much welcome a PR into the library and we can build out this area.

I updated mojo-sort so insertion sort and quick sort are working on nightly, I am currently using following signature: fn quick_sort[D: CollectionElement, lt: fn (D, D) -> Bool](inout vector: List[D]): to make it generic, but with OrderingComparable trait it will be even simpler. For std sort I would implement a mix of quick and insertions sort, use insertion sort for small count and quick sort for larger count.

BTW if we will have a sorting package then I can also contribute Radix sort for numerical values. For large lists it is up to 30x faster than current std lib sort algorithm. Also in this case, a mix of insertion sort and Radix sort gives best performance, however important to know that Radix sort is not an in-place algorithm, so I would suggest to have it in std under an explicit name.

@mzaks
Copy link
Contributor

mzaks commented May 3, 2024

And another point, there is a class of sorting algorithms specific for strings.

https://github.com/rantala/string-sorting is a treasure trove when it comes to string sorting.

I ported multi key sorting algorithm in my repo, but now it seems to be buggy and slower than generic insertion sort or quick sort.
But for the future I think it would be great to pick a few ( e.g. SIMD based) string sorting algorithm which performs best for certain use cases and implement them in std lib.

@mzaks
Copy link
Contributor

mzaks commented May 3, 2024

@JoeLoser how do we want to start? You will opensource the sorting package? I can add the ComparableCollectionElement trait which is a CollectionElement and has the compare methods. And implement a sorting function which is a mix of insertion and quick sort, for Lists where elements conform to ComparableCollectionElement.

@jayzhan211
Copy link
Contributor

There are lots of sorting algorithm which has their own strength in specific cases. I think the built-in algorithm should be the best for general cases. We can take other languages stdlib as reference. In Python, they have Timsort. In Rust, they have quicksort for unstable, timsort for stable sort. https://doc.rust-lang.org/core/slice/sort/index.html.

I think it is fine for me if we start from @mzaks sorting algorithm, but I expect at the end we can have Timsort like what in Python and Rust, or other competitive sorting algorithms by default for most of the use cases. Other sorting algorithms should be a package imported if needed.

@mzaks
Copy link
Contributor

mzaks commented May 3, 2024

@JoeLoser I heard that there are plans to introduce benchmarks (infrastructure and code) to std lib. Is it correct? For evolution on algorithms and data structures it would be crucial to have it, so we can make data driven decisions.

@ematejska ematejska added the mojo-stdlib Tag for issues related to standard library label May 3, 2024 — with Linear
Copy link
Collaborator

JoeLoser commented May 5, 2024

I'll do my best to respond to everything here in one go.

  1. I don't mind open sourcing our basic sorting algorithm whose sort public API works on CollectionElement types and is using a basic quicksort if you guys would find that useful. It's nothing fancy and its tests aren't anything to brag about either. The benefit of this is it gives us a place to swap (no pun intended) the implementation out in builtin/sort.mojo for example with one of the best out-of-the-box algorithms mentioned above, such as Timsort without breaking anything internally which is a weak positive. Otherwise, if we just add a new sort function now, it would require our very few internal uses of our algorithm.sort function to change to use the built-in, which isn't an issue at all. They would just get an immediate performance boost.
  2. I agree regarding there being a myriad of sorting algorithms and opportunities for Mojo to shine here with its SIMD-as-a-first-class library functionality with regard to this. Like always, I recommend we start small and simple with the Timsort mentioned in (1) and then we can talk about more fancy algorithms. I do agree that these other algorithms would want to be in their own place outside of the builtins. Should they be in the library or a standalone package like @mzaks has them now? I haven't thought about it, but our package manager is a ways out, and I think it's meaningful to provide a few good SIMD-powered sorting algorithms in the library. I'll see what the team thinks and add this to our agenda for the design discussions this coming week.
  3. With the benchmarks, as mentioned in our roadmap, this is something on our radar. I don't have an exact timeline for when this will land in the open source, but it's something people internally are working on and I agree it would be immensely useful for things like this, SSO in [stdlib] Add small string optimization to the String struct #2507, etc. I'll take an action item to go check on where this is at internally and see if there's appetite for delivering incremental value and continue iterating on it out in the open rather than waiting for it to be even further.

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 5, 2024

I think it would be nice if internal code builtin/sort.mojo can open source first, then we can incrementally improve based on it, the improvement can also benefit both open source and internal usage. We should avoid duplicate function and ends up divergence

I recommend we start small and simple with the Timsort mentioned in (1) and then we can talk about more fancy algorithms

I agree to start with Timsort, others sorting algorithm should be a standalone package for now. I believe there will come out a better SIMD-based sorting as a builtin candidate in the future 😎

@mzaks
Copy link
Contributor

mzaks commented May 5, 2024

Actually if the sorting algorithm should land in builtin/sort.mojo than there is no need to wait. I actually started on it, but it is a bit unfortunate that we have this PRs already in flight:
#2517
#2378
#2409

BTW after banging my head against the wall with string sorting doing something strange, I realised that the memcmp function is broken. I provided a fix:
#2524

@ematejska ematejska removed the mojo-stdlib Tag for issues related to standard library label May 6, 2024
@ematejska ematejska added the mojo-stdlib Tag for issues related to standard library label May 6, 2024 — with Linear
@JoeLoser
Copy link
Collaborator

I just landed our basic sort algorithm from algorithm.sort into builtin/sort.mojo - see #2605

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request mojo Issues that are related to mojo mojo-repo Tag all issues with this label mojo-stdlib Tag for issues related to standard library
Projects
None yet
Development

No branches or pull requests

5 participants