Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recursive implementation for in(::Any, ::Tuple) #54411

Closed
wants to merge 1 commit into from

Conversation

nsajko
Copy link
Contributor

@nsajko nsajko commented May 8, 2024

Avoids relying on tail(::Tuple), thus it can be performant for moderate length tuples. Before this change the recursive implementation could only be used for tuples of length less than about thirty, while now the recursion is used for lengths up to 600.

This means that in for larger tuples than before will both be faster and amenable to constant folding.

This introduces the TupleView type, which will hopefully be used in a few other places in the future, too. It's meant to help with recursing over every tuple element in a compiler-friendly way. In particular, TupleView could be useful for _isdisjoint, isdisjoint(::Tuple, ::Tuple) and other places where tail(::Tuple) is currently used.

@nsajko
Copy link
Contributor Author

nsajko commented May 8, 2024

The TupleView type is a simplified version of my unregistered package StaticViews.jl. This PR was prompted by @jishnub recent message on Slack about in(::Any, ::Tuple).

@jishnub
Copy link
Contributor

jishnub commented May 8, 2024

We may want a cutoff length for this, as the present loop-based implementation is faster for large tuples.
On nightly v"1.12.0-DEV.467"

julia> t = ntuple(x->'A', 1000);

julia> @btime 'C' in $t;
  354.118 ns (0 allocations: 0 bytes)

This PR

julia> @btime 'C' in $t;
  2.889 μs (0 allocations: 0 bytes)

@nsajko

This comment was marked as outdated.

@aviatesk
Copy link
Sponsor Member

aviatesk commented May 9, 2024

Can you explain the advantages of using StaticView, especially how it makes it unnecessary to use tail? I'm particularly interested in its advantages compared to the technique that uses the length of argument tuple as a threshold to split dispatches between recursive and loop implementations, like in the implementation of map.

@aviatesk
Copy link
Sponsor Member

aviatesk commented May 9, 2024

Also xref: #54026

@nsajko nsajko marked this pull request as draft May 9, 2024 08:05
@nsajko

This comment was marked as resolved.

@tecosaur tecosaur added the domain:collections Data structures holding multiple items, e.g. sets label May 9, 2024
Avoids relying on `tail(::Tuple)`, thus it can be performant for
moderate length tuples. Before this change the recursive
implementation could only be used for tuples of length less than about
thirty, while now the recursion is used for lengths up to 600.

This means that `in` for larger tuples than before will both be faster
and amenable to constant folding.

This introduces the `TupleView` type, which will hopefully be used
in a few other places in the future, too. It's meant to help with
recursing over every tuple element in a compiler-friendly way.  In
particular, `TupleView` could be useful for `_isdisjoint`,
`isdisjoint(::Tuple, ::Tuple)` and other places where `tail(::Tuple)`
is currently used.
@nsajko nsajko marked this pull request as ready for review May 10, 2024 07:35
@nsajko
Copy link
Contributor Author

nsajko commented May 10, 2024

Performance comparisons show 3x-4x improvement:

Before:

julia> using BenchmarkTools

julia> t = ntuple(Returns(7), 100);

julia> minimum(@benchmark 3 ∈ $t)
BenchmarkTools.TrialEstimate: 
  time:             34.215 ns
  gctime:           0.000 ns (0.00%)
  memory:           0 bytes
  allocs:           0

julia> t = ntuple(Returns(7), 500);

julia> minimum(@benchmark 3 ∈ $t)
BenchmarkTools.TrialEstimate: 
  time:             203.996 ns
  gctime:           0.000 ns (0.00%)
  memory:           0 bytes
  allocs:           0

julia> versioninfo()
Julia Version 1.12.0-DEV.495
Commit 3c966a5107a (2024-05-09 11:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × AMD Ryzen 3 5300U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-17.0.6 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

After:

julia> using BenchmarkTools

julia> t = ntuple(Returns(7), 100);

julia> minimum(@benchmark 3 ∈ $t)
BenchmarkTools.TrialEstimate:
  time:             9.829 ns
  gctime:           0.000 ns (0.00%)
  memory:           0 bytes
  allocs:           0

julia> t = ntuple(Returns(7), 500);

julia> minimum(@benchmark 3 ∈ $t)
BenchmarkTools.TrialEstimate:
  time:             53.974 ns
  gctime:           0.000 ns (0.00%)
  memory:           0 bytes
  allocs:           0

julia> versioninfo()
Julia Version 1.12.0-DEV.502
Commit 9598e8dce1 (2024-05-10 07:13 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 8 × AMD Ryzen 3 5300U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-17.0.6 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

@nsajko nsajko added the performance Must go faster label May 10, 2024
@aviatesk
Copy link
Sponsor Member

The benchmark isn't fair because this PR increases the threshold for tuple lengths that permit unrolling. Essentially, the performance difference we're seeing is between the loop implementation (from the master) and the recursive implementation (in this PR) for the very large (100-length) tuples. If comparisons are needed, they should be made between recursive implementations. Something like this:

julia> @benchmark i  t setup=(n = rand((2:10...,25,100)); t = ntuple(n) do _; rand(1:10); end; i = rand(1:10))
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min  max):  17.117 ns  83.709 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     22.253 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.440 ns ±  2.913 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

            ▁▁   ▃▂▄▁▁▂█▇▄                                     
  ▁▁▂▂▂▄▄▄▃▅██▅▅▇█████████▇▇▇▇▄▃▃▅▅▃▃▃▄▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  17.1 ns         Histogram: frequency by time        31.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@nsajko
Copy link
Contributor Author

nsajko commented May 10, 2024

The benchmark isn't fair because this PR increases the threshold for tuple lengths that permit unrolling.

Increasing the threshold is the point of this PR (now that the competing PR is already merged).

If comparisons are needed, they should be made between recursive implementations.

That obviously makes this PR's perf wins even greater, by a lot. The recursive implementation on master allocates when given tuples larger than intended (because of relying on tail(::Tuple)):

julia> f(x, t) = Base._in_tuple(x, t, false)
f (generic function with 1 method)

julia> using BenchmarkTools

julia> t = ntuple(Returns(7), 100);

julia> minimum(@benchmark f(3,$t))
BenchmarkTools.TrialEstimate: 
  time:             175.159 μs
  gctime:           0.000 ns (0.00%)
  memory:           37.48 KiB
  allocs:           69

Comment on lines +1346 to +1347
let f = is_missing | any_missing, rest = tail(v)
_in_tupleview(x, rest, f)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I should just get rid of the let and inline f and rest.

@nsajko
Copy link
Contributor Author

nsajko commented May 13, 2024

Closing in favor of an upcoming, more comprehensive PR.

@nsajko nsajko closed this May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:collections Data structures holding multiple items, e.g. sets performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants