Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce matmul latency by splitting small matmul #54421

Merged
merged 5 commits into from May 12, 2024

Conversation

jishnub
Copy link
Contributor

@jishnub jishnub commented May 9, 2024

This splits the matmul2x2 and matmul3x3 into components that depend on MulAddMul and those that don't depend on it. This improves compilation time, as the MulAddMul-independent methods won't need to be recompiled in the @stable_muladdmul branches.

TTFX (each call timed in a separate session):

julia> using LinearAlgebra

julia> A = rand(2,2); B = Symmetric(rand(2,2)); C = zeros(2,2);

julia> @time mul!(C, A, B);
  1.927468 seconds (5.67 M allocations: 282.523 MiB, 12.09% gc time, 100.00% compilation time) # nightly v"1.12.0-DEV.492"
  1.282717 seconds (4.46 M allocations: 228.816 MiB, 4.58% gc time, 100.00% compilation time) # This PR

julia> A = rand(2,2); B = rand(2,2); C = zeros(2,2);

julia> @time mul!(C, A, B);
  1.653368 seconds (5.75 M allocations: 291.586 MiB, 13.94% gc time, 100.00% compilation time) # nightly
  1.148330 seconds (4.46 M allocations: 230.714 MiB, 4.47% gc time, 100.00% compilation time) # This PR

Edit: Not inlining the function seems to incur a runtime perfomance cost.

julia> using LinearAlgebra

julia> A = rand(3,3); B = rand(size(A)...); C = zeros(size(A));

julia> @btime mul!($C, $A, $B);
  23.923 ns (0 allocations: 0 bytes) # nightly
  31.732 ns (0 allocations: 0 bytes) # This PR

Adding @inline annotations resolves this difference, but this reintroduces the compilation latency. The tradeoff is perhaps ok, as users may use StaticArrays for performance-critical matrix multiplications.

@jishnub jishnub added the domain:linear algebra Linear algebra label May 9, 2024
@tecosaur tecosaur added compiler:latency Compiler latency and removed compiler:latency Compiler latency labels May 11, 2024
end
__matmul2x2_elements(tA, tB, A, B) = __matmul2x2_elements(tA, A), __matmul2x2_elements(tB, B)

function matmul2x2!(C::AbstractMatrix, tA, tB, A::AbstractMatrix, B::AbstractMatrix,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you added aggressive constant propagation here? tA and tB should be known as constants at the call site, because they are obtained from unpeeling outer wrappers. Does that also increase latency again?

Copy link
Contributor Author

@jishnub jishnub May 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had tried this, but this doesn't seem to change the compilation time at all. Some of the branches are certainly eliminated, but the bulk of the latency probably arises elsewhere.

This doesn't make things worse either, though, so perhaps we may include this, just in case the other issues are resolved in the future. This may also let us use lazy versions of the wrappers instead of copies, as this would be type-stable if the other branches are eliminated.

@dkarrasch
Copy link
Member

Awesome! The compilation latency improvement should hold for any initial calls to mul!, right? Not only for small matrices, because those code paths dependent on values of objects unknown at compile time?

@nanosoldier runbenchmarks(["linalg"], vs=":master")

@dkarrasch
Copy link
Member

@nanosoldier runbenchmarks("linalg", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here.

@jishnub
Copy link
Contributor Author

jishnub commented May 12, 2024

Yes, this improves the compilation time for matrices of any size

julia> A = rand(30,30); B = rand(size(A)...); C = zeros(size(A));

julia> @time mul!(C, A, B); # similar to above
  1.272428 seconds (4.25 M allocations: 221.403 MiB, 19.92% gc time, 99.93% compilation time)

@jishnub jishnub merged commit 5006312 into master May 12, 2024
7 checks passed
@jishnub jishnub deleted the jishnub/matmulsplit_2x2_3x3 branch May 12, 2024 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:latency Compiler latency domain:linear algebra Linear algebra
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants