Reduce matmul latency by splitting small matmul #54421

jishnub · 2024-05-09T12:11:33Z

This splits the matmul2x2 and matmul3x3 into components that depend on MulAddMul and those that don't depend on it. This improves compilation time, as the MulAddMul-independent methods won't need to be recompiled in the @stable_muladdmul branches.

TTFX (each call timed in a separate session):

julia> using LinearAlgebra

julia> A = rand(2,2); B = Symmetric(rand(2,2)); C = zeros(2,2);

julia> @time mul!(C, A, B);
  1.927468 seconds (5.67 M allocations: 282.523 MiB, 12.09% gc time, 100.00% compilation time) # nightly v"1.12.0-DEV.492"
  1.282717 seconds (4.46 M allocations: 228.816 MiB, 4.58% gc time, 100.00% compilation time) # This PR

julia> A = rand(2,2); B = rand(2,2); C = zeros(2,2);

julia> @time mul!(C, A, B);
  1.653368 seconds (5.75 M allocations: 291.586 MiB, 13.94% gc time, 100.00% compilation time) # nightly
  1.148330 seconds (4.46 M allocations: 230.714 MiB, 4.47% gc time, 100.00% compilation time) # This PR

Edit: Not inlining the function seems to incur a runtime perfomance cost.

julia> using LinearAlgebra

julia> A = rand(3,3); B = rand(size(A)...); C = zeros(size(A));

julia> @btime mul!($C, $A, $B);
  23.923 ns (0 allocations: 0 bytes) # nightly
  31.732 ns (0 allocations: 0 bytes) # This PR

Adding @inline annotations resolves this difference, but this reintroduces the compilation latency. The tradeoff is perhaps ok, as users may use StaticArrays for performance-critical matrix multiplications.

dkarrasch · 2024-05-11T19:48:42Z

stdlib/LinearAlgebra/src/matmul.jl

+end
+__matmul2x2_elements(tA, tB, A, B) = __matmul2x2_elements(tA, A), __matmul2x2_elements(tB, B)
+
+function matmul2x2!(C::AbstractMatrix, tA, tB, A::AbstractMatrix, B::AbstractMatrix,


What if you added aggressive constant propagation here? tA and tB should be known as constants at the call site, because they are obtained from unpeeling outer wrappers. Does that also increase latency again?

I had tried this, but this doesn't seem to change the compilation time at all. Some of the branches are certainly eliminated, but the bulk of the latency probably arises elsewhere.

This doesn't make things worse either, though, so perhaps we may include this, just in case the other issues are resolved in the future. This may also let us use lazy versions of the wrappers instead of copies, as this would be type-stable if the other branches are eliminated.

dkarrasch · 2024-05-11T19:50:57Z

Awesome! The compilation latency improvement should hold for any initial calls to mul!, right? Not only for small matrices, because those code paths dependent on values of objects unknown at compile time?

@nanosoldier runbenchmarks(["linalg"], vs=":master")

dkarrasch · 2024-05-11T19:52:04Z

@nanosoldier runbenchmarks("linalg", vs = ":master")

nanosoldier · 2024-05-11T20:31:38Z

Your benchmark job has completed - no performance regressions were detected. A full report can be found here.

jishnub · 2024-05-12T11:11:23Z

Yes, this improves the compilation time for matrices of any size

julia> A = rand(30,30); B = rand(size(A)...); C = zeros(size(A));

julia> @time mul!(C, A, B); # similar to above
  1.272428 seconds (4.25 M allocations: 221.403 MiB, 19.92% gc time, 99.93% compilation time)

Reduce matmul latency by splitting small matmul

de990ef

jishnub added the domain:linear algebra Linear algebra label May 9, 2024

jishnub added 2 commits May 9, 2024 18:43

Rename indexing functions and remove trailing !

4aa0277

Reduce code duplication

eab5c05

tecosaur added compiler:latency Compiler latency and removed compiler:latency Compiler latency labels May 11, 2024

dkarrasch reviewed May 11, 2024

View reviewed changes

Constant propagation in small matmul functions

cba033c

Unwrap WrapperChar before comparing

2da8e94

dkarrasch approved these changes May 12, 2024

View reviewed changes

jishnub merged commit 5006312 into master May 12, 2024
7 checks passed

jishnub deleted the jishnub/matmulsplit_2x2_3x3 branch May 12, 2024 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce matmul latency by splitting small matmul #54421

Reduce matmul latency by splitting small matmul #54421

jishnub commented May 9, 2024 •

edited

dkarrasch May 11, 2024

jishnub May 12, 2024 •

edited

dkarrasch commented May 11, 2024

dkarrasch commented May 11, 2024

nanosoldier commented May 11, 2024

jishnub commented May 12, 2024

Reduce matmul latency by splitting small matmul #54421

Reduce matmul latency by splitting small matmul #54421

Conversation

jishnub commented May 9, 2024 • edited

dkarrasch May 11, 2024

Choose a reason for hiding this comment

jishnub May 12, 2024 • edited

Choose a reason for hiding this comment

dkarrasch commented May 11, 2024

dkarrasch commented May 11, 2024

nanosoldier commented May 11, 2024

jishnub commented May 12, 2024

jishnub commented May 9, 2024 •

edited

jishnub May 12, 2024 •

edited