New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce matmul latency by splitting small matmul #54421
Conversation
stdlib/LinearAlgebra/src/matmul.jl
Outdated
end | ||
__matmul2x2_elements(tA, tB, A, B) = __matmul2x2_elements(tA, A), __matmul2x2_elements(tB, B) | ||
|
||
function matmul2x2!(C::AbstractMatrix, tA, tB, A::AbstractMatrix, B::AbstractMatrix, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if you added aggressive constant propagation here? tA
and tB
should be known as constants at the call site, because they are obtained from unpeeling outer wrappers. Does that also increase latency again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had tried this, but this doesn't seem to change the compilation time at all. Some of the branches are certainly eliminated, but the bulk of the latency probably arises elsewhere.
This doesn't make things worse either, though, so perhaps we may include this, just in case the other issues are resolved in the future. This may also let us use lazy versions of the wrappers instead of copies, as this would be type-stable if the other branches are eliminated.
Awesome! The compilation latency improvement should hold for any initial calls to @nanosoldier |
@nanosoldier |
Your benchmark job has completed - no performance regressions were detected. A full report can be found here. |
Yes, this improves the compilation time for matrices of any size julia> A = rand(30,30); B = rand(size(A)...); C = zeros(size(A));
julia> @time mul!(C, A, B); # similar to above
1.272428 seconds (4.25 M allocations: 221.403 MiB, 19.92% gc time, 99.93% compilation time) |
This splits the
matmul2x2
andmatmul3x3
into components that depend onMulAddMul
and those that don't depend on it. This improves compilation time, as theMulAddMul
-independent methods won't need to be recompiled in the@stable_muladdmul
branches.TTFX (each call timed in a separate session):
Edit: Not inlining the function seems to incur a runtime perfomance cost.
Adding
@inline
annotations resolves this difference, but this reintroduces the compilation latency. The tradeoff is perhaps ok, as users may useStaticArrays
for performance-critical matrix multiplications.