Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Portable Pipeline] Composition operators not working (only SrcCopy and SrcOver) #182

Open
HowToExpect opened this issue Nov 23, 2023 · 11 comments

Comments

@HowToExpect
Copy link

`BLImage img(480, 480, BL_FORMAT_PRGB32);
BLContext ctx(img);

ctx.clearAll();

// First shape filled with a radial gradient.
// By default, SRC_OVER composition is used.
BLGradient radial(
BLRadialGradientValues(180, 180, 180, 180, 180));
radial.addStop(0.0, BLRgba32(0xFFFFFFFF));
radial.addStop(1.0, BLRgba32(0xFFFF6F3F));
ctx.fillCircle(180, 180, 160, radial);

// Second shape filled with a linear gradient.
BLGradient linear(
BLLinearGradientValues(195, 195, 470, 470));
linear.addStop(0.0, BLRgba32(0xFFFFFFFF));
linear.addStop(1.0, BLRgba32(0xFF3F9FFF));

// Use 'setCompOp()' to change a composition operator.
ctx.setCompOp(BL_COMP_OP_DIFFERENCE);
ctx.fillRoundRect(
BLRoundRect(195, 195, 270, 270, 25), linear);

ctx.end();`
setCompOp(BL_COMP_OP_DIFFERENCE);When using this interface, the drawing of fillRoundRect will not take effect.

@kobalicek
Copy link
Member

I'm sorry but the portable pipeline at the moment doesn't provide all composition operators. This is still something to do.

@HowToExpect
Copy link
Author

ok

@kobalicek kobalicek changed the title AARCH64 GNU/Linux,BLContext::setCompOp question [Portable Pipeline] Composition operators not working (only SrcCopy and SrcOver) Dec 5, 2023
@dongzhong
Copy link

@kobalicek how could I use composition operation properly?

@kobalicek
Copy link
Member

This is something that will be solved by AArch64 JIT - I'm not investing much time into portable pipelines at the moment, the JIT seems more important and its first version will premiere very soon.

@openlearnc

This comment was marked as resolved.

@openlearnc
Copy link

openlearnc commented Mar 16, 2024

I found an example that shows us how to optimize alpha blend using arm neon, with a function that can blend 8 pixels.https://github.com/tttapa/ARM-NEON-Compositor

@kobalicek
Copy link
Member

kobalicek commented Mar 16, 2024

@openlearnc I have found this approach slow on ARM, the most important thing to do on ARM is to align the destination and then to use pair stores, at least this works great on Apple Silicon.

There is a branch aarch64_jit now, which can be used by people interested in testing the new AArch64 JIT - it's still a little experimental, but it's much faster than portable pipelines.

In addition, the aarch64_jit branch has an optimized filler that specializes for smaller widths as well, which makes both small and large fills a little faster especially on ARM hardware.

@openlearnc
Copy link

@kobalicek Although the aarch64 jit branch supports more composition operations, testing has found that it is not as fast as the no jit version.

@kobalicek
Copy link
Member

kobalicek commented Mar 24, 2024

@openlearnc I'm interested in a workload that performs better without JIT. I have an Apple M3 chip here and I can see between 2-5x performance increase when using JIT. I optimized mostly SRC and SRC_OVER though, so other compositing operators need some optimizations first (as now they are basically using the strategy used by x86, which is not ideal on ARM).

@openlearnc
Copy link

openlearnc commented Mar 24, 2024

This is the result of my testing blend2d on an Android device, and all the compilations were done using clang.

@kobalicek
Copy link
Member

If you are doing a single-shot benchmark like calling something only once, there would be some little overhead to compile each function. After that they are cached, but they have to be compiled the first time.

Maybe Apple M3 is too powerful and has lower latencies of he selected instructions Blend2D prefers, but I would need more info about that. For example Blend2D has no problem in using TBL instruction, which was slow in the past.

But still, JIT understand how to unroll some stuff much better than the C++ compiler, so it's hard to believe that something would be faster. Maybe some very tiny stuff can be better without JIT, when only few pixels per scanline are modified, but I would like to know about these cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants