Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ArrayList #316

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Conversation

mouse0w0
Copy link
Contributor

In the previous pull request (#311), I optimized the array field access of some classes. Recently, with my deeper understanding of fastutil, I have discovered more optimizations and technical debts that need attention. However, resolving these issues is challenging and time-consuming, requiring contributions from the community and multiple pull requests. This pull request is the recent work I have done on ArrayList.

@incaseoftrouble
Copy link
Collaborator

Did you profile these changes? (With a proper JMH setup)

I'm doubtful that these changes have any noticeable performance improvements (the JVM is incredibly good at figuring out these things). In particular, copying fields to local variables in my experience just adds clutter and doesn't yield anything. (Why I come to this conclusion: I also thought that it helps, after all, the compiler cannot prove the absence of concurrent modification. But even in really costly iterative methods this made practically no difference. I think this is partly due to such a local variable needing to be copied and written to memory, since, indeed, the compiler cannot prove that not copying the value will have no influence.)

Similarly, marking local variables as final has (hardly) any influence. To my knowledge, the only possibility is that the compiler can perform some static analysis (which the JIT would very likely figure out, too) - final is discarded from the byte-code.

@mouse0w0
Copy link
Contributor Author

mouse0w0 commented Feb 12, 2024

I conducts a benchmark for List forEach. The benchmark code is as follows:

@State(Scope.Thread)
@Warmup(iterations = 3)
@Measurement(iterations = 10)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class IterationBenchmark {
    @Param({"10", "1000", "100000"})
    private int n;

    private int[] a;

    @Setup
    public void setup() {
        int[] a = new int[n];
        for (int i = 0; i < n; i++) {
            a[i] = i;
        }
        this.a = a;
    }

    @Benchmark
    public void forEach(Blackhole blackhole) {
        for (int i = 0; i < n; i++) {
            blackhole.consume(a[i]);
        }
    }

    @Benchmark
    public void forEachCacheArray(Blackhole blackhole) {
        int[] a = this.a;
        for (int i = 0; i < n; i++) {
            blackhole.consume(a[i]);
        }
    }

    @Benchmark
    public void forEachCacheSize(Blackhole blackhole) {
        int[] a = this.a;
        for (int i = 0, max = n; i < max; i++) {
            blackhole.consume(a[i]);
        }
    }
}

The benchmark results are as follows:

Benchmark                                (n)  Mode  Cnt      Score     Error  Units
IterationBenchmark.forEach                10  avgt   10      7.003 ±   0.006  ns/op
IterationBenchmark.forEach              1000  avgt   10    656.301 ±   0.153  ns/op
IterationBenchmark.forEach            100000  avgt   10  65878.207 ± 424.070  ns/op
IterationBenchmark.forEachCacheArray      10  avgt   10      4.208 ±   0.004  ns/op
IterationBenchmark.forEachCacheArray    1000  avgt   10    330.056 ±   0.024  ns/op
IterationBenchmark.forEachCacheArray  100000  avgt   10  32982.049 ±   6.810  ns/op
IterationBenchmark.forEachCacheSize       10  avgt   10      2.805 ±   0.004  ns/op
IterationBenchmark.forEachCacheSize     1000  avgt   10    105.135 ±   0.013  ns/op
IterationBenchmark.forEachCacheSize   100000  avgt   10  13622.145 ±   9.760  ns/op

Based on the benchmark results, caching the array object and size of the list reduces iteration time by over 70% compared to not caching. Regarding the benchmark results, my opinions are as follows:

  1. When the code attempts to get a field of the this object, it needs to execute two opcodes, aload 0 and getfield. However, when getting a local variable, only one opcode is required, such as iload 1. Therefore, if the code needs to access a field value more than twice, the number of opcodes for accessing fields is greater than accessing local variables.

  2. Local variables are stored in the stack rather than the heap, implying that they can be released from memory when the method returns.

  3. Computers tend to store local variables in the CPU cache rather than memory, facilitating faster access.

  4. The getfield opcode is relatively time-consuming, which to some extent affects performance.

Furthermore, using the final modifier to declare local variables is to conform to the existing code style of fastutil. Looking forward to your reply.

@incaseoftrouble
Copy link
Collaborator

Based on the benchmark results, caching the array object and size of the list reduces iteration time by over 70% compared to not caching.

... on this benchmark that doesn't have dynamic dispatch etc. involved and is about forEach (I'm totally in favor of specializing forEach!)

I doubt that there will be any difference (there might even be a penalty) for get and set (i.e. methods where assigning the local variable takes a constant fraction of the overall number of operations)

Also: Which version of Java are you using? I cannot replicate these findings. For me, all three are the same timing:

IterationBenchmark.forEach            100000  avgt    4  12461.602 ± 438.815  ns/op
IterationBenchmark.forEachCacheArray  100000  avgt    4  12404.329 ± 242.271  ns/op
IterationBenchmark.forEachCacheSize   100000  avgt    4  12649.409 ± 985.465  ns/op

opcodes

All these assumptions are assumptions. The bytecode is not at all what is actually produced by the JIT, which in turn is not at all what the actual operations executed on the CPU are (see pipelining, speculative execution / branch prediction, etc.). Such constant-factor optimizations need (as realistic as possible) profiling as justification.

(regarding point 2: The field however already exists on the heap, so there is no need to write and then free that memory at all!)

using the final modifier to declare local variables is to conform to the existing code style of fastutil.

Oh, right! Then making all effectively final variables final definitely is good :)

@incaseoftrouble
Copy link
Collaborator

PS: I don't want to sound overly negative! I've just seen way too often that people try to outsmart a compiler, which more often than not actually is counterproductive.

@szarnekow
Copy link

It's interesting. I ran the very same benchmark out of curiosity on Hotspot 17.0.2 both on Mac M1 and on x86 and see the same improvements as @mouse0w0 when caching the reference to the array and its length. Even for very small data sets.

@incaseoftrouble
Copy link
Collaborator

incaseoftrouble commented Feb 12, 2024

It's interesting. I ran the very same benchmark out of curiosity on Hotspot 17.0.2 both on Mac M1 and on x86 and see the same improvements as @mouse0w0 when caching the reference to the array and its length. Even for very small data sets.

Oh, that is surprising! I am on the same major version, to be precise # VM version: JDK 17.0.9, OpenJDK 64-Bit Server VM, 17.0.9+9-Ubuntu-122.04, Blackhole mode: compiler and for me there really are no differences. I tried several variations of this benchmark.

Can you try with .0.9? I doubt that this changes with a minor version, but still. I don't get why there should be such drastic differences.

@incaseoftrouble
Copy link
Collaborator

@szarnekow @mouse0w0

Could you try this benchmark variant:

@State(Scope.Benchmark)
@Warmup(time = 2, iterations = 2)
@Measurement(time = 2, iterations = 4)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class IterationBenchmark {
    @Param({"100000"})
    private int n;

    private int[] a;

    @Setup
    public void setup() {
        int[] a = new int[n];
        Arrays.setAll(a, IntUnaryOperator.identity());
        this.a = a;
    }

    @Benchmark
    public void forEach(Blackhole blackhole) {
        for (int i = 0; i < n; i++) {
            a[i] += i;
        }
        blackhole.consume(a);
    }

    @Benchmark
    public void forEachCacheArray(Blackhole blackhole) {
        int[] a = this.a;
        for (int i = 0; i < n; i++) {
            a[i] += i;
        }
        blackhole.consume(a);
    }

    @Benchmark
    public void forEachCacheSize(Blackhole blackhole) {
        int[] a = this.a;
        for (int i = 0, max = n; i < max; i++) {
            a[i] += i;
        }
        blackhole.consume(a);
    }
}

(I just want to eliminate the low probability that it has something to do with the used blackhole implementation)

@mouse0w0
Copy link
Contributor Author

My benchmark result is based on # VM version: JDK 17.0.7, Java HotSpot(TM) 64-Bit Server VM, 17.0.7+8-LTS-224.

@incaseoftrouble
Copy link
Collaborator

incaseoftrouble commented Feb 12, 2024

Ok this is very surprising. Let me try if I can reproduce with HotSpot and, if so, if it is hotspot being slow or openjdk being fast.

EDIT: Nope. JDK 17.0.10, Java HotSpot(TM) 64-Bit Server VM, 17.0.10+11-LTS-240 seems to be slightly faster overall (~2%) but the "naive" forEach actually is fastest ...

@mouse0w0
Copy link
Contributor Author

Run your benchmark variant in JDK 17.0.7, the benchmark results are as follows:

Benchmark                                (n)  Mode  Cnt      Score     Error  Units
IterationBenchmark.forEach            100000  avgt    4  24642.541 ± 212.961  ns/op
IterationBenchmark.forEachCacheArray  100000  avgt    4  24641.441 ± 106.699  ns/op
IterationBenchmark.forEachCacheSize   100000  avgt    4  24642.589 ±  38.049  ns/op

@mouse0w0
Copy link
Contributor Author

I think the reason for this result is that the benchmark variant only consumes the array object itself, rather than consuming the data within the array object.

@incaseoftrouble
Copy link
Collaborator

incaseoftrouble commented Feb 12, 2024

I think the reason for this result is that the benchmark variant only consumes the array object itself, rather than consuming the data within the array object.

But the compiler cannot know that .consume(a) does not access the content of the array, thus it cannot skip the for loop.

I re-ran the original benchmark with HotSpot and get the same results:

Benchmark                                (n)  Mode  Cnt      Score   Error  Units
IterationBenchmark.forEach            100000  avgt    2  12524.095          ns/op
IterationBenchmark.forEachCacheArray  100000  avgt    2  12577.578          ns/op
IterationBenchmark.forEachCacheSize   100000  avgt    2  12394.182          ns/op

For completeness: JMH version: 1.36. But I am sort of at a loss right now. I have no idea how our results could be that different.

@mouse0w0
Copy link
Contributor Author

But my benchmark aligns with the function of forEach, obtaining and consuming one element at a time. We need to clarify what the JVM is doing.

@mouse0w0
Copy link
Contributor Author

My JMH version is 1.37.

@incaseoftrouble
Copy link
Collaborator

incaseoftrouble commented Feb 12, 2024

But my benchmark aligns with the function of forEach, obtaining and consuming one element at a time. We need to clarify what the JVM is doing.

Absolutely. This was just an attempt at pinpointing where the difference comes from.

Let's try this:

I started from an empty folder and ran what the JMH guide suggests:

mvn archetype:generate \
  -DinteractiveMode=false \
  -DarchetypeGroupId=org.openjdk.jmh \
  -DarchetypeArtifactId=jmh-java-benchmark-archetype \
  -DgroupId=org.sample \
  -DartifactId=test \
  -Dversion=1.0

Then, I put the following code into src/main/java/bench/IterationBenchmark.java:

package bench;

import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.infra.Blackhole;


@State(Scope.Thread)
@Warmup(iterations = 3)
@Measurement(iterations = 5)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class IterationBenchmark {
    @Param({"100000"})
    private int n;

    private int[] a;

    @Setup
    public void setup() {
        int[] a = new int[n];
        for (int i = 0; i < n; i++) {
            a[i] = i;
        }
        this.a = a;
    }

    @Benchmark
    public void forEach(Blackhole blackhole) {
        for (int i = 0; i < n; i++) {
            blackhole.consume(a[i]);
        }
    }

    @Benchmark
    public void forEachCacheArray(Blackhole blackhole) {
        int[] a = this.a;
        for (int i = 0; i < n; i++) {
            blackhole.consume(a[i]);
        }
    }

    @Benchmark
    public void forEachCacheSize(Blackhole blackhole) {
        int[] a = this.a;
        for (int i = 0, max = n; i < max; i++) {
            blackhole.consume(a[i]);
        }
    }
}

followed by mvn clean verify and java -jar target/benchmarks.jar.

Output:

# JMH version: 1.37
# VM version: JDK 17.0.9, OpenJDK 64-Bit Server VM, 17.0.9+9-Ubuntu-122.04
# VM invoker: /usr/lib/jvm/java-17-openjdk-amd64/bin/java
# VM options: <none>
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 3 iterations, 10 s each
# Measurement: 2 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op

and

Benchmark                                (n)  Mode  Cnt      Score   Error  Units
IterationBenchmark.forEach            100000  avgt    2  12523.915          ns/op
IterationBenchmark.forEachCacheArray  100000  avgt    2  12503.226          ns/op
IterationBenchmark.forEachCacheSize   100000  avgt    2  12481.488          ns/op

(* I ran this with iterations = 2 because I didn't want to wait :) but the numbers are so consistent that iterations=1 already should be enough)

@szarnekow
Copy link

Original benchmark; M1 native;

JDK 19.0.2, OpenJDK 64-Bit Server VM, 19.0.2+7

Benchmark                                   (n)  Mode  Cnt  Score   Error  Units
IterationBenchmark.forEach                   10  avgt   10  7,168 ± 0,177  ns/op
IterationBenchmark.forEachCacheArrayLength   10  avgt   10  2,375 ± 0,129  ns/op
IterationBenchmark.forEachCacheField         10  avgt   10  3,860 ± 0,141  ns/op

JDK 21.0.1, OpenJDK 64-Bit Server VM, 21.0.1+12-LTS

Benchmark                                   (n)  Mode  Cnt  Score   Error  Units
IterationBenchmark.forEach                   10  avgt   10  2,547 ± 0,059  ns/op
IterationBenchmark.forEachCacheArrayLength   10  avgt   10  2,541 ± 0,036  ns/op
IterationBenchmark.forEachCacheField         10  avgt   10  2,556 ± 0,065  ns/op

Numbers on 17.0.9 or newer look similar to 21 for me

@mouse0w0
Copy link
Contributor Author

Upgrade JVM to 17.0.10 and get the same results as you:

# JMH version: 1.37
# VM version: JDK 17.0.10, Java HotSpot(TM) 64-Bit Server VM, 17.0.10+11-LTS-240
# VM invoker: C:\Program Files\jdk-17\bin\java.exe
# VM options: <none>
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 3 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op

and

Benchmark                                (n)  Mode  Cnt      Score     Error  Units
IterationBenchmark.forEach            100000  avgt    5  13633.378 ± 892.284  ns/op
IterationBenchmark.forEachCacheArray  100000  avgt    5  13517.690 ±  47.061  ns/op
IterationBenchmark.forEachCacheSize   100000  avgt    5  13550.632 ± 198.555  ns/op

@incaseoftrouble
Copy link
Collaborator

Huh! I didn't expect this to come out of a minor version change. Well, that at least explains what happens in that particular case ... Thanks for trying with an upgraded version!

@mouse0w0
Copy link
Contributor Author

Benchmark in JVM versions 17.0.8 and 17.0.9 respectively, the results indicate that the change occurred in 17.0.9.

@incaseoftrouble
Copy link
Collaborator

Aha! The 17.0.9 changelog includes hotspot/compiler C2 Blackholes should allow load optimizations

Since your performance was similar when the blackhole was moved outside the loop, I am guessing this is the culprit. This would mean the performance difference was caused by blackhole and would not be observable in real applications (which aligns with what I have seen in e.g. JBDD)

drv/ArrayList.drv Outdated Show resolved Hide resolved
drv/ArrayList.drv Outdated Show resolved Hide resolved
@vigna
Copy link
Owner

vigna commented Feb 13, 2024

Wow. That's a great and constructive discussion!

My 2€¢: I welcomed the PRs from @mouse0w0 because this is how fastutil code is written. Beyond 2 accesses to the same array I try to cache. I know it's up and down with actual improvements etc., but semantically it really makes sense and aften works (remember we're supporting Java 8).

@mouse0w0 noted that in some places I wasn't being coherent and fixed those places. So I'm totally happy to see this discussion but at least for the time being (and seeing different results on different architectures) for 3 accesses or more the rule is to cache. If you spot a relevant method where this does not happen I'm happy to accept PRs.

drv/ArrayList.drv Outdated Show resolved Hide resolved
@incaseoftrouble
Copy link
Collaborator

incaseoftrouble commented Feb 13, 2024

for 3 accesses or more the rule is to cache

I'm fine with that. Its just important to note that a) this primarily is a style decision, not performance and b) it should be tested that such a change doesn't actually negatively impact performance (which can happen).

The only thing I am "unhappy" about is the for (int i = 0, max = size; ...). I find these not nice to read. If the size field should be copied to local, I would prefer int size = this.size; for (int i = 0; i < size; i++) {...}. But this is up for discussion.

@vigna
Copy link
Owner

vigna commented Feb 13, 2024

Well, the rule tries to avoid the bad case. It is physically impossible to try with all type combinations and all instances whether there will be a negative impact on performance. Thus the 3+ rule.

What about for(i = 0, _size = size, i < _size; i++)?

@incaseoftrouble
Copy link
Collaborator

Well, the rule tries to avoid the bad case. It is physically impossible to try with all type combinations and all instances whether there will be a negative impact on performance. Thus the 3+ rule.

Sure, I just meant that we cant just assume that local caching makes things faster. We tested it, saw that it doesn't make a difference, great

for(i = 0, _size = size, i < _size; i++)?

I equally dislike that - for me, a for loop should just have a single initialization. But its your project and your style decision :)

@mouse0w0
Copy link
Contributor Author

mouse0w0 commented Feb 13, 2024

for (int i = 0, max = size; ...)

This code style has been applied to some of fastutil's templates, and if we don't use it, we'll need to change some of the templates to keep it consistent.

int size = this.size; for (int i = 0; i < size; i++) {...}

Personally, I prefer this code style as well, int size = this.size is already heavily used in this pull request. And to simplify the code in the future, we can simply remove it without any problems.

@vigna
Copy link
Owner

vigna commented Feb 14, 2024

for (int i = 0, max = size; ...)

This code style has been applied to some of fastutil's templates, and if we don't use it, we'll need to change some of the templates to keep it consistent.

int size = this.size; for (int i = 0; i < size; i++) {...}

Personally, I prefer this code style as well, int size = this.size is already heavily used in this pull request. And to simplify the code in the future, we can simply remove it without any problems.

Ok, so let's go for int size = this.size;. One just has to be sure not to change the semantics of the the method, as this. is not compulsory in Java. I'd thus suggest final int size = this.size;. In this way, if in the rest of the method where was some mutation for this.size we'll get an error.

Comment on lines 667 to 677
public void forEachRemaining(final METHOD_ARG_KEY_CONSUMER action) {
final KEY_GENERIC_TYPE[] a = ARRAY_LIST.this.a;
final int max = to - from;
while(pos < max) {
action.accept(a[from + (lastReturned = pos++)]);
final int from = SubList.this.from, to = SubList.this.to;
for (int i = from + pos; i < to; i++) {
action.accept(a[i]);
}

final int max = to - from;
pos = max;
lastReturned = max - 1;
}
Copy link
Contributor Author

@mouse0w0 mouse0w0 Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The forEachRemaining change is meaningless (maybe slower) in the latest benchmark, and I'm not sure if I should keep it. I did this optimization because Java's ArrayList uses the same code, and it seems more efficient (less assignment and self-increment).
Benchmark:

@Fork(1)
@State(Scope.Thread)
@Warmup(iterations = 3)
@Measurement(iterations = 5)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class ForEachRemainingBenchmark {
    @Param({"10", "1000", "100000"})
    private int max;
    private int[] a;

    @Setup
    public void setup() {
        a = new int[max];
        Arrays.setAll(a, IntUnaryOperator.identity());

    }

    private int pos;
    private int last;

    @Benchmark
    public void raw(Blackhole blackhole) {
        pos = max >> 1;
        last = pos - 1;

        while (pos < max) {
            blackhole.consume(a[last = pos++]);
        }
        blackhole.consume(pos);
        blackhole.consume(last);
    }

    @Benchmark
    public void optimize(Blackhole blackhole) {
        pos = max >> 1;
        last = pos - 1;

        for (int i = pos; i < max; i++) {
            blackhole.consume(a[i]);
        }
        pos = max;
        last = max - 1;
        blackhole.consume(pos);
        blackhole.consume(last);
    }
}

Output:

# JMH version: 1.37
# VM version: JDK 17.0.10, Java HotSpot(TM) 64-Bit Server VM, 17.0.10+11-LTS-240
# VM invoker: C:\Program Files\jdk-17\bin\java.exe
# VM options: -javaagent:C:\Program Files\IntelliJ IDEA 2023.2.5\lib\idea_rt.jar=8024:C:\Program Files\IntelliJ IDEA 2023.2.5\bin -Dfile.encoding=UTF-8
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 3 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op

Benchmark                            (max)  Mode  Cnt     Score    Error  Units
ForEachRemainingBenchmark.optimize      10  avgt    5     3.331 ±  0.018  ns/op
ForEachRemainingBenchmark.optimize    1000  avgt    5    56.202 ±  0.023  ns/op
ForEachRemainingBenchmark.optimize  100000  avgt    5  6028.862 ±  7.591  ns/op
ForEachRemainingBenchmark.raw           10  avgt    5     3.786 ±  0.006  ns/op
ForEachRemainingBenchmark.raw         1000  avgt    5    53.319 ±  0.026  ns/op
ForEachRemainingBenchmark.raw       100000  avgt    5  6007.510 ± 14.774  ns/op

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's a sensible criterion, but the ArrayList code has been written by many people over the years, and optimized for different implementations of Collections and different JVMs, so it's not necessary the best.

However, the difference is so small that I'd keep the new ArrayList-style code. It's definitely cleaner, and writing in Rust I realized that exceeding with side-effects in expression evaluation doesn't do a good service to anybody.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants