Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41560: [C++] ChunkResolver: Implement ResolveMany and add unit tests #41561

Merged
merged 11 commits into from May 15, 2024

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented May 6, 2024

Rationale for this change

I want ResolveMany to support me in the implementation of Take that doesn't Concatenate all the chunks from a ChunkedArray values parameter.

What changes are included in this PR?

  • Implementation of ChunkResolver::ResolveMany()
  • Addition of missing unit tests for ChunkResolver

Are these changes tested?

Yes. By new unit tests.

Are there any user-facing changes?

No. ChunkResolver is an internal API at the moment (see #34535 for future plans).

Copy link

github-actions bot commented May 6, 2024

⚠️ GitHub issue #41560 has been automatically assigned in GitHub to PR creator.

@zeroshade zeroshade requested a review from pitrou May 8, 2024 14:32
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be some (unwarranted? gratuitous?) complexity here. Also, it's not obvious why this would be better than calling Resolve multiple times in a row.

int64_t index_in_chunk = 0;

/// \brief Create a ChunkLocation without asserting any preconditions.
static ChunkLocation Forge(int64_t chunk_index, int64_t index_in_chunk) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid inventing terminology that's currently not used in the project?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your actionable suggestion? I'm trying to make this API less error-prone. MakeUnsafe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an actual need for a checked constructor?

Copy link
Contributor Author

@felipecrv felipecrv May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It adds a layer of safety and by poisoning the index_in_chunk value when chunk_index is invalid, I (and other users of the library) are more likely to detect unguarded use of chunk_index this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how such unguarded uses could occur. The primary use case for ChunkLocation is as a return value for ChunkResolver. Unless the user has passed an invalid logical index, ChunkResolver should always return valid ChunkLocations.

cpp/src/arrow/chunk_resolver.h Outdated Show resolved Hide resolved
cpp/src/arrow/chunk_resolver.h Outdated Show resolved Hide resolved
cpp/src/arrow/chunk_resolver.h Show resolved Hide resolved
cpp/src/arrow/chunk_resolver.h Outdated Show resolved Hide resolved
cpp/src/arrow/chunk_resolver.h Outdated Show resolved Hide resolved
cpp/src/arrow/chunk_resolver.h Outdated Show resolved Hide resolved
cpp/src/arrow/chunk_resolver.h Show resolved Hide resolved
cpp/src/arrow/chunk_resolver.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/vector_sort.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review and removed awaiting committer review Awaiting committer review labels May 13, 2024
@felipecrv
Copy link
Contributor Author

There seems to be some (unwarranted? gratuitous?) complexity here.

You can comment on what you think is "unwarranted" and "gratuitous" complexity. This kind of feedback is demotivating and not actionable. I try to simplify the code before opening a PR instead of pushing the first thing that works.

Also, it's not obvious why this would be better than calling Resolve multiple times in a row.

I tried that in the Take kernels, but things got messy and ugly because:

  • Resolve and ChunkLocation use int64_t for everything while Take kernels use unsigned integer of all widths. The loops would be full of unsafe casts with confusing safety justifications.
  • Take kernels have complex loops handling different cases (the 4 combinations of nullability of the values and indices), so Resolve would be called (inlined) in 4 branches per kernel loop specialization.
  • Ensuring Resolve gets called only once per index is tricky because in the Take loops you need to resolve chunks for reading values and validity bits from multiple functions if you modularize the loop. By having a buffer with all the chunks resolved at once, code that performs these loads can be more cleanly separated and accept an index array position as parameter instead of a ChunkLocation
  • That makes the code that deals with ChunkedArray look very similar to the code that handles flat arrays — the ChunkedArray version needs a preparation pass to resolve the chunks
  • By having this chunk-resolution pass, I can make Take loops for ChunkedArray of nested types more easily

In addition to that, ResolveMany can be further optimized with different search strategies (e.g. branchless and interleaved binary searches), the step that calculates index_in_chunk can be vectorized. All of that without contributing to binary size of all the numerous Take specializations that will rely on it because ResolveManyImpl is not inlined.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels May 14, 2024
@pitrou
Copy link
Member

pitrou commented May 14, 2024

You can comment on what you think is "unwarranted" and "gratuitous" complexity.

I did post specific comments :-)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 14, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 15, 2024
@felipecrv felipecrv requested a review from pitrou May 15, 2024 01:40
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the number of back-and-forths. Here are a number of followup questions and comments, but feel free to merge in any case.


/// \brief Resolve `n` logical indices to chunk indices.
///
/// \pre 0 <= logical_index_vec[i] < n (for well-defined and valid chunk index results)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean logical_array_length() rather than n?

///
/// \pre 0 <= logical_index_vec[i] < n (for well-defined and valid chunk index results)
/// \pre out_chunk_index_vec has space for `n_indices`
/// \post chunk_hint in [0, chunks.size()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it rather a pre-condition?

/// \param n_indices The number of logical indices to resolve
/// \param logical_index_vec The logical indices to resolve
/// \param out_chunk_index_vec The output array where the chunk indices will be written
/// \param chunk_hint 0 or the last chunk_index produced by ResolveMany
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning the caller is supposed to pass the value of out_chunk_index_vec[n_indices - 1] from the previous call to ResolveMany?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's the plan. Or 0 if n_indices == 0. That's why I didn't want to write the formula.

/// \param chunk_hint 0 or the last chunk_index produced by ResolveMany
/// \param out_index_in_chunk_vec If not NULLPTR, the output array where the
/// within-chunk indices will be written
/// \return false iff chunks.size() > std::numeric_limits<IndexType>::max()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming ResolveMany will be invoked in batches? This condition doesn't need to be checked in each ResolveMany call, only once for each call to the larger operation (such as Take). That's not necessarily a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming ResolveMany will be invoked in batches?

It could be. I'm not doing it like that for Take at the moment. It's a very predictable branch though and I'm allowing it to be inlined at the caller.

//
// Negative logical indices can become large values when cast to unsigned, but
// they are gracefully handled by ResolveManyImpl. Although both the chunk index
// and the index in chunk values will be undefined in these cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem to be the case: if a logical index is negative, its unsigned counterpart will be out of bounds and the corresponding out_chunk_index_vec value will therefore be equal to chunks_.size().

Copy link
Contributor Author

@felipecrv felipecrv May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. The overflow check guarantees it's impossible for negative logical indices to become valid indices.

Not really. INT8_MIN becomes 128 when cast to uint8_t, -1 becomes 255, so depending on the chunks, they won't be an out-of-bounds logical indices.

I'm tweaking the comment here and improving the tests.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 15, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 15, 2024
@felipecrv felipecrv merged commit e04f5b4 into apache:main May 15, 2024
36 of 37 checks passed
@felipecrv felipecrv removed the awaiting change review Awaiting change review label May 15, 2024
@felipecrv felipecrv deleted the resolve_many branch May 16, 2024 00:00
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit e04f5b4.

There were 7 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 16 possible false positives for unstable benchmarks that are known to sometimes produce them.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…it tests (apache#41561)

### Rationale for this change

I want `ResolveMany` to support me in the implementation of `Take` that doesn't `Concatenate` all the chunks from a `ChunkedArray` `values` parameter.

### What changes are included in this PR?

 - Implementation of `ChunkResolver::ResolveMany()`
 - Addition of missing unit tests for `ChunkResolver`

### Are these changes tested?

Yes. By new unit tests.

### Are there any user-facing changes?

No. `ChunkResolver` is an internal API at the moment (see apache#34535 for future plans).
* GitHub Issue: apache#41560

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants