Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] How do I sort RecordBatch using SortOptions? #41591

Open
caiwanli opened this issue May 8, 2024 · 8 comments
Open

[C++] How do I sort RecordBatch using SortOptions? #41591

caiwanli opened this issue May 8, 2024 · 8 comments
Labels
Component: C++ Type: usage Issue is a user question

Comments

@caiwanli
Copy link

caiwanli commented May 8, 2024

Describe the usage question you have. Please include as many useful details as possible.

I want to sort a RecordBatch, and I noticed the SortOptions class in compute, so I wrote the following code:

arrow::Int32Builder int32builder;
int32_t ids[10] = {0,1,2,3,4,5,6,7,8,9};
ARROW_RETURN_NOT_OK(int32builder.AppendValues(ids, 10));
std::shared_ptr<arrow::Array> id;
ARROW_ASSIGN_OR_RAISE(id, int32builder.Finish());
int32_t ages[10] = {28,34,25,26,24,23,25,23,30,22};
ARROW_RETURN_NOT_OK(int32builder.AppendValues(ages, 10));
std::shared_ptr<arrow::Array> age;
ARROW_ASSIGN_OR_RAISE(age, int32builder.Finish());
int32_t salarys[10] = {20000,38000,23000,22000,5000,8000,9000,18000,45000,4000};
ARROW_RETURN_NOT_OK(int32builder.AppendValues(salarys, 10));
std::shared_ptr<arrow::Array> salary;
ARROW_ASSIGN_OR_RAISE(salary, int32builder.Finish());
int32_t heights[10] = {165,178,183,180,177,174,176,173,168,171};
ARROW_RETURN_NOT_OK(int32builder.AppendValues(heights, 10));
std::shared_ptr<arrow::Array> height;
ARROW_ASSIGN_OR_RAISE(height, int32builder.Finish());
arrow::StringBuilder utf8builder;
std::vector<std::string> names = {"cwl_0","hhq_1","lsc_2","zyn_3","yyj_4","zyy_5","chj_6","hcf_7","wk_8","zw_9"};
ARROW_RETURN_NOT_OK(utf8builder.AppendValues(names));
std::shared_ptr<arrow::Array> name;
ARROW_ASSIGN_OR_RAISE(name, utf8builder.Finish());
std::shared_ptr<arrow::Field> field_id, field_name, field_age, field_salary, field_height;
std::shared_ptr<arrow::Schema> schema;
field_id = arrow::field("ID", arrow::int32());
field_name = arrow::field("Name", arrow::utf8());
field_age = arrow::field("Age", arrow::int32());
field_salary = arrow::field("Salary", arrow::int32());
field_height = arrow::field("Height", arrow::int32());
schema = arrow::schema({field_id, field_name, field_age, field_salary, field_height});
std::shared_ptr<arrow::RecordBatch> rbatch;
rbatch = arrow::RecordBatch::Make(schema, 10, {id, name, age, salary, height});
std::cout << "===================Sort====================" << std::endl;
arrow::Datum res;
std::vector<arrow::compute::SortKey> sort_keys;
arrow::compute::SortKey age_key("Age", arrow::compute::SortOrder::Ascending);
arrow::compute::SortKey salary_key("Salary", arrow::compute::SortOrder::Ascending);
arrow::compute::SortKey name_key("Name", arrow::compute::SortOrder::Descending);
sort_keys.emplace_back(age_key);
sort_keys.emplace_back(salary_key);
sort_keys.emplace_back(name_key);
arrow::compute::SortOptions sort_options(std::move(sort_keys));
ARROW_ASSIGN_OR_RAISE(res, arrow::compute::CallFunction("order", {rbatch}, &sort_options));
rbatch =  res.record_batch();
std::cout << "Datum kind: " << rbatch->ToString() << "++++" << rbatch->num_rows()<< std::endl;
return arrow::Status::OK();

But here I'm not sure if the usage is correct, and I don't know what function name to pass to CallFunction. I didn't find any sorting-related instructions in the official documentation. How should I implement sorting for RecordBatch?

Component(s)

C++

@caiwanli caiwanli added the Type: usage Issue is a user question label May 8, 2024
@kou kou changed the title How do I sort RecordBatch using SortOptions? [C++] How do I sort RecordBatch using SortOptions? May 8, 2024
@kou
Copy link
Member

kou commented May 8, 2024

Could you use sort_indices not order?
See also: https://arrow.apache.org/docs/cpp/compute.html#sorts-and-partitions

We have arrow::compute::SortIndices() as a shortcut:

/// \brief Return the indices that would sort an input in the
/// specified order. Input is one of array, chunked array record batch
/// or table.
///
/// Perform an indirect sort of input. The output array will contain
/// indices that would sort an input, which would be the same length
/// as input. Nulls will be stably partitioned to the start or to the end
/// of the output depending on SortOrder::null_placement.
///
/// For example given input (table) = {
/// "column1": [[null, 1], [ 3, null, 2, 1]],
/// "column2": [[ 5], [3, null, null, 5, 5]],
/// } and options = {
/// {"column1", SortOrder::Ascending},
/// {"column2", SortOrder::Descending},
/// }, the output will be [5, 1, 4, 2, 0, 3].
///
/// \param[in] datum array, chunked array, record batch or table to sort
/// \param[in] options options
/// \param[in] ctx the function execution context, optional
/// \return offsets indices that would sort a table
ARROW_EXPORT
Result<std::shared_ptr<Array>> SortIndices(const Datum& datum, const SortOptions& options,
ExecContext* ctx = NULLPTR);

@caiwanli
Copy link
Author

caiwanli commented May 9, 2024

After using sort_indices, I indeed obtained a sorted index array. How should I use this array to sort the RecordBatch?Because I'm going to sort a RecordBatch.

@kou
Copy link
Member

kou commented May 9, 2024

You can use take for it: https://arrow.apache.org/docs/cpp/compute.html#selections

@kou
Copy link
Member

kou commented May 9, 2024

/// \brief Take from an array of values at indices in another array
///
/// The output array will be of the same type as the input values
/// array, with elements taken from the values array at the given
/// indices. If an index is null then the taken element will be null.
///
/// For example given values = ["a", "b", "c", null, "e", "f"] and
/// indices = [2, 1, null, 3], the output will be
/// = [values[2], values[1], null, values[3]]
/// = ["c", "b", null, null]
///
/// \param[in] values datum from which to take
/// \param[in] indices which values to take
/// \param[in] options options
/// \param[in] ctx the function execution context, optional
/// \return the resulting datum
ARROW_EXPORT
Result<Datum> Take(const Datum& values, const Datum& indices,
const TakeOptions& options = TakeOptions::Defaults(),
ExecContext* ctx = NULLPTR);

@caiwanli
Copy link
Author

caiwanli commented May 9, 2024

Thank you very much for your answer, it's very helpful for me. But now I have another question: If I have multiple (let's say 4) sorted RecordBatches, how can I merge them into one sorted RecordBatch? Let's assume that the RecordBatches are sorted according to a specified arrow::compute::SortKey.

@kou
Copy link
Member

kou commented May 9, 2024

@caiwanli
Copy link
Author

caiwanli commented May 9, 2024

I think you misunderstood my meaning. What I meant is: Suppose I have 2 RBs (sorted by Age in ascending order, sorted by Salary in descending order), as shown below:
1715237107648

So the result after merging:
1715237313834

@kou
Copy link
Member

kou commented May 9, 2024

How about arrow::Table::FromRecordBatches() -> sort_indices -> take?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: C++ Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

2 participants