-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add bitpack encoding for LanceV2 #2333
base: main
Are you sure you want to change the base?
Conversation
ACTION NEEDED Lance follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Nice work so far. This looks like the correct general approach to me. Still some details to work out but nothing looks out of place.
_all_null: &mut bool, | ||
) { | ||
// TODO -- not sure if this is correct | ||
buffers[0].0 = self.uncompressed_bits_per_value / 8 * num_rows as u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works as long as uncompressed_bits_per_value
is a multiple of 8 and, for now, it should always be so. If we have to start handling cases where it isn't we will need to update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a debug assert for now
} | ||
|
||
fn decode_into(&self, rows_to_skip: u32, num_rows: u32, dest_buffers: &mut [BytesMut]) { | ||
let mut bytes_to_skip = rows_to_skip as u64 * self.bits_per_value / 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rows_to_skip * self.bits_per_value
isn't always going to be a multiple of 8. What happens when it isn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this logic wasn't correct. Reworked the decode_into method
// pre-add enough capacity to the buffer to hold all the values we're about to put in it | ||
let capacity_to_add = dst.capacity() as i64 - dst.len() as i64 + num_rows as i64; | ||
if capacity_to_add > 0 { | ||
let bytes_to_add = | ||
capacity_to_add as usize * self.uncompressed_bits_per_value as usize / 8; | ||
dst.extend((0..bytes_to_add).into_iter().map(|_| 0)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need to do this. As long as update_capacity
is returning a valid value then you should be able to safely assume the capacity is already there.
That being said, it doesn't really hurt to have this code. Maybe simpler to just put a debug_assert
checking that there is enough capacity.
let mut mask = 0u64; | ||
for _ in 0..self.bits_per_value { | ||
mask = mask << 1 | 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this means you have a limit of 64 bits per value. This is probably fine but you should add a debug_assert
somewhere verifying this.
fn num_buffers(&self) -> u32 { | ||
// TODO ask weston what this is about | ||
1 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 is correct. There are some cases (e.g. dictionary encoding) where we encode 1 input buffer into 2 output buffers.
|
||
// additional metadata that should be present if bitpacking is used | ||
optional BitpackMeta bitpack_meta = 4; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: I think of bitpacking less as an extension of Flat
and more as it's own encoding that has another array encoding inside of it (like fixed_size_list
). I don't know of any concrete reason that's better but I like thinking of these as small composable pieces rather than one piece with lots of options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, made this change
let mut dest = vec![BytesMut::new()]; | ||
unit.decode_into(0, 7, &mut dest); | ||
|
||
println!("{:?}", dest); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: convert to an assert when ready to move out of draft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted this.. we have other tests covering this code path
let mut packed_arrays = vec![]; | ||
for arr in arrays { | ||
let packed = pack_array(arr.clone(), num_bits)?; | ||
packed_arrays.push(packed.into()); | ||
} | ||
|
||
let data_type = arrays[0].data_type(); | ||
let bits_per_value = 8 * data_type.byte_width() as u64; | ||
|
||
Ok(EncodedBuffer { | ||
bits_per_value: num_bits, | ||
parts: packed_arrays, | ||
bitpack_meta: Some(pb::BitpackMeta { | ||
uncompressed_bits_per_value: bits_per_value, | ||
}), | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to conditionally bitpack based on the whether num_bits
is less than "native num bits" if that makes sense? E.g. if a number is using the full range then don't bitpack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure -- made this change
T: ArrowPrimitiveType, | ||
T::Native: PrimInt + AsPrimitive<u64>, | ||
{ | ||
let max = arrow::compute::bit_or(arr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well this is convenient :)
let buffers = data.buffers(); | ||
let mut packed_buffers = vec![]; | ||
for buffer in buffers { | ||
let packed_buffer = pack_bits(&buffer, num_bits, byte_len); | ||
packed_buffers.push(packed_buffer); | ||
} | ||
packed_buffers.concat() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only want to pack the values buffer, I think this will also try and pack the validity buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're actually OK here. This gets passed the result of array.to_data()
here:
lance/rust/lance-encoding/src/encodings/physical/bitpack.rs
Lines 165 to 168 in d18b7df
match arr.data_type() { | |
DataType::UInt8 | DataType::UInt16 | DataType::UInt32 | DataType::UInt64 => Ok( | |
pack_buffers(arr.to_data(), num_bits, arr.data_type().byte_width()), | |
), |
And the validity buffer doesn't get included in that result. For example:
let arr = UInt16Array::from(vec![Some(1), None, Some(2)]);
let data = arr.to_data();
let buffers = data.buffers();
for buffer in buffers {
println!("{:?}", buffer);
}
prints:
Buffer { data: Bytes { ptr: 0x124704e80, len: 6, data: [1, 0, 0, 0, 2, 0] }, ptr: 0x124704e80, length: 6 }
// if if there's a partial byte at the end, we need to allocate one more byte | ||
if (src_items * num_bits as usize) % 8 != 0 { | ||
dst_bytes_total += 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do a divide_round_up here? like (src_items * num_bits as usize + 7) / 8;
// we wrote a partial byte | ||
if bit_len % num_bits != 0 { | ||
partial_bytes_written += 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we can have a general divide_round_up
function
d18b7df
to
47afcba
Compare
056e140
to
e07e788
Compare
Work in progress
TODO