feat: add bitpack encoding for LanceV2 #2333

albertlockett · 2024-05-13T20:31:11Z

Work in progress

TODO

improve tests
support signed types
handle case where buffer is all 0s
handle case where num compressed bits = num uncompressed bits

github-actions · 2024-05-13T20:31:30Z

ACTION NEEDED

Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

westonpace

+1 Nice work so far. This looks like the correct general approach to me. Still some details to work out but nothing looks out of place.

westonpace · 2024-05-13T21:04:07Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+        _all_null: &mut bool,
+    ) {
+        // TODO -- not sure if this is correct
+        buffers[0].0 = self.uncompressed_bits_per_value / 8 * num_rows as u64;


This works as long as uncompressed_bits_per_value is a multiple of 8 and, for now, it should always be so. If we have to start handling cases where it isn't we will need to update this.

I've added a debug assert for now

westonpace · 2024-05-13T21:04:59Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+    }
+
+    fn decode_into(&self, rows_to_skip: u32, num_rows: u32, dest_buffers: &mut [BytesMut]) {
+        let mut bytes_to_skip = rows_to_skip as u64 * self.bits_per_value / 8;


rows_to_skip * self.bits_per_value isn't always going to be a multiple of 8. What happens when it isn't?

Yeah this logic wasn't correct. Reworked the decode_into method

westonpace · 2024-05-13T21:06:14Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+        // pre-add enough capacity to the buffer to hold all the values we're about to put in it
+        let capacity_to_add = dst.capacity() as i64 - dst.len() as i64 + num_rows as i64;
+        if capacity_to_add > 0 {
+            let bytes_to_add =
+                capacity_to_add as usize * self.uncompressed_bits_per_value as usize / 8;
+            dst.extend((0..bytes_to_add).into_iter().map(|_| 0));
+        }


You shouldn't need to do this. As long as update_capacity is returning a valid value then you should be able to safely assume the capacity is already there.

That being said, it doesn't really hurt to have this code. Maybe simpler to just put a debug_assert checking that there is enough capacity.

westonpace · 2024-05-13T21:07:03Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+        let mut mask = 0u64;
+        for _ in 0..self.bits_per_value {
+            mask = mask << 1 | 1;
+        }


I think this means you have a limit of 64 bits per value. This is probably fine but you should add a debug_assert somewhere verifying this.

westonpace · 2024-05-13T21:08:32Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+    fn num_buffers(&self) -> u32 {
+        // TODO ask weston what this is about
+        1
+    }


1 is correct. There are some cases (e.g. dictionary encoding) where we encode 1 input buffer into 2 output buffers.

westonpace · 2024-05-13T21:48:37Z

protos/encodings.proto

+
+  // additional metadata that should be present if bitpacking is used
+  optional BitpackMeta bitpack_meta = 4;
+}


Minor nit: I think of bitpacking less as an extension of Flat and more as it's own encoding that has another array encoding inside of it (like fixed_size_list). I don't know of any concrete reason that's better but I like thinking of these as small composable pieces rather than one piece with lots of options.

good call, made this change

westonpace · 2024-05-13T21:50:21Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+        let mut dest = vec![BytesMut::new()];
+        unit.decode_into(0, 7, &mut dest);
+
+        println!("{:?}", dest);


Nit: convert to an assert when ready to move out of draft.

I deleted this.. we have other tests covering this code path

westonpace · 2024-05-13T21:51:41Z

rust/lance-encoding/src/encodings/physical/buffers.rs

+        let mut packed_arrays = vec![];
+        for arr in arrays {
+            let packed = pack_array(arr.clone(), num_bits)?;
+            packed_arrays.push(packed.into());
+        }
+
+        let data_type = arrays[0].data_type();
+        let bits_per_value = 8 * data_type.byte_width() as u64;
+
+        Ok(EncodedBuffer {
+            bits_per_value: num_bits,
+            parts: packed_arrays,
+            bitpack_meta: Some(pb::BitpackMeta {
+                uncompressed_bits_per_value: bits_per_value,
+            }),
+        })


Do we want to conditionally bitpack based on the whether num_bits is less than "native num bits" if that makes sense? E.g. if a number is using the full range then don't bitpack?

Sure -- made this change

westonpace · 2024-05-13T21:52:50Z

rust/lance-encoding/src/encodings/physical/buffers.rs

+    T: ArrowPrimitiveType,
+    T::Native: PrimInt + AsPrimitive<u64>,
+{
+    let max = arrow::compute::bit_or(arr);


Well this is convenient :)

westonpace · 2024-05-13T21:53:59Z

rust/lance-encoding/src/encodings/physical/buffers.rs

+    let buffers = data.buffers();
+    let mut packed_buffers = vec![];
+    for buffer in buffers {
+        let packed_buffer = pack_bits(&buffer, num_bits, byte_len);
+        packed_buffers.push(packed_buffer);
+    }
+    packed_buffers.concat()


We only want to pack the values buffer, I think this will also try and pack the validity buffer.

I think we're actually OK here. This gets passed the result of array.to_data() here:

lance/rust/lance-encoding/src/encodings/physical/bitpack.rs

Lines 165 to 168 in d18b7df

match arr.data_type() {

DataType::UInt8 | DataType::UInt16 | DataType::UInt32 | DataType::UInt64 => Ok(

pack_buffers(arr.to_data(), num_bits, arr.data_type().byte_width()),

),

And the validity buffer doesn't get included in that result. For example:

let arr = UInt16Array::from(vec![Some(1), None, Some(2)]); let data = arr.to_data(); let buffers = data.buffers(); for buffer in buffers { println!("{:?}", buffer); }

prints:

Buffer { data: Bytes { ptr: 0x124704e80, len: 6, data: [1, 0, 0, 0, 2, 0] }, ptr: 0x124704e80, length: 6 }

broccoliSpicy · 2024-05-18T16:20:18Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+    // if if there's a partial byte at the end, we need to allocate one more byte
+    if (src_items * num_bits as usize) % 8 != 0 {
+        dst_bytes_total += 1;
+    }


can we do a divide_round_up here? like (src_items * num_bits as usize + 7) / 8;

broccoliSpicy · 2024-05-18T16:23:31Z

rust/lance-encoding/src/encodings/physical/bitpack.rs

+            // we wrote a partial byte
+            if bit_len % num_bits != 0 {
+                partial_bytes_written += 1;
+            }


perhaps we can have a general divide_round_up function

albertlockett requested a review from westonpace May 13, 2024 20:31

github-actions bot added the enhancement New feature or request label May 13, 2024

westonpace reviewed May 13, 2024

View reviewed changes

albertlockett changed the title ~~feat: Add bitpac encoding for LanceV2~~ feat: Add bitpack encoding for LanceV2 May 13, 2024

westonpace mentioned this pull request May 14, 2024

Lance File Format Version 2 (technically v0.3) #1929

Open

77 tasks

broccoliSpicy reviewed May 18, 2024

View reviewed changes

albertlockett force-pushed the lw-compressions branch 2 times, most recently from d18b7df to 47afcba Compare May 28, 2024 23:23

albertlockett changed the title ~~feat: Add bitpack encoding for LanceV2~~ feat: add bitpack encoding for LanceV2 May 28, 2024

albertlockett added 4 commits May 30, 2024 10:58

feat: Add bitpacking encoding

fad3108

fixup code

5fd3190

fix bug packing multiple pages

0a3fcb6

lint and clippy

e07e788

albertlockett force-pushed the lw-compressions branch from 056e140 to e07e788 Compare May 30, 2024 13:59

albertlockett added 4 commits May 30, 2024 16:05

some code cleanup

b593760

some code cleanup

4b9051b

fmt and clippy

de94eb1

fix order of compression vs bitpacking in buffer encoding strategy

18c4b5a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add bitpack encoding for LanceV2 #2333

feat: add bitpack encoding for LanceV2 #2333

albertlockett commented May 13, 2024

github-actions bot commented May 13, 2024

westonpace left a comment

westonpace May 13, 2024

albertlockett May 30, 2024

westonpace May 13, 2024

albertlockett May 28, 2024

westonpace May 13, 2024

westonpace May 13, 2024

westonpace May 13, 2024

westonpace May 13, 2024

albertlockett May 22, 2024

westonpace May 13, 2024

albertlockett May 28, 2024

westonpace May 13, 2024

albertlockett May 28, 2024

westonpace May 13, 2024

westonpace May 13, 2024

albertlockett May 28, 2024

broccoliSpicy May 18, 2024

broccoliSpicy May 18, 2024

	match arr.data_type() {
	DataType::UInt8 \| DataType::UInt16 \| DataType::UInt32 \| DataType::UInt64 => Ok(
	pack_buffers(arr.to_data(), num_bits, arr.data_type().byte_width()),
	),

feat: add bitpack encoding for LanceV2 #2333

Are you sure you want to change the base?

feat: add bitpack encoding for LanceV2 #2333

Conversation

albertlockett commented May 13, 2024

github-actions bot commented May 13, 2024

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment