tantivy document memory experiment #2371

PSeitz · 2024-04-23T08:40:00Z

Some test regarding the memory consumption of TantivyDocument

Experiment

Parse data set lines and store all Documents in a Vec.
hdfs: 3 fields (timestmap, body, severity), raw dataset 22MB
gh: all fields in a json field (dynamic mode). raw dataset 2.3MB

Note: The root level in hdfs fields are stored as Field id instead as string

Variant1: TantivyDocumentMedVec

replace Vec in OwnedValue with 32 bit versions of the Vec and drop Facet and Pretokstr

Variant2: DocContainerRef

The nodes all store their data in 2 vecs and just reference the position there

#[derive(Default)]
struct OwnedValueRefContainer {
    nodes: mediumvec::Vec32<ValueContainerRef>,
    node_data: mediumvec::Vec32<u8>,
}

Results

cargo run --example doc_mem
[examples/doc_mem.rs:21:5] std::mem::size_of::<TantivyDocument>() = 24
[examples/doc_mem.rs:22:5] std::mem::size_of::<DocContainerRef>() = 48
[examples/doc_mem.rs:23:5] std::mem::size_of::<OwnedValue>() = 48
[examples/doc_mem.rs:24:5] std::mem::size_of::<OwnedValueMedVec>() = 24
[examples/doc_mem.rs:25:5] std::mem::size_of::<ValueContainerRef>() = 12
[examples/doc_mem.rs:26:5] std::mem::size_of::<mediumvec::vec32::Vec32<u8>>() = 16
Peak Memory 42308307 : "hdfs TantivyDocument"
Peak Memory 28708435 : "hdfs TantivyDocumentMedVec "
Peak Memory 27555817 : "hdfs DocContainerRef "
Peak Memory 6555583 : "gh TantivyDocument"
Peak Memory 4668215 : "gh TantivyDocumentMedVec "
Peak Memory 3533176 : "gh DocContainerRef "

Conclusion

There should be some easy gains by using 32 bit vecs, which only use 16byte instead of 24 bytes.
DocContainerRef could provide additional gains, but adds some complexity.

quickwit-oss/quickwit#4890

PSeitz · 2024-05-20T02:39:09Z

Peak Memory 42308307 : "hdfs TantivyDocument"
Peak Memory 28708435 : "hdfs TantivyDocumentMedVec"
Peak Memory 25155841 : "hdfs DocContainerRef"
Peak Memory 25456237 : "hdfs CompactDoc" // Current version in PR https://github.com/quickwit-oss/tantivy/pull/2402
Peak Memory 27857662 : "hdfs RkyvDoc"         // zero deserialization rkyv
Peak Memory 21055858 : "hdfs PostcardDoc" // postcard serialized
Peak Memory 20106059 : "hdfs ZstdDoc"         // postcard + Zstd
Peak Memory 22555843 : "hdfs BinarySerializable"
Peak Memory 25309370 : "hdfs JsonSerialized"
Peak Memory 6555583 : "gh TantivyDocument"
Peak Memory 4668215 : "gh TantivyDocumentMedVec"
Peak Memory 2735326 : "gh DocContainerRef"
Peak Memory 2543967 : "gh CompactDoc"
Peak Memory 3274042 : "gh RkyvDoc"
Peak Memory 2197615 : "gh PostcardDoc"
Peak Memory 862839 : "gh ZstdDoc"
Peak Memory 2325673 : "gh BinarySerialized"
Peak Memory 2508695 : "gh JsonSerialized"

tantivy document memory test

cf1460d

PSeitz force-pushed the check_doc_mem branch from de6c719 to cf1460d Compare April 23, 2024 08:49

PSeitz changed the title ~~tantivy document memory test~~ tantivy document memory experiment Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tantivy document memory experiment #2371

tantivy document memory experiment #2371

PSeitz commented Apr 23, 2024 •

edited

PSeitz commented May 20, 2024 •

edited

tantivy document memory experiment #2371

Are you sure you want to change the base?

tantivy document memory experiment #2371

Conversation

PSeitz commented Apr 23, 2024 • edited

Experiment

Variant1: TantivyDocumentMedVec

Variant2: DocContainerRef

Results

Conclusion

PSeitz commented May 20, 2024 • edited

PSeitz commented Apr 23, 2024 •

edited

PSeitz commented May 20, 2024 •

edited