Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Cannot insert/create a data that contains a large vector size #4007

Open
2 tasks done
yc-wang00 opened this issue May 9, 2024 · 8 comments
Open
2 tasks done
Assignees
Labels
bug Something isn't working topic:indexing This is related to indexing and full-text search topic:net This is related to the server and networking

Comments

@yc-wang00
Copy link

Describe the bug

Description

In surrealdb, I am trying to insert a record which one field contains an array size 2_000_000 elements and I found out it won't let me do this and return: Api(Ws("IO error: Broken pipe (os error 32)")) or Payload too large if using http connection. How should I increase the maximum payload limitation in this case? Is it designed to be capped for data insertion query? I've provided a minimum reproducable example below which just insert an array of size 2_000_000. Hope someone can take a look and give some suggestion. Thanks in advance.

Steps to reproduce

Here's a minimum reproducable example:

use serde::{Deserialize, Serialize};
use surrealdb::engine::remote::ws::Ws;
use surrealdb::opt::auth::Root;
use surrealdb::sql::Thing;
use surrealdb::Surreal;

#[derive(Debug, Deserialize)]
struct Record {
    #[allow(dead_code)]
    id: Thing,
}

#[derive(Debug, Serialize, Deserialize)]
struct TestVec {
    uid: String,
    test_vec: Vec<u32>,
}


#[tokio::main]
async fn main() -> surrealdb::Result<()> {
    // Connect to the server
    let db = Surreal::new::<Ws>("localhost:3301").await?;
    db.use_ns("test").use_db("test").await?;
    db.signin(Root {
        username: "root",
        password: "root",
    })
    .await?;
    // create a vector with length 1_000_000
    // Note: 
    // 1_000_000 works
    // 1_500_000 works
    // 1_750_000 works
    // 1_900_000 doesn't work 
    // 2_000_000 doesn't work
    let mut data: Vec<u32> = Vec::with_capacity(1_900_000);
    for i in 0..1_900_000 {
        data.push(i as u32);
    }
    dbg!(data.len());

    let test_data: TestVec = TestVec {
        uid: "123".to_string(),
        test_vec: data,
    };

    // ERROR! This will error and return: Error: Api(Ws("IO error: Broken pipe (os error 32)"))
    // Create a vector
    let created: Vec<Record> = db
        .create("test_vec")
        .content(&test_data)
    .await?;
    dbg!(created);
    Ok(())
}

Expected behaviour

Successfully insert the data

SurrealDB version

1.4.2 for linux on x86_64

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@yc-wang00 yc-wang00 added bug Something isn't working triage This issue is new labels May 9, 2024
@phughk
Copy link
Contributor

phughk commented May 13, 2024

This question can probably be better answered by @emmanuel-keller , but from the network side we may configure the limit of network messages to allow this in the future.

@phughk phughk added topic:indexing This is related to indexing and full-text search topic:net This is related to the server and networking and removed triage This issue is new labels May 13, 2024
@emmanuel-keller
Copy link
Contributor

emmanuel-keller commented May 13, 2024

The Rust SDK has a hard currently a hard limit related to the maximum size of a message.

pub(crate) const MAX_MESSAGE_SIZE: usize = 64 << 20; // 64 MiB

Workaround 1 - Use u16:

In this case, a possible option (if it fits) would be to use u16 rather than u32. That will reduce the size of the statement and make it pass.

Workaround 2 - Use HTTP:

The HTTP connection has different size limit. You may try with:

 let db = Surreal::new::<Http>("localhost:3301").await?;

Workaround 3 - array::push

You can split the vector in smaller array and use the arraypush function to build the final vector.

Follow up

We are going to make this configurable in the SDK. Notes that you will also have to increase the value in the server using the environment variable.

The value must also be increased server side.

export SURREAL_WEBSOCKET_MAX_MESSAGE_SIZE=134000000

@yc-wang00
Copy link
Author

yc-wang00 commented May 13, 2024

I believe you've reached a hard limit in the Rust SDK, related to the maximum size of a message.

pub(crate) const MAX_MESSAGE_SIZE: usize = 64 << 20; // 64 MiB

Would it be possible to use u16 ? Or do you indeed need u32?

Thanks for your response!

Re. (1) Ahh I see. Yes I do need to store in u32 since this array is used to store token (LLM context), a llama3 tokenizer vocab size can go up to 128256 which needs u32 to store.

Re. (2) I tried the http connection and it gives me the same thing.

Re. (3) working around 3, could you give more context on how I should use array::push?

Re. follow up, can I configure the env var right now or in the future version?

@yc-wang00
Copy link
Author

pub(crate) const MAX_MESSAGE_SIZE: usize = 64 << 20; // 64 MiB

Also regarding this line, it the max size looks like 64 MB, but u32 * array size (2_000_000) should be only 8 MB if my math is correct. Just wondering if this is expected?

@emmanuel-keller
Copy link
Contributor

emmanuel-keller commented May 13, 2024

array::push as an operator equivalent: +=

Here is how you can do that:

CREATE foo:1 SET bar = [1, 2, 3];
[[{ bar: [1, 2, 3], id: foo:1 }]]

UPDATE foo:1 SET bar += [4, 5, 6];
[[{ bar: [1, 2, 3, 4, 5, 6], id: foo:1 }]]

SELECT * FROM foo;
[[{ bar: [1, 2, 3, 4, 5, 6], id: foo:1 }]]

Server side, you can already change the configuration using the environment variable (since 1.4.x).

Also regarding this line, it the max size looks like 64 MB, but u32 * array size (2_000_000) should be only 8 MB if my math is correct. Just wondering if this is expected?

That's a good point. It all depend on the serialisation. Which version of the server and the SDK are you using?

@yc-wang00
Copy link
Author

That's a good point. It all depend on the serialisation. Which version of the server and the SDK are you using?

I am using

  • surrealdb 1.4.2 for linux on x86_64
  • sdk: surrealdb = "1.4.2" (in cargo.toml)

@emmanuel-keller
Copy link
Contributor

emmanuel-keller commented May 13, 2024

The current binary serialization is not optimal for large vectors. Due to the parsing process, the Vec<u32> is currently transformed into a Vec<Value>. The Value structure itself is an enum, which, in turn, points to another enum structure (Number) that represents numbers, which are internally stored as 64 bits. These structures are also versioned. So, each number is likely to consist of 64 bits + 2 ordinals (8 bits each) + 2 version holders (8 bits each), totaling about 96 bits per element. This brings us close to 24MB.

That said, that's still under the 64MB limit. So, we are still currently investigating this issue.

@yc-wang00
Copy link
Author

Thanks for your active and quick response! Really appreciate it. I have tried your work around and make it work for my case.

I am also curious about the issue so keep me update if you found it later.

Again thanks. really like this project and keep the good work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working topic:indexing This is related to indexing and full-text search topic:net This is related to the server and networking
Projects
None yet
Development

No branches or pull requests

3 participants