Speed up bulk insert #2290

Mulugruntz · 2021-03-01T12:57:12Z

Mulugruntz
Mar 1, 2021

I come back to the community with a new question.
I would like to insert several million records into EdgeDB. These records are on plain files (kind of CSV).
I've adapted the code that I use, for reproducibility purposes. For example, instead of files, it's random values.
I've made as if there are 60k records per day, for 5 days.
The Types that I used are the real ones.
And I insert as batches of 50k records.

I have two question:

How to make it faster? Currently the average speed is about 2300 records/second.
How to estimate the size in DB? (knowing there's an index)

I'm also open to any general remark.

The real relevant function here is insert_history_values. But I included a working piece of code so that anyone who would like to make suggestions can actually test it without wasting their own time :-).

I currently use docker image edgedb/edgedb:1-alpha7

I run twice the main, to showcase the idempotency. If it is (or seems to be, at least) already inserted, it won't insert it again.

import asyncio
import sys
from datetime import datetime, timedelta
from typing import List, Tuple
import pandas as pd
from numpy.random import MT19937
from numpy.random import RandomState, SeedSequence

import edgedb
import ujson
from aiostream import stream
from edgedb import AsyncIOConnection, AsyncIOPool, InvalidReferenceError


init_esdl = [
    """drop type MyTable;""",
    """drop type SubTable;""",
    """drop scalar type posfloat64;""",
    """create scalar type posfloat64 extending float64 {
        create constraint min_value(0);
    }""",
    """create type SubTable {
        create required property name -> str;
    };""",
    """create type MyTable {
        create required link sub -> SubTable;
        create required property timestamp -> datetime;
        create required property value -> posfloat64;

        create index on (.timestamp);
    };""",
    """INSERT SubTable {
        name := "NAME"
    };""",
]


async def batch(iterable, n=1):
    """Iterates asynchronously over batches of an iterable.
    batch([1, 2, 3, 4, 5, 6, 7], 3) -> [[1, 2, 3], [4, 5, 6], [7]]
    """
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx : min(ndx + n, l)]


async def check_if_history_is_in_db(
    connection: AsyncIOConnection, values: List[Tuple[str, float]]
) -> bool:
    query = f"""
    SELECT (
        {len(values)} <= count( MyTable
        FILTER MyTable.timestamp >= <datetime>$first_date
        AND MyTable.timestamp <= <datetime>$last_date)
    ) AND (
        0 < count( MyTable {{ value }}
        FILTER MyTable.timestamp = <datetime>$first_date
        AND MyTable.value = <posfloat64>$first_value)
    ) AND (
        0 < count( History {{ value }}
        FILTER MyTable.timestamp = <datetime>$last_date
        AND MyTable.value = <posfloat64>$last_value)
    );"""
    start = datetime.now()
    res = await connection.query_one(
        query,
        first_date=datetime.fromisoformat(f"{values[0][0]}:00"),
        first_value=values[0][1],
        last_date=datetime.fromisoformat(f"{values[-1][0]}:00"),
        last_value=values[-1][1],
    )
    print(
        f"Checking if it's already in DB took {(datetime.now() - start).total_seconds()} s."
    )
    return res


async def insert_history_values(
    connection: AsyncIOConnection,
    values: List[Tuple[str, float]],
    batch_size: int = 50_000,
) -> None:
    ddl = """
        WITH SNAME := (SELECT SubTable FILTER .name = <str>$subname),
        FOR x IN {
            json_array_unpack(<json>$data)
        }
        UNION (INSERT MyTable {
            sub := SNAME,
            timestamp := <datetime>x[0],
            value := <posfloat64>x[1]
        });
        """
    n_sub_queries = (len(values) - 1) // batch_size + 1
    print(f"Splitting the query into {n_sub_queries} sub-queries.")
    start = datetime.now()
    xs = stream.enumerate(batch(values, batch_size), start=1)
    async with xs.stream() as streamer:
        async with connection.transaction():
            try:
                async for i, batch_values in streamer:
                    print(
                        f"Running query {i}/{n_sub_queries} of {len(ddl)} characters and {len(batch_values)} values."
                    )
                    json_batch_values = ujson.dumps(batch_values)
                    start_sub = datetime.now()
                    await connection.query(ddl, data=json_batch_values, subname="NAME")
                    print(
                        f"DDL query {i}/{n_sub_queries} completed in {(datetime.now() - start_sub).total_seconds()} secs!"
                    )
            except edgedb.errors.InternalServerError as e:
                print(e, file=sys.stderr)
                raise e
            except Exception as e:
                print(e, file=sys.stderr)
                raise e
    print(f"Done in {(datetime.now() - start).total_seconds()} secs!")


async def insert_history_into_db(connection: AsyncIOConnection, values) -> bool:
    try:
        is_in_db = await check_if_history_is_in_db(connection=connection, values=values)
        print(
            f"History of {values[0][0]} has{' ' if is_in_db else ' NOT '}already been inserted."
        )

        if not is_in_db:
            await insert_history_values(connection=connection, values=values)
    except Exception as e:
        print(e, file=sys.stderr)
        return False
    else:
        return True


async def init_base(connection: AsyncIOConnection) -> None:
    print("Initializing the DB!")
    for s in init_esdl:
        print(s)
        try:
            await connection.query(s)
        except InvalidReferenceError as e:
            print(e, file=sys.stderr)


async def init_data(connection: AsyncIOConnection) -> None:
    """Create `r` records per day for `duration` days."""
    r = 60_000
    duration = 5
    begin = datetime(2015, 1, 1)
    days = [begin + timedelta(days=d) for d in range(duration)]
    rs = RandomState(MT19937(SeedSequence(123456789)))
    data = [
        [
            (t.isoformat(), n)
            for t, n in zip(
                pd.date_range(
                    day, day + timedelta(days=1, microseconds=-1), periods=r, tz="utc"
                ).to_pydatetime(),
                rs.uniform(low=23, high=42, size=(r,)),
            )
        ]
        for day in days
    ]
    for day in data:
        await insert_history_into_db(connection=connection, values=day)


async def init(pool: AsyncIOPool, *, skip_init=True) -> None:
    async with pool.acquire() as connection:
        if not skip_init:
            await init_base(connection)
        await init_data(connection)


async def main(*, skip_init=True) -> None:
    async with await edgedb.create_async_pool(
        min_size=2, host="localhost", port=5656, user="edgedb", database="edgedb"
    ) as pool:
        await init(pool, skip_init=skip_init)


if __name__ == "__main__":
    asyncio.run(main(skip_init=False))
    asyncio.run(main(skip_init=True))

The output:

Initializing the DB!
drop type MyTable;
drop type SubTable;
drop scalar type posfloat64;
create scalar type posfloat64 extending float64 {
        create constraint min_value(0);
    }
create type SubTable {
        create required property name -> str;
    };
create type MyTable {
        create required link sub -> SubTable;
        create required property timestamp -> datetime;
        create required property value -> posfloat64;

        create index on (.timestamp);
    };
INSERT SubTable {
        name := "NAME"
    };
Checking if it's already in DB took 0.146045 s.
History of 2015-01-01T00:00:00+00:00 has NOT already been inserted.
Splitting the query into 2 sub-queries.
Running query 1/2 of 300 characters and 50000 values.
DDL query 1/2 completed in 22.951571 secs!
Running query 2/2 of 300 characters and 10000 values.
DDL query 2/2 completed in 6.699114 secs!
Done in 29.701877 secs!
Checking if it's already in DB took 0.012808 s.
History of 2015-01-02T00:00:00+00:00 has NOT already been inserted.
Splitting the query into 2 sub-queries.
Running query 1/2 of 300 characters and 50000 values.
DDL query 1/2 completed in 20.083595 secs!
Running query 2/2 of 300 characters and 10000 values.
DDL query 2/2 completed in 3.832906 secs!
Done in 23.97906 secs!
Checking if it's already in DB took 0.007113 s.
History of 2015-01-03T00:00:00+00:00 has NOT already been inserted.
Splitting the query into 2 sub-queries.
Running query 1/2 of 300 characters and 50000 values.
DDL query 1/2 completed in 20.126366 secs!
Running query 2/2 of 300 characters and 10000 values.
DDL query 2/2 completed in 4.126418 secs!
Done in 24.306945 secs!
Checking if it's already in DB took 0.007437 s.
History of 2015-01-04T00:00:00+00:00 has NOT already been inserted.
Splitting the query into 2 sub-queries.
Running query 1/2 of 300 characters and 50000 values.
DDL query 1/2 completed in 21.318595 secs!
Running query 2/2 of 300 characters and 10000 values.
DDL query 2/2 completed in 3.891169 secs!
Done in 25.271574 secs!
Checking if it's already in DB took 0.009652 s.
History of 2015-01-05T00:00:00+00:00 has NOT already been inserted.
Splitting the query into 2 sub-queries.
Running query 1/2 of 300 characters and 50000 values.
DDL query 1/2 completed in 21.547917 secs!
Running query 2/2 of 300 characters and 10000 values.
DDL query 2/2 completed in 4.032665 secs!
Done in 25.637961 secs!
Checking if it's already in DB took 0.090927 s.
History of 2015-01-01T00:00:00+00:00 has already been inserted.
Checking if it's already in DB took 0.048484 s.
History of 2015-01-02T00:00:00+00:00 has already been inserted.
Checking if it's already in DB took 0.042678 s.
History of 2015-01-03T00:00:00+00:00 has already been inserted.
Checking if it's already in DB took 0.042255 s.
History of 2015-01-04T00:00:00+00:00 has already been inserted.
Checking if it's already in DB took 0.043632 s.
History of 2015-01-05T00:00:00+00:00 has already been inserted.

Process finished with exit code 0

We can see that 300,000/(29.70+23.98+24.31+25.27+25.64) ~= 2327 records / second.

Thanks in advance and I hope I made it easy for anyone who is interested to help.

Answered by elprans

Mar 2, 2021

How to make it faster? Currently the average speed is about 2300 records/second.

We'll implement batched executemany soon a-la asyncpg, so this'll be much faster.

Meanwhile, the best way is to open a bunch of concurrent connections and chunk data in. Example: https://github.com/edgedb/webapp-bench/blob/366a9a74f59442cf279b02ab482d4ff48c6b5b2b/_edgedb/loaddata.py

How to estimate the size in DB? (knowing there's an index)

Do you mean the actual size in megabytes or the number of records in an object set?

View full answer

elprans · 2021-03-02T20:16:29Z

elprans
Mar 2, 2021
Maintainer

How to make it faster? Currently the average speed is about 2300 records/second.

We'll implement batched executemany soon a-la asyncpg, so this'll be much faster.

Meanwhile, the best way is to open a bunch of concurrent connections and chunk data in. Example: https://github.com/edgedb/webapp-bench/blob/366a9a74f59442cf279b02ab482d4ff48c6b5b2b/_edgedb/loaddata.py

How to estimate the size in DB? (knowing there's an index)

Do you mean the actual size in megabytes or the number of records in an object set?

1 reply

Mulugruntz Mar 2, 2021
Author

Thanks for the link, @elprans !
I'll implement something similar then. But I cannot have too many connections (haven't tested with beta1 yet).
The batched executemany would be sweet :-). Do you have any idea where it would be in the roadmap? Is there an issue I can subscribe to its notifications?
The actual size in MB. :-)

thedeadliestcatch · 2024-01-16T19:52:06Z

thedeadliestcatch
Jan 16, 2024

This is relevant to my use case. @Mulugruntz are you in the community chat?

2 replies

Mulugruntz Jan 17, 2024
Author

@thedeadliestcatch Please note this was a question for an alpha version.
Please see with @elprans as now there's maybe a supported solution.

Might be relevant: https://www.edgedb.com/docs/edgeql/for#bulk-inserts
Side note: might be nice to have "New in version X.Y" along each feature in the docs, like in python.org.

thedeadliestcatch Jan 30, 2024

Unfortunately, the problem seems to be the inability to directly pass a dict or named tuple object. Bulk inserts depend on JSON, which is extremely inefficient as it introduces a double or triple conversion step. Zerocopy should be the goal for all bulk operations. You should be able to efficiently consume a dict that contains key names matching the field names of the target object. A nice to have feature would be to also support nested SELECT-OR-INSERT queries in bulk inserts, but this is not trivial.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up bulk insert #2290

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Speed up bulk insert #2290

Mulugruntz Mar 1, 2021

Replies: 2 comments · 3 replies

elprans Mar 2, 2021 Maintainer

Mulugruntz Mar 2, 2021 Author

thedeadliestcatch Jan 16, 2024

Mulugruntz Jan 17, 2024 Author

thedeadliestcatch Jan 30, 2024

Mulugruntz
Mar 1, 2021

Replies: 2 comments 3 replies

elprans
Mar 2, 2021
Maintainer

Mulugruntz Mar 2, 2021
Author

thedeadliestcatch
Jan 16, 2024

Mulugruntz Jan 17, 2024
Author