Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiently Loading a Large Time Series Dataset into KairosDB #647

Open
AKheli opened this issue Sep 5, 2021 · 4 comments
Open

Efficiently Loading a Large Time Series Dataset into KairosDB #647

AKheli opened this issue Sep 5, 2021 · 4 comments

Comments

@AKheli
Copy link

AKheli commented Sep 5, 2021

I am trying to load 100 billion multi-dimensional time series datapoints into KairosDB from a CSV file with the following format:

timestamp value_1 value_2 .... value_n

I tried to find a fast loading method on the official documentation and here's how I am currently doing the insertion (my codebase is in Python):

f = open(args.file, "r")
# Insert
i = 0
with tqdm(total=int(rows)) as pbar:
    while i < len(range(rows)):   
        data = []
        batch_size = 65000 / column
        while i < len(range(rows)) and batch_size > 0:
            batch_size -= 1
            # print(batch_size)
            i += 1
            values = f.readline()[:-1].split(" ")
            t = (get_datetime(values[0])[0] - datetime(1970, 1, 1)).total_seconds() * 1000
            t = int(t)
            for j in range(column):
                    data.append({
                            "name": "master.data",
                            "datapoints": [[t, values[j + 1]]],
                            "tags": {
                                    "dim": "dim" + str(j)
                            }
                    })    
        r = requests.post("http://localhost:8080/api/v1/datapoints", data = json.dumps(data))
        pbar.update(65000 / column)
pbar.close()

As the code above shows, my code is reading the dataset CSV file and preparing batches of 65000 data points, then sending the datapoints using requests.post.

However, this method is not very efficient. In fact, I am trying to load 100 billion data points and this is taking way longer than expected, loading only 3 Million rows with 100 columns each has been running for 29 hours and still has 991 hours to finish!!!!

image

I am certain there is a better way to load the dataset into KairosDB. Any suggestions for a faster loading better.

@biswaKL
Copy link

biswaKL commented Sep 6, 2021 via email

@biswaKL
Copy link

biswaKL commented Sep 6, 2021

Is it possible for you to share the CSV file?

@brianhks
Copy link
Member

I posted a comment on the forum in response to this, that should be sufficient.

@AKheli
Copy link
Author

AKheli commented Sep 15, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants