-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
batch_size_bytes not working as expected #1554
Comments
Hi @dpmabo Thanks for the detailed report here. We think this might be related to missing sort order specification on write, this still needs to be confirmed though - and a more robust solution determined. |
Thank you for being patient @dpmabo. The sort order issue has been solved in #1656 and will be merged very soon. We've encountered OOM issues before and it was related to row group size in the data files. Can you please run the following script on any of the files in your dataset and share the output here. import pyarrow.parquet as pq
import csv
import sys
def extract_ts_init_values(parquet_file, csv_file):
# Open the Parquet file
parquet_file = pq.ParquetFile(parquet_file)
# Open the CSV file for writing
with open(csv_file, "w", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(
["index", "start_ts", "end_ts", "group_size"]
) # Write the header
# Iterate over each row group in the Parquet file
for i in range(parquet_file.num_row_groups):
# Read the row group into a table
table = parquet_file.read_row_group(i)
# Convert the 'ts_init' column to a list
ts_init_values = table.column("ts_init").to_pandas().tolist()
# Write the index, first and last value to the CSV file
writer.writerow([i, ts_init_values[0], ts_init_values[-1], table.num_rows])
if __name__ == "__main__":
if len(sys.argv) < 3:
print("Usage: python extract_ts_init.py <parquet_file> <csv_file>")
sys.exit(1)
parquet_file = sys.argv[1]
csv_file = sys.argv[2]
extract_ts_init_values(parquet_file, csv_file) |
@twitu sorry for my delayed post. Here is a csv generated by your script. index,start_ts,end_ts,group_size |
Hi @dpmabo, a fix has been merged into develop. Please give it a try and make an issue if it doesn't work for you. |
Bug Report
I just tested a simple OrderBookImbalance example strategy. When I use an one-month order_book_delta dataset,
the memory usage is so huge and always result in an OOM-killed(12core/32GB workstation), batch_size_bytes is useless for memory usage.
The order_book_delta dataset is like below:
8.6M part-20240208-010605.parquet
8.7M part-20240208-014357.parquet
9.1M part-20240208-023338.parquet
8.7M part-20240208-034528.parquet
8.7M part-20240208-042741.parquet
8.9M part-20240208-050715.parquet
9.7M part-20240208-060216.parquet
8.7M part-20240208-075000.parquet
......
9.2M part-20240310-184726.parquet
9.1M part-20240310-195349.parquet
9.2M part-20240310-205305.parquet
8.8M part-20240310-215918.parquet
8.7M part-20240310-224410.parquet
4.3M part-20240310-232658.parquet
9.5M part-20240310-235944.parquet
(totally 799 files,6.8G)
Expected Behavior
As docs stating, Consider the high-level API when: Your data stream’s size exceeds available memory, necessitating streaming data in batches, When we set batch_size_bytes NT will run in streaming mode and the data in catalog is gradually loaded to memory and feed to the engine.
Actual Behavior
All data in catalog will be loaded to memory at first and easily result in an OOM-killed.
Steps to Reproduce the Problem
Specifications
nautilus_trader
version: 1.189.0The text was updated successfully, but these errors were encountered: