Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GluonTS not using all available (CPU) resources #200

Open
emobs opened this issue Sep 28, 2023 · 14 comments
Open

GluonTS not using all available (CPU) resources #200

emobs opened this issue Sep 28, 2023 · 14 comments

Comments

@emobs
Copy link

emobs commented Sep 28, 2023

While training GluonTS model less than 10% of the CPU resources (16 cores in my test case) are used, while other models use 70%+ of the available CPU resources. Memory and Disk resources capacity are not the issue in my case, there are more than enough free resources there as well. The model is set to use n_jobs='auto', which seems to work fine for other models which are using the major part of the CPU capacity. Is GluonTS not configured to or capable of parallelizing tasks? Thanks for for your reply and explanation in advance.

@winedarksea
Copy link
Owner

winedarksea commented Sep 28, 2023

I've seen gluonts in the past using high CPU usage but it does vary by underlying model.
Some quick first questions:

  • What version of GluonTS are you using?
  • Is Mxnet enabled with GPU?
  • What GluonTS sub models are you seeing being used when cpu utilization is low?
  • What operating system and CPU Arch (x86 vs ARM) are you using?

@emobs
Copy link
Author

emobs commented Sep 28, 2023

Thanks for the quick reply!

  • I'm using version 0.13.5
  • No, CPU only: gluonts.mx.context:Using CPU
  • I'm not sure, but I think it's the training model. If this is not correct, please let me know how to find out.
  • x86

@winedarksea
Copy link
Owner

Here on gluonts 0.11.9, on Windows x86, it's working fine for me right now. Also works on Linux. I'll check the updated package and see if the newest version has issues.
image

@winedarksea
Copy link
Owner

Set verbose=2 and you'll see the gluon_model printed which is the model being used. in this example, NBEATS is running.

@winedarksea
Copy link
Owner

Working on gluonts==0.13.5 as well on Windows. Some models, NPTS comes to mind, seems likely to only 1 be jobs

@emobs
Copy link
Author

emobs commented Oct 17, 2023

Sorry for the late reply. I have been doing some more testing and it appears the CPU capacity usage drops when several models from GluonTS are run, including DeepAR, WaveNet, Transformer, SFF, NPTS.
After AutoTS tried to train quite some models according to the logs, the process is killed with no specific error message thrown although verbose=2, so I would guess there's another root cause. However, I can't figure it out...

I attached a few lines from the top of the run log, hoping you're willing to take a look at them and help me find out what's wrong here. If you need more details, please let me know.

The input pandas dataframe consists of 50k rows (tried less as well btw) with a datetime column in the correct format and multiple feature columns with proper data (no NaN, extreme outliers or otherwise false values).

Thank you very much in advance for your support!!

debug_log.txt

@winedarksea
Copy link
Owner

winedarksea commented Oct 17, 2023

Some things I have noticed:

  • kernels don't die nearly as often with scripts run on the command line. python my_file.py is usually more stable than iPython notebook kernels
  • RAM is the major issue I see with random kernel failures, depending on the size of your data. Keep an eye on RAM usage and see if you are exceeding it. Some models are better with smaller RAM than others. Try subset='50' and see if that helps, it should use only a subset for the model search. Also try manually setting n_jobs smaller - you may not have enough RAM to match your CPU core count (for example, I upgraded my Ryzen 5950X to 128 GB of RAM which helped to reduce these problems with the 16C/32T CPU, but if buying lots more RAM isn't an option, reducing your max core use will reduce memory use)
  • Try running with a model_list='superfast' as seeing if it runs to completion and looks somewhat reasonable, before going back and adding back in harder models. Next try model_list='fast_parallel_no_arima'.

What frequency are you using with 50K rows of history? This must be hourly or minute level data?
AutoTS can handle NaN and outliers, that won't be a cause of crashing.

Neural networks are always unreliable. I've been working on adding the TiDE model and it keeps killing my kernel for no clear reason. For GluonTS, I am adding in the next release limited support for their pytorch approach, which might help, since apparently mxnet is deprecated.

Something else that can help with full core utilization is setting some environmental variables.
export OMP_NUM_THREADS=8 ('set' in Windows) or whatever your thread count you want to use. Beware this can sometimes cause conflicts with other methods of multiprocessing.

@emobs
Copy link
Author

emobs commented Oct 18, 2023

Thanks for your reply Colin, first of all. I really appreciate your support!

Secondly, I'm running the script directly from the command line (user terminal on Ubuntu 22.04), FYI.

I ran the script once again with the suggested parameters:

model=AutoTS(
forecast_length=1,
frequency='5T',
model_list='superfast', 
ensemble='all', 
n_jobs='auto', 
subset = 50, 
verbose=2
)
model = model.fit(train, date_col='Time', value_col=f'{target_col}')

Run results:

  • It finished without being killed with no specific reason, so that's the good part :)
  • Took almost 5 hours for a single training and the CPU (around 12% all the way) and RAM (around 3% all the way) usage were minimal, although CPU usage peaked to (not exceeded) almost 100% in the first 15 minutes of the run (RAM didn't peak at all). This could be the point after which the first errors came in but not exactly sure about that.
  • There are several errors and warnings in the debug log of which I don't know if they should be addressed and/or are causing any issues, like ValueError: Input y contains NaN., ValueError: The number of quantiles cannot be greater than the number of samples used. Got 17538 quantiles and 10000 samples., ValueError: freq T not understood. and TypeError: unsupported operand type(s) for +: 'float' and 'NoneType'.
  • There's also some output in the terminal, which seems relevant and to me it seems something is wrong with the input data, but I don't have a clue what could be wrong with it...

I attached the terminal output and full debug log here for a quick review, hoping you are willing to take a look at it. I would be very grateful if you would take the effort to do so and help me further along here since I don't know what to do at this point.

terminal ouput.txt
debug_log.txt

Last but not least:
I am running the script on a Digital Ocean droplet currently with dedicated CPUs, however they are virtual. Could that cause trouble with multiprocessing because of the GIL (global interpreter lock) and result in just a fraction of the CPU capacity (and hence minimal RAM) being used?

Thanks in advance for any help and please let me know if you need more details.

@winedarksea
Copy link
Owner

Cloud provider VMs are listed as vCPUs which is the number of 'logical' processors or threads, not the physical number of cores. For workloads with lots of small, lightweight tasks this works out fine, but data science tends to pretty heavy duty workloads that can't hyperthread very effectively, so the number of physical cores is the performance constraint. Normally the number of actual cores is listed number of CPUs / 2. It might be worth setting n_jobs to VM CPU official size / 2.

But as for a low usage of RAM and CPU, that might be because the 'superfast' model list is mostly matrix operations that are quite efficient and don't parallelize much. That said, you should double check and make sure Numpy is liked to a blas/lapack correctly. It should be correct but worth checking (and also setting that OMP_NUM_THREADS environmental variable which can help). Here's a stack overflow thing on checking for the lapack.
https://stackoverflow.com/questions/9000164/how-to-check-blas-lapack-linkage-in-numpy-and-scipy
Installing with anaconda or mamba usually assures packages have been built with support for MKL or OPENBLAS.

Some errors are to be expected . Some combination of parameters AutoTS generates, and some parameters on some datasets, don't work. Those errors don't look like a problem to me.

Something that can help diagnose which model is crashing your script is passing current_model_file="current_model", to AutoTS(). It saves the model being run to your drive as JSON, and you can then (hopefully) see what is running before it crashed.

Not sure if you have looked at the production_example.py example yet, but you definitely should if you haven't.

ensemble='all' may be unnecessarily slow. ensemble=['simple', 'horizontal-max'] should be sufficient, and often ensemble=None is not that much worse and is definitely faster, so probably worth starting there if you are concerned about speed.
Might be worth setting verbose=0 to only show the errors.

Check model.df_wide_numeric to make sure your input data is getting processed in correctly. You might be needing 'series_id' as a column name if your data has multiple features.

@emobs
Copy link
Author

emobs commented Oct 19, 2023

Thanks for the advise on the virtual CPUs, I am playing with OMP_NUM_THREADS variable and n_jobs parameter settings but didn't get much better results than about 30% usage yet. It's better than 13% though, so I will persevere to find an optimum (>30% hopefully!). Any other advise that might help on this? Or is this actually the best possible on a VM?

I checked the numpy package which is using openblas:
openblas64__info.txt
Looks good te me, do you agree?

The script doesn't crash anymore with a smaller input dataset (currently testing with 500 rows of data). With 50k rows of data I get some memory allocation errors.

Regarding the other errors, could you please take a look at this merged debug log:
debug_log.txt
It contains logs for runs with and without a transformer_list specified and with verbose set to 0 and 2 respectively. Are these 'Transformer failed on fit' errors something to worry about? And if so, what could be the cause and possible solutions for them?

For sure I studied the production_example.py before, but will do that once again tonight :)

Not yet succeeded to check the model.df_wide_numeric properly. Could you provide example code to do so? Should this be done after model.fit or after model creation?

Again, thanks a lot for your help in advance, very much appreciated!

@winedarksea
Copy link
Owner

winedarksea commented Oct 19, 2023

Where are you collecting your CPU utilization from? From the droplet internally or externally on the dashboard? Internally collected metrics on a VM might be inaccurate.

if your only goal is 100% CPU utilization, try `model_list='parallel' then increase n_jobs until you see 100%. Although I wouldn't advise aiming for fully 100% all the time, that often means the system is overutilized and 'thrashing', a bit lower is better.

Actually should point out that utilization will still be low if you are only inputting ONE time series. Many of the optimizations here are designed to parallelize across multiple time series, not across a single input time series. Try with the example load_daily() dataset and see how utilization is looking. If you only have one series, try model_list=['WindowRegression'] too, those scikit-learn/xgboost models mostly parallelize, although you will only see brief spikes as most of them train pretty fast.

But really, you shouldn't be aiming for maximal CPU utilization, you should focus on overall runtime optimization. Some operations and some models can't be parallelized but are still fast.

openblas config looks good.

Is this a dataset you can share? If you want you can send it to me and I can see how it works for me. Given that you are working with 5 minute frequency and 1 step ahead predictions I'm 90% willing to bet you are trying to do some sort of semi-high frequency stock trading automation.

No, 'Transformer failed on fit' are not usually something to worry about,

model.df_wide_numeric should be a pandas dataframe, it will be available after .fit() is run. Although if you just want to check the data, you could try just fit_data:

model = AutoTS()
model = model.fit_data(df)
print(model.df_wide_numeric)

@emobs
Copy link
Author

emobs commented Oct 20, 2023

Thanks again for your reply Colin.

I'm using the graphs on the DO dashboard for CPU and RAM utilization analysis, but also tried using the internal Ubuntu performance metrics application. The metrics of both instruments are in line with each other.

Full CPU utilization is not my goal necessarily, however because of the quite extensive amount of training data it would save a lot of time if all CPUs and their capacity could be utilized as optimally as possible. Nevertheless I thank you for pointing out that runtime optimization is much more essential than CPU utilization optimization. Thanks to your explanation I do understand now that parallelization is limited depending on the model (list) used.

Thanks for the confirmation on the openblas config.

We're trying to predict dynamic electricity prices in order to advise on when and where to charge/de-charge EV's and other high-capacity batteries like home batteries. Here's an example of the input data (simplified to only 5 columns of features and 1k records): data.csv
Please note that the csv file is utf-16 encoded, not the regular utf-8, because of our data sources.
We're predicting 1 step ahead in the current tests, but also more steps ahead and on different time frames later. I was hoping if I get the 5-minute 1-step forecasting to work properly, forecasting on other time frames and larger steps would be easy to accomplish as well.

Lastly, using model_list=['WindowRegression'] throws some more specific errors related to the datetime column like:
AssertionError: df index is not pd.DatetimeIndex and Could not convert date to datetime format. Incorrect column name or preformat with pandas to_datetime.
I have been trying to explicitly define the correct format and set the 'Time' column as the index this way:

train['Time'] = pd.to_datetime(train['Time'], format='%Y.%m.%d %H:%M:%S')
train.set_index('Time', inplace=True)

Still I get the above errors however, which might be the cause of other errors as well I guess.

Hope you will be able to evaluate the example data and pinpoint the cause of the errors.

Thanks again in advance!

@winedarksea
Copy link
Owner

I'm a bit busy with a few other things at the moment but hope to give your data a proper examination in the next couple days.
I've never tested on 5 minute frequency data so it's possible one or two things don't like the frequency. I'll check and fix as needed.

@emobs
Copy link
Author

emobs commented Oct 23, 2023

Thanks Colin, I'm very happy to hear you're willing to take the effort to help us on this matter. Looking forward to your message after you had the opportunity to examine this case. Thanks a million already!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants