Nonblocking callback #459

erik-overdahl · 2022-11-14T23:21:13Z

This PR makes it so that the ARA callback module no longer blocks the flow of Ansible playbook execution at all. Now the only blocking happens for the requests sent at the end of the playbook (this is unavoidable). This is accomplished by moving all logic for the callback to a separate thread and pushing received callback messages to the worker thread via a queue.

I ran some very simple benchmarks to confirm that this does in fact improve performance. I would like to have more robust measurements, but right now these are what I had time for.

All measurements were taken with 10 hosts performing 5 tasks across 10 forks. Times are reported in milliseconds and averaged over 5 runs.

First, the test used in the ARA benchmark article - each task is just debug

strat	client	db	threads	master	nonblock	diff
free				984
linear				1042
linear	http	sqlite	0	2628	2380	-248
free	http	sqlite	0	3588	3073	-515
free	http	postgres	4	4463	3888	-575
linear	http	postgres	0	4847	4267	-580
linear	offline	postgres	4	4892	4722	-170
free	http	postgres	0	7194	6244	-950
linear	offline	sqlite	0	8026	7426	-600
linear	offline	postgres	0	8262	7487	-775
free	offline	postgres	4	8771	8039	-732
free	offline	sqlite	0	11835	10850	-985
free	offline	postgres	0	11939	10771	-1168

The second test used the ansible.builtin.shell module to sleep for 1 - 3 seconds in order to mimic more realistic conditions

strat	client	db	threads	master	nonblock	diff
free				15641
free	http	postgres	4	15704	16150	446
free	http	sqlite	0	16221	14842	-1379
linear				18190
free	offline	postgres	4	18353	15907	-2446
free	http	postgres	0	18454	15572	-2882
linear	http	sqlite	0	19229	18668	-561
free	offline	sqlite	0	19590	17511	-2079
linear	http	postgres	4	19919	19405	-514
free	offline	postgres	0	20170	16072	-4098
linear	http	postgres	0	20595	19226	-1369
linear	offline	postgres	4	20849	19746	-1103
linear	offline	postgres	0	22578	20276	-2302
linear	offline	sqlite	0	23304	19960	-3344

The nonblocking client is faster than the current ARA master branch, and, importantly, makes a big performance difference in the situations where the previous threading-patch cannot help ( using sqlite as the database backend).

Move all callback operations into an auxiliary `AraWorker` class that runs in a separate thread started by the `CallbackModule` class, which is now just a shell that puts messages into a queue for the worker thread to process. The callback plugin now does not block the main Ansible thread.

softwarefactory-project-zuul · 2022-11-14T23:36:12Z

Build failed.

✔️ ara-tox-py3 SUCCESS in 4m 23s
✔️ ara-tox-linters SUCCESS in 4m 30s
❌ ara-basic-ansible-core-devel FAILURE in 4m 10s (non-voting)
❌ ara-basic-ansible-6 FAILURE in 4m 15s
❌ ara-basic-ansible-core-2.14 FAILURE in 3m 56s
❌ ara-basic-ansible-core-2.13 FAILURE in 4m 14s
❌ ara-basic-ansible-core-2.12 FAILURE in 4m 08s
❌ ara-basic-ansible-core-2.11 FAILURE in 4m 11s
❌ ara-basic-ansible-2.9 FAILURE in 4m 30s
✔️ ara-container-images SUCCESS in 11m 45s

Running the thread as "daemon" makes the previous cleanup unecessary.

softwarefactory-project-zuul · 2022-11-15T17:20:57Z

Build failed.

✔️ ara-tox-py3 SUCCESS in 4m 01s
✔️ ara-tox-linters SUCCESS in 4m 01s
❌ ara-basic-ansible-core-devel FAILURE in 2m 56s (non-voting)
❌ ara-basic-ansible-6 FAILURE in 3m 21s
❌ ara-basic-ansible-core-2.14 FAILURE in 2m 51s
❌ ara-basic-ansible-core-2.13 FAILURE in 4m 44s
❌ ara-basic-ansible-core-2.12 FAILURE in 2m 49s
❌ ara-basic-ansible-core-2.11 FAILURE in 3m 16s
❌ ara-basic-ansible-2.9 FAILURE in 4m 00s
✔️ ara-container-images SUCCESS in 11m 50s

Prevents program from exiting before all ARA work is finished. (meant to commit this earlier - whoops!)

dmsimard · 2022-11-17T04:03:45Z

Hey @erik-overdahl and thanks for the PR !

I've looked at the CI failure and I think I know what's going on but it's a bit tricky.

This is the error:

TASK [Generate a failure that will be rescued] *********************************
task path: /home/zuul-worker/src/github.com/ansible-community/ara/tests/integration/lookups.yaml:76
fatal: [localhost]: FAILED! => {
    "changed": false,
    "msg": "Message from a failed task"
}

TASK [We can go on, it was a failure for test purposes] ************************
task path: /home/zuul-worker/src/github.com/ansible-community/ara/tests/integration/lookups.yaml:80
ok: [localhost] => {
    "msg": "👍"
}
2022-11-15 17:10:51,558 DEBUG ara.clients.http: HTTP 201: post on /api/v1/tasks
2022-11-15 17:10:51,561 DEBUG ara.plugins.callback.worker: v2_runner_on_start
2022-11-15 17:10:51,561 DEBUG ara.plugins.callback.worker: v2_runner_on_ok

TASK [Find the task with the ignored errors and validate the status] ***********
task path: /home/zuul-worker/src/github.com/ansible-community/ara/tests/integration/lookups.yaml:85
2022-11-15 17:10:51,615 DEBUG ara.clients.http: HTTP 200: get on /api/v1/tasks?name=ignore_errors&playbook=2
fatal: [localhost]: FAILED! => {
    "msg": "The conditional check '_task['results'][0]['name'] == 'Fail something with ignore_errors'' failed. The error was: error while evaluating conditional (_task['results'][0]['name'] == 'Fail something with ignore_errors'): list object has no element 0"
}

PLAY RECAP *********************************************************************
localhost                  : ok=9    changed=1    unreachable=0    failed=1    skipped=1    rescued=1    ignored=1

It's these tasks:

ara/tests/integration/lookups.yaml

Lines 71 to 91 in dc4344e

    
           - block: 
        
               - name: Fail something with ignore_errors 
        
                 command: /usr/bin/false 
        
                 ignore_errors: true 
        
               - name: Generate a failure that will be rescued 
        
                 fail: 
        
                   msg: Message from a failed task 
        
             rescue: 
        
               - name: We can go on, it was a failure for test purposes 
        
                 debug: 
        
                   msg: 👍 
        
           # failed is a new status for a task, validate that it actually works 
        
           - name: Find the task with the ignored errors and validate the status 
        
             vars: 
        
               _task: "{{ lookup('ara_api', '/api/v1/tasks?name=ignore_errors&playbook=' + playbook_id ) }}" 
        
             assert: 
        
               that: 
        
                 - _task['results'][0]['name'] == 'Fail something with ignore_errors' 
        
                 - _task['results'][0]['status'] == 'completed'

In other words, we are generating a failure for the sake of testing that we can query and search for it within the playbook.
This is meant to be integration testing coverage for ara itself and I guess it assumes that by the time we query for the failure, it should be there -- but since we're moving stuff around in threads, it could be possible that the failure will get posted later so it can't find it.

I don't have a good suggestion at this time but does my explanation make sense ?

Edit: I haven't got around to reviewing and testing the code yet, apologies for the delay

erik-overdahl · 2022-11-17T19:51:10Z

That is my understanding as well. I tried adding a retry... until... delay to that assertion, but when I run the tests locally I am getting all sorts of weird "connection timeout" errors at random and inconsistent points in the test playbooks.

hille721 · 2022-11-17T20:40:59Z

Was reading your conversations with much interest. Where I'm a bit afraid of and I feel confirmed after the latest findings is that the reporting is no longer live. As long as the playbook runs to the end everything might be fine. But if the playbook run will be aborted or crashs due to any other reason then there is the possibility that later tasks are already performed before former tasks are fully reported to Ara.

Thus the question of this change is speed vs accuracy. For me speed is important but more important is accuracy. I prefer correct results in Ara then faster playbook runs. (btw that's why I tell everyone in my company to not use the free strategy, because this led to problems in Ara too).
Are the benchmark results really this good that we want to go for a more complex solution with maybe less accuracy?
Imo it would be also great to make such a think optional, meaning to support the traditional behavior and making this a feature which you can enable if performance is really important for you.

erik-overdahl · 2022-11-18T16:08:56Z

I do not agree with your assessment.

All results will still be logged by ARA. A failing task "crashes" the playbook but the Ansible program underneath continues to run. The ARA worker added in this PR runs as a daemon thread, so the program does not exit until the thread is done with its work. So there is no problem with accuracy or missing results unless the Ansible program itself crashes, and if your playbooks are causing that to happen, you have bigger problems to deal with.

Additionally, in a more realistic Ansible scenario, tasks on hosts do not complete immediately like they do in our simple benchmark, giving ARA more time to complete the HTTP calls. In a normal situation, the reporting should line up closely to the actual run of the playbook. I attempted to reproduce this in my second set of benchmarks, and indeed you'll notice that the time differences between the runs with ARA and those without are much less severe than in the simple benchmark where all tasks complete immediately.

Refactoring the callback to be nonblocking addresses more than just performance. The synchronous nature of callbacks in the Ansible architecture means that, in its current form, ARA makes callbacks for things like timing the execution of tasks completely useless. Not only is this bad in itself, but it is also not transparent; people using those callbacks may enable ARA and still be relying on those timings to be accurate representations of the actual task executions times on the hosts, unaware of the slowdown coming from ARA.

Avoiding the free strategy because ARA cannot handle it is backwards thinking, imo. If there is a language feature that is perfectly fine by itself but you cannot use it because your logging system cannot handle it, then the problem is with the logging system. ARA should be implemented better. No one should even have to think about including this in their Ansible toolkit and they should not have to change their Ansible playbooks to accommodate ARA.

hille721 · 2022-11-18T17:52:21Z

unless the Ansible program itself crashes

that is exactly what I meant. Maybe a user kills it, or the controller (e.g. a VM) crashs due to any reason, must not be related to the playbook itself. But in such cases I want to have a propper Ara reporting as well.

The synchronous nature of callbacks in the Ansible architecture...

But wouldn't it better then if directly Ansible supports kind of async callbacks in general?

erik-overdahl · 2022-11-18T18:47:22Z

If the machine you are running Ansible on crashes, there isn't anything you can do, and current ARA also has the risk of losing results in this case. A task may finish, but if we are still waiting on the POST requests in the callback to finish when the crash happens, then you just lost all results you didn't send yet.

If any exception happens in the Ansible process, the exception is caught at the top level and the process shuts down gracefully - see the cli_executor method. Because the thread runs as a daemon, your outstanding ARA requests will complete before the Ansible process exits. Which, actually, I'm not sure is currently true for ARA? if you cancel the playbook before the request for a task has been sent and you aren't using threads for ARA, then the UserInterrupt error will be thrown and the program will exit and because your requests were happening on that main thread that just got interrupted, you lose that data. So if anything, this PR makes it less likely that you will lose data, not more.

EDIT: after running some tests, you are indeed correct that user cancelling will cause lost data when this patch is applied. I can fix this by bringing back the __del__ method on the CallbackModule lass that I removed in 6397d24. In this case, there is a hang after the user cancels the playbook, but all results are successfully recorded. It's up to you whether that hang is acceptable (we could put up a message to the user?). I haven't encountered lost data from ARA master in my tests, but I'm not sure if that's because it is impossible or if I am not running a test that will cause that problem to arise.

I agree that Ansible SHOULD implement the callback plugins in a non-blocking manner, but they don't, so it is up to the callback plugins themselves to not block.

Even if the user cancels the run. Brings back the `__del__` method on CallbackModule removed in 6397d24

softwarefactory-project-zuul · 2022-11-18T20:33:12Z

Build failed.

✔️ ara-tox-py3 SUCCESS in 3m 49s
✔️ ara-tox-linters SUCCESS in 3m 33s
❌ ara-basic-ansible-core-devel FAILURE in 3m 25s (non-voting)
❌ ara-basic-ansible-6 FAILURE in 3m 30s
❌ ara-basic-ansible-core-2.14 FAILURE in 3m 22s
❌ ara-basic-ansible-core-2.13 FAILURE in 3m 33s
❌ ara-basic-ansible-core-2.12 FAILURE in 3m 23s
❌ ara-basic-ansible-core-2.11 FAILURE in 4m 41s
❌ ara-basic-ansible-2.9 FAILURE in 3m 41s
✔️ ara-container-images SUCCESS in 12m 18s

dmsimard · 2022-11-19T23:23:01Z

unless the Ansible program itself crashes

that is exactly what I meant. Maybe a user kills it, or the controller (e.g. a VM) crashs due to any reason, must not be related to the playbook itself. But in such cases I want to have a propper Ara reporting as well.

If the machine you are running Ansible on crashes, there isn't anything you can do, and current ARA also has the risk of losing results in this case. A task may finish, but if we are still waiting on the POST requests in the callback to finish when the crash happens, then you just lost all results you didn't send yet.

Right.

I might describe ara's attempts to record data as "best effort", to some extent, due to the nature of how the callback interface works. ara's callback gets "woken up" when it receives a hook but if it stops receiving hooks, it can't do anything.
It's why interrupted playbooks and tasks may end up in the "running" status forever because the callback didn't receive the last hook, v2_playbook_on_stats.

There isn't much we can do about that, unfortunately.

I agree that Ansible SHOULD implement the callback plugins in a non-blocking manner, but they don't, so it is up to the callback plugins themselves to not block.

Over the years, ara has helped with finding issues in the Ansible callback interface and while some were fixed, others were closed because fixing them would be too complicated or break stuff.

One of those is about the blocking nature of the callback interface but that was more than 5 years ago now: ansible/ansible#27705

We could revisit that and ask about it upstream but the challenge is that any meaningful change would probably break compatibility and only be available in newer versions of ansible-core since it wouldn't be backported.

Anyway, before we settled on the current threading implementation, every request was synchronous.
This worked reliably well at a small scale but performed poorly at scale because, as you note, ansible would only continue running the playbook once ara's API requests had been sent and returned.

Now, there are still some amount of synchronous requests because, for example, to create a play we need a playbook ID and so we need to make sure the playbook has been created before we try to create the play.

For what it's worth, I went through the exercise of picturing a high level overview of how the sequence currently works from the perspective of the ara callback:

dmsimard · 2022-11-19T23:36:18Z

Refactoring the callback to be nonblocking addresses more than just performance. The synchronous nature of callbacks in the Ansible architecture means that, in its current form, ARA makes callbacks for things like timing the execution of tasks completely useless. Not only is this bad in itself, but it is also not transparent; people using those callbacks may enable ARA and still be relying on those timings to be accurate representations of the actual task executions times on the hosts, unaware of the slowdown coming from ARA.

I think to call the current durations of tasks completely useless is a bit of a stretch.
Yes, they include the overhead of ara processing the task because ansible waits for the task to be finished -- but this overhead is typically negligible in the grand scheme of things when tasks can last several minutes long.

There are certain tradeoffs in ara for the sake of simplicity and sometimes it comes at the expense of performance.

Don't get me wrong, performance is important and I understand what you mean and how it would be more accurate to record the task duration as 3m 22s instead 3m 22.5s but that's neither useless or the end of the world.

Thus the question of this change is speed vs accuracy. For me speed is important but more important is accuracy. I prefer correct results in Ara then faster playbook runs. (btw that's why I tell everyone in my company to not use the free strategy, because this led to problems in Ara too).

Avoiding the free strategy because ARA cannot handle it is backwards thinking, imo. If there is a language feature that is perfectly fine by itself but you cannot use it because your logging system cannot handle it, then the problem is with the logging system. ARA should be implemented better. No one should even have to think about including this in their Ansible toolkit and they should not have to change their Ansible playbooks to accommodate ARA.

ara should work reliably with the free strategy. It historically hasn't because I did not personally have a use case for it and so a lot of ara was built around assumptions of running synchronously with the serial strategy.

There are some fixes for the free strategy in 1.6.0 and if there are still issues, we should document and fix them.
I still don't personally have a use case for the free strategy so any help is welcome here.

erik-overdahl · 2022-11-21T23:04:33Z

Yes, upon further reflection, I agree with you that ARA calls being reported in execution time is mostly a non-issue. I was thinking of it as reporting the execution time of a task on a host, but that is plainly not the case when the number of hosts is greater than ANSIBLE_FORKS.

I personally think that the performance gains from this patch are worth it, but I understand if you do not want to take on the added complexity. It isn't much more complexity, but it isn't none either. I can try to update that integration test to wait until the record is available, if you like (although I'm not sure of the best way to do this - you can't really retry until the record appears, because the test would never fail if the record wasn't actually there, so then your option is, what, a timeout? but that seems flaky, especially in a test, which you would hope is 100% deterministic).

Perhaps the most useful datapoint on ARA performance is this comment from the issue related to the benchmark blog post which claims that the 30% gains they saw from using Ansible with Mitogen were wiped out by ARA.

I would like to have a wider range of benchmarks. The numbers in the original ARA blogpost make the situation look a lot worse than it actually is. But I think it would be difficult to develop something useful; there are a lot of potential scenarios in a realistic workload.

Re: the free strategy - One of the things I found when I sprinkled logging statements throughout the callback code was that the ARA threading strategy ends up killing and creating a lot of ThreadPoolExecutors as the free strategy calls v2_playbook_on_task_start when it schedules a task on a host. In the ARA callback module, this triggers the _end_task method, as ARA assumes the linear strategy in which all runners for a task complete before the next task begins. Oddly, though... I don't know if this is actually an issue?? Look at the benchmarks in my first post; the free strategy with multiple threads still seems to perform pretty well.

dmsimard · 2023-01-23T01:34:41Z

Hi @erik-overdahl,

I just wanted to say I haven't forgotten about your PR and that I'll get back to it when the time is right.
I believe the callback is due for a refactor but I don't know what that would look like in practice yet.

The callback has grown organically over the years: it has had features and fixes patched on along some reverse engineering of internal ansible-core's "API"s. It could be a good idea to take a step back, look at what it does and what we've learned over the years to come up with something a bit cleaner.

ara didn't always have an API and the callback used to make database queries directly with flask so a lot has changed since then :p

Performance, threading, queueing or asynchronicity can be part of an eventual refactor but I mean to say that a lot of what the callback does could benefit from being put in a library -- things like creating a playbook, a task, a host or a result, for example.

I want to have a better idea of what this looks like before going down the road of optimizing performance such that we don't step on each other's feet.

I appreciate your patience in the meantime !

phemmer · 2023-05-06T02:42:31Z

If the risk of losing updates on crash is an issue, I think we can address those easily enough.

As @erik-overdahl mentioned we should be able to delay the shutdown until the queue drains. A typical approach I see in other applications (including my own) with similar concerns is that SIGINT/SIGTERM will initiate a graceful shutdown (wait for things to terminate cleanly, and in this case for the queue to drain), and a second SIGINT/SIGTERM causes an immediate exit.

We can also put in a max queue size. For example if we make the max size 0, then the request has to be passed synchronously to the thread, but once the thread has it ansible can continue and the thread would make the API call in the background. Any further attempts to push something onto the queue will block until the API call completes.
This can be customizable, allowing the user to specify how big that max size should be based on how much risk of lost data they're willing to accept. And a max size of -1 would disable async entirely (though might make it more intuitive to shift the values by +1, so 0 would disable async).

I also just want to add my own real world experience. The benchmarks in the original post were overly optimistic for my experience. My playbooks have lots of very short tasks that run across hundreds of hosts. I saw playbooks taking over 3x longer when ARA was enabled. A playbook that took about 7 minutes was taking almost 30 minutes. Though this was a while ago, and I cant remember details. I might have been using django auth, and might have been using sqlite, of which I now use neither, and am using the async wrapper I posted in #171 (comment).

berendt · 2023-12-04T17:01:05Z

I have tested the Async Wrapper. It does not work for us because it changes the timestamps in ARA (or in other words, these do not correspond to the actual time of execution). But I think the idea of the async wrapper (thanks for sharing @phemmer) is a good one.

erik-overdahl added 2 commits November 10, 2022 09:20

callback: Avoid race condition on play files

1bb67b4

callback: Remove unecessary cleanup

6397d24

Running the thread as "daemon" makes the previous cleanup unecessary.

callback: Make worker thread a daemon

1810792

Prevents program from exiting before all ARA work is finished. (meant to commit this earlier - whoops!)

callback: Ensure all requests are recorded

5c66638

Even if the user cancels the run. Brings back the `__del__` method on CallbackModule removed in 6397d24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nonblocking callback #459

Nonblocking callback #459

erik-overdahl commented Nov 14, 2022

softwarefactory-project-zuul bot commented Nov 14, 2022

softwarefactory-project-zuul bot commented Nov 15, 2022

dmsimard commented Nov 17, 2022 •

edited

erik-overdahl commented Nov 17, 2022

hille721 commented Nov 17, 2022

erik-overdahl commented Nov 18, 2022

hille721 commented Nov 18, 2022

erik-overdahl commented Nov 18, 2022 •

edited

softwarefactory-project-zuul bot commented Nov 18, 2022

dmsimard commented Nov 19, 2022 •

edited

dmsimard commented Nov 19, 2022

erik-overdahl commented Nov 21, 2022

dmsimard commented Jan 23, 2023

phemmer commented May 6, 2023

berendt commented Dec 4, 2023

Nonblocking callback #459

Are you sure you want to change the base?

Nonblocking callback #459

Conversation

erik-overdahl commented Nov 14, 2022

softwarefactory-project-zuul bot commented Nov 14, 2022

softwarefactory-project-zuul bot commented Nov 15, 2022

dmsimard commented Nov 17, 2022 • edited

erik-overdahl commented Nov 17, 2022

hille721 commented Nov 17, 2022

erik-overdahl commented Nov 18, 2022

hille721 commented Nov 18, 2022

erik-overdahl commented Nov 18, 2022 • edited

softwarefactory-project-zuul bot commented Nov 18, 2022

dmsimard commented Nov 19, 2022 • edited

dmsimard commented Nov 19, 2022

erik-overdahl commented Nov 21, 2022

dmsimard commented Jan 23, 2023

phemmer commented May 6, 2023

berendt commented Dec 4, 2023

dmsimard commented Nov 17, 2022 •

edited

erik-overdahl commented Nov 18, 2022 •

edited

dmsimard commented Nov 19, 2022 •

edited