New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: First iteration of a prometheus exporter for ara #483
base: master
Are you sure you want to change the base?
Conversation
Heavily a work in progress and learning experience over which we will iterate a number of times. The intent is to make a prometheus exporter gather metrics from an ara instance and expose them so that prometheus can scrape them.
- Added support for querying results through pagination - Added support for paginating through pages of results - Query everything at boot via result limit (i.e, ?limit=1000) and pagination - Store the latest object timestamp such that next scrape will only pick up objects created after that using ?created_after=<timestamp>
- Move it under our existing ara CLI so it can re-use all the boilerplate about instanciating an API client with all the settings - Add args for limits, poll frequency and port for the exporter to listen on
405c187
to
86dfdf8
Compare
Build failed. ✔️ ara-tox-py3 SUCCESS in 4m 09s |
- Added --max-days to limit backfill at boot - Added a bit of verbosity - Adjust hosts to be scanned before tasks (there are way, way more tasks than hosts in terms of volume)
- First try at a playbook histogram containing the timestamp and duration
I've added a bit more context in the issue (#177 (comment)) and got two quick iterations in:
Edit: I've put up an example /metrics response from a single playbook's metric as an histogram in the gist: https://gist.github.com/dmsimard/68c149eea34dbff325c9e4e9c39980a0#file-playbooks_as_histogram-txt It wants to group metrics based on their label uniqueness, I suppose in our case we want each playbook to be represented individually so we should include their id ? More on that later. |
Build failed. ✔️ ara-tox-py3 SUCCESS in 3m 24s |
Still heavily a work in progress but getting a better undertanding of how things work. Host and Tasks have now have gauges by status. Disable playbook metrics temporarily until we revisit it with newfound knowledge.
I think my brain is starting to understand what is happening. I've temporarily commented out the current iteration of the playbook metrics until I revisit it with newfound knowledge. This latest iteration re-works the host and tasks metrics to have gauges per status such that we are able to do graphs like this, for example: Prometheus task results in grafanaPrometheus host results in grafanaA snippet of what this looks like when querying the prometheus exporter:
|
Build failed. ✔️ ara-tox-py3 SUCCESS in 9m 57s |
- Add a summary metric for tracking the duration of tasks. This is what was intended when trying to do the playbook histogram so we'll come back to that later.
Build failed. ✔️ ara-tox-py3 SUCCESS in 4m 14s |
Build succeeded. ✔️ ara-tox-py3 SUCCESS in 4m 15s |
- Substantial cleanup and cut on code duplication - Fix linting and style - Metric labels moved to default constants, leave the door opened for the possibility of customizing them - Retrofit what we learned back to the playbook metrics - Re-enable playbook metrics
feadacf
to
7558a6f
Compare
Build failed. ✔️ ara-tox-py3 SUCCESS in 3m 12s |
- More cleanup - Removed Gauges for each status of playbooks and tasks, they were not useful once understanding how to use Summaries and generated a lot of needless metrics in hindsight - Added a package extra for [prometheus] - First iteration of docs - Add first iteration of grafana dashboard
b82da8c
to
6283872
Compare
Build failed. ✔️ ara-tox-py3 SUCCESS in 3m 15s |
I feel this is ready for a first look to a wider audience so I've asked around for testing and feedback:
The final implementation may change before landing (for example if I screwed up in metric types) but this will be useful to make sure we did the right decisions and do the necessary changes before merging. I am narrowing the scope of this first PR to playbooks, tasks and hosts for now. Results and plays can come in a later patch as necessary. |
Build failed. ✔️ ara-tox-py3 SUCCESS in 3m 50s |
0ce3cf1
to
c92b29b
Compare
Build failed. ✔️ ara-tox-py3 SUCCESS in 3m 49s |
c92b29b
to
6283872
Compare
Nothing special pushed, just rebased on top of latest master. |
Build failed. ✔️ ara-tox-py3 SUCCESS in 4m 05s |
I will eventually include it in the docs but in the meantime, I've come up with the following graph that explains how one might use the exporter:
|
|
||
ara doesn't provide monitoring or alerting out of the box (they are out of scope) but it records a number of granular metrics about Ansible playbooks, tasks and hosts, amongst other things. | ||
|
||
Starting with version 1.6.2, ara provides an integration of `prometheus_client <https://github.com/prometheus/client_python>`_ that queries the ara API and then exposes these metrics for prometheus to scrape. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.6.2 didn't pan out, we went straight to 1.7.0. It can be included in a release as soon as it's ready.
help='Maximum number of days to backfill metrics for (default: 90)', | ||
default=90, | ||
type=int | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be interesting for the exporter to be able to filter queries like the general CLI commands work, for example ara playbook list
(docs) has:
--ansible_version <ansible_version>
List playbooks that ran with the specified Ansible
version (full or partial)
--client_version <client_version>
List playbooks that were recorded with the specified
ara client version (full or partial)
--server_version <server_version>
List playbooks that were recorded with the specified
ara server version (full or partial)
--python_version <python_version>
List playbooks that were recorded with the specified
python version (full or partial)
--user <user> List playbooks that were run by the specified user
(full or partial)
--controller <controller>
List playbooks that ran from the provided controller
(full or partial)
--name <name> List playbooks matching the provided name (full or
partial)
--path <path> List playbooks matching the provided path (full or
partial)
--status <status> List playbooks matching a specific status
('completed', 'running', 'failed')
Hi, I think you can transform for example this metric : into several metric, ara_tasks_duration { action="command", name='Echo the abc binary string", path="/home/.......", playbook="30" } number seconds (or micro seconds if needed) ara_tasks_results { action="command", name='Echo the abc binary string", path="/home/.......", playbook="30" } 1 We can work together to build correct metric, then we will produce correct python for exporter. |
Hi @voileux and thanks for reaching out! What you suggest makes sense to me and it's worth looking into. I don't have bandwidth to look into this /right now/ but I will revisit this in the near future. |
Hello, depending on your goal here : it might be easier for you to limit the "exporter part" to what you want to monitor live (i.e. what you want to trigger alerts on) And for the visualization aspects, directly connect grafana to your database with the specific grafana
something like : flowchart TD
G[Grafana] -->|promql <br/> visualize <b>alerts</b><br/> and correlate current metrics| P(Prometheus )
G -->|db datasource <br/> visualize <b>metrics</b> <br/>current and historical| D
W(alertmanager) -->|promql<br/>trigger alerts| P
P-->|scrapes /metrics<br/> stores data| E(Prometheus Exporter<br/>prometheus_client)
E --> |query metrics| D(ara API server <br/> django <br/>fa:fa-database recorded playbooks)
A(ansible playbook) -->|collects data<br/>& sends it| D
instead of (from your previous schema here) flowchart TD
G[Grafana] -->|promql| P(Prometheus)
P-->|scrapes /metrics<br/> stores data| E(Prometheus Exporter<br/>prometheus_client)
E --> |query metrics| D(ara API server <br/> django <br/>fa:fa-database recorded playbooks)
A(ansible playbook) -->|collects data<br/>& sends it| D
(edit: I forgot to put the mermaid keyword, and took this opportunity to add This indeed requires you to rewrite your panels in grafana in order to make use of the proper SQL, and you will need to open the connection between grafana and your DB Also it avoids to transform the whole content of the DB opentelemetry format and scraping it each time, which will scale better :-D |
Hi, I haven't revisited this in a little while but I wanted to say it was still on my radar and I plan to work on this some more in the near future. |
As discussed on the issue for this topic: #177
It's not finished and still very much a WIP but I figured it might be worthwhile to iterate under a branch in a PR instead of the gist: https://gist.github.com/dmsimard/68c149eea34dbff325c9e4e9c39980a0
If prometheus_client is installed, there will be an
ara prometheus
command to expose prometheus metrics gathered and parsed from an ara instance: