Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: First iteration of a prometheus exporter for ara #483

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

dmsimard
Copy link
Contributor

As discussed on the issue for this topic: #177

It's not finished and still very much a WIP but I figured it might be worthwhile to iterate under a branch in a PR instead of the gist: https://gist.github.com/dmsimard/68c149eea34dbff325c9e4e9c39980a0

If prometheus_client is installed, there will be an ara prometheus command to expose prometheus metrics gathered and parsed from an ara instance:

usage: ara prometheus [-h] [--client <client>] [--server <url>] [--timeout <seconds>] [--username <username>] [--password <password>] [--ssl-cert <path/to/certificate>] [--ssl-key <path/to/key>] [--ssl-ca <path/to/cacert>] [--insecure]
                      [--playbook-limit PLAYBOOK_LIMIT] [--task-limit TASK_LIMIT] [--host-limit HOST_LIMIT] [--poll-frequency POLL_FREQUENCY] [--prometheus-port PROMETHEUS_PORT]

Exposes a prometheus exporter to provide metrics from an instance of ara

options:
  -h, --help            show this help message and exit
  --client <client>
                        API client to use, defaults to ARA_API_CLIENT or 'offline'
  --server <url>
                        API server endpoint if using http client, defaults to ARA_API_SERVER or 'http://127.0.0.1:8000'
  --timeout <seconds>
                        Timeout for requests to API server, defaults to ARA_API_TIMEOUT or 30
  --username <username>
                        API server username for authentication, defaults to ARA_API_USERNAME or None
  --password <password>
                        API server password for authentication, defaults to ARA_API_PASSWORD or None
  --ssl-cert <path/to/certificate>
                        If a client certificate is required, the path to the certificate to use, defaults to ARA_API_CERT or None
  --ssl-key <path/to/key>
                        If a client certificate is required, the path to the private key to use, defaults to ARA_API_KEY or None
  --ssl-ca <path/to/cacert>
                        Path to a certificate authority for trusting the API server certificate, defaults to ARA_API_CA or None
  --insecure            Ignore SSL certificate validation, defaults to ARA_API_INSECURE or False
  --playbook-limit PLAYBOOK_LIMIT
                        Max number of playbooks to request at once (default: 1000)
  --task-limit TASK_LIMIT
                        Max number of tasks to request at once (default: 2500)
  --host-limit HOST_LIMIT
                        Max number of hosts to request at once (default: 2500)
  --poll-frequency POLL_FREQUENCY
                        Seconds to wait until querying ara for new metrics (default: 60)
  --prometheus-port PROMETHEUS_PORT
                        Port on which the prometheus exporter will listen (default: 8001)

Heavily a work in progress and learning experience over which we will
iterate a number of times.

The intent is to make a prometheus exporter gather metrics from an ara
instance and expose them so that prometheus can scrape them.
- Added support for querying results through pagination
- Added support for paginating through pages of results
- Query everything at boot via result limit (i.e, ?limit=1000) and pagination
- Store the latest object timestamp such that next scrape will only pick up
   objects created after that using ?created_after=<timestamp>
- Move it under our existing ara CLI so it can re-use all the
  boilerplate about instanciating an API client with all the settings
- Add args for limits, poll frequency and port for the exporter to
  listen on
@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/f9d8f487b49d447d8f37dc2007613d34

✔️ ara-tox-py3 SUCCESS in 4m 09s
ara-tox-linters FAILURE in 3m 32s
✔️ ara-basic-ansible-core-devel SUCCESS in 5m 33s (non-voting)
✔️ ara-basic-ansible-6 SUCCESS in 5m 09s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 5m 35s
✔️ ara-basic-ansible-core-2.13 SUCCESS in 5m 03s
✔️ ara-basic-ansible-core-2.12 SUCCESS in 5m 04s
✔️ ara-basic-ansible-core-2.11 SUCCESS in 5m 20s
✔️ ara-basic-ansible-2.9 SUCCESS in 5m 08s
✔️ ara-container-images SUCCESS in 11m 19s

- Added --max-days to limit backfill at boot
- Added a bit of verbosity
- Adjust hosts to be scanned before tasks (there are way, way more tasks
  than hosts in terms of volume)
- First try at a playbook histogram containing the timestamp and
  duration
@dmsimard
Copy link
Contributor Author

dmsimard commented Feb 24, 2023

I've added a bit more context in the issue (#177 (comment)) and got two quick iterations in:

  • Added --max-days to limit backfill at boot
  • Added a bit of verbosity
  • Adjust hosts to be scanned before tasks (there are way, way more tasks than hosts in terms of volume)
  • First try at a playbook histogram containing the timestamp and duration

Edit: I've put up an example /metrics response from a single playbook's metric as an histogram in the gist: https://gist.github.com/dmsimard/68c149eea34dbff325c9e4e9c39980a0#file-playbooks_as_histogram-txt

It wants to group metrics based on their label uniqueness, I suppose in our case we want each playbook to be represented individually so we should include their id ? More on that later.

@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/d069974d12c14515aded43c6df617003

✔️ ara-tox-py3 SUCCESS in 3m 24s
ara-tox-linters FAILURE in 3m 15s
✔️ ara-basic-ansible-core-devel SUCCESS in 5m 50s (non-voting)
✔️ ara-basic-ansible-6 SUCCESS in 5m 09s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 5m 26s
✔️ ara-basic-ansible-core-2.13 SUCCESS in 5m 15s
✔️ ara-basic-ansible-core-2.12 SUCCESS in 5m 16s
✔️ ara-basic-ansible-core-2.11 SUCCESS in 6m 29s
✔️ ara-basic-ansible-2.9 SUCCESS in 5m 28s
✔️ ara-container-images SUCCESS in 11m 56s

Still heavily a work in progress but getting a better undertanding of
how things work.

Host and Tasks have now have gauges by status.
Disable playbook metrics temporarily until we revisit it with newfound
knowledge.
@dmsimard
Copy link
Contributor Author

I think my brain is starting to understand what is happening.

I've temporarily commented out the current iteration of the playbook metrics until I revisit it with newfound knowledge.

This latest iteration re-works the host and tasks metrics to have gauges per status such that we are able to do graphs like this, for example:

Prometheus task results in grafana

Screenshot from 2023-06-18 19-53-59

Prometheus host results in grafana

Screenshot from 2023-06-18 19-54-20

A snippet of what this looks like when querying the prometheus exporter:

# HELP ara_tasks_total Number of tasks recorded by ara in prometheus
# TYPE ara_tasks_total gauge
ara_tasks_total 403.0
# HELP ara_tasks_range Limit metric collection to the N most recent tasks
# TYPE ara_tasks_range gauge
ara_tasks_range 2500.0
# HELP ara_tasks_completed Completed Ansible tasks
# TYPE ara_tasks_completed gauge
ara_tasks_completed{action="command",duration="00:00:00.294820",name="Echo the �abc binary string",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/smoke.yaml",playbook="30",results="1",status="completed",updated="2023-06-08T02:43:29.665787Z"} 1.0
ara_tasks_completed{action="debug",duration="00:00:00.155210",name="Task with non-ascii characters - ä, ö, ü",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/smoke.yaml",playbook="30",results="1",status="completed",updated="2023-06-08T02:43:29.317583Z"} 1.0
ara_tasks_completed{action="gather_facts",duration="00:00:01.035601",name="Gathering Facts",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/smoke.yaml",playbook="30",results="1",status="completed",updated="2023-06-08T02:43:29.098823Z"} 1.0
# HELP ara_tasks_failed Failed Ansible tasks
# TYPE ara_tasks_failed gauge
ara_tasks_failed{action="command",duration="00:00:00.455411",name="smoke-tests : Return false",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/roles/smoke-tests/tasks/test-ops.yaml",playbook="30",results="1",status="failed",updated="2023-06-08T02:43:25.190901Z"} 1.0
ara_tasks_failed{action="fail",duration="00:00:00.210469",name="fail",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/failed.yaml",playbook="29",results="1",status="failed",updated="2023-06-08T02:43:07.648379Z"} 1.0
ara_tasks_failed{action="fail",duration="00:00:00.219566",name="Generate a failure that will be rescued",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/lookups.yaml",playbook="26",results="1",status="failed",updated="2023-06-08T02:32:51.180755Z"} 1.0
# ...

# HELP ara_hosts_total Hosts recorded by ara
# TYPE ara_hosts_total gauge
ara_hosts_total 43.0
# HELP ara_hosts_range Limit metric collection to the N most recent hosts
# TYPE ara_hosts_range gauge
ara_hosts_range 2500.0
# HELP ara_hosts_changed Number of changes on a host
# TYPE ara_hosts_changed gauge
ara_hosts_changed{name="localhost",playbook="30",updated="2023-06-08T02:43:29.848077Z"} 10.0
ara_hosts_changed{name="localhost",playbook="28",updated="2023-06-08T02:33:20.625359Z"} 1.0
ara_hosts_changed{name="localhost",playbook="26",updated="2023-06-08T02:32:54.179356Z"} 1.0
# HELP ara_hosts_failed Number of failures on a host
# TYPE ara_hosts_failed gauge
ara_hosts_failed{name="localhost",playbook="29",updated="2023-06-08T02:43:07.767992Z"} 1.0
ara_hosts_failed{name="localhost",playbook="24",updated="2023-06-08T02:32:18.773096Z"} 1.0
ara_hosts_failed{name="localhost",playbook="23",updated="2023-06-08T02:04:04.810142Z"} 1.0
# ...

@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/75ed0374bc6e4344af27503fe6350e60

✔️ ara-tox-py3 SUCCESS in 9m 57s
ara-tox-linters FAILURE in 9m 48s
✔️ ara-basic-ansible-core-devel SUCCESS in 4m 59s (non-voting)
✔️ ara-basic-ansible-6 SUCCESS in 6m 11s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 6m 01s
✔️ ara-basic-ansible-core-2.13 SUCCESS in 10m 57s
✔️ ara-basic-ansible-core-2.12 SUCCESS in 10m 38s
✔️ ara-basic-ansible-core-2.11 SUCCESS in 10m 51s
✔️ ara-basic-ansible-2.9 SUCCESS in 10m 50s
✔️ ara-container-images SUCCESS in 17m 13s

- Add a summary metric for tracking the duration of tasks.

This is what was intended when trying to do the playbook histogram so
we'll come back to that later.
@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/4c6c9dea87f14d93aa1ec28b71ebc083

✔️ ara-tox-py3 SUCCESS in 4m 14s
ara-tox-linters FAILURE in 3m 12s
✔️ ara-basic-ansible-core-devel SUCCESS in 6m 20s (non-voting)
✔️ ara-basic-ansible-6 SUCCESS in 7m 07s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 8m 02s
✔️ ara-basic-ansible-core-2.13 SUCCESS in 6m 20s
✔️ ara-basic-ansible-core-2.12 SUCCESS in 5m 32s
✔️ ara-basic-ansible-core-2.11 SUCCESS in 6m 17s
✔️ ara-basic-ansible-2.9 SUCCESS in 5m 40s
✔️ ara-container-images SUCCESS in 11m 13s

@softwarefactory-project-zuul
Copy link

Build succeeded.
https://ansible.softwarefactory-project.io/zuul/buildset/59731f5a132942749960db45ae05a18a

✔️ ara-tox-py3 SUCCESS in 4m 15s
✔️ ara-tox-linters SUCCESS in 3m 57s
✔️ ara-basic-ansible-core-devel SUCCESS in 7m 09s (non-voting)
✔️ ara-basic-ansible-6 SUCCESS in 6m 09s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 6m 24s
✔️ ara-basic-ansible-core-2.13 SUCCESS in 6m 01s
✔️ ara-basic-ansible-core-2.12 SUCCESS in 6m 30s
✔️ ara-basic-ansible-core-2.11 SUCCESS in 6m 08s
✔️ ara-basic-ansible-2.9 SUCCESS in 6m 31s
✔️ ara-container-images SUCCESS in 11m 36s

- Substantial cleanup and cut on code duplication
- Fix linting and style
- Metric labels moved to default constants, leave the door opened for
  the possibility of customizing them
- Retrofit what we learned back to the playbook metrics
- Re-enable playbook metrics
@dmsimard
Copy link
Contributor Author

Lots of cleanup in this last iteration and I've done some tweaking on the grafana dashboard.

It looks like this now:
Screenshot from 2023-06-20 01-17-51

Screenshot from 2023-06-20 01-18-23

@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/0eed3702b4444312b85e762bc95e51dc

✔️ ara-tox-py3 SUCCESS in 3m 12s
ara-tox-linters FAILURE in 3m 12s
✔️ ara-basic-ansible-core-devel SUCCESS in 6m 16s (non-voting)
✔️ ara-basic-ansible-6 SUCCESS in 5m 58s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 5m 20s
✔️ ara-basic-ansible-core-2.13 SUCCESS in 6m 54s
✔️ ara-basic-ansible-core-2.12 SUCCESS in 4m 51s
✔️ ara-basic-ansible-core-2.11 SUCCESS in 6m 03s
✔️ ara-basic-ansible-2.9 SUCCESS in 5m 08s
✔️ ara-container-images SUCCESS in 11m 33s

- More cleanup
- Removed Gauges for each status of playbooks and tasks, they were not
  useful once understanding how to use Summaries and generated a lot of
  needless metrics in hindsight
- Added a package extra for [prometheus]
- First iteration of docs
- Add first iteration of grafana dashboard
@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/fe23cb058a504bc48f68b007b1d4de91

✔️ ara-tox-py3 SUCCESS in 3m 15s
ara-tox-linters FAILURE in 3m 07s
✔️ ara-tox-docs SUCCESS in 7m 57s
✔️ ara-basic-ansible-core-devel SUCCESS in 5m 09s (non-voting)
✔️ ara-basic-ansible-6 SUCCESS in 5m 03s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 11m 10s
✔️ ara-basic-ansible-core-2.13 SUCCESS in 5m 06s
✔️ ara-basic-ansible-core-2.12 SUCCESS in 5m 06s
✔️ ara-basic-ansible-core-2.11 SUCCESS in 4m 45s
✔️ ara-basic-ansible-2.9 SUCCESS in 5m 08s
✔️ ara-container-images SUCCESS in 10m 57s

@dmsimard
Copy link
Contributor Author

I feel this is ready for a first look to a wider audience so I've asked around for testing and feedback:

The final implementation may change before landing (for example if I screwed up in metric types) but this will be useful to make sure we did the right decisions and do the necessary changes before merging.

I am narrowing the scope of this first PR to playbooks, tasks and hosts for now. Results and plays can come in a later patch as necessary.

@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/5332cbba06be4ca09a29ccbfe24bb719

✔️ ara-tox-py3 SUCCESS in 3m 50s
ara-tox-linters FAILURE in 3m 56s
✔️ ara-tox-docs SUCCESS in 3m 58s
✔️ ara-basic-ansible-core-devel SUCCESS in 6m 17s (non-voting)
✔️ ara-basic-ansible-8 SUCCESS in 6m 03s
✔️ ara-basic-ansible-core-2.15 SUCCESS in 6m 53s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 5m 23s
✔️ ara-basic-ansible-2.9 SUCCESS in 6m 06s
✔️ ara-container-images SUCCESS in 12m 00s

@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/51c4f4164d66409bbf48568389543706

✔️ ara-tox-py3 SUCCESS in 3m 49s
ara-tox-linters FAILURE in 3m 53s
✔️ ara-tox-docs SUCCESS in 3m 11s
✔️ ara-basic-ansible-core-devel SUCCESS in 6m 03s (non-voting)
✔️ ara-basic-ansible-8 SUCCESS in 6m 01s
✔️ ara-basic-ansible-core-2.15 SUCCESS in 7m 29s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 7m 20s
✔️ ara-basic-ansible-2.9 SUCCESS in 5m 55s
✔️ ara-container-images SUCCESS in 11m 19s

@dmsimard dmsimard marked this pull request as draft September 9, 2023 15:49
@dmsimard
Copy link
Contributor Author

Nothing special pushed, just rebased on top of latest master.

@softwarefactory-project-zuul
Copy link

Build failed.
https://ansible.softwarefactory-project.io/zuul/buildset/7f750024dd7b42b2987983a14fc3a884

✔️ ara-tox-py3 SUCCESS in 4m 05s
ara-tox-linters FAILURE in 3m 50s
✔️ ara-tox-docs SUCCESS in 3m 15s
✔️ ara-basic-ansible-core-devel SUCCESS in 6m 55s (non-voting)
✔️ ara-basic-ansible-8 SUCCESS in 7m 00s
✔️ ara-basic-ansible-core-2.15 SUCCESS in 6m 58s
✔️ ara-basic-ansible-core-2.14 SUCCESS in 6m 21s
✔️ ara-container-images SUCCESS in 13m 52s

@dmsimard
Copy link
Contributor Author

I will eventually include it in the docs but in the meantime, I've come up with the following graph that explains how one might use the exporter:

                                         ┌──────────────────┐
       ┌────────────┐ promql ┌─────────┐ │ ansible-playbook │
       │ Prometheus │◄───────┤ Grafana │ │    (with ara)    │
       └──────┬─────┘        └─────────┘ └───────┬──────────┘
              │                                  │
              │ scrapes /metrics                 │ collects data
              │ & stores results                 │ & sends it
              │                                  │
   ┌──────────▼──────────┐               ┌───────▼────────┐
   │ Prometheus Exporter ├──────────────►│ ara API server │
   │ (prometheus_client) │ query metrics │    (django) ┌──┴─────────┐
   └─────────────────────┘               └─────────────┤ recorded   │
                                                       │  playbooks │
                                                       └────────────┘


ara doesn't provide monitoring or alerting out of the box (they are out of scope) but it records a number of granular metrics about Ansible playbooks, tasks and hosts, amongst other things.

Starting with version 1.6.2, ara provides an integration of `prometheus_client <https://github.com/prometheus/client_python>`_ that queries the ara API and then exposes these metrics for prometheus to scrape.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.6.2 didn't pan out, we went straight to 1.7.0. It can be included in a release as soon as it's ready.

help='Maximum number of days to backfill metrics for (default: 90)',
default=90,
type=int
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be interesting for the exporter to be able to filter queries like the general CLI commands work, for example ara playbook list (docs) has:

  --ansible_version <ansible_version>
                        List playbooks that ran with the specified Ansible
                        version (full or partial)
  --client_version <client_version>
                        List playbooks that were recorded with the specified
                        ara client version (full or partial)
  --server_version <server_version>
                        List playbooks that were recorded with the specified
                        ara server version (full or partial)
  --python_version <python_version>
                        List playbooks that were recorded with the specified
                        python version (full or partial)
  --user <user>         List playbooks that were run by the specified user
                        (full or partial)
  --controller <controller>
                        List playbooks that ran from the provided controller
                        (full or partial)
  --name <name>         List playbooks matching the provided name (full or
                        partial)
  --path <path>         List playbooks matching the provided path (full or
                        partial)
  --status <status>     List playbooks matching a specific status
                        ('completed', 'running', 'failed')

@voileux
Copy link

voileux commented Nov 17, 2023

Hi,
I was at ansible meetup in OVH building at montreal, your presentation was really good.
In prometheus, it's bad, when value of tag change during polling interval for one metric, it's better to transform the tag into metric.

I think you can transform for example this metric :
ara_tasks_completed{
action="command",
duration="00:00:00.294820",
name="Echo the �abc binary string",
path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/smoke.yaml",
playbook="30",
results="1",
status="completed",
updated="2023-06-08T02:43:29.665787Z"} 1.0

into several metric,
ara_tasks_status { action="command", name='Echo the abc binary string", path="/home/.......", playbook="30" } 1 (you can map value of integer to status name (1 for completed', 2 for running', 3 for 'failed)

ara_tasks_duration { action="command", name='Echo the abc binary string", path="/home/.......", playbook="30" } number seconds (or micro seconds if needed)

ara_tasks_results { action="command", name='Echo the abc binary string", path="/home/.......", playbook="30" } 1

We can work together to build correct metric, then we will produce correct python for exporter.

@dmsimard
Copy link
Contributor Author

Hi @voileux and thanks for reaching out!

What you suggest makes sense to me and it's worth looking into.

I don't have bandwidth to look into this /right now/ but I will revisit this in the near future.

@copolycube
Copy link

copolycube commented Nov 23, 2023

Hello,

depending on your goal here : it might be easier for you to limit the "exporter part" to what you want to monitor live (i.e. what you want to trigger alerts on)

And for the visualization aspects, directly connect grafana to your database with the specific grafana datasource:

something like :

flowchart TD
    G[Grafana] -->|promql <br/> visualize <b>alerts</b><br/> and correlate current metrics| P(Prometheus )
    G -->|db datasource <br/> visualize <b>metrics</b> <br/>current and historical| D
    W(alertmanager) -->|promql<br/>trigger alerts| P
    P-->|scrapes /metrics<br/> stores data| E(Prometheus Exporter<br/>prometheus_client)
    E --> |query metrics| D(ara API server <br/> django <br/>fa:fa-database recorded playbooks)
    A(ansible playbook) -->|collects data<br/>& sends it| D

instead of (from your previous schema here)

flowchart TD
    G[Grafana] -->|promql| P(Prometheus)
    P-->|scrapes /metrics<br/> stores data| E(Prometheus Exporter<br/>prometheus_client)
    E --> |query metrics| D(ara API server <br/> django <br/>fa:fa-database recorded playbooks)
    A(ansible playbook) -->|collects data<br/>& sends it| D

(edit: I forgot to put the mermaid keyword, and took this opportunity to add alertmanager & clarify the schema equivalent to the one you presented before)

This indeed requires you to rewrite your panels in grafana in order to make use of the proper SQL, and you will need to open the connection between grafana and your DB

Also it avoids to transform the whole content of the DB opentelemetry format and scraping it each time, which will scale better :-D

@dmsimard
Copy link
Contributor Author

Hi, I haven't revisited this in a little while but I wanted to say it was still on my radar and I plan to work on this some more in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants