Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore adding table ids tags to some metrics #4511

Open
keith-turner opened this issue May 1, 2024 · 14 comments
Open

Explore adding table ids tags to some metrics #4511

keith-turner opened this issue May 1, 2024 · 14 comments
Labels
enhancement This issue describes a new feature, improvement, or optimization.

Comments

@keith-turner
Copy link
Contributor

Is your feature request related to a problem? Please describe.

The monitor currently has a custom metrics system. This custom system in the monitor tracks certain metrics per tablet server per table id. The monitor currently lacks any metrics for scan severs that are similar to the current monitor tablet server metrics. There are metrics being emitted by scan servers that almost have enough information to construct a scan server metrics dashboard in an external metrics system, however the lack of table id information prevents having parity with the tablet server view in the monitor.

In general, if table ids were added to some subset of tablet server and scan server metrics then it could allow an external metrics system to achieve parity with the metrics offered by the monitor. The metrics in the monitor that do this are primarily concerned with table read, write, and compactions.

Describe the solution you'd like

For a subset of metrics related to reading and writing data tag them with table ids. This would like result in a Map<TableId, Meter> in the code. For this map would need to be able to add new meters as new table ids are encountered and remove table ids that have not been used in a while. The meters in this map would have the table id tag.

Describe alternatives you've considered

Do not try to achieve feature parity in external metrics and custom monitor metrics.

@keith-turner keith-turner added the enhancement This issue describes a new feature, improvement, or optimization. label May 1, 2024
@EdColeman
Copy link
Contributor

Other than well defined Accumulo system tables (root, metadata,...) it is probably not a good idea to add unbounded tags. Systems that have large numbers of dynamic tables will have issues.

@keith-turner
Copy link
Contributor Author

The following may be relevant if something is done for this issue.

https://docs.micrometer.io/micrometer/reference/concepts/meter-provider.html

@keith-turner
Copy link
Contributor Author

keith-turner commented May 1, 2024

Other than well defined Accumulo system tables (root, metadata,...) it is probably not a good idea to add unbounded tags. Systems that have large numbers of dynamic tables will have issues.

Yeah there is big difference there. The monitor metrics only track the tables that currently exist, it will not track any deleted tables. Adding a table id tag for external metrics would cause the metrics system to have data for tables that no longer exists. So that is a huge difference. If anything is done twoards this, would probably need to only add tags for tables that are specified by a user. So could provide a configurable list of tables to tag and only add tags for tables in that list.

@dlmarion
Copy link
Contributor

dlmarion commented May 1, 2024

https://docs.micrometer.io/micrometer/reference/concepts/meter-provider.html

I saw this as well. Additionally, there is a MultiGauge.

If we are going to do something that will dynamically create Meters, then we need to be sure to remove them (MeterRegistry.remove). Otherwise the process will continue to report the metrics for it's lifetime.

@keith-turner
Copy link
Contributor Author

It may be more useful to tag the metrics with a table name instead of a table id.

@EdColeman
Copy link
Contributor

One way to think of it is that each tag will create a unique time-series in some (most?) back ends. So, in addition to the number of metrics that are reported each interval, there can be negative impact to the collection / storage / display systems.

@EdColeman
Copy link
Contributor

Would something like tracing work? If we could activate / de-activate tracing on-demand, then it seems that maybe would could collect the needed values for profiling and then turn them off when not being used? No idea what it would look like at this point.

@dlmarion
Copy link
Contributor

dlmarion commented May 1, 2024

Would something like tracing work?

If tracing provided the level of information needed, then it could be enabled on a subset of the scan servers / tablet servers via settings in accumulo-env.sh. If there are problems with specific queries, those queries could be directed to specific scan servers via the resource group mechanism.

@keith-turner
Copy link
Contributor Author

One way to think of it is that each tag will create a unique time-series in some (most?) back ends. So, in addition to the number of metrics that are reported each interval, there can be negative impact to the collection / storage / display systems.

Yeah, I think for this to be workable would need a user provided list of tables to tag. This could be a metric property with a value of a list of table names or it could be a per table property that is a boolean. So this would allow control over which tables are tagged.

Would something like tracing work?

I think that would be out of scope of this issue. This issue could be closed if its not something that seems workable. This issue was about supporting current functionality of the monitor in an external metrics system. Currently in the monitor if a table is clicked on it will show metrics that are per table and per tserver. Also in the tables view in the monitor can see metrics like how many compactions are queued for a table and what the current ingest rate is for a table.

@dlmarion
Copy link
Contributor

dlmarion commented May 1, 2024

Yeah, I think for this to be workable would need a user provided list of tables to tag. This could be a metric property with a value of a list of table names or it could be a per table property that is a boolean. So this would allow control over which tables are tagged.

The code needs to be responsive to changes in the list so that processes don't need to be restarted for metrics to be enabled / disabled.

@EdColeman
Copy link
Contributor

For example guidance, Prometheus label guidelines provides the following:,

As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.

If you have a metric that has a cardinality over 100 or the potential to grow that large, investigate alternate solutions such as reducing the number of dimensions or moving the analysis away from monitoring and to a general-purpose processing system.

@dlmarion
Copy link
Contributor

dlmarion commented May 1, 2024

For example guidance, Prometheus label guidelines provides the following:,

As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.

If you have a metric that has a cardinality over 100 or the potential to grow that large, investigate alternate solutions such as reducing the number of dimensions or moving the analysis away from monitoring and to a general-purpose processing system.

These guidelines aren't general time series database guidelines, right? They are based on the limitations of Prometheus?

@EdColeman
Copy link
Contributor

While those are specific to Prometheus - my understanding is that they apply generally across various metric systems. There are similar limitations for things like InfluxDB and it all relates to unique tags (labels) creating a unique time-series.

@dlmarion
Copy link
Contributor

dlmarion commented May 1, 2024

Yeah, I get that total cardinality is an issue for most of the TSDBs, but I think limitations are different for each. My point was that we should not take the numbers from Prometheus as a hard limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This issue describes a new feature, improvement, or optimization.
Projects
None yet
Development

No branches or pull requests

3 participants