Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify FATE metric counts. #4498

Open
EdColeman opened this issue Apr 26, 2024 · 16 comments
Open

Verify FATE metric counts. #4498

EdColeman opened this issue Apr 26, 2024 · 16 comments
Assignees

Comments

@EdColeman
Copy link
Contributor

Noticed that the metric values logged during the MetricsIT (in the manager log) for the FATE metrics mostly are reporting 0 for the values. Because of the test, it seems likely that some values should be non-zero.

It could be that 0 is the correct, but this should be verified.

Logged metrics for one metric update cycle (manager log):

2024-04-26T21:29:45,965 47 [accumulo.METRICS] INFO : accumulo.fate.errors{host=ip-10-113-15-130,instance.name=miniInstance,port=39177,process.name=manager,resource.group=default,tag1=value1,tag2=value2,type=zk.connection} value=0
2024-04-26T21:29:45,965 47 [accumulo.METRICS] INFO : accumulo.fate.ops.activity{host=ip-10-113-15-130,instance.name=miniInstance,port=39177,process.name=manager,resource.group=default,tag1=value1,tag2=value2} value=24
2024-04-26T21:29:45,965 47 [accumulo.METRICS] INFO : accumulo.fate.ops.in.progress{host=ip-10-113-15-130,instance.name=miniInstance,port=39177,process.name=manager,resource.group=default,tag1=value1,tag2=value2} value=0
2024-04-26T21:29:45,965 47 [accumulo.METRICS] INFO : accumulo.fate.tx{host=ip-10-113-15-130,instance.name=miniInstance,port=39177,process.name=manager,resource.group=default,state=submitted,tag1=value1,tag2=value2} value=0
2024-04-26T21:29:45,965 47 [accumulo.METRICS] INFO : accumulo.fate.tx{host=ip-10-113-15-130,instance.name=miniInstance,port=39177,process.name=manager,resource.group=default,state=in_progress,tag1=value1,tag2=value2} value=0
2024-04-26T21:29:45,965 47 [accumulo.METRICS] INFO : accumulo.fate.tx{host=ip-10-113-15-130,instance.name=miniInstance,port=39177,process.name=manager,resource.group=default,state=successful,tag1=value1,tag2=value2} value=0
2024-04-26T21:29:45,965 47 [accumulo.METRICS] INFO : accumulo.fate.tx{host=ip-10-113-15-130,instance.name=miniInstance,port=39177,process.name=manager,resource.group=default,state=failed,tag1=value1,tag2=value2} value=0
2024-04-26T21:29:45,965 47 [accumulo.METRICS] INFO : accumulo.fate.tx{host=ip-10-113-15-130,instance.name=miniInstance,port=39177,process.name=manager,resource.group=default,state=unknown,tag1=value1,tag2=value2} value=0
2024-04-26T21:29:45,965 47 [accumulo.METRICS] INFO : accumulo.fate.tx{host=ip-10-113-15-130,instance.name=miniInstance,port=39177,process.name=manager,resource.group=default,state=new,tag1=value1,tag2=value2} value=0
2024-04-26T21:29:45,965 47 [accumulo.METRICS] INFO : accumulo.fate.tx{host=ip-10-113-15-130,instance.name=miniInstance,port=39177,process.name=manager,resource.group=default,state=failed_in_progress,tag1=value1,tag2=value2} value=0

@kevinrr888
Copy link
Contributor

I can look into this

@kevinrr888
Copy link
Contributor

kevinrr888 commented May 8, 2024

It seems like the METRICS are logged every second (configured by cfg.setProperty("general.custom.metrics.opts.logging.step", "1s"); in MetricsIT) but only updated every 5 seconds (configured by private static final long DEFAULT_MIN_REFRESH_DELAY = TimeUnit.SECONDS.toMillis(5); in FateMetrics). This makes it look like the METRICS values are wrong at the time they are logged in the MetricsIT. Since this is a fast test, increasing the update interval to every second, we can see values other than 0. Here's one set of the logged metrics for one cycle of logs showing a submitted transaction:

2024-05-07T16:05:07,695 [accumulo.METRICS] INFO : accumulo.fate.errors{host=groot,instance.name=miniInstance,port=35101,process.name=manager,type=zk.connection} value=0
2024-05-07T16:05:07,695 [accumulo.METRICS] INFO : accumulo.fate.ops.activity{host=groot,instance.name=miniInstance,port=35101,process.name=manager} value=1
2024-05-07T16:05:07,695 [accumulo.METRICS] INFO : accumulo.fate.ops.in.progress{host=groot,instance.name=miniInstance,port=35101,process.name=manager} value=1
2024-05-07T16:05:07,696 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=35101,process.name=manager,state=failed_in_progress} value=0
2024-05-07T16:05:07,696 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=35101,process.name=manager,state=in_progress} value=0
2024-05-07T16:05:07,696 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=35101,process.name=manager,state=new} value=0
2024-05-07T16:05:07,697 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=35101,process.name=manager,state=unknown} value=0
2024-05-07T16:05:07,697 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=35101,process.name=manager,state=successful} value=0
2024-05-07T16:05:07,697 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=35101,process.name=manager,state=failed} value=0
2024-05-07T16:05:07,697 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=35101,process.name=manager,state=submitted} value=1

So, it appears that the values of 0 are correct, they are just not updated as frequently as they are logged, making it seem like the count of 0 is more frequent than it actually is. Perhaps for this test we could increase the logging time to 5 seconds, or decrease the update time to 1 second so it is more clear in the logs what is actually happening at what time.

One thing I noticed when looking at this, accumulo.fate.ops.in.progress is a sum of all current txns (failed_in_progress, in_progress, new, unknown, successful, failed, and submitted). The "in.progress" may be a bit misleading since the "in.progress" in this case is different from the state "in_progress". Maybe something like accumulo.fate.ops.total or accumulo.fate.ops.current.ops would be clearer? Or maybe this was intended to only include txns that are in_progress? This doesn't seem to apply to accumulo.fate.ops.in.progress.by.type since this only includes those that are in_progress.

Another thing I noticed was it seems like the accumulo.fate.ops.in.progress.by.type metric is not reset from updates. For example, the following two cycles of logs are 1 second apart from each other:

2024-05-08T10:32:56,013 [accumulo.METRICS] INFO : accumulo.fate.errors{host=groot,instance.name=miniInstance,port=34595,process.name=manager,type=zk.connection} value=0
2024-05-08T10:32:56,013 [accumulo.METRICS] INFO : accumulo.fate.ops.activity{host=groot,instance.name=miniInstance,port=34595,process.name=manager} value=3
2024-05-08T10:32:56,013 [accumulo.METRICS] INFO : accumulo.fate.ops.in.progress{host=groot,instance.name=miniInstance,port=34595,process.name=manager} value=1
2024-05-08T10:32:56,013 [accumulo.METRICS] INFO : accumulo.fate.ops.in.progress.by.type{host=groot,instance.name=miniInstance,op.type=TABLE_COMPACT,port=34595,process.name=manager} value=1
2024-05-08T10:32:56,013 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=unknown} value=0
2024-05-08T10:32:56,013 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=successful} value=0
2024-05-08T10:32:56,013 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=failed} value=0
2024-05-08T10:32:56,013 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=submitted} value=0
2024-05-08T10:32:56,013 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=failed_in_progress} value=0
2024-05-08T10:32:56,013 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=new} value=0
2024-05-08T10:32:56,013 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=in_progress} value=1

2024-05-08T10:32:57,014 [accumulo.METRICS] INFO : accumulo.fate.errors{host=groot,instance.name=miniInstance,port=34595,process.name=manager,type=zk.connection} value=0
2024-05-08T10:32:57,014 [accumulo.METRICS] INFO : accumulo.fate.ops.activity{host=groot,instance.name=miniInstance,port=34595,process.name=manager} value=4
2024-05-08T10:32:57,014 [accumulo.METRICS] INFO : accumulo.fate.ops.in.progress{host=groot,instance.name=miniInstance,port=34595,process.name=manager} value=0
2024-05-08T10:32:57,014 [accumulo.METRICS] INFO : accumulo.fate.ops.in.progress.by.type{host=groot,instance.name=miniInstance,op.type=TABLE_COMPACT,port=34595,process.name=manager} value=1
2024-05-08T10:32:57,014 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=unknown} value=0
2024-05-08T10:32:57,014 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=successful} value=0
2024-05-08T10:32:57,014 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=failed} value=0
2024-05-08T10:32:57,014 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=submitted} value=0
2024-05-08T10:32:57,014 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=failed_in_progress} value=0
2024-05-08T10:32:57,014 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=new} value=0
2024-05-08T10:32:57,014 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=in_progress} value=0

In the first set of logs, 1 transaction is in_progress on TABLE_COMPACT, in the second set of logs, 0 transactions are in_progress but 1 is in progress doing op TABLE_COMPACT which doesn't make sense. Looking at the code and after some experimentation, it doesn't seem like this is an issue with seeing that transaction still in_progress in ZK, instead seems like it is just being logged again. I can't exactly narrow down why it's being logged again, I think someone more familiar with Metrics might be able to better understand.

And one question, I noticed that these metrics are only obtained by looking at FATE transactions in ZK in elasticity (List<AdminUtil.TransactionStatus> currFates = admin.getTransactionStatus(Map.of(FateInstanceType.META, metaFateStore), null, null, null); in FateMetricValues). Should this also be looking at the UserFateStore/Accumulo FATE table for transactions? Or do we only care about META transactions here?

@EdColeman
Copy link
Contributor Author

I believe the accumulo.fate.ops.in.progress.by.type were modeled on a monotonically increasing counter and then should not be reset. Not sure about the reporting through.

The FATE metrics should provide a window / snapshot into system activity. If the UserFateStore/FATE table is necessary for a complete view - then yes, it should be included.

Since these were introduced in 1.10 the usage was to allow users to monitor system activity like bulk imports, create table, drop table - knowing the number of transaction provides an quick view of the system heath. Knowing the type and step counts help isolate issues.

@EdColeman
Copy link
Contributor Author

The LoggingMetrics are useful for development and testing, but in production, another solution should be used - statsD, Prometheus should be two that we strive to support. Trying a different collection system may be informative.

@kevinrr888
Copy link
Contributor

kevinrr888 commented May 8, 2024

I believe the accumulo.fate.ops.in.progress.by.type were modeled on a monotonically increasing counter and then should not be reset. Not sure about the reporting through.

It just seems like a bug that it is continually reported as being in progress on a certain step when it has already completed.

The FATE metrics should provide a window / snapshot into system activity. If the UserFateStore/FATE table is necessary for a complete view - then yes, it should be included.

The UserFateStore includes the fate transactions on user tables (user creating a table, user compacting a table, etc.), so this may or may not be a part of "system activity" (I'm just not sure what is/isn't considered system activity)

@EdColeman
Copy link
Contributor Author

It just seems like a bug that it is continually reported as being in progress on a certain step when it has already completed.

Are you questioning the reporting or the value? If it is a counter, then systems like Prometheus will end up calculating the delta between measurements. If it should be a gauge, then yes, it should be reset

@EdColeman
Copy link
Contributor Author

EdColeman commented May 8, 2024

I'm just not sure what is/isn't considered system activity

My interpretation is anything that is occurring. While users may want visibility into specific Accumulo internals (say metadata, gc...) Overall what they care about is how is Accumulo performing - is it ingesting data when received in a timely fashion? Are user queries being served, and returned? Is the system under strain and are there specific bottlenecks or failures? Are compactions running and completing as expected. Are files being deleted as intended so they don't run out of space ....

The metrics should be able to answer these types of questions as well as provide feedback for tuning system parameters. Increasing number of threads for... Allocating more memory to... Adding / decreasing compactors, scan servers, tservers to provide enough capability in an efficient maner

@kevinrr888
Copy link
Contributor

It just seems like a bug that it is continually reported as being in progress on a certain step when it has already completed.

Are you questioning the reporting or the value? If it is a counter, then systems like Prometheus will end up calculating the delta between measurements. If it should be a gauge, then yes, it should be reset

We can see in the first set of logs that there is one transaction in progress:

2024-05-08T10:32:56,013 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=in_progress} value=1

It also shows the operation that transaction is currently on in that same set of logs:

2024-05-08T10:32:56,013 [accumulo.METRICS] INFO : accumulo.fate.ops.in.progress.by.type{host=groot,instance.name=miniInstance,op.type=TABLE_COMPACT,port=34595,process.name=manager} value=1

This is all fine/expected

One second later, when the next set of logs occur, there are zero transactions in progress:

2024-05-08T10:32:57,014 [accumulo.METRICS] INFO : accumulo.fate.tx{host=groot,instance.name=miniInstance,port=34595,process.name=manager,state=in_progress} value=0

But we still see the same log that there is a transaction on operation TABLE_COMPACT:

2024-05-08T10:32:57,014 [accumulo.METRICS] INFO : accumulo.fate.ops.in.progress.by.type{host=groot,instance.name=miniInstance,op.type=TABLE_COMPACT,port=34595,process.name=manager} value=1

The compaction has completed, so it seems wrong that this is still logged

@kevinrr888
Copy link
Contributor

I'm just not sure what is/isn't considered system activity

My interpretation is anything that is occurring. While users may want visibility into specific Accumulo internals (say metadata, gc...) Overall what they care about is how is Accumulo performing - is it ingesting data when received in a timely fashion? Are user queries being served, and returned? Is the system under strain and are there specific bottlenecks or failures? Are compactions running and completing as expected. Are files being deleted as intended so they don't run out of space ....

The metrics should be able to answer these types of questions as well as provide feedback for tuning system parameters. Increasing number of threads for... Allocating more memory to... Adding / decreasing compactors, scan servers, tservers to provide enough capability in an efficient maner

Based on this, it seems like UserFateStore should be included. I will create an issue for this

@EdColeman
Copy link
Contributor Author

The tension between a counter and a gauge is what the intended usage is.

Your interpretation seems to be that the by_type should be a snapshot of an instant in time. That is not necessarily incorrect.

Using a counter provides a different view. Because a counter is always increasing you can reason about things between measurements. It is not the instantaneous value but the absolute value that is reported. The absolute value allows using the delta and the rate of change of the value over time to be determined. The counter answers how many X's have ever occurred.

For example, how many compactions have occurred over the last T? One way the counter is useful - say you look at an instantaneous snapshot and there are 0 compactions. No problem, that seems normal. However, if you looked over the last 24 hours and there were never any compactions you might need to look at things.

The counter allows you to look at things like how many compactions happened over the last reporting period - has that increased? Counters also allow you to aggregate between reporting periods. With a snapshot you just need to hope that it hits at the measurement time.

A counter also provides a way to trend things. For example you may want to trend bulk imports - Is there a surge at certain times? Does it vary by day? A counter can answer that, a gauge cannot.

Using a combination of gauges (a snapshot of state) as well as counters to provide for trending and history may be appropriate. Think how it would be displayed on a dashboard. The gauges are the state right now. The counters could be graphs. showing the activity over the last few hours.

So using a counter provides a view into activity over-time. Which is also not incorrect.

@kevinrr888
Copy link
Contributor

Thank you for the detailed explanation. That makes sense.
It seemed like that could have been a potential issue, so thought I would point it out. Glad that this is the expected behavior, and the "issue" was instead stemming from my misunderstanding of how the metrics are used.

@kevinrr888
Copy link
Contributor

A couple more things that I'm still curious about from my original comment:

So, it appears that the values of 0 are correct, they are just not updated as frequently as they are logged, making it seem like the count of 0 is more frequent than it actually is. Perhaps for this test we could increase the logging time to 5 seconds, or decrease the update time to 1 second so it is more clear in the logs what is actually happening at what time.

Should any of the timings be adjusted for this test? Might be confusing not having the log and update intervals be the same. This seemed to be the basis of the original issue - unsure if the FATE metric counts were correct.

One thing I noticed when looking at this, accumulo.fate.ops.in.progress is a sum of all current txns (failed_in_progress, in_progress, new, unknown, successful, failed, and submitted). The "in.progress" may be a bit misleading since the "in.progress" in this case is different from the state "in_progress". Maybe something like accumulo.fate.ops.total or accumulo.fate.ops.current.ops would be clearer? Or maybe this was intended to only include txns that are in_progress? This doesn't seem to apply to accumulo.fate.ops.in.progress.by.type since this only includes those that are in_progress.

Should the naming for accumulo.fate.ops.in.progress be changed?

@EdColeman
Copy link
Contributor Author

Should the naming for accumulo.fate.ops.in.progress be changed?

Maybe accumulo.fate.ops.in.progress.total if it is a counter?

@kevinrr888
Copy link
Contributor

My concern was more with the "in.progress" part. accumulo.fate.ops.in.progress is a count of all the current transactions (failed_in_progress, in_progress, new, unknown, successful, failed, and submitted), so it seems misleading to be named "in.progress".

@EdColeman
Copy link
Contributor Author

Then maybe in_progress would be better specified as a gauge? My comments earlier were generic. One suggestion that I have seen is that total may be common for a unit-less accumulating count. Maybe we adopt info for instantaneous counts?

@kevinrr888
Copy link
Contributor

So, it appears that the values of 0 are correct, they are just not updated as frequently as they are logged, making it seem like the count of 0 is more frequent than it actually is. Perhaps for this test we could increase the logging time to 5 seconds, or decrease the update time to 1 second so it is more clear in the logs what is actually happening at what time.

If the above is fine/not an issue, this issue can be closed as completed as everything else has been addressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants