Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 6581: LTS - Throttle SLTS on slow or unstable LTS #7300

Open
wants to merge 20 commits into
base: issue-6581-LTS-throttle
Choose a base branch
from

Conversation

sachin-j-joshi
Copy link
Contributor

@sachin-j-joshi sachin-j-joshi commented Oct 3, 2023

[ PR against feature branch]
Signed-off-by: Sachin Joshi sachin.joshi@emc.com

Change log description
LTS - Throttle major SLTS activities on slow or unavailable LTS.

Purpose of the change
Fixes #6581

What the code does
New StorageHealthTracker that tracks health of ChunkedSegmentStorage and reports storage health status periodically.
Based on the

how is throttle calculated

  • The already existing current setting for lateness is utilized. (any relevant request more than configured value is considered late. This value needs to be chosen as a small multiple of expected T99 latency for given storage. Default is 100ms)
  • SLOW - when % late requests in current iteration is above min throttle threshold, throttling is applied on the next iteration and each updating request is throttled by duration proportional to % late requests.
  • DEGRADED - when % late requests are above max threshold then ALL update requests are rejected with StorageUnavailableException
  • UNAVAILABLE - when underlying storage itself indicate that storage is too busy or unavailable (Eg http 503)
  • In addition , as response to StorageUnavailableException, the StorageHealthTracker also throttles any requests until there is at least one successful request. This throttle duration is linearly proportional to number of iteration where storage was unavailable. As soon as there is a successful non-late request this throttle becomes inactive.

When LTS is slow or degraded performance.

  • throttles SLTS
  • throttle GC activity

When LTS is unavailable.

  • Reject any SLTS requests
  • throttle GC activity

Only following modifying operations are throttled.

  • write
  • truncate
  • concat
  • entire GC batch

How to verify it
All tests should pass

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
… part 1

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
… Simplify logic, remove TimerUtils.java

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
… Add mock tests.

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
… Fix doc comments.

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
… Fix doc comments.

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
…orageHealthTracker.

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
…duled task.

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
…e-6581-LTS-throttle-storage-writer

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>

# Conflicts:
#	segmentstore/storage/src/main/java/io/pravega/segmentstore/storage/chunklayer/ChunkedSegmentStorage.java
#	segmentstore/storage/src/main/java/io/pravega/segmentstore/storage/chunklayer/SystemJournal.java
#	segmentstore/storage/src/test/java/io/pravega/segmentstore/storage/chunklayer/ChunkedSegmentStorageTests.java
…ub.com/sachin-j-joshi/pravega into issue-6581-LTS-throttle-storage-writer-for-merge

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>

# Conflicts:
#	segmentstore/server/src/main/java/io/pravega/segmentstore/server/containers/StreamSegmentContainer.java
#	segmentstore/storage/src/main/java/io/pravega/segmentstore/storage/chunklayer/ChunkedSegmentStorage.java
#	segmentstore/storage/src/main/java/io/pravega/segmentstore/storage/chunklayer/ChunkedSegmentStorageConfig.java
@codecov
Copy link

codecov bot commented Oct 3, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ No coverage uploaded for pull request base (issue-6581-LTS-throttle@3c72c17). Click here to learn what that means.

Additional details and impacted files
@@                    Coverage Diff                     @@
##             issue-6581-LTS-throttle    #7300   +/-   ##
==========================================================
  Coverage                           ?   86.00%           
  Complexity                         ?    16385           
==========================================================
  Files                              ?     1045           
  Lines                              ?    61187           
  Branches                           ?     6224           
==========================================================
  Hits                               ?    52626           
  Misses                             ?     5354           
  Partials                           ?     3207           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sachin-j-joshi sachin-j-joshi marked this pull request as ready for review October 3, 2023 20:46
@RaulGracia RaulGracia self-requested a review October 4, 2023 14:13
@RaulGracia
Copy link
Contributor

Thanks @sachin-j-joshi , one question before getting into the review: has this been tested with performance test to check for potential regressions and/or for the behavior of the system inducing LTS unavailability on purpose?

@sachin-j-joshi
Copy link
Contributor Author

sachin-j-joshi commented Oct 6, 2023

@RaulGracia We do plan to validate this change with perf and load regression test. (That is the reason this PR is against the feature branch so that we can continue improve it before merging to master)

Signed-off-by: Sachin Joshi <sachin.joshi@emc.com>
@sachin-j-joshi sachin-j-joshi force-pushed the issue-6581-LTS-throttle-storage-writer-for-merge branch from 366138f to d37b635 Compare November 16, 2023 19:10
Copy link
Member

@tkaitchuck tkaitchuck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100ms is far too low for most of the backends we support. (Because 100ms an expected p50 for many LTSs). Rather than making this configured can we make this adaptive by tracking the latency of successful requests? Or failing that, could we make the default value determined by which backend store is being used so that we can set reasonable defaults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants