Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fate reservations moved out of memory #4524

Open
wants to merge 1 commit into
base: elasticity
Choose a base branch
from

Conversation

kevinrr888
Copy link
Contributor

@kevinrr888 kevinrr888 commented May 3, 2024

closes #4131
Fate reservations were moved out of memory. This is a WIP PR* for one part that is needed for having Fate be distributed in the future.

  • Reservations for MetaFateStore were moved out of memory into ZooKeeper
  • Reservations for UserFateStore were moved out of memory into the Accumulo Fate table
    • Is an additional column which just indicates whether the FateId of that row is reserved or not
  • Added test MultipleStoresIT
    • Tests the new functionality of how reservations are stored
    • I also included a couple tests in MultipleStoresIT that are testing the reserve/unreserve functionality and that it still works as intended. This can be moved from this class if desired.
  • Verified existing tests in the fate test package (FateIT, FateOpsCommandsIT, FateStoreIT, ZooMutatorIT, FateMutatorImplIT, UserFateStoreIT) still pass

*There are still a few things I have marked as TODO that I would like input/suggestions on. These are marked in the code as "TODO 4131".

- Reservations for MetaFateStore were moved out of memory into ZooKeeper
- Reservations for UserFateStore were moved out of memory into the Accumulo Fate table
- Added test MultipleStoresIT
This commit is one part needed for having Fate be distributed
@kevinrr888
Copy link
Contributor Author

Marking this as ready for review, since it may not be reviewed otherwise

@kevinrr888 kevinrr888 marked this pull request as ready for review May 7, 2024 15:28
@Override
public Optional<FateTxStore<T>> tryReserve(FateId fateId) {
// TODO 4131 should this throw an exception if the id doesn't exist (status = UNKNOWN)?
FateMutator.Status status = newMutator(fateId).putReservedTx(fateId).tryMutate();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code needs to be refactored to use a UUID similar to the zookeeper code. The UUID covers the problem of a write making it the server, but the client not knowing the write made it. Both zookeeper and accumulo can have this problem and both can have a similar solution. The following examples show how the UUID can help address this problem.

  1. FATE1 exists and is not currently reserved.
  2. Thread1 attempts reserve FATE1 by writing isreserved only if current value is notreserved.
  3. Thread2 attempts reserve FATE1 by writing isreserved only if current value is notreserved.
  4. One of the writes succeeds and the other fails. However before the tablet server can respond to Thread1 or Thread2, it dies. This causes both thread to receive an unknown status.
  5. After getting the UNKNOWN status, Thread1 reads FATE1 and sees it marked reserved however it does not know if it reserved FATE1 or another thread.
  6. After getting the UNKNOWN status, Thread2 reads FATE1 and sees it marked reserved however it does not know if it reserved FATE1 or another thread.

In the above example, FATE1 is marked reserved but no thread will be working on it. If instead of setting a value of isreserved we instead set a per thread UUID, then the above situation could look like the following.

  1. FATE1 exists and is not currently reserved. Its reserved column has a value of empty string.
  2. Thread1 wants to reserve FATE1 and generates UUID1 for this purpose
  3. Thread2 wants to reserve FATE1 and generates UUID2 for this purpose
  4. Thread1 attempts reserve FATE1 by writing UUID1 as a value for the reserved column only if current value is empty.
  5. Thread2 attempts reserve FATE1 by writing UUID2 as a value for the reserved column only if current value is empty.
  6. The write for Thread2 succeeds and the write for Thread1 fails because the col val is not empty. However before the tablet server can respond to Thread1 or Thread2, it dies. This causes both thread to receive an unknown status.
  7. After getting the UNKNOWN status, Thread1 reads FATE1 reserved column and sees UUID2 so it knows it did not reserve it.
  8. After getting the UNKNOWN status, Thread2 reads FATE1 reserved column and sees UUID2 so it knows it did reserve it.

In addition to the UUID mentioned above, the persisted reservation will need another field and that is a process lock id. The UUID mentioned above is scoped very narrowly, it scoped to an individual reservation attempt and its used to handle the case of server applying write and then dying. The process lock id would be used to detect reservation that are held by a process that is no longer alive. The following is an example of how all of this works together.

  1. A manager process starts in Accumulo and acquires a lock in ZK, let calls it PL1
  2. The manager process reserves FATE1 setting the reservation column to UUID1:PL1. UUID1 is scoped to this reservation thread/code block as mentioned above.
  3. After successfully reserving (could have seen unknown status and check for UUID1) the manager process dies
  4. A new manager process is started and acquires a lock in ZK, let calls it PL2
  5. The new manager process starts FATE, which starts a periodic task to look for reservations held by dead process. The periodic task sees FATE1 is reserved by PL1 which no longer has lock in ZK and is therefore presumed dead.
  6. The periodic task issues a conditional mutation (or conditional update for ZK) to delete the reserved value for FATE1 only if the currently value is UUID1:PL1

Taking all of this together, may want to do the following.

  1. Create a new type that represent the reservation value, maybe like
          class FateReseveration {
                 UUID reservationId;
                 ZooUtil.LockID processID; // TODO not sure if this is the best type for this
          }
  2. Use the above type internally in the User and Meta Fatestores for their reservations
  3. Add methods to FateStore like Stream<FateReservation> getActiveReservations() and void deleteReservation(FateReservation)
  4. Add a periodic task to Fate that calls the two methods above and looks active reservations held by dead processes and then deletes them.
  5. When creating a new FateStore instance, pass in the lock of the process where it is running. For now this would mean when the manager creates a new FateStore instance it will pass in its lock. This will be used to set the processID on reservations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manager lock is here and from that can access ServiceLock.getLockID() and ServiceLock.isLockHeld(). So can persist the LockId w/ reservations, although not sure of the best way to do that. Then the periodic test that looks for reservations held by dead processes it can call ServiceLock.isLockHeld.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants