Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop tracking last compactor check-in for non-existent groups #4403

Open
wants to merge 10 commits into
base: elasticity
Choose a base branch
from

Conversation

dlmarion
Copy link
Contributor

No description provided.

@dlmarion dlmarion self-assigned this Mar 19, 2024
@dlmarion dlmarion changed the title Stop tracking last compactor check-in for non-existant groups Stop tracking last compactor check-in for non-existent groups Mar 19, 2024
@keith-turner
Copy link
Contributor

Seems like there is an existing catch-22 in the code that we may want to fix. The existing problem is that if a compactor has never checked in for a group then there will be no entry in TIME_COMPACTOR_LAST_CHECKED and no warning will ever be logged. So then groups that have jobs queued and are never being serviced will never get a warning logged.

Wondering if we can fix the current code and do the needed cleanup of TIME_COMPACTOR_LAST_CHECKED in the existing idleCompactionWarning() method. Was playing around with the following trying to work through this, it calls functions that do no currently exist in the code. Was just trying to work out the concept.

private void idleCompactionWarning() {

    // The set of groups that currently have >0 jobs queued
    Set<CompactorGroupId> queuedGroups =  jobQueues.getCurrentActiveGroups();
    
    // The set of groups that have >0 live compactors w/ a ZK lock.  These are groups that have running compactors.
    Set<CompactorGroupId> liveGroups = getLiveGroups();

    long now = System.currentTimeMillis();
    for(var groupName : queuedGroups) {
      // If the group has running compactors then do not want to log a warning.  Looking for groups that have jobs queued, no compactors running, and the situation has been like this for a while.
      if(!liveGroups.contains(groupName)) {
        long lastCheckTime = TIME_COMPACTOR_LAST_CHECKED.getOrDefault(groupName, 0);
        if ((now - lastCheckTime) > getMissingCompactorWarningTime()){
          LOG.warn("No compactors have checked in with coordinator for group {} in {}ms", groupName,
                  getMissingCompactorWarningTime());
        }
      }
    }

    // Only care about tracking check in times for groups with compactions currently queued
    TIME_COMPACTOR_LAST_CHECKED.keySet().retainAll(queuedGroups);
  }

@dlmarion
Copy link
Contributor Author

We have the following maintenance tasks running in the Coordinator.

DeadCompactionDetector - fails compactions that Coordinator thinks are running, but are not
cleanUpCompactors - remove nodes in ZooKeeper for empty groups and compactors
cleanUpRunning - remove compactions from RUNNING_CACHE that are not actually running
Idle compaction warning - warn no compactors for group checked in
Group check thread - remove non-existent groups from TIME_COMPACTOR_LAST_CHECKED

I'm thinking that we might be able to combine the last 4 into one method.

@dlmarion
Copy link
Contributor Author

I'm thinking that we might be able to combine the last 4 into one method.

Something like:

  private void cleaner() {
    
    // This method does the following:
    //
    // 1. Removes entries from RUNNING_CACHE that are not really running
    // 2. Cancels running compactions for groups that are not in the current configuration
    // 3. Remove groups not in configuration from TIME_COMPACTOR_LAST_CHECKED
    // 4. Log groups with no compactors
    // 5. Log compactors with no groups
    // 6. Log groups with compactors that have not checked in
    
    // ** BEGIN Contents of cleanupRunning **//
    
    // grab a snapshot of the ids in the set before reading the metadata table. This is done to
    // avoid removing things that are added while reading the metadata.
    Set<ExternalCompactionId> idsSnapshot = Set.copyOf(RUNNING_CACHE.keySet());

    // grab the ids that are listed as running in the metadata table. It important that this is done
    // after getting the snapshot.
    Set<ExternalCompactionId> idsInMetadata = readExternalCompactionIds();

    var idsToRemove = Sets.difference(idsSnapshot, idsInMetadata);

    // remove ids that are in the running set but not in the metadata table
    idsToRemove.forEach(this::recordCompletion);

    if (idsToRemove.size() > 0) {
      LOG.debug("Removed stale entries from RUNNING_CACHE : {}", idsToRemove);
    }

    // ** END Contents of cleanupRunning **//
    
    // Get the set of groups being referenced in the current configuration
    // Needs Dan's changes for this
    Set<CompactorGroupId> groupsInConfiguration = new HashSet<>();
    
    // Compaction jobs are created in the TabletGroupWatcher and added to the Coordinator
    // via the addJobs method which adds the job to the CompactionJobQueues object.
    Set<CompactorGroupId> groupsWithJobs = jobQueues.getQueueIds();
    
    Set<CompactorGroupId> jobGroupsNotInConfiguration = Sets.difference(groupsWithJobs, groupsInConfiguration);

    if (jobGroupsNotInConfiguration != null && !jobGroupsNotInConfiguration.isEmpty()) {
      RUNNING_CACHE.values().forEach(rc -> {
        if (jobGroupsNotInConfiguration.contains(CompactorGroupId.of(rc.getGroupName()))) {
          LOG.warn("External compaction {} running in group {} on compactor {},"
              + " but group not found in current configuration. Failing compaction...",
              rc.getJob().getExternalCompactionId(), rc.getGroupName(), rc.getCompactorAddress());
          cancelCompactionOnCompactor(rc.getCompactorAddress(), rc.getJob().getExternalCompactionId());
        }
      });
      
      // Remove groups not in configuration from TIME_COMPACTOR_LAST_CHECKED
      LOG.debug("No longer tracking compactor check-in times for groups: {}", jobGroupsNotInConfiguration);
      jobGroupsNotInConfiguration.forEach(TIME_COMPACTOR_LAST_CHECKED::remove);
    }
    
    final Set<CompactorGroupId> runningCompactorGroups = new HashSet<>();
    ExternalCompactionUtil.getCompactorAddrs(ctx).keySet()
      .forEach(group -> runningCompactorGroups.add(CompactorGroupId.of(group)));
    
    Set<CompactorGroupId> groupsWithNoCompactors = Sets.difference(groupsInConfiguration, runningCompactorGroups);
    if (groupsWithNoCompactors != null && !groupsWithNoCompactors.isEmpty()) {
      LOG.warn("The following groups have no running compactors: {}", groupsWithNoCompactors);
    }

    Set<CompactorGroupId> compactorsWithNoGroups = Sets.difference(runningCompactorGroups, groupsInConfiguration);
    if (compactorsWithNoGroups != null && !compactorsWithNoGroups.isEmpty()) {
      LOG.warn("The following groups have running compactors, but are not in the current configuration: {}",
          compactorsWithNoGroups);
    }
    
    //** BEGIN IdleCompactionWarning code **/
    long now = System.currentTimeMillis();
    Map<String,Set<HostAndPort>> idleCompactors = getIdleCompactors();
    TIME_COMPACTOR_LAST_CHECKED.forEach((groupName, lastCheckTime) -> {
      if ((now - lastCheckTime) > getMissingCompactorWarningTime()
          && jobQueues.getQueuedJobs(groupName) > 0
          && idleCompactors.containsKey(groupName.canonical())) {
        LOG.warn("No compactors have checked in with coordinator for group {} in {}ms", groupName,
            getMissingCompactorWarningTime());
      }
    });
    //** End IdleCompactionWarning code **/

  }

Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidating some of the periodic task is nice

@dlmarion
Copy link
Contributor Author

dlmarion commented Apr 8, 2024

I believe this is complete, just waiting on some changes from @ddanielr that have to do with the compaction configuration changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

None yet

3 participants