Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug(go client):The cluster added a new meta node, but the meta server configuration of the go client was not updated. As a result, the client cannot find the new meta address and can only access the meta listed in the meta list #1880

Open
lengyuexuexuan opened this issue Jan 30, 2024 · 2 comments
Labels
type/bug This issue reports a bug.

Comments

@lengyuexuexuan
Copy link
Collaborator

Assuming the Pegasus client is configured with a meta server list of "127.0.0.1:34602" and "127.0.0.1:34603," but the actual primary meta server for the Pegasus server is "127.0.0.1:34601," the Pegasus client will not be able to connect to the Pegasus server until a timeout occurs.

The reason is that when the go client searches for the primary, it iterates through the meta server list, sending an RPC RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX to each meta server and making a determination based on the response.

Unlike the Java client, the go client cannot directly use indirection to add meta servers not specified in the configuration to the client.
Below is the logic code for this part of the go client.

// go-client/session/meta_call.go  
func (c *metaCall) issueBackupMetas(ctx context.Context) {
    for i := range c.metas {
       if i == c.lead {
          continue
       }
       // concurrently issue RPC to the rest of meta servers.
       go func(idx int) {
          c.issueSingleMeta(ctx, idx)
       }(i)
    }
}

// issueSingleMeta returns false if we should try another meta
func (c *metaCall) issueSingleMeta(ctx context.Context, i int) bool {
    meta := c.metas[i]
    resp, err := c.callFunc(ctx, meta)
    if err != nil || resp.GetErr().Errno == base.ERR_FORWARD_TO_OTHERS.String() {
       return false
    }
    // the RPC succeeds, this meta becomes the new leader now.
    atomic.StoreUint32(&c.newLead, uint32(i))
    select {
    case <-ctx.Done():
    case c.respCh <- resp:
       // notify the caller
    }
    return true
}

Here is the relevant part of the Java client code for this:

// com/xiaomi/infra/pegasus/rpc/async/MetaSession.java  onFinishQueryMeta()

synchronized (this) {
  if (needSwitchLeader) {
    if (forwardAddress != null && !forwardAddress.isInvalid()) {
      boolean found = false;
      for (int i = 0; i < metaList.size(); i++) {
        if (metaList.get(i).getAddress().equals(forwardAddress)) {
          curLeader = i;
          found = true;
          break;
        }
      }
      if (!found) {
        logger.info("add forward address {} as meta server", forwardAddress);
        metaList.add(clusterManager.getReplicaSession(forwardAddress));
        curLeader = metaList.size() - 1;
      }
    } else if (metaList.get(curLeader) == round.lastSession) {
      curLeader = (curLeader + 1) % metaList.size();
      if (curLeader == 0 && hostPort != null && round.maxResolveCount != 0) {
        resolveHost(hostPort);
        round.maxResolveCount--;
        round.maxExecuteCount = metaList.size();
      }
    }
  }
  round.lastSession = metaList.get(curLeader);
}

In summary

The primary impact of this issue is that, in the online cluster, a new meta server was added, and at some point thereafter, this meta server became the primary. Users, without changing their configurations, are unable to connect to the server.

@lengyuexuexuan lengyuexuexuan added the type/bug This issue reports a bug. label Jan 30, 2024
@acelyc111
Copy link
Member

acelyc111 commented Jan 30, 2024

Good point, could you please submit a patch to fix it?

@lengyuexuexuan
Copy link
Collaborator Author

Good point, could you please submit a patch to fix it?

OK, get it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug This issue reports a bug.
Projects
None yet
Development

No branches or pull requests

2 participants