AtxBuilder should validate initial PoST loaded from DB #5921

poszu · 2024-05-09T08:11:27Z

Description

When ATX Builder starts, it first determines whether it needs to build an initial PoST proof. It doesn't need to when the smesher already has previous ATXs or the proof is found in the local database. However, in the latter case (PoST in local DB), it could be that the proof is not valid anymore for the current configuration. It could happen if the user increases Space Units.

Steps to reproduce

initialize X SU
wait for the initial PoST to be generated
increase to X+N SU
the initial PoST proof in the DB is no longer valid

Actual Behavior

Atx Builder trusts the initial PoST found in the local DB. It later repeatedly fails to publish an ATX because of invalid initial PoST proof. It cannot break out of the cycle on its own, the only way to get out is to manually remove the local.sql DB.

Expected Behavior

Atx Builder should verify the initial PoST found in the local DB and recreate it if it's not valid anymore.

Affected code

go-spacemesh/activation/activation.go

Lines 329 to 344 in ebaf0a0

    
           func (b *Builder) buildInitialPost(ctx context.Context, nodeID types.NodeID) error { 
        
           	// Generate the initial POST if we don't have an ATX... 
        
           	if _, err := atxs.GetLastIDByNodeID(b.db, nodeID); err == nil { 
        
           		return nil 
        
           	} 
        
           	// ...and if we haven't stored an initial post yet. 
        
           	_, err := nipost.GetPost(b.localDB, nodeID) 
        
           	switch { 
        
           	case err == nil: 
        
           		b.logger.Info("load initial post from db") 
        
           		return nil 
        
           	case errors.Is(err, sql.ErrNotFound): 
        
           		b.logger.Info("creating initial post") 
        
           	default: 
        
           		return fmt.Errorf("get initial post: %w", err) 
        
           	}

acud · 2024-06-05T22:44:10Z

@poszu there are several issues here:

the node was started with different configurations, and now needs to reinitialize the storage and redo the post
the user changed the data over the disk by using the initialization tool and now the data doesn't match on the disk and the proof can never be made either (the initial post needs to be made again). For this case I'm not sure we can do anything, because it involves verifying that the labels are correct (regenerate them) for the nodes involved in the proof.

My approach to handling #1 was to try to use the prover to run the existing proof that we find in the DB with the parameters of smeshing-opts-numunits we get in the CLI. That should at least cover the case where there's a mismatch in the storage units. Just the only things is that I just can't seem to find in any relevant place in the activation package the usages of this value. It is not clear whether the grpc node on the other end (which handles communicating with the rust library) has this configuration somehow implicitly. It is also not clear why this very specific parameter isn't passed on the actual protobuf message which asks for the initial post to be generated (it only specifies the identity and the challenge).

acud · 2024-06-06T18:57:03Z

Updating after discussing further with @poszu, that this approach may work fine for cases where the spacemesh node is run in standalone mode, but won't in cases where it manages more identities.

In that case, my suggestion would be the following:

that configuration for every identity (what is now managed by the rust process in the postdata_metadata.json file) should be moved to the go node's responsibility
that the rpc method of get_metadata should be inverted - that the rust process should ask the go process for that metadata, not the other way around (my understanding is that the rust process is not expected to be running always, but only when the PoST window comes into view). the rust process would do so when it starts, and keep that in-mem for it's entire lifecycle (as it does now with the data in the file).
once we agree to these changes, we can then add a simple post verification call in the go node's startup sequence that checks whether the proof we have matches up with the number of SUs a certain identity has configured

Since this is infers a bunch of changes to CLI flags in both the rust application and the go application, we need to plan on how to phase this and allow users time to migrate.
The other issue is that it may affect the initialization process as it is generated there.
I see that the node ID is included in the file, however, there's absolutely no assurance that the location of the metadata file is actually accessible from the go node. Therefore, it may have implications for node operators.

EDIT: the other alternative is to have reliance that the post service is always available from the go node on startup. This will save on all these changes (but might complicate the startup sequence a bit)

fasmat · 2024-06-07T14:56:48Z

I went through the code and think these are the steps required to fix the core of the issue:

Update the nipostBuilder interface:

type nipostBuilder interface {
   BuildNIPost(
 	  ctx context.Context,
 	  sig *signing.EdSigner,
 	  challenge *types.NIPostChallenge,
 	  challengeHash types.Hash32,
   ) (*nipost.NIPostState, error)
   Proof(ctx context.Context, nodeID types.NodeID, challenge []byte, initialPost *types.Post) (*types.Post, *types.PostInfo, error)
   ResetState(types.NodeID) error
}

We need to pass the full challenge to the builder instead of just its hash. It contains the publish epoch and the (optional) initial post that needs to be checked if present.

The Proof method also needs to be adjusted to receive the (optional) initial PoST that needs to be verified to still be valid when generating a proof.

In the nipostBuilder in the Proof method the initial proof (if not nil) needs to be verified with the post info from the connected post-service before the new proof is generated:

 	retries = 0
 	if initialPost != nil {
 		info, err := client.Info(ctx)
 		if err != nil {
 			events.EmitPostFailure(nodeID)
 			return nil, nil, fmt.Errorf("failed to get post info: %w", err)
 		}
 		// verify that the initial post is valid with the given `info` parameters
 		// if not return an `ErrInvalidInitialPost`
 	}
 	post, postInfo, err := client.Proof(ctx, challenge)

BuildNipost in nipost.go needs to be updated to pass the initial post from the challenge to the Proof method - it is not nil when this is the first ATX and is nil otherwise.
Other callers of Proof just pass nil for the initial post
createAtx needs to be updated to pass the full challange and its hash instead of just its hash and the publish epoch
The run method in activation.go needs to check for the new error returned by nipostbuilder.Proof and do the same as in the case of ErrATXChallengeExpired plus building a new initial post by removing it from the local database and calling buildInitialPost again.
Update existing tests to the new interfaces, add new (unit) tests for the updated behavior.

poszu added bug area/atx labels May 9, 2024

pigmej mentioned this issue Jun 3, 2024

Smesher UX improvements spacemeshos/pm#328

Open

10 tasks

acud self-assigned this Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AtxBuilder should validate initial PoST loaded from DB #5921

AtxBuilder should validate initial PoST loaded from DB #5921

poszu commented May 9, 2024 •

edited

acud commented Jun 5, 2024

acud commented Jun 6, 2024 •

edited

fasmat commented Jun 7, 2024 •

edited

AtxBuilder should validate initial PoST loaded from DB #5921

AtxBuilder should validate initial PoST loaded from DB #5921

Comments

poszu commented May 9, 2024 • edited

Description

Steps to reproduce

Actual Behavior

Expected Behavior

Affected code

acud commented Jun 5, 2024

acud commented Jun 6, 2024 • edited

fasmat commented Jun 7, 2024 • edited

poszu commented May 9, 2024 •

edited

acud commented Jun 6, 2024 •

edited

fasmat commented Jun 7, 2024 •

edited