Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#2871 disconnect from bad peers in consensus #9417

Draft
wants to merge 303 commits into
base: main
Choose a base branch
from

Conversation

ebuchman
Copy link
Contributor

Addresses #2871. Still a WIP. Wanted to get some feedback on how the ConsensusState should communicate bad peers to the ConsensusReactor. I see basically three main options:

  1. give ConS a direct reference to ConR or Switch or StopPeer func
  2. introduce a new StopPeer event for the reactor to listen on
  3. use the statsMsgQueue or a new routine/channel

(1) is prob easiest but involves more coupling maybe than we want. (2) would require a new event and is maybe excessive? (3) has two options - using the statsMsgQueue is straightforward but would require us implementing a "fake" StopPeer msg type to send on it, which is easy and simple enough and uses existing communication channel between ConS and ConR, but a bit of abuse of the Message interface. A new go-routine/channel is also easy but it's just another layer in our concurrency pie. I would probably lean towards using the statsMsgQueue.

Note I didn't add the errors from addProposalBlockPart because I wasn't entirely sure if they can happen accidentally (I recall maybe some recent reports where they did?)

PR checklist

  • Tests written/updated, or no tests needed
  • CHANGELOG_PENDING.md updated, or no changelog entry needed
  • Updated relevant documentation (docs/) and code comments, or no
    documentation updates needed

cmwaters and others added 30 commits March 10, 2020 16:15
* codeowners: add code owners

- added some codeowners
please comment if youd like to be added as well.

Signed-off-by: Marko Baricevic <marbar3778@yahoo.com>

* remove comment of repo maintainers
Signed-off-by: Marko Baricevic <marbar3778@yahoo.com>
* Added RFC for truncated block history coordination

* Clarified minimum block retention

* Added hard checks on block retention and snapshot interval, and made some minor tweaks

* Genesis parameters are immutable

* Use local config for snapshot interval

* Reordered parameter descriptions

* Clarified local config option for snapshot-interval

* rewrite for ABCI commit response

* Renamed RFC

* add block retention diagram

* Removed retain_blocks table

* fix image numbers

* resolved open questions

* image quality
Documentation for block pruning, once it's merged: #4588.

Minimum documentation, for now - we probably shouldn't encourage using this feature too much until we release state sync.
The corresponding Tendermint PRs are #4704 and #4705.
* abci: specify sorting of RequestInitChain.Validators

* blockchain: change validator sorting method

Refs #2478
* Revert "Revert "document state sync ABCI interface and P2P protocol (#90)" (#92)"

This reverts commit 90797ce.

* update with new enum case

* fix links

Co-authored-by: Erik Grinaker <erik@interchain.berlin>
evidence params now includes maxNum which is the maximum number of evidence that can be committed on a single block
now evidence reflects the actual evidence present in the tendermint repo
Co-authored-by: Igor Konnov <igor.konnov@gmail.com>
spec/consensus: canonical vs subjective commit
@ebuchman
Copy link
Contributor Author

ebuchman commented Sep 14, 2022

Cool I think I got this working and tested!

Will keep as a draft for now as there may be more to think through to fully close the issue and we may want a more complete approach to testing

@ebuchman ebuchman marked this pull request as ready for review September 14, 2022 16:13
@ebuchman ebuchman requested a review from a team September 14, 2022 16:13
@ebuchman ebuchman changed the title WIP: #2871 disconnect from bad peers in consensus #2871 disconnect from bad peers in consensus Sep 14, 2022
@ebuchman ebuchman marked this pull request as draft September 14, 2022 16:23
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale for use by stalebot label Sep 25, 2022
@jmalicevic jmalicevic removed the stale for use by stalebot label Sep 26, 2022
@cason cason added the S:wip Work in progress (prevents stalebot from automatically closing) label Sep 28, 2022
}
// XXX: does this block? should we run this in its own go-routine
// to be safe?
conR.Switch.StopPeerForError(peer, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function itself is not blocking (no waiting on channels etc.) . There are locks being taken and connections being closed but that should not impact things too much.

@@ -1922,6 +1950,7 @@ func (cs *State) addProposalBlockPart(msg *BlockPartMessage, peerID p2p.ID) (add

added, err = cs.ProposalBlockParts.AddPart(part)
if err != nil {
// TODO: can we punish for these or is there a reason they might accidentaly happen?
if errors.Is(err, types.ErrPartSetInvalidProof) || errors.Is(err, types.ErrPartSetUnexpectedIndex) {
cs.metrics.BlockGossipPartsReceived.With("matches_current", "false").Add(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen during catchup?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find disconnecting from peers here a risky approach. A legit peer can send us parts of a block that is not the block we are assembling. This happens because a peer that does not know which block we are looking for will assume that we are looking for the same block as its consensus layer is building. In rounds with multiple proposals, this will lead a legit peer to send us parts of a block that is not the block we are looking for.

In more detail, a node sets the ProposalBlockPartSetHeader for a peer the first time one of the following happen:

  1. It receives a Proposal from that peer, regardless whether it accepts the Proposal
  2. It sends a Proposal to that peer, regardless whether this Proposal is accepted by the peer
  3. The peer is at a previous height, in this case we assume that the peer is looking for the same block we have decided in that height

Once set, this field is only updated upon a NewValidBlockMessage from the peer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen during catchup?

The catch-up consists of receiving block parts from a peer. As mentioned above (3), if we don't know which block the peer is looking for (cases 1 or 2), we ship to the peer the block we have committed on that height.

}
// XXX: does this block? should we run this in its own go-routine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call may block for closing the (raw) connection, or by the RemovePeer calls on other reactors. I think it is safe to run it in this same thread.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the same time, if the stopped peer is configured as persistent peer, the switch will attempt to reconnect to it, in background. Not sure if this is a desired behavior.

Comment on lines +52 to +55
type badPeerInfo struct {
Error error `json:"error"`
PeerID p2p.ID `json:"peer_key"`
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error doesn't marshal effectively.

package main

import (
	"encoding/json"
	"errors"
	"fmt"
)

type badPeerInfo struct {
	Error error `json:"error"`
}

func main() {
	i := badPeerInfo{
		Error: errors.New("this is an error"),
	}

	data, err := json.Marshal(i)
	if err != nil {
		panic(err)
	}

	fmt.Printf("%s\n", string(data))
}
{"error":{}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow thanks. What is the right way to do this ? Just use a string in the struct?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow thanks. What is the right way to do this ? Just use a string in the struct?

Yeah, basically, e.g.

type badPeerInfo struct {
	Error string `json:"error,omitempty"` // omitempty is important I guess
}

@ebuchman
Copy link
Contributor Author

ebuchman commented Nov 16, 2022

I'm thinking we should update this to actually not remove the peer yet, but leave the other code changes, just so we can merge it sooner without blocking on more analysis, and we can come back to activate the peer removal after doing a bit more analysis later

@cason
Copy link
Contributor

cason commented Nov 29, 2022

I think what this PR introduces in the current state is a way to the consensus protocol (consensus/state.go) to report bad peer behavior to the consensus reactor (consensus/reactor.go). For that, we are augmenting the existing interface between this two layers, used to report good peer behavior (peer ID and message accepted from the peer), to also report bad peer behavior (peer ID and error returned by the consensus implementation upon processing a message from the peer).

I would rename the PR to make this clear. Plus, I would log the errors reported by consensus instead of acting upon them.

@cason
Copy link
Contributor

cason commented Nov 29, 2022

In terms of implementation, I note that we report an error upon receiving a message from a peer. From this perspective, the message causing the error could also be reported to the peerInfo/Stat queue.

Comment on lines +120 to +121
// Takes a `msgInfo` or a `badPeerInfo`
peerInfoQueue chan interface{}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be two distinct channels, not a single channel of interface{}.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go is not a dynamically typed language, interface{} is a huge red flag, not a pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh I agonized over this a bit and wrote about the problem in the opening description. I thought introducing a channel would be more invasive but on second thought it should be pretty straight forward and would preserve better type discipline so you're probably right, thanks.

Comment on lines +908 to 913
peer := conR.Switch.Peers().Get(peerID)
if peer == nil {
conR.Logger.Debug("Attempt to remove non-existent peer",
"peer", peerID)
continue
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Get can fail, it needs to return an error, failure should not be expressed as a nil peer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not a failure, its just a miss - there's no value under that key.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A read miss is a failure. Or, at least, it's distinct from all valid values of the type expressed by the collection you're calling Get on. Like if you got

m := map[int]int{31: 0}

then

v12 := m[12] // read miss resulting in v12 == 0
v31 := m[31] // read hit resulting in v31 == 0

and

println(v12 == v31)

is gonna output "true" but it ain't true, right? v12 was miss and v31 was a hit and those zeroes aren't the same. So when you get something out of a collection that operation has got to return (ValueType, error) or at least (ValueType, bool) to be sound. No way around it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterbourgon in your example, v12 can validly be 0, but a peer can't validly be nil here.

While I understand that it's more idiomatic, it seems like addressing that would add complexity for very little value (we'd need to add a specific error type for when the peer doesn't exist, and then check that error's type to ensure that it was a miss and not some other error, and then what do we do with other errors?).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that in this specific case a nil value can be interpreted as a read miss without loss of information, and that my suggestion is ultimately making the code more idiomatic rather than (necessarily) more correct. I don't think that modeling read misses as errors represents an increase in complexity, but no reason to get into that here.

Comment on lines +778 to +797
err := cs.handleMsg(mi)
if err != nil {
// if error is due to misbehaviour, disconect from peer
// TODO: is there a cleaner way to do these checks?
if errors.Is(err, ErrInvalidProposalSignature) ||
errors.Is(err, types.ErrVoteInvalidSignature) ||
errors.Is(err, types.ErrVoteInvalidValidatorAddress) ||
errors.Is(err, types.ErrVoteInvalidValidatorIndex) {

// tell reactor to disconnect from peer
cs.peerInfoQueue <- badPeerInfo{
Error: err,
PeerID: mi.PeerID,
}
} else {
// other errors don't necessarily mean the peer is bad,
// so there is nothing to do
// see https://github.com/tendermint/tendermint/issues/2871
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleMsg is a method defined on the State type. It returns an error which, apparently, in certain cases, causes the caller to send something on a Go channel defined as a field on that same State type. Is there any reason that this work is not done by the handleMsg method directly? Maybe there is a reason! and so maybe this should in fact be done by the caller like this here code suggests. But then you want like

if err := cs.wal.Write(mi); err != nil {
	cs.Logger.Error("failed writing to WAL", "err", err)
	// IS IT VALID TO CARRY ON FROM THIS POINT?? I DOUBT IT?? 
	// PROBABLY NEED TO CONTINUE HERE?? I MEAN DANG DUDE
}

var (
	handleErr                        = cs.handleMsg(mi)
	isErrInvalidProposalSignature    = errors.Is(handleErr, ErrInvalidProposalSignature)
	isErrVoteInvalidSignature        = errors.Is(handleErr, types.ErrVoteInvalidSignature)
	isErrVoteInvalidValidatorAddress = errors.Is(handleErr, types.ErrVoteInvalidValidatorAddress)
	isErrVoteInvalidValidatorIndex   = errors.Is(handleErr, types.ErrVoteInvalidValidatorIndex)
	isMisbehavior                    = isErrInvalidProposalSignature || isErrVoteInvalidSignature || isErrVoteInvalidValidatorAddress || isErrVoteInvalidValidatorIndex
)
switch {
case isMisbehavior:
	cs.peerInfoQueue <- badPeerInfo{Error: handleErr, PeerID: mi.PeerID}
default:
	// PROBABLY AT LEAST LOG SOMETHING, I MEAN LIKE WYD FAM
}

Comment on lines +817 to 821
if err := cs.handleMsg(mi); err != nil {
// TODO: why are we erroring on our own messages?
// should we panic ?
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bro.

If handleMsg returns an error you can't make any assumptions about what that means except that the handleMsg call failed. An error means you didn't handle the message. And there is basically no situation where panicking is an acceptable response to functions that return errors.

If handleMsg fails, what does that mean, in the context? Should it terminate the goroutine that's running the receiveRoutine method of this type? If so, return fmt.Errorf("handle message: %w", err). Should the receive goroutine be resilient to this error? If so, log the error and continue.

@@ -896,6 +923,7 @@ func (cs *State) handleMsg(mi msgInfo) {
"err", err,
)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log an error or return it, don't do both.

@@ -807,13 +836,9 @@ func (cs *State) receiveRoutine(maxSteps int) {
}

// state transitions on complete-proposal, 2/3-any, 2/3-one
func (cs *State) handleMsg(mi msgInfo) {
func (cs *State) handleMsg(mi msgInfo) (err error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (cs *State) handleMsg(mi msgInfo) (err error) {
func (cs *State) handleMsg(mi msgInfo) error {

@@ -825,6 +850,7 @@ func (cs *State) handleMsg(mi msgInfo) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return cs.setProposal(msg.Proposal)

@@ -859,11 +885,12 @@ func (cs *State) handleMsg(mi msgInfo) {
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return nil

Comment on lines -886 to +913
return
return nil

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an error?

@@ -896,6 +923,7 @@ func (cs *State) handleMsg(mi msgInfo) {
"err", err,
)
}
return err

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete everything after the switch statement

Suggested change
return err

Comment on lines 891 to +893
added, err = cs.tryAddVote(msg.Vote, peerID)
if added {
cs.statsMsgQueue <- mi
cs.peerInfoQueue <- mi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, err := cs.tryAddVote(msg.Vote, peerID)
if err != nil { // added is by definition invalid 
	return err
}
if added {
	cs.whatever <- whatever
}
return nil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S:wip Work in progress (prevents stalebot from automatically closing)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet