Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add chaos-mesh for testing (BFT-392) #70

Open
wants to merge 185 commits into
base: main
Choose a base branch
from
Open

Conversation

IAvecilla
Copy link
Contributor

@IAvecilla IAvecilla commented Feb 27, 2024

What ❔

Incorporate chaos-mesh tools into the consensus testing framework.

Why ❔

To validate the network's behavior in response to non-canonical node behavior, we need tests. Chaos-mesh enables us to introduce delays or node crashes, simulating real-world network scenarios for thorough testing.

IAvecilla and others added 30 commits January 10, 2024 18:34
Co-authored-by: Bruno França <bruno@franca.xyz>
@IAvecilla IAvecilla marked this pull request as ready for review March 11, 2024 18:24
@IAvecilla
Copy link
Contributor Author

@moshababo, I've implemented the initial approach for this integration here. It's likely to undergo several iterations and discussions to refine the testing flow and the introduction of chaos into the network. I may make some additional changes in the coming days, but please feel free to suggest any improvements or request changes as needed.

@moshababo
Copy link
Contributor

Can you respond to my previous reviews so I'll know what changes to expect?

@IAvecilla
Copy link
Contributor Author

@moshababo Done!

.unwrap();
let last_voted_view: u64 =
serde_json::from_value(response.get("last_commited_block").unwrap().to_owned()).unwrap();
for _ in 0..5 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this loop?

Copy link
Contributor Author

@IAvecilla IAvecilla Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first my idea was to check a few times that the node was indeed going through the delay and not being able to vote for the views, now I'm not sure if that is useful at all. Maybe there's a better way to achieve this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it adds much value, while adding non-determinism to the scenario (for being dependent on the response latency).
It is good to check that status while the chaos resources are deployed, but it's better done via test controlling the teardown explicitly.

pub async fn delay_test(test_result: Arc<Mutex<u8>>) -> anyhow::Result<()> {
let client = k8s::get_client().await.unwrap();
let target_node = "consensus-node-01";
let ip = k8s::get_node_rpc_address_with_name(&client, target_node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to further encapsulate technical details under the test framework. The actual test scenario should more declarative.

/// We use unwraps here because this function is intended to be used like a test.
pub async fn delay_test(test_result: Arc<Mutex<u8>>) -> anyhow::Result<()> {
let client = k8s::get_client().await.unwrap();
let target_node = "consensus-node-01";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test framework should support the installation of chaos resources for a set of nodes, not just a single one.

@@ -83,6 +83,15 @@ pub async fn sanity_test() {
}
}

/// Sanity test for the RPC server.
/// We use unwraps here because this function is intended to be used like a test.
pub async fn delay_test() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, most byzantine behavior duration will need to be "denominated" in views, not clock time, for actually testing liveness during the disruption (and not just quick recovery once it's over).

It might be needed only for the removal, while applying it always at the beginning of the test, but it's better to just have the test framework supports chaos scheduling with from_view,to_view args.

Base automatically changed from rpc_execution_connection to main March 19, 2024 20:00
@brunoffranca
Copy link
Member

@moshababo given the developments with AttackNet, should we close this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants