Add atomic code upgrade test coverage#7973
Conversation
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
|
@copilot look at the last complete Virtual C test and fix the failure. Don't forget to format the code. |
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
In code upgrade scenarios, nodes may have read-only ledger directories from retired/stopped nodes that contain non-contiguous committed chunks. The check_ledger_files_chunk_flags function was checking all ledger paths (including read-only historical dirs), causing a ValueError when non- contiguous chunks were encountered. Fix: only check the node's own (writable) current ledger directory. Read-only ledger dirs are bootstrapping data from other nodes and may legitimately be non-contiguous after code upgrades. Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
|
@copilot looks good but please generalise remove_retired_node() to remove_retired_nodes() taking a collection of nodes, and update the test accordingly, I think that will be shorter and faster to run. |
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Addressed in db836ec.
Checked the VMSS Virtual C history. The last completed CI run for this branch passed VMSS Virtual C at 585bfac; the earlier failure was the non-contiguous ledger chunk check before the subsequent main merge. I also re-ran the Python format check successfully.
Checked the VMSS Virtual C history. The last completed CI run for this branch passed VMSS Virtual C at 585bfac; the earlier failure was the non-contiguous ledger chunk check before the subsequent main merge. I also re-ran the Python format check successfully.
Addressed in db836ec. |
There was a problem hiding this comment.
Pull request overview
This PR adds end-to-end coverage (and corresponding ops documentation) for CCF’s “atomic code upgrade” path, where an entire old node set is retired and a non-overlapping new node set is trusted via a single governance proposal.
Changes:
- Added
Consortium.replace_nodes()to trust multiple pending nodes and remove multiple existing nodes in one proposal, keepingreplace_node()as a single-node wrapper. - Extended
tests/code_update.pywith an atomic full-network upgrade test (test_update_all_nodes_atomically()), exercised before upgrading back to preserve sequential upgrade coverage. - Updated code upgrade operations docs to describe the atomic alternative and the requirement to keep new nodes pending until the atomic proposal is submitted.
Custom instructions used:
.github/copilot-instructions.md.github/instructions/reviewing.instructions.md
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/infra/network.py | Restricts chunk-flag verification to each node’s writable ledger directory, avoiding non-contiguous read-only bootstrap ledgers. |
| tests/infra/consortium.py | Adds multi-node atomic replace helper (replace_nodes) and refactors replace_node to wrap it. |
| tests/code_update.py | Adds atomic full-network code upgrade test path plus helper to remove retired nodes from the service state. |
| doc/operations/code_upgrade.rst | Documents atomic reconfiguration option during code upgrade procedure. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| primary = new_primary | ||
| for new_node in new_nodes: | ||
| new_node.wait_for_node_to_join(timeout=args.ledger_recovery_timeout) | ||
| new_node.set_certificate_validity_period( |
There was a problem hiding this comment.
@copilot I'm curious why do we do that, I thought there's some default period good enough, elaborate on the necessity of this line here
There was a problem hiding this comment.
set_certificate_validity_period() here is test-side bookkeeping, not a governance change.
In the non-atomic path, network.trust_node() already sets this local expectation on each Node object. In the atomic path we bypass network.trust_node() and call replace_nodes() directly, so we set it manually to keep node metadata consistent for later cert-validity checks.
You’re right that the service default validity period is sufficient on-chain; this line is only to mirror that value in the test harness state.
Closes #3390.
CCF supports replacing an old node set with a non-overlapping new node set via one governance reconfiguration proposal. This adds coverage and docs for that atomic code upgrade path.
Consortium.replace_nodes()to trust multiple pending nodes and retire multiple existing nodes in one proposal.replace_node()as the single-node wrapper.Code upgrade test
test_update_all_nodes_atomically().transition_node_to_trustedandremove_nodeactions.Operations documentation