Skip to content

Add atomic code upgrade test coverage#7973

Open
Copilot wants to merge 23 commits into
mainfrom
copilot/add-test-for-atomic-code-upgrade
Open

Add atomic code upgrade test coverage#7973
Copilot wants to merge 23 commits into
mainfrom
copilot/add-test-for-atomic-code-upgrade

Conversation

Copilot AI commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Closes #3390.

CCF supports replacing an old node set with a non-overlapping new node set via one governance reconfiguration proposal. This adds coverage and docs for that atomic code upgrade path.

  • Atomic reconfiguration helper
    • Added Consortium.replace_nodes() to trust multiple pending nodes and retire multiple existing nodes in one proposal.
    • Kept replace_node() as the single-node wrapper.
network.consortium.replace_nodes(
    primary,
    old_nodes,
    new_nodes,
    valid_from=valid_from,
)
  • Code upgrade test

    • Added test_update_all_nodes_atomically().
    • Starts a full replacement node set with the new package.
    • Submits one proposal containing all transition_node_to_trusted and remove_node actions.
    • Waits for primary election among the new nodes, then removes retired old nodes.
    • Preserves sequential upgrade coverage by upgrading back afterward.
  • Operations documentation

    • Documented the atomic alternative in the code upgrade procedure.
    • Clarified that new nodes should remain pending until the atomic proposal is submitted.

Copilot AI and others added 10 commits June 24, 2026 21:00
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Copilot AI changed the title [WIP] Add test for atomic code upgrade Add atomic code upgrade test coverage Jun 24, 2026
Copilot AI requested a review from achamayou June 24, 2026 21:09
@achamayou

Copy link
Copy Markdown
Member

@copilot check code format and drop the reference to #1713, that's been addressed long ago.

Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>

Copilot AI commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

@copilot check code format and drop the reference to #1713, that's been addressed long ago.

Addressed in a870d10. I re-ran the targeted Python format check and removed the stale #1713 references from the updated code path.

@achamayou

Copy link
Copy Markdown
Member

@copilot look at the last complete Virtual C test and fix the failure. Don't forget to format the code.

Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Copilot AI requested a review from achamayou June 27, 2026 08:35
In code upgrade scenarios, nodes may have read-only ledger directories
from retired/stopped nodes that contain non-contiguous committed chunks.
The check_ledger_files_chunk_flags function was checking all ledger paths
(including read-only historical dirs), causing a ValueError when non-
contiguous chunks were encountered.

Fix: only check the node's own (writable) current ledger directory.
Read-only ledger dirs are bootstrapping data from other nodes and may
legitimately be non-contiguous after code upgrades.

Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
@achamayou

Copy link
Copy Markdown
Member

@copilot looks good but please generalise remove_retired_node() to remove_retired_nodes() taking a collection of nodes, and update the test accordingly, I think that will be shorter and faster to run.

Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>

Copilot AI commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

@copilot looks good but please generalise remove_retired_node() to remove_retired_nodes() taking a collection of nodes, and update the test accordingly, I think that will be shorter and faster...

Addressed in db836ec. remove_retired_node() is now remove_retired_nodes() and the atomic upgrade test removes the old node set with one batched helper call.

@copilot look at the last complete Virtual C test and fix the failure. Don't forget to format the code.

Checked the VMSS Virtual C history. The last completed CI run for this branch passed VMSS Virtual C at 585bfac; the earlier failure was the non-contiguous ledger chunk check before the subsequent main merge. I also re-ran the Python format check successfully.

@copilot look at the last complete Virtual C test and fix the failure. Don't forget to format the code.

Checked the VMSS Virtual C history. The last completed CI run for this branch passed VMSS Virtual C at 585bfac; the earlier failure was the non-contiguous ledger chunk check before the subsequent main merge. I also re-ran the Python format check successfully.

@copilot looks good but please generalise remove_retired_node() to remove_retired_nodes() taking a collection of nodes, and update the test accordingly, I think that will be shorter and faster...

Addressed in db836ec. remove_retired_node() is now remove_retired_nodes() and the atomic upgrade test removes the old node set with one batched helper call.

@achamayou achamayou marked this pull request as ready for review June 30, 2026 09:37
@achamayou achamayou requested a review from a team as a code owner June 30, 2026 09:37
Copilot AI review requested due to automatic review settings June 30, 2026 09:37

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end coverage (and corresponding ops documentation) for CCF’s “atomic code upgrade” path, where an entire old node set is retired and a non-overlapping new node set is trusted via a single governance proposal.

Changes:

  • Added Consortium.replace_nodes() to trust multiple pending nodes and remove multiple existing nodes in one proposal, keeping replace_node() as a single-node wrapper.
  • Extended tests/code_update.py with an atomic full-network upgrade test (test_update_all_nodes_atomically()), exercised before upgrading back to preserve sequential upgrade coverage.
  • Updated code upgrade operations docs to describe the atomic alternative and the requirement to keep new nodes pending until the atomic proposal is submitted.

Custom instructions used:

  • .github/copilot-instructions.md
  • .github/instructions/reviewing.instructions.md

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
tests/infra/network.py Restricts chunk-flag verification to each node’s writable ledger directory, avoiding non-contiguous read-only bootstrap ledgers.
tests/infra/consortium.py Adds multi-node atomic replace helper (replace_nodes) and refactors replace_node to wrap it.
tests/code_update.py Adds atomic full-network code upgrade test path plus helper to remove retired nodes from the service state.
doc/operations/code_upgrade.rst Documents atomic reconfiguration option during code upgrade procedure.

Comment thread tests/code_update.py
Comment thread tests/code_update.py Outdated
Comment thread doc/operations/code_upgrade.rst Outdated
achamayou and others added 3 commits June 30, 2026 10:54
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Comment thread tests/code_update.py
primary = new_primary
for new_node in new_nodes:
new_node.wait_for_node_to_join(timeout=args.ledger_recovery_timeout)
new_node.set_certificate_validity_period(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot I'm curious why do we do that, I thought there's some default period good enough, elaborate on the necessity of this line here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_certificate_validity_period() here is test-side bookkeeping, not a governance change.

In the non-atomic path, network.trust_node() already sets this local expectation on each Node object. In the atomic path we bypass network.trust_node() and call replace_nodes() directly, so we set it manually to keep node metadata consistent for later cert-validity checks.

You’re right that the service default validity period is sufficient on-chain; this line is only to mirror that value in the test harness state.

Copilot AI requested a review from maxtropets June 30, 2026 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add test for atomic code upgrade

4 participants