Reattempts (#483)

* Schedule re-attempts and add a (not filled out) match statement to actually execute them A comment explains the methodology. To copy it here: """ This is because we *always* re-attempt any protocol which had participation. That doesn't mean we *should* re-attempt this protocol. The alternatives were: 1) Note on-chain we completed a protocol, halting re-attempts upon 34%. 2) Vote on-chain to re-attempt a protocol. This schema doesn't have any additional messages upon the success case (whereas alternative #1 does) and doesn't have overhead (as alternative #2 does, sending votes and then preprocesses. This only sends preprocesses). """ Any signing protocol which reaches sufficient participation will be re-attempted until it no longer does. * Have the Substrate scanner track DKG removals/completions for the Tributary code * Don't keep trying to publish a participant removal if we've already set keys * Pad out the re-attempt match a bit more * Have CosignEvaluator reload from the DB * Correctly schedule cosign re-attempts * Actuall spawn new DKG removal attempts * Use u32 for Batch ID in SubstrateSignableId, finish Batch re-attempt routing The batch ID was an opaque [u8; 5] which also included the network, yet that's redundant and unhelpful. * Clarify a pair of TODOs in the coordinator * Remove old TODO * Final comment cleanup * Correct usage of TARGET_BLOCK_TIME in reattempt scheduler It's in ms and I assumed it was in s. * Have coordinator tests drop BatchReattempts which aren't relevant yet may exist * Bug fix and pointless oddity removal We scheduled a re-attempt upon receiving 2/3rds of preprocesses and upon receiving 2/3rds of shares, so any signing protocol could cause two re-attempts (not one more). The coordinator tests randomly generated the Batch ID since it was prior an opaque byte array. While that didn't break the test, it was pointless and did make the already-succeeded check before re-attempting impossible to hit. * Add log statements, correct dead-lock in coordinator tests * Increase pessimistic timeout on recv_message to compensate for tighter best-case timeouts * Further bump timeout by a minute AFAICT, GH failed by just a few seconds. This also is worst-case in a single instance, making it fine to be decently long. * Further further bump timeout due to lack of distinct error
2025-12-08 20:29:23 +00:00 · 2023-12-12 12:28:53 -05:00
parent b297b79f07
commit 6a172825aa
17 changed files with 437 additions and 191 deletions
--- a/coordinator/src/tributary/handle.rs
+++ b/coordinator/src/tributary/handle.rs
@@ -26,12 +26,9 @@ use serai_db::*;
 use crate::{
  processors::Processors,
  tributary::{
-    SignData, Transaction, TributarySpec, SeraiBlockNumber, Topic, Label, DataSpecification,
-    DataSet, Accumulation,
+    *,
    signing_protocol::{DkgConfirmer, DkgRemoval},
    scanner::{RecognizedIdType, RIDTrait, PstTxType, PSTTrait, PTTTrait, TributaryBlockHandler},
-    FatallySlashed, DkgShare, DkgCompleted, PlanIds, ConfirmationNonces, RemovalNonces, DkgKeyPair,
-    AttemptDb, DataReceived, DataDb,
  },
  P2p,
 };
@@ -106,6 +103,7 @@ impl<T: DbTxn, Pro: Processors, PST: PSTTrait, PTT: PTTTrait, RID: RIDTrait, P:
    signer: <Ristretto as Ciphersuite>::G,
    data: &Vec<u8>,
  ) -> Accumulation {
+    log::debug!("accumulating entry for {:?} attempt #{}", &data_spec.topic, &data_spec.attempt);
    let genesis = self.spec.genesis();
    if DataDb::get(self.txn, genesis, data_spec, &signer.to_bytes()).is_some() {
      panic!("accumulating data for a participant multiple times");
@@ -121,13 +119,37 @@ impl<T: DbTxn, Pro: Processors, PST: PSTTrait, PTT: PTTTrait, RID: RIDTrait, P:
    DataReceived::set(self.txn, genesis, data_spec, &now_received);
    DataDb::set(self.txn, genesis, data_spec, &signer.to_bytes(), data);

+    let received_range = (prior_received + 1) ..= now_received;
+
+    // If 2/3rds of the network participated in this preprocess, queue it for an automatic
+    // re-attempt
+    // DkgConfirmation doesn't have a re-attempt as it's just an extension for Dkg
+    if (data_spec.label == Label::Preprocess) &&
+      received_range.contains(&self.spec.t()) &&
+      (data_spec.topic != Topic::DkgConfirmation)
+    {
+      // Double check the attempt on this entry, as we don't want to schedule a re-attempt if this
+      // is an old entry
+      // This is an assert, not part of the if check, as old data shouldn't be here in the first
+      // place
+      assert_eq!(AttemptDb::attempt(self.txn, genesis, data_spec.topic), Some(data_spec.attempt));
+      ReattemptDb::schedule_reattempt(self.txn, genesis, self.block_number, data_spec.topic);
+
+      // Because this attempt was participated in, it was justified
+      // The question becomes why did the prior attempt fail?
+      // TODO: Slash people who failed to participate as expected in the prior attempt
+    }
+
    // If we have all the needed commitments/preprocesses/shares, tell the processor
-    let needed = if (data_spec.topic == Topic::Dkg) || (data_spec.topic == Topic::DkgConfirmation) {
-      self.spec.n()
-    } else {
-      self.spec.t()
-    };
-    if (prior_received < needed) && (now_received >= needed) {
+    let needs_everyone =
+      (data_spec.topic == Topic::Dkg) || (data_spec.topic == Topic::DkgConfirmation);
+    let needed = if needs_everyone { self.spec.n() } else { self.spec.t() };
+    if received_range.contains(&needed) {
+      log::debug!(
+        "accumulation for entry {:?} attempt #{} is ready",
+        &data_spec.topic,
+        &data_spec.attempt
+      );
      return Accumulation::Ready({
        let mut data = HashMap::new();
        for validator in self.spec.validators().iter().map(|validator| validator.0) {
@@ -187,6 +209,12 @@ impl<T: DbTxn, Pro: Processors, PST: PSTTrait, PTT: PTTTrait, RID: RIDTrait, P:

    // If the attempt is lesser than the blockchain's, return
    if data_spec.attempt < curr_attempt {
+      log::debug!(
+        "dated attempt published onto tributary for topic {:?} (used attempt {}, current {})",
+        data_spec.topic,
+        data_spec.attempt,
+        curr_attempt
+      );
      return Accumulation::NotReady;
    }
    // If the attempt is greater, this is a premature publication, full slash
@@ -541,17 +569,22 @@ impl<T: DbTxn, Pro: Processors, PST: PSTTrait, PTT: PTTTrait, RID: RIDTrait, P:
              return;
            };

-            // TODO: Only handle this if we're not actively removing any of the signers
+            // We need to only handle this if we're not actively removing any of the signers
+            // At the start of this function, we only handle messages from non-fatally slashed
+            // participants, so this is held
+            //
            // The created Substrate call will fail if a removed validator was one of the signers
            // Since:
            // 1) publish_serai_tx will block this task until the TX is published
            // 2) We won't scan any more TXs/blocks until we handle this TX
            // The TX *must* be successfully published *before* we start removing any more
            // signers
+            //
            // Accordingly, if the signers aren't currently being removed, they won't be removed
            // by the time this transaction is successfully published *unless* a malicious 34%
            // participates with the non-participating 33% to continue operation and produce a
            // distinct removal (since the non-participating won't block in this block)
+            //
            // This breaks BFT and is accordingly within bounds

            let tx = serai_client::SeraiValidatorSets::remove_participant(
@@ -560,6 +593,7 @@ impl<T: DbTxn, Pro: Processors, PST: PSTTrait, PTT: PTTTrait, RID: RIDTrait, P:
              signers,
              Signature(signature),
            );
+            LocallyDkgRemoved::set(self.txn, genesis, data.plan, &());
            self
              .publish_serai_tx
              .publish_serai_tx(self.spec.set(), PstTxType::RemoveParticipant(data.plan), tx)
@@ -597,7 +631,12 @@ impl<T: DbTxn, Pro: Processors, PST: PSTTrait, PTT: PTTTrait, RID: RIDTrait, P:
        );
        self
          .recognized_id
-          .recognized_id(self.spec.set(), genesis, RecognizedIdType::Batch, batch.to_vec())
+          .recognized_id(
+            self.spec.set(),
+            genesis,
+            RecognizedIdType::Batch,
+            batch.to_le_bytes().to_vec(),
+          )
          .await;
      }