Files
serai/tests/coordinator/src/lib.rs

357 lines
11 KiB
Rust
Raw Normal View History

2023-08-01 19:00:48 -04:00
#![allow(clippy::needless_pass_by_ref_mut)] // False positives
2023-08-14 06:54:04 -04:00
use std::{
sync::{OnceLock, Arc},
2023-08-14 06:54:04 -04:00
time::Duration,
};
2023-08-01 19:00:48 -04:00
use tokio::{
task::AbortHandle,
sync::{Mutex as AsyncMutex, mpsc},
};
use rand_core::{RngCore, OsRng};
use zeroize::Zeroizing;
use ciphersuite::{
group::{ff::PrimeField, GroupEncoding},
Ciphersuite, Ristretto,
};
2023-08-01 19:00:48 -04:00
use serai_client::primitives::NetworkId;
use messages::{
coordinator::{SubstrateSignableId, SubstrateSignId, cosign_block_msg},
CoordinatorMessage, ProcessorMessage,
};
use serai_message_queue::{Service, Metadata, client::MessageQueue};
use serai_client::{primitives::Signature, Serai};
use dockertest::{PullPolicy, Image, TestBodySpecification, DockerOperations};
2023-08-01 19:00:48 -04:00
#[cfg(test)]
mod tests;
pub fn coordinator_instance(
name: &str,
message_queue_key: <Ristretto as Ciphersuite>::F,
) -> TestBodySpecification {
2023-08-01 19:00:48 -04:00
serai_docker_tests::build("coordinator".to_string());
TestBodySpecification::with_image(
2023-08-01 19:00:48 -04:00
Image::with_repository("serai-dev-coordinator").pull_policy(PullPolicy::Never),
)
.replace_env(
2023-08-01 19:00:48 -04:00
[
("MESSAGE_QUEUE_KEY".to_string(), hex::encode(message_queue_key.to_repr())),
("DB_PATH".to_string(), "./coordinator-db".to_string()),
("SERAI_KEY".to_string(), {
use serai_client::primitives::insecure_pair_from_name;
hex::encode(&insecure_pair_from_name(name).as_ref().secret.to_bytes()[.. 32])
}),
(
"RUST_LOG".to_string(),
"serai_coordinator=trace,".to_string() + "tributary_chain=trace," + "tendermint=trace",
),
2023-08-01 19:00:48 -04:00
]
.into(),
)
}
pub fn serai_composition(name: &str) -> TestBodySpecification {
2023-08-01 19:00:48 -04:00
serai_docker_tests::build("serai".to_string());
TestBodySpecification::with_image(
Image::with_repository("serai-dev-serai").pull_policy(PullPolicy::Never),
)
Redo Dockerfile generation (#530) Moves from concatted Dockerfiles to pseudo-templated Dockerfiles via a dedicated Rust program. Removes the unmaintained kubernetes, not because we shouldn't have/use it, but because it's unmaintained and needs to be reworked before it's present again. Replaces the compose with the work in the new orchestrator binary which spawns everything as expected. While this arguably re-invents the wheel, it correctly manages secrets and handles the variadic Dockerfiles. Also adds an unrelated patch for zstd and simplifies running services a bit by greater utilizing the existing infrastructure. --- * Delete all Dockerfile fragments, add new orchestator to generate Dockerfiles Enables greater templating. Also delete the unmaintained kubernetes folder *for now*. This should be restored in the future. * Use Dockerfiles from the orchestator * Ignore Dockerfiles in the git repo * Remove CI job to check Dockerfiles are as expected now that they're no longer committed * Remove old Dockerfiles from repo * Use Debian for monero-wallet-rpc * Remove replace_cmds for proper usage of entry-dev Consolidates ports a bit. Updates serai-docker-tests from "compose" to "build". * Only write a new dockerfile if it's distinct Preserves the updated time metadata. * Update serai-docker-tests * Correct the path Dockerfiles are built from * Correct inclusion of orchestration folder in Docker builds * Correct debug/release flagging in the cargo command Apparently, --debug isn't an effective NOP yet an error. * Correct path used to run the Serai node within a Dockerfile * Correct path in Monero Dockerfile * Attempt storing monerod in /usr/bin * Use sudo to move into /usr/bin in CI * Correct 18.3.0 to 18.3.1 * Escape * with quotes * Update deny.toml, ADD orchestration in runtime Dockerfile * Add --detach to the Monero GH CI * Diversify dockerfiles by network * Fixes to network-diversified orchestration * Bitcoin and Monero testnet scripts * Permissions and tweaks * Flatten scripts folders * Add missing folder specification to Monero Dockerfile * Have monero-wallet-rpc specify the monerod login * Have the Docker CMD specify env variables inserted at time of Dockerfile generation They're overrideable with the global enviornment as for tests. This enables variable generation in orchestrator and output to productionized Docker files without creating a life-long file within the Docker container. * Don't add Dockerfiles into Docker containers now that they have secrets Solely add the source code for them as needed to satisfy the workspace bounds. * Download arm64 Monero on arm64 * Ensure constant host architecture when reproducibly building the wasm Host architecture, for some reason, can effect the generated code despite the target architecture always being foreign to the host architecture. * Randomly generate infrastructure keys * Have orchestrator generate a key, be able to create/start containers * Ensure bash is used over sh * Clean dated docs * Change how quoting occurs * Standardize to sh * Have Docker test build the dev Dockerfiles * Only key_gen once * cargo update Adds a patch for zstd and reconciles the breaking nightly change which just occurred. * Use a dedicated network for Serai Also fixes SERAI_HOSTNAME passed to coordinator. * Support providing a key over the env for the Serai node * Enable and document running daemons for tests via serai-orchestrator Has running containers under the dev network port forward the RPC ports. * Use volumes for bitcoin/monero * Use bitcoin's run.sh in GH CI * Only use the volume for testnet (not dev)
2024-02-09 02:48:44 -05:00
.replace_env([("SERAI_NAME".to_string(), name.to_lowercase())].into())
.set_publish_all_ports(true)
2023-08-01 19:00:48 -04:00
}
fn is_cosign_message(msg: &CoordinatorMessage) -> bool {
matches!(
msg,
CoordinatorMessage::Coordinator(
messages::coordinator::CoordinatorMessage::CosignSubstrateBlock { .. }
)
) || matches!(
msg,
CoordinatorMessage::Coordinator(
messages::coordinator::CoordinatorMessage::SubstratePreprocesses {
id: SubstrateSignId { id: SubstrateSignableId::CosigningSubstrateBlock(_), .. },
..
}
),
) || matches!(
msg,
CoordinatorMessage::Coordinator(messages::coordinator::CoordinatorMessage::SubstrateShares {
id: SubstrateSignId { id: SubstrateSignableId::CosigningSubstrateBlock(_), .. },
..
}),
)
}
#[derive(Clone, PartialEq, Eq, Debug)]
pub struct Handles {
pub(crate) serai: String,
pub(crate) message_queue: String,
}
pub struct Processor {
network: NetworkId,
serai_rpc: String,
#[allow(unused)]
handles: Handles,
msgs: mpsc::UnboundedReceiver<messages::CoordinatorMessage>,
queue_for_sending: MessageQueue,
abort_handle: Option<Arc<AbortHandle>>,
substrate_key: Arc<AsyncMutex<Option<Zeroizing<<Ristretto as Ciphersuite>::F>>>>,
}
impl Drop for Processor {
fn drop(&mut self) {
if let Some(abort_handle) = self.abort_handle.take() {
abort_handle.abort();
};
}
}
impl Processor {
pub async fn new(
raw_i: u8,
network: NetworkId,
ops: &DockerOperations,
handles: Handles,
processor_key: <Ristretto as Ciphersuite>::F,
) -> Processor {
let message_queue_rpc = ops.handle(&handles.message_queue).host_port(2287).unwrap();
let message_queue_rpc = format!("{}:{}", message_queue_rpc.0, message_queue_rpc.1);
// Sleep until the Substrate RPC starts
let serai_rpc = ops.handle(&handles.serai).host_port(9944).unwrap();
let serai_rpc = format!("http://{}:{}", serai_rpc.0, serai_rpc.1);
// Bound execution to 60 seconds
for _ in 0 .. 60 {
2023-08-14 06:54:04 -04:00
tokio::time::sleep(Duration::from_secs(1)).await;
let Ok(client) = Serai::new(serai_rpc.clone()).await else { continue };
if client.latest_finalized_block_hash().await.is_err() {
continue;
}
break;
}
// The Serai RPC may or may not be started
// Assume it is and continue, so if it's a few seconds late, it's still within tolerance
// Create the queue
let mut queue = (
0,
Arc::new(MessageQueue::new(
Service::Processor(network),
message_queue_rpc.clone(),
Zeroizing::new(processor_key),
)),
);
let (msg_send, msg_recv) = mpsc::unbounded_channel();
let substrate_key = Arc::new(AsyncMutex::new(None));
let mut res = Processor {
network,
serai_rpc,
handles,
queue_for_sending: MessageQueue::new(
Service::Processor(network),
message_queue_rpc,
Zeroizing::new(processor_key),
),
msgs: msg_recv,
abort_handle: None,
substrate_key: substrate_key.clone(),
};
// Spawn a task to handle cosigns and forward messages as appropriate
let abort_handle = tokio::spawn({
async move {
loop {
// Get new messages
let (next_recv_id, queue) = &mut queue;
let msg = queue.next(Service::Coordinator).await;
assert_eq!(msg.from, Service::Coordinator);
assert_eq!(msg.id, *next_recv_id);
queue.ack(Service::Coordinator, msg.id).await;
*next_recv_id += 1;
let msg_msg = borsh::from_slice(&msg.msg).unwrap();
// Remove any BatchReattempts clogging the pipe
// TODO: Set up a wrapper around serai-client so we aren't throwing this away yet
// leave it for the tests
if matches!(
msg_msg,
messages::CoordinatorMessage::Coordinator(
messages::coordinator::CoordinatorMessage::BatchReattempt { .. }
)
) {
continue;
}
if !is_cosign_message(&msg_msg) {
msg_send.send(msg_msg).unwrap();
continue;
}
let msg = msg_msg;
let send_message = |msg: ProcessorMessage| async move {
queue
.queue(
Metadata {
from: Service::Processor(network),
to: Service::Coordinator,
intent: msg.intent(),
},
borsh::to_vec(&msg).unwrap(),
)
.await;
};
struct CurrentCosign {
block_number: u64,
block: [u8; 32],
}
static CURRENT_COSIGN: OnceLock<AsyncMutex<Option<CurrentCosign>>> = OnceLock::new();
let mut current_cosign =
CURRENT_COSIGN.get_or_init(|| AsyncMutex::new(None)).lock().await;
match msg {
// If this is a CosignSubstrateBlock, reset the CurrentCosign
// While technically, each processor should individually track the current cosign,
// this is fine for current testing purposes
CoordinatorMessage::Coordinator(
messages::coordinator::CoordinatorMessage::CosignSubstrateBlock { id, block_number },
) => {
let SubstrateSignId {
id: SubstrateSignableId::CosigningSubstrateBlock(block), ..
} = id
else {
panic!("CosignSubstrateBlock didn't have CosigningSubstrateBlock ID")
};
let new_cosign = CurrentCosign { block_number, block };
if current_cosign.is_none() || (current_cosign.as_ref().unwrap().block != block) {
*current_cosign = Some(new_cosign);
}
send_message(
messages::coordinator::ProcessorMessage::CosignPreprocess {
id: id.clone(),
preprocesses: vec![[raw_i; 64]],
}
.into(),
)
.await;
}
CoordinatorMessage::Coordinator(
messages::coordinator::CoordinatorMessage::SubstratePreprocesses { id, .. },
) => {
// TODO: Assert the ID matches CURRENT_COSIGN
// TODO: Verify the received preprocesses
send_message(
messages::coordinator::ProcessorMessage::SubstrateShare {
id,
shares: vec![[raw_i; 32]],
}
.into(),
)
.await;
}
CoordinatorMessage::Coordinator(
messages::coordinator::CoordinatorMessage::SubstrateShares { .. },
) => {
// TODO: Assert the ID matches CURRENT_COSIGN
// TODO: Verify the shares
let block_number = current_cosign.as_ref().unwrap().block_number;
let block = current_cosign.as_ref().unwrap().block;
let substrate_key = substrate_key.lock().await.clone().unwrap();
// Expand to a key pair as Schnorrkel expects
// It's the private key + 32-bytes of entropy for nonces + the public key
let mut schnorrkel_key_pair = [0; 96];
schnorrkel_key_pair[.. 32].copy_from_slice(&substrate_key.to_repr());
OsRng.fill_bytes(&mut schnorrkel_key_pair[32 .. 64]);
schnorrkel_key_pair[64 ..].copy_from_slice(
&(<Ristretto as Ciphersuite>::generator() * *substrate_key).to_bytes(),
);
let signature = Signature(
schnorrkel::keys::Keypair::from_bytes(&schnorrkel_key_pair)
.unwrap()
.sign_simple(b"substrate", &cosign_block_msg(block_number, block))
.to_bytes(),
);
send_message(
messages::coordinator::ProcessorMessage::CosignedBlock {
block_number,
block,
signature: signature.0.to_vec(),
}
.into(),
)
.await;
}
_ => panic!("unexpected message passed is_cosign_message"),
}
}
}
})
.abort_handle();
res.abort_handle = Some(Arc::new(abort_handle));
res
}
pub async fn serai(&self) -> Serai {
Serai::new(self.serai_rpc.clone()).await.unwrap()
}
/// Send a message to the coordinator as a processor.
pub async fn send_message(&mut self, msg: impl Into<ProcessorMessage>) {
let msg: ProcessorMessage = msg.into();
self
.queue_for_sending
.queue(
Metadata {
from: Service::Processor(self.network),
to: Service::Coordinator,
intent: msg.intent(),
},
borsh::to_vec(&msg).unwrap(),
)
.await;
}
Reattempts (#483) * Schedule re-attempts and add a (not filled out) match statement to actually execute them A comment explains the methodology. To copy it here: """ This is because we *always* re-attempt any protocol which had participation. That doesn't mean we *should* re-attempt this protocol. The alternatives were: 1) Note on-chain we completed a protocol, halting re-attempts upon 34%. 2) Vote on-chain to re-attempt a protocol. This schema doesn't have any additional messages upon the success case (whereas alternative #1 does) and doesn't have overhead (as alternative #2 does, sending votes and then preprocesses. This only sends preprocesses). """ Any signing protocol which reaches sufficient participation will be re-attempted until it no longer does. * Have the Substrate scanner track DKG removals/completions for the Tributary code * Don't keep trying to publish a participant removal if we've already set keys * Pad out the re-attempt match a bit more * Have CosignEvaluator reload from the DB * Correctly schedule cosign re-attempts * Actuall spawn new DKG removal attempts * Use u32 for Batch ID in SubstrateSignableId, finish Batch re-attempt routing The batch ID was an opaque [u8; 5] which also included the network, yet that's redundant and unhelpful. * Clarify a pair of TODOs in the coordinator * Remove old TODO * Final comment cleanup * Correct usage of TARGET_BLOCK_TIME in reattempt scheduler It's in ms and I assumed it was in s. * Have coordinator tests drop BatchReattempts which aren't relevant yet may exist * Bug fix and pointless oddity removal We scheduled a re-attempt upon receiving 2/3rds of preprocesses and upon receiving 2/3rds of shares, so any signing protocol could cause two re-attempts (not one more). The coordinator tests randomly generated the Batch ID since it was prior an opaque byte array. While that didn't break the test, it was pointless and did make the already-succeeded check before re-attempting impossible to hit. * Add log statements, correct dead-lock in coordinator tests * Increase pessimistic timeout on recv_message to compensate for tighter best-case timeouts * Further bump timeout by a minute AFAICT, GH failed by just a few seconds. This also is worst-case in a single instance, making it fine to be decently long. * Further further bump timeout due to lack of distinct error
2023-12-12 12:28:53 -05:00
/// Receive a message from the coordinator as a processor.
pub async fn recv_message(&mut self) -> CoordinatorMessage {
2024-02-18 08:19:07 -05:00
// Set a timeout of 20 minutes to allow effectively any protocol to occur without a fear of
Reattempts (#483) * Schedule re-attempts and add a (not filled out) match statement to actually execute them A comment explains the methodology. To copy it here: """ This is because we *always* re-attempt any protocol which had participation. That doesn't mean we *should* re-attempt this protocol. The alternatives were: 1) Note on-chain we completed a protocol, halting re-attempts upon 34%. 2) Vote on-chain to re-attempt a protocol. This schema doesn't have any additional messages upon the success case (whereas alternative #1 does) and doesn't have overhead (as alternative #2 does, sending votes and then preprocesses. This only sends preprocesses). """ Any signing protocol which reaches sufficient participation will be re-attempted until it no longer does. * Have the Substrate scanner track DKG removals/completions for the Tributary code * Don't keep trying to publish a participant removal if we've already set keys * Pad out the re-attempt match a bit more * Have CosignEvaluator reload from the DB * Correctly schedule cosign re-attempts * Actuall spawn new DKG removal attempts * Use u32 for Batch ID in SubstrateSignableId, finish Batch re-attempt routing The batch ID was an opaque [u8; 5] which also included the network, yet that's redundant and unhelpful. * Clarify a pair of TODOs in the coordinator * Remove old TODO * Final comment cleanup * Correct usage of TARGET_BLOCK_TIME in reattempt scheduler It's in ms and I assumed it was in s. * Have coordinator tests drop BatchReattempts which aren't relevant yet may exist * Bug fix and pointless oddity removal We scheduled a re-attempt upon receiving 2/3rds of preprocesses and upon receiving 2/3rds of shares, so any signing protocol could cause two re-attempts (not one more). The coordinator tests randomly generated the Batch ID since it was prior an opaque byte array. While that didn't break the test, it was pointless and did make the already-succeeded check before re-attempting impossible to hit. * Add log statements, correct dead-lock in coordinator tests * Increase pessimistic timeout on recv_message to compensate for tighter best-case timeouts * Further bump timeout by a minute AFAICT, GH failed by just a few seconds. This also is worst-case in a single instance, making it fine to be decently long. * Further further bump timeout due to lack of distinct error
2023-12-12 12:28:53 -05:00
// an arbitrary timeout cutting it short
tokio::time::timeout(Duration::from_secs(20 * 60), self.msgs.recv()).await.unwrap().unwrap()
Reattempts (#483) * Schedule re-attempts and add a (not filled out) match statement to actually execute them A comment explains the methodology. To copy it here: """ This is because we *always* re-attempt any protocol which had participation. That doesn't mean we *should* re-attempt this protocol. The alternatives were: 1) Note on-chain we completed a protocol, halting re-attempts upon 34%. 2) Vote on-chain to re-attempt a protocol. This schema doesn't have any additional messages upon the success case (whereas alternative #1 does) and doesn't have overhead (as alternative #2 does, sending votes and then preprocesses. This only sends preprocesses). """ Any signing protocol which reaches sufficient participation will be re-attempted until it no longer does. * Have the Substrate scanner track DKG removals/completions for the Tributary code * Don't keep trying to publish a participant removal if we've already set keys * Pad out the re-attempt match a bit more * Have CosignEvaluator reload from the DB * Correctly schedule cosign re-attempts * Actuall spawn new DKG removal attempts * Use u32 for Batch ID in SubstrateSignableId, finish Batch re-attempt routing The batch ID was an opaque [u8; 5] which also included the network, yet that's redundant and unhelpful. * Clarify a pair of TODOs in the coordinator * Remove old TODO * Final comment cleanup * Correct usage of TARGET_BLOCK_TIME in reattempt scheduler It's in ms and I assumed it was in s. * Have coordinator tests drop BatchReattempts which aren't relevant yet may exist * Bug fix and pointless oddity removal We scheduled a re-attempt upon receiving 2/3rds of preprocesses and upon receiving 2/3rds of shares, so any signing protocol could cause two re-attempts (not one more). The coordinator tests randomly generated the Batch ID since it was prior an opaque byte array. While that didn't break the test, it was pointless and did make the already-succeeded check before re-attempting impossible to hit. * Add log statements, correct dead-lock in coordinator tests * Increase pessimistic timeout on recv_message to compensate for tighter best-case timeouts * Further bump timeout by a minute AFAICT, GH failed by just a few seconds. This also is worst-case in a single instance, making it fine to be decently long. * Further further bump timeout due to lack of distinct error
2023-12-12 12:28:53 -05:00
}
pub async fn set_substrate_key(
&mut self,
substrate_key: Zeroizing<<Ristretto as Ciphersuite>::F>,
) {
*self.substrate_key.lock().await = Some(substrate_key);
}
}