A Memory Amplification Bug in Sigma Prime's rust-libp2p (used by Lighthouse)

Introduction

Sigma Prime maintains a fork of rust-libp2p for Lighthouse, the Ethereum consensus client. Gossipsub is the protocol Lighthouse uses to push attestations and blocks across the validator network. So any extra allocator work you can force on a beacon node touches every staker behind it.

I was reading the gossipsub code one evening and noticed two truncate calls in behaviour.rs that looked fine but were actually too late.

The PR is at sigp/rust-libp2p#578.

The bug

Gossipsub peers exchange control messages. Two of them carry lists of message IDs: IWANT and IDONTWANT. Both get decoded into a Vec<MessageId>.

The config has bounds on these. max_iwant_length defaults to 5000, max_idontwant_messages to 1000. So in theory a node never accepts more than that.

The behaviour layer enforces it like this:

ControlAction::IWant(IWant { mut message_ids }) => {
    message_ids.truncate(self.config.max_iwant_length());
    self.handle_iwant(&propagation_source, message_ids)
}

ControlAction::IDontWant(IDontWant { mut message_ids }) => {
    // ...peer lookup...
    message_ids.truncate(self.config.max_idontwant_messages());
    let removed = peer.messages.remove_data_messages(&message_ids);
}

Looks fine. The problem is what happens before this code runs.

GossipsubCodec decodes the protobuf frame and builds the full Vec<MessageId> first. Every ID in the frame gets allocated. Then the vector reaches the behaviour, and only now does truncate drop the tail.

So a peer can pack a single ~10 MB frame with hundreds of thousands of IDs. With 20-byte IDs, that fits about 477,000 of them. Each MessageId is roughly 44 bytes once you count the heap. That is about 21 MB allocated per frame, most of it freed right after.

It does not crash anything. It is allocator pressure, not denial of service. But on a beacon node already under load, "recoverable" is not the same as "free".

Why max_messages_per_rpc does not save you

The first thing you notice is max_messages_per_rpc. It looks like the right knob.

It is not. That cap counts ControlAction entries, not the message IDs inside each entry. One ControlAction::IWant with 477,000 IDs counts as one entry. The cap is satisfied. The vector is still huge.

You only see this if you walk the code from the wire to the behaviour. Reading the config docs alone tells you the wrong story.

The fix

Move the cap into the codec, where the vector is first built. Then the oversized portion is never allocated to begin with.

I added two fields to ProtocolConfig, plumbed them through GossipsubCodec::new from the existing Config setters, and changed the decode path in protocol.rs from this:

let message_ids: Vec<MessageId> = iwant_msg
    .message_ids
    .into_iter()
    .map(MessageId::from)
    .collect();

to this:

let message_ids: Vec<MessageId> = iwant_msg
    .message_ids
    .into_iter()
    .take(self.max_iwant_length)
    .map(MessageId::from)
    .collect();

Same shape for IDONTWANT. The .take(N) runs on the iterator before .collect(), so the Vec is bounded at construction.

I left the behaviour-layer truncate in place as a defense-in-depth net. Defaults stay at 5000 and 1000, so nothing else changes.

Disclosure

This is a low severity finding. Allocator pressure from a peer with mesh access. Not memory corruption, not unauthenticated.

I sent it to security@sigmaprime.io. Kirk Baird at Sigma Prime confirmed the severity and approved a direct public PR rather than a private patch.

cargo check, the full cargo test -p libp2p-gossipsub suite (148 tests), cargo clippy, and cargo fmt all passed clean.

What happened upstream

Two days after my PR, jxs (João Oliveira), a libp2p maintainer, opened libp2p/rust-libp2p#6409. Same branch name as mine: cap-iwant-idontwant-decode.

His PR generalizes the same fix across all control messages, not just IWANT and IDONTWANT. It folds max_iwant_length, max_ihave_length, and max_idontwant_messages into one max_control_messages bound, and renames max_ihave_messages to max_ihave_messages_heartbeat for clarity.

That is a better shape than my narrower change. His comment on my PR was direct:

Hi, and thanks for this! We were already working on hardening this part of code with other improvements as well. I have submitted this upstream: https://github.com/libp2p/rust-libp2p/pull/6409/

So my fix did not land as-is. The idea did, generalized, upstream. Sigma Prime's fork will pick it up on the next sync.

Final Thoughts

"It is bounded" and "the bound is enforced where it matters" are two different statements. The config had the right numbers. The behaviour layer respected them. But the codec decoded the full vector before the behaviour layer ever saw it. A cap that runs after allocation is documentation, not protection.

You only catch this if you read the code from the wire inward. The config docs and the behaviour handler both look fine in isolation.

Honest about the outcome: parallel work by the actual maintainers landed the broader fix. My contribution was finding and reporting the class of bug, with a working patch and tests. The branch name match is the part that tells the story.

Worth doing this kind of read on any P2P code where decoded vectors get capped after the fact. The pattern shows up a lot, because protobuf decoders return everything and caps get bolted on after.

Thanks to Kirk Baird at Sigma Prime for the quick triage, and to jxs for picking up the broader cleanup upstream.