Kafka Internals — Wire Protocol, Data Encoding, and the RecordBatch Format
Most engineers interact with Kafka through high-level client abstractions: produce a string, consume a string. But Kafka has a well-defined binary wire protocol, and the way data is encoded on the wire — and stored on disk — has direct implications for performance, compatibility, and schema evolution.
This post covers the RecordBatch format that Kafka has used since version 0.11, how variable-length integer encoding reduces overhead, how the message format evolved and why it matters, and how Schema Registry solves the schema management problem.
The Kafka Wire Protocol
Kafka uses a custom binary TCP protocol (not HTTP, not gRPC). Every interaction between clients and brokers — produce, fetch, metadata, offset commit — is a typed request/response pair.
Request/Response Frame
Every message on the wire starts with a 4-byte length prefix:
┌──────────────────────────────────────────────────┐
│ Length (4 bytes, big-endian int32) │
├──────────────────────────────────────────────────┤
│ API Key (2 bytes) e.g. 0 = Produce │
│ API Version (2 bytes) e.g. 9 │
│ Correlation ID (4 bytes) for matching responses │
│ Client ID (nullable string) │
├──────────────────────────────────────────────────┤
│ Request Body (varies by API Key + Version) │
└──────────────────────────────────────────────────┘
The API Version field is how Kafka handles backwards compatibility — both client and broker negotiate the highest version they both support. This allows brokers to be upgraded before clients without breaking anything.
Responses mirror the structure: length prefix, correlation ID (to match the response to the request), and the response body.
Message Format Evolution
Kafka has gone through three message format versions, each identified by a magic byte in the message header.
v0 and v1 (pre-0.11)
Early Kafka wrapped each record in a Message struct:
v0 Message:
Offset (8 bytes)
Message size (4 bytes)
CRC (4 bytes)
Magic (1 byte, = 0)
Attributes (1 byte) compression codec in bits 0-2
Key length (4 bytes)
Key (variable)
Value length (4 bytes)
Value (variable)
v1 added a timestamp field (8 bytes). Compressed messages were stored as a special case: the value field of an outer message contained a compressed sequence of inner messages. This "message-of-messages" nesting was inefficient: to verify the CRC of a compressed batch, you had to decompress it first.
v2 (0.11+): RecordBatch
Kafka 0.11 introduced a completely new format — RecordBatch — that moves the per-record overhead into a shared batch header. This is the format in use today.
The RecordBatch Format
A RecordBatch is the unit of data on the wire and on disk. One batch can contain many records.
RecordBatch:
BaseOffset (8 bytes) first offset in the batch
BatchLength (4 bytes) byte length of the rest
PartitionLeaderEpoch (4 bytes) for leader epoch tracking
Magic (1 byte, = 2)
CRC (4 bytes) CRC32C over everything after this field
Attributes (2 bytes) compression, timestamp type, transactional, control flags
LastOffsetDelta (4 bytes) lastOffset - baseOffset
BaseTimestamp (8 bytes)
MaxTimestamp (8 bytes)
ProducerID (8 bytes) for idempotent/transactional producers
ProducerEpoch (2 bytes)
BaseSequence (4 bytes) for deduplication
RecordCount (4 bytes)
Records (variable) array of Record structs
The CRC covers everything after the CRC field (including attributes, timestamps, records). This means the CRC can be verified without decompressing the records, which was not possible with the v0/v1 nesting approach.
Individual Record
Within the batch, each Record is compact — per-record fields that are the same for all records in the batch (timestamps, key, compression) are hoisted into the batch header:
Record:
Length (varint) byte length of the record
Attributes (int8) reserved, currently 0
TimestampDelta (varlong) delta from BaseTimestamp
OffsetDelta (varint) delta from BaseOffset
KeyLength (varint) -1 for null
Key (bytes)
ValueLength (varint) -1 for null
Value (bytes)
HeadersCount (varint)
Headers (key-value pairs)
Notice that TimestampDelta and OffsetDelta are deltas, not absolute values. If records in a batch have similar timestamps and consecutive offsets (the common case), these deltas are small integers — which compresses very well with the varint encoding below.
Variable-Length Integer Encoding (Varint)
Kafka v2 uses zigzag-encoded varints (the same scheme as Protocol Buffers) for variable-length fields.
Why Varint?
A fixed 4-byte int32 always uses 4 bytes, even for the value 1. A varint uses 1 byte for values 0–63, 2 bytes for 64–8191, and so on. For the small deltas typical in a RecordBatch, varints dramatically reduce per-record overhead.
Zigzag Encoding
Standard varint encoding is inefficient for negative numbers (e.g., -1 encodes as 10 bytes). Zigzag encoding maps signed integers to unsigned integers:
0 → 0
-1 → 1
1 → 2
-2 → 3
2 → 4
...
n → 2*n (n ≥ 0)
n → -2*n - 1 (n < 0)
This keeps small negative numbers small — -1 encodes as 1 (1 byte), not as a large unsigned number.
In Practice
A batch of 1000 records with consecutive offsets and similar timestamps, the OffsetDelta for each record is a number from 0–999, and the TimestampDelta is a number from 0 to a few milliseconds in nanoseconds. Both encode as 1–2 byte varints per record, versus 4–8 bytes fixed. For a high-throughput topic, this adds up.
Compression Framing
Compression in the RecordBatch format works differently from v0/v1. The Attributes field in the batch header encodes the compression codec (bits 0-2):
| Value | Codec |
|---|---|
| 0 | none |
| 1 | GZIP |
| 2 | Snappy |
| 3 | LZ4 |
| 4 | ZSTD |
When compression is enabled, the Records field of the batch contains the compressed bytes of all records. The rest of the batch header (base offset, timestamps, producer ID, CRC) is uncompressed.
This is why the broker can verify the batch CRC and update the high watermark without decompressing the records — the integrity check and metadata are outside the compressed region.
When the consumer receives the batch, it decompresses the Records field and deserialises the individual Record structs. The broker never needs to decompress — it forwards the raw batch bytes directly.
Headers
Every Record can carry arbitrary headers — key-value pairs of raw bytes. Headers are part of the record but outside the value field, which means they are accessible without deserialising the value.
Common uses:
- Routing metadata (which microservice should handle this event)
- Tracing context (W3C Trace-Context, OpenTelemetry span ID)
- Schema ID (pointing to a schema in Schema Registry — see below)
- Event type for envelope pattern
ProducerRecord<String, byte[]> record = new ProducerRecord<>("orders", key, value);
record.headers().add("event-type", "OrderPlaced".getBytes());
record.headers().add("trace-id", traceId.getBytes());
Schema Management: Why You Need It
Kafka's wire protocol is schema-agnostic — values are raw bytes. This flexibility is also a liability: if the producer changes the shape of the data it sends, consumers that expect the old shape break silently or noisily.
Without schema management, changes require carefully coordinated deployments of all producers and consumers, and there is no machine-readable contract for what a topic contains.
Schema Registry
Confluent's Schema Registry (and compatible alternatives like Apicurio) solves this by maintaining a central repository of schemas, with each schema assigned an integer ID.
Instead of embedding the full schema in every message, the producer registers the schema once and gets back a 5-byte magic prefix:
Value bytes:
0x00 (magic byte, 1 byte)
Schema ID (4 bytes, big-endian int32)
Serialised payload (Avro / Protobuf / JSON Schema)
The consumer reads the first 5 bytes, looks up the schema by ID, and deserialises the rest. Schema lookup results are cached locally.
Avro
Avro is the most commonly used format with Schema Registry. A schema is a JSON document describing the fields and types:
{
"type": "record",
"name": "Order",
"fields": [
{"name": "id", "type": "string"},
{"name": "amount", "type": "double"},
{"name": "currency", "type": "string"},
{"name": "timestamp", "type": "long", "logicalType": "timestamp-millis"}
]
}
Avro binary encoding is compact — field names are not included in the payload (they are in the schema), so an Order above encodes as roughly 20 bytes.
Schema Evolution and Compatibility
Schema Registry enforces compatibility rules when a new schema version is registered:
| Mode | Allowed changes |
|---|---|
BACKWARD |
New schema can read old data (add optional fields with defaults) |
FORWARD |
Old schema can read new data (remove fields) |
FULL |
Both — add optional fields with defaults only |
NONE |
No checks |
BACKWARD compatibility (the default) means consumers can be upgraded before producers — the new consumer can read both old and new messages. This is the correct default for rolling deployments.
Protobuf
Protocol Buffers are an alternative to Avro. Protobuf uses field numbers instead of field names, which makes certain schema changes (renaming a field) backwards-compatible that would not be in Avro.
message Order {
string id = 1;
double amount = 2;
string currency = 3;
int64 timestamp = 4;
}
Protobuf is generally preferred when you have existing Protobuf infrastructure or need cross-language compatibility with strong typing.
JSON Schema
JSON Schema validation works but produces larger payloads than Avro or Protobuf because field names are included in every message. It is useful for debugging or when human readability matters more than wire efficiency.
Practical Recommendations
| Scenario | Recommendation |
|---|---|
| Internal event streaming, high throughput | Avro + Schema Registry + ZSTD compression |
| Cross-team or cross-company data contracts | Protobuf + Schema Registry |
| Simple scripts, debugging, low volume | Plain JSON (no Schema Registry) |
| Strongly typed ML feature pipelines | Avro or Protobuf, FULL compatibility mode |
Regardless of serialisation format, always:
- Register schemas before deploying producers
- Set compatibility mode before anyone writes to a topic
- Never reuse a topic for structurally different event types — use separate topics instead
Summary
Kafka's encoding story has three layers:
Wire framing: a simple length-prefixed binary protocol with versioned API keys. API versioning is how Kafka achieves rolling upgrades across hundreds of brokers without downtime.
RecordBatch format: a shared header per batch, per-record deltas encoded as varints, CRC outside the compressed region. This format makes the broker a fast byte forwarder — it never needs to inspect or decompress records to do its job.
Schema layer: Kafka's bytes are opaque without an agreement on what they mean. Schema Registry, Avro, and Protobuf provide that agreement, along with machine-enforced compatibility rules that make schema evolution safe at the cost of a small operational overhead.
Understanding these three layers tells you where to look when things go wrong: wire-level issues at the protocol layer, serialisation errors at the RecordBatch layer, and schema mismatch errors at the Schema Registry layer.