Back to DAG

Data Encoding (JSON, XML, Avro)

networking

Text and Binary Data Encoding Formats

When systems communicate over a network or store data to disk, they need to serialize structured data into a sequence of bytes (encoding) and later deserialize it back (decoding). The choice of encoding format affects performance, interoperability, and the ability to evolve data schemas over time.

Text Formats

JSON (JavaScript Object Notation) is the dominant format for web APIs. It is human-readable, self-describing (field names are included in the data), and supported by every programming language. JSON supports objects, arrays, strings, numbers, booleans, and null. Its weaknesses: no schema enforcement (the sender and receiver must agree on structure informally), no native binary data type (binary must be Base64-encoded, expanding size by ~33%), and numbers lack precision guarantees (large integers may lose precision in JavaScript's 64-bit floats).

XML (Extensible Markup Language) is more verbose than JSON but supports namespaces (avoiding name collisions when mixing schemas), attributes on elements, and formal schema definitions via XSD (XML Schema Definition) or DTD. XML is still used in enterprise systems (SOAP web services, configuration files like Maven's pom.xml, Android layouts). Its verbosity — opening and closing tags for every element — makes it significantly larger than equivalent JSON.

Binary Formats

Binary formats are more compact and faster to parse than text formats:

Protocol Buffers (protobuf) — developed by Google. You define a schema in a .proto file, then use a code generator to produce serialization/deserialization code in your target language. Fields are identified by numeric tags (not names), making the encoded data compact. Values use varint encoding (small integers use fewer bytes). Adding new optional fields is backward-compatible because unknown tags are skipped.

Apache Avro — developed for Hadoop. The writer's schema is embedded in the file header (or exchanged during the RPC handshake). The encoded data contains no field tags or names — just values in schema order — making it the most compact format. Avro resolves differences between the writer's schema and the reader's schema at decode time, enabling flexible evolution.

MessagePack — a binary encoding of JSON. It preserves JSON's data model (objects, arrays, strings, numbers) but encodes them in a compact binary format. No schema needed. Useful as a drop-in replacement for JSON when bandwidth matters but schema management is unwanted.

Schema Evolution

As systems evolve, schemas change. Compatibility rules determine which changes are safe:

  • Backward compatibility — new code can read data written by old code. Achieved by only adding optional fields (with defaults) to the new schema. Old data missing the new field uses the default.
  • Forward compatibility — old code can read data written by new code. Achieved because old code ignores unknown fields (works naturally in protobuf and Avro).
  • Full compatibility — both directions work. Required for systems where producers and consumers upgrade at different times (common in microservice architectures).

Required vs optional fields: making a field required locks you in. If you later need to remove or change it, old producers still send it and new consumers expect it. Protobuf 3 dropped the concept of required fields entirely for this reason — all fields are implicitly optional.

The Same Data in Different Formats

Real-World Example

Consider a simple user record: id=42, name="Alice", email="alice@example.com".

JSON (83 bytes as UTF-8):

{"id":42,"name":"Alice","email":"alice@example.com"}

XML (~150 bytes):

<?xml version="1.0"?>
<user>
  <id>42</id>
  <name>Alice</name>
  <email>alice@example.com</email>
</user>

Protocol Buffers (~30 bytes): Schema: message User { int32 id = 1; string name = 2; string email = 3; } Binary encoding: field tags + varint for id, length-prefixed strings for name and email. No field names in the data.

Avro (~28 bytes): Schema (in file header): {"type":"record","name":"User","fields":[{"name":"id","type":"int"},{"name":"name","type":"string"},{"name":"email","type":"string"}]} Data: just the values in order — varint 42, then length-prefixed "Alice", then length-prefixed "alice@example.com". No field tags at all.

MessagePack (~50 bytes): Binary representation of the JSON structure. Field names are included (like JSON) but encoded in binary form, saving a few bytes on overhead.

Size comparison for 1 million records: JSON ~83 MB, XML ~150 MB, protobuf ~30 MB, Avro ~28 MB. For large-scale data processing (Hadoop, Kafka), this 3x size reduction matters enormously for storage cost and network throughput.

Schema Evolution: Backward and Forward Compatibility

Schema Evolution: Adding an Optional Field Schema V1 message User { int32 id = 1; string name = 2; } Schema V2 message User { int32 id = 1; string name = 2; string email = 3; } Backward Compatible V1 Producer writes: {id:42, name:"Alice"} data flows to V2 Consumer reads: email = default "" Forward Compatible V2 Producer writes: {id, name, email} data flows to V1 Consumer reads: ignores tag 3 (email)
Step 1 of 3