Xlang Serialization Format
Cross-language Serialization Specification
Apache Fory™ xlang serialization enables automatic cross-language object serialization with support for shared references, circular references, and polymorphism. Unlike traditional serialization frameworks that require IDL definitions and schema compilation, Fory serializes objects directly without any intermediate steps.
Key characteristics:
- Automatic: No IDL definition, no schema compilation, no manual object-to-protocol conversion
- Cross-language: Same binary format works seamlessly across Java, Python, C++, Rust, Go, JavaScript, and more
- Reference-aware: Handles shared references and circular references without duplication or infinite recursion
- Polymorphic: Supports object polymorphism with runtime type resolution
This specification defines the Fory xlang binary format. The format is dynamic rather than static, which enables flexibility and ease of use at the cost of additional complexity in the wire format.
Type Systems
Data Types
- bool: a boolean value (true or false).
- int8: a 8-bit signed integer.
- int16: a 16-bit signed integer.
- int32: a 32-bit signed integer.
- var_int32: a 32-bit signed integer which use fory var_int32 encoding.
- int64: a 64-bit signed integer.
- var_int64: a 64-bit signed integer which use fory PVL encoding.
- sli_int64: a 64-bit signed integer which use fory SLI encoding.
- float16: a 16-bit floating point number.
- float32: a 32-bit floating point number.
- float64: a 64-bit floating point number including NaN and Infinity.
- string: a text string encoded using Latin1/UTF16/UTF-8 encoding.
- enum: a data type consisting of a set of named values. Rust enum with non-predefined field values are not supported as an enum.
- named_enum: an enum whose value will be serialized as the registered name.
- struct: a morphic(final) type serialized by Fory Struct serializer. i.e. it doesn't have subclasses. Suppose we're
deserializing
List<SomeClass>, we can save dynamic serializer dispatch sinceSomeClassis morphic(final). - compatible_struct: a morphic(final) type serialized by Fory compatible Struct serializer.
- named_struct: a
structwhose type mapping will be encoded as a name. - named_compatible_struct: a
compatible_structwhose type mapping will be encoded as a name. - ext: a type which will be serialized by a customized serializer.
- named_ext: an
exttype whose type mapping will be encoded as a name. - list: a sequence of objects.
- set: an unordered set of unique elements.
- map: a map of key-value pairs. Mutable types such as
list/map/set/arrayare not allowed as key of map. - duration: an absolute length of time, independent of any calendar/timezone, as a count of nanoseconds.
- timestamp: a point in time, independent of any calendar/timezone, as a count of nanoseconds. The count is relative to an epoch at UTC midnight on January 1, 1970.
- local_date: a naive date without timezone. The count is days relative to an epoch at UTC midnight on Jan 1, 1970.
- decimal: exact decimal value represented as an integer value in two's complement.
- binary: an variable-length array of bytes.
- array: only allow 1d numeric components. Other arrays will be taken as List. The implementation should support the
interoperability between array and list.
- bool_array: one dimensional int16 array.
- int8_array: one dimensional int8 array.
- int16_array: one dimensional int16 array.
- int32_array: one dimensional int32 array.
- int64_array: one dimensional int64 array.
- float16_array: one dimensional half_float_16 array.
- float32_array: one dimensional float32 array.
- float64_array: one dimensional float64 array.
- union: a tagged union type that can hold one of several alternative types. The active alternative is identified by an index.
- none: represents an empty/unit value with no data (e.g., for empty union alternatives).
Note:
- Unsigned int/long are not added here, since not every language support those types.
Polymorphisms
For polymorphism, if one non-final class is registered, and only one subclass is registered, then we can take all elements in List/Map have same type, thus reduce runtime check cost.
Collection/Array polymorphism are not fully supported, since some languages such as golang have only one collection type. If users want to get exactly the type he passed, he must pass that type when deserializing or annotate that type to the field of struct.
Type disambiguation
Due to differences between type systems of languages, those types can't be mapped one-to-one between languages. When deserializing, Fory use the target data structure type and the data type in the data jointly to determine how to deserialize and populate the target data structure. For example:
class Foo {
int[] intArray;
Object[] objects;
List<Object> objectList;
}
class Foo2 {
int[] intArray;
List<Object> objects;
List<Object> objectList;
}
intArray has an int32_array type. But both objects and objectList fields in the serialize data have list data
type. When deserializing, the implementation will create an Object array for objects, but create a ArrayList
for objectList to populate its elements. And the serialized data of Foo can be deserialized into Foo2 too.
Users can also provide meta hints for fields of a type, or the type whole. Here is an example in java which use annotation to provide such information.
@ForyObject(fieldsNullable = false, trackingRef = false)
class Foo {
@ForyField(trackingRef = false)
int[] intArray;
@ForyField(polymorphic = true)
Object object;
@ForyField(tagId = 1, nullable = true)
List<Object> objectList;
}
Such information can be provided in other languages too:
- cpp: use macro and template.
- golang: use struct tag.
- python: use typehint.
- rust: use macro.
Type ID
All internal data types are expressed using an ID in range 0~64. Users can use IDs in range 0~8192 for registering their
custom types (struct/ext/enum). User type IDs are in a separate namespace and combined with internal type IDs via bit shifting:
(user_type_id << 8) | internal_type_id.
Internal Type ID Table
| Type ID | Name | Description |
|---|---|---|
| 0 | UNKNOWN | Unknown type, used for dynamic typing |
| 1 | BOOL | Boolean value |
| 2 | INT8 | 8-bit signed integer |
| 3 | INT16 | 16-bit signed integer |
| 4 | INT32 | 32-bit signed integer |
| 5 | VAR_INT32 | Variable-length encoded 32-bit signed integer |
| 6 | INT64 | 64-bit signed integer |
| 7 | VAR_INT64 | Variable-length encoded 64-bit signed integer |
| 8 | SLI_INT64 | Small Long as Int encoded 64-bit signed integer |
| 9 | FLOAT16 | 16-bit floating point (half precision) |
| 10 | FLOAT32 | 32-bit floating point (single precision) |
| 11 | FLOAT64 | 64-bit floating point (double precision) |
| 12 | STRING | UTF-8/UTF-16/Latin1 encoded string |
| 13 | ENUM | Enum registered by numeric ID |
| 14 | NAMED_ENUM | Enum registered by namespace + type name |
| 15 | STRUCT | Struct registered by numeric ID (schema consistent) |
| 16 | COMPATIBLE_STRUCT | Struct with schema evolution support (by ID) |
| 17 | NAMED_STRUCT | Struct registered by namespace + type name |
| 18 | NAMED_COMPATIBLE_STRUCT | Struct with schema evolution (by name) |
| 19 | EXT | Extension type registered by numeric ID |
| 20 | NAMED_EXT | Extension type registered by namespace + type name |
| 21 | LIST | Ordered collection (List, Array, Vector) |
| 22 | SET | Unordered collection of unique elements |
| 23 | MAP | Key-value mapping |
| 24 | DURATION | Time duration (seconds + nanoseconds) |
| 25 | TIMESTAMP | Point in time (nanoseconds since epoch) |
| 26 | LOCAL_DATE | Date without timezone (days since epoch) |
| 27 | DECIMAL | Arbitrary precision decimal |
| 28 | BINARY | Raw binary data |
| 29 | ARRAY | Generic array type |
| 30 | BOOL_ARRAY | 1D boolean array |
| 31 | INT8_ARRAY | 1D int8 array |
| 32 | INT16_ARRAY | 1D int16 array |
| 33 | INT32_ARRAY | 1D int32 array |
| 34 | INT64_ARRAY | 1D int64 array |
| 35 | FLOAT16_ARRAY | 1D float16 array |
| 36 | FLOAT32_ARRAY | 1D float32 array |
| 37 | FLOAT64_ARRAY | 1D float64 array |
| 38 | UNION | Tagged union type (one of several alternatives) |
| 39 | NONE | Empty/unit type (no data) |
Type ID Encoding for User Types
When registering user types (struct/ext/enum), the full type ID combines user ID and internal type ID:
Full Type ID = (user_type_id << 8) | internal_type_id
Examples:
| User ID | Type | Internal ID | Full Type ID | Decimal |
|---|---|---|---|---|
| 0 | STRUCT | 15 | (0 << 8) | 15 | 15 |
| 0 | ENUM | 13 | (0 << 8) | 13 | 13 |
| 1 | STRUCT | 15 | (1 << 8) | 15 | 271 |
| 1 | COMPATIBLE_STRUCT | 16 | (1 << 8) | 16 | 272 |
| 2 | NAMED_STRUCT | 17 | (2 << 8) | 17 | 529 |
When reading type IDs:
- Extract internal type:
internal_type_id = full_type_id & 0xFF - Extract user type ID:
user_type_id = full_type_id >> 8
Type mapping
See Type mapping
Spec overview
Here is the overall format:
| fory header | object ref meta | object type meta | object value data |
The data are serialized using little endian byte order overall. If bytes swap is costly for some object, Fory will write the byte order for that object into the data instead of converting it to little endian.
Fory header
Fory header format for xlang serialization:
| 2 bytes | 1 byte bitmap | 1 byte | optional 4 bytes |
+--------------+--------------------------------+------------+------------------------------------+
| magic number | 4 bits reserved | 4 bits meta | language | unsigned int for meta start offset |
Detailed byte layout:
Byte 0-1: Magic number (0x62d4) - little endian
Byte 2: Bitmap flags
- Bit 0: null flag (0x01)
- Bit 1: endian flag (0x02)
- Bit 2: xlang flag (0x04)
- Bit 3: oob flag (0x08)
- Bits 4-7: reserved
Byte 3: Language ID (only present when xlang flag is set)
Byte 4-7: Meta start offset (only present when meta share mode is enabled)
- magic number:
0x62d4(2 bytes, little endian) - used to identify fory xlang serialization protocol. - null flag (bit 0): 1 when object is null, 0 otherwise. If an object is null, only this flag and endian flag are set.
- endian flag (bit 1): 1 when data is encoded by little endian, 0 for big endian. Modern implementations always use little endian.
- xlang flag (bit 2): 1 when serialization uses Fory xlang format, 0 when serialization uses Fory language-native format.
- oob flag (bit 3): 1 when out-of-band serialization is enabled (BufferCallback is not null), 0 otherwise.
- language: 1 byte indicating the source language. This allows deserializers to optimize for specific language characteristics.
Language IDs
| Language | ID |
|---|---|
| XLANG | 0 |
| JAVA | 1 |
| PYTHON | 2 |
| CPP | 3 |
| GO | 4 |
| JAVASCRIPT | 5 |
| RUST | 6 |
| DART | 7 |
Meta Start Offset
If compatible mode is enabled, an uncompressed unsigned int32 (4 bytes, little endian) is appended to indicate the start offset of metadata. During serialization, this is initially written as a placeholder (e.g., -1 or 0), then updated after all objects are serialized and metadata is collected.
Reference Meta
Reference tracking handles whether the object is null, and whether to track reference for the object by writing corresponding flags and maintaining internal state.
Reference Flags
| Flag | Byte Value (int8) | Hex | Description |
|---|---|---|---|
| NULL FLAG | -3 | 0xFD | Object is null. No further bytes are written for this object. |
| REF FLAG | -2 | 0xFE | Object was already serialized. Followed by unsigned varint32 reference ID. |
| NOT_NULL VALUE FLAG | -1 | 0xFF | Object is non-null but reference tracking is disabled for this type. Object data follows immediately. |
| REF VALUE FLAG | 0 | 0x00 | Object is referencable and this is its first occurrence. Object data follows. Assigns next reference ID. |
Reference Tracking Algorithm
Writing:
function write_ref_or_null(buffer, obj):
if obj is null:
buffer.write_int8(NULL_FLAG) // -3
return true // done, no more data to write
if reference_tracking_enabled:
ref_id = lookup_written_objects(obj)
if ref_id exists:
buffer.write_int8(REF_FLAG) // -2
buffer.write_varuint32(ref_id)
return true // done, reference written
else:
buffer.write_int8(REF_VALUE_FLAG) // 0
add_to_written_objects(obj, next_ref_id++)
return false // continue to serialize object data
else:
buffer.write_int8(NOT_NULL_VALUE_FLAG) // -1
return false // continue to serialize object data
Reading:
function read_ref_or_null(buffer):
flag = buffer.read_int8()
switch flag:
case NULL_FLAG (-3):
return (null, true) // null object, done
case REF_FLAG (-2):
ref_id = buffer.read_varuint32()
obj = get_from_read_objects(ref_id)
return (obj, true) // referenced object, done
case NOT_NULL_VALUE_FLAG (-1):
return (null, false) // non-null, continue reading
case REF_VALUE_FLAG (0):
reserve_ref_slot() // will be filled after reading
return (null, false) // non-null, continue reading
Reference ID Assignment
- Reference IDs are assigned sequentially starting from
0 - The ID is assigned when
REF_VALUE_FLAGis written (first occurrence) - Objects are stored in a list/map indexed by their reference ID
- For reading, a placeholder slot is reserved before deserializing the object, then filled after
When Reference Tracking is Disabled
When reference tracking is disabled globally or for specific types, only the NULL and NOT_NULL VALUE flags
will be used for reference meta. This reduces overhead for types that are known not to have references.
Language-Specific Considerations
Languages with nullable and reference types by default (Java, Python, JavaScript):
In xlang mode, for cross-language compatibility:
- All fields are treated as not-null by default
- Reference tracking is disabled by default
- Users can explicitly mark fields as nullable or enable reference tracking via annotations
Optionaltypes (e.g.,java.util.Optional,typing.Optional) are treated as nullable
Annotation examples:
// Java: use @ForyField annotation
public class MyClass {
@ForyField(nullable = true, ref = true)
private Object refField;
@ForyField(nullable = false)
private String requiredField;
}
# Python: use typing with fory field descriptors
from pyfory import Fory, ForyField
class MyClass:
ref_field: ForyField(SomeType, nullable=True, ref=True)
required_field: ForyField(str, nullable=False)
Languages with non-nullable types by default:
| Language | Null Representation | Reference Tracking Support |
|---|---|---|
| Rust | Option::None | Via Rc<T>, Arc<T>, Weak<T> |
| C++ | std::nullopt, nullptr | Via std::shared_ptr<T>, weak_ptr<T> |
| Go | nil interface/pointer | Via pointer/interface types |
Important: For languages like Rust that don't have implicit reference semantics, reference tracking must use
explicit smart pointers (Rc, Arc).
Type Meta
For every type to be serialized, it have a type id to indicate its type.
- basic types: the type id
- enum:
Type.ENUM+ registered idType.NAMED_ENUM+ registered namespace+typename
- list:
Type.List - set:
Type.SET - map:
Type.MAP - ext:
Type.EXT+ registered idType.NAMED_EXT+ registered namespace+typename
- struct:
Type.STRUCT+ struct metaType.NAMED_STRUCT+ struct meta
Every type must be registered with an ID or name first. The registration can be used for security check and type identification.
Struct is a special type, depending whether schema compatibility is enabled, Fory will write struct meta differently.
Struct Schema consistent
- If schema consistent mode is enabled globally when creating fory, type meta will be written as a fory unsigned varint
of
type_id. Schema evolution related meta will be ignored. - If schema evolution mode is enabled globally when creating fory, and current class is configured to use schema
consistent mode like
structvstablein flatbuffers:- Type meta will be add to
captured_type_defs:captured_type_defs[type def stub] = map sizeahead when registering type. - Get index of the meta in
captured_type_defs, write that index as| unsigned varint: index |.
- Type meta will be add to
Struct Schema evolution
If schema evolution mode is enabled globally when creating fory, and enabled for current type, type meta will be written using one of the following mode. Which mode to use is configured when creating fory.
-
Normal mode(meta share not enabled):
- If type meta hasn't been written before, add
type deftocaptured_type_defs:captured_type_defs[type def] = map size. - Get index of the meta in
captured_type_defs, write that index as| unsigned varint: index |. - After finished the serialization of the object graph, fory will start to write
captured_type_defs:-
Firstly, set current to
meta start offsetof fory header -
Then write
captured_type_defsone by one:buffer.write_var_uint32(len(writting_type_defs) - len(schema_consistent_type_def_stubs))
for type_meta in writting_type_defs:
if not type_meta.is_stub():
type_meta.write_type_def(buffer)
writing_type_defs = copy(schema_consistent_type_def_stubs)
-
- If type meta hasn't been written before, add
-
Meta share mode: the writing steps are same as the normal mode, but
captured_type_defswill be shared across multiple serializations of different objects. For example, suppose we have a batch to serialize:captured_type_defs = {}
stream = ...
# add `Type1` to `captured_type_defs` and write `Type1`
fory.serialize(stream, [Type1()])
# add `Type2` to `captured_type_defs` and write `Type2`, `Type1` is written before.
fory.serialize(stream, [Type1(), Type2()])
# `Type1` and `Type2` are written before, no need to write meta.
fory.serialize(stream, [Type1(), Type2()]) -
Streaming mode(streaming mode doesn't support meta share):
-
If type meta hasn't been written before, the data will be written as:
| unsigned varint: 0b11111111 | type def | -
If type meta has been written before, the data will be written as:
| unsigned varint: written index << 1 |written indexis the id incaptured_type_defs. -
With this mode,
meta start offsetcan be omitted.
-
The normal mode and meta share mode will forbid streaming writing since it needs to look back for update the start offset after the whole object graph writing and meta collecting is finished. Only in this way we can ensure deserialization failure in meta share mode doesn't lost shared meta.
Type Def
Here we mainly describe the meta layout for schema evolution mode:
| 8 bytes header | variable bytes | variable bytes |
+----------------------+--------------------+-------------------+
| global binary header | meta header | fields meta |
For languages which support inheritance, if parent class and subclass has fields with same name, using field in subclass.
Global binary header
50 bits hash + 1bit compress flag + write fields meta + 12 bits meta size. Right is the lower bits.
- lower 12 bits are used to encode meta size. If meta size
>= 0b1111_1111_1111, then writemeta_ size - 0b1111_1111_1111next. - 13rd bit is used to indicate whether to write fields meta. When this class is schema-consistent or use registered serializer, fields meta will be skipped. Class Meta will be used for share namespace + type name only.
- 14rd bit is used to indicate whether meta is compressed.
- Other 50 bits is used to store the unique hash of
flags + all layers class meta.
Meta header
Meta header is a 8 bits number value.
- Lowest 5 digits
0b00000~0b11110are used to record num fields.0b11111is preserved to indicate that Fory need to read more bytes for length using Fory unsigned int encoding. Note that num_fields is the number of compatible fields. Users can use tag id to mark some fields as compatible fields in schema consistent context. In such cases, schema consistent fields will be serialized first, then compatible fields will be serialized next. At deserialization, Fory will use fields info of those fields which aren't annotated by tag id for deserializing schema consistent fields, then use fields info in meta for deserializing compatible fields. - The 6th bit: 0 for registered by id, 1 for registered by name.
- Remaining 2 bits are reserved for future extension.
Fields meta
Format:
| field info: variable bytes | variable bytes | ... |
+---------------------------------+-----------------+-----+
| header + type info + field name | next field info | ... |
Field Header
Field Header is 8 bits, annotation can be used to provide more specific info. If annotation not exists, fory will infer those info automatically.
The format for field header is:
2 bits field name encoding + 4 bits size + nullability flag + ref tracking flag
Detailed spec:
- 2 bits field name encoding:
- encoding:
UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID - If tag id is used, field name will be written by an unsigned varint tag id, and 2 bits encoding will be
11.
- encoding:
- size of field name:
- The
4 bits size: 0~14will be used to indicate length1~15, the value15indicates to read more bytes, the encoding will encodesize - 15as a varint next. - If encoding is
TAG_ID, then num_bytes of field name will be used to store tag id.
- The
- ref tracking: when set to 1, ref tracking will be enabled for this field.
- nullability: when set to 1, this field can be null.
Field Type Info
Field type info is written as unsigned int8. Detailed id spec is:
- For struct registered by id, it will be
Type.STRUCT. - For struct registered by name, it will be
Type.NAMED_STRUCT. - For enum registered by id, it will be
Type.ENUM. - For enum registered by name, it will be
Type.NAMED_ENUM. - For ext type registered by id, it will be
Type.EXT. - For ext type registered by name, it will be
Type.NAMED_EXT. - For list/set type, it will be written as
Type.LIST/SET, then write element type recursively. - For 1D primitive array type, it will be written as
Type.XXX_ARRAY. - For multi-dimensional primitive array type with same size on each dim, it will be written as
Type.TENSOR. - For other array type, it will be written as
Type.LIST, then write element type recursively. - For map type, it will be written as
Type.MAP, then write key and value type recursively. - For other types supported by fory directly, it will be fory type id for that type.
- For other types not determined at compile time, write
Type.UNKNOWNinstead. For such types, actual type will be written when serializing such field values.
Polymorphism spec:
struct/named_struct/ext/named_extare taken as polymorphic, the meta for those types are written separately instead of inlining here to reduce meta space cost if object of this type is serialized in current object graph multiple times, and the field value may be null too.enumis taken as morphic, if deserialization doesn't have this field, or the type is not enum, enum value will be skipped.list/map/setare taken as morphic, when serializing values of those type, the concrete types won't be written again.- Other types that fory supported are taken as morphic too.
List/Set/Map nested type spec:
list:| list type id | nested type id << 2 + nullability flag + ref tracking flag | ... multi-layer type info |set:| set type id | nested type id << 2 + nullability flag + ref tracking flag | ... multi-layer type info |map:| set type id | key type info | value type info |- Key type format:
| nested type id << 2 + nullability flag + ref tracking flag | ... multi-layer type info | - Value type format:
| nested type id << 2 + nullability flag + ref tracking flag | ... multi-layer type info |
- Key type format:
Field Name
If tag id is set, tag id will be used instead. Otherwise meta string of field name will be written instead.
Field order
Field order are left as implementation details, which is not exposed to specification, the deserialization need to resort fields based on Fory fields sort algorithms. In this way, fory can compute statistics for field names or types and using a more compact encoding.
Extended Type Meta with Inheritance support
If one want to support inheritance for struct, one can implement following spec.
Schema consistent
Fields are serialized from parent type to leaf type. Fields are sorted using fory struct fields sort algorithms.
Schema Evolution
Meta layout for schema evolution mode:
| 8 bytes header | variable bytes | variable bytes | variable bytes | variable bytes |
+----------------------+----------------+----------------+--------------------+--------------------+
| global binary header | meta header | fields meta | parent meta header | parent fields meta |
Meta header
Meta header is a 64 bits number value encoded in little endian order.
- Lowest 4 digits
0b0000~0b1110are used to record num classes.0b1111is preserved to indicate that Fory need to read more bytes for length using Fory unsigned int encoding. If current type doesn't has parent type, or parent type doesn't have fields to serialize, or we're in a context which serialize fields of current type only, num classes will be 1. - The 5th bit is used to indicate whether this type needs schema evolution.
- Other 56 bits are used to store the unique hash of
flags + all layers type meta.