ruby/doc/marshal.rdoc

= Marshal Format

The Marshal format is used to serialize ruby objects.  The format can store
arbitrary objects through three user-defined extension mechanisms.

For documentation on using Marshal to serialize and deserialize objects, see
the Marshal module.

This document calls a serialized set of objects a stream.  The Ruby
implementation can load a set of objects from a String, an IO or an object
that implements a +getc+ method.

== Stream Format

The first two bytes of the stream contain the major and minor version, each as
a single byte encoding a digit.  The version implemented in Ruby is 4.8
(stored as "\x04\x08") and is supported by ruby 1.8.0 and newer.

Different major versions of the Marshal format are not compatible and cannot
be understood by other major versions.  Lesser minor versions of the format
can be understood by newer minor versions.  Format 4.7 can be loaded by a 4.8
implementation but format 4.8 cannot be loaded by a 4.7 implementation.

Following the version bytes is a stream describing the serialized object.  The
stream contains nested objects (the same as a Ruby object) but objects in the
stream do not necessarily have a direct mapping to the Ruby object model.

Each object in the stream is described by a byte indicating its type followed
by one or more bytes describing the object.  When "object" is mentioned below
it means any of the types below that defines a Ruby object.

=== true, false, nil

These objects are each one byte long.  "T" is represents +true+, "F"
represents +false+ and "0" represents +nil+.

=== Fixnum and long

"i" represents a signed 32 bit value using a packed format.  One through five
bytes follows the type.  The value loaded will always be a Fixnum.  On
32 bit platforms (where the precision of a Fixnum is less than 32 bits)
loading large values will cause overflow on CRuby.

The fixnum type is used to represent both ruby Fixnum objects and the sizes of
marshaled arrays, hashes, instance variables and other types.  In the
following sections "long" will mean the format described below, which supports
full 32 bit precision.

The first byte has the following special values:

"\x00"::
  The value of the integer is 0.  No bytes follow.

"\x01"::
  The total size of the integer is two bytes.  The following byte is a
  positive integer in the range of 0 through 255.  Only values between 123
  and 255 should be represented this way to save bytes.

"\xff"::
  The total size of the integer is two bytes.  The following byte is a
  negative integer in the range of -1 through -256.

"\x02"::
  The total size of the integer is three bytes.  The following two bytes are a
  positive little-endian integer.

"\xfe"::
  The total size of the integer is three bytes.  The following two bytes are a
  negative little-endian integer.

"\x03"::
  The total size of the integer is four bytes.  The following three bytes are
  a positive little-endian integer.

"\xfd"::
  The total size of the integer is two bytes.  The following three bytes are a
  negative little-endian integer.

"\x04"::
  The total size of the integer is five bytes.  The following four bytes are a
  positive little-endian integer.  For compatibility with 32 bit ruby,
  only Fixnums less than 1073741824 should be represented this way.  For sizes
  of stream objects full precision may be used.

"\xfc"::
  The total size of the integer is two bytes.  The following four bytes are a
  negative little-endian integer.  For compatibility with 32 bit ruby,
  only Fixnums greater than -10737341824 should be represented this way.  For
  sizes of stream objects full precision may be used.

Otherwise the first byte is a sign-extended eight-bit value with an offset.
If the value is positive the value is determined by subtracting 5 from the
value.  If the value is negative the value is determined by adding 5 to the
value.

There are multiple representations for many values.  CRuby always outputs the
shortest representation possible.

=== Symbols and Byte Sequence

":" represents a real symbol.  A real symbol contains the data needed to
define the symbol for the rest of the stream as future occurrences in the
stream will instead be references (a symbol link) to this one.  The reference
is a zero-indexed 32 bit value (so the first occurrence of <code>:hello</code>
is 0).

Following the type byte is byte sequence which consists of a long indicating
the number of bytes in the sequence followed by that many bytes of data.  Byte
sequences have no encoding.

For example, the following stream contains the Symbol <code>:hello</code>:

  "\x04\x08:\x0ahello"

";" represents a Symbol link which references a previously defined Symbol.
Following the type byte is a long containing the index in the lookup table for
the linked (referenced) Symbol.

For example, the following stream contains <code>[:hello, :hello]</code>:

  "\x04\b[\a:\nhello;\x00"

When a "symbol" is referenced below it may be either a real symbol or a
symbol link.

=== Object References

Separate from but similar to symbol references, the stream contains only one
copy of each object (as determined by #object_id) for all objects except
true, false, nil, Fixnums and Symbols (which are stored separately as
described above) a one-indexed 32 bit value will be stored and reused when the
object is encountered again.  (The first object has an index of 1).

"@" represents an object link.  Following the type byte is a long giving the
index of the object.

For example, the following stream contains an Array of the same
<code>"hello"</code> object twice:

  "\004\b[\a\"\nhello@\006"

=== Instance Variables

"I" indicates that instance variables follow the next object.  An object
follows the type byte.  Following the object is a length indicating the number
of instance variables for the object.  Following the length is a set of
name-value pairs.  The names are symbols while the values are objects.  The
symbols must be instance variable names (<code>:@name</code>).

An Object ("o" type, described below) uses the same format for its instance
variables as described here.

For a String and Regexp (described below) a special instance variable
<code>:E</code> is used to indicate the Encoding.

=== Extended

"e" indicates that the next object is extended by a module.  An object follows
the type byte.  Following the object is a symbol that contains the name of the
module the object is extended by.

=== Array

"[" represents an Array.  Following the type byte is a long indicating the
number of objects in the array.  The given number of objects follow the
length.

=== Bignum

"l" represents a Bignum which is composed of three parts:

sign::
  A single byte containing "+" for a positive value or "-" for a negative
  value.
length::
  A long indicating the number of bytes of Bignum data follows, divided by
  two.  Multiply the length by two to determine the number of bytes of data
  that follow.
data::
  Bytes of Bignum data representing the number.

The following ruby code will reconstruct the Bignum value from an array of
bytes:

  result = 0

  bytes.each_with_index do |byte, exp|
   result += (byte * 2 ** (exp * 8))
  end

=== Class and Module

"c" represents a Class object, "m" represents a Module and "M" represents
either a class or module (this is an old-style for compatibility).  No class
or module content is included, this type is only a reference.  Following the
type byte is a byte sequence which is used to look up an existing class or
module, respectively.

Instance variables are not allowed on a class or module.

If no class or module exists an exception should be raised.

For "c" and "m" types, the loaded object must be a class or module,
respectively.

=== Data

"d" represents a Data object.  (Data objects are wrapped pointers from ruby
extensions.)  Following the type byte is a symbol indicating the class for the
Data object and an object that contains the state of the Data object.

To dump a Data object Ruby calls _dump_data.  To load a Data object Ruby calls
_load_data with the state of the object on a newly allocated instance.

=== Float

"f" represents a Float object.  Following the type byte is a byte sequence
containing the float value.  The following values are special:

"inf"::
  Positive infinity

"-inf"::
  Negative infinity

"nan"::
  Not a Number

Otherwise the byte sequence contains a C double (loadable by strtod(3)).
Older minor versions of Marshal also stored extra mantissa bits to ensure
portability across platforms but 4.8 does not include these.  See
[ruby-talk:69518] for some explanation.

=== Hash and Hash with Default Value

"{" represents a Hash object while "}" represents a Hash with a default value
set (<code>Hash.new 0</code>).  Following the type byte is a long indicating
the number of key-value pairs in the Hash, the size.  Double the given number
of objects follow the size.

For a Hash with a default value, the default value follows all the pairs.

=== Module and Old Module

=== Object

"o" represents an object that doesn't have any other special form (such as
a user-defined or built-in format).  Following the type byte is a symbol
containing the class name of the object.  Following the class name is a long
indicating the number of instance variable names and values for the object.
Double the given number of pairs of objects follow the size.

The keys in the pairs must be symbols containing instance variable names.

=== Regular Expression

"/" represents a regular expression.  Following the type byte is a byte
sequence containing the regular expression source.  Following the type byte is
a byte containing the regular expression options (case-insensitive, etc.) as a
signed 8-bit value.

Regular expressions can have an encoding attached through instance variables
(see above).  If no encoding is attached escapes for the following regexp
specials not present in ruby 1.8 must be removed: g-m, o-q, u, y, E, F, H-L,
N-V, X, Y.

=== String

'"' represents a String.  Following the type byte is a byte sequence
containing the string content.  When dumped from ruby 1.9 an encoding instance
variable (<code>:E</code> see above) should be included unless the encoding is
binary.

=== Struct

"S" represents a Struct.  Following the type byte is a symbol containing the
name of the struct.  Following the name is a long indicating the number of
members in the struct.  Double the number of objects follow the member count.
Each member is a pair containing the member's symbol and an object for the
value of that member.

If the struct name does not match a Struct subclass in the running ruby an
exception should be raised.

If there is a mismatch between the struct in the currently running ruby and
the member count in the marshaled struct an exception should be raised.

=== User Class

"C" represents a subclass of a String, Regexp, Array or Hash.  Following the
type byte is a symbol containing the name of the subclass.  Following the name
is the wrapped object.

=== User Defined

"u" represents an object with a user-defined serialization format using the
+_dump+ instance method and +_load+ class method.  Following the type byte is
a symbol containing the class name.  Following the class name is a byte
sequence containing the user-defined representation of the object.

The class method +_load+ is called on the class with a string created from the
byte-sequence.

=== User Marshal

"U" represents an object with a user-defined serialization format using the
+marshal_dump+ and +marshal_load+ instance methods.  Following the type byte
is a symbol containing the class name.  Following the class name is an object
containing the data.

Upon loading a new instance must be allocated and +marshal_load+ must be
called on the instance with the data.
* doc/marshal.rdoc: Add description of Marshal format. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@41075 b2dd03c8-39d4-4d8f-98ff-823fe69b080e 2013-06-05 01:35:31 +04:00			`= Marshal Format`

			`The Marshal format is used to serialize ruby objects. The format can store`
			`arbitrary objects through three user-defined extension mechanisms.`

			`For documentation on using Marshal to serialize and deserialize objects, see`
			`the Marshal module.`

			`This document calls a serialized set of objects a stream. The Ruby`
			`implementation can load a set of objects from a String, an IO or an object`
			`that implements a +getc+ method.`

			`== Stream Format`

			`The first two bytes of the stream contain the major and minor version, each as`
			`a single byte encoding a digit. The version implemented in Ruby is 4.8`
			`(stored as "\x04\x08") and is supported by ruby 1.8.0 and newer.`

			`Different major versions of the Marshal format are not compatible and cannot`
			`be understood by other major versions. Lesser minor versions of the format`
			`can be understood by newer minor versions. Format 4.7 can be loaded by a 4.8`
			`implementation but format 4.8 cannot be loaded by a 4.7 implementation.`

			`Following the version bytes is a stream describing the serialized object. The`
			`stream contains nested objects (the same as a Ruby object) but objects in the`
			`stream do not necessarily have a direct mapping to the Ruby object model.`

			`Each object in the stream is described by a byte indicating its type followed`
			`by one or more bytes describing the object. When "object" is mentioned below`
			`it means any of the types below that defines a Ruby object.`

			`=== true, false, nil`

			`These objects are each one byte long. "T" is represents +true+, "F"`
			`represents +false+ and "0" represents +nil+.`

			`=== Fixnum and long`

			`"i" represents a signed 32 bit value using a packed format. One through five`
			`bytes follows the type. The value loaded will always be a Fixnum. On`
			`32 bit platforms (where the precision of a Fixnum is less than 32 bits)`
			`loading large values will cause overflow on CRuby.`

			`The fixnum type is used to represent both ruby Fixnum objects and the sizes of`
			`marshaled arrays, hashes, instance variables and other types. In the`
			`following sections "long" will mean the format described below, which supports`
			`full 32 bit precision.`

			`The first byte has the following special values:`

			`"\x00"::`
			`The value of the integer is 0. No bytes follow.`

			`"\x01"::`
			`The total size of the integer is two bytes. The following byte is a`
			`positive integer in the range of 0 through 255. Only values between 123`
			`and 255 should be represented this way to save bytes.`

			`"\xff"::`
			`The total size of the integer is two bytes. The following byte is a`
			`negative integer in the range of -1 through -256.`

			`"\x02"::`
			`The total size of the integer is three bytes. The following two bytes are a`
			`positive little-endian integer.`

			`"\xfe"::`
			`The total size of the integer is three bytes. The following two bytes are a`
			`negative little-endian integer.`

			`"\x03"::`
			`The total size of the integer is four bytes. The following three bytes are`
			`a positive little-endian integer.`

			`"\xfd"::`
			`The total size of the integer is two bytes. The following three bytes are a`
			`negative little-endian integer.`

			`"\x04"::`
			`The total size of the integer is five bytes. The following four bytes are a`
			`positive little-endian integer. For compatibility with 32 bit ruby,`
			`only Fixnums less than 1073741824 should be represented this way. For sizes`
			`of stream objects full precision may be used.`

			`"\xfc"::`
			`The total size of the integer is two bytes. The following four bytes are a`
			`negative little-endian integer. For compatibility with 32 bit ruby,`
			`only Fixnums greater than -10737341824 should be represented this way. For`
			`sizes of stream objects full precision may be used.`

			`Otherwise the first byte is a sign-extended eight-bit value with an offset.`
			`If the value is positive the value is determined by subtracting 5 from the`
			`value. If the value is negative the value is determined by adding 5 to the`
			`value.`

			`There are multiple representations for many values. CRuby always outputs the`
			`shortest representation possible.`

			`=== Symbols and Byte Sequence`

			`":" represents a real symbol. A real symbol contains the data needed to`
			`define the symbol for the rest of the stream as future occurrences in the`
			`stream will instead be references (a symbol link) to this one. The reference`
			`is a zero-indexed 32 bit value (so the first occurrence of <code>:hello</code>`
			`is 0).`

			`Following the type byte is byte sequence which consists of a long indicating`
			`the number of bytes in the sequence followed by that many bytes of data. Byte`
			`sequences have no encoding.`

			`For example, the following stream contains the Symbol <code>:hello</code>:`

			`"\x04\x08:\x0ahello"`

			`";" represents a Symbol link which references a previously defined Symbol.`
			`Following the type byte is a long containing the index in the lookup table for`
			`the linked (referenced) Symbol.`

			`For example, the following stream contains <code>[:hello, :hello]</code>:`

			`"\x04\b[\a:\nhello;\x00"`

			`When a "symbol" is referenced below it may be either a real symbol or a`
			`symbol link.`

			`=== Object References`

			`Separate from but similar to symbol references, the stream contains only one`
			`copy of each object (as determined by #object_id) for all objects except`
			`true, false, nil, Fixnums and Symbols (which are stored separately as`
			`described above) a one-indexed 32 bit value will be stored and reused when the`
			`object is encountered again. (The first object has an index of 1).`

			`"@" represents an object link. Following the type byte is a long giving the`
			`index of the object.`

Clarify object references example * doc/marshal.rdoc: Clarify object references example, that the reference is same object. [Fix GH-1156] git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@53438 b2dd03c8-39d4-4d8f-98ff-823fe69b080e 2016-01-05 05:47:38 +03:00			`For example, the following stream contains an Array of the same`
			`<code>"hello"</code> object twice:`
* doc/marshal.rdoc: Add description of Marshal format. git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@41075 b2dd03c8-39d4-4d8f-98ff-823fe69b080e 2013-06-05 01:35:31 +04:00
			`"\004\b[\a\"\nhello@\006"`

			`=== Instance Variables`

			`"I" indicates that instance variables follow the next object. An object`
			`follows the type byte. Following the object is a length indicating the number`
			`of instance variables for the object. Following the length is a set of`
			`name-value pairs. The names are symbols while the values are objects. The`
			`symbols must be instance variable names (<code>:@name</code>).`

			`An Object ("o" type, described below) uses the same format for its instance`
			`variables as described here.`

			`For a String and Regexp (described below) a special instance variable`
			`<code>:E</code> is used to indicate the Encoding.`

			`=== Extended`

			`"e" indicates that the next object is extended by a module. An object follows`
			`the type byte. Following the object is a symbol that contains the name of the`
			`module the object is extended by.`

			`=== Array`

			`"[" represents an Array. Following the type byte is a long indicating the`
			`number of objects in the array. The given number of objects follow the`
			`length.`

			`=== Bignum`

			`"l" represents a Bignum which is composed of three parts:`

			`sign::`
			`A single byte containing "+" for a positive value or "-" for a negative`
			`value.`
			`length::`
			`A long indicating the number of bytes of Bignum data follows, divided by`
			`two. Multiply the length by two to determine the number of bytes of data`
			`that follow.`
			`data::`
			`Bytes of Bignum data representing the number.`

			`The following ruby code will reconstruct the Bignum value from an array of`
			`bytes:`

			`result = 0`

			`bytes.each_with_index do \|byte, exp\|`
			`result += (byte * 2 ** (exp * 8))`
			`end`

			`=== Class and Module`

			`"c" represents a Class object, "m" represents a Module and "M" represents`
			`either a class or module (this is an old-style for compatibility). No class`
			`or module content is included, this type is only a reference. Following the`
			`type byte is a byte sequence which is used to look up an existing class or`
			`module, respectively.`

			`Instance variables are not allowed on a class or module.`

			`If no class or module exists an exception should be raised.`

			`For "c" and "m" types, the loaded object must be a class or module,`
			`respectively.`

			`=== Data`

			`"d" represents a Data object. (Data objects are wrapped pointers from ruby`
			`extensions.) Following the type byte is a symbol indicating the class for the`
			`Data object and an object that contains the state of the Data object.`

			`To dump a Data object Ruby calls _dump_data. To load a Data object Ruby calls`
			`_load_data with the state of the object on a newly allocated instance.`

			`=== Float`

			`"f" represents a Float object. Following the type byte is a byte sequence`
			`containing the float value. The following values are special:`

			`"inf"::`
			`Positive infinity`

			`"-inf"::`
			`Negative infinity`

			`"nan"::`
			`Not a Number`

			`Otherwise the byte sequence contains a C double (loadable by strtod(3)).`
			`Older minor versions of Marshal also stored extra mantissa bits to ensure`
			`portability across platforms but 4.8 does not include these. See`
			`[ruby-talk:69518] for some explanation.`

			`=== Hash and Hash with Default Value`

			`"{" represents a Hash object while "}" represents a Hash with a default value`
			`set (<code>Hash.new 0</code>). Following the type byte is a long indicating`
			`the number of key-value pairs in the Hash, the size. Double the given number`
			`of objects follow the size.`

			`For a Hash with a default value, the default value follows all the pairs.`

			`=== Module and Old Module`

			`=== Object`

			`"o" represents an object that doesn't have any other special form (such as`
			`a user-defined or built-in format). Following the type byte is a symbol`
			`containing the class name of the object. Following the class name is a long`
			`indicating the number of instance variable names and values for the object.`
			`Double the given number of pairs of objects follow the size.`

			`The keys in the pairs must be symbols containing instance variable names.`

			`=== Regular Expression`

			`"/" represents a regular expression. Following the type byte is a byte`
			`sequence containing the regular expression source. Following the type byte is`
			`a byte containing the regular expression options (case-insensitive, etc.) as a`
			`signed 8-bit value.`

			`Regular expressions can have an encoding attached through instance variables`
			`(see above). If no encoding is attached escapes for the following regexp`
			`specials not present in ruby 1.8 must be removed: g-m, o-q, u, y, E, F, H-L,`
			`N-V, X, Y.`

			`=== String`

			`'"' represents a String. Following the type byte is a byte sequence`
			`containing the string content. When dumped from ruby 1.9 an encoding instance`
			`variable (<code>:E</code> see above) should be included unless the encoding is`
			`binary.`

			`=== Struct`

			`"S" represents a Struct. Following the type byte is a symbol containing the`
			`name of the struct. Following the name is a long indicating the number of`
			`members in the struct. Double the number of objects follow the member count.`
			`Each member is a pair containing the member's symbol and an object for the`
			`value of that member.`

			`If the struct name does not match a Struct subclass in the running ruby an`
			`exception should be raised.`

			`If there is a mismatch between the struct in the currently running ruby and`
			`the member count in the marshaled struct an exception should be raised.`

			`=== User Class`

			`"C" represents a subclass of a String, Regexp, Array or Hash. Following the`
			`type byte is a symbol containing the name of the subclass. Following the name`
			`is the wrapped object.`

			`=== User Defined`

			`"u" represents an object with a user-defined serialization format using the`
			`+_dump+ instance method and +_load+ class method. Following the type byte is`
			`a symbol containing the class name. Following the class name is a byte`
			`sequence containing the user-defined representation of the object.`

			`The class method +_load+ is called on the class with a string created from the`
			`byte-sequence.`

			`=== User Marshal`

			`"U" represents an object with a user-defined serialization format using the`
			`+marshal_dump+ and +marshal_load+ instance methods. Following the type byte`
			`is a symbol containing the class name. Following the class name is an object`
			`containing the data.`

			`Upon loading a new instance must be allocated and +marshal_load+ must be`
			`called on the instance with the data.`