* doc/marshal.rdoc: Add description of Marshal format.

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@41075 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
2013-06-04 21:35:31 +00:00 · 2013-06-04 21:35:31 +00:00 · 0454c0a281
--- a/4
+++ b/4
@ -1,3 +1,7 @@
+Wed Jun  5 06:35:15 2013  Eric Hodel  <drbrain@segment7.net>
+
+	* doc/marshal.rdoc:  Add description of Marshal format.
+
 Wed Jun  5 01:16:09 2013  Benoit Daloze  <eregontp@gmail.com>

 	* array.c (Array#+): fix documentation example.
--- a/doc/marshal.rdoc
+++ b/doc/marshal.rdoc
@ -0,0 +1,313 @@
+= Marshal Format
+
+The Marshal format is used to serialize ruby objects.  The format can store
+arbitrary objects through three user-defined extension mechanisms.
+
+For documentation on using Marshal to serialize and deserialize objects, see
+the Marshal module.
+
+This document calls a serialized set of objects a stream.  The Ruby
+implementation can load a set of objects from a String, an IO or an object
+that implements a +getc+ method.
+
+== Stream Format
+
+The first two bytes of the stream contain the major and minor version, each as
+a single byte encoding a digit.  The version implemented in Ruby is 4.8
+(stored as "\x04\x08") and is supported by ruby 1.8.0 and newer.
+
+Different major versions of the Marshal format are not compatible and cannot
+be understood by other major versions.  Lesser minor versions of the format
+can be understood by newer minor versions.  Format 4.7 can be loaded by a 4.8
+implementation but format 4.8 cannot be loaded by a 4.7 implementation.
+
+Following the version bytes is a stream describing the serialized object.  The
+stream contains nested objects (the same as a Ruby object) but objects in the
+stream do not necessarily have a direct mapping to the Ruby object model.
+
+Each object in the stream is described by a byte indicating its type followed
+by one or more bytes describing the object.  When "object" is mentioned below
+it means any of the types below that defines a Ruby object.
+
+=== true, false, nil
+
+These objects are each one byte long.  "T" is represents +true+, "F"
+represents +false+ and "0" represents +nil+.
+
+=== Fixnum and long
+
+"i" represents a signed 32 bit value using a packed format.  One through five
+bytes follows the type.  The value loaded will always be a Fixnum.  On
+32 bit platforms (where the precision of a Fixnum is less than 32 bits)
+loading large values will cause overflow on CRuby.
+
+The fixnum type is used to represent both ruby Fixnum objects and the sizes of
+marshaled arrays, hashes, instance variables and other types.  In the
+following sections "long" will mean the format described below, which supports
+full 32 bit precision.
+
+The first byte has the following special values:
+
+"\x00"::
+  The value of the integer is 0.  No bytes follow.
+
+"\x01"::
+  The total size of the integer is two bytes.  The following byte is a
+  positive integer in the range of 0 through 255.  Only values between 123
+  and 255 should be represented this way to save bytes.
+
+"\xff"::
+  The total size of the integer is two bytes.  The following byte is a
+  negative integer in the range of -1 through -256.
+
+"\x02"::
+  The total size of the integer is three bytes.  The following two bytes are a
+  positive little-endian integer.
+
+"\xfe"::
+  The total size of the integer is three bytes.  The following two bytes are a
+  negative little-endian integer.
+
+"\x03"::
+  The total size of the integer is four bytes.  The following three bytes are
+  a positive little-endian integer.
+
+"\xfd"::
+  The total size of the integer is two bytes.  The following three bytes are a
+  negative little-endian integer.
+
+"\x04"::
+  The total size of the integer is five bytes.  The following four bytes are a
+  positive little-endian integer.  For compatibility with 32 bit ruby,
+  only Fixnums less than 1073741824 should be represented this way.  For sizes
+  of stream objects full precision may be used.
+
+"\xfc"::
+  The total size of the integer is two bytes.  The following four bytes are a
+  negative little-endian integer.  For compatibility with 32 bit ruby,
+  only Fixnums greater than -10737341824 should be represented this way.  For
+  sizes of stream objects full precision may be used.
+
+Otherwise the first byte is a sign-extended eight-bit value with an offset.
+If the value is positive the value is determined by subtracting 5 from the
+value.  If the value is negative the value is determined by adding 5 to the
+value.
+
+There are multiple representations for many values.  CRuby always outputs the
+shortest representation possible.
+
+=== Symbols and Byte Sequence
+
+":" represents a real symbol.  A real symbol contains the data needed to
+define the symbol for the rest of the stream as future occurrences in the
+stream will instead be references (a symbol link) to this one.  The reference
+is a zero-indexed 32 bit value (so the first occurrence of <code>:hello</code>
+is 0).
+
+Following the type byte is byte sequence which consists of a long indicating
+the number of bytes in the sequence followed by that many bytes of data.  Byte
+sequences have no encoding.
+
+For example, the following stream contains the Symbol <code>:hello</code>:
+
+  "\x04\x08:\x0ahello"
+
+";" represents a Symbol link which references a previously defined Symbol.
+Following the type byte is a long containing the index in the lookup table for
+the linked (referenced) Symbol.
+
+For example, the following stream contains <code>[:hello, :hello]</code>:
+
+  "\x04\b[\a:\nhello;\x00"
+
+When a "symbol" is referenced below it may be either a real symbol or a
+symbol link.
+
+=== Object References
+
+Separate from but similar to symbol references, the stream contains only one
+copy of each object (as determined by #object_id) for all objects except
+true, false, nil, Fixnums and Symbols (which are stored separately as
+described above) a one-indexed 32 bit value will be stored and reused when the
+object is encountered again.  (The first object has an index of 1).
+
+"@" represents an object link.  Following the type byte is a long giving the
+index of the object.
+
+For example, the following stream contains an Array of the object
+<code>"hello"</code> twice:
+
+  "\004\b[\a\"\nhello@\006"
+
+=== Instance Variables
+
+"I" indicates that instance variables follow the next object.  An object
+follows the type byte.  Following the object is a length indicating the number
+of instance variables for the object.  Following the length is a set of
+name-value pairs.  The names are symbols while the values are objects.  The
+symbols must be instance variable names (<code>:@name</code>).
+
+An Object ("o" type, described below) uses the same format for its instance
+variables as described here.
+
+For a String and Regexp (described below) a special instance variable
+<code>:E</code> is used to indicate the Encoding.
+
+=== Extended
+
+"e" indicates that the next object is extended by a module.  An object follows
+the type byte.  Following the object is a symbol that contains the name of the
+module the object is extended by.
+
+=== Array
+
+"[" represents an Array.  Following the type byte is a long indicating the
+number of objects in the array.  The given number of objects follow the
+length.
+
+=== Bignum
+
+"l" represents a Bignum which is composed of three parts:
+
+sign::
+  A single byte containing "+" for a positive value or "-" for a negative
+  value.
+length::
+  A long indicating the number of bytes of Bignum data follows, divided by
+  two.  Multiply the length by two to determine the number of bytes of data
+  that follow.
+data::
+  Bytes of Bignum data representing the number.
+
+The following ruby code will reconstruct the Bignum value from an array of
+bytes:
+
+  result = 0
+
+  bytes.each_with_index do |byte, exp|
+   result += (byte * 2 ** (exp * 8))
+  end
+
+=== Class and Module
+
+"c" represents a Class object, "m" represents a Module and "M" represents
+either a class or module (this is an old-style for compatibility).  No class
+or module content is included, this type is only a reference.  Following the
+type byte is a byte sequence which is used to look up an existing class or
+module, respectively.
+
+Instance variables are not allowed on a class or module.
+
+If no class or module exists an exception should be raised.
+
+For "c" and "m" types, the loaded object must be a class or module,
+respectively.
+
+=== Data
+
+"d" represents a Data object.  (Data objects are wrapped pointers from ruby
+extensions.)  Following the type byte is a symbol indicating the class for the
+Data object and an object that contains the state of the Data object.
+
+To dump a Data object Ruby calls _dump_data.  To load a Data object Ruby calls
+_load_data with the state of the object on a newly allocated instance.
+
+=== Float
+
+"f" represents a Float object.  Following the type byte is a byte sequence
+containing the float value.  The following values are special:
+
+"inf"::
+  Positive infinity
+
+"-inf"::
+  Negative infinity
+
+"nan"::
+  Not a Number
+
+Otherwise the byte sequence contains a C double (loadable by strtod(3)).
+Older minor versions of Marshal also stored extra mantissa bits to ensure
+portability across platforms but 4.8 does not include these.  See
+[ruby-talk:69518] for some explanation.
+
+=== Hash and Hash with Default Value
+
+"{" represents a Hash object while "}" represents a Hash with a default value
+set (<code>Hash.new 0</code>).  Following the type byte is a long indicating
+the number of key-value pairs in the Hash, the size.  Double the given number
+of objects follow the size.
+
+For a Hash with a default value, the default value follows all the pairs.
+
+=== Module and Old Module
+
+=== Object
+
+"o" represents an object that doesn't have any other special form (such as
+a user-defined or built-in format).  Following the type byte is a symbol
+containing the class name of the object.  Following the class name is a long
+indicating the number of instance variable names and values for the object.
+Double the given number of pairs of objects follow the size.
+
+The keys in the pairs must be symbols containing instance variable names.
+
+=== Regular Expression
+
+"/" represents a regular expression.  Following the type byte is a byte
+sequence containing the regular expression source.  Following the type byte is
+a byte containing the regular expression options (case-insensitive, etc.) as a
+signed 8-bit value.
+
+Regular expressions can have an encoding attached through instance variables
+(see above).  If no encoding is attached escapes for the following regexp
+specials not present in ruby 1.8 must be removed: g-m, o-q, u, y, E, F, H-L,
+N-V, X, Y.
+
+=== String
+
+'"' represents a String.  Following the type byte is a byte sequence
+containing the string content.  When dumped from ruby 1.9 an encoding instance
+variable (<code>:E</code> see above) should be included unless the encoding is
+binary.
+
+=== Struct
+
+"S" represents a Struct.  Following the type byte is a symbol containing the
+name of the struct.  Following the name is a long indicating the number of
+members in the struct.  Double the number of objects follow the member count.
+Each member is a pair containing the member's symbol and an object for the
+value of that member.
+
+If the struct name does not match a Struct subclass in the running ruby an
+exception should be raised.
+
+If there is a mismatch between the struct in the currently running ruby and
+the member count in the marshaled struct an exception should be raised.
+
+=== User Class
+
+"C" represents a subclass of a String, Regexp, Array or Hash.  Following the
+type byte is a symbol containing the name of the subclass.  Following the name
+is the wrapped object.
+
+=== User Defined
+
+"u" represents an object with a user-defined serialization format using the
+_dump+ instance method and +_load+ class method.  Following the type byte is
+a symbol containing the class name.  Following the class name is a byte
+sequence containing the user-defined representation of the object.
+
+The class method +_load+ is called on the class with a string created from the
+byte-sequence.
+
+=== User Marshal
+
+"U" represents an object with a user-defined serialization format using the
+marshal_dump+ and +marshal_load+ instance methods.  Following the type byte
+is a symbol containing the class name.  Following the class name is an object
+containing the data.
+
+Upon loading a new instance must be allocated and +marshal_load+ must be
+called on the instance with the data.
+