||This article is written in the style of a debate rather than an encyclopedic summary. (March 2012)|
In computer science, in the context of data storage, serialization is the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and reconstructed later in the same or another computer environment. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of object-oriented objects does not include any of their associated methods with which they were previously linked.
This process of serializing an object is also called marshalling an object. The opposite operation, extracting a data structure from a series of bytes, is deserialization (which is also called unmarshalling).
- A method of transferring data through the wires (messaging).
- A method of storing data (in databases, on hard disk drives).
- A method of remote procedure calls, e.g., as in SOAP.
- A method for distributing objects, especially in component-based software engineering such as COM, CORBA, etc.
- A method for detecting changes in time-varying data.
For some of these features to be useful, architecture independence must be maintained. For example, for maximal use of distribution, a computer running on a different hardware architecture should be able to reliably reconstruct a serialized data stream, regardless of endianness. This means that the simpler and faster procedure of directly copying the memory layout of the data structure cannot work reliably for all architectures. Serializing the data structure in an architecture independent format means preventing the problems of byte ordering, memory layout, or simply different ways of representing data structures in different programming languages.
Inherent to any serialization scheme is that, because the encoding of the data is by definition serial, extracting one part of the serialized data structure requires that the entire object be read from start to end, and reconstructed. In many applications this linearity is an asset, because it enables simple, common I/O interfaces to be utilized to hold and pass on the state of an object. In applications where higher performance is an issue, it can make sense to expend more effort to deal with a more complex, non-linear storage organization.
Even on a single machine, primitive pointer objects are too fragile to save because the objects to which they point may be reloaded to a different location in memory. To deal with this, the serialization process includes a step called unswizzling or pointer unswizzling, where direct pointer references are converted to references based on name or position. The deserialization process includes an inverse step called pointer swizzling.
Since both serializing and deserializing can be driven from common code (for example, the Serialize function in Microsoft Foundation Classes), it is possible for the common code to do both at the same time, and thus, 1) detect differences between the objects being serialized and their prior copies, and 2) provide the input for the next such detection. It is not necessary to actually build the prior copy because differences can be detected on the fly. The technique is called differential execution. It is useful in the programming of user interfaces whose contents are time-varying — graphical objects can be created, removed, altered, or made to handle input events without necessarily having to write separate code to do those things.
Serialization breaks the opacity of an abstract data type by potentially exposing private implementation details. Trivial implementations which serialize all data members may violate encapsulation.
To discourage competitors from making compatible products, publishers of proprietary software often keep the details of their programs' serialization formats a trade secret. Some deliberately obfuscate or even encrypt the serialized data. Yet, interoperability requires that applications be able to understand each other's serialization formats. Therefore, remote method call architectures such as CORBA define their serialization formats in detail.
In the late 1990s, a push to provide an alternative to the standard serialization protocols started: XML was used to produce a human readable text-based encoding. Such an encoding can be useful for persistent objects that may be read and understood by humans, or communicated to other systems regardless of programming language. It has the disadvantage of losing the more compact, byte-stream-based encoding, but by this point larger storage and transmission capacities made file size less of a concern than in the early days of computing. Binary XML had been proposed as a compromise which was not readable by plain-text editors, but was more compact than regular XML. In the 2000s, XML was often used for asynchronous transfer of structured data between client and server in Ajax web applications.
Another alternative, YAML, is similar to JSON and includes features that make it more powerful for serialization, more "human friendly," and potentially more compact. These features include a notion of tagging data types, support for non-hierarchical data structures, the option to structure data with indentation, and multiple forms of scalar data quoting.
For large volume scientific datasets, such as satellite data and output of numerical climate, weather, or ocean models, specific binary serialization standards have been developed, e.g. HDF, netCDF and the older GRIB.
Programming language support
Several object-oriented programming languages directly support object serialization (or object archival), either by syntactic sugar elements or providing a standard interface for doing so. Some of these programming languages are Ruby, Smalltalk, Python, PHP, Objective-C, Delphi, Java, and the .NET family of languages. There are also libraries available that add serialization support to languages that lack native support for it.
- CFML allows data structures to be serialized to WDDX with the
<cfwddx>tag and to JSON with the SerializeJSON() function.
- OCaml's standard library provides marshalling through the
Marshalmodule (its documentation) and the Pervasives functions
input_value. While OCaml programming is statically type-checked, uses of the
Marshalmodule may break type guarantees, as there is no way to check whether an unmarshalled stream represents objects of the expected type. In OCaml it is difficult to marshal a function or a data structure which contains a function (e.g. an object which contains a method), because executable code in functions cannot be transmitted across different programs. (There is a flag to marshal the code position of a function but it can only be unmarshalled in exactly the same program). The standard marshalling functions can preserve sharing and handle cyclic data, which can be configured by a flag.
- Several Perl modules available from CPAN provide serialization mechanisms, including
FreezeThaw. Storable includes functions to serialize and deserialize Perl data structures to and from files or Perl scalars. In addition to serializing directly to files,
freezefunction to return a serialized copy of the data packed into a scalar, and
thawto deserialize such a scalar. This is useful for sending a complex data structure over a network socket or storing it in a database. When serializing structures with
Storable, there are network safe functions that always store their data in a format that is readable on any computer at a small cost of speed. These functions are named
nfreeze, etc. There are no "n" functions for deserializing these structures — the regular
retrievedeserialize structures serialized with the "
n" functions and their machine-specific equivalents.
- C and C++
- C and C++ do not provide serialization as any sort of high-level construct, but both languages support writing any of the built-in data types, as well as plain old data structs, as binary data. As such, it is usually trivial to write custom serialization functions. Moreover, compiler-based solutions, such as the ODB ORM system for C++, are capable of automatically producing serialization code with few or no modifications to class declarations. Other popular serialization frameworks are Boost.Serialization from the Boost Framework, the S11n framework, and Cereal. MFC framework (Microsoft) also provides serialization methodology as part of its Document-View architecture.
- Delphi provides a built-in mechanism for serialization of components (also called persistent objects), which is fully integrated with its IDE. The component's contents are saved to a DFM file and reloaded on-the-fly.
- Java provides automatic serialization which requires that the object be marked by implementing the
java.io.Serializableinterface. Implementing the interface marks the class as "okay to serialize", and Java then handles serialization internally. There are no serialization methods defined on the
Serializableinterface, but a serializable class can optionally define methods with certain special names and signatures that if defined, will be called as part of the serialization/deserialization process. The language also allows the developer to override the serialization process more thoroughly by implementing another interface, the
Externalizableinterface, which includes two special methods that are used to save and restore the object's state. There are three primary reasons why objects are not serializable by default and must implement the
Serializableinterface to access Java's serialization mechanism. Firstly, not all objects capture useful semantics in a serialized state. For example, a
Threadobject is tied to the state of the current JVM. There is no context in which a deserialized
Threadobject would maintain useful semantics. Secondly, the serialized state of an object forms part of its classes' compatibility contract. Maintaining compatibility between versions of serializable classes requires additional effort and consideration. Therefore, making a class serializable needs to be a deliberate design decision and not a default condition. Lastly, serialization allows access to non-transient private members of a class that are not otherwise accessible. Classes containing sensitive information (for example, a password) should not be serializable nor externalizable. The standard encoding method uses a simple translation of the fields into a byte stream. Primitives as well as non-transient, non-static referenced objects are encoded into the stream. Each object that is referenced by the serialized object via a field that is not marked as
transientmust also be serialized; and if any object in the complete graph of non-transient object references is not serializable, then serialization will fail. The developer can influence this behavior by marking objects as transient, or by redefining the serialization for an object so that some portion of the reference graph is truncated and not serialized. Java does not use constructor to serialize objects. It is possible to serialize Java objects through JDBC and store them into a database. While Swing components do implement the Serializable interface, they are not portable between different versions of the Java Virtual Machine. As such, a Swing component, or any component which inherits it, may be serialized to an array of bytes, but it is not guaranteed that this storage will be readable on another machine.
- .NET Framework
- .NET Framework has several serializers designed by Microsoft. There are also many serializers by third parties. More than a dozen serializers are discussed and tested here. and here The list is constantly growing.
- The core general serialization mechanism is the
picklestandard library module, alluding to the database systems term pickling to describe data serialization (unpickling for deserializing). Pickle uses a simple stack-based virtual machine that records the instructions used to reconstruct the object. It is a cross-version customisable but unsafe (not secure against erroneous or malicious data) serialization format. Malformed or maliciously constructed data, may cause the deserializer to import arbitrary modules and instantiate any object. The standard library also includes modules serializing to standard data formats:
json(with built-in support for basic scalar and collection types and able to support arbitrary types via encoding and decoding hooks) and XML-encoded property lists. (
plistlib), limited to plist-supported types (numbers, strings, booleans, tuples, lists, dictionaries, datetime and binary blobs). Finally, it is recommended that an object's
__repr__be evaluable in the right environment, making it a rough match for Common Lisp's
print-object. Not all object types can be pickled automatically, especially ones that hold operating system resources like file handles, but users can register custom "reduction" and construction functions to support the pickling and unpickling of arbitrary types. Pickle was originally implemented as the pure Python
picklemodule, but, in versions of Python prior to 3.0, the
cPicklemodule (also a built-in) offers improved performance (up to 1000 times faster). The
cPicklewas adapted from the Unladen Swallow project. In Python 3, users should always import the standard version, which attempts to import the accelerated version and falls back to the pure Python version.
- PHP originally implemented serialization through the built-in
unserialize()functions. PHP can serialize any of its data types except resources (file pointers, sockets, etc.). The built-in
unserialize()function is often dangerous when used on completely untrusted data. For objects, there are two "magic methods" that can be implemented within a class —
__wakeup()— that are called from within
unserialize(), respectively, that can clean up and restore an object. For example, it may be desirable to close a database connection on serialization and restore the connection on deserialization; this functionality would be handled in these two magic methods. They also permit the object to pick which properties are serialized. Since PHP 5.1, there is an object-oriented serialization mechanism for objects, the
- R has the function
dputwhich writes an ASCII text representation of an R object to a file or connection. A representation can be read from a file using
dget. More specific, the function
serializeserializes an R object to a connection, the output being a raw vector coded in hexadecimal format. The
unserializefunction allows to read an object from a connection or a raw vector.
- REBOL will serialize to file (
save/all) or to a
mold/all). Strings and files can be deserialized using the polymorphic
RProtoBufprovides cross-language data serialization in R, using protocol buffers.
- Ruby includes the standard module
Marshalwith 2 methods
load, akin to the standard Unix utilities
restore. These methods serialize to the standard class
String, that is, they effectively become a sequence of bytes. Some objects cannot be serialized (doing so would raise a
TypeErrorexception): bindings, procedure objects, instances of class IO, singleton objects and interfaces. If a class requires custom serialization (for example, it requires certain cleanup actions done on dumping / restoring), it can be done by implementing 2 methods:
_load. The instance method
_dumpshould return a
Stringobject containing all the information necessary to reconstitute objects of this class and all referenced objects up to a maximum depth given as an integer parameter (a value of -1 implies that depth checking should be disabled). The class method
_loadshould take a
Stringand return an object of this class.
- In general, non-recursive and non-sharing objects can be stored and retrieved in a human readable form using the
storeOn:method generates the text of a Smalltalk expression which - when evaluated using
readFrom:- recreates the original object. This scheme is special, in that it uses a procedural description of the object, not the data itself. It is therefore very flexible, allowing for classes to define more compact representations. However, in its original form, it does not handle cyclic data structures or preserve the identity of shared references (i.e. two references a single object will be restored as references to two equal, but not identical copies). For this, various portable and non-portable alternatives exist. Some of them are specific to a particular Smalltalk implementation or class library. There are several ways in Squeak Smalltalk to serialize and store objects. The easiest and most used are
storeOn:/readFrom:and binary storage formats based on
SmartRefStreamserializers. In addition, bundled objects can be stored and retrieved using
ImageSegments. Both provide a so-called "binary-object storage framework", which support serialization into and retrieval from a compact binary form. Both handle cyclic, recursive and shared structures, storage/retrieval of class and metaclass info and include mechanisms for "on the fly" object migration (i.e. to convert instances which were written by an older version of a class with a different object layout). The APIs are similar (storeBinary/readBinary), but the encoding details are different, making these two formats incompatible. However, the Smalltalk/X code is open source and free and can be loaded into other Smalltalks to allow for cross-dialect object interchange. Object serialization is not part of the ANSI Smalltalk specification. As a result, the code to serialize an object varies by Smalltalk implementation. The resulting binary data also varies. For instance, a serialized object created in Squeak Smalltalk cannot be restored in Ambrai Smalltalk. Consequently, various applications that do work on multiple Smalltalk implementations that rely on object serialization cannot share data between these different implementations. These applications include the MinneStore object database  and some RPC packages. A solution to this problem is SIXX , which is a package for multiple Smalltalks that uses an XML-based format for serialization.
- Generally a Lisp data structure can be serialized with the functions "
read" and "
(print foo). Similarly an object can be read from a stream named s by
(read s). These two parts of the Lisp implementation are called the Printer and the Reader. The output of "
(4 2.9 "x" y). In many types of Lisp, including Common Lisp, the printer cannot represent every type of data because it is not clear how to do so. In Common Lisp for example the printer cannot print CLOS objects. Instead the programmer may write a method on the generic function
print-object, this will be invoked when the object is printed. This is somewhat similar to the method used in Ruby. Lisp code itself is written in the syntax of the reader, called read syntax. Most languages use separate and different parsers to deal with code and data, Lisp only uses one. A file containing lisp code may be read into memory as a data structure, transformed by another program, then possibly executed or written out, such as in a read–eval–print loop. Not all readers/writers support cyclic, recursive or shared structures.
- In Haskell, serialization is supported for types that are members of the Read and Show type classes. Every type that is a member of the
Readtype class defines a function that will extract the data from the string representation of the dumped data. The
Showtype class, in turn, contains the
showfunction from which a string representation of the object can be generated. The programmer need not define the functions explicitly—merely declaring a type to be deriving Read or deriving Show, or both, can make the compiler generate the appropriate functions for many cases (but not all: function types, for example, cannot automatically derive Show or Read). The auto-generated instance for Show also produces valid source code, so the same Haskell value can be generated by running the code produced by show in, for example, a Haskell interpreter. For more efficient serialization, there are haskell libraries that allow high-speed serialization in binary format, e.g. binary.
- Windows PowerShell
- Windows PowerShell implements serialization through the built-in cmdlet
Export-CliXMLserializes .NET objects and stores the resulting XML in a file. To reconstitute the objects, use the
Import-CliXMLcmdlet, which generates a deserialized object from the XML in the exported file. Deserialized objects, often known as "property bags" are not live objects; they are snapshots that have properties, but no methods. Two dimensional data structures can also be (de)serialized in CSV format using the built-in cmdlets
- Julia implements serialization through the
deserialize()modules, intended to work within the same version of Julia, and/or instance of the same system image. The
HDF5.jlpackage offers a more stable alternative, using a documented format and common library with wrappers for different languages, while the default serialization format is suggested to have been designed rather with maximal performance for network communication in mind.
- Commutation (telemetry)
- Comparison of data serialization formats
- Hibernate (Java)
- XML Schema
- Basic Encoding Rules
- Google Protocol Buffers
- Apache Avro
- Marshall Cline. "C++ FAQ: "What's this "serialization" thing all about?"". Archived from the original on 2015-04-05.
It lets you take an object or group of objects, put them on a disk or send them through a wire or wireless transport mechanism, then later, perhaps on another computer, reverse the process, resurrecting the original object(s). The basic mechanisms are to flatten object(s) into a one-dimensional stream of bits, and to turn that stream of bits back into the original object(s).
- How to marshal an object to a remote server by value by using Visual Basic 2005 or Visual Basic .NET […] Because the whole object is being serialized to the server (marshaling by value), the code will execute in the server's process. at the Wayback Machine (archived November 15, 2004)
- S. Miller, Mark. "Safe Serialization Under Mutual Suspicion". ERights.org.
Serialization, explained below, is an example of a tool for use by objects within an object system for operating on the graph they are embedded in. This seems to require violating the encapsulation provided by the pure object model.
- Sun Microsystems (1987). "XDR: External Data Representation Standard". RFC 1014. Network Working Group. Retrieved July 11, 2011.
- Documentation to Boost.Serialization
- s11n home page
- cereal documentation page
- ".NET Serializers".
There are many kinds of serializers; they produce very compact data very fast. There are serializers for messaging, for data stores, for marshaling objects. What is the best serializer in .NET?
- Herlihy, Maurice; Liskov, Barbara (October 1982). "A Value Transmission Method for Abstract Data Types" (PDF). TOPLAS. New York, NY: ACM. 4 (4): 527–551. doi:10.1145/69622.357182. ISSN 0164-0925. OCLC 67989840.
- Birrell, Andrew; Jones, Mike; Wobber, Ted (November 1987). "A Simple and Efficient Implementation for Small Databases". ACM SIGOPS Operating Systems Review: Proceedings of the 11th ACM Symposium on Operating System Principles. New York, NY: ACM. 11 (5): 149–154. doi:10.1145/41457.37517. ISSN 0163-5980. OCLC 476062921.
Our implementation makes use of a mechanism called “pickles”, which will convert between any strongly typed data structure and a representation of that structure suitable for storing in permanent disk files. The operation Pickle.Write takes a pointer to a strongly typed data structure and delivers buffers of bits for writing to the disk. Conversely Pickle.Read reads buffers of bits from the disk and delivers a copy of the original data structure.(*) This conversion involves identifying the occurrences of addresses in the structure, and arranging that when the structure is read back from disk the addresses are replaced with addresses valid in the current execution environment. The pickle mechanism is entirely automatic: it is driven by the run-time typing structures that are present for our garbage collection mechanism. ... (*) Pickling is quite similar to the concept of marshalling in remote procedure calls. But in fact our pickling implementation works only by interpreting at run-time the structure of dynamically typed values, while our RPC implementation works only by generating code for the marshalling of statically typed values. Each facility would benefit from adding the mechanisms of the other, but that has not yet been done.
- van Rossum, Guido (1 December 1994). "Flattening Python Objects". Python Programming Language – Legacy Website. Delaware, United States: Python Software Foundation. Retrieved 6 April 2017.
Origin of the name 'flattening': Because I want to leave the original 'marshal' module alone, and Jim complained that 'serialization' also means something totally different that's actually relevant in the context of concurrent access to persistent objects, I'll use the term 'flattening' from now on. ... (The Modula-3 system uses the term 'pickled' data for this concept. They have probably solved all problems already, and in a type-safe manner :-)
- 13.1.1 Relationship to other Python modules
- Python library documentation
- "What's new in Python 3.0"
- Esser, Stephen (2009-11-28). "Shocking News in PHP Exploitation". Suspekt...
- Serializable interface
- [R manual http://stat.ethz.ch/R-manual/R-patched/library/base/html/dput.html]
- [R manual http://stat.ethz.ch/R-manual/R-patched/library/base/html/serialize.html]
- "Text.Show Documentation". Retrieved 15 January 2014.