Serialization: Difference between revisions

Content deleted Content added

Inline

Revision as of 20:34, 25 September 2006

In computer science, serialization has several distinct meanings.

In the context of concurrency control serialization means to force one-at-a-time access. For example, a single-threaded ActiveX server can process only one request at a time; thus requests are queued and executed in the order they are made.

In the context of data storage and transmission serialization is the process of saving an object onto a storage medium (such as a file, or a memory buffer) or to transmit it across a network connection link (such as a socket), either as a series of bytes or in some human-readable format such as XML. The series of bytes or the format can be used to re-create an object that is identical in its internal state to the original object (actually a clone). This type of serialization is used mostly to transport an object across a network, to persist objects to a file or database, or to distribute identical objects to several applications or locations.

This process of serializing an object is also called deflating an object or marshalling an object.
The opposite operation, extracting a data structure from a series of bytes, is deserialization (which is also called inflating or unmarshalling ).

Uses

Serialization has a number of advantages. It provides:

a simple and robust way to make objects persistent
a method of issuing remote procedure calls, e.g., as in SOAP
a method for distributing objects, especially in software componentry such as COM, CORBA, etc.

For some of these features to be useful, architecture independence must be maintained. For example, for maximal use of distribution, a computer running on a different hardware architecture should be able to reliably reconstruct a serialized data stream, regardless of endianness. This means that the simpler and faster procedure of directly copying the memory layout of the data structure cannot work reliably for all architectures. Serializing the data structure in an architecture independent format means that we do not suffer from the problems of byte ordering, memory layout, or simply different ways of representing data structures in different programming languages.

In some forms, however, serialization has the disadvantage that since the encoding of the data is serial, merely extracting one part of the data structure that is serialized means that the entire object must be reconstructed or read before this can be done. The serialization capabilities in the Cocoa framework, NSKeyedArchiver, alleviate the problem somewhat by allowing an object to be archived with each instance variable of the object accessible by using a key.

Even on a single machine, primitive pointer objects are too fragile to save, because the objects to which they point may be reloaded to a different location in memory. To deal with this, the serialization process includes a step called unswizzling or pointer unswizzling and the deserialization process includes a step called pointer swizzling.

Consequences

Serialization, however, breaks the opacity of an abstract data type by potentially exposing private implementation details. To discourage competitors from making compatible products, publishers of proprietary software often keep the details of their programs' serialization formats a trade secret. Some deliberately obfuscate or even encrypt the serialized data.

Yet, interoperability requires that applications be able to understand the serialization of each other. Therefore remote method call architectures such as CORBA define their serialization formats in detail and often provide methods of checking the consistency of any serialized stream when converting it back into an object.

Human-readable serialization

In the late 1990s, a push to provide an alternative to the standard serialization protocol started that uses XML and produces a human readable encoding. Such an encoding could be useful for persistent objects that may be read and understood by humans, or communicated to other systems regardless of programming language, but this has the disadvantage of losing the more compact, byte stream based encoding, which is generally more practical. A future solution to this dilemma could be transparent compression schemes (see binary XML).

Programming language support

Several object-oriented programming languages directly support object serialization (or object archival), either by syntactic sugar elements or providing a standard interface for doing so.

Some of these programming languages are Ruby, Smalltalk, Python, Objective-C, Java, and the .NET family of languages.

There are also libraries available that add serialization support to languages that lack native support for it.

Objective-C

In the Objective-C programming language, serialization (most commonly known as archival) is achieved by overriding the write: and read: methods in the Object root class. (NB This is in the GNU runtime variant of Objective-C. In the NeXT-style runtime, the implementation is very similar.)

Example

The following example demonstrates two independent programs, a "sender", who takes the current time (as per time in the C standard library), archives it and prints the archived form to the standard output, and a "receiver" which decodes the archived form, reconstructs the time and prints it out.

When compiled, we get a sender program and a receiver program. If we just execute the sender program, we will get out a serialization that looks like:

GNU TypedStream 1D@îC¡

(with a NULL character after the 1). If we pipe the two programs together, as sender | receiver, we get

received 1089356705

showing the object was serialized, sent, and reconstructed properly.

In essence, the sender and receiver programs could be distributed across a network connection, providing distributed object capabilities.

Sender.h

#import <objc/Object.h>
#import <time.h>
#import <stdio.h>

@interface Sender : Object
{
   time_t  current_time;
}

- (id) setTime;
- (time_t) time;
- (id) send;
- (id) read: (TypedStream *) s;
- (id) write: (TypedStream *) s;

@end

Sender.m

#import "Sender.h"

@implementation Sender
- (id) setTime
{
   //Set the time
   current_time = time(NULL);
   return self;
}

- (time_t) time;
{
   return current_time;
}

- (id) write: (TypedStream *) stream
{
   /*
    *Write the superclass to the stream.
    *We do this so we have the complete object hierarchy,
    *not just the object itself.
    */
   [super write:stream];

   /*
    *Write the current_time out to the stream.
    *time_t is typedef for an integer.
    *The second argument, the string "i", specifies the types to write
    *as per the @encode directive.
    */
   objc_write_types(stream, "i", &current_time);
   return self;
}

- (id) read: (TypedStream *) stream
{
   /*
    *Do the reverse to write: - reconstruct the superclass...
    */
   [super read:stream];

   /*
    *And reconstruct the instance variables from the stream...
    */
   objc_read_types(stream, "i", &current_time);
   return self;
}

- (id) send
{
   //Convenience method to do the writing. We open stdout as our byte stream
   TypedStream *s = objc_open_typed_stream(stdout, OBJC_WRITEONLY);

   //Write the object to the stream
   [self write:s];

   //Finish up - close the stream.
   objc_close_typed_stream(s);
}
@end

Sender.c

#import "Sender.h"

int
main(void)
{
   Sender *s = [Sender new];
   [s setTime];
   [s send];

   return 0;
}

Receiver.h

#import <objc/Object.h>
#import "Sender.h"

@interface Receiver : Object
{
   Sender *t;
}

- (id) receive;
- (id) print;
@end;

Receiver.m

#import "Receiver.h"

@implementation Receiver

- (id) receive
{
   //Open stdin as our stream for reading.
   TypedStream *s = objc_open_typed_stream(stdin, OBJC_READONLY);

   //Allocate memory for, and instantiate the object from reading the stream.
   t = [[Sender alloc] read:s];
   objc_close_typed_stream(s);
}

- (id) print
{
   fprintf(stderr, "received %d\n", [t time]);
}

@end

Receiver.c

#import "Receiver.h"

int
main(void)
{
   Receiver *r = [Receiver new];
   [r receive];
   [r print];

   return 0;
}

Java

Java provides automatic serialization which requires only that the object be marked by implementing the java.io.Serializable interface. Implementing the interface marks the class as "okay to serialize," and Java then handles serialization internally. There are no serialization methods defined on the Serializable interface, but a serializable class can optionally define methods with certain special names and signatures that if defined, will be called as part of the serialization/deserialization process. The language also allows the developer to override the serialization process more thoroughly by implementing another interface, the Externalizable interface, which includes two special methods that are used to save and restore the object's state.

There are three primary reasons why objects are not serializable by default and must implement the Serializable interface to access Java's serialization mechanism.

Not all objects capture useful semantics in a serialized state. For example, a Thread object is tied to the state of the current JVM. There is no context in which a deserialized Thread object would maintain useful semantics.
The serialized state of an object forms part of its class's compatibility contract. Maintaining compatibility between versions of serializable classes requires additional effort and consideration. Therefore, making a class serializable needs to be deliberate design decision and not a default condition.
Serialization allows access to non-transient private members of a class that are not otherwise accessible. Classes containing sensitive information (for example, a password) should not be serializable or externalizable.

The standard encoding method uses a simple translation of the fields into a byte stream. Primitives as well as non-transient, non-static referenced objects are encoded into the stream. Each object that is referenced by the serialized object and not marked as transient must also be serialized; and if any object in the complete graph of non-transient object references is not serializable, then serialization will fail. The developer can influence this behavior by marking objects as transient, or by redefining the serialization for an object so that the some portion of the reference graph is truncated and not serialized.

ColdFusion

ColdFusion allows data stuctures to be serialized to WDDX with the <cfwddx> tag.

OCaml

OCaml's standard library provides marshalling through the Marshal module. While OCaml programming is statically type-checked, uses of the Marshal module may break type guarantees, as there is no way to check whether an unmarshalled stream represents objects of the expected type.

Perl

Several Perl modules available from CPAN provide serialization mechanisms, including Storable and FreezeThaw.

Storable includes functions to serialize and deserialize Perl data structures to and from files or Perl scalars.

use Storable;

# Create a hash with some nested data structures
my %struct = ( text => 'Hello, world!', list => [1, 2, 3] );

# Serialize the hash into a file
store \%struct, 'serialized';

# Read the data back later
my $newstruct = retrieve 'serialized';

In addition to serializing directly to files, Storable includes the freeze function to return a serialized copy of the data packed into a scalar, and thaw to deserialize such a scalar. This is useful for sending a complex data structure over a network socket or storing it in a database.

C++

The Boost library includes a library for serializing C++ data structures. XML Data Binding implementations, such as XML Schema to C++ data binding compiler, provide serialization/deserialization of C++ objects to/from XML and binary formats.

Python

Python implements serialization through the built-in pickle, and to a lesser extent, the older marshal modules. Marshal does offer the ability to serialize Python code objects, unlike pickle.

PHP

PHP implements serialization through the built-in 'serialize' and 'unserialize' functions. PHP can serialize any of its datatypes except resources (file pointers, sockets, etc.).

For objects (as of at least PHP 4) there are two "magic methods" than can be implemented within a class — __sleep() and __wakeup() — that are called from within serialize() and unserialize(), respectively, that can clean up and restore an object. For example, it may be desirable to close a database connection on serialization and restore the connection on unserialization; this functionality would be handled in these two magic methods. They also permit the object to pick which properties are serialized.

REBOL

REBOL will serialize to file (save/all) or to a string! (mold/all). Strings and files can be deserialized using the polymorphic load function.

Ruby

Ruby include standard module Marshal with 2 methods dump and restore, akin to standard Unix utilities dump and restore. These methods serialize to standard class String, that is effectively a sequence of bytes.

Some objects can't be serialized (doing so would raise TypeError exception):

bindings,
procedure objects,
instances of class IO,
singleton objects.

If a class requires custom serialization (for example, it requires certain cleanup actions done on dumping / restoring), it can be done by implementing 2 methods: _dump and _load. The instance method _dump should return a String object containing all the information necessary to reconstitute objects of this class and all referenced objects up to a maximum depth given as an integer parameter (a value of -1 implies that depth checking should be disabled). The class method _load should take a String and return an object of this class.

Smalltalk

Squeak Smalltalk

There are several ways in Squeak Smalltalk to serialize and store objects. The easiest and most used method will be shown below. Other classes of interest in Squeak for serializing objects are SmartRefStream and ImageSegment.

To store a Dictionary (sometimes called a hash map in other languages) containing some nonsense data of varying types into a file named "data.obj":

| data rr |
data := Dictionary new.
data at: #Meef put: 25;
	at: 23 put: 'Amanda';
	at: 'Small Numbers' put: #(0 1 2 3 four).
rr := ReferenceStream fileNamed: 'data.obj'.
rr nextPut: data; close.

To restore the Dictionary object stored in "data.obj" and bring up an

| restoredData rr |
rr := ReferenceStream fileNamed: 'data.obj'.
restoredData := rr next.
restoredData inspect.
rr close.

Other Smalltalk dialects

Object seralization is not part of the ANSI Smalltalk specification. As a result, the code to serialize an object varies by Smalltalk implementation. The resulting binary data also varies. For instance, a serialized object created in Squeak Smalltalk cannot be restored in Ambrai Smalltalk. Consequently, various applications that do work on multiple Smalltalk implementations that rely on object serialization cannot share data between these different implementations. These applications include the MinneStore object database [1] and some RPC packages. A solution to this problem is SIXX [2], which is an package for multiple Smalltalks that uses an XML-based format for serialization.

External links

For C#:

Deep Serialization: Binary and SOAP Serialization with a Generic Twist

For Java:

Java Object Serialization documentation

Java Object Serialization Specification

@@ Line 308: / Line 308: @@
 == External links ==
+For C#:
+* [http://developer.coreweb.com/articles/Default3.aspx Deep Serialization: Binary and SOAP Serialization with a Generic Twist]
 For Java: