Bobby Woolf

Subscribe to Bobby Woolf: eMailAlertsEmail Alerts
Get Bobby Woolf: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Article

Designing For Object Serialization

Designing For Object Serialization

Often objects need to be serializable. This can be as easy as simply declaring that a class implements the Serializable interface. But just because an object says it's serializable doesn't necessarily mean that it will serialize (and deserialize) successfully. This article will explore what serialization is, how to prove that a purportedly serializable object can really be serialized, and how to redesign a nonserializable class to make it serializable.

The first question is: What is serialization and why do objects need to be serializable? The Javadocs aren't much help; they say what java.io.Serializable does, but not what serialization is. To learn what it is, you must read the "Java Object Serialization Specification" in Sun's Object Serialization documentation (http://java.sun.com/products/jdk/1.2/docs/guide/serialization/). According to the spec, "The key to storing and retrieving objects in a serialized form is representing the state of objects sufficient to reconstruct the object(s)." So serializing an object converts it to a form that can be stored easily, and the stored form can easily be converted back into a Java object again. The new object is essentially a copy of the original. Furthermore, the object being serialized typically refers to other objects, a graph of objects that must be serialized and deserialized so that the entire structure is copied.

Why do objects need to be serialized? The simple answer is persistence and distribution. One of the goals of serialization is to "support simple persistence of Java objects," so object persistence schemes that simply want to store objects as blobs tend to use serialization and store the serialized form. (GemStone/J is an exception; it persists objects without serializing them, so it can persist any Java object, even those that don't implement Serializable.)

In the EJB technology, the javax.ejb.EntityBean and javax.ejb.SessionBean interfaces extend Serializable so that EJBs can easily be passivated and activated. To persist an entity bean field that is a complex object, the value is serialized.

Another common use of serialization is as part of object distribution. A distributed object is one that is instantiated in one Java Virtual Machine (JVM) but accessible from other virtual machines. If the object is not copied but is accessible remotely via Remote Method Invocation (RMI), the arguments and result of the remote invocations must be marshalled (java.rmi.MarshalledObject), which implements Serializable so that the arguments and result can be serialized.

When using the Java Message Service (JMS), if you know that the consumer of your message will be another Java application, you can send it an object using a javax.jms.ObjectMessage that contains the serialized object. Even a Throwable is serializable, so that it can be thrown from one VM to another like a marshaled object. Thus if you're going to implement objects that can be distributed or persisted, you're going to need to make them serializable.

Moving Around in the Byte Stream
Serialization is not unique to Java. BOSS, the binary object streaming system, was a prominent feature of VisualWorks Smalltalk since the early '90s. Whenever you have objects, you'll find a need to move them around using simple byte streams.

As common as serialization is, there are alternative ways to persist and distribute objects. Java serialization is practical only when Java code is being used for both the serialization and deserialization.

CORBA objects can be transferred from one object space to another, even when the two spaces use different languages (for example, to enable a Java application to exchange objects with a C++ application). XML converts an object to a form that can be read by any application with an XML parser. Object-to-relational frameworks like TopLink (www.webgain.com) and CocoBase (www.thoughtinc.com) convert objects into rows that can be stored in relational database (RDBMS) tables. Yet when you know that Java will be used both to convert the object and to unconvert it again, serialization is definitely the easiest way to do so.

You'll rarely write your own serialization code. Rather, the frameworks you're using - such as RMI, EJB, and JMS - will use serialization and will expect your objects to be serializable. But to understand how to make your objects serializable, it helps to understand a little about how serialization works.

Implementing Serialization
Serialization is implemented in the java.io package by two main types: ObjectOutput (implemented by ObjectOutputStream) and ObjectInput (implemented by ObjectInputStream). Ultimately, serialization and deserialization are performed by two methods:

  • void::ObjectOutput.writeObject(Object obj): Serializes the object into bytes
  • Object::ObjectInput.readObject(): Deserializes the bytes back into an object, a new instance of the same class as the original object (whose type is upcast to Object)

    This code will serialize some object, obj, into a byte array:

    public byte[] serialize(Serializable obj)
    throws IOException {
    ByteArrayOutputStream stream = new
    ByteArrayOutputStream();
    ObjectOutput serialStream = new
    ObjectOutputStream(stream);
    serialStream.writeObject(obj);
    return stream.toByteArray();
    }

    This code will deserialize the byte array back into an object:

    public Object deserialize(byte[] bytes)
    throws IOException, ClassNotFoundException {
    InputStream stream = new ByteArrayInputStream(bytes);
    ObjectInput serialStream = new
    ObjectInputStream(stream);
    return serialStream.readObject();
    }

    Thus when you're using a framework that does serialization, these two pieces of code are essentially what the framework is doing.

    A class can customize its serialization by implementing writeObject(ObjectOutputStream) and readObject(ObjectInputStream). An object can even use Object::writeReplace() and Object::readResolve() to specify an instance of an entirely different class to be serialized instead of itself. However, most classes don't need to implement these methods and just let the stream classes do the heavy lifting.

    As if Serializable weren't enough, it has a specialized subtype called Externalizable. The difference between serialization and externalization is a bit like the difference between container-managed persistence (CMP) and bean-managed persistence (BMP) in EJB. With Serializable, the object delegates serialization to the container and its implementations of ObjectOutputStream and ObjectInputStream. With Externalizable, the object manages its own serialization, ignoring what the container's streams would otherwise try to do.

    Externalizable declares two methods - writeExternal(ObjectOutput) and readExternal(ObjectInput) - that are solely responsible for saving and restoring the object's contents. Externalizable isn't used much, so I won't discuss it further here.

    Books on improving the performance of Java code point out that you can usually improve your objects' serialization performance by implementing the serialization code yourself. This is because the serialization streams are ignorant of your particular class's implementation and use lots of reflection to pick an object apart and get its (often private) state. Your custom code - knowing how the class is implemented and having direct access even to private variables - is bound to be more efficient. On the other hand, implementing your own serialization code is tricky and difficult to maintain. You should generally only do it for large classes that are serialized frequently, so that the performance improvements will be significant enough to justify the implementation effort.

    Let's look at a simple example of serialization that doesn't involve a more complex framework like RMI or EJB. The DataMapper (www.xprogramming.com/datamapper/data- mapper_1.htm) is a framework for converting record files into Java objects and vice versa (which itself is an alternative to serialization and XML for converting objects into an easily transferable and persistable form). The mappings for doing the conversion are implemented via a format map, a tree of FieldFormat objects.

    Sometimes it is useful to have a process in one virtual machine define a map, then have a process in another virtual machine use the map, perhaps storing the map in between. To be able to move the map between VMs, it needs to be serializable, which means that the FieldFormat graph - and everything those objects reference - needs to be serializable. The remainder of this article will discuss the redesign necessary to take the working format map design and modify it to make the maps serializable as well.

    The Principle Is Simple
    In principle, making objects serializable is pretty easy: just change the class to implement Serializable.
    This class isn't serializable:
                                                                                             . public class MyClass {
    . . .
    }
    whereas this makes the class serializable:
    public class MyClass implements java.io.Serializable {
    . . .
    }

    When you implement Serializable, you don't even have to implement any new methods because Serializable doesn't declare any. So then what good does implementing Serializable do? Serializable does not implement or declare a class's serializable behavior; the default implementation for that behavior is already implemented by ObjectOutputStream.defaultWriteObject() and ObjectInputStream.defaultReadObject(). (It would have just been easier to put this code in Object as default implementations of readObject(ObjectInputStream) and writeObject(ObjectOutputStream) that subclasses could override or extend. But since objects delegate this responsibility to the stream classes, new subclasses of ObjectOutputStream and ObjectInputStream can impose their own serialization approaches and default implementations.) Thus all objects already know how to serialize themselves (or at least the serialization streams already know how to serialize any objects that don't otherwise implement their own serialization); implementing Serializable just enables the serialization behavior.

    Serialization Surprises
    This seems to imply that all objects are naturally serializable and that the default implementations are sufficient. It turns out that's not always the case. The problem is especially difficult to detect because even classes whose objects will not serialize successfully still compile just fine as if they will serialize. Typically, when a class says it implements an interface but does not have the necessary code to do so, it will fail to compile and the compiler will indicate what code the class is missing. But in the case of serialization, you can change any properly compiling class to implement Serializable and the class will compile. But just because the class compiles doesn't mean it will serialize.

    Also remember that an object doesn't just get state from its class, but can also inherit state from its class's superclasses. Serialization only serializes the state in a class that says it's serializable. If a superclass declares state, but doesn't implement Serializable, then the superclass's state won't be serialized. This can result in an object whose serialized state may only represent part of its total state. The state declared by its class gets serialized but the state declared by the superclass does not.

    Another common pitfall is forgetting that for an object to be serializable, all of its parts must be serializable. Serialization will recurse through the entire object structure, serializing each object in the structure. If any of those objects are not serializable, the serialization will fail.

    A risk to keep in mind with serialization is that it's potentially a huge security hole. A class may be carefully designed to hide its state, but a serialized version of an object lays its state out for anyone who wants to read it. It can also enable a malicious client to bypass a class's constructors and manufacture new instances with virtually any possible state stuffed inside it. Thus classes with especially sensitive variables should be designed to avoid serializing those variables. If the whole class must be secure, it should block serialization completely by not implementing Serializable and perhaps even implementing writeObject(ObjectOutputStream) and readObject(ObjectInputStream) to throw NotSerializableException.

    When Objects Don't Serialize
    So how do you find out that your objects won't serialize? The compiler won't tell you; it'll let anything implement Serializable. It's not until you actually try to serialize an object that you'll find out whether or not it's really serializable.

    When you try to serialize an object that's not serializable, Java will throw some sort of ObjectStreamException. Typically it'll be a NotSerializableException, which indicates that you tried to serialize an object that doesn't implement Serializable. The stack trace is like the code shown in Listing 1.

    Interestingly, NotSerializableException doesn't necessarily mean that the object inherently is not serializable. It just means that the object doesn't explicitly say that it can be serialized. Often all that's needed is to change the object's class to implement Serializable and the object will now serialize successfully.

    Still, though, how do you know whether or not an object will serialize? The compiler won't tell you. You could run all the code in your application in every possible way, but that tends to be difficult and is a kind of overkill just to make sure that you can serialize everything you need to. What we need is some sort of serialization tester that we can use as part of our normal testing procedures. It will make sure that the objects we're producing that are supposed to be serializable and say that they're serializable really are serializable.

    A Serialization Tester
    As it turns out, I implemented just such a serialization tester for the DataMapper. It's the class bw.dm.test.ObjectSerializer, and although it's bundled with the DataMapper's testing code, it can be used to test the serialization of any code. It's a very small class; the whole thing is shown in Listing 2. The code should look very familiar at this point. The class just embodies the serialization and deserialization code shown earlier in this article.

    So, to test an object structure and verify that it can really be serialized and deserialized successfully, you just do this (assuming that MyClass extends/implements Serializable):

    MyClass obj1 = // create an instance of MyClass
    MyClass obj2 = (MyClass)
    ObjectSerializer.serializeAndDeserialize(obj1);

    Two things should happen here:

    1. obj2 should be a copy of obj1, such that they're equal and can be used interchangeably.
    2. Perhaps more important, serializeAndDeserialize should not throw any exceptions.
    For convenience, the ObjectSerializer converts any exceptions the serialization throws (IOException or ClassNotFoundException) into RuntimeExceptions. As long as no exceptions get thrown, the serialization has most likely worked. On the other hand, if the object cannot be serialized successfully, you'll know right away.

    Making the Map Serializable
    I already had tests written for the DataMapper that created different types of maps and tried them to verify that they work. Now that I had ObjectSerializer, I modified each test to create the map, serialize, and deserialize it, and then use the copy for the rest of the test. As long as no exceptions were thrown and the copy worked the way the original was supposed to, that was good enough for me.

    Fortunately, my tests were now able to reproduce the NotSerializableException all too easily. I had assumed that since my code implemented Serializable and it compiled successfully, serialization would work. The tester now confirmed that the serialization was not working.

    Some problems were easy to fix. Basically, certain classes that needed to be serializable didn't implement Serializable, so making them implement Serializable fixed the problem. But in other cases, labeling a class as serializable didn't help because the class fundamentally isn't serializable. These classes either need significant redesign, or the map structure needs to be redesigned so that these nonserializable objects don't need to be serialized.

    Nonserializable Objects
    Some objects just cannot be serialized. These include some of the most fundamental classes in Java. While Class is serializable, Method and Field are not. And that makes sense. They represent particular members of a particular class. What if they were deserialized in a VM without the class in its classpath, or if the class didn't contain the member? Likewise, a Thread (an object representing the execution of a program) is Runnable, but it isn't Serializable, which is good, because if you copy a thread into a new VM, or persist it and reload it later, what's it supposed to do after it deserializes?

    So it makes sense that objects like these aren't serializable, but that's a problem for the DataMapper. It has objects like FieldAspectAdapter that uses a pair of Methods to get and set a value in the object being mapped. It has to store Methods, which are not serializable, so how can FieldAspectAdapter be?

    The DataMapper gets around this problem with a FieldAdapterProxy class. An implementation of the Proxy pattern, it's a FieldFormatAdapter that is serializable. A similar example is MethodSpec, a class that makes it easy to convert the parameters necessary to specify a method into the Method itself, but also a proxy designed to be serializable, a distinct advantage over Method.

    Transient Variables
    Serializable proxies for nonserializable objects don't completely solve the problem, however. Eventually, the proxy has to point to its subject, and the subject (something like a ProtocolAdapter or a Method) still isn't any more serializable than it ever was. So how do these proxies really solve the problem?

    Let's consider MethodSpec. It never actually stores the Method. It stores the receiver class, the method name, and the parameter types, but stores them all as Strings, which are very serializable. When asked for the Method, the MethodSpec quickly creates the Method from the Strings and returns it. By never storing the Method, MethodSpec has no problem being serializable even though Method isn't.

    This won't work for FieldAspectAdapterProxy, however. As a FieldAdapterProxy, it has to be serializable. But it's a proxy for a ProtocolAdapter, which uses either a Field or a pair of Methods to do its work, which clearly isn't serializable. The FieldAdapterProxy could create the ProtocolAdapter for every use, but ProtocolAdapter is a pretty complex object that references several others, and probably shouldn't be re-created over and over even if it could be. So FieldAdapterProxy needs to create its ProtocolAdapter once and maintain a reference to it, but prevent it from being serialized.

    A serializable object can prevent some of its references from being serialized by declaring them as transient. A standard instance variable declaration looks like this:

    public class MyClass {
    private SomeType variable;
    . . .
    }

    To make the variable transient, do this:

    public class MyClass {
    transient private SomeType variable;
    . . .
    }

    The serialization code in ObjectOutputStream.defaultWriteObject() serializes the object's state by serializing each of its references. But if a reference is transient, rather than write the variable's value, the stream writes the variable type's default value (false, 0, null, etc.). Then when the object is deserialized, the transient variable's value naturally gets set to this default value without any special processing by the input stream. In the end, the transient variable's real value is never written to the stream, which is an important security consideration if the variable is something sensitive like a password.

    So while the FieldAdapterProxy maintains a reference to its ProtocolAdapter, it is a transient reference, so the stream will not attempt to serialize the ProtocolAdapter.

    A more complex approach for specifying what parts of an object should be serialized is to use a special static serialPersistentFields variable. The value is an array of java.io.ObjectStreamField objects, each of which specifies the name and type of a nonstatic field to be serialized. These nonstatic fields don't necessarily have to exist in the current version of the class, so this approach is used for migrating serialized instances from one version of a class to another, even when the class's fields have changed.

    Lazy Initialization
    Yet transient variables don't entirely solve the problem either. They start out as null (or for a primitive type, a default value like false or zero). Yet they're expected to be valid object instances. So when does a transient variable's value get set, and how does the container object avoid setting the value over and over?

    A good technique to use here is lazy initialization. Basically, you access the variable through a getter method that initializes the variable the first time, then just returns its value every time after that. The code looks like this:

    public MyClass getLazyVariable()  {
    if (lazyVariable == null)
    lazyVariable = this.defaultLazyVariable();
    return lazyVariable;
    }

    Here's what happens when lazy initialization is used with the transient variable:

    1. The container object gets created or deserialized, so the transient variable is null.
    2. The first time the variable is accessed, it's null, so it gets initialized and the newly initialized value is returned.
    3. After that, each time the variable is accessed, its value is no longer null, so the current value is returned immediately.
    4. When the container is serialized, the transient value is ignored and goes back to null at deserialization.
    Thus lazy initialization makes sure that a transient variable gets initialized before it's used but not reinitialized unnecessarily.

    Proxy objects with transient variables that are lazy initialized are a good way to develop a serializable wrapper for an otherwise unserializable object.

    Implementing a Proxy
    For an example of transient fields and lazy initialization, let's look at another DataMapper class. FieldAspectAdapterProxy is a kind of FieldAdapterProxy, which, among other things, needs to be serializable. It's a proxy for a ProtocolAdapter, which is not serializable. (What all these classes do exactly is not too important in this example; you just need to keep your eye on which ones are serializable and which ones aren't.)

    First, we have FieldAdapterProxy, which is serializable:

    public abstract class FieldAdapterProxy implements
    Serializable {
    protected FieldFormat field;
    . . . }

    It has a variable, field, which is a FieldFormat. Since FieldFormat is serializable, we're okay so far.

    Second, one of the main subclasses of FieldAdapterProxy is Field AspectAdapterProxy. As a subclass, it has to be serializable too. It looks like this:

    public class FieldAspectAdapterProxy extends FieldAdapterProxy {
    protected AspectSpec spec;
    protected ProtocolAdapter adapter;
    . . .
    }

    AspectSpec is also serializable, so it's okay. But ProtocolAdapter isn't serializable. (Obviously. If it were, we wouldn't need a proxy class for it!) Implementations of ProtocolAdapter are those classes like FieldAspectAdapter and FieldFieldAdapter that are implemented with reflection classes like Method and Field that aren't serializable. So when we serialize a FieldAspect-AdapterProxy, we're not going to be able to serialize its ProtocolAdapter.

    This is where we want to make a variable transient, so that it won't be serialized. Thus we change FieldAspectAdapterProxy like this:

    public class FieldAspectAdapterProxy extends FieldAdapterProxy {
    protected AspectSpec spec;
    transient protected ProtocolAdapter adapter = null;
    . . .
    }

    Initializing the variable to null isn't really necessary, but it helps remind us what the variable's initial/default value will be.

    Now the problem is that when a new instance is created, or an instance is deserialized, the adapter variable will be null. Let's add some code that uses lazy initialization to set the value when necessary (see Listing 3).

    Now, the methods that use adapter call setupDomainAdapter() first to make sure that the variable is initialized. If I weren't sure which methods these were (they're actually just about every method in FieldAspectAdapterProxy), I could put the lazy initialization in a getter method like this:

    protected ProtocolAdapter getAdapter()  {
    if (adapter == null)  {
    try {
    this.initializeAdapter();
    }
    catch . . . }
    }

    Then as long as all methods using the variable use the getter to access it, it'll get initialized. To help make sure methods don't access the variable directly (at least methods in subclasses), I could declare the variable to be private instead of protected.

    Conclusions
    To summarize:

    • Object serialization is one of Java's fundamental standard features. It's used for enabling objects to be persisted and/or distributed.
    • Serialization is accomplished via three types in the java.io package: Serializable, ObjectOutput, and ObjectInput.
    • Making an object serializable may be as simple as making its class implement Serializable. But it may take more than that.
    • A NotSerializableException is the main sign that an object is not serializable but needs to be.
    • A simple serialization tester like ObjectSerializer removes much of the uncertainty about whether or not an object structure is really serializable.
    • When an object cannot be designed for serialization, it can often be wrapped with a proxy that is serializable.
    • When a serializable object, such as a proxy, has nonserializable parts, those parts can be excluded from the serialization process by declaring their variables as transient.
    • Lazy initialization is a good way to make sure that a transient variable gets initialized at the right time and doesn't get initialized repeatedly.
    Now, if you ever need to make a data structure of yours serializable, and a teammate advises, "Just implement Serializable," you'll know what you really need to do.
  • More Stories By Bobby Woolf

    Bobby Woolf is a senior architect at GemStone Systems (www.gemstone.com), a Brokat company, and a member of their Professional Services division. He specializes in developing application architectures using various J2EE technologies and embeddable tools.

    Comments (1) View Comments

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    Most Recent Comments
    Yathirajulu 03/11/08 07:33:33 AM EDT

    Hi,
    I am following regularly sun,javapassion and this site.

    This site is tremendous and easily understand the topic on which is I required.I attending interviews, in any interview, questioning on Serializable interface. which is most and most useful in real-time scenario's.

    If any updated news please mail to me.

    Thanks & Regards,
    YATHIRAJULU