Serialization In Detail

Introduction

Serialization is one of the core benefits of the NPF -- classes are generated with serialization in place, and the use of XML as the file format ensures that the serialization schema is forward-compatible, so that earlier versions of an application will be able to read and manipulate the files generated by later versions. This fact alone ensures a high degree of maintainability in software that uses the NPF.

This chapter describes the details of the serialization scheme, including the details of the lumped file format. It is possible to read and write to files that are either lumped or not. Lumped files contain the XML document and the binary entities in a single file: this is the most useful format for applications that deal with a bunch of peices that are conceptually a single file. For testing purposes, however, it is also desireable to be able to generate a pure XML file with binary entities in separate files. This may also be useful for delivering XML documents over the Web in the case where the core application is a server.

After covering some of the issues that come up in serialization of different parts of the doucment -- the basic XML, the binary entities and the text entities -- the details of the reading and writing processes are described.

XML Serialization

The core serialization code deals with the XML part of the document. One of the interesting things about XML is that it does not define a top-level element: any element can be the outer or bounding element of a document. This means that you can start serializing anywhere in the application element tree. The method for serializing a sub-tree is:

  void NPFElement::writeXML(ostream &ssXML, bool bDecompose, bool bValidating)
The output stream argument is the file to write, obviously. The decompose flag, which is false by default, tells the serialization code whether or not to write a lumped document or not. If the decompose flag is true then the document gets decomposed into separate files for each binary entity. The validating flag, which is also false by default, tells if the document should be validating or just well formed. For most cases, the document should not be validating -- this should be for testing purposes only, because the XML parser used by the framework is non-validating.

The lumped file format is as human-readable as possible. The first part is a table of contents that gives the name, offset and length of each sub-file. Because the XML document is the first part of the document, the first blocks are sure to be all ASCII. An example of the first few lines of a simple lumped document is shown in Table 6.1 -- the first line of the table of contents always contains the name "XML" to identify the XML document. The table of contents ends with a line of exactly 20 equals signs as a separator.

======================================================================
Table 6.1 Beginning of Lumped Serialization Document
======================================================================

0 1022 XML
1022 12 entityFileName
====================
<!DOCTYPE example [
<!ENTITY OhMy "¢Test">
<!ENTITY OhNo SYSTEM "test1.ent">
<!ENTITY aName SYSTEM "entityFileName" NDATA JPEG>
]>
<example...

======================================================================
The difference between a validating and non-validating document is in the prolog. A non-validating document has all of the required entities declared in the internal DTD subset in the document prolog: that is, in the bit between the square brackets in the DOCTYPE declaration. A validating document just has a reference to the DTD file, which will be the file generated from the input DTD. The name is the same as the input DTD, but has the letters XG pre-pended.

One important feature of the lumped file format is that the length of every binary entity is given. This is used to read these entities in the case when they are not recognized. The serialization code has a concept of "orphaned entities", which are binary entities that have a type that is not known, or that are not referenced by an attribute that is understood by the version of the code doing the reading. For instance, suppose that in version 5 of an application we add support for PGF files, which didn't even exist when version 1 was produced. Elements with binary entities of this type will have them represented by some PGF class. Version 1 will of course not have any class like this, and these entity attributes will just get passed on to the NPFElement base class, where there names will be stored. But what about the binary data?

When an entity attribute is recognized as such, the NPFElement's static readBinary() method is called to read the binary data. The static method populates a map that is used to keep track of all the binary entities that are read, and so when the document read is complete it can tell -- by comparing the entities in the table of contents to the list of entities actually read -- if there are any unread entities. These entities are declared "orphaned" and are read into blocks of memory as undifferentiated hexadecimal sludge. Nothing can be done with them, of course, but when the document is written back out they are put back safely in their places without a bit misplaced. Thus, the NPF can even deal safely with binary data of unknown type.

As mentioned in the previous chapter, the framework keeps track of what has been read for other reasons as well, so that multiple references to the same object can be resolved to the same thing. The NPFElement class contains a static method for reading:

  static NPFElement* readXML(class istream& inStream, bool bClearReadMap)
The stream argument is obviously the place to read from. The clear read map flag is false by default. The read map is where binary entities that have already been read are stored. If we do multiple reads, we may well want to keep track of entities that have already been read to increase the speed of reading. On the other hand, there will be plenty of cases where we need to "forget" what we have already read so that new copies get read.

Text Entity Serialization

When a reference to an external text entity is written into a file it is preceded by a processing instruction (PI) to tell if it is to be written out when the file is written, and if so, how. This PI is written into the normal XML part of the file, not the entity file itself, so the PI is part of the document, not part of the entity. This is useful because it means that the same entity can be used in different ways in different files if so desired.

The PI's are very simple:

<? SAVE_NONE ?>		: do not write the entity
<? SAVE_INTERNAL ?> 	: write the entity as a normal part of the document
<? SAVE_EXTERNAL ?> 	: write the entity back to its external file
Public entities are SAVE_NONE. SAVE_INTERNAL should be used rarely -- it effectivey means that the entity is made part of the document, which does not seem likely to happen all that often. SAVE_EXTERNAL is expected to be the most frequently used case, as external entities will probably be used most frequently as places to put application specific configuration information.

Parts of the document can be turned into entities by writing them to files indpendently of the rest of the document. To produce a well-formed fragement of XML with no prolog, which is just the thing to go into an entity file, call toXML() with a stream pointing to the entity file. Then all you have to do is remove that element from the document, register the entity with the external entity processor and call addExternalEntity() to register the entity as part of the children of the node. You may have to do a little fiddling with swap() to get the entity into the right place in element's list of children. A complete example is shown in Figure 6.1.

======================================================================
Figure 6.1 Example of Converting Sub-tree to Entity
======================================================================

... assume we are in a derived NPFElement object method ...

	// pull a child out of the element to make into entity
	NPFElement *pChild = removeChild(5);
	
	// write the child to a file
	ofstream entityFile("newEntity.xml");
	pChild->toXML(entityFile);
	
	// delete the child
	delete pChild;
	
	// add the entity to the external entity processor
	ExternalEntityProcessor::addEntity("newEntity",
					   "newEntity.xml",
					   "newEntity.xml",false);
					   
	// add the entity to this element as a child, with write to file set
        addExternalEntity("newEntity","SAVE_EXTERNAL");
	
	// insert a dummy element where we pulled this one out
        insertChild(5,"DUMMY");
	
	// record the location the entity was added
	int nLocation = getChildNumber() - 1;
	
	// get the entity (the last child in this element)
	pChild = getLastChild();
	
	// replace the dummy with this child
	swap(5,pChild);
	
	// delete the dummy
	delete pChild;
	
	// remove the entity from its original location
	removeChild(nLocation);

======================================================================
The process of converting a sub-tree to an entity is made a bit awkward by the fact that the frameword does not allow the developer to add a child pointer directly to an element, but as explained above, this is intended to make it difficult to have multiple copies of the same element running around, which makes the document distinctly un-tree-like.

Binary Entity Serialization

Binary entity serialization is made complex because there is no way to be sure that binary entities don't reference each other in arbitrarily complex ways. This means care track has to be kept of what entities have already been read or written. The usual way to do this in class libraries is to provide a special stream class for serialization that keeps track of what has been read or written. This solution has a number of drawbacks: the big one is that every class you want to serialize has to have insertion and extraction operators, or somethine like them, written for these specialized streams. And sometimes you are going to want to read and write such objects in contexts where a standard stream would by nice to use.

The NPF solution to this is to have the NPFBinary base class do the tracking of what has been read or written. This way you the developer can deal entirely with the familiar stream classes and not have to worry about any custom behaviour from the streams. On the othe hand, it means that some care needs to be taken in writing the I/O methods: subRead() and subWrite(), of NPFBinary.

The important things to remember about subRead() and subWrite() is to never read or write pointers to non-NPFBinary types directly if they are going to be shared amongst several classes. The situations where they are shared should be pretty uncommon to begin with, but if you must do it, make sure they are wrapped in NPFBinary classes. When you need to write one NPFBinary class from another, simply call write() on it.

Reading is a bit more complex, as it involves object creation. The global function readBinary(istream&, NPFBinary**) must be used to read binary objects from other binary objects. This will deal with the case where we have already read the object and so just need to pass its pointer back to the client.

Writing In Detail

The method NPFElement::writeXML() is used to write to files. It does things in several stages. First, it uses the visitor pattern to walk over the document tree to figure out what binary entities are required in the document prolog. This process has a big limitation: if any of the binary entities contain other binary entities, we need to know about it as well. But in purely XML terms, we can't, because so far as XML is concerned every binary entity is just a big lump of data with no internal structure. So when we actually do the XML generation, we need to ensure that every binary entity that needs to be written gets added to a set of binary entities for writing. This is done by passing a binary entity set to the NPFBinary class where it is held as a static member. Each NPFBinary object that needs to be written is added to this set.

The XML generation is done to a string by the generateXML() method, and after the table of contents is written to the stream, the XML string is written. Once the XML is written, the binary entities are written. During the XML generation process all that happens is that the binary entities get added to the set for writing: no actual writing takes place. After the binary entities we know about are written, any orphaned entities that were picked up on read are written.

Reading In Detail

Reading is a bit more complicated than writing because we need to be able to create new objects. Thus, readXML() is a static method of the NPFElement class that returns a pointer to the top level element in the document. Reading of binary entities also mostly takes place during the actual XML parsing: as entity attributes are read an attempt is made to read their data from the input stream, or from an external file of the appropriate name if their names cannot be found in the table of contents. Before any reading is done, the entity name is checked against the read map, and if it is found there a pointer to the existing entity is returned. This is all handled by the NPFElement method readBinary().

Once all the XML is parsed and the reading should be complete, readXML() checks the entities that have actually been read against those in the table of contents. If any entities are left over, they are read in as orphaned entities, and a set of pointers to them is maintained by the NPFElement base class for future writing.

Summary

This chapter has covered the serialization process in more detail, including the serialization of binary entities. It is very important that binary entities be serialized properly, or you can wind up with infinite loops, which are generally considered bad for your application's performance.

This concludes the introduction of the Narrative Programming Framework. The next section deals with the question: where do document type definitions come from? The answer to this question reveals the full power of the Narrative Programming idea.