Entity Attributes and Binary Data

Introduction

The ENTITY keyword in XML covers a multitude of afterthoughts. An entity is a mapping between a name and some data. Apart from parameter entities, which appear only in the DTD and were described briefly in the preceding chapter, there are three types of entity that are important in the NPF: internal entities, and text and binary external entities.

Internal entities are used to represent special characters, and should be kept to a minimum for efficiency reasons. The two kinds of external entity have very different roles. External text entities can be used for a wide range of housekeeping tasks: they are an excellent place to put application configuration information so it can be globally managed without much trouble. Binary entities, on the other hand, are the way that XML handles binary (non-textual) data. They are mostly useful for wrapping non-framework legacy classes in a way that the framework can deal with them.

Internal Entities

Internal entities are typically used to map special characters; for instance, the XML spec reserves the characters "<", ">", "&" "'" and """, and requires that every XML processor recognizes the entities names "lt", "gt", "amp", "apos" and "quot" as mapping to these characters. In the document itself, entity names appear bounded by "&" and ";", so the text "5 < 10" would be written in an XML document as "5 < 10". The name of an entity in the text bounded by "&" and ";" is called an entity reference.

Any character can be represented by its numeric value using text entities by preceding the number with "#" for decimal and "#x" for hexadecimal values. So to represent "a" in an XML document we could use "a", &#97; or &#x61; depending on how confusing we wanted to be.

An internal entity is declared in the DTD as follows:

<!ENTITY entityName "entityText">
For example. the following maps the name "hello" to the string "goodbye":
<!ENTITY hello "goodbye">
Entities may (and frequently do) contain entity references. For instance, to map the entity BEL to the character 7, we can define:
<!ENTITY BEL "&#7;">
In general this kind of thing should not be necesary, as non-printing, non-whitespace characters are handled automatically by the framework, as described below, and entity translation is an extremely expensive part of serialization. Although some attention has been paid to making it efficient, the fact remains that every character may have to undergo translation, so the framework needs to check every character on output against the set of declared entities, and the bigger that set gets, the slower the operation becomes.

A great deal of work is done by the xml2cpp framework and generated code to deal with internal entities as painlessly as possible -- the five basic entities are all defined in the framework, and any entities defined in the DTD are automatically added. Also, non-printing, non-whitespace characters in the ASCII character set are automatically mapped to numeric entities, which just give the ASCII character codes for those characters. Numeric entities are delimited by "&#" and ";", so the ASCII "BEL" character, which has a decimal value of 7, will automatically get mapped to "&#7;" in the XML generated from an object in which it appears. On the API side, however, programmers can deal with the raw characters they care about: the framework translates to and from XML-encoded, entity-laden text as required. It is worth reiterating that screening strings for character strings that might need to be converted to entities is terribly inefficient, and if you can avoid all but the basic entities you will be doing well.

External Binary Entities

An external entity is mapping between an entity name and a file.

External entities are either textual -- in which case they are just files containing fragments of well-formed XML that get inserted into the document in place of the entity reference -- or binary.

From a programming point of view binary entities may represent legacy or third-party classes that must be brought into the framework, as well as classes that really do have to contain large chunks of binary data. The archetypal binary entity is the JPEG image: a large chunk of binary data that is completely non-textual. Binary entities allow XML to deal with such large blocks of non-textual data with relative efficiency.

An external binary entity is defined as follows:

<!ENTITY entityName SYSTEM "fileName" NDATA JPEG>
The SYSTEM keyword indicates that the quoted string following it is a filename. If the keyword PUBLIC had been used instead this would have identified a public name. In XML, public names are often stored in a catalog system. In the NPF, declaring an entity PUBLIC indicates that it is read-only: the data may be read and modified by the application, but will not be written back out. Public entities are often accessed through URLs.

Following the filename comes the keyword NDATA, and after that comes a notation name. The NDATA keyword is what identifies this as a binary entity: a notation is a binary data type, and the NDATA keyword tells the XML processor that the named entity maps to a file that contains information in the format specified by the named notation declaration. Every binary entity must specify the name of a notation that is defined in the DTD. A notation declaration looks like:

<!NOTATION JPEG SYSTEM "jpeg">
where the notation name is given after the NOTATION keyword, and the name of the processor for handling notations of this type is given after the SYSTEM identifier. In ordinary XML this would typically be the name of the application or plug-in component for dealing with files of the notation's type. In the NPF, it gives the name of the class to be generated to serve as an interface between this type of data and the rest of the system. This will be described in much more detail below -- like a great deal of XML, there is less to it than may first appear.

Binary Entity Declarations

Now that it is clear what binary entities do, there is the question of how one is included in the document. Then answer is that binary entities are included by name as the value of an entity attribute.

An entity attribute contains the name of a binary entity. The name of a binary entity is just a string that can be mapped to a filename by the XML processor that is part of the NPF.

A typical entity attribute declaration might look something like:

<!ATTLIST elementName
jpegImage1 ENTITY>
There are two parts to this: the ENTITY keyword obviously is what identifies this as an entity attribute. And the name of the attribute has a special form that is similar to those of other attribute names: the first part of the name is all lower case and specifies the type of the entity, and the second part of the name starts with an upper case letter to separate it from the type.

The type of the entity is the notation name for the entity. XML does not give any way of specifying the notation associated with a particular entity attribute, so we have to encode the type in the name as usual.

The code generated for binary entities is complex: the notation declaration generates code for a class derived from the NPFBinary base class, and the C++ representation of the attribute is a class member of this type. The serialization code calls read and write methods on this class to insert it into and extract it from streams and files. The XML that gets generated for an entity attribute comes in two parts: the first is the attribute value, which is just the entity name. The second is XML that gets generated for the document prolog, so that there is a valid declaration for the entity, with the correct name and notation type, so that the XML processor knows how to deal with the entity name, and the XML document is properly parsable. These serialization issues are dealt with in detail in the next chapter.

The relationship between entity, entity attribute and notation declarations are shown in Figure 5.1. Any file that holds a binary entity must be declared as part of the DTD or the document prolog -- although this list can be added to at runtime, there are often "standard" images that it is convenient to declare up front. The notation type for the entity is given after the NDATA keyword. Every type used in an entity declaration must have an associated notation declaration, as shown. This just maps the type name to another name, and this other name is what is used to determine the type of entity attributes, as shown.

=============================================================
Figure 5.1 Entity, Entity Attribute and Notation Declarations
=============================================================

<!ELEMENT imageHolder EMPTY>


<!ENTITY myImage SYSTEM "c:\images\myimage.jpg" NDATA JPEG>
                                                       |
             ------image identified as this type--------              
             |
<!NOTATION JPEG SYSTEM "jpeg">
                         |
			 | attribute type same as notation "filename"
                         |
<!ATTLIST imageHolder  jpegImage ENTITY #REQUIRED>

=============================================================
Of course, not all entities need to be declared in the DTD: entities that are added to the document at runtime have declarations written into the document's prolog so that other XML parsers can deal with them.

Code Generated for Binary Entities

A modified version of the DTD from Example 1 is shown in Figure 5.2: it includes an entity attribute for the person class.

=============================================================
Figure 5.2  Use of Notations and Entity Attributes
=============================================================

<!ELEMENT person EMPTY>
<!ATTLIST person
strFirstName CDATA ""
strMiddleInitial CDATA ""
strLastName CDATA #REQUIRED
nAge CDATA "-1"
jpegImage ENTITY #IMPLIED>

<!NOTATION JPEG SYSTEM "jpeg" >

=============================================================
As well as the entity attribute, there is a notation declaration for an entity type. No entity is declared in this case, because of course every person will have a different image, which will have to be supplied by the system somehow -- perhaps this DTD is part of a personel records system that requires a digital photograph to be input somehow.

There are two classes generated by xml2cpp for this DTD: one representing a person, one representing the JPEG type. The person class looks much like before, as shown in Figure 5.3. Most of the class is similar to the previous version, with the added features that there is a "jpeg.h" file included, and several new methods for dealing with the new attribute.

=============================================================
Figure 5.3 Partial Person Class Declaration -- parts deleted for clarity
=============================================================

#ifndef PERSON_H
#define PERSON_H

#include        "jpeg.h"
#include        <string>

#include        "npf_element.h"

class person : public virtual NPFElement
{
  public:

        string getNotationDecls(set<NPFBinary*> &setBinaryEntity);

        jpeg*& getImage() {return m_pjpegImage;}
        void  setImage(const string & jpegImage)
         {	
	 	if (m_pjpegImage == 0) m_pjpegImage = new jpeg;
                m_pjpegImage->readFile(jpegImage);
	}
        void setImage(jpeg *pjpegImage)
        {	
		if (m_pjpegImage != 0) delete m_pjpegImage;
          	m_pjpegImage = pjpegImage;
	}

  protected:

        jpeg*   m_pjpegImage;
        int     m_nAge;
        string  m_strFirstName;
        string  m_strLastName;
        string  m_strMiddleInitial;

};

#endif

=============================================================
Note that the entity attribute is represented by a pointer to an object of type "jpeg" -- this is based on the notion that binary objects are generally going to be large and we don't want to reserve any space for them if we don't need them. Also, when we have multiple copies of the same entity we want to have all the pointers pointing to the same object, which would not be possible if entity attributes were held by value.

The set and get methods for the entity attribute have the type stripped off of their names as usual, and there is also an extra set method that takes a string rather than a pointer to a jpeg. This method causes a new object to be allocated if required, and tries to read its data from the file named in the argument. The other set method sets the pointer directly, and if there is already one allocated it is deleted.

The get method returns a reference to the pointer, so it can be modified if necessary without deleting the existing pointer.

Part of the implementation for the class is shown in Figure 5.4 -- the most important features are the handling of the binary data object in the addAttribute() and toXML() methods, and the getNotationDecls() method for handling the document prolog.

=============================================================
Figure 5.4 Partial Person Class Definition -- parts deleted for clarity
=============================================================

void person::addAttribute(const string &strName, const string &strValue)
{
        if (strName == "jpegImage")
        {
          if (m_pjpegImage == 0)
          {
                m_pjpegImage = new jpeg();
          }
          if (m_pssXML == 0)
          {
                m_pjpegImage->setName(strValue);
          }
          else
          {
                readBinary(*m_pssXML,strValue,(NPFBinary**) &m_pjpegImage);
          }
        }
        else if (strName == "nAge")
        {	
       		:
		:
}

void person::toXML(ostream& ssXML)
{
        ssXML << "<person ";
        if (m_pjpegImage != 0)
        {
                ssXML << "jpegImage=\"" << m_pjpegImage->getName() << "\" ";
        }
        ssXML << "nAge=\"" << m_nAge << "\" ";
       		:
		:
}

string person::getNotationDecls(set<NPFBinary*> &setBinaryEntity)
{
   	string strDecls;

        if (m_pjpegImage != 0) 
		strDecls += m_pjpegImage->getProlog(setBinaryEntity);

        return strDecls;
}

=============================================================
Depending on the situation, addAttribute() does different things. If no pointer has been allocated, a new object is constructed. The next question has to do with how the element is being read: if it is being read from a stream, that stream will be pointed to by m_pssXML, which is an istream that is part of the NPFElement class. There are circumstances, however, where all we want to do is store the name of the attribute and not actually do any reading immediately. In this case, the stream pointer will be null, and just the name will be stored.

In the case where reading is done, note that it is done through a call to the NPFElement's readBinary() method rather than through the member's readFile() method, which is used by the set method. The reason for this will have to wait for the discussion of serialization to get to its full-blown detail, but the short answer has to do with keeping track of multiple copies of the same entity within the application. We obviously want to ensure that only one copy gets read or written, and that all pointers to the entity point to that single copy. The readBinary() method is part of the system that ensures this happens.

The toXML() method, by constrast, is amazingly simple: all it does is write the name of the entity into the attribute value. The question arises: where and how does the binary entity actually get written out? This will be described in the following section.

The getNotationDecls() method takes a set of pointers to NPFBinary objects as an argument, which is passed on to the entity attribute's getProlog() method. The entity attribute itself, as will be described below, is derived from the NPFBinary class, and getProlog() is part of that class's interface. The getProlog() method returns a string that contains the declaration for the entity represented by the entity attribute -- it is necessary that all such entities be declared in the prolog of the document if they are not in the DTD. If the person class had multiple entity attributes, each of them would have an entry in the getNotationDecls() method. Note that the getNotationDecls() method is quite poorly named: it should be called getBinaryEntityDecls() or something like that, as it does not get notation declarations at all.

The other class created by processing the example DTD is the jpeg class, which represents the notation type. This modified to contain a class that represents the jpeg image itself, of course -- the generated code only provides a few hooks to be used by the framework. The header file is shown in Figure 5.5.

=============================================================
Figure 5.5 jpeg Class Header Generated from Notation Declaration
=============================================================

#ifndef JPEG_H
#define JPEG_H

#include "npf_binary.h"

//##NPF_USER_HEADER_INCLUDE##

//##NPF_USER_HEADER_INCLUDE##
class jpeg : public NPFBinary
{
  public:

        jpeg();
//##NPF_USER_PUBLIC_DECL##

//##NPF_USER_PUBLIC_DECL##
  protected:

        void subRead(class istream& inStream);
        void subWrite(class ostream& outStream);
//##NPF_USER_PROTECTED_DECL##

//##NPF_USER_PROTECTED_DECL##
  private:

//##NPF_USER_PRIVATE_DECL##

//##NPF_USER_PRIVATE_DECL##
};

#endif

=============================================================
The things to notice are that the class derives not from NPFElement, but from NPFBinary. The full reference to the NPFBinary class is given in the appendix, but for now we are interested only in a few features. In particular, NPFBinary provides an interface that is used by NPFElement in reading and writing binary data, and the hooks we have into that are the subRead() and subWrite() methods, which are called by the framework after some setup on the streams. These methods are actually responsible for reading and writing the binary data objects. The NPFBinary class also has a protected size member, m_nSize, which is checked after subRead() and subWrite() are called to ensure that the correct number of bytes were actually read/written. Used properly, this can help ensure the developer against a large class of read/write errors.

The generated implementation file for the jpeg class is shown in Figure 5.6, and does not contain anything very interesting. The bodies of the subRead() and subWrite() methods of course need to be supplied by the developer, probably by doing reads and writes on the stream arguments. The streams are in most cases not going to be simple filestreams that can be rewound to zero to get to the start of the binary object. If you need to rewind the stream for any reason, make sure that you mark the point it is at by calling tellp() or tellg() at the start of the method, and then seeking back to that point rather than to zero. As will be described in the next chapter, a single stream in the NPF are used to store many objects, and so each object has to respect the boundaries of the others.

=============================================================
Figure 5.6 jpeg Class Header Generated from Notation Declaration
=============================================================

#include <istream.h>
#include <ostream.h>
#include "jpeg.h"

//##NPF_USER_IMPL_INCLUDE##

//##NPF_USER_IMPL_INCLUDE##

jpeg::jpeg() : NPFBinary("JPEG")
{
//##NPF_USER_CONSTRUCTOR_BODY##

//##NPF_USER_CONSTRUCTOR_BODY##
}
void jpeg::subRead(istream &inStream)
{
//##NPF_USER_SUB_READ_BODY##

//##NPF_USER_SUB_READ_BODY##

}

void jpeg::subWrite(ostream &outStream)
{
//##NPF_USER_BINARY_WRITE_BODY##

//##NPF_USER_BINARY_WRITE_BODY##

}

//##NPF_USER_OTHER_METHODS##

//##NPF_USER_OTHER_METHODS##

=============================================================

Binary Entity Serialization

There are two main issues raised by the previous section: how do we make sure that only one copy of a binary entity gets read/written, even if it appears multiple times in a document, and where does the writing occur as it does not happen in toXML()? Detailed discussion of appears in the next chapter, but this section will give a sketch of the solution.

All binary entities are derived from NPFBinary. NPFBinary contains a rich API for handling I/O of binary entities, most of which should be completely invisible to the developer. This section describes it because people often like to know what they are using.

The important members of NPFBinary are two static maps that keep track of the objects that need to be read/written and that have actually be read/written. These maps are populated in various ways depending on what is being done. In the ordinary course of a write, for instance, a tree-walker is sent across the document tree that is about to be written to pull out all of the binary entities that need to be written. When the document has been written, this list of entities is walked over, and each one is written in turn. This process will be described in more detail in the next chapter.

On read, the process takes a slightly different course: when an object gets read for the first time it is really read and a pointer to it is put into the read map. After that, if the same entity needs to be read again, the pointer is just set from the value in the map, so all copies point to the same place.

There is also the issue of read and write loops, which are dealt with by setting flags in each object to determine if they are already being read or written, so that if two objects refer to each other they will not be able to get into an infinite cycle of reading or writing each other.

The structure of an XML document stream as used by the NPF is fairly rich. Depending on how the document is written, it may reside in many different files -- one for each binary entity, plus one for the document -- or it may reside in a single file that has an index table at the top, and all the binary entities following after the XML document itself. The latter is the more common for applications programming, but the former is useful for debugging purposes, as it is the only way to generate a document that can be parsed by a validating XML parser.

External Text Entities

External text entities are well-formed peices of XML. The NPF is slightly more restrictive than XML, in that XML allows text entities to have leading and trailing text that is outside of any element. This text has to be valid where it appears, of course, but so far as the entity is concerned it is not part of any element. This model is too general to be supported efficiently, and rarely useful. It results in a situation where any character could be an entity boundary, and processing the document back into entities becomes prohibitively expensive.

Text entities can be used for a number of purposes, and their behaviour can be controlled depending on what purpose they are to be used for. Most often, text entities are used to contain configuration information for the application. For static configuration information, the entity should be public, so it will be read only. For entities that get written there are two types: those that get written back into their original file, and those that get saved with the document. The latter kind lose their identity as entities, and become part of the document itself.

Entity behaviour is controlled in the document by processing instructions. A PI is a bit of XML markup that looks like: <? some text >. XML processors ignore processing instructions that they don't recognize, so this is a safe way to encode extra information. There are three processing instructions defined for text entities, as shown in Table 5.1.

=====================================================================
Table 5.1 Processing Instructions for External Text Entities
=====================================================================
	PI			Meaning
---------------------------------------------------------------------
SAVE_INTERNAL		Save the entity as part of the document
SAVE_EXTERNAL		Save the entity in its original file
SAVE_NONE		Do not save the entity -- it is read only
=====================================================================
Public entities are SAVE_NONE, and are typically expected to be described by URLs.

An entity can be added to a document at run-time using the NPFElement's API:

  // Add a well-formed external entity as a child of this element.  Unlike
  // the XML spec, I do not allow entities with leading and trailing
  // text -- they must begin and end on a tag.  The processing instruction
  // (PI) string identifies the output policy for the entity, which
  // may be SAVE_NONE, SAVE_INTERNAL or SAVE_EXTERNAL. SAVE_NONE entities
  // are not output on write, SAVE_INTERNAL makes the entity part of the
  // document and SAVE_EXTERNAL maintains the entity as an external file,
  // over-writing the existing one.  
  void addExternalEntity(const string &strEntityName, const string &strPI);
The entity actually has to exist before this call is made. This means that the file containing the entity should exist (possibly created by writing a sub-tree of the existing document) and it should be registered with the external entity processor. There are two entity processing classes: the entity processor, which deals with internal entities, and the external entity processor, which deals with external entities. The appropriate call to the external entity processor is:
  static void ExternalEntityProcessor::addEntity(string strName, 
  						string strSysName, 
						string strPubName, 
						bool bBinary);
This will register the entity and the filenames associated with it in the external entity processor's static maps, and then when addExternalEntity is called the entity will be read from the appropriate file and represented as a child of the current element.

Summary

This chapter has dealt with the complex topic of entities: internal, external and binary. There are still a few details that will be introduced in the following chapter, but this should be sufficient to use entities for many practical things. Internal entities should be used as rarely as possible. Special -- that is, non-alphanumeric, non-whitespace -- characters will be automatically converted to numeric entities during XML generation, and this and the pre-defined five XML specified entities should deal with most other cases. External text entities are intended as a powerful mechanism for storing configuration information that is shared between many documents. And binary entities are a means of bringing non-framework classes into the NPF fold.