[All Packages]  [Previous]  [Next]

Parser APIs

Extensible Markup Language (XML) describes a class of data objects called XML documents and partially describes the behavior of computer programs which process them. XML is an application profile or restricted form of SGML, the Standard Generalized Markup Language [ISO 8879]. By construction, XML documents are conforming SGML documents.

XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form character data, and some of which form markup. Markup encodes a description of the document's storage layout and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure.

A software module called an XML processor is used to read XML documents and provide access to their content and structure. It is assumed that an XML processor is doing its work on behalf of another module, called the application.

This C implementation of the XML processor (or parser) followed the W3C XML specification (rev REC-xml-19980210) and included the required behavior of an XML processor in terms of how it must read XML data and the information it must provide to the application.

The following is the general behavior of this parser:

  1. If an input's character encoding cannot be determined automatically by a BOM (Byte Order Mark) or XMLDecl, then UTF-8 is assumed. A separate, fast single-byte code path exists, as well as the multibyte path. To use this fast track, if your documents are single-byte (ASCII, ISO-8859, EBCDIC, etc), make sure to specify the correct input encoding and not let it default to UTF-8.
  2. Output encoding (DOM/SAX data) will be in the same encoding as the first input encountered. To explicitly set the output encoding, use xmlinitenc and pass in the extra outcoding argument. UTF-16 is supported.
  3. Messages are printed to stderr unless msghdlr is given. If you provide a message handler (and context), a numeric error code error, error message, and context will be passed to this function instead. Error message text will be in UTF-8, and any data included as part of a message will be converted to UTF-8.
  4. DOM is the default interface for accessing a parsed document. To use SAX instead, specify a structure of SAX callbacks functions (and SAX context) at initialization time. Not all SAX functions need be provided; you can set any or all to NULL and only process those events you care about.
  5. The default behavior for the parser is to check that the input is well-formed, but not to validate. Set the xmlinit flag XML_FLAG_VALIDATE to turn on validation.
  6. Whitespace processing is fully conformant with the XML 1.0 spec, i.e. all whitespace is reported back to the application but it is indicated which whitespace is "ignorable". Some applications may want to set the XML_FLAG_DISCARD_WHITESPACE flag which will discard all whitespace between an end-element tag and the following start-element tag (such as newlines).
  7. Validation problems are printed (or passed to the error message callback) but do not halt validation. Set the flag XML_FLAG_STOP_ON_WARNING to cause validation to cease immediately on the first warning (as for an error).

Calling Sequence

The sequence of calls to the parser can be: Parsing a single document: Parsing multiple documents, but only the latest document needs to be available: Parsing multiple documents, all documents must be available:

Memory Callbacks

The memory callback functions memcb may be used if you wish to use your own memory allocation. If they are used, all of the functions should be specified.

The memory allocated for parameters passed to the SAX callbacks or for nodes and data stored with the DOM parse tree will not be freed until one of the following is done:

  1. xmlparse or variant is called to parse another document.
  2. xmlclean is called.
  3. xmlterm is called.

Error Message Callbacks

By default, error messages are printed to stderr. An error message callback may be provided at initialization time, however. If given, error numbers and text are passed to that function, and the user may do whatever they wish with them. Location information (line number and source filename) is available through the xmlwhere function. This function should only be called while an error is in progress (i.e. while in the error callback function). Error message callback functions should be declared using the XML_MSGHDLRF function prototype macro.


I/O Callbacks

Document input is handled through a set of I/O callback functions. For most access methods (HTTP, FTP, filesystem, etc), built-in callbacks are provided. For other methods, notably stream, the user must specify their own callbacks, as none will be provided. Any of the built-in callbacks may be overridden with user-defined ones. The function xmlaccess sets the callbacks for the given access method (xmlacctype).


Thread Safety

If threads are forked off somewhere in the midst of the init-parse-term sequence of calls, you will get unpredictable behavior and results.


Data Types Index

oratext String pointer used for all data encodings, cast as needed; for UTF-16, to (ub2 *)
xmlctx Top-level XML context
xmlmemcb Memory callback structure (optional)
xmlsaxcb SAX callback structure (SAX only)
xmlacctype XML access type (HTTP, FTP, File, etc)
ub4 32-bit (or larger) unsigned integer
uword Native unsigned integer

Function Index

xmlaccess Set I/O access method callbacks
xmlclean Clean up memory used during parse
xmlinit Initialize XML parser
xmlinitenc Initialize XML parser specifying DOM data encoding
xmlparse Parse a URI
xmlparsebuf Parse a buffer
xmlparsefile Parse a file
xmlparsestream Parse a user-defined stream
xmlterm Shut down XML parser
xmlwhere Return error location information
createDocument Create a new document
createDocumentNS Create a new document (namespace aware)
isSingleChar Is document data single/multibyte?
isUnicode Is document data Unicode?
isStandalone Is document standalone?
getEncoding Return document's encoding

Data Structures and Types


oratext

xmlctx

xmlmemcb

xmlsaxcb

xmlacctype

ub4

uword


Functions


xmlaccess

Purpose

Sets the I/O callback functions for the given access method.

Syntax
uword xmlaccess(xmlctx *ctx, xmlacctype access, XML_OPENF((*openf)),
		XML_CLOSEF((*closef)), XML_READF((*readf)));

Parameters

 ctx    (IN) - The XML context
 access (IN) - access method enum, XMLACCESS_xxx
 openf  (IN) - Open-input callback function
 closef (IN) - Close-input callback function
 readf  (IN) - Read-input callback function
Comments

Sets the I/O callback functions for the given access method. Most methods have built-in callback functions, so none be provided by the user. The notable exception is XMLACCESS_STREAM, user-defined streams, where the user must set the stream callback functions themselves.

The three callback functions are invoked to open, close, and read from the input source. The functions should have been declared using the the function prototype macros XML_OPENF, XML_CLOSEF and XML_READF.

XML_OPENF is the open function, called once to open the input source. It should set its persistent handle in the xmlihdl union, which has two choices, a generic pointer (void *), and an integer (as unix file or socket handle). This function must return XMLERR_OK on success. Args:

 ctx    (IN)  - XML context
 path   (IN)  - full path to the source to be opened
 parts  (IN)  - path broken down into components; opaque pointer
 length (OUT) - total length of input source, if known (0 if not known)
 ih     (OUT) - the opened handle is placed here
XML_CLOSEF is the close function; it closes an open source and frees resources. Args:

 ctx    (IN) - XML context
 ih     (IN) - input handle union
XML_READF is the reader function; it reads data from an open source into a buffer, and returns the number of bytes read:

On EOI, the matching close function will be called automatically. Args:

 ctx      (IN)  - XML context
 path     (IN)  - full path to the source to be opened; only
                  provided here for use in error messages
 ih       (IN)  - input handle union
 dest     (OUT) - destination buffer to read data into
 destsize (IN)  - size of dest
 nraw     (OUT) - number of bytes read
 eoi      (OUT) - hit End of Information?


xmlinit, xmlinitenc

Purpose

Initializes the C XML parser. It must be called before any parsing can take place.

Syntax
xmlctx *xmlinit(uword *err, const oratext *incoding, 
		XML_MSGHDLRF((*msghdlr)), void *msgctx,
		const xmlsaxcb *saxcb, void *saxcbctx, 
                const xmlmemcb *memcb, void *memcbctx, const oratext *lang);
xmlctx *xmlinitenc(uword *err, const oratext *incoding, const oratext *outcoding,
		   XML_MSGHDLRF((*msghdlr)), void *msgctx,
		   const xmlsaxcb *saxcb, void *saxcbctx, 
                   const xmlmemcb *memcb, void *memcbctx, const oratext *lang);
Parameters
   err       (OUT) - Numeric error code, on failure
   incoding  (IN)  - default input character set encoding
   outcoding (IN)  - output (DOM/SAX data) character set encoding (xmlinitenc only)
   msghdlr   (IN)  - Error message handler function
   msgctx    (IN)  - Context for the error message handler
   saxcb     (IN)  - SAX callback structure (filled with function pointers)
   saxcbctx  (IN)  - Context for SAX callbacks
   memcb     (IN)  - Memory function callback structure
   memcbctx  (IN)  - Context for the memory function callbacks
   lang      (IN)  - Language for error messages
Comments

Do not call any other XML parser functions if this is not successful!

This function should only be called once before parsing any XML files. xmlterm should be called after all parsing and DOM use has completed. Multiple parses should call xmlclean between runs if only the current document needs to be available. Until clean is called, data pointers from all previous parses will continue to be valid.

All arguments may be NULL except for err, which is required. On success, an XML context (lpxctx *) is returned. If this is NULL, a failure occured and the numeric error code is stored in *err.

Data Encoding

The encoding of input documents is detected automatically (by BOM, XMLDecl, etc). If the encoding cannot be determined, incoding is assumed. If incoding is not specified (NULL), UTF-8 is assumed. incoding should be an IANA/Mine encoding name, e.g. "UTF-16", "ASCII", etc.

NOTE: A separate, fast code path exists for single-byte character sets like ASCII, ISO-8859, and EBCDIC. This path is considerably faster than the UTF-8 multibyte path, so if you are sure your input documents are single-byte, you are strongly encouraged to say so by setting the incoding.

The encoding which data will be presented as (through DOM/SAX) is given as outcoding. If not specified, UTF-8 is chosen. Unicode (UTF-16) is supported. Since DOM/SAX APIs specify (oratext *) as data pointers, for Unicode these should be cast to (ub2 *).

NOTE: For backwards compatibility (until the next major release), xmlinit will set the outcoding to the input encoding of the first document parsed, to simulate the old behavior. For xmlinitenc, the output encoding is explicitly specified.

Error Messages, Language

By default, error messages are printed to stderr. To handle messages yourself, specify a handler function pointer. The formatted error string and numeric error code will be passed to your function, along with the user-defined message context msgctx. The error strings will be UTF-8; any data included as part of the error message will be converted to UTF-8. If you need the line number and path/URL where the error occured, the xmlwhere function returns this information, but it may only be called from the user's callback function (while the error is in progress).

The error language is specified as lang, e.g. "AMERICAN", JAPANESE", "FRENCH", etc, and defaults to American.

SAX vs DOM

By default, a DOM parse tree is built. To use SAX instead, specify a SAX callback structure (saxcb). The callbacks will be invoked with the given SAX context pointer. If any of the SAX functions returns an error (non-zero), parsing stops immediately.

Memory Allocation

The parser allocates memory in large chunks. The default system memory allocator (malloc etc) will be used to allocate and free the chunks unless a memory callback structure is provided. If given, it contains function pointers to alloc/free functions which will be used instead. The memory callback context memcbctx is passed to the callback functions.

Error Codes

XMLERR_NLS_INIT The National Language Service package could not be initialized. Perhaps an installation or configuration problem.
XMLERR_INVALID_MEMCB A memory callback structure (memcb) was specified, but it did not have alloc and free function pointers.
XMLERR_BAD_ENCODING An encoding was not known. Use IANA/Mine names for encodings, and make sure NLS data is present.
XMLERR_INVALID_LANG The language specified for error messages was not known.
XMLERR_LEH_INIT The LEH (catch/throw) package could not be initialized. An internal error, contact support.


xmlclean

Purpose

Frees any memory used during the previous parse.

Syntax
void xmlclean(xmlctx *ctx);
Parameters

ctx (IN) - The XML parser context

Comments

Recycles memory within the XML parser, but does not free it to the system-- only xmlterm finally releases all memory back to the system. If xmlclean is not called between parses, then the data used by the previous documents remains allocated, and pointers to it are valid. Thus, the data for multiple documents can be accessible simultaneously, although only the current document can be manipulated with DOM.

If you just want to access one document's data at a time (within a single context), then call clear before each new parse.


xmlparse, xmlparsebuf, xmlparsefile, xmlparsestream

Purpose

These functions invoke the XML parser on various input sources. The parser must have been initialized successfully with a call to xmlinit first.

Syntax
uword xmlparse(xmlctx *ctx, const oratext *uri,
               const oratext *incoding, ub4 flags);
uword xmlparsebuf(xmlctx *ctx, const oratext *buffer, size_t len,
                  const oratext *incoding, ub4 flags);
uword xmlparsefile(xmlctx *ctx, const oratext *path,
                   const oratext *incoding, ub4 flags);
uword xmlparsestream(xmlctx *ctx, const void *stream,
                     const oratext *incoding, ub4 flags);
Parameters

 ctx      (IN/OUT) - The XML parser context
 uri      (IN)     - URI of XML document (xmlparse only)
 buffer   (IN)     - input buffer (xmlparsefile only)
 len      (IN)     - length of the buffer (xmlparsefile only)
 stream   (IN)     - input stream (xmlparsestream only)
 incoding (IN)     - default input character set encoding
 flags    (IN)     - mask of parser options
Comments

Parser options are specified as flag bits OR'd together into the flags mask. Flag bits are:
XML_FLAG_VALIDATE Turn validation on
XML_FLAG_DISCARD_WHITESPACE Discard whitespace where it appears to be extraneous (end-of-line etc)
XML_FLAG_STOP_ON_WARNING Stop validation on warnings
By default, the parser does not validate the input. To validate against a DTD, set the XML_FLAG_VALIDATE flag. Validation problems are considered warnings, not errors, and by default validation will continue after warnings have occured. To treat warnings as errors, set the flag XML_FLAG_STOP_ON_WARNING.

The default behavior for whitespace processing is to be fully conformant to the XML 1.0 spec, i.e. all whitespace is reported back to the application, but it is indicated which whitespace is "ignorable". However, some applications may prefer to set the XML_FLAG_DISCARD_WHITESPACE which will discard all whitespace between an end-element tag and the following start-element tag.

The default input encoding may be specified as incoding, which overrides the incoding given to xmlinit. If the input's encoding cannot be determined automatically (based on BOM, XMLDecl, etc) then it is assumed to be incoding. IANA/Mime encoding names should be used, "UTF-8", "ASCII", etc.

Data pointers returned by DOM APIs remain valid until xmlclean or xmlterm is called.

For SAX, the data pointers only remain valid for the duration of the user's callback function. That is, once the callback function has returned, the data pointers become invalid. If longer access is needed, the data can be stored in the XML memory's pool using stringSave (or stringSave2 for UCS2 data).

Streams: A stream is a user defined entity here-- all that's passed in is a stream/context pointer, which is in turned passed to the I/O callback functions. The parser does not reference the stream directly.

DTD: The DTD parser invokes the XML parser on an external DTD, not a complete document. It is used mainly by the Class Generator so that classes may be generated from a DTD without needed a complete (dummy) document.


xmlterm

Purpose

Terminates the XML parser. It should be called after xmlinit, and before exiting the main program.

Syntax
uword xmlterm(xmlctx *ctx);
Parameters
ctx (IN) - the XML parser context
Comments

This function tears down the parser. It frees all allocated memory, giving it back to the system (through free or the user's memory callback). Contrast to xmlclean, which recycles memory internally without giving it back to the system.

No additional XML parser calls can be made until xmlinit is called again to get a new context.


xmlwhere

Purpose

Return error location information for the last (current) error.

Syntax
uword xmlwhere(xmlctx *ctx, ub4 *line, oratext **path, uword idx);
Parameters
  ctx  (IN)  - the XML parser context
  line (OUT) - line# where the error occured
  path (OUT) - source path/URL where error occured
  idx  (IN)  - error# in stack (starting at 0)
Comments

Returns the location information for the idx'th error on the stack. This function should only be called while an error is in progress, i.e. from within an error message callback function. Since errors occur in nested inputs (document A includes document B includes document C which contains an error), more than one location is available. The highest- level input file is idx 0, then the next level down is 1, etc. If only the highest level is desired, just call once with idx=0. If all levels are desired, loop starting with idx=0 and incrementing until the function returns FALSE.


createDocument, createDocumentNS

Purpose

Creates a new document in memory.

Syntax
xmlnode* createDocument(xmlctx *ctx)

xmlnode* createDocumentNS(xmldomimp *imp, oratext *uri, oratext *qname, xmlnode *dtd);

Parameters
ctx   (IN) - XML parser context
imp   (IN) - XML DOMImplementation (see getImplementation)
uri   (IN) - New document's namespace URI
qname (IN) - Namespace qualified name of new document (DOCUMENT_NODE's name)
dtd   (IN) - DTD this document is associated with
Comments

The original function createDocument has now been standardized in DOM 2.0 CORE. For compatibility, the old function remains with its original usage, and the new CORE function is called createDocumentNS.

Creates a new document in memory. An XML document is always rooted in a node of type DOCUMENT_NODE-- this function creates that root node and sets it in the context. There can be only one current document and hence only one document node; if one already exists, this function does nothing and returns NULL.

For createDocumentNS, if a DTD is specified, its ownerDocument attribute will be set to the document being created.


isStandalone

Purpose

Return value of document's standalone flag.

Syntax
boolean isStandalone(xmlctx *ctx)
Parameters
ctx (IN) - the XML parser context
Comments

This function returns the boolean value of the document's standalone flag, as specified in the <?xml?> processing instruction.


isSingleChar

Purpose

Return value of "simple encoding" flag.

Syntax
boolean isSingleChar(xmlctx *ctx)
Parameters
ctx (IN) - the XML parser context
Comments

This function returns the boolean value of the document's "simple" encoding flag. If the document is single-byte encoded (ASCII, ISO-8859, EBCDIC, etc), TRUE is returned; otherwise, encoding is multibyte or Unicode and FALSE is returned. See also the getEncoding function which returns the name of the specific encoding, and isUnicode which tests for Unicode data.


isUnicode

Purpose

Return value of Unicode encoding flag.

Syntax
boolean isUnicode(xmlctx *ctx)
Parameters
ctx (IN) - the XML parser context
Comments

This function returns the flag which determines whether DOM/SAX data for this context is in Unicode (UCS2).


getEncoding

Purpose

Returns the IANA/Mime name of the character encoding used by the document, e.g. "ASCII", "ISO-8859-1", "UTF-8", "UTF-16", etc.

Syntax
oratext *getEncoding(xmlctx *ctx)
Parameters
ctx (IN) - the XML parser context
Comments

This function returns the name of the document's encoding, e.g. "ASCII", "UTF-8", etc. See also the isSingleChar function, which can be used to simply determine if the document is single or multibyte, and the isUnicode function, which determines if the input is Unicode (UTF-16).