[All Packages] [Previous] [Next]
Parser APIs
Extensible Markup Language (XML) describes a class of data objects
called XML documents and partially describes the behavior of computer
programs which process them. XML is an application profile or restricted
form of SGML, the Standard Generalized Markup Language [ISO 8879]. By
construction, XML documents are conforming SGML documents.
XML documents are made up of storage units called entities, which contain
either parsed or unparsed data. Parsed data is made up of characters, some
of which form character data, and some of which form markup. Markup encodes
a description of the document's storage layout and logical structure. XML
provides a mechanism to impose constraints on the storage layout and logical
structure.
A software module called an XML processor is used to read XML documents and
provide access to their content and structure. It is assumed that an XML
processor is doing its work on behalf of another module, called the
application.
This C implementation of the XML processor (or parser) followed the W3C XML
specification (rev REC-xml-19980210) and included the required behavior of
an XML processor in terms of how it must read XML data and the information
it must provide to the application.
The following is the general behavior of this parser:
- If an input's character encoding cannot be determined automatically
by a BOM (Byte Order Mark) or XMLDecl, then UTF-8 is assumed. A separate,
fast single-byte code path exists, as well as the multibyte path. To use
this fast track, if your documents are single-byte (ASCII, ISO-8859, EBCDIC,
etc), make sure to specify the correct input encoding and not let it default
to UTF-8.
- Output encoding (DOM/SAX data) will be in the same encoding as the
first input encountered. To explicitly set the output encoding, use
xmlinitenc and pass in the extra outcoding argument. UTF-16
is supported.
- Messages are printed to stderr unless msghdlr is given.
If you provide a message handler (and context), a numeric error code error,
error message, and context will be passed to this function instead. Error
message text will be in UTF-8, and any data included as part of a message
will be converted to UTF-8.
- DOM is the default interface for accessing a parsed document. To
use SAX instead, specify a structure of SAX callbacks functions (and SAX
context) at initialization time. Not all SAX functions need be provided;
you can set any or all to NULL and only process those events you care about.
- The default behavior for the parser is to check that the input is
well-formed, but not to validate. Set the xmlinit flag
XML_FLAG_VALIDATE to turn on validation.
- Whitespace processing is fully conformant with the XML 1.0 spec,
i.e. all whitespace is reported back to the application but it is indicated
which whitespace is "ignorable". Some applications may want to
set the XML_FLAG_DISCARD_WHITESPACE flag which will discard all whitespace
between an end-element tag and the following start-element tag (such as
newlines).
- Validation problems are printed (or passed to the error message callback)
but do not halt validation. Set the flag XML_FLAG_STOP_ON_WARNING to cause
validation to cease immediately on the first warning (as for an error).
Calling Sequence
The sequence of calls to the parser can be:
Parsing a single document:
- xmlinit - xmlparsexxx - xmlterm
Parsing multiple documents, but only the latest document needs to be
available:
- xmlinit - xmlparsexxx - xmlclean - xmlparsexxx - xmlclean ... xmlterm
Parsing multiple documents, all documents must be available:
- xmlinit - xmlparsexxx - xmlparsexxx ... xmlterm
Memory Callbacks
The memory callback functions memcb may be used if you wish to use your
own memory allocation. If they are used, all of the functions should be
specified.
The memory allocated for parameters passed to the SAX callbacks or for
nodes and data stored with the DOM parse tree will not be freed until one
of the following is done:
- xmlparse or variant is called to parse another document.
- xmlclean is called.
- xmlterm is called.
Error Message Callbacks
By default, error messages are printed to stderr. An error message
callback may be provided at initialization time, however. If given,
error numbers and text are passed to that function, and the user may do
whatever they wish with them. Location information (line number and
source filename) is available through the xmlwhere function. This
function should only be called while an error is in progress (i.e. while
in the error callback function). Error message callback functions should
be declared using the XML_MSGHDLRF function prototype macro.
I/O Callbacks
Document input is handled through a set of I/O callback functions. For most
access methods (HTTP, FTP, filesystem, etc), built-in callbacks are provided.
For other methods, notably stream, the user must specify their
own callbacks, as none will be provided. Any of the built-in callbacks may
be overridden with user-defined ones.
The function xmlaccess sets the callbacks for
the given access method (xmlacctype).
Thread Safety
If threads are forked off somewhere in the midst of the init-parse-term
sequence of calls, you will get unpredictable behavior and results.
Data Types Index
| oratext
| String pointer used for all data encodings, cast as needed; for UTF-16, to (ub2 *)
|
| xmlctx
| Top-level XML context
|
| xmlmemcb
| Memory callback structure (optional)
|
| xmlsaxcb
| SAX callback structure (SAX only)
|
| xmlacctype
| XML access type (HTTP, FTP, File, etc)
|
| ub4
| 32-bit (or larger) unsigned integer
|
| uword
| Native unsigned integer
|
Function Index
Data Structures and Types
typedef unsigned char oratext;
typedef struct xmlctx xmlctx;
Note: The contents of xmlctx are private and must not be accessed by users.
struct xmlmemcb
{
void *(*alloc)(void *ctx, size_t size);
void (*free)(void *ctx, void *ptr);
void *(*realloc)(void *ctx, void *ptr, size_t size);
};
typedef struct xmlmemcb xmlmemcb;
Note: This is the memory callback structure.
struct xmlsaxcb
{
sword (*startDocument)(void *ctx);
sword (*endDocument)(void *ctx);
sword (*startElement)(void *ctx, const oratext *name,
const struct xmlattrs *attrs);
sword (*endElement)(void *ctx, const oratext *name);
sword (*characters)(void *ctx, const oratext *ch, size_t len);
sword (*ignorableWhitespace)(void *ctx, const oratext *ch,
size_t len);
sword (*processingInstruction)(void *ctx, const oratext *target,
const oratext *data);
sword (*notationDecl)(void *ctx, const oratext *name,
const oratext *publicId,
const oratext *systemId);
sword (*unparsedEntityDecl)(void *ctx, const oratext *name,
const oratext *publicId,
const oratext *systemId,
const oratext *notationName);
sword (*nsStartElement)(void *ctx, const oratext *qname,
const oratext *local,
const oratext *namespace,
const struct xmlattrs *attrs);
/* The following 8 fields are reserved for future use. */
void (*empty1)();
void (*empty2)();
void (*empty3)();
void (*empty4)();
void (*empty5)();
void (*empty6)();
void (*empty7)();
void (*empty8)();
};
typedef struct xmlsaxcb xmlsaxcb;
Note: Callbacks for SAX-like API.
typedef unsigned int ub4;
typedef unsigned int uword;
Functions
- Purpose
- Sets the I/O callback functions for the given access method.
- Syntax
uword xmlaccess(xmlctx *ctx, xmlacctype access, XML_OPENF((*openf)),
XML_CLOSEF((*closef)), XML_READF((*readf)));
- Parameters
ctx (IN) - The XML context
access (IN) - access method enum, XMLACCESS_xxx
openf (IN) - Open-input callback function
closef (IN) - Close-input callback function
readf (IN) - Read-input callback function
- Comments
- Sets the I/O callback functions for the given access method. Most
methods have built-in callback functions, so none be provided by the user.
The notable exception is XMLACCESS_STREAM, user-defined streams, where the
user must set the stream callback functions themselves.
- The three callback functions are invoked to open, close, and read from
the input source. The functions should have been declared using the
the function prototype macros XML_OPENF, XML_CLOSEF and XML_READF.
- XML_OPENF is the open function, called once to open the input
source. It should set its persistent handle in the xmlihdl
union, which has two choices, a generic pointer (void *), and
an integer (as unix file or socket handle). This function
must return XMLERR_OK on success. Args:
ctx (IN) - XML context
path (IN) - full path to the source to be opened
parts (IN) - path broken down into components; opaque pointer
length (OUT) - total length of input source, if known (0 if not known)
ih (OUT) - the opened handle is placed here
- XML_CLOSEF is the close function; it closes an open source and
frees resources. Args:
ctx (IN) - XML context
ih (IN) - input handle union
- XML_READF is the reader function; it reads data from an open
source into a buffer, and returns the number of bytes read:
- If <= 0, an EOI condition is indicated.
- If > 0, then the EOI flag determines if this's the terminal data.
On EOI, the matching close function will be called automatically. Args:
ctx (IN) - XML context
path (IN) - full path to the source to be opened; only
provided here for use in error messages
ih (IN) - input handle union
dest (OUT) - destination buffer to read data into
destsize (IN) - size of dest
nraw (OUT) - number of bytes read
eoi (OUT) - hit End of Information?
- Purpose
- Initializes the C XML parser. It must be called before any parsing
can take place.
- Syntax
xmlctx *xmlinit(uword *err, const oratext *incoding,
XML_MSGHDLRF((*msghdlr)), void *msgctx,
const xmlsaxcb *saxcb, void *saxcbctx,
const xmlmemcb *memcb, void *memcbctx, const oratext *lang);
xmlctx *xmlinitenc(uword *err, const oratext *incoding, const oratext *outcoding,
XML_MSGHDLRF((*msghdlr)), void *msgctx,
const xmlsaxcb *saxcb, void *saxcbctx,
const xmlmemcb *memcb, void *memcbctx, const oratext *lang);
- Parameters
err (OUT) - Numeric error code, on failure
incoding (IN) - default input character set encoding
outcoding (IN) - output (DOM/SAX data) character set encoding (xmlinitenc only)
msghdlr (IN) - Error message handler function
msgctx (IN) - Context for the error message handler
saxcb (IN) - SAX callback structure (filled with function pointers)
saxcbctx (IN) - Context for SAX callbacks
memcb (IN) - Memory function callback structure
memcbctx (IN) - Context for the memory function callbacks
lang (IN) - Language for error messages
- Comments
- Do not call any other XML parser functions if this is not successful!
- This function should only be called once before parsing any XML files.
xmlterm should be called after all parsing and DOM use has
completed. Multiple parses should call xmlclean between runs
if only the current document needs to be available. Until clean is called,
data pointers from all previous parses will continue to be valid.
- All arguments may be NULL except for err, which is required. On
success, an XML context (lpxctx *) is returned. If this is NULL, a
failure occured and the numeric error code is stored in *err.
- Data Encoding
- The encoding of input documents is detected automatically (by BOM,
XMLDecl, etc). If the encoding cannot be determined, incoding is
assumed. If incoding is not specified (NULL), UTF-8 is assumed.
incoding should be an IANA/Mine encoding name, e.g. "UTF-16", "ASCII", etc.
- NOTE: A separate, fast code path exists for single-byte character
sets like ASCII, ISO-8859, and EBCDIC. This path is considerably
faster than the UTF-8 multibyte path, so if you are sure your input
documents are single-byte, you are strongly encouraged to say so by
setting the incoding.
- The encoding which data will be presented as (through DOM/SAX) is given
as outcoding. If not specified, UTF-8 is chosen. Unicode (UTF-16)
is supported. Since DOM/SAX APIs specify (oratext *) as data pointers,
for Unicode these should be cast to (ub2 *).
- NOTE: For backwards compatibility (until the next major release),
xmlinit will set the outcoding to the input encoding of the first
document parsed, to simulate the old behavior. For xmlinitenc,
the output encoding is explicitly specified.
- Error Messages, Language
- By default, error messages are printed to stderr. To handle messages
yourself, specify a handler function pointer. The formatted error
string and numeric error code will be passed to your function, along
with the user-defined message context msgctx. The error strings will
be UTF-8; any data included as part of the error message will be
converted to UTF-8. If you need the line number and path/URL where the
error occured, the xmlwhere function returns this information,
but it may only be called from the user's callback function (while the
error is in progress).
- The error language is specified as lang, e.g. "AMERICAN", JAPANESE",
"FRENCH", etc, and defaults to American.
- SAX vs DOM
- By default, a DOM parse tree is built. To use SAX instead, specify a
SAX callback structure (saxcb). The callbacks will be invoked with
the given SAX context pointer. If any of the SAX functions returns
an error (non-zero), parsing stops immediately.
- Memory Allocation
- The parser allocates memory in large chunks. The default system
memory allocator (malloc etc) will be used to allocate and free the
chunks unless a memory callback structure is provided. If given, it
contains function pointers to alloc/free functions which will be used
instead. The memory callback context memcbctx is passed to the
callback functions.
- Error Codes
| XMLERR_NLS_INIT
| The National Language Service package could not be initialized.
Perhaps an installation or configuration problem.
|
| XMLERR_INVALID_MEMCB
| A memory callback structure (memcb) was specified, but it did not have
alloc and free function pointers.
|
| XMLERR_BAD_ENCODING
| An encoding was not known. Use IANA/Mine names for encodings, and
make sure NLS data is present.
|
| XMLERR_INVALID_LANG
| The language specified for error messages was not known.
|
| XMLERR_LEH_INIT
| The LEH (catch/throw) package could not be initialized. An internal
error, contact support.
|
- Purpose
- Frees any memory used during the previous parse.
- Syntax
void xmlclean(xmlctx *ctx);
- Parameters
- ctx (IN) - The XML parser context
- Comments
- Recycles memory within the XML parser, but does not free it to the
system-- only xmlterm finally releases all memory back to the
system. If xmlclean is not called between parses, then the data
used by the previous documents remains allocated, and pointers to
it are valid. Thus, the data for multiple documents can be accessible
simultaneously, although only the current document can be manipulated
with DOM.
- If you just want to access one document's data at a time (within a
single context), then call clear before each new parse.
- Purpose
- These functions invoke the XML parser on various input sources. The
parser must have been initialized successfully with a call to
xmlinit first.
- Syntax
uword xmlparse(xmlctx *ctx, const oratext *uri,
const oratext *incoding, ub4 flags);
uword xmlparsebuf(xmlctx *ctx, const oratext *buffer, size_t len,
const oratext *incoding, ub4 flags);
uword xmlparsefile(xmlctx *ctx, const oratext *path,
const oratext *incoding, ub4 flags);
uword xmlparsestream(xmlctx *ctx, const void *stream,
const oratext *incoding, ub4 flags);
- Parameters
ctx (IN/OUT) - The XML parser context
uri (IN) - URI of XML document (xmlparse only)
buffer (IN) - input buffer (xmlparsefile only)
len (IN) - length of the buffer (xmlparsefile only)
stream (IN) - input stream (xmlparsestream only)
incoding (IN) - default input character set encoding
flags (IN) - mask of parser options
- Comments
- Parser options are specified as flag bits OR'd together into
the flags mask. Flag bits are:
| XML_FLAG_VALIDATE
| Turn validation on
|
| XML_FLAG_DISCARD_WHITESPACE
| Discard whitespace where it appears to be extraneous (end-of-line etc)
|
| XML_FLAG_STOP_ON_WARNING
| Stop validation on warnings
|
- By default, the parser does not validate the input. To validate
against a DTD, set the XML_FLAG_VALIDATE flag. Validation problems
are considered warnings, not errors, and by default validation will
continue after warnings have occured. To treat warnings as errors,
set the flag XML_FLAG_STOP_ON_WARNING.
- The default behavior for whitespace processing is to be fully
conformant to the XML 1.0 spec, i.e. all whitespace is reported
back to the application, but it is indicated which whitespace is
"ignorable". However, some applications may prefer to set the
XML_FLAG_DISCARD_WHITESPACE which will discard all whitespace
between an end-element tag and the following start-element tag.
- The default input encoding may be specified as incoding, which
overrides the incoding given to xmlinit. If the input's encoding
cannot be determined automatically (based on BOM, XMLDecl, etc)
then it is assumed to be incoding. IANA/Mime encoding names
should be used, "UTF-8", "ASCII", etc.
- Data pointers returned by DOM APIs remain valid until xmlclean
or xmlterm is called.
- For SAX, the data pointers only remain valid for the duration of
the user's callback function. That is, once the callback function
has returned, the data pointers become invalid. If longer access
is needed, the data can be stored in the XML memory's pool using
stringSave (or stringSave2 for UCS2 data).
- Streams: A stream is a user defined entity here-- all that's passed
in is a stream/context pointer, which is in turned passed to the
I/O callback functions. The parser does not reference the stream
directly.
- DTD: The DTD parser invokes the XML parser on an external DTD, not
a complete document. It is used mainly by the Class Generator so
that classes may be generated from a DTD without needed a complete
(dummy) document.
- Purpose
- Terminates the XML parser. It should be called after
xmlinit, and before exiting the main program.
- Syntax
uword xmlterm(xmlctx *ctx);
- Parameters
ctx (IN) - the XML parser context
- Comments
- This function tears down the parser. It frees all allocated memory,
giving it back to the system (through free or the user's memory
callback). Contrast to xmlclean, which recycles memory internally
without giving it back to the system.
- No additional XML parser calls can be made until xmlinit
is called again to get a new context.
- Purpose
- Return error location information for the last (current) error.
- Syntax
uword xmlwhere(xmlctx *ctx, ub4 *line, oratext **path, uword idx);
- Parameters
ctx (IN) - the XML parser context
line (OUT) - line# where the error occured
path (OUT) - source path/URL where error occured
idx (IN) - error# in stack (starting at 0)
- Comments
- Returns the location information for the idx'th error on the stack.
This function should only be called while an error is in progress, i.e.
from within an error message callback function. Since errors occur in
nested inputs (document A includes document B includes document C which
contains an error), more than one location is available. The highest-
level input file is idx 0, then the next level down is 1, etc. If only
the highest level is desired, just call once with idx=0. If all levels
are desired, loop starting with idx=0 and incrementing until the
function returns FALSE.
- Purpose
- Creates a new document in memory.
- Syntax
xmlnode* createDocument(xmlctx *ctx)
xmlnode* createDocumentNS(xmldomimp *imp, oratext *uri,
oratext *qname, xmlnode *dtd);
- Parameters
ctx (IN) - XML parser context
imp (IN) - XML DOMImplementation (see getImplementation)
uri (IN) - New document's namespace URI
qname (IN) - Namespace qualified name of new document (DOCUMENT_NODE's name)
dtd (IN) - DTD this document is associated with
- Comments
- The original function createDocument has now been
standardized in DOM 2.0 CORE. For compatibility, the old function
remains with its original usage, and the new CORE function is called
createDocumentNS.
Creates a new document in memory. An XML document is always rooted in
a node of type DOCUMENT_NODE-- this function creates that root
node and sets it in the context. There can be only one current document
and hence only one document node; if one already exists, this function
does nothing and returns NULL.
- For createDocumentNS, if a DTD is specified, its ownerDocument
attribute will be set to the document being created.
- Purpose
- Return value of document's standalone flag.
- Syntax
boolean isStandalone(xmlctx *ctx)
- Parameters
ctx (IN) - the XML parser context
- Comments
- This function returns the boolean value of the document's standalone
flag, as specified in the <?xml?> processing instruction.
- Purpose
- Return value of "simple encoding" flag.
- Syntax
boolean isSingleChar(xmlctx *ctx)
- Parameters
ctx (IN) - the XML parser context
- Comments
- This function returns the boolean value of the document's "simple"
encoding flag. If the document is single-byte encoded (ASCII, ISO-8859,
EBCDIC, etc), TRUE is returned; otherwise, encoding is multibyte or
Unicode and FALSE is returned. See also the getEncoding function
which returns the name of the specific encoding, and isUnicode
which tests for Unicode data.
- Purpose
- Return value of Unicode encoding flag.
- Syntax
boolean isUnicode(xmlctx *ctx)
- Parameters
ctx (IN) - the XML parser context
- Comments
- This function returns the flag which determines whether DOM/SAX data
for this context is in Unicode (UCS2).
- Purpose
- Returns the IANA/Mime name of the character encoding
used by the document, e.g. "ASCII", "ISO-8859-1", "UTF-8", "UTF-16", etc.
- Syntax
oratext *getEncoding(xmlctx *ctx)
- Parameters
ctx (IN) - the XML parser context
- Comments
- This function returns the name of the document's encoding, e.g. "ASCII",
"UTF-8", etc. See also the isSingleChar function, which can be
used to simply determine if the document is single or multibyte, and
the isUnicode function, which determines if the input is Unicode
(UTF-16).