HTML(3) Library Functions Manual HTML(3)

NAME

html, hattr_alloc_chars, hattr_alloc_text, hattr_clone, hattr_column, hattr_delete, hattr_enumname, hattr_find, hattr_line, hattr_literal, hattr_name, hattr_parent, hattr_sibling, hattr_type, hattr_value, hcache_alloc, hcache_clone, hcache_delete, hcache_get, hcache_root, hcache_verify, hcref_find, hcref_name, hdecl_enumname, hdecl_find, hdecl_name, helem_enumname, helem_find, helem_name, hident_alloc, hident_clone, hident_column, hident_delete, hident_line, hident_literal, hident_sibling, hident_value, hnode_addattr, hnode_addchild, hnode_addident, hnode_alloc_comment, hnode_alloc_decl, hnode_alloc_elem, hnode_alloc_proc, hnode_alloc_root, hnode_alloc_text, hnode_attr, hnode_child, hnode_clone, hnode_column, hnode_comment, hnode_dechild, hnode_dechildpart, hnode_decl, hnode_delete, hnode_elem, hnode_ident, hnode_line, hnode_parent, hnode_proc, hnode_repattr, hnode_sibling, hnode_text, hnode_type, hparse_alloc, hparse_delete, hparse_tree, hproc_enumname, hproc_find, hproc_name, hvalid_alloc, hvalid_delete, hvalid_tree, hwrite_alloc, hwrite_delete, hwrite_mode, hwrite_treesimple HTML parsing library

SYNOPSIS

#include <html.h>

enum hattrt;
enum hcreft;
enum hdeclt;
enum helemt;
enum hmode;
enum hnodet;
enum hproct;
struct ioctx;
struct iofd;
struct ioout;
struct iovalid;

struct hattr *
hattr_alloc_chars(enum hattr t, const char *val, int lit, int line, int col);

struct hattr *
hattr_alloc_text(enum hattr t, const char *val, int lit, int line, int col);

struct hattr *
hattr_clone(struct hattr *);

int
hattr_column(struct hattr *);

void
hattr_delete(struct hattr *);

const char *
hattr_enumname(enum hattrt);

enum hattrt
hattr_find(const char *);

int
hattr_line(struct hattr *);

int
hattr_literal(struct hattr *);

const char *
hattr_name(enum hattrt);

struct hnode *
hattr_parent(struct hattr *);

struct hattr *
hattr_sibling(struct hattr *);

enum hattrt
hattr_type(struct hattr *);

const char *
hattr_value(struct hattr *);

struct hcache *
hcache_alloc(struct hnode *root, const struct hcid *ids, int maxid, size_t idsz);

struct hcache *
hcache_clone(struct hcache *p);

void
hcache_delete(struct hcache *);

struct hnode *
hcache_get(struct hcache *p, int id);

struct hnode *
hcache_root(struct hcache *p);

int
hcache_verify(struct hcache *p);

enum hcreft
hcref_find(const char *);

const char *
hcref_name(enum hcreft);

const char *
hdecl_enumname(enum hdeclt);

enum hdeclt
hdecl_find(const char *);

const char *
hdecl_name(enum hdeclt);

const char *
helem_enumname(enum helemt);

enum helemt
helem_find(const char *);

const char *
helem_name(enum helemt);

struct hident *
hident_alloc(const char *, int, int, int);

struct hident *
hident_clone(struct hident *);

int
hident_column(struct hident *);

void
hident_delete(struct hident *);

int
hident_line(struct hident *);

int
hident_literal(struct hident *);

struct hident *
hident_sibling(struct hident *);

const char *
hident_value(struct hident *);

int
hnode_addattr(struct hnode *, struct hattr *);

int
hnode_addchild(struct hnode *, struct hnode *);

int
hnode_addident(struct hnode *, struct hident *);

struct hnode *
hnode_alloc_chars(const char *p, int line, int col);

struct hnode *
hnode_alloc_comment(const char *p, int line, int col);

struct hnode *
hnode_alloc_decl(enum hdeclt type, int line, int col);

struct hnode *
hnode_alloc_elem(enum helemt type, int line, int col);

struct hnode *
hnode_alloc_proc(enum hproct type, int line, int col);

struct hnode *
hnode_alloc_root(int line, int col);

struct hnode *
hnode_alloc_text(const char *p, int line, int col);

struct hattr *
hnode_attr(struct hnode *);

struct hnode *
hnode_child(struct hnode *);

struct hnode *
hnode_clone(struct hnode *);

int
hnode_column(struct hnode *);

const char *
hnode_comment(struct hnode *);

void
hnode_dechild(struct hnode *n);

void
hnode_dechildpart(struct hnode *n, const struct hnode **r, int rsz);

enum hdeclt
hnode_decl(struct hnode *);

void
hnode_delete(struct hnode *);

enum helemt
hnode_elem(struct hnode *);

struct hident *
hnode_ident(struct hnode *);

int
hnode_line(struct hnode *);

struct hnode *
hnode_parent(struct hnode *);

enum hproct
hnode_proc(struct hnode *);

int
hnode_repattr(struct hnode *, struct hattr *);

struct hnode *
hnode_sibling(struct hnode *);

const char *
hnode_text(struct hnode *);

enum hnodet
hnode_type(struct hnode *);

struct hparse *
hparse_alloc(enum hmode);

void
hparse_delete(struct hparse *);

int
hparse_tree(struct hparse *, struct ioctx *, struct hnode **);

const char *
hproc_enumname(enum hproct);

enum hproct
hproc_find(const char *);

const char *
hproc_name(enum hproct);

struct hvalid *
hvalid_alloc(enum hmode);

void
hvalid_delete(struct hvalid *);

int
hvalid_tree(struct hvalid *, struct iovalid *, struct hnode *);

struct hwrite *
hwrite_alloc(enum hmode);

void
hwrite_delete(struct hwrite *);

void
hwrite_mode(struct hwrite *p, enum hmode mode);

int
hwrite_tree(struct hwrite *, struct ioout *, struct hnode *);

DESCRIPTION

The html library contains HTML DOM parsing, serialising, and manipulation functions. It only works on input and output described by a subset of HTML-4.01 and XHTML-1.0 strict (full compliance is an eventual possibility). Input and output contexts are provided by the calling application. All operations are strictly checked for correctness.

In general, a parser object is first allocated with hparse_alloc(). Input is parsed using hparse_tree(), which reads from user-supplied callbacks in struct ioctx. A commonly-used context of reading from a file is documented as the struct iofd context. Identifiers in the tree may then be queried with hcache_alloc(). Finally, output is written with hwrite_tree().

The html library is currently under development. A subset of both languages is implemented: parse trees are validated syntactically (XML/SGML) but not semantically (HTML).

VALIDATION

This section documents the validation process of the html library. Validation is two-phase: syntactic, which occurs when a tree is being assembled by parsing (hparse_tree() et al.) or manual construction (hnode_addchild() et al.); and semantic, when a tree has been constructed and is walked with hvalid_tree().

Syntax

The tree is structurally validated in terms of SGML/XML as it is being built. First, only certain node types may be children of other nodes (comments may not have element children, roots may not be children of anybody, text may be a child of elements, etc.). Pseudo-semantic closure is also enforced, where some nodes (such as the “LINK” element in HTML-4.01) may not have children.

After succcessful construction, a tree is guaranteed to be syntactically valid.

Semantics

Exclusion is enforced, where some nodes may not contain other nodes. For example, “A” may not contain other “A” nodes

Assignment of attributes to processing instructions and elements is also enforced, such as the “version” attribute only be applicable to XML mark-up in the “xml” processing instruction or “CELLPADDING” only being applicable to “TABLE” in HTML-4.01).

Lastly, text data streams are enforced when named of numeric character references are encountered.

REFERENCE

This section contains a canonical list of Data Types and Functions for this library.

Data Types

enum hattrt
An HTML attribute type. See hattr_type(), among other functions.
enum hcreft
A named character reference. See hcref_find(), among other functions.
enum hdeclt
A document type (“doctype”) identifier. See hnode_decl(), among other functions.
enum helemt
An HTML element type. See hnode_elem(), among other functions.
enum hmode
An HTML version. This bit-wise value may be masked with HMODE_SGML for SGML-based HTML versions or HMODE_XML for XML-based versions.
enum hnodet
The classification of a node. See hnode_type(), among other functions.
enum hproct
A processing instruction (“pic”) identifier. See hnode_proc(), among other functions.
struct ioctx
Provides functions for manipulating I/O during a parsing sequence. See hparse_tree() for usage and struct iofd for an example implementation.
struct iofd
A bundled implementation of an struct ioctx context. The included iofd_open(), iofd_getchar(), iofd_close(), and iofd_rew() functions should be used for the callbacks. Before using, the structure should be zeroed and fd set to -1.
struct ioout
Provides functions for serialising I/O during a write sequence. See hwrite_tree(). The convenience functions iostdout_putchar() and iostdout_puts() are provided for writing to standard out.
struct iovalid
Provides functions for validating HTML trees. See hvalid_tree().

Functions

hattr_alloc_chars()
Allocate an HTML key/pair attribute. Variable lit indicates whether the attribute value is a literal. Returns NULL if allocation failed.
hattr_alloc_text()
Like hattr_alloc_chars() but HTML-encoding val.
hattr_clone()
Clone an attribute. This will create a new attribute with no context (i.e., parent or siblings), but all values duplicated. Returns NULL if allocation failed.
hattr_column()
Get the column number where the attribute was first parsed.
hattr_delete()
Remove an HTML attribute from its context (if applicable) and free it.
hattr_enumname()
Convert an attribute type to the string representation of its enumeration type.
hattr_find()
Look up an element type by its name. Returns the type or HATTR__MAX if none was found.
hattr_line()
Get the line number where the attribute was first parsed.
hattr_literal()
Whether the attribute value was invoked as a literal or not.
hattr_name()
Convert an attribute type to its string representation.
hattr_parent()
Get the parent of an attribute or NULL if it has none.
hattr_sibling()
Get the next sibling of an attribute or NULL if it has none.
hattr_type()
Get the type of an attribute.
hattr_value()
Get the value of an attribute or NULL if it has none.
hcache_alloc()
Allocate a cached node tree. A cached tree is one where elements with the “ID” attribute are each assigned to a numeric identifier. These nodes may then be queried with hcache_get() in constant time. This function should be called on the document root. It returns an allocated cache object over the root node, whose pointer should not be manipulated outside of the cache context. Cached node identifiers are stored in ids, which is of size idsz, where the maximum numeric identifier is strictly less than maxid. Returns NULL if memory allocation fails.
hcache_clone()
Clone a cached node tree. Cloned nodes will propogate their cached-ness. Returns NULL if memory allocation fails.
hcache_delete()
Delete a cache, including the cached tree.
hcache_get()
Get a cached node. This will assert if the cached node has been unlinked.
hcache_root()
Return the node that was used to allocate the cache.
hcache_verify()
Verify that all cached entries are filled. Returns 1 if they are filled, 0 if they are not.
hcref_find()
Look up a character reference by its name (note that character references are case sensitive). Returns the type or HCREF__MAX if none was found.
hcref_name()
Convert a character reference to its string representation.
hdecl_enumname()
Convert an declaration type to the string representation of its enumeration type.
hdecl_find()
Look up a declaration type by its name. Returns the type or HDECL__MAX if none was found.
hdecl_name()
Convert a declaration type to its string representation.
helem_enumname()
Convert an element type to the string representation of its enumeration type.
helem_find()
Look up an element type by its name. Returns the type or HELEM__MAX if none was found.
helem_name()
Convert an element type to its string representation.
hident_alloc()
Allocate a identifier with whether it's a literal or not, i.e., one that was invoked with surrounding quotes. Returns NULL if allocation failed.
hident_clone()
Clone an identifier. This will create a new identifier with no context (i.e., parent or siblings), but all values duplicated. Returns NULL if allocation failed.
hident_column()
Get the column number where the node was first parsed.
hident_delete()
Remove an identifier from its context (if applicable) and free it.
hident_line()
Get the line number where the node was first parsed.
hident_literal()
Returns 1 if this identifier was invoked as a literal, 0 otherwise.
hident_sibling()
Get the next sibling of an identifier or NULL if it has none.
hident_value()
Get the value of an identifier.
hnode_addattr()
Unlink an attribute from its current context (if applicable) and add it to another element. Returns 1 on success of 0 if the attribute is not allowed to be the child of the parent.
hnode_addchild()
Unlink an element from its current context (if applicable) and append it to another element's queue of children. Returns 1 on success or 0 if the node is not allowed to be a child of the parent.
hnode_addident()
Unlink an identifier from its current context (if applicable) and add it to another declaration. Returns 1 on success or 0 if the node is not allowed to be a child of the parent.
hnode_alloc_chars()
Allocate an HTML text node. Copies over all values in the string as-is. This should be used for pre-validated strings (i.e., text that is guaranteed not to have invalid mark-up). Returns NULL if memory allocation failed.
hnode_alloc_comment()
Allocate an HTML comment node. Returns NULL if memory allocation failed.
hnode_alloc_decl()
Allocate an HTML type declaration. Returns NULL if memory allocation failed.
hnode_alloc_elem()
Allocate an HTML element node. Returns NULL if memory allocation failed.
hnode_alloc_proc()
Allocate a processing instruction node. Returns NULL if memory allocation failed.
hnode_alloc_root()
Allocate the root HTML node. Returns NULL if memory allocation failed.
hnode_alloc_text()
Allocate an HTML text node. The ‘​<' ‘​>' ‘​"' and ‘​&' characters are transformed into their named entity forms. Returns NULL if memory allocation failed.
hnode_attr()
Get the first attribute of a node or NULL if it has none.
hnode_child()
Get the first child of a node or NULL if it has none.
hnode_clone()
Recursively clone a node and any sub-components (nodes, attributes, identifiers, etc.).
The HATTR_ID attribute is not duplicated.
This will create a new node with no context (i.e., parent or siblings), but all values duplicated. Sub-components are grouped under the returned node. Returns NULL if allocation failed.
hnode_column()
Get the column where the node began to be parsed.
hnode_comment()
Get the comment field of a node. The node must be of type HNODE_COMMENT.
hnode_dechild()
Convenience routine invoking hnode_delete() for all children of a node.
hnode_dechildpart()
Like hnode_dechild(), but not deleting the subtree up to and below any node in r of size rsz.
hnode_decl()
Get the declaration type of a node. The node must be of type HNODE_DECL.
hnode_delete()
Recursively unlink HTML elements rooted at the current element (if applicable) and free them.
hnode_elem()
Get the element type of a node. The node must be of type HNODE_ELEM.
hnode_ident()
Get the first identifier of a node or NULL if it has none.
hnode_line()
Get the line number where the node was first parsed.
hnode_parent()
Get the parent of a node or NULL if it has none.
hnode_proc()
Get the processing type of a node. The node must be of type HNODE_PROC.
hnode_repattr()
Replace an existing attribute with a new one. An existing attribute, if found, will be deleted with hattr_delete(). See hnode_addattr() for return values.
hnode_sibling()
Get the next sibling of a node or NULL if it has none.
hnode_text()
Get the text of a node. The node must be of type HNODE_TEXT.
hnode_type()
Get the type of a node.
hparse_alloc()
Allocates a parse sequence with a certain document type. Returns NULL if memory allocation failed. A single struct hparse * may be used for multiple parse sequences of this document type.
hparse_delete()
Free a parse sequence.
hparse_tree()
Opens a struct ioctx and begins processing. The tree is processed until end of file or error. The struct ioctx is always closed unless the open fails. Returns 0 on syntax or system error, 1 on success. The tree is guaranteed to be well-structured, but not validated for HTML. The result, if not NULL (it may be non-NULL even if parsing fails), must be freed by a separate call to hnode_delete().
hproc_enumname()
Convert a processing instruction type to the string representation of its enumeration type.
hproc_find()
Look up a processing type by its name. Returns the type or HPROC__MAX if none was found.
hproc_name()
Convert a processing type to its string representation.
hvalid_alloc()
Allocate a validator bound to a validation mode. Returns NULL if memory allocation fails. The same validator may be used with multiple hvalid_tree() calls.
hvalid_delete()
Frees a validator.
hvalid_tree()
Depth-first validation of node and all children. Validation is tied to a particular mode, which is usually the same as the input mode (mixing modes is perfectly reasonable, but will probably fail validation due to differences between XHTML and HTML). Returns 1 on success and 0 on failure.
hwrite_alloc()
Allocate a writer bound to an output mode. Returns NULL if memory allocation fails. The same writer may be used with multiple hwrite_tree() calls.
hwrite_delete()
Frees a writer.
hwrite_mode()
Changes the output mode of a writer.
hwrite_tree()
Serialise node and all children to a struct ioout. The input tree need not have been validated, as the writer assumes nothing about the semantic structure of the tree. Writing is necessarily tied to a particular mode, which is usually the same as the input and validation mode. Mixing modes is not suggested (parsing and validating in XHTML and printing in HTML, for example, will result in the <?xml?> processing instruction printed alongside SGML-formed HTML code). Returns 1 on success and 0 on failure (which can only happen if a struct ioout fails).

EXAMPLES

The test.c file contained in the distribution has a complete example.

STANDARDS

The html library implements W3C REC-HTML-4.01 (HTML 4.01) and REC-XHTML-1.0 (XHTML 1.0) with constant low-level syntax reference to ISO 8879:1986 (SGML) and W3C REC-XML-1.0 (XML 1.0), respectively; and UTR-20 (Unicode in XML and other Markup Languages) for character references.

AUTHORS

The html library was written by Kristaps Dzonsons <kristaps@bsd.lv>.

CAVEATS

Many parts of HTML-4.01 and XHTML-1.0 aren't implemented. Most notably are XML namespaces for tags, which will confuse the parser.

The html library does not support tag and attribute names in non-ASCII characters: although the xcode facility in struct ioctx allows for transcoding arbitrary input character types, this is only called for text or literal data. Thus, while the SGML standard, in this regard, is implemented, the XML reference is not.

July 5, 2011 OpenBSD 4.6