GlueX Internal Note
HDDM - GlueX Data Model
draft 1.1
Richard Jones
September 12, 2003
(supersedes draft 1.0, September 12, 2003)
  Fig. 1: The conceptual data model for GlueX begins with a physics
event, coming either from the detector or a Monte Carlo program,
which builds up internal structure as it flows through the
analysis pipeline.  The data model specifies the elements of information
that are contained in a view of the event at each stage and the
relationships between them.  The implementation provides standard
methods for creating, storing and accessing the information.
Fig. 1: The conceptual data model for GlueX begins with a physics
event, coming either from the detector or a Monte Carlo program,
which builds up internal structure as it flows through the
analysis pipeline.  The data model specifies the elements of information
that are contained in a view of the event at each stage and the
relationships between them.  The implementation provides standard
methods for creating, storing and accessing the information.
General Notes:
-  At each stage (lower-case items in diagram) in the pipeline one has
     a unique view of the event.
-  To each of these is associated a unique data model that expresses
     the event in that view.
-  GlueX policy is to use xml to describe all of our shared data, which
     means any data that might be passed as input to a program or produced
     as output.  This does not mean that all data records are represented
     as plain-text files, but that to each data file or i/o port is attached
     some metadata that a tool can use to automatically express all of its
     contents in the form of a plain-text xml document.
-  This policy is interpreted to mean that to each data file or i/o port
     of a program is associated a xml schema that defines the data structure
     that the program expects or produces.  The schemas should be either
     bundled with the distribution of the program, or published on the web
     and indicated by links in the Readme file.
-  Any xml document should be accepted as input to a program if it is
     valid according to the schema associated to that input port.
-  In practice, this last requirement adds significant of overhead to the
     task of writing a simple analysis program, because it must be capable of
     parsing general xml documents as input.  In addition to this overhead
     imposed on the program code itself, the author must also produce schemas
     for each input or output port or file accessed by the program.
-  The purpose of the Hall D Data Model (HDDM) is to simplify the
     programmer's task by providing automatic ways to generate schemas
     and providing i/o libraries to standardize input and output of data
     in the form of xml-described files.
-  HDDM consists of a specification supported by a 
     set of tools to assist its implementation.  The
     specification is a set of rules that a programmer must obey in
     constructing schemas in order for them to be supported by the tools.
     The tools include an automatic schema generator and an i/o 
     library builder for c and c++.
-  The HDDM specification was designed to enable the construction of an
     efficient i/o library.  It was assumed in the design that users could
     not afford a general xml-parsing cycle every time an event is read in
     or written out by a program.  It was also assumed that serializing data
     in plain-text xml is too expensive in terms of bandwidth and storage.
     Using the HDDM tools, users can efficiently pass data between programs
     in a serialized binary stream, and convert to/from plain-text xml
     representations using a translation tool when desired.
-  Programmers are not obligated to use HDDM tools to work in the GlueX
     software framework. If they provide their own schema for each file or
     i/o port used by the program and accept any xml that respects their
     input schema then they are within the agreed framework.
-  The HDDM tools are presently implemented in c and c++, so programmers
     wishing to work in java have more work to do.  However, they will find
     it easy to interface to other programs that do use the HDDM libraries
     because they provide for the correct reading and writing of valid xml
     files and the automatic generation of schemas that describe them.
-  The hddm-c tool automatically constructs a set of c-structures
     based on xml metadata that can be used to express the data in memory.
     It also builds an i/o library that can be called by user code to
     serialize/deserialize these structures.
-  The serialized data format supported by hddm-c consists of a
     xml header in plain text that describes the structure and contents of
     the file, followed by a byte string that is a reasonably compact
     serialization of the structured data in binary form.  Such hddm
     files are inherently self-describing.  The overhead of including the
     metadata in the stream with the data is negligible.
-  The hddm-xml tool extracts the xml metadata from a hddm file
     header and expresses the data stored in the file in the form of a
     plain-text xml document.
-  The hddm-schema tool extracts the xml metadata from a hddm file
     header and generates a schema that describes the structure of the data
     in the file.  The schema produced by hddm-schema will always
     validate the document produced by hddm-xml when both act on the
     same hddm file.  More significantly, the schema can be used to check
     the validity of other xml data that originate from a different source.
-  The xml-hddm tool reads an xml document and examines its schema
     for compliance with the HDDM specification. If successful, it parses
     the xml file and converts it into hddm format.
-  The schema-hddm tool reads a schema and checks it for compliance
     with the HDDM specification. If successful, it parses it into the form
     of a hddm file consisting of the header only and no data.  Such a
     data-less hddm file is also called a "template"
     (see below).
-  Note that the hddm-xml, hddm-schema, and hddm-c tools
     can act on any hddm data file written by any program, even if the code
     that produced the data is no longer available.  This is because
     sufficient metadata is provided in the schema header to completely
     reconstruct the file's contents in xml, or instantiate it in c-structures.
-  A tool called xml-xml has been included in the tool set as a
     simple means to validate an arbitrary xml document against a dtd or
     schema, and reformat it with indentation to make it easier to read.
-  Tools called stdhep2hddm and hddm2stdhep provide
     conversion between the hddm data stream and the STDHEP format used by
     HDFast.  This is an example where a user program achieves xml i/o
     by employing translators, in this case a two-stage pipeline.
-  In spite of the array of tools described above, the programmer still
     must do the work of describing the structure and contents of the data
     expected or produced by his program.  He may do this in one of two
     ways: either he constructs an original schema describing his data, or
     he creates an original xml template of his data and then generates the
     schema using hddm-schema.
-  Since schemas are rather verbose and repetitive, the suggested method
     is to create a template first, use hddm-schema to transform it
     into a basic HDDM schema, and then add facets to the schema to enrich the
     minimal set of metadata generated from the template.  This method has
     the advantage that one starts off with a basic schema that is known to
     conform to the rules for HDDM schemas (see below)
     so it is relatively simple thereafter to stay within the specification.
-  As a shortcut to creating schemas, it is not necessary to do anything
     more than just create the template.  The basic schema that is generated
     automatically from the template contains sufficient information to
     validate most data, so a programmer can get by without ever learning
     how to write or modify schemas.
Rules for constructing HDDM templates:
-  A hddm template is nothing more than a plain-text xml file that mimics
     the structure of the xml that the program expects on input or produces
     on output.  In some ways it is like sample data that the programmer
     might provide to a user to demonstrate how to use it, although the
     comparison is not perfect.
-  The top element in the template must be <HDDM> and have
     three required attributes: class, version, and xmlns.
     The value of the latter must be xmlns="http://www.gluex.org/hddm".
     The values of the class and version arguments are user-defined. They
     serve to identify a group of schemas that share a basic set of tags.
     See below for more details on classes.
-  The names of elements below the root <HDDM> element are
     user-defined, but they must be constructed according to the following
     rules.
-  All values in hddm files are expressed as attributes of elements.
     Any text that appears between tags in the template is treated as
     a comment and ignored.
-  An element may have two information attached to it: child elements 
     which appear as new tags enclosed between the open and close tags of
     the parent element, and attributes which appear as key="value"
     items inside the open tag.  
-  All quantities in the data model are carried by named attributes of
     elements.  The rest of the document exists to express the meaning of
     the data and the relationships between them.
-  All elements in the model document either hold attributes, contain other
     elements, or both.  Empty elements are meaningless, and are not allowed.
-  One way a template is not like sample data is that it does not
     contain actual numerical or symbolic values for the fields in the
     structure.  In the place of actual values, the types of the fields
     are given.  For example, instead of showing energy="12.5 as
     might be shown for sample data, the template would show in this
     position energy="float" or energy="double".
-  The complete list of allowed types supported by hddm is "int", "long",
     "float", "double", "boolean", "string" and "Particle_t".  The
     Particle_t type is a value from an enumerated list of capitalized
     names of physical particles. The int type is a 32-bit signed integer,
     and long is a 64-bit signed integer.  The other cases are obvious.
-  Attributes in the template that do not belong to this list are assumed
     to be constants. Constants are sometimes useful for annotating the
     xml record.  The must have the same value for all instances of the
     element throughout the template.
-  Any given attribute may appear more than once throughout the template
     hierarchy.  Wherever it appears, it must appear with identical
     attributes and with content elements of the same order and type.
-  Another difference between a template sample data is that the
     template never shows a given element more than once in a given context,
     even if the given tag would normally the repeated more than once for
     an actual sample.  One obvious example of this is a physics event,
     which is represented only once in the template, but repeated multiple
     times in a file.
-  By default, it is assumed that an element appearing in the template
     must appear in that position exactly once.  If the element is allowed
     to appear more than once or not at all then additional attributes
     should be inserted in the element of the form minOccurs="N1"
     and maxOccurs="N2", where N1 can be zero or any positive
     integer and N2 can be any integer no smaller than N1, or
     set to the string "unbounded".  Each defaults to 1.
-  Arrays of simple types are represented by a sequence of elements,
     each carrying an attribute containing a single value from the array.
     This is more verbose than allowing users to include arrays as a simple
     space separated string of values, but the chosen method is more apt
     for expressing parallelism between related arrays of data.
-  An element may be used more than once in the model, but it may never
     appear as a descendent of itself.  Such recursion is complicated to
     handle and it is hard to think of a situation where it is necessary.
-  Examples of valid hddm templates are given in the examples section
     below.
-  Because templates contain new tags that are invented by the programmer,
     it is not possible to write a standard template schema against which a
     programmer can check his new xml file for use as a template.  Instead of
     using schema validation, the programmer can use the hddm-schema
     tool to check a xml file for correctness as a hddm template.  Any errors
     that occur in the hddm - schema transformation indicate problems in the
     xml file that must be fixed before it can be used as a template.
Rules for constructing HDDM schemas:
-  HDDM schemas must be valid xml schemas, belonging to the namespace
     http://www.w3.org/2001/XMLSchema.  Not every valid schema is a valid
     HDDM schema, however, because xml allows for several different ways to
     express a given data structure.
-  GlueX programmers are not obligated to write schemas that conform to
     the HDDM specification, but if they do, they have the help of the HDDM
     tools for efficient file storage and i/o.
-  In the following specification, a prefix xs: is applied to the
     names of elements, attributes or datatypes that belong to the official
     schema namespace "http://www.w3.org/2001/XMLSchema", whose meaning is
     defined by the xml schema standard.  The extensions introduced for the
     specific needs of GlueX are assigned to a private namespace called
     "http://www.gluex.org/hddm" that is denoted by the prefix hddm:.
-  The top element defined by the schema must be <hddm:HDDM> and have
     three required attributes: class, version, and xmlns.
     The value of the latter must be xmlns="http://www.gluex.org/hddm".
     The class and version arguments are of type xs:string and are
     user-defined.   They serve to identify a group of schemas that share a
     basic set of tags.  See below for more details.
-  The names of elements below the root <hddm:HDDM> element are
     user-defined, but they must be constructed according to the following
     rules.
-  An element may have two kinds of content: child elements and attributes,
     and hence must have xs:complexType.  Elements represent the
     grouping together of related pieces of data in a hierarchy of nodes.
     The actual numerical or symbolic values of individual variables appear
     as the values of attributes. Examples are shown
     below.
-  All quantities in the data model are carried by named attributes of
     elements.  The rest of the document exists to express the meaning of
     the data and the relationships between them.
-  All elements in the model document either hold attributes, contain other
     elements, or both.  Empty nodes are meaningless, and are not allowed.
-  Text content between open and close tags is allowed in documents
     (type="mixed") but it is treated as a comment and stripped on
     translation.  Basic HDDM schemas do not use type="mixed"
     elements.
-  The datatype of an attribute is restricted to a subset of basic types
     to simplify the task of translation.  Currently the list is
     xs:int, xs:long, xs:float, xs:double,
     xs:boolean, xs:string, xs:anyURI and
     hddm:Particle_t.  User types that are derived from the above
     by xs:restriction may also be defined and used in a HDDM schema.
-  Attributes must always be either "required" or "fixed".  Default
     attributes, i.e. those that are sometimes present inside their host and
     sometimes not are not allowed.  This allows a single element to be
     treated as a fixed-length binary object on serialization, which has
     advantages for efficient i/o.
-  A datum that is sometimes absent can be expressed in the model by
     assigning it as an attribute to its own host element and putting the
     host element into its parent with minOccurs="0".
-  Fixed attributes (with use="fixed") may be attached to
     user-defined elements.  They may be of any valid schema datatype, not
     just those listed above, and may be used as comments to qualify the
     information contained in the element.  Because they have the same
     value for every instance of the element, they do not take up space in
     the binary stream, but they are included explicitly in the output
     produced by the hddm-xml translator.
-  All elements must be globally defined in the schema, i.e. declared at
     the top level of the xs:schema element.  Child elements are
     included in the definition of their parents through a ref=tagname
     reference. Local definitions of elements inside other elements are not
     allowed. This guarantees that a given element has the same meaning and
     contents wherever it appears in the hierarchy.
-  Arrays of simple types are represented by a sequence of elements,
     each carrying an attribute containing a single value from the array.
     This is more verbose than allowing a simple list type like is defined
     by xs:list, but the chosen method is more apt for expressing
     parallelism between related arrays of data, such as frequently occurs
     in descriptions of physical events.  Forbidding the use of simple
     xs:list datatypes should encourage programmers to chose the
     better model, although of course they could just mimic the habitual use
     of lists by filling the data tree with long strings of monads!
-  Elements are included inside their parent elements within a
     xs:sequence schema declaration.  Each member of the sequence
     must be a reference to another element with a top-level definition.
-  A given element may occur only once in a given the sequence, but may
     have minOccurs and maxOccurs attributes to indicate
     possible absence or repetition of the element.
-  The sequence is the only content organizer allowed by HDDM.
     More complex organizers are supported by schema standards, such as
     all and choice, but their use would complicate the i/o
     i/o interfaces that have to handle them and they add little by way
     of flexibility to the model the way it is currently defined.
-  An element may be used more than once in the model, but it may never
     appear as a descendent of itself.  Such recursion is complicated to
     handle and it is hard to think of a situation where it is necessary.
-  A user can check whether a given schema conforms to the HDDM rules 
     by transforming it into a hddm template
     document.  Any errors that occur during the transformation generate
     a message indicating where the specification has been violated.
Class relationships between HDDM schemas:
-  Two HDDM schemas belong to the same class if all tags that are
     defined in both have the same set of attributes in both.
-  This is a fairly weak condition.  It is possible that all data files
     used in GlueX will belong to the same class, but it is not required.
-  If two HDDM schemas belong to the same class then it is possible to
     form a union schema that will validate documents of either type by
     taking the xml union of the two schema documents and changing any
     sequence elements in one and not in the other to minOccurs="0".
-  The translation tools xml-hddm and hddm-xml will work
     with any HDDM class.
-  Any program built using the i/o library created with hddm-c is
     dependent on the class of the schema used during the build.  Any files
     it writes through this interface will be built on this schema, however
     it is able to read any file of the same class without recompilation.
-  A new schema may be derived from an existing HDDM schema by taking the
     existing one and adding new elements to the structure.  In this case
     the version attribute of the HDDM tag should be incremented, while
     leaving the class attribute unchanged.
-  A program that was built using the hddm-c tool for its i/o
     interface can read from any from any hddm file of the same class as
     the original schema used during the build.  If the content of the file
     is a superset of the original schema then nothing has changed.  If
     some elements of the original schema are missing in the file then the
     i/o still works transparently, but the c-structures corresponding to
     the missing elements will be empty, i.e. zeroed out.
-  The c/c++ i/o library rejects an attempt to read from a hddm file that
     has a schema of a different class from the one for which it was built.
-  No mandatory rules are enforced on the version attribute of the
     hddm file, but it is available to programs and may be used to select
     certain actions based on the "vintage" of the data.
-  Programs that need simultaneous access to multiple classes of hddm
     files can be built with more than one i/o library.  The structures and
     i/o interface are defined in separate header files hddm_X.h and
     implementation files hddm_X.c, where X is the class letter.
Implementation Notes:
-  There is a complementarity between xml schemas and the xml templates
     that express the metadata in hddm files.  Depending on the level of
     detail desired, schemas may become arbitrarily sophisticated and
     complex.  On the other hand, only a small subset of that information
     is needed to support the functions of the hddm tool set.  Templates
     allow that information to be distilled in a compact form that is both
     human-readable and valid xml.
-  In the present implementation, the text layout of the template
     (including the whitespace between the tags) is used by the hddm tools
     to simplify the encoding and decoding.  There is exactly one tag per
     line and two leading spaces per level of indent.  This may change in
     future implementations.  This means that hddm file headers should not
     be edited by hand.
-  The XDR library is used to encode the binary values in the hddm
     stream.  This means that hddm files are machine-independent, and
     can be read and written on any machine without any dependence on
     whether the machine is little-endian or big-endian.  XDR is the network
     encoding standard library developed for Sun's rpc and nfs services.
     For more info, search for RFC 1014 on the web or do "man xdr" under
     linux.
-  The binary file format will change.  The point is not to fix
     on some absolute binary format at this early stage.  The only
     design constraint was that the data model be specified in xml and
     that the data be readily converted into plain-text xml, preferably
     without needing to look up auxiliary files or loading the libraries
     that wrote it.
-  The design of the i/o library has been optimized for flexibility:
     the user can request only the part of the model that is of interest.
     The entire model does not even have to be present in the file, in which
     case only the parts of the tree that are present in the file are loaded
     into memory, and the rest of the requested structure is zeroed out.
-  The only constraint between the model used in the program and that
     of the hddm stream is that there be no collisions, that is tags
     found in both but with different attributes.
-  Two data models with colliding definitions can be used in one program
     but they have to have different class Ids.  Two streams with
     different class Ids cannot feed into each other.   In any case the
     xml viewing tool hddm-xml can read a hddm stream of any class.
Examples:
-  A simple model of an event fragment describing hits in a 
     time-of-flight wall.  It allows for multiple hits per detector
     in a single event, with t and dE information for each hit.
     The hits are ordered by side (right: end=0, left: end=1) and then by
     horizontal slab.  The minOccurs and maxOccurs attributes allow those
     tags to appear any number of times, or not at all, in the given context.
<forwardTOF>
  <slab y="float" minOccurs="0" maxOccurs="unbounded">
    <side end="int" minOccurs="0" maxOccurs="unbounded">
      <hit t="float" dE="float" maxOccurs="unbounded" />
    </side>
  </slab>
</forwardTOF>
-  A model of the output from an event generator.
An example of actual output from genr8
converted to xml using hddm-xml.  Warning: some browsers have
difficulty displaying plain xml.  Mozilla 1.x and Internet Explorer 6
give a nice view of the document below.
<?xml version="1.0" encoding="UTF-8"?>
<HDDM class="s" version="1.0" xmlns="http://www.gluex.org/hddm">
  <physicsEvent eventNo="int" runNo="int">
    <reaction type="int" weight="float" maxOccurs="unbounded">
      <beam type="Particle_t">
        <momentum px="float" py="float" pz="float" E="float" />
        <properties charge="int" mass="float" />
      </beam>
      <target type="Particle_t">
        <momentum px="float" py="float" pz="float" E="float" />
        <properties charge="int" mass="float" />
      </target>
      <vertex maxOccurs="unbounded">
        <product type="Particle_t" decayVertex="int" maxOccurs="unbounded">
          <momentum px="float" py="float" pz="float" E="float" />
          <properties charge="int" mass="float" />
        </product>
        <origin vx="float" vy="float" vz="float" t="float" />
      </vertex>
    </reaction>
  </physicsEvent>
</HDDM>
-  A more complex example follows, showing a hits tree for the full
detector.
<?xml version="1.0" encoding="UTF-8"?>
<HDDM class="s" version="1.0" xmlns="http://www.gluex.org/hddm">
  <physicsEvent eventNo="int" runNo="int">
    <hitView version="1.0">
      <barrelDC>
        <cathodeCyl radius="float" minOccurs="0" maxOccurs="unbounded">
          <strip sector="int" z="float" minOccurs="0" maxOccurs="unbounded">
            <hit t="float" dE="float" maxOccurs="unbounded" />
          </strip>
        </cathodeCyl>
        <ring radius="float" minOccurs="0" maxOccurs="unbounded">
          <straw phim="float" minOccurs="0" maxOccurs="unbounded">
            <hit t="float" dE="float" minOccurs="0" maxOccurs="unbounded" />
            <point z="float" dEdx="float" phi="float"
                        dradius="float" maxOccurs="unbounded" />
          </straw>
        </ring>
      </barrelDC>
    
      <forwardDC>
        <package pack="int" minOccurs="0" maxOccurs="unbounded">
          <chamber module="int" minOccurs="0" maxOccurs="unbounded">
            <cathodePlane layer="int" u="float" minOccurs="0" maxOccurs="unbounded">
              <hit t="float" dE="float" minOccurs="0" maxOccurs="unbounded"/>
              <cross v="float" z="float" tau="float" maxOccurs="unbounded" />
            </cathodePlane>
          </chamber>
        </package>
      </forwardDC>
    
      <startCntr>
        <sector sector="float" minOccurs="0" maxOccurs="unbounded">
          <hit t="float" dE="float" maxOccurs="unbounded" />
        </sector>
      </startCntr>
    
      <barrelCal>
        <module sector="float" minOccurs="0" maxOccurs="unbounded">
          <flash t="float" pe="float" maxOccurs="unbounded" />
        </module>
      </barrelCal>
        
      <Cerenkov>
        <module sector="float" minOccurs="0" maxOccurs="unbounded">
          <flash t="float" pe="float" maxOccurs="unbounded" />
        </module>
      </Cerenkov>
    
      <forwardTOF>
        <slab y="float" minOccurs="0" maxOccurs="unbounded">
          <side end="int" minOccurs="0" maxOccurs="unbounded">
            <hit t="float" dE="float" maxOccurs="unbounded" />
          </side>
        </slab>
      </forwardTOF>
    
      <forwardEMcal>
        <row row="int" minOccurs="0" maxOccurs="unbounded">
          <column col="int" minOccurs="0" maxOccurs="unbounded">
            <flash t="float" pe="float" maxOccurs="unbounded" />
          </column>
        </row>
      </forwardEMcal>
    </hitView>
  </physicsEvent>
</HDDM>
This material is based upon work supported by the National Science Foundation under Grant No. 0072416.