StatDataML: An XML Format for Statistical Data

title?, source?, date?, version?,
comment?, creator?, properties?)>
<!ELEMENT
title (#PCDATA)>
<!ELEMENT source (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT version (#PCDATA)>
<!ELEMENT comment (#PCDATA)>
<!ELEMENT creator (#PCDATA)>
<!ELEMENT properties (list)>
It consists of seven elements: title, source, date, comment, version,
and creator are simple strings (PCDATA), whereas properties is a list
element (see next section). The creator element should contain knowledge
about the creating application and the StatDataML implementation, version
complements date in uniquely identifying the data set, and properties oers
a well-dened structure to save application-based meta-information.
1 1.3
The dataset element
We dene a dataset element either as a list or as an array:
<!ELEMENT dataset (list | array)>
We use arrays and lists as basic data types in StatDataML because virtually
every data object in statistics can be decomposed into a set of arrays and
lists (as in the S language, or the corresponding arrays and cell-arrays in
MATLAB). The basic property of a list is its generic structure (it may contain
data of any type), in contrast to arrays whose elements are all of the same type.
As a consequence, lists can also represent recursive structures because they can
also contain lists.
1.3.1
Lists
A list contains of three elements: dimension, properties, and listdata:
<!ELEMENT list (dimension, properties?, listdata)>
<!ELEMENT listdata (list | array | empty)*>
The dimension element may contain several dim tags, depending on the
number of dimensions:
<!ELEMENT dimension (dim*)>
<!ELEMENT dim (e*)>
<!ATTLIST dim size CDATA #REQUIRED>
<!ATTLIST dim name CDATA #IMPLIED>
Each of them has size as a required attribute, and may optionally contain up
to size names, specied with <e>. . . </e> tags. In addition, the dimension
as a whole can be attributed a name by the optional name attribute. Note
that arrays, like the whole dataset, can also have additional properties at-
tached, corresponding, e.g., to attributes in S. The listdata element may
either contain arrays (with the actual data), again lists (allowing complex and
even recursive structures), or empty tags (indicating non-existing elements,
corresponding to NULL in S).
1.3.2
Arrays
Arrays are blocks of data objects of the same elementary type with dimension
information used for memory allocation and data access (indexing):
<!ELEMENT array (dimension, type, properties?, (data | textdata))>
The dimension and properties elements are identical to the corresponding
list tags. The listdata block gets replaced by the data (or textdata) el-
ement that contains the data itself. The type element contains all information
about the statistical data type:
<!ELEMENT type (logical | categorical | numeric | character | datetime)>
<!ELEMENT
logical EMPTY>
<!ELEMENT categorical (label)+>
2 <!ELEMENT numeric (integer | real | complex)?>
<!ELEMENT character EMPTY>
<!ELEMENT datetime EMPTY>
It must contain exactly one logical, categorical, numeric, character,
or datetime tag. The categorical tag mustand the numeric element
maycontain additional elements, providing even ner type characterization.
The categorical tag carries a mode attribute that can be unordered
(factors), ordered, or cyclic (e.g., days of the week)unordered is the
default:
<!ELEMENT categorical (label)+>
<!ATTLIST categorical mode (unordered | ordered | cyclic) "unordered">
In addition, the factor labeling has to be specied by the means of one or more
label tags:
<!ELEMENT label (#PCDATA)>
<!ATTLIST label code CDATA #REQUIRED>
The label element has a mandatory code attribute specifying the levels
integer value, and optionally contains a name. If no name is given, the applica-
tion should use the numerical code instead. The order of the label elements
also denes the ordering relation of the levels for ordinal data. As an example,
consider the type specication of a color factor:
<type>
<categorical mode="unordered">
<label code="1">red</label>
<label code="2">green</label>
<label code="3">blue</label>
<label code="4">yellow</label>
</categorical>
</type>
In the data section (see below), only the codes will be used.
Finally, the numeric element may contain a further tag, allowing the dis-
tinction of integer, real, and complex data:
<!ELEMENT numeric (integer | real | complex)?>
<!ELEMENT
integer (min?, max?)>
<!ELEMENT real (min?, max?)>
<!ELEMENT complex>
If numeric is left empty, the data is assumed to be real. For integer and
real, one optionally can specify the data range using the min and max tags,
allowing the parsing software both to choose a memory-saving storage mode
and to check the data validity:
<!ENTITY % RANGE "#PCDATA | posinf | neginf">
<!ELEMENT min (%RANGE;)>
<!ELEMENT max (%RANGE;)>
3 As an example, consider the type specication for the integers from 1 to 10:
<type>
<numeric>
<integer>
<min>1</min> <max>10</max>
<integer/>
<numeric/>
<type/>
The content of min and max should be <posinf/> and <neginf/> for +
and , respectively.
1.3.3
The data tag
If data is used (especially recommended for character data), then each element
of the array representing an existing value is encapsulated in <e>. . . </e> pairs
(or <ce>. . . </ce> for complex numbers). For missing values, <na/> has to
be used, empty values are just represented by <e></e> (or simply <e/>):
<!ELEMENT data (e|ce|na|T|F)* >
<!ENTITY % REAL "#PCDATA|posinf|neginf|nan">
<!ELEMENT e (%REAL;)* >
<!ELEMENT posinf EMPTY>
<!ELEMENT neginf EMPTY>
<!ELEMENT nan EMPTY>
<!ELEMENT
ce (r,i) >
<!ELEMENT r (%REAL;)* >
<!ELEMENT i (%REAL;)* >
<!ELEMENT
na EMPTY>
<!ATTLIST e
info CDATA #IMPLIED>
<!ATTLIST ce info CDATA #IMPLIED>
<!ATTLIST na info CDATA #IMPLIED>
<!ATTLIST T info CDATA #IMPLIED>
<!ATTLIST F info CDATA #IMPLIED>
As an example, consider a character dataset formed by color names, with one
value missing (after green), and one being empty (after blue). The corre-
sponding data section would appear as follows:
<data>
<e>red</e> <e>green</e> <na/> <e>blue</e> <e></e> <e>yellow<e>
</data>
If the colors were coded as factor levels, the example would become:
4 <data>
<e>1</e> <e>2</e> <na/> <e>3</e> <e></e> <e>4<e>
</data>
(Note that the association between numbers and labels is dened in the type
section as mentioned above.)
<e>, and <ce>, and <na/> tags (and also <T/> and <F/>, see above) can
carry an optional info attribute, allowing the storage of meta-information:
<e>120<e/> <e info="unsure">123<e/> <na info="data deleted">
IEEE Number Format
Computer systems represent numbers in dierent ways, depending on their hard-
ware architecture. We require the number format to follow the IEEE Standard
for Binary Floating Point Arithmetic (
Institute of Electrical and Electronics
Engineers
,
1985
), implemented by most programming languages and system
libraries. However, the IEEE special values +Inf, -Inf and NaN must ex-
plicitly be specied by <posinf/>, <neginf/>, and <nan/>, respectively, to
facilitate the parsing process in case the IEEE standard was not implemented
(we are not distinguishing between the Quiet and Signalling variants of NaN
provided by the IEEE standard, the signalling oneif ever usedbeing more
close to NA values for which we dene a special symbol). These special values
could appear, e.g., as follows:
<data>
<e>1.23</e> <e><posinf/></e> <e><nan/></e> <e>2.43</e>
</data>
When an application reads a StatDataML le, the implementation is responsible
for the correct casts, i.e. for choosing the appropriate number representation in
the computer system.
Complex Numbers
Complex numbers are enclosed in <ce>. . . </ce> tags, containing exactly one
<r>. . . </r> tag (for the real part) and one <i>. . . </i> tag (for the imagi-
nary part). Apart from that, the same rules as for <e>. . . </e> apply:
<data>
<ce> <r>12.4</r> <i>1</i> </ce>
</data>
Logical Values
The true and false values are represented by the special tags <T/> and <F/>:
5 <data>
<T/> <F/>
</data>
Date and Time Information
Data of type datetime has to follow the ISO 8601 specication (see
Interna-
tional Organization for Standardization
,
2000
). StatDataML should only make
use of the complete representation in extended format of the combined calendar
date and time of the day representation:
CCYY-MM-DDThh:mm:ss県h:mm
where the characters represent Century (C), Year (Y), Month (M), Day (D),
Time designator (T; indicates the start of time elements), Hour (h), Minutes
(m) and Seconds (s). For example, the 12th of March 2001 at 12 hours and 53
minutes, UTC+1, would be represented as: 2001-03-12T12:53:00+01:00.
1.3.4
The textdata tag
Fo