XRel: A Path-Based Approach to Storage and Retrieval of XML Documents ...

L Documents using Relational Databases
XRel: A Path-Based Approach to Storage and
Retrieval of XML Documents using Relational
Databases
Masatoshi Yoshikawa and Toshiyuki Amagasa
Nara Institute of Science and Technology,
Takeyuki Shimura
IBM Japan, Ltd.
and
Shunsuke Uemura
Nara Institute of Science and Technology
This paper describes XRel, a novel approach to storage and retrieval of XML documents using relational databases.
In this approach, an XML document is decomposed into nodes based on its tree structure, and stored in relational
tables according to the node type, with path information from the root to each node. XRel enables us to store
XML documents using a xed relational schema without any information about DTDs and element types, and
also enables us to utilize indices such as the
B
+
-tree and the
R-tree supported by database management systems.
For the processing of XML queries, we present an algorithm for translating a core subset of XPath expressions
into SQL queries. Thus, XRel does not impose any extension of relational databases for storage of XML doc-
uments, and query retrieval based on XPath expressions can be realized in terms of a preprocessor for database
query language. Finally, we demonstrate the effectiveness of this approach through several experiments using
actual XML documents.
Categories and Subject Descriptors: I.7.1 [Document and Text Processing]: Document and Text Editing
Document management; I.7.2 [Document and Text Processing]: Document PreparationMarkup languages,
XML; H.2.1 [Database Management]: Logical DesignSchema and subschema; H.3.6 [Information Storage
and Retrieval
]: Library AutomationLarge text archives
General Terms: Algorithms, Design, Management, Performance
This work is partially supported by the Ministry of Education, Culture, Sports, Science and Technology, Japan
under grants 11480088, 12680417 and 12208032, and by CREST of JST (Japan Science and Technology).
An earlier version of this paper appeared in the Proceedings of the Tenth International Conference on Database
and Expert Systems Applications, Aug. 30 - Sep. 3, 1999 (DEXA99), pp. 206-217.
Name: Masatoshi Yoshikawa, Toshiyuki Amagasa and Shunsuke Uemura
Address: 8916-5 Takayama, Ikoma, Nara 630-0101, Japan
Afliation: Graduate School of Information Science, Nara Institute of Science and Technology; e-mail:
{yosikawa, amagasa, uemura}@is.aist-nara.ac.jp
Name: Takeyuki Shimura
Address: 1623-14, Shimotsuruma, Yamato, Kanagawa, 242-8502, Japan
Afliation: IBM Japan, Ltd.; e-mail: takeyus@jp.ibm.com
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted with-
out fee provided that copies are not made or distributed for prot or direct commercial advantage and that copies
show this notice on the rst page or initial screen of a display along with the full citation. Copyrights for com-
ponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other
works, requires prior specic permission and/or a fee. Permissions may be requested from Publications Dept,
ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or
permissions@acm.org
. 2
·
M. Yoshikawa et al.
Additional Key Words and Phrases: text markup, text tagging, XML query, XPath
1. INTRODUCTION
XML (Extensible Markup Language) [World Wide Web Consortium 1998] is emerging as
a standard format for data and documents on the Internet. Various kinds of applications
that use the XML format have been developed (see [Cover ]). As a result, many kinds of
data will be exchanged in the form of XML documents or XML data. In addition, storage
devices are begin miniaturized while their capacity is being enlarged. Thus, it is expected
that not only organizations but also individuals will use a large quantity of XML documents
in the near future. Development of techniques for storing massive XML documents and
retrieving information needed from them is one of the core problems regarding the point
of contact between the research area of database and XML.
In this paper, we describe XRel, a novel approach to building XML databases on top of
off-the-shelf relational databases. The design goals of XRel are as follows: (1) No restric-
tion should be imposed on XML documents being stored. Any valid or well-formed XML
documents should be stored and queried; (2) XML queries should be based on W3C stan-
dardization activities. (3) Storage and manipulation of XML documents should be made
possible using currently available relational databases. No extension of data model, query
expressive power, or index structures of relational database systems should be assumed;
and (4) Query processing should be efcient.
One of the important features of XML documents is that we can perform operations
based on their logical structures. Based on this feature, databases that manage XML docu-
ments have to support queries on their logical structures and on their contents. Considering
access based on a logical unit and reuse-ability, it is appropriate to decompose and store
XML documents according to their tree structures. They are then stored in databases. In
order to retrieve XML documents from such databases, XML queries are translated into
database queries (typically in SQL).
There are two approaches to designing database schemas for XML documents, as fol-
lows.
Structure-mapping approach: Database schemas represent the logical structure (or DTDs
if they are available) of target XML documents. In a basic design method, a relation
or class is created for each element type in the XML documents (e.g. [Christophides
et al. 1994; Abiteboul et al. 1997]). A more sophisticated mapping method has also
been proposed, whereby database schemas are designed based on detailed analysis of
DTDs [Shanmugasundaram et al. 1999]. In the structure-mapping approach, a database
schema is dened for each XML document structure or DTD.
Model-mapping approach: Database schemas represent constructs of the XML document
model. In this approach, a xed database schema is used to store the structure of all
XML documents. Early proposals of this approach include [Zhang 1995]. Another
example includes the edge approach [Florescu and Kossmann 1999b], in which edges
in XML document trees are stored as relational tuples.
In this paper, we adopt the model-mapping approach for the following two reasons:
The data structure of XML documents has richer expressive power than the relational XRel: A Path-Based Approach to Storage and Retrieval of XML Documents
·
3
data model or object-oriented data model; more concretely, neither the relational data
model nor the object-oriented data model has constructs to express the order or choice
(
|) of elements in the element content models in DTDs. This implies that we cannot
nd a method of structure mapping that maps data structures of XML documents into
database schemas in a natural way. To cope with this problem, we need to extend the
expressive power of database models [Christophides et al. 1994; Abiteboul et al. 1997].
However, storage schemes assuming extended database models are not applicable to
off-the-shelf database systems.
The structure-mapping approach is suitable when we store a large number of XML doc-
uments that conform to a limited number of document structures or DTDs, and when
the document structures or DTDs are modestly static. However, numerous sophisticated
Web applications are based on the exible and dynamic usage of XML. In such appli-
cations, there is a demand to store various kinds of XML documents i) the DTDs of
which are not known beforehand, or ii) that are well-formed but do not have DTDs. Fur-
ther, many such applications deal with XML documents the logical structure of which
changes often. Obviously, the structure-mapping approach is inappropriate to the stor-
age of a large number of such dynamic and structurally-variant XML documents.
In both of the above-mentioned approaches to database schema design, XML documents
are decomposed into fragments composed of certain logical units. Obviously, these decom-
position approaches have drawbacks; it takes time to restore the entire or large sub-portion
of original XML documents, and processing of certain text operations such as a proxim-
ity search beyond the boundaries of elements becomes very complex. A simple alternative
approach to overcome these problems is to store the entire text of XML documents in a sin-
gle database attribute as a CLOB (Character Large OBject), or as les outside of database
systems. In our system, we optionally keep the entire text of XML documents in addition
to their fragments stored in database schemas. The decision whether to keep entire XML
documents depends on the demand of the applications that will be used. The entire text of
each XML document in our system is stored as CLOB data if CLOB is supported in the
database system, or stored as text les otherwise.
There are additional important design choices; that is, which database models should we
use for the XML document databases the (object) relational database model or object-
oriented database ? We chose the (object) relational databases for the following two rea-
sons:
Current use of (object) relational databases is widespread [Leavitt 2000]. Consequently,
a large quantity of non-XML data have already been stored in them. In order to work in
closer cooperation with such traditional data, it is useful to store XML data in the same
type of databases.