Secure and Privacy Preserving Outsourcing of Tree Structured Data
a State University
Tempe AZ. 85287, USA.
{ping.lin, candan}@asu.edu
phone:480-727-3611
Abstract. With the increasing use of web services, many new challenges
concerning data security are becoming critical. Data or applications can
now be outsourced to powerful remote servers, which are able to pro-
vide services on behalf of the owners. Unfortunately, such hosts may not
always be trustworthy. In [1, 2], we presented a one-server computation-
ally private tree traversal technique, which allows clients to outsource
tree-structured data. In this paper, we extend this protocol to prevent a
polynomial time server with large memory to use correlations in client
queries and in data structures to learn private information about queries
and data. We show that, when the proposed techniques are used, com-
putational privacy is achieved even for non-uniformly distributed node
accesses that are common in real databases.
Keywords: Search on Encrypted Data, Tree structured data (XML) Security,
Private Information Retrieval
1
Introduction
In web and mobile computing, clients usually do not have sucient computation
power or memory and they need remote servers to do the computation or store
data for them. Publishing data on remote servers helps improve service avail-
ability and system scalability, reducing clients burden of managing data. With
their computation power and large memory, such remote servers are called data
stores or oracles. Typically, as the entities dierent than the data owners, these
data stores can not be fully trusted, for they may be malicious and can be driven
by their own benets to make illegal use of information stored on them. Hence
data outsourcing introduces security concerns that are dierent from traditional
database service which always assumes that database is honest and it is the il-
legal users access that should protect the data from. These concerns have to be
addressed eectively to convince customers that outsourcing their IT needs is a
viable alternative to deploying complex infrastructures locally.
1.1
Problem Statement
Special security concerns with respect to data outsourcing can be categorized
into content privacy and access privacy [12]. Clients with sensitive data (e.g.,
personal identiable data) outsourced to untrusted host may require that their
data be protected from such data storage oracles. This is dened as content
privacy [12] and leads to encrypted database research [7], in which sensitive
data is encrypted, so the content is hidden from the database. Sometimes not
only the data outsourced to a data store, but also queries are of value and a
malicious data store can make use of such information for its own benets. This
is dened as access privacy [12]. Access privacy leads to private information
retrieval [13] research, which studies how to let users retrieve information from
database without leaking (even to the server) the identity and the location of
the retrieved data item.
Access privacy and content privacy are not independent. If the data has some
structure, plain access may reveal this structure, hence impair content privacy.
For example, if the data is in the form of an XML tree (as described in the next
subsection), without proper methods to protect access privacy, the path along
which to nd the target data will be revealed, so will be the whole structure of
the data tree, which impairs the content privacy. Hence to protect both content
privacy and access privacy, we need to hide the structure of the data.
In this paper, we address secure outsourcing of tree structured data, such
as XML documents. To be specic, we address hiding of tree-structured data
and queries on this data. XML documents [8] have tree-like structures. XML
has become a de facto standard for data exchange and representation over the
Internet [9]. Some work has been done on selective and authentic untrusted
third-party distribution of XML documents [3, 911]. The work focuses on access
control and authentication of document (i.e.,query result) source and content.
With more and more data stored in XML documents, techniques to hide tree
structures (the content and structure of XML documents) from untrusted data
stores are in great need. In an XML database, a query is often given in the
form of tree paths, like XQuery. To hide XML queries and the structure of XML
documents, clients need to traverse XML trees in a hidden way. Other frequently
used tree-structures include indexes that are often built for convenient access to
data. However, most index structures closely reect the distribution of the data.
Thus, in order to hide the data and data distribution from the database, tree
structure hiding techniques must be adopted to protect index trees from oracles.
Though the techniques we present in this paper can also be used for hiding
index structure accesses, here we do not focus on this application. Recent work
in privacy-preserving index structures includes [4].
In [1, 2], we proposed a protocol for hiding traversals of trees from ora-
cles. Noticing that existing private information retrieval techniques require ei-
ther heavy replication of the database onto multiple non-communicating servers
or large communication costs or computation costs [13], we provided an one-
server tree-traversal protocol that provides a balance between the communica-
tion/computation cost and security requirements. To protect the client from the
malicious data store, some tasks (such as traversing the tree-structures) are per-
formed interactively. Client responsibilities include encryption and decryption of
the data received from the data store during the traversal of the tree. The en-
cryption capability required at client side can be achieved by assistant hardware
equipments, such as smartcards that are cheap (generally no more than sev-
eral dollars) and now commonly used in mobile environments [20]. We analyzed
the overhead incurred by proposed technique, including communication cost,
encryption/decryption costs, and concurrency overhead. Since [13] has argued
that information-theoretical private information retrieval cannot be achieved
without a signicant communication overhead, our proposed method minimizes
those costs in a computational hiding sense.
In this paper, we build on the approaches proposed in [1, 2] to develop a
computationally secure protocol for hiding correlated accesses to tree-structured
outsourced data. We nd that if node accesses are uniformly distributed, the
original protocol achieves computational privacy [2]. To ensure computational
privacy in face of non-uniformly distributed and correlated node accesses, which
actually occur in real scenarios, we propose a systematic way to enhance the
preliminary protocol so that from the servers view, node accesses are uniformly
distributed.
2
Related Work
Besides the common data encryption methods that hide sensitive data, there are
various eorts to hide other kinds of secret information from untrusted servers.
Basic methods to protect content privacy include database encryption, where
critical data such as credit card number can be encrypted. DBMS suppliers
such as Oracle and DB2 have provided encryption functionality. Bertino and
Ferrari [10] have studied how to protect sensitive XML data content from dif-
ferent entities by performing dierentiated encryption of various portions using
multiple keys. Hacig¨
um¨
us, H. et al. [7] have studied how to execute SQL over en-
crypted data. Other recent work on querying encrypted data includes [5] and [6].
Sensitive information about data may include users queries about data. Dif-
ferent from traditional database security which deals with preventing, detecting,
and deterring improper disclosure or modication of information in databases,
private information retrieval (PIR) aims to let users query a database without
leaking to the database what data is queried.
The basic idea behind any information theoretic PIR scheme is to replicate
the database to several non-communicating servers and pose randomized queries
to each server so that from the servers view those queries are independent of
the target but the user can reconstruct the target data from query results. The
privacy guarantee lies in that even computationally un-bounded server can not
tell the dierence between any two communications for dierent targets. [13]
showed that if one copy of database is used, the only way to hide the query
in the information theoretic sense is to send the whole database to the user.
In order to reduce even one bit in communication between the server and the
user, replication of the whole database is required. Hence information theoretic
privacy techniques require multiple data copies and cause heavy communication
overheads.
In order to achieve practical use, communication cost and the number of
replicas need to be reduced. Ambainis proposed a k-server scheme that requires
O(n
2k1
) communication [14] (where n is the database size). This result is fur-
ther improved to n
O
(
loglogk
klogk
)
by Beimel et al. [15]. This is the best known k-server
information theoretic private information retrieval scheme so far. However, to
achieve communication that is subpolynomial in the size of the database, more
than a constant number of servers are needed [13]. In real world, database repli-
cation is not a preferable solution. It may not be possible to prevent servers
from communicating with each other. In computationally private information
retrieval (CPIR) schemes, therefore, the users privacy requirement is relaxed so
that any two