OLAP Data Scalability
ack height=1 colspan=2>
OLAP Data Scalability
OLAP Data Scalability
White Paper
Ignore OLAP Data Explosion at great cost.
many organisations will never know that they figuratively bought a very
expensive rowing boat, when they could have traveled business class for less!
by Johann Potgieter
© 2003 SPF Pty Ltd. All Rights Reserved.
© 2003 SPF Pty Ltd. All Rights Reserved.
Page 1/8
Executive Summary:
Introduction:
The reality of data explosion in multi-dimensional databases is a surprising and
widely misunderstood phenomenon. For those about to buy or use an OLAP product,
it is critically important to understand what data explosion is, what causes it, and how
it can be avoided, because the consequences of ignoring data explosion can be very
costly, and in most cases, result in project failure.
There are very few OLAP vendors who can truly claim to have technically conquered
the consequences of data explosion. The claims offered by many vendors about how
they manage data explosion make it very difficult to understand what is actually
important on this topic and whats not.
For example, one of the problems of data explosion is that it results in a massive
database. The size of the database in one product can literally be hundreds and even
thousands of times bigger than the same database in another product.
Rather than admit to the problems of data explosion, the vendor with the massive
database will argue that his database is handling large data sets, while he will imply
that the vendor of the smaller database a database without data explosion - cannot
address large enterprise datasets.
The correct analysis should be to compare sizes with equal volumes of base data, but
because the size of the databases are so profoundly different, prospective customers
find it hard to believe that such dramatic differences are possible with similar datasets.
The end result is that organisations often commit to what they erroneously believe is
the best so-called enterprise solution. This mistake comes at a huge price (see
consequences of ignoring data explosion). Ironically, in this way, a vendors biggest
weakness (data explosion) becomes their biggest selling point.
The consequence of ignoring data explosion:
Massive databases that are literally hundreds and even thousands of times larger
than is necessary
Expensive hardware is required to process and accommodate exploded data
Load and or calculation times that take hours rather than seconds or minutes
Large costs to build and maintain these monolithic models
The hidden cost of failing to provide timely and relevant enterprise business
intelligence - there is a great cost associated with the inability to make fast
business decisions and the negative culture that prevails because of poor
underlying analytical systems
Real or defacto project failure
© 2003 SPF Pty Ltd. All Rights Reserved.
Page 2/8
Sparsity (antonym Density):
Sparsity and sparsity handling are important concepts worth understanding as a
precursor to understanding data explosion.
Input data or base data (i.e. before calculated hierarchies or levels) in OLAP
applications is typically sparse (not densely populated). Also, as the number of
dimensions increase, data will typically become sparser (less dense).
For example, in a 1 dimensional matrix, you can suppress all zero values and
therefore have a 100% dense matrix. In a 2 dimensional matrix, you cannot suppress
zeros if there is a non-zero value in any element in the two dimensions (see figs 1 and
2).
YEAR
A
10
C
20
D
8
F
15
Fig 1: 100% dense
Q1
Q2
Q3
Q4
A
10
0
0
0
B
0
20
0
0
C
0
0
8
0
D
0
0
0
15
Fig 2: 25% dense (4 out of 16 data points populated)
Whilst it is not true in all cases, typically as the number of dimensions in a model
increases, so does the data sparsity. For example, if you are storing sales data by
product and by month, it is conceivable that you will sell each product each month
(100% dense). However, if you were storing sales data, by product, by customer, by
region, by month, clearly you would not sell each product to every customer in every
region every month.
By adding the dimension gender to this model, you would double the possible size
of the cube by storing the data by either of the two variables male or female, but the
size of the stored data would remain the same. In this case, by introducing another
simple dimension, the sparsity will have doubled!
To provide a practical baseline expectation for sparsity, we researched data sparsity
on a variety of models with a sample of 7 companies. Each company had a variety of
models (eg P&L, Balance Sheet, Cash Flow, Sales Analysis, HR/Labour Analysis,
Budgeting & Forecasting, Industry Specific models etc) with differing dimensions.
The Industry specific models included insurance claim analysis, telco call analysis
and revenue per user analysis and a medical device companys sales analysis. A
detailed summary of our research can be found in Appendix A.
Our summary findings were as follows:
1. Data density in all cases was significantly less than 1% - i.e. extremely sparse.
2. As the number of dimensions increases, so did the sparsity of the data (models
reviewed had between 5 and 16 dimensions).
3. Extreme sparsity existed in all the Industry Specific models (all models had
density of less than 1 billionth of a %).
© 2003 SPF Pty Ltd. All Rights Reserved.
Page 3/8
Sparsity Handling
Superficially, any multi-dimensional model needs to provide space for every possible
combination of data points. Since in sparse models most data points are zeros, the
main issue is how to store all values other than zero values. For example, if the data
density of a model is 1% and there is no sparsity handling, the resulting model will be
100 times larger than a model that has perfect sparsity handling. Sparsity handling
therefore is the efficient storage of very sparse data.
Dont confuse poor sparsity handling with data explosion a common myth:
It is important not to confuse poor sparsity handling (the inefficient storage of zero
values) with data explosion. Although sparsity handling is an issue for multi-
dimensional databases, it usually only accounts for differences of less than ten times
between products.
Whilst some might say that a difference of size of up to ten times is important, it is
nowhere near as important as the differences that arise as a result of data explosion.
As stated previously, these differences can be hundreds and even thousands of times
between good and bad databases. Also, whilst sparsity handling was more of a
problem a few years ago, most vendors now have a reasonable solution for this.
Importantly, this is another classic area of vendor deception. When challenged on the
topic of data explosion, some vendors divert the argument to sparsity handling.
Because both sparsity handling and data explosion are poorly understood, a vendors
handling of sparsity may be incorrectly accepted as an ability to adequately address
and avoid data explosion.
Data Explosion - what it is and what causes it the facts
Data explosion is the phenomenon that occurs in multidimensional models where the
derived or calculated values significantly exceed the base values. There are three main
factors that contribute to data explosion.
1. Sparsely populated base data increases the likelihood of data explosion
2. Many dimensions in a model increase the likelihood of data explosion
3. A high number of calculated levels in each dimension increase the likelihood of
data explosion
© 2003 SPF Pty Ltd. All Rights Reserved.
Page 4/8
Lets extend our previous example to explain this. In fig 3, the 100% dense 1
dimensional model with 2 levels, has base data that exceeds calculated data in a ratio
of 4:1. There is no data explosion here. In fig 4, the 25% dense 2 dimensional model
with 3 levels in 1 dimension, has base data of 4 values that explodes to 15
calculated values.
YEAR
A
10
C
20
D
8
F
15
Total
53
Fig 3: No data explosion
YEAR
Q1
Q2
Q3
Q4
A
10
10
0
0
0
B
20
0
20
0
0
C
8
0
0
8
0
D
15
0
0
0
15
A+B
30
10
20
0
0
C+D
23
0
0
8
15
A+B+C+D
53
10
20
8
15
Fig 4: 3.75 times data explosion
By increasing sparsity, and/or adding dimensions, and/or adding calculated levels in
dimensions in the above example, the data explosion rate will increase.
A frame of reference for data explosion in practice:
In practice between 5 and 12 dimensions are very common. Highly sparse models are
also typical. Density factors of 1% or less are very common and should be assumed
unless proven otherwise. A typical product dimension has between 4 and 9 levels and
an account dimension has between 8 and 16 levels. The other dimensions in typical
models often have between 2 and 6 levels.
To provide a practical frame of reference, we researched data explosion in a variety of
models from our sample of seven companies. A detailed summary of our research can
be found in Appendix A. Important findings were as follows:
1. Data density in all cases was significantly less than 1% with extremes in all the
Industry Specific models.
2. The average number of levels in each model was between 3 and 6.
3. The highest number of levels for any