APIC: An Efficient Algorithm for Computing Iceberg Datacubes

iceberg queries) has been recently introduced by the algo- rithm BUC. Iceberg datacubes group aggregates satisfying a selection condition (i.e. SQL having ...
251KB taille 5 téléchargements 356 vues
APIC: An Efficient Algorithm for Computing Iceberg Datacubes Rosine C ICCHETTI, Noël N OVELLI, Lotfi L AKHAL LIM, CNRS FRE-2246 - Université de la Méditerranée, Case 901 163 Avenue de Luminy, F-13288 Marseille Cedex 9, France E-mail: [email protected]

Abstract OLAP databases are increasingly used and require handling multidimensional data in order of seconds. The cube operator was introduced to precompute aggregates in order to improve response time of aggregation queries. Data collected in data warehouses is frequently sparse and datacubes, costly to compute and specially voluminous when compared to the input size, can encompass many aggregated results not significant for decision makers. In order to avoid this drawback, the concept of iceberg datacube (answering iceberg queries) has been recently introduced by the algorithm BUC. Iceberg datacubes group aggregates satisfying a selection condition (i.e. SQL having clause). In this paper, we propose an approach for computing a condensed representation of either full and iceberg datacubes. We introduce a novel and sound characterization of datacubes based on dimensional-measurable partitions. Such partitions have an attractive advantage: avoiding sorting techniques which are replaced by a linear product of dimensional-measurable partitions. Moreover, our datacube characterization provides a logical condensed representation interesting when considering the storage explosion problem. We show that our approach turns out to an operational solution more efficient than previous proposals: the algorithm APIC. It enforces a lectic-wise traverse of the dimensional lattice and takes into account the critical problem of memory limitation. Our analytical and experimental performance study shows that APIC and BUC are promising candidates for scalable computation and the best efficiency of APIC. Keywords: Algorithms, Data Warehouse, Iceberg Datacubes, Lattices, Mining Queries, Partitions

1

Motivation

The cube operator was introduced [18] for supporting efficiently multiple aggregations widely used in OLAP databases [1, 8, 12, 37, 22]. It provides multiple points of view on metrics of interest, relevant for the decision makers. In other words, measured values are aggregated, at various levels of granularity, across dimensions which stand for  analysis criteria. Given a relation over the schema , a set  of dimensions  = {      },  , a measure

 , and an aggregative function , the cube operator is expressed as follows:

 SELECT      FROM CUBE BY      Such a query achieves all the group-by according to any attribute combination belonging to the power set of .

It results in what is called a datacube, and each sub-query performing a single group-by yields a cuboid [22]. Computing datacubes is exponential in the number of dimensions (the lattice of the dimension set must be explored), and the problem worsens when very large data sets are to be aggregated. Various approaches addressing the datacube computation make use of sorting or hash-based techniques [16], and study optimization strategies allowing to reuse results previously obtained during execution [1, 37, 36]. More precisely, they convert the dimensional lattice in a processing tree providing the order according to which cuboids are to be computed. Recently, a variant of the problem, called iceberg datacube computation, was introduced by BUC [4]. In order to meet similar objectives, [37] proposes “multi-feature cubes”. When computing such cubes, aggregates not satisfying a selection condition specified by user (similar to the clause Having in SQL) are discarded. Motivations behind computing iceberg cubes are the following. Firstly, iceberg cubes provide users only with relevant results. Thus decision makers can focus on significant data. This advantage is specially interesting when managing sparse data because scarce or exceptional dimensional combinations of values are discarded. Moreover computation performances can be improved since the lattice to be explored can be pruned (using the selection condition) [4]. Precomputed iceberg cubes also offer an efficient solution for answering iceberg queries which discriminate results when yielding a cuboid by applying a condition to be satisfied [14]. Another important issue, addressed when computing iceberg cubes, concerns the discovery of multidimensional patterns (multidimensional association rules [22], or roll-up dependencies [43]) which generalize frequent patterns by taking into account various attributes. In this paper, we propose an approach for achieving full or iceberg datacubes. The originality of our proposal is that it aims to compute a condensed representation of datacubes with various advantages detailed further (cf. Section 6.1). The notion of condensed representation is introduced in [29] for frequent itemsets and suggested for datacubes without further investigations. A condensed representation of a datacube is a representation smaller than the datacube itself, and from which OLAP queries yield similar results. The main contributions of our approach are the following. Firstly, we introduce a novel and sound characterization of datacubes based on the concept of dimensional-measurable

partitions, inspired from the partition model [10, 41]. From a dimensional-measurable partition, according to a set of dimensions, computation of the associated cuboid is simple. This new concept is attractive because dealing with dimensional-measurable partitions means operating linearly set intersections (and thus, the use of sorting or hash-based methods is avoided). Secondly, our characterization provides a condensed representation of datacubes in order to minimize disk storage. Various approaches have addressed the problem of storing datacubes which is wasteful on storage and could be inadequate on query performance [25, 45]. They enforce efficient physical techniques whereas we propose a logical condensed representation which can be combined with the former techniques. The third contribution provides a new principle, dictated by the critical problem of main memory limitation, for navigating through the dimensional lattice. It is called lectic-wise traverse of the subset lattice, offers a sound basis for our computing strategy, and applies for computing full or iceberg datacubes. Finally, the described ideas are enforced through a new operational algorithm, called APIC, which has been experimented by using various benchmarks. Through an in-depth discussion giving both analytical and experimental results, we compare APIC with the algorithm BUC which is, among previous proposals, the single one computing iceberg cubes and the most efficient for yielding full datacubes. Our experiments show that APIC is more efficient than BUC in any case. The rest of this paper is organized as follows. In section 2, we briefly describe the problem statement. Then we introduce the novel concepts of our approach and characterize datacubes (section 3). Section 4 is devoted to our algorithmic solution: the lectic-wise traverse of the dimensional lattice is given and the algorithm APIC is detailed. Related work addressing datacube computation is summarized in section 5. Then we give an analytical comparison between APIC and BUC by examining four key factors. It is complemented by experimental and comparative results (cf. Section 6). Finally, we discuss the strengths of our approach and evoke further research work.

[36, 4]. In order to minimize the overall cost of computation an attractive idea is to overlap computations as much as possible. For instance, the less aggregated cuboid (according to the whole set of dimensions) is achieved and if a distributive function (e.g. count, sum) is used for aggregation, this cuboid can be used to compute more aggregated cuboids, themselves being used for achieving new results at a coarser granularity level, and so on [22]. Following from such a strategy, the dimensional lattice is traversed, level by level in a top-down way, and yielded cuboids are more and more aggregated until the associated dimensional set is empty (the corresponding query does not involve any group-by). Although such a strategy has been successfully applied [1, 36], it cannot, due to its very nature, take benefit from the threshold condition given when computing iceberg datacubes.

2

An alternative vision of the problem Faced with the problem of computing and storing huge amount of aggregated results, an interesting question is: which aggregates are really significant and informative for the decision maker? Two kinds of answer have been recently proposed: [14] investigates iceberg queries yielding an iceberg cuboid by selecting aggregates having a measure beyond a certain threshold, and [37, 4] generalize the solution by computing iceberg datacubes. The latter proposal contrasts with all previous cube construction algorithms in the sense that more aggregated cuboids are computed before less aggregated ones. Thus a bottom-up traverse of the dimensional lattice is achieved. It is therefore impossible to overlap parts of computation but the counterpart is that pruning rules can apply in order to minimize the dimensional lattice to be explored: as soon as a dimensional value combination is proved to be scarce, it can be discarded and coarser aggregates originated by that combination are not

Problem survey

In this section, we aim to highlight four critical concerns when computing datacubes: execution time, main memory limitation, disk storage explosion, and pruning method. The two former factors, intrinsically related, are examined together. They are, as well as the third factor, specially critical when addressing full datacube computation, but remain key parameters for achieving iceberg cubes. The latter factor is originated by a new vision of the problem and must be considered only for iceberg cubes. Algorithm cost For computing full datacubes, the number of cuboids (or group-by) to be dealt is exponential in the number of dimensions. Moreover, datacubes use to summarize huge amount of collected data therefore the computation of a single cuboid can be costly and require external processing (e.g. sorting) because dealt data cannot fit in main memory

Disk storage explosion Datacubes are considerably larger than the input relation ([36] exemplifies the problem by achieving a full datacube encompassing more than 210 millions of tuples from an input relation having 1 million of tuples). The problem is originated by a twofold reason: on one hand the exponential number of dimensional combinations to be dealt, and on the other hand the cardinality of dimensions. The more dimension domains are large, the more aggregated results there are (according to each real value combination). Unfortunately, it is widely recognized that in OLAP bases, data can be very sparse [36, 14, 4] thus scarce value combinations are likely to be numerous and, when computing full datacubes, each exception must be preserved. In such a context, storage explosion is unavoidable [12] and originates a twofold drawback: (i) on one hand, physical organization techniques are strongly required and (ii) on the other hand, OLAP query performances are likely to be debased [25]. Since, the efficiency of such queries is the main motivation of all approaches aiming to compute datacubes, we believe, like [4], that the issue must be addressed with another point of view.

even considered. When traversing the dimensional lattice, such a strategy can apply for achieving full but also iceberg datacubes. Moreover, the approach enforcing it [4] is proved to be more efficient than previous proposals when exploring the whole dimensional lattice. Taking into account the critical factors of the problem argues for an approach computing efficiently iceberg datacubes.

3

Condensed representation of datacubes

In this section, we introduce a novel characterization of datacubes which is based on simple concepts. It offers a condensed representation which can be seen as a logical proposal, complementing physical contributions [25], for minimizing the storage explosion. With this characterization, we are provided with a mere and formal framework for computing full or iceberg datacubes. First of all, we assume that the relation  to be aggregated is defined over a schema  encompassing, apart from the tuple identifier, two kinds of attributes: (i) a set  of  dimensions which are the criteria for further analysis (also called categorical attributes in statistical databases [40, 9]), and (ii) a measurable attribute  standing for the measure being analyzed1 . In the rest of the paper, such a relation is called a Dimensional-Measurable relation, or DM-relation for briefness. We introduce in the following subsection the new concept of Dimensional-Measurable partition which is used as a novel representation of DM-relations, and from which a cuboid can be easily achieved.

3.1 Dimensional-Measurable partitions Inspired from the concept of partition defined in [10, 41] (and used in [23, 28, 31] for mining functional dependencies), we introduce the concept of Dimensional-Measurable partition according to a set of dimensions   . Such a definition requires to extend the concept of tuple equivalence classes. In our approach, the Dimensional-Measurable equivalence class of a tuple  is a set of couples, each of which encompasses (i) the identifier of a tuple  sharing with  the very same value for  , and (ii) the value of  for  . Definition 1 Dimensional-Measurable Classes  Let  be a DM-relation over the schema  =      , and   . The Dimensional-Measurable equivalence class of a tuple  !  according to  is denoted by " #$  , and defined as follows: " #$ % & " #  " # " # " # '(      /  ! ,   =   Example 1 Let us consider the classical example of studying the sales of a company according to various criteria such as the sold product (Product), the store (Store) and the year (Year). The measure being studied according to the previous criteria is the total amount of sales (Total). An instance of our relation example is illustrated in figure 1. 1 All definitions, given in this section, can be easily extended in order to consider a set of measures, like in [37, 7].

IdRow 1 2 3 4 5 6 7 8 9 10 11 12

Product 100 100 100 100 100 100 103 103 103 103 103 103

Sales Store a a b b c c a a b b c c

Year 1999 2000 1999 2000 1999 2000 1999 2000 1999 2000 1999 2000

Total 70 85 105 120 55 60 36 37 55 60 28 30

Figure 1: The relation example Sales The DM-equivalence class of the tuple ), i.e. having IdRow = 1, according to the dimension Product, groups all the tuples (their identifier and measurable value) concerning the product 100: " #* +,./0 = { (1, 70) (2, 85) (3, 105) (4, 120) (5, 55) (6, 60) }. 1 ) A Dimensional-Measurable partition (or DM-partition) of a relation, according to a set of dimensions, is the collection of DM-equivalence classes obtained for the tuples of . Definition 2 Dimensional-Measurable Partition % Let  be a DM-relation over the schema         , and  a set of dimensions,   . The Dimensional-Measurable partition of  according to  $ is $ denoted&by" #$ 2  , and defined as follows: ' 2  =  /  !  . Remark: in the particular case where  = 3, the partition of  is  itself, and all its tuples belong to a single equivalence class. Example 2 Let us resume our relation example given in figure 1. The DM-partitions, according to the attributes Product and Store respectively, are given below (the various classes are delimited by < >). * +,- 0DM-equivalence 45678 ./  2 = { < (1, 70) (2, 85) (3, 105) (4, 120) (5, 55) (6, 60) >, < (7, 36) (8, 37) (9, 55) (10, 60) (11, 28) (12, 30) > } and ,+: 45678 290  = { < (1, 70) (2, 85) (7, 36) (8, 37) >, < (3, 105) (4, 120) (9, 55) (10, 60) >, < (5, 55) (6, 60) (11, 28) (12, 30) > }. Since the dimension Product has two distinct values in our relation instance, the associated DM-partition groups two ,+: 45678 equivalence classes while 290  encompasses three classes. 1 Let us underline that our implementation of DM-partitions only preserves tuple identifiers which are used for indexing measurable values (stored only once in an additional storage structure, cf. Section 6.1). In order to efficiently handle DM-partitions, we introduce the concept of DM-partition product, inspired from partition product [10, 41]. Lemma 1 DM-Partition Product $ Let  be a DM-relation, and 2  , 2;  two DM-

partitions computed from < according to = and > respectively. The product of the two partitions, denoted by ?@ A * > # * # * % &'( ) . If + , ) > ,   ) = % &'( ) where #* 4 1 .: ; ? # / = {1, 2, . . . , + , ) }. Then (i) 9 @ / .: ; < > # = 8 #* )# * - . : ' )# * % &'( ) + , +A and thus 8 .: ; ? .: ; < > )# * - . : ' )# * @ / + , (ii) 9 +A and > > #* thus which can be written = {1, 2, . . . , + A' ) } and > * #* '( ) )7 4 > * ' )# * + , = +A + 1. Thus % & = + A' ) # = 8 > * '( ) %& + 1 and .

. 4 1

7

B*

Example 9 In our set example ,   ) ? = . E

B /CD *

= % &'( )

With lemma 3, we are provided with a sound guideline for selecting among DM-partitions the ones which offer the best compromise between execution cost and memory requirement. Cases 1 and 2 of lemma 3 guarantee that the expected results can be efficiently yielded when provided with original DM-partitions and the partition just being dealt. Thus let us assume that the quoted DM-partitions are in-memory structures. In the former case, the result is straightforward since the associated DM-partition is available, in the latter case, it requires a single product of two in-memory DM-partitions. For dealing with the third case of lemma 3, the less often encountered, we do not need additional input data since any cuboid can be computed from original DM-partitions. Even if the number of DM-partition products is not, in that case, optimal the size of in-memory data is minimized. Definition 8 and lemma 4 state the conditions under which an iceberg cube computation, following from our navigation principles, can be operated, and how it can be efficiently performed by enforcing pruning rules.

4.2 The algorithm APIC We propose an algorithm, called APIC, for computing full and iceberg datacubes. It fits in the theoretical framework

previously presented. A pre-processing step is required in order to build DM-partitions according to each single attribute from the input relation. According to the user need, these DM-partitions can be full or iceberg partitions. While performing this initial step, the computation of the cuboid according to the empty set is operated and its result is yielded. The pseudo-code of our algorithm is given below. The monotonic aggregative function F is used when performing the DM-partition product ( GH ). Algorithm APIC input: original DM-partitions, function F output: condensed (iceberg) datacube # I J 1 # 8I ? 2 while @ do . I )7 4 # * 3 + , #: I # 4 # I  (L  (M )# * 5   % K @ # II . 6 if 7 then 8 AggregateGen( NO ) # II # : / . 9 else if 10 then I H Q 11 NO NO P G N 12 AggregateGen( NO ) 13 else I NR NS 14 ;# 15 for all A do 16 begin I H 17 NR NR G NT 18 endfor 8I J 19 if NR 20 then AggregateGen( NR ) 21 endif 22 endif 23 endif 24 endwhile end APIC For describing principles of our algorithm, we firstly assume that original DM-partitions fit in main memory (our experiments show that for relations encompassing 1 million of tuples and 10 dimensions, this assumption is satisfied when provided with 42 MB of memory). We explain below how to do when such a constraint is not satisfied. Thus let us suppose that original DM-partitions are preserved all along J the computation process. The lattice traverse starts from and goes on by following from principles given in section 4.1. As stated in lemma 3, three cases must be considered: 1. if the combination yielded by % &'( is a single attribute, then the associated DM-partition is available and measured values can be directly aggregated for each equivalence class (possibly respecting the threshold condition); #

2. if % &'( returns a superset of the combination just being dealt, a single partition product is required. It oper-

ates on UV WXY and the original DM-partition UZ WXY aca cording to the additional attribute [ \ ] ^_` W Y. In our implementation, both partitions are available. If necessary, the aggregation condition is verified, and equivalence classes are possibly discarded. An aggregative operation is then performed on remaining classes and results are output; 3. in other cases, the DM-partition associated to the next combination must be computed from original DMa partitions, and requires (| ] ^_` W Y| - 1) DM-partition products (| | stands for the cardinality). The algorithm completes when the last DM-partition XY Ubcd W is processed. These general principles are enforced through the algorithm APIC. The procedure SkipNotEmpty does not strictly follow from definition 8, because it is recursive. In fact, it can perform repeated skips until the reached dimensional combination satisfies the given condition (the associated DM-partition is not empty). The procedure uses a structure in which minimal combinations of dimensions provided with empty DM-partitions are preserved. From a DM-partition, the procedure AggregateGen computes an aggregated value for each equivalence class in the partition, and outputs aggregates according to our condensed representation. Let us now consider that original partitions ( efgh Uf WXY) cannot fit in main memory, then the fragmentation strategy proposed in [36] and used in [4] is applied (cf. subsection 5). It divides the input relation in fragments according to an attribute. The process can be repeated until each fragment can be loaded. We adopt that technique, specially efficient for dealing with I/O.

5

Related work

Managing OLAP databases is very close to dealing with statistical databases [40]. Most of approaches addressing one of the two issues are concerned by similar problems such as logical representations of aggregated data, computed from huge amounts of collected data [17, 21, 6], suitable manipulation operators [35, 9, 33], efficient physical organization techniques [25]. . . However, certain problems were recently introduced by OLAP database management, such as selecting materialized views [20], answering efficiently OLAP queries [12, 38], or computing datacubes. In this section, we only focus on research work addressing datacube computation. We place much more emphasis on the algorithm BUC, devoted to iceberg cube computation, because it is the single approach addressing such an issue and the most efficient solution according to our knowledge for achieving full cubes. Most of related approaches are intended to compute full datacubes from very large data sets [1, 11, 36]. Their major objective is to study optimization strategies which could turn out into operational algorithms, and their aim is to achieve, from the dimensional lattice, a processing tree covering the search lattice and used to order and drive all the

cuboid computations. In [1], the PipeSort algorithm enforces repeated sorts of tuples according to dimensional attributes and aims to optimize the overall cost of cube computation. It takes benefit of commonality of dimensional combinations: cuboids are computed from their smallest parent in the dimension set lattice. For instance, when sorting is performed according to ij k j lj m, sorting according to ij k j l (or ij k) is straightforward. However, when the input relation is large, many cuboids require external sorts and the amount of I/O can be considerable [36]. It is exponential in the number of dimensions. PipeHash [1] also computes group-by from their smallest parent. It makes use of hash tables. While PipeSort achieves multiple group-by with a single sort, PipeHash must re-hash the data for computing every cuboid [4], while spending significant memory space required for storing hash tables. The algorithm Overlap [1, 11] aims to minimize the I/O numbers by overlapping the cuboid computation. It is based on a partial matching sort order. Once the less aggregated cuboid is achieved, it is sorted according to a particular attribute order. Attribute combinations in the dimensional lattice are ordered for being subsequences of the initial order. Then the lattice is converted in a tree: a node is linked to the parent sharing with it the longest prefix, and labeled with the estimated cost for computing it. Overlap traverses the search space in breadth-first and dynamically evaluates the set of cuboids which can be computed together. When sparse and large relation are dealt, the I/O cost is at least quadratic in the number of dimensions [36]. An array-based approach is proposed for MOLAP systems [44]. It is close to Overlap but data partitions are stored in in-memory arrays and each cell can be directly accessed. Thus it avoids sorting and tuple comparisons. Its input is an array file-structure possibly chunked in smaller sub-arrays which can be loaded in main memory. Compression techniques are also used. Nevertheless, when data is too sparse, in-memory arrays are too voluminous to be loaded and the algorithm becomes unpracticable. The approach presented in [36] divides large relations into fragments fitting in memory (algorithm Partitioned-Cube). Then a datacube is computed from each fragment by using the algorithm Memory-Cube. Fragments are obtained by partitioning the input relation according to an attribute, and correspond to tuple equivalence classes in that partition. The process is recursively applied until tuple equivalence classes can be managed by Memory-Cube. In order to minimize the number of required sorts, the approach computes the desired set of paths traversing the search lattice. Given a prefix-ordered path, Memory-Cube sorts an in-memory fragment according to the attribute order of the node root of the path. Then a single data scan returns all cuboids at whatever granularity level. They are output immediately. The I/O cost of the algorithm is proportional to the number of dimensions. In contrast with approaches previously presented, the algorithm BUC is intended for computing iceberg datacubes even if it can efficiently achieves full datacubes [4]. It takes

benefit of the I/O efficiency of the algorithm PartitionedCube but provides a different processing of in-memory fragments and can apply the minimum support pruning like Apriori [2]. The dealt fragments are tuple equivalence classes. We can say that the relation is partitioned in an “horizontal” way because all the values of tuples are preserved whereas our approach makes use of a “vertical” partitioning of tuples (by preserving only identifiers according to dimensions). After an overall aggregation over the whole relation, BUC partitions it according to the first dimension (e.g. n). Each tuple equivalence class is recursively dealt in the following way. Aggregation is performed for the whole class (achieving a tuple of the cuboid according to the considered dimension n). Then BUC partitions the class according to the second dimension ( o), considers the first equivalence class according to ( np o), performs aggregation (the result is a tuple of the cuboid according to np o), and so on. As soon as a class is proved to contain a single tuple, the recursive process (aggregating, partitioning) stops since further results for the class are straightforward. Following from this strategy, BUC computes, at each recursive step, a tuple of a cuboid at a variable granularity level. When the last class of the tuple partition according to n is dealt, all the cuboids according to a dimension combination prefixed by n are achieved from the coarsest granularity to the finest. By enforcing such a strategy, BUC differs from all previous approaches, and can apply in a relevant way pruning rules for computing iceberg datacubes. Actually, once a tuple class, at any level of granularity, does not satisfy a given condition, finer granularity levels are not even examined since the condition can no longer hold.

6

Comparative analysis

In this section, our aim is to provide a comparison between our approach and related work. Of course we place much more emphasis on the single algorithm designed for achieving iceberg datacubes, BUC [4].

A similar attempt, devoted to MOLAP databases, is investigated in [45] with an array-based technique, but with similar drawbacks: execution cost is specially attractive (direct accesses to cells are performed, and no sort is needed) until taking into account the critical parameter of memory limitation [36, 4]. When intending to compute iceberg datacubes, a natural idea is to extend the algorithm Apriori [2] because it deals with combinations longer and longer (in our context, with less and less aggregated cuboids) and encompasses a threshold-based pruning step. Such an idea has been investigated in [4]. It seems specially attractive for a twofold reason: tuple sorting is not required (because candidate verification is based on value set comparison) and pruning is optimal4 . Unfortunately, an Apriori-like approach (and in fact any level-wise algorithm) cannot apply in practice because of main memory limitation. Actually, in our research context, generating all possible candidates is unthinkable because, by their very nature, Apriori candidates are value sets and dimensions in OLAP databases can be very sparse. Moreover, as previously mentioned, preserving the two largest levels of the dimensional lattice requires{|} managing q r s | |~ u v st dimensional combinations (w x yz z €€n €€ |  ‚ƒ where n w), and main memory capability is likely to be exceeded (cf. Section 4). This is why it is claimed in [4] that an extended version of Apriori “needs too much memory and performs terribly”. The algorithm BUC was recently proposed. It is the single algorithm computing iceberg cubes, and the most efficient solution (according to our knowledge) for achieving full datacubes [4]. This is why we focus on BUC for providing a comparison with our approach. This comparison has a twofold aspect: an in-depth analysis of the two algorithms is firstly given from a theoretical viewpoint, then experimental results are detailed.

6.1 Earlier contributions are intended for computing full datacubes [1, 11, 36]. Their driving principle is to overlap as much as possible cuboid computation in order to minimize the overall cost (cf. Section 5). Such a principle originates the chosen computation strategy: less aggregated cuboids are firstly computed (beginning by the least aggregated one) and are used for achieving cuboids more and more aggregated. This means that, even if adapted to iceberg cube computation, these approaches cannot take any benefit from a threshold condition because, by their very nature, the search space cannot be pruned. The quoted approaches enforce sorting or hash-based techniques. The former provide a good compromise between execution cost and main memory requirement [1, 11, 36], the latter attempt to optimize the computation cost [1], but the price is an increasing need of main memory, thus as soon as very large and sparse data sets must be handled, these approaches become unpracticable because main memory requirement increases unreasonably (and consequently execution time is strongly debased).

Analytical comparison

In order to compare BUC and APIC, we analyse the four following key parameters: execution time, main memory requirement, disk space required for storing results, and enforced pruning method. We examine the three former parameters in the worse case, i.e. when computing full datacubes, and consider iceberg cubes only when studying pruning strategies. Execution cost BUC spends most of its execution time by sorting tuples in a recursive way [4], and sorting operations can be seen as elementary processing units when achieving datacubes. BUC enforces performant sorting techniques but the number of sorting can be very high. Lemma 5 For an input relation (or fragment) loaded in main memory, the number of tuple sorts performed by BUC is bounded by: 4 As soon as a candidate is proved to be unfrequent, it cannot be extended for generating new candidates.

„ …

Š

†‡ˆ ‰

+

…†  Ž    ‘’“”””’•–“ ‹ ‡ˆ ŒŒ ŒŒ



where ŒŒ ŒŒ stands for the number of distinct values for the  dimensional combination . The number of sorts performed by BUC depends on the dimensional domain cardinalities (as stated by the previous lemma), but it also depends on the dimensional value distribution for the various combinations. For instance if a dimension is very sparse, the overall number of tuple sorts can be very high. However each one applies on a more and more reduced number of tuples. Furthermore the chances to be provided with all possible values of such an attribute within a single tuple equivalence class are likely to be strongly decreased. However, despite these latter remarks, the cost of BUC remains significant. On the other hand, for performing aggregate formation, APIC deals with DM-partitions and computes their intersection (elementary processing units) linearly in the number of elements. The product of two partitions encomŽ passing — elements has a complexity in ˜ ‰— . In addition, we pay much more attention, when designing APIC, to the number of required partition products. In most cases, a single DM-partition product is required or no one (an in-memory DM-partition is already available and offers the expected results). In some exceptional cases, a  cuboid accordingto the dimensional combination  can be Š achieved after ( Œ Œ ™ ) DM-partition products ( Œ Œ is the  number of attributes in ). Let us on one hand underline that the latter number of DM-partition products does not depend on dimension domain cardinalities but only on the number of dimensions, and on the other hand exemplify the number of worse cases which can be encountered. In our running example, and assuming that a full datacube is to be computed, the worse case is encountered only once, i.e. for computing the cuboid according to BCD (cf. Figure 4). When achieving iceberg datacubes, the number of worse cases is likely to be increased but the predominant factor which is the number of tuples is significantly decreased. Memory requirement As previously mentioned, main memory limitation is a critical concern when dealing with OLAP databases. The two compared approaches take into account this key parameter, and yields similar results for computing full datacubes (BUC is disadvantaged in the case of iceberg datacube computation which is examined further, cf. Section 6.1). BUC requires a significant but reasonnable main memory capability, which is measured as follows [4]: š

—

( › + œ) +



 ¡

„  š… †‡ˆ ŒŒ † ŒŒ

+

šž Ÿ 

Ž  †‡ˆ”””„ ŒŒ † ŒŒ ‰

where † . For each tuple, › dimensional values and a measure are preserved, along with a counter and a pointer, each of which requiring 4 bytes. Thus loading the input relation š or fragment requires — ( › + œ) bytes (where — is the

number of handled tuples). BUC needs additional counters for aggregated values (as many as the number of distinct values for each dimensional attribute) and counting variables used when sorting (bounded by the maximal cardinality of dimensional attributes). APIC preserves in main memory the › original DMpartitions, the last partition computed, the one being currently processed, and the set of measure values. Thus for an input relation encompassing — tuples, the main memory requirement is given by: š

—

( › + œ)

This result is similar to BUC memory requirement which in addition depends on parameters related to dimension cardinalities. The latter parameters are not determinant even if not negligible when several very sparse attributes have to be considered. Datacube storage„ space and I/O cost When computing ‘ full datacubes, dimensional combinations have to be examined, each of which originates a cuboid. For each cuboid, the number of output tuples depends on the domain cardinality of the considered combination [36], which is denoted by ŒŒ ŒŒ. All approaches do not deal with original data but instead with coded data (for obvious optimization reasons).  † Actually each value of a dimensional  attribute is replaced ¦ £¤¤¤ ŒŒ †¥ˆ ŒŒ during a preby an integer in the range ¢ processing step [36, 4]. Under this assumption, the storage  space required for preserving a cuboid according to is: š

› ‰

Š

+ )

 ŒŒ ŒŒ

The overall space for storing a full datacube is bounded by: ‘„ š

› ‰

Š ž Ÿ 

+ )

‰

 Ž    ‘§†¨ ŒŒ ŒŒ

,

In contrast, APIC generates a condensed representation of datacubes. For each cuboid according to a dimensional com  bination , ŒŒ ŒŒ tuples are to be computed but each one only requires to store three values (an identifier, the associated aggregated value and the corresponding dimensional combination), each one needing 4 bytes. Thus the storage requirement for a cuboid is: 12

 ŒŒ ŒŒ

When compared to the classical representation (used by BUC and all previous proposals), the latter result is really significant because as soon as the number of dimensions is higher than 2, our condensed representation is more compact. Of course this advantage is increased as the set of dimensions is enlarged because APIC storage requirement, for any cuboid, is independent of the number of dimensions. The latter remark explains results reported in figure 5. They give, according to the number of considered dimensions, the percentage of space occupied by our condensed representation when compared to a classical datacube storage (adopted by all previous approaches). Such a ratio is independent

from the dimensional domain cardinality (because APIC and BUC storage requirements are related to this factor in the very same way). When the number of dimensions is equal to 10, our condensed representation requires 27.2% of the space needed by BUC to store a full datacube, and only 14.2%, for 20 dimensions. ©ª«¬ ©

% of space condensed vs. classical storages ©ª«¬ ©

% of space condensed vs. classical storages

For the four critical parameters being studied, APIC results are slightly or strongly better than those of BUC. In order to confirm this analytical comparison, it remains to provide experimental results.

6.2 1

2

3

4

5

6

150%

100%

75%

60%

50%

42.8%

7

9

10

12

20

25

37.5%

33.3%

27.2%

23%

14.2%

11.5%

Figure 5: % of storage space for APIC vs. BUC According to our knowledge, all approaches addressing the problem of disk explosion when storing datacubes investigate physical techniques (such as data compression). Let us recall that this problem is specially critical when attempting to optimize OLAP queries. We show, through the latter presented results, that a logical point of view can be specially relevant because storage space can be strongly decreased when compared to a classical representation. Moreover there is no prevention for combining logical and physical solutions to this particularly critical problem. Another advantage of our condensed representation is its good influence on I/O cost. For example, for a relation with 10 attributes and 1,000,000 tuples the condensed datacube size is around 6 GB while with a classical representation the datacube size is around 22 GB. For an I/O rate set to 5 MB/s, the I/O time is 1,230 seconds for APIC and 4,500 seconds for BUC. Pruning method We finish our analytical comparison by examining how pruning is performed by the two algorithms when computing iceberg cubes. Let us imagine that BUC is dealing with an equivalence class grouping tuples which share the very same value for the ­ first dimensions (­ < ®). Associated measures are aggregated. If the result does not respect the threshold condition then the recursive process stops for this tuple class and starts again for the next class of tuples. Rather than pruning, we can consider that BUC performs skips during the computation of iceberg datacubes. Actually, no tuple can be discarded because it can be significant when dealing with another combination of dimensions. In contrast, APIC really performs pruning. Let us consider that our algorithm is achieving the iceberg DM-partition according to the ­ first dimensions. As soon as a DMequivalence class does not respect the selection condition, it is removed. Thus, on one hand no further computation is performed from the underlying DM-class (APIC performs a skip to another branch in the lectic tree), and on the other hand main memory requirement decreases.

Experimental comparison

In order to assess performances of APIC, the algorithm was implemented using the language C++. An executable file can be generated with Visual C++ 5.0 or GNU g++ compilers. Experiments were performed on a Pentium Pro III/700 MHz with 2 GB, running Linux. An executable version of BUC is not available from the authors, we have therefore developed a new version of this algorithm under the very same conditions than for APIC implementation and with a similar programming style. For simplicity, our implementation enforces the QuickSort technique provided by the C++ compiler. The benchmark relations used for experiments are synthetic data sets automatically generated under the assumption that the data is uniformly and at random distributed. For the experimental results of the two algorithms, we do not take into account the I/O cost but only focus on execution times. Figure 6 (A) gives the execution times of the two algorithms, when varying the dimension cardinalities from 10 to 1,000. The input is a relation encompassing 100,000 tuples and 10 attributes. APIC behaves specially well, and the presented experiment results are in accordance with our original expectations. As mentioned in [4], BUC is penalized when domain cardinalities are small and the gap between execution times of APIC and BUC decreases as the domain cardinality increases. Figures 6 (B), (C), and (D) illustrate execution times of APIC and BUC when computing the iceberg cube from the same input relation. The minimum support varies from 0.001 % to 0.01 % and the cardinality of all dimensions is successively set to 10, 100, and 1,000. In any case, APIC is more efficient than BUC. Figure 7 (A) provides execution times of APIC for computing a full datacube from a relation encompassing 1 million of tuples and 10 attributes. The curve labeled “APIC + Gen” integrates in the presented times not only execution times but also result generation whereas the second one (“APIC”) does not consider result generation (thus it provides an evaluation of APIC without any call to the procedure AggregateGen). The latter curve is interesting for the cost estimation of DM-partition products. We do not integrate in our results the time spent to output aggregates. Under the experiment conditions, the full datacube can be computed in less than 50 minutes. Figures 7 (B), (C), and (D) give, for the same input relation, results obtained for computing the iceberg cube when varying the minimum threshold from 0.001 % to 0.01 %. The aggregative function used is ¯ °±²³ and the parameter evolving for the three figures is the dimension cardinality (set to 10, 100 and 1,000 respectively). As expected, the more the threshold increases, the faster the algorithm runs, even if synthetic data

(A) APIC vs. BUC

(A) APIC and result generation (full datacube)

2,000 APIC BUC

3,200 APIC + Gen APIC

3,000 1,500

Times in seconds

Times in seconds

2,800

1,000

2,600 2,400 2,200 2,000

500

1,800 1,600 0 10

100

1,000

10

100

Cardinality of attributes

(B) APIC vs. BUC (cardinality = 10) 350

1,000 Cardinality of attributes

(B) APIC and result generation (cardinality = 10)

APIC BUC

1,500 APIC + Gen APIC

300 1,400 1,300 Times in seconds

Times in seconds

250

200

150

1,200 1,100

100

1,000

50

900

0 O.OO1 %

0.005 %

800 O.OO1 %

0.01 %

Minimum Support

(C) APIC vs. BUC (cardinality = 100) 180

160

100 Times in seconds

Times in seconds

APIC + Gen APIC

170

120

80 60 40

150 140 130 120

20 0 O.OO1 %

0.01 %

(C) APIC and result generation (cardinality = 100)

APIC BUC

140

0.005 % Minimum Support

110

0.005 %

100 O.OO1 %

0.01 %

Minimum Support

(D) APIC vs. BUC (cardinality = 1,000)

0.01 %

(D) APIC and result generation (cardinality = 1,000)

APIC BUC

12

0.005 % Minimum Support

40 APIC + Gen APIC

10

Times in seconds

Times in seconds

35 8

6

4

30

25 2

0 O.OO1 %

0.005 %

0.01 %

Minimum Support

Figure 6: Execution times in seconds for relations with 100,000 tuples and 10 attributes

20 O.OO1 %

0.005 %

0.01 %

Minimum Support

Figure 7: Execution times in seconds for relations with 1,000,000 tuples and 10 attributes

distribution is not really relevant for highlighting this aspect because from a threshold set to 0.005 % pruning slightly differs.

(A) APIC (cardinality = 100) 10 attributes Minimum Support = 0.01 % 900 APIC 800 700 Times in seconds

The curves, in figure 8 (A), illustrates the algorithm scalability according to the tuple number which varies from 100,000 to 5,000,000. As expected, APIC behaves linearly in the number of tuples. We also study the influence of the dimension number. Associated results are given in figure 8 (B). The scalability of APIC according to the number of tuples and attributes is also illustrated for iceberg cube, provided with a minimum support set to 0.01 % in figure 9.

600 500 400 300 200 100 0 100,000

1,000,000

3,000,000

5,000,000

Number of tuples

(B) APIC (cardinality = 100) 1,000,000 tuples Minimum Support = 0.01 %

(A) APIC (cardinality = 100) 10 attributes

APIC

120

APIC 100 Times in seconds

12,000

Times in seconds

10,000 8,000

80 60 40

6,000 20 4,000 0 2

2,000

4

6

8

10

Number of attributes 0 100,000

1,000,000

3,000,000

5,000,000

Figure 9: Execution times in seconds for various numbers of tuples and attributes with a minimum support

Number of tuples

(B) APIC (cardinality = 100) 1,000,000 tuples APIC

Times in seconds

2,000

1,500

1,000

500

0 2

4

6

8

10

Number of attributes

Figure 8: Execution times in seconds for various numbers of tuples and attributes

7

Conclusion

The approach presented in this paper addresses the computation of either full or iceberg datacubes. It fits in a formal framework proved to be sound and based on simple concepts. We propose an alternative representation of data sets to be aggregated: the DM-partitions. By selecting relevant DM-partitions, we show that on one hand memory requirement is similar to the BUC one [4], and on the other hand all necessary cuboids can be computed by enforcing in-memory DM-partition products, i.e. by performing set intersections, linearly in the set cardinalities. APIC traverses the dimensional lattice by following from the lectic order. Its navigation principles are soundly founded. In addition, we propose a condensed representation of datacubes which significantly

reduces the necessary storage space without making use of physical techniques. Space saving in our approach is really significant and show that addressing the disk explosion problem with a logical point of view is relevant. We provide an in-depth comparison with the competive proposal BUC by on one hand analyzing the critical factors of the problem for the two algorithms and on the other hand performing experimental comparisons. We show that APIC has good scale-up properties and is more efficient than BUC. Intended further work concerns an extension of datacubes that we call decision datacubes. Our aim is to define a mining approach fully integrated in DBMS for extracting key decision rules [34] (relevant and non redundant decision rules). Such rules are useful for classification. Another research perspective is related to OLAP queries. Actually we are interested in proposing a mining query language for our condensed representation of datacubes as well as studying setupdates on them (like in [26, 38]). In particular, we would like to apply our algorithm in order to efficiently implement the operator Roll-up [25] according to dimension hierarchies [37, 33].

References [1] S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J.F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the Computation of Multidimensional Aggregates. In VLDB’96, pages 506–521, 1996. [2] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In VLDB’94, pages 487–499, Santiago, Chile, 1994.

[3] Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and L. Lakhal. Mining Frequent Patterns with Counting Inference. ACM SIGKDD Explorations, 2(2):66–75, December 2000.

[25] Y. Kotidis and N. Roussopoulos. An Alternative Storage Organization for ROLAP Aggregate Views Based on Cubetrees. In ACM SIGMOD, Seattle, Washington, USA, pages 249–258, 1998.

[4] K.S. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs. In ACM SIGMOD, Philadephia, Pennsylvania, USA, pages 359–370, 1999.

[26] D. Laurent, J. Lechtenbörger, N. Spyratos, and G. Vossen. Complements for Data Warehouses. In ICDE’99, Sydney, Australia, pages 490–499, 1999.

[5] J.F. Boulicaut, A. Bykowski, and C. Rigotti. Approximation of Frequency Queries by Means of Free-Sets. In Proceedings of Conference PDKK, pages 75–85, 2000.

[27] L. Libkin. Expressive Power of SQL. In ICDT’01, London, UK, LNCS vol. 1973, pages 1–21. Springer Verlag, January 2001.

[6] L. Cabibbo and R. Torlone. A Logical Approach to Multidimensional Databases. In EDBT’98, Valencia, Spain, LNCS vol. 1377, pages 183– 197. Springer Verlag, 1998. [7] L. Cabibbo and R. Torlone. A Framework for the Investigation of Aggregate Functions in Database Queries. In C. Beeri and P. Buneman, editors, ICDT’99, Jerusalem, Israel, LNCS vol. 1540, pages 383–397. Springer Verlag, 1999. [8] S. Chaudhuri and U. Dayal. An Overview of Data Warehousing and OLAP Technology. SIGMOD Record, 26(1):65–74, 1997. [9] R. Cicchetti and L. Lakhal. Matrix-Relation for Statistical Database Management. In EDBT’94, Cambridge, UK, LNCS vol. 779, pages 31–44. Springer Verlag, 1994. [10] S.S. Cosmadakis, P.C. Kanellakis, and N. Spyratos. Partition Semantics for Relations. Journal of Computer and System Sciences, 33(2):203–233, 1986. [11] P.M. Deshpande, S. Agrwal, J.F. Naughton, and R. Ramakrishnan. Computation of multidimensional aggretates. Technical report, University of Wiscontin - Madison, 1996.

[28] S. Lopes, J.M. Petit, and L. Lakhal. Efficient Discovery of Functional Dependencies and Armstrong Relations. In EDBT’2000, Konstanz, Germany, LNCS vol. 1777, pages 350–364. Springer Verlag, 2000. [29] H. Mannila and H. Toivonen. Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract). In KDD’96, Portland, Oregon, USA, pages 189–194, 1996. [30] H. Mannila and H. Toivonen. Levelwise Search and Borders of Theories in Knowledge Discovery. Data Mining and Knowledge Discovery, 10(3):241–258, 1997. [31] N. Novelli and R. Cicchetti. FUN: An Efficient Algorithm for Mining Functional and Embedded Dependencies. In ICDT’01, London, UK, LNCS vol. 1973, pages 189–203. Springer Verlag, 2001. [32] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering Frequent Closed Itemsets for Association Rules. In ICDT’99, Jerusalem, Israel, LNCS vol. 1540, pages 398–416. Springer Verlag, 1999. [33] E. Pourabbas and M. Rafanelli. Hierarchies and relative operators in the olap environment. SIGMOD Record, 29(1):32–37, 2000. [34] J. R. Quinlan. C 4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993. [35] M. Rafanelli and F.L. Ricci. Mefisto: A Functional Model for Statistical Entities. TKDE, 5(4):670–681, 1993.

[12] P.M. Deshpande, J.F. Naughton, K. Ramasamy, A. Shukla, K. Tufte, and Y. Zhao. Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP. Bulletin of IEEE, pages 3–11, 1997.

[36] K.A. Ross and D. Srivastava. Fast Computation of Sparse Datacubes. In VLDB’97, Athens, Greece, pages 116–125, 1997.

[13] T. Eiter and G. Gottlob. Identifying the Minimal Transversals of a Hypergraph and Related Problems. SIAM Journal on Computing, 24(6), 1995.

[37] K.A. Ross, D. Srivastava, and D. Chatziantoniou. Complex Aggregation at Mutiple Granularities. In EDBT’98, LNCS vol. 1377, pages 263–277. Springer Verlag, 1998.

[14] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J.D. Ullman. Computing Iceberg Queries Efficiently. In VLDB’98, New York City, New York, USA, pages 299–310. Morgan Kaufmann, 1998.

[38] K.A. Ross and K.A. Zaman. Serving Datacube Tuples from Main Memory. In SSDM’2000, Berlin, Germany, pages 182–195, 2000.

[15] B. Ganter and R. Wille. Formal Concept Analysis: Mathematical Foundations. Springer-Verlag, 1999.

[39] P. Shenoy, J. R. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, and D. Shah. Turbo-charging Vertical Mining of Large Databases. In ACM SIGMOD’00, Dallas, Texas, USA, pages 22–33, 2000.

[16] H. Garcia-Molina, J.D. Ullman, and J. Widom. Database System Implementation. Prentice Hall, 1999.

[40] A. Shoshani. OLAP and Statistical Databases: Similarities and Differences. In ACM PODS, Tucson, Arizona, pages 185–196, 1997.

[17] S.P. Ghosh. Statistical Relational Databases: Normal Forms. TKDE, 3(1):55–64, 1991.

[41] Nicolas Spyratos. The partition model: A deductive database model. ACM TODS, 12(1):1–37, 1987.

[18] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals. Data Mining and Knowledge Discovery, 1(1), 1997.

[42] G. Stumme, R. Taouil, Y. Bastide, N. Pasquier, and L. Lakhal. Fast Computation of Concept lattices Using Data Mining Techniques. In Proceedings of the 7th International Workshop on Knowledge Representation meets Databases (KRDB 2000), Berlin, Germany, number 29 in CEUR Workshop Proceedings, pages 129–139, August 2000.

[19] D. Gunopulos, R. Khardon, H. Mannila, and H. Toivonen. Data Mining, Hypergraph Transversals, and Machine Learning. In ACM PODS, Tucson, Arizona, pages 209–216, 1997.

[43] J. Wijsen, R.T. Ng, and T. Calders. Discovering Roll-Up Dependencies. In KDD’99, San Diego, CA, USA, pages 213–222, 1999.

[20] A. Gupta and I.S. Mumick. Maintenance of Materialized Views: Problems, Techniques, and Applications. Data Engineering Bulletin, June 1995.

[44] Y. Zhao, P.M. Deshpande, and J.F. Naughton. An Array-based Algorithm for Simultaneous Multidimensional Aggregates. In ACM SIGMOD’97, pages 159–170, 1997.

[21] M. Gyssens and L.V.S. Lakshmanan. A Foundation for Multidimensional Databases. In VLDB’97, Athens, Greece, pages 106–115, 1997.

[45] Y. Zhao, P.M. Deshpande, J.F. Naughton, and A. Shukla. Simultaneous Optimization and Evaluation of Multiple Dimensional Queries. In ACM SIGMOD’98, pages 271–282, 1998.

[22] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001. [23] Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. In ICDE’98, Orlando, Florida, USA, pages 392–401, 1998. [24] A. C. Klug. Equivalence of Relational Algebra and Relational Calculus Query Languages Having Aggregate Functions. Journal of ACM, 29(3):699–717, 1982.