Array Database Internals

After huge amount of big scientific data, which needed to be stored and processed, has emerged, the problem of large multidimensional arrays support gained close attention in the database world. Devising special database engines with support of array data model became an issue. Development of a well-organized database management system which stands on completely uncommon data model required performing the following tasks: formally defining a data model, building a formal algebra operating on objects from the data model, devising optimization rules on logical level and then on the physical one. Those tasks has already been completed by creators of different array databases. In this paper array formalization, core algebra and optimization techniques are revised using examples of AML, RasDaMan, SciDB – developed array database management systems with different algebras and optimization approaches. Ключевые слова: array databases; overview; formal array algebra; array query processing; array query optimization; AML; RasDaMan; SciDB DOI: 10.15514/ISPRAS-2018-30(1)-10 For citation: Pavlov V.A., Novikov B.A. Array Database Internals. Trudy ISP RAN/Proc. ISP RAS, vol. 30, issue 1, 2018, pp. 137-160. DOI: 10.15514/ISPRAS-2018-30(1)-10


Introduction
Recently in many scientific fields database users need to support and process new non-traditional data structures.Among such uncommon structures are different hierarchical structures, graphs, as well as arrays.It's worth noting, that such a need is not explained by the subjective preferences of database users, it is fully justified by the real state of things for users and their requirements for processing the data under study.In this paper we will partially consider what is offered to users who need to store and process array data and how storage and processing are made efficient.But first, we let us understand more precisely, with what kind of data such users have to deal with.The data referred to is also called multidimensional discrete data (MDD) or raster data [1].Such data is homogeneous, each element has some index (represented by a vector in a d -dimensional Euclidean space) and, hence, has some adjacent 138 elements.This data is typically huge.On a more intuitive level, MDD data can be imagined as huge multidimensional cubes.For such a cube, each cell has a discrete multidimensional index and contains a value of a fixed type.An example of a 3D cube may be the following: a series of images [2] obtained from two Huble telescopes cameras for some a period of time, say. a year.As it has been said, in real life those cubes are usually of tera-or even petabyte scale: for instance, Large Hadron Collider (LHC), producing raster cubes during its work, after a day of functioning and filtering produced data generates multidimensional data sizing over 5 terabytes [3].Granted, such cubes are not just stored as they often demand some kind of analysis.Usually the analysis to be done is not trivial due to the fact that the need for intelligent raster data processing arises in such fields as: natural sciences, medicine, census, multimedia and OLAP.Demand for efficient storage and processing of huge raster data cubes states a problem of devising special tools and algorithms.The specificity of the data in use is another factor increasing the need in a specialized storage systems.Raster data has several peculiarities induced by its properties mentioned above.These peculiarities include: large size of a single raster data value (a single cube may occupy several disk pages instead of a part of a disk page in case of conventional data types, e.g.numeric values); lack of index support due to absence of natural ordering of cubes, etc.Those peculiarities make efficient processing of raster data different from processing conventional data types in terms of storage and optimization techniques.Fortunately, the problem of optimized storage and efficient processing of raster data has already been faced by authors of special extensions for existing relational/object-relational databases (such as Terralib [4], PostGIS [5], SpatialLite [6], Oracle GeoRaster [7]) and creators of array databases standing on specially devised array data models [8].Today there are several array database management systems (ADBMS) such as RasDaMan [9], SciDB [10], which are still maintained and intensively developed with the aim to continuously improve and to conform to rapidly increasing scientific demands.In each ADBMS much attention is paid to optimization as optimizing queries is crucial when processing queries operating with petabyte sized data cubes.There are two ways of optimization: logical and physical ones.Logical optimization is usually based on formal algebra standing behind the array model.Physical optimization is typically achieved by devising special storage scheme and/or data retrieving order [11], [12].In the current work we will briefly review theoretical basement and optimization techniques considering three distinct developed ADBMSs : RasDaMan, AML and SciDB.The reader should be aware of the fact that the aim of this paper is to present and analyze different data models, algebras and optimization techniques used in some array databases and certainly NOT to compare those databases in order to determine advantages of one over another.The paper does NOT intend to characterize the databases anyhow so that the given characteristics are based on subjective opinions.139

Diving In
To study the the theoretical basement along with optimization techniques in ADBMSs it is important to understand a generic algorithm following which a new ADBMS can be built.First, an array term should be formally defined as it is a central object of interest in such systems.Obviously, in array DBMS algebras, the main object of all operations is an array.For best of our knowledge, all the existing algebras define an array mostly similarly.Formally, an array is a function defined on index domain D to some value set Second, a formal algebra is introduced.Algebras are mathematical structures where several operations with some core objects are defined.Operations with those objects return an object from the same algebra.One of the most important features of algebras is an ability to construct expressions in them by combining application of algebra's operations.As algebras operations are closed, result of an expression evaluation is again an object from the algebra.In simpler words, a formal algebra enables to construct complex expressions value of which do not leave the algebra.In reality in different systems underlying algebras start to differ.Several existing algebras in existing array dbms are described further.Third, logical optimization rules are introduced.Complex expressions can be overburdened with unnecessary operations and elimination them simplifies the expression benefiting in less execution complexity.Fourth, physical optimization rules are derived.Logical optimization of an expression is not sufficient for actually executing the query in the most efficient way.In most cases a single query can be executed differently accounting "physical" information which tells how the queried data is actually stored.Fifth, the query language is introduced to give a user of the system of the system a convenient high-level language taking the user away from lower-level algebra language.
In the current paper we will look at how the first four steps were followed for each of the highlighted array databases, ignoring high level query languages as they do not contribute much to understanding the theoretical essentials of ADBMS.

Baumann's array algebra
We will now optimization process is organized in RasDaMan ADBMS, explaining its core model and formal algebra -Baumann's array algebra [13].The overview of optimization techniques is mostly based on PhD thesis [14] of Ronald Ritsch.To present the entire data model the following terms should be explained:

Multidimensional Intervals and Spatial Domains
An m -interval X is defined as follows: let Multiple probe functions are defined on the multidimensional intervals: Usually a multidimensional interval represents an index set of a multidimensional array, therefore it is commonly called a spatial domain.It is convenient to restrict the possible value type of an interval by introducing a spatial domain type.More precisely, a spatial domain type  over all non-negative integers d will be denoted as  .Then, it is possible to define multiple operations from  to  :

MDD values and types
An MDD value a over base type

Core operations
As reported in [14] Baumann's array algebra stands on two basic operators.They are listed below., which can be defined as follows: let X be an array, i be a dimension number, r be an expression of some type E on which a total ordering is defined.Let S be a one-dimensional array representing a permutation of elements of a set Those types of operations help to formally classify all probable expressions provided for cell expression and to exploit that knowledge for finer optimization.
Similarly, all the operations for COND 's operation op are considered to be from one of the following class:

Derived operations
When core operation are defined, multiple additional ones are built on top the lowlevel core operations to have a convenient notation for frequently used typical operations.Such operations are called derived operations and there are several types of them: geometric, induced, binary induced, aggregate induced.

Geometric operations
These operations are some special cases of application of , K be a result of application of trim with some parameters to multidimensional interval D , i be a dimension number, v some valid point along i -th dimension of domain D .Then the following operations can be defined: Павлов В.А., Новиков Б.А.Базы данных для обработки массивов: взгляд изнутри.Труды ИСП РАН, том 30, вып. 1, 2018 г., стр. 137-160 143

Induced operations
These operations are again some special cases of application of Then the following operations are defined: Note that in case of unary induced operations MARRAY cell expression belongs to 3 CE class and in case of binary induced operations to class 6 CE .

Aggregate induced operations
The main convenient operation is reduce operation on which other derived aggregates are based.Let , op be an associative and commutative binary operation defined and closed on T .Then reduce operation can be defined as follows: ) , , ( . Then several convenient operations can be defined as follows:

Extended relational model
In order to examine MDD specific optimization techniques in combination with set based query processing, the MDD Model is integrated into an adapted Relational Model.The attribute domain of multi-dimensional values can be specified differently: . When attribute domains are formalized in the model, three relational operations are defined.Let R and S be relations, cond be a predicate on R , j op be a function defined on R and returning either scalar or multidimensional values for each j .Then relational operations are defied as follows: The first two operations are very similar to those in canonical relational model, however, the third operator differs significantly.What it does is application of the provided operators to all of the tuples Relational projection can be expressed through projection operation with special functions returning an element of a tuple at the specific position.

Formalizing Array Query Processing
An array query can be represented as a special graph.Each node of the graph is represented by an operator which comprises the query.This graph is a tree and consists of set trees and element trees.Set trees include relational operations as inner nodes and MDD relations(see definition above) as leaves, whereas element trees' inner nodes are MDD/logical operations.Leaves are MDD constants/iterators.There are also some specific kinds of trees for naming convenience: condition trees and operation trees.The former are element trees representing boolean multidimensional expressions attached to a select node.The latter are element trees representing some multidimensional expression attached to an application node.The mere query graph represents data flow between operator nodes.A single edge transfers only particular type of data, thus all edges can be classified into distinct categories depending on type of data flowing through it.The whole edge set is divided into non-intersection sets of three types: relational, dimensional, scalar types.Data edges carrying relations comprises relation sets, edges carrying raster data -dimensional ones, and edges carrying non-dimensional or scalar values are those forming scalar sets.According to this classification, the graph is partitioned in subgraphs of maximal size, each of which contains only one sort of edges.Such subgraphs are called areas and are in particular called relational data areas (RDAs), dimensional data areas (DDAs) and scalar data areas (SDAs), depending on the type of edges they contain.Optimizing data flow in DDAs is of primary importance in extended relational model for array query processing.In general, when a query optimizer receives a query it processes it in three stages: Execution During rewriting a query is represented in a normal query form which is based on logical optimization rules and the following key principles:  Eager constant subexpressions evaluation.This saves computational resources as those expressions might be needed to be computed for each cell of MDD objects  Boolean expressions normalization aimed at application of optimization rules.All boolean expressions are transformed to CNF or DNF depending on the predicates in order to let the optimizer detect patterns for application of logical optimization rules A is an MDD value.
In the second case, rewriting expressions leveraging beneficial properties of an operation on which some induced, binary induced or aggregation operation is based may reduce the amount of multidimensional operations in an expression.A remarkable example of such a rule is In the third case, computation effort is potentially diminished by reducing the amount of multidimensional operations performed.For example, the amount of multidimensional predicate evaluation may be diminished as in case of application of the rule where R , S are relations, condS condR, -predicates defined on R and S respectively.In general, query rewriting is reported to be notably profitable if following heuristics are taken into account:  Perform geometric operations eagerly (load optimization rewriting)  Reduce number and overall cardinality of Dimensional Data Areas as much as possible Abiding by this rule lets to diminish the number of edges in DDAs by applying rules that eliminate MDD expressions or transform them into scalar ones.
 Perform applications eagerly Similar to pushing down projections while using greedy algorithms in relational query processing [16], which is not always the best choice [17].However, lack of join conditions lets sieving down application into cross product to reduce amount of 147 tuples for which given functions are applied.Application is an expensive operation as MDD values typically have plenty of cells to operate on.
 Perform selections eagerly This heuristics also resembles the common heuristics in relational query processing as noted in [18].Selections are pushed down to diminish operation sets as early as possible.Scalar predicates are given priority over MDD ones.
 Search for common subexpressions Common subexpressions are stored as intermediate results.Doing one unit of work several times is useless and even costly when operating on large MDD values.During transformation logical plan operations are mapped to physical plan's operations.Such a mapping is not the distinctive feature of RasDaMan, for example, AML (described in Sec.2.2) also maps logical operations to physical ones.Typically multiple physical plans are valid and semantically equivalent.Those might be analytically compared via usage of special array cost models which were devised for corresponding algebras in [19], [14].However, being able to just compare physical plans using cost functions is not always sufficient, it is crucial that some physical plan refinement techniques are exploited whose aim is to try to reduce the cost of a physical plan by accounting physical layout of the processed MDD values and to adjust the iteration order correspondingly.In RasDaMan system each type operation is considered separately.Transformation of induced and aggregate operations is pretty straightforward.Main idea of transformation in such cases is to provide parallel tile processing.However, transformation of binary induced operations is a bit more tricky and we will focus on them in more detail further.
Binary induced operations are optimized by trying to find the optimal tile traversing order over tiles comprising binary induced operands.Finding optimal tile traversing order minimizes disk reads as the main bottleneck during MDD values processing, leveraging efficient exploitation of main memory.The problem can be formalized using graph terminology.Let V represent tile sets forming the first and the second operands of the binary induced operation respectively.The edge if and only if tiles corresponding to 1 v , 2 v need to be processed simultaneously during binary induced operation performance.The result graph is a bipartite graph, as any edge from E has the one end in 1 V and another in 2 V .An edge in the graph may intuitively be perceived as an indicator of represents a need for holding two tiles from different sets in main memory for faster processing.However, to process the tiles in main memory those tiles should be, obviously, loaded into the main memory first, unless they are already in place.Loading of a tile is an expensive operation as it is directly related to expansive disk I/O, hence the amount loads should be diminished as much as possible.If there is no cache and only two tiles can be held in main memory simultaneously at a time then the 148 problem of minimization of disk access can be formulated as follows: find a vertex traversing order , minimizing amount of disk access.
The tile traverse algorithm for binary induced operations has been pondered in [14].Disk access minimization problem is reduced to the problem of minimization of a special cost function that is defined as The minimum is obtained when the sequence In generic case determination of whether such a sequence of vertices exist is known to be an .However, the restriction to visit each vertex only once can be relaxed.Such an approach has been used in [21] for facing the problem of finding the optimal tile traversing order for array join -a special case of binary induces operation.In this paper the requirement of exclusive visit of each vertex is replaced with requirement of visiting each edge only once with intention to minimize multiple reads of a tile.Authors exploit Hierhozer and Weiner necessary and sufficient condition to find an Euler circuit, i.e. a circuit that visits each edge in the graph only once.As the Hierholzer and Wiener criteria claims [22] the presence of Euler circuit in a graph is equivalent to its connectivity and all vertices having even degrees.Authors split the graph G into connected components, augmenting each with extra auxiliary edges so that each vertex has an odd degree and Euler circuit exists.For each component an Euler circuit is computed and tile traverse path is determined.The result traverse path is built from a random permutation of components' paths.The authors report that it is still an open question how this approach may be adapted to either a distributed environment where each tile load may have its own cost and or to a presence of an arbitrary sized cache.

In Array Manipulation Language (AML) [23] an array
A is set by its shape S , domain D (which is conceptually similar to Array's algebra value set), and the mapping M that establishes the link between the array's shape and domain.
] [i S represents the extent of A along i -th dimension, as all cells of an AML array has lower bound equal to zero.A vector x is said to be in array' . Said that, we can define the mapping M more formally as a function that returns some value from D for every S x  and a special D null  element otherwise.

Algebra operations
AML makes use of bit patterns and several probe functions defined on those patterns (index and count).The  ,  ,  ,  ,  , ( Iterates over slabs of shape f D from A , checks whether that slab is "allowed" by all of the i P patterns (by looking through corresponding bits over all patterns and rejecting the slab if some of the bits is not set).If the slab is "allowed" function f is applied to the slab and a slab of shape f R is obtained.The result slab is "glued" to the result array from the side indicated by iteration position (more formally to ( ) ,  (  ,  ),  , ( During rewriting phase an AML optimizer receives an AML query, builds query expression tree, and transforms it to semantically equivalent one using algebra's transformation rules.The transformed expression is guaranteed to be evaluated to the same result as the intial one, however the transformed query is hoped to be executed faster for some reason (e.g.due to operating on smaller amount of cells as in case of load optimization in Baumann's array algebra, see Sec. 2.1.6).In [19], [23] several algebraic rules are presented.The set of logical optimization rules is not so diverse compared to those devised in [14].The approaches used by AML optimizer is similar to those exploited by Baumann's Algebra optimzer for load optimization.Examples of AML optimizer's query transformations are: merging several subsample operators into one, pushing subsample through merge in some cases, pushing subsamples through apply in some cases, etc.As AML's algebra operators by nature are superior to Baumann's Algebra ones in terms of flexibility (e.g.bit pattern support in all elementary operators, differently sized range boxes for function application operator), logical optimization rules become more complex, accompanied with extra initial conditions and overburdened by auxiliary bit pattern calculations.For example, some generic optimizations exploited by a Baumann's algebra optimizer, like reducing the cardinality of processed cells set before function application, cannot be simply borrowed by AML optimizer for use.This occurs due to the fact that additional logic is introduced by bit patterns and possibly different sizes of range and domain boxes.However, overall relative complexity of AML's logical transformation rules does not diminish the AML optimization potential.At plan generation phase rewritten expression tree is mapped to a physical plan just as during Transformation phase of Baumann's optimizer.The physical plan is represented by a directed graph where a vertex represents a logical operator, while an edge depicts a data flow.Every operator expects a stream of nonoverlapping chunks (tiles in RasDaMan terminology) of some particular shape and in some particular order as an input.Operators produce non-overlapping array chunks of particular shape and in some particular order as an output.An operator may have some parameters which specifies its behavior.The summary on the physical operators is provided in table 1.The building of the physical operator tree is based on recursive top-down traversing of the expression tree.The algorithm step can be determined depending on the currently processed node of the expression tree.Algorithm of building the physical plan tree is summarized in tab. 2.

Table 2. AML physical plan tree building algorithm
Current tree root node is ... The main aims of the optimizer on plan refinement phase are to remove no-op operators and specify chunk ordering of each operator.Chunk iteration order directly affects the amount of data buffed by an operator, therefore the optimizer tries to minimize the memory requirement so that try to execute entirely in memory not to spent effort on materializing intermediate results of evaluation.For d - dimensional array consisting of q chunks there are !q iteration orders.If a plan consists of multiple operators consuming several arrays then considering all iteration orders becomes exponentially expensive.AML authors decided that the optimizer should consider only d iteration orders for each operator, where d is the maximum dimensionality of an array consumed in the plan.Each of those d iteration order differs in the primary dimension by which all the input chunks are ordered.Other sort dimensions are taken in dimension number increasing order.

Action
Authors claim that Hilbert curve [24] or Z -order [25] could be considered by an optimizer as those orders might be related to secondary storage scheme, but those types of orders are neglected for the sake of simplicity.

152
For physical plan consisting of w operators there will be w d iteration orders.
There is another problem that even if an optimal iteration order is found for some operator it does not mean that optimizer is done with this operator.There might be another dependent operator expecting output chunk stream of the just considered operator in a completely different order.However, instead of examination of another possible chunk ordering it might be more beneficial to insert a reorder operator between the producer and the consumer operators.Having addressed the aforementioned problems, AML authors devised a cost-based algorithm reducing the complexity of assigning chunk iteration orders to operators down to shows the additional cost augmented after inserting a REORDER_P operator between x and y operator reordering j -th ordered chunk stream into i -th order one.

SciDB
To the best of our knowledge the formal SciDB algebra description with possible derived optimization techniques has not been yet published.This may be explained by the fact that, formally speaking, there is no formal algebra in SciDB, as it will be seen below, and all operators are initially considered to be user defined functions (UDFs).There are some built-in operators, but they do not form any algebra, and are still considered UDFs, as SciDB pays much attention to extensibility.Despite this fact, here we try to formalize our knowledge about SciDB and structure it using the plan used for AML and Baumann's array algebra.First we define an array, then list built-in elementary operations of SciDB and finally try to explore how SciDB query optimizer works.

Array abstraction
SciDB operates on collection of n -dimensional arrays each of which again may be represented as a mapping be seen from formal SciDB array definition the cell value is actually a heterogeneous tuple or a special NULL value.The presence of NULL value gives lets SciDB to store sparse arrays just out of the box.Cells which are mapped to NULL are called empty.Each tuple element can be addressed by so called attribute name.It should be mentioned, there are some constraints on value sets j V , which restrict a tuple value to be of one of some predefined types: fixed length string, number, etc.However, users of SciDB are given an opportunity to compose custom types (user defined types -UDTs).One of the features of the SciDB array data model is that it is nested, allowing an array cell contain another array.

Algebra operators
One of the most distinguishable feature of SciDB is that as reported in [26] it has no built-in operators, forming some rigor algebraic system.Authors claim that all operators are in fact UDFs and SciDB has some embedded UDFs, which to some extent may be perceived as elementary operators forming SciDB base algebra.However, this might seem as contradiction to the SciDB ideology.In [27] the term 'algebra' is used, but formalism is avoided and just built-in operator usage examples are provided.Below we describe SciDB built-in operators mentioned in [27] and specified in online SciDB documentation [28].Let  A be an n -dimensional SciDB array, B be m -dimensional array  V be a array index values for A , where array index values can be defined as a set of pairs A iff E evaluates to true on it, or considered empty otherwise.

 ) , ( F A Apply
Gets a new array applying F to a cell (substituting cell's corresponding attribute values to F ) and storing the result of the calculation in the result array's cell with the same dimension index.
It should be mentioned that SciDB orients on high extensibility, which explains the shift of SciDB towards UDFs (and UDTs).ScidDB provides a special facility that enables to extend the aforementioned set of operators with custom ones, written in C++.Custom operators are required to take array input(s) and return array output(s).Moreover, SciDB supports defining own aggregates (user defined aggregates -UDAs), increasing the degree of extensibility even more.

Optimization
The shift to the paradigm 'everything is defined by a user' makes query planning a much harder task.The SciDB optimizer operates on 'blackbox' operators which may theoretically be optimized, but in general case their nature is too generic for the optimizer to determine optimizations it might perform.However, compared to AML and RasDaMan systems, SciDB tries to overcome this difficulty with optimizations basing more on physical data storage and parallelism.Those optimizations are discussed below.When a query is got an optimizer builds a logical plan ding all the required semantic checks.As it is stated in [27] the optimizer will produce a complete physical plan corresponding to the built logical plan, where possible.Otherwise it will split the query plan into subplans consisting of pipelinable operators and execute them (in parallel, taking into account the physical structure, discussed further).One logical optimization of SciDB query optimizer mentioned in [29] is detecting commuting operations and pushing them down in the query tree.However, due to generic nature UDFs finding such operators is typically a luck.When physical information comes in use, SciDB optimizer starts to 'breathe easier'.A SciDB instance is supposed to run on multiple nodes, adhering to shared nothing design.A central system catalog exists, storing meta information about user-defined extentions, data distribution, etc.By such SciDB enables to provide a high level of parallelizm and related performance improvements accounting the fact that array data manipulations are known to be CPU bound ([14], [29]).SciDB optimizer makes use of distributed architecture and performs several related optimizations discussed in [29].For example, the optimizer examines the logical tree for blocking operations, i.e. those which require a temporary array to be constructed (e.g. operations demanding redistribution of data in order to execute).In [29] built-in SciDB optimizer is reported to be an incremental and cost-based one.It means that the optimizer picks the best choice for the first subtree to execute making use of a cost model for plan evaluation.The same paper states the SciDB optimizer can be called a 'simple optimizer' which tries to minimize amount of date movement and increase the level of parallelism.However, different optimization techniques has been recently proposed, which might be used in SciDB environment.Those include devising iterative array processing model for a parallel array engine proposed in [30], optimization of SciDB's Filter operator proposed in [31], shuffle join optimization framework for the SciDB array data model presented in [32], etc.

Possible directions of investigation
Based on the overview of the optimization process provided above we summarize some directions that are available for further investigation.Under no circumstances should this list be perceived as complete one.The list below is just a set of noticed future work directions which is actually much larger.
• Baumann's Array Algebra.Array Query Processing.Commutativity of slice and trimming is not accounted during optimization How can this property be exploited for load optimization?
• Baumann's Array Algebra.Array Cost Model.Approximating selectivity of predicates containing MDD expressions using common techniques for AQP approach.In relational query processing there are three well-known techniques: Sampling, Parametric, Histogram-based Techniques.RasDaMan creators opted for Histogrambased approach, saying that parametric techniques have a problem that real distributions (especially those of operations results) are seldom accurately approximated by mathematical distributions.The problem is very serious in case of raster image data.Sampling techniques are reports to be very flexible and tolerant to updates.The main disadvantage of such an approach is that sampling has have considerable I/O and CPU overhead and lacks computation result reusability.How serious is that overhead and how disadvantages overweigh advantages?

Array join problem. Optimization for relaxed restrictions
When the tile graph is built as in [21] the cost of fetching a tile from disk is considered the same for all tiles.However, this might be not the case for a distributed environment or in presence of cache with size allowing to hold more than 1 tile of each operand.The approach presented in [33] might be considered.
• AML.Plan refinement.Iteration order Authors claim that Hilbert curve or Z -order could be accounted by an optimizer during plan refinement as those orders might be related to storage scheme, but those types of orders are dismissed from consideration for the sake of simplicity.Can such a simplification be revisited?

Conclusion
In the current paper we have investigated the theoretical background of array databases exploring three different mature array database management systems: RasDaMan, AML, SciDB.We have looked at those database from a fixed perspective: firstly, we explore the data model the system uses to simply define an array; secondly, we examine what formal algebra the system constructs above arrays; thirdly, we take a closer look on algebraic optimizations (logical optimizations) and those applied when information about physical storage and retrieval of data is taken into account (physical level optimizations).We collect some possible directions of further investigation for considered array databases.The list of directions mainly contain ones outlined by the authors of the array databases themselves.

Аннотация
In less formal words, SORT operator allows to sort hyperplanes of a hypercube along given dimension by value of r expression evaluated on each hyperplane.In[15]  it is used for modeling of such a geo-raster operations as k top  and median operations.142Itshould be mentioned that from optimization point, it is very important to understand what expression e or operation op is provided for an operator to make best possible optimizations.As reported in [14], all operations for MARRAY 's expression e are classified into seven disjoint classes: expression on cell at probing point x  4 CE : Cell access with cluster preserving index expression  5 CE : Access to small neighbourhood of probing point x  6 CE : Simple expression on two cells at probing point x  7 CE : General expression

CA
incorporate expressions to which an optimizer cannot apply any modifications to optimize the MARRAY performance.

146
Prepare induction expressions for the application of optimization rules.Such a modification leverages associativity and distributivity of induced operations, which can replace MDD operations by scalar ones, dramatically reducing computation cost of the expression In [14] more than one hundred logical optimization rules are presented.Those rules are mainly based on load optimization for geometric operations; exploitation of operations' beneficial properties (such as associativity, distributivity) for induced, binary induced and aggregate operations; movement of individual  ,  subexpressions through cross product operation for set trees.In the first case, geometric operations are pushed down to multidimensional nodes serving as data sources for upper nodes.This potentially reduces amount of I/O needed to process a single MDD value by an upper node.An example of such a is an unary induced operation,

The
domain and range boxes respectively; then the core operations on arrays in AML algebra are: over hyperplanes along i -th dimension and, if allowed by the corresponding bit in P , concatenates the hyperplane to the result array.MERGE operator cyclically glues together hyperplanes cut along i -th dimension according to pattern P .Hyperplane are taken from A for set bit in the pattern and from B otherwise.If source array has no values hyperplane is filled with  value. Function application.), of SUB and MERGE rooted at current node.Translate it to n -ary COMBINE_P and n x neighbors.If the x is LEAF_P then cost function is equal to operator memory cost (defined for each operator separately), otherwise:

nI
are closed subsets of n Z just as in case of Baumann's array algebra.Value associated with each index vector   an array's cell.What needs more attention here is the nature of a SciDB array's cell.As it can 153

F
admissible index value for array's dimension j . Q be a predicate containing free occurrence of A cell's dimensional index  E be a predicate containing free occurrences of A cell's attributes  be a function mapping A cell attributes values to some admissible cell type Then the set of the following operators may be described: A as buffer array and sequentially cutting out hyperplanes from buffer array based on elements in ' values of two input arrays if cells have the identical dimensional indices. of the same dimensionality asA where each cell is taken from

•
Multidimensional intervals and spatial domains • MDD types, values and elementary operations on them • Derived operations on MDD data • Extended relational model with MDD support Base type is fixed.Spatial is domain unknown  Base type is fixed.Spatial domain's dimensionality is fixed  Base type is fixed.Spatial domain is fixed  Base type is fixed.Spatial domain is fixed with its physical representation (violation of 'hidden physical representation' principle) The type of attribute domain sets amount of restrictions on MDD values that can be effectively used by an optimizer.
i A be attributes with OptimizationAML optimization techniques, similarly to those in Baumann's algebra, can be classified into logical and physical ones.In AML those techiques are applied in three phases, so called rewriting phase and plan generation phase, plan refinement phase.

Table 1 .
AML physical operations