Static dependency analysis for semantic data validation

. Modern information systems manipulate data models containing millions of items, and the tendency is to make these models even more complex. One of the most crucial aspects of modern concurrent engineering environments is their reliability. The principles of ACID (atomicity, consistency, isolation, durability) are aimed at providing it, but directly following them leads to serious performance drawbacks on large-scale models, since it is necessary to control the correctness of every performed transaction. In the paper, a method for incremental validation of object-oriented data is presented. Assuming that a submitted transaction is applied to originally consistent data, it is guaranteed that the final data representation is also consistent if only the spot rules are satisfied. To identify data items subject to spot rule validation, a bipartite data-rule dependency graph is formed. To automatically build the dependency graph a static analysis of the model specifications is proposed to apply. In the case of complex object-oriented models defining hundreds and thousands of data types and semantic rules, the static analysis seems to be the only way to realize the incremental validation and to make possible to manage the data in accordance with the ACID principles.


Introduction
Management of semantically complex data is one of the challenging problems tightly connected with emerging information systems such as concurrent engineering environments and product data management systems [1][2][3][4]. Although transactional guarantees ACID (Atomicity, Consistency, Isolation, and Durability) are widely recognized and recommended for any information system, it is difficult to maintain the consistency and integrity of data driven by complex object-oriented models. Often such models are specified in EXPRESS language being part of the STEP standard on industrial automation systems and integration (ISO 10303). To be unambiguously interpretable by different systems the data must satisfy numerous semantic rules imposed by formal models. Maintaining data consistency and ensuring system interoperability become a serious computational problem. Full semantic validation requires extremely high costs, often exceeding the processing time of individual transactions. Periodic validation is possible, but at a high risk of violating rules and losing actual data. The paper presents an effective method for incremental validation of object-oriented data. An idea of incremental checks is well-understood and was successfully implemented for the validation of such specific data as UML charts, XML documents, deductive databases [5][6][7]. Unlike the aforementioned results, the proposed method can be applied to semantically complex data driven by arbitrary object-oriented models. Assuming that a submitted transaction is applied to originally consistent data, it is guaranteed that the final data representation is also consistent if only the spot rules are satisfied. To identify data items subject to spot rule validation, a bipartite datarule dependency graph is formed. To automatically build the dependency graph a static analysis of the model specifications is proposed to apply. In the case of largescale models defining hundreds and thousands of data types and semantic rules, static analysis seems to be the only way to realize the incremental validation and to make possible to effectively manage the data in accordance with the ACID principles. The structure of the paper is as follows. In section 2, we will shortly overview EXPRESS language with an emphasis on the data types and the rule categories admitted by the language. Formal definitions of model-driven data, rules and transactions are also provided. In section 3, we will present a complete validation routine and then explain how an incremental validation can be arranged using the proposed dependency graph. This is accompanied by an example of the model specification. In conclusion, we summarise benefits of the proposed validation method and outdraw future efforts.

EXPRESS language
Product data models and, particularly, semantic rules can be specified formally in EXPRESS (ISO 2004) language [8]. This object-oriented modeling language provides a wide range of declarative and imperative constructs to define both data types and constraints imposed upon them. The supported data types can be subdivided into the following groups: simple types (character, string, integer, float, double, Boolean, logical, binary), aggregate types (set, multi-set, sequence, array), selects, enumerations, and entity types.
Depending on the definition context, three basic sorts of constraints are distinguished in the modeling language: rules for simple user-defined data types, local rules for object types, and global rules for object type extents. Depending on the evaluation context these imply the following semantic checks:  attribute type compliance ( 0 );  limited widths of strings and binaries ( 1 , 2 );  size of aggregates ( 3 );  multiplicity of direct and inverse associations in objects ( 4 , 5 );  uniqueness of elements in sets, unique lists and arrays ( 6 );  mandatory attributes in objects ( 7 );  mandatory elements in aggregates excluding sparse arrays ( 8 );  value domains for primitive data types ( 9 );  value domains restricting and interrelating the states of separate attributes within objects ( 10 or so-called local rules);  uniqueness of attribute values (optionally, their groups) on object type extents ( 11 or uniqueness rules);  value domains restricting and interrelating the states of whole object populations ( 12 or so-called global rules). Value domains can be specified in a general algebraic form by means of all the variety of imperative constructs available in the language (control statements, functions, procedures, etc.). Certainly, each product model defines own data types and rules. Therefore, semantic validation methods and tools should be developed in a model-driven paradigm allowing their application for any data whose model is formally specified in EXPRESS language. For a more detailed description refer to the mentioned above standard family which regulates the language.

Formalization of models, data and transactions
An object-oriented data model can be formally considered as a triple = 〈 ∪ ≺ ∪ 〉, where the types = { ∪ ∪ ∪ … } are classes , simple types , aggregates and other constructed structures allowed by EXPRESS. Generalization/specialization relations ≺ are defined among these types. Each class ∈ defines a set of attributes in the form . : ↦ . The attributes . : ↦ , . : ↦ ( ) are single and multiple associations which play role of object references. The rules = { 0 ∪ 1 ∪ 2 ∪ … ∪ 12 } define the value domains of typed data in an algebraic way in accordance with EXPRESS. The rules are subdivided into 12 categories enumerated above. Let us define the key concepts that are used in further consideration.
An object-oriented dataset = { 1 , 2 , … } is said to be driven by the model 〈 , ≺, 〉 if all the objects belong to its classes: ∀ ∈ → ( ) ∈ ⊂ . Let a dataset is driven by the model 〈 , ≺, 〉. All the objects { * } ⊂ such that ( * ) = ∈ ⊂ are called an extent of the class on the dataset . A query returning the class extent on the dataset is called the extent query and is designated as ( , ). Let a dataset is driven by the model 〈 , ≺, 〉. An object set { * } ⊂ , ( * ) = * ∈ ⊂ is said to be interlinked with the objects { } ⊂ , ( ) = ∈ ⊂ along the association . if ∀ ∈ { }, . ⊂ { * }, ∀ * ∈ { * } → ∃ ∈ { }: * ∈ . . We will denote that as { } . → { * }. Let a dataset is driven by the model 〈 , ≺, 〉. An object set { * } ⊂ , In what follows, we assume that each creation operation in the delta representation is complemented by the operations of initializing the attributes that are equivalent to the modification operations. Each deletion operation is supplemented by the operations of resetting the attributes to an undefined state, also representable by the modification operations. Regardless of the way, the delta is structured, only elementary operations are taken into account in the context of the studied validation problems.

Complete validation
The complete validation routine is provided below (see Figure 1). In a cycle on all objects their attributes are checked against the rules of the categories 1 ∪ 2 ∪ … ∪ 9 . The checks are performed individually for each attribute provided that the corresponding rules are imposed on their types. In case of detected violations, the error messages are logged. Rules 10 are evaluated for entire objects in the same loop. The second cycle is formed due to the need for checks of uniqueness rules 11 . Since these rules are declared inside the class definitions, an additional cycle is arranged on the model classes. The rules are evaluated on the class extents. Finally, the third cycle allows to check global rules 12 which are defined directly in the model. Such checks are performed for the corresponding class extents.  As mentioned above, complete validation of semantically complex product data is a computationally costly task that can cause performance degradation when processing transactions. Incremental validation makes it possible to reduce the amount of checks to be performed.

Incremental validation
The proposed incremental validation method is based on the idea of localizing spot rules that can be affected by a transaction and generating a set of semantic checks that is sufficient to detect all potential violations. For this purpose, the dependency graph is built by a given specification of the data model in EXPRESS language. For brevity, we just explain that this structure represents and omit the details of how it can be formed using static analysis of the specification. The dependency graph is a bipartite graph whose nodes represent the kinds of transaction operations and the categories of semantic rules both defined by the underlying model. An operation node is connected with the rule nodes by directed edges if only such operations can violate the rules being instantiated for particular data. Usually, the semantics of the operations imply what are the data it is applied to. Sometimes the inspected data are apriori unknown and have to be determined by executing corresponding route queries. Therefore, each edge is formed by the dependency structure containing both a rule reference . and an optional query route .
. In some sense, the graph reflects the transaction structure as if it contains all possible kinds of changes and the data organisation as if all data types are present and all rules are potentially suffered to violations. As mentioned above, only elementary operations are involved in the dependency analysis. Thus, the dependency graph enables to determine spot rules that could be violated for particular data due to the accepted transaction. For example, if the node operation is a modification of the object attribute . and a rule ∈ 0 ∪ 1 ∪ 2 ∪ … ∪ 9 is defined for its type, then the node ( . ) is connected with the rule node by a corresponding edge. Having a specific operation of this kind ( . ) , ( ) = in the delta representation the corresponding check ( . ) can be produced using the dependency edge. The method of the dependency graph construction is described in more detail in the next section. Still, here we will point out some of its important features. If the same attribute . participates in an expression of the domain rule ∈ 10 for the class , then the operation ( . ) , ( ) = produces the check ( ) for the object . If the attribute . participates in the uniqueness rule ∈ 11 defined for the class , then another dependency edge must be associated with the operation node. In this case, the corresponding check ( ( , )) must be performed. There is a more difficult case when the attribute . participates in an expression of the domain rule ∈ 10 defined for the other class * . The attribute . is assumed to be accessed by traversing associated objects along the route { * . * } from the objects * ∈ * . Then the operation ( . ) , ( ) = induces the checks ( * ) for all * ∈ ( , , { * . * }). To identify and perform such checks the operation node must be connected with the evaluated rule node and a route { * . * } must be prescribed to the edge. The dependency analysis of spot rules ∈ 12 is carried out in a similar way. Finally, we note that the operations of creating and deleting objects on the assumptions made above can only violate global rules and only in those cases if the cardinalities of class extents are computed. Considering object references as specific attribute types, it is possible to localize some spot rules more exactly. Differing operations on aggregates also leads to better localization of spot rules. For brevity we omit the details how the spot rules can be localized more carefully and provide an example in the next subsection.

Fig. 2. Incremental validation routine
The validation routine presented in Figure 2 consists in the sequential traversing of delta operations, determining the nodes of the operation semantics, obtaining associated spot rule nodes, evaluating the rules directly or filling the checkset for the subsequent validation. The checkset is organized as an indexed set of records each of which stores references on the validated rule, query and factual data to perform the corresponding check. The use of the checkset is motivated by the fact that some operations lead to repeated checks of the same rules. Indexing of the checkset allows you to exclude repeated records and, thus, to avoid redundant computations. At the same time, the attribute rule checks are always produced once by the modification operations and, therefore, it is more expedient to execute them immediately, without overloading the checkset.

Dependency graph construction
To construct the dependency graph, an abstract syntactic tree for the model is built. According to the retrieved data, for all attribute declarations operation nodes are built. Number and types of these nodes constructed for a single attribute depend on its type. For non-aggregate attributes . Construction of the dependency graph proceeds with generating rule nodes. We handle construction of nodes for rules R 1 -R 9 and R 10 -R 12 differently. For rules R 1 -R 9 we take all explicit attributes and build rule nodes for each of them. The types of rule nodes depend on the type of the attribute in question. For instance, if it is a bounded string c.S, we generate a R 1 (c.S) (R 1limited width of strings), connected with the node corresponding to the modification of S ( . ). Similarly, if an attribute is a bounded aggregate, we construct a node of type R 4 and connect it with the insertion ( . []) and/or removal ( . []) operation nodes of the attribute, depending on the side from which the aggregate is boundedif it is bounded above, then only with insertion node, if belowwith removal, if from both sideswith both of them. The way of construction of rule nodes for R 10 -R 12 is uniform. We start with locating all local rules for R 10 , all uniqueness rules for R 11 and all global rules for R 12 . For each of the rules, we find all attributes used in it. If an attribute is explicit, we only connect its modification with the rule node, and also with insertion and removal, if it is an aggregate used inside a SIZEOF operation. If an attribute is derived, we take its definition and find the attributes used in it; if inversewe proceed with analyzing the attribute it references. For derived and explicit attributes, the analysis is performed recursively, until all the explicit attributes, directly and indirectly referenced by them, are located. Then all of them are connected with the rule node corresponding to the rule in question. If the during the analysis we find a node that is a function call, we substitute its formal parameters with actual and thus locate the attributes which are used in it; the analysis of a function body with the parameters substituted is completely identical to the analysis of a rule. An example illustrating the constructed graph is given in the next subsection.

Example of a dependency graph
Let us consider a fragment of the EXPRESS specification of a project management system. Three classes depicted in Figure 3 -Task, Link and Calendarare its core entities. The meaning of Task is self-evident; Link represents a connection defining a relation and execution order between two tasks. The fact that between two tasks might be only a single link of one type is reflected in uniqueness rule ur1. A Calendar defines a typical working pattern: working days, working times, holidays. The calendar can be assigned to specific tasks, and one calendar can be set as a default project calendar, that means that it will be used for tasks for which no task calendar is set. Besides that, it is possible to use an Elapsed calendar for a task implying that work will be performed 24/7. Global rule SingleProjectCalendar restricts the possible number of project calendars to no more than one. Moreover, local rule wr3 is used to check that if a task has got a task calendar, it the reference to it must be non-null. One more local rule, wr2, restricts the length of an EntityName to be between 1 and 32 characters.

Fig. 3. An example of the model specification in EXPRESS language
The dependency graph for this fragment of the specification is shown in Figure 4.

Fig. 4. A fragment of the model dependency graph
Each operation of attribute modification except for removal of elements from the list of task children is connected with the rules validating corresponding attribute type compliance R 0 and availability of defined values for mandatory attributes R 7 .
To avoid placement of null values to the list of mandatory elements the rule R 8 should be validated as well after the operations have been performed. The insertion cannot violate multiplicity of the direct and inverse associations as their upper borders are unlimited, but checks R 4 , R 5 should be performed when an element is removed from Children. Therefore, the corresponding operation nodes should be connected with the aforementioned nodes of the rules that the operations may potentially violate. As the expression for the local rule wr3 includes the attributes CalendarRule and TaskCalendar, the nodes corresponding to the operations of modification of these attributes are connected with the wr3 rule node. For the rule wr2 defining the value range of the EntityName type, there is a connection between the EntityName modification node and the wr2 rule node. The corresponding edges are assigned by the routes by traversing of which the attributes could be accessed. The expression of the global rule SingleProjectCalendar references only one attribute IsProjectCalendar, so the appropriate graph nodes are connected by the edge as well. Modification of any attribute of the Link class can affect its uniqueness defined by ur2; hence the connections between LinkType, Predecessor and Successor and the uniqueness rule node. It is also possible that a change affects a constraint not directly but through an inverse association, or even a chain of them, where other classes can be involved. In this case, rules for all the chain of affected classes is added to the checkset. Furthermore, they can be affected not only by direct associations but also by the inverse. For instance, cardinality constraints on inverse aggregate attributes causes insertion of additional rule nodes to the graph.

Conclusion
This paper presents the incremental method of model data validation. The method is applicable for semantically complex data driven by arbitrary object-oriented models. It allows to increase the performance of semantic validation and to effectively manage the data in accordance with the ACID principles. The planned work concerns basically the implementation of the method proposed and its evaluation for industry meaningful product data. The expected positive results will allow its wide introduction into new software engineering technologies and emerging information systems.