An Approach to Test Program Generation Based on Formal Specifications of Caching and Address Translation Mechanisms

. A memory subsystem is one of the key components of a microprocessors. It consists of a number of storage devices (instruction buffers, address translation buffers, multilevel cache memory, main memory, and others) organized into a complex hierarchical structure. Huge state space of a memory subsystem makes its functional verification extremely labor consuming. Nowadays, the main approach to functional verification of microprocessors at a system level is simulation with the use of automatically generated test programs. In this paper, a method for generating test programs for functional verification of microprocessors’ memory management units is proposed. The approach is based on formal specification of memory access instructions, namely load and store instructions, and formal specification of memory devices, such as cache units and address translation buffers. The use of formal specifications allows automating development of test program generators and makes functional verification systematic due to clear definition of testing goals. In the suggested approach, test programs are constructed by using combinatorial techniques, which means that stimuli (sequences of loads and stores) are created by enumerating all feasible combinations of instructions, situations (instruction execution paths) and dependencies (sets of conflicts between instructions). It is of importance that test situations and dependencies are automatically extracted from the formal specifications. The approach was used in several industrial projects on verification of MIPS microprocessors and allowed to discover critical bugs in the memory management mechanisms.


Introduction
A computer memory is known to be a complex hierarchy of data storage devices varying in volume, latency and price [1]. In addition to registers and main memory, microprocessors include a multi-level cache memory and address translation buffers. The set of devices responsible for handling memory accesses is referred to as a memory subsystem or a memory management unit (MMU). Being one of the key microprocessor components, the memory subsystem is strongly required to be correct and reliable. Due to the complicated structure of the memory, the number of situations that can occur in processing load and store instructions is huge; this makes it improbable to verify the subsystem "manually". In the current practice, testsprograms in the assembly language of the microprocessor under testare created in an automated way with the intensive use of random generation. A tool that constructs test programs is called a test program generator (TPG) or an instruction stream generator (ISG) [2]. In a typical use case, a TPG accepts probability distributions for instructions types and operand values as well as other parameters and produces a set of programs in compliance with the settings. Though the randomization-based approach is able to find "high-quality" bugs, it is not systematic and does not guarantee the verification completeness. In the present work, an approach to generate test program for memory subsystems of single-core microprocessors is discussed (the multi-core issues, such as memory consistency and cache coherence [3], are out of the scope of the paper). The proposed approach complements the random-based testing and enables thoroughly checking situations in the MMU behavior. It uses specifications of memory access instructions, i.e. load and store instructions, and specifications of memory devices including, first of all, caches and address translation buffers. The formal specifications serve as a source of test coverage information and allow automatically extracting instruction-level situations and dependencies. Test programs are built by composing possible situations and dependencies for instruction sequences of bounded length. The rest of the paper is organized as follows. Section II is a primer on microprocessor memory organization. Section III provides a brief overview of the related work. Section IV describes in detail the mentioned approach to test program generation. Section V considers industrial applications of the described approach. Finally, Section VI concludes the paper and outlines directions for future research and development.

Memory Subsystem
In a nutshell, a memory subsystem of a microprocessor is intended for handling memory accesses, namely instruction fetch requests, data loads and data stores. Its functions include translation of virtual addresses into physical ones, memory protection, code and data caching, etc. [1]. Let us consider the essential concepts of the memory management.
From a programmer's perspective, a computer memory is a linear array of bytes. However, the underlying mechanisms and techniquesusually referred to as a virtual memoryare rather sophisticated. A virtual address space, i.e. a range of the byte array indices available for programs to use, is commonly divided into disjoint segments. Given a segment and a virtual address, the MMU acts as follows. If the microprocessor mode satisfies the segment's privilege level, the virtual address is translated into the physical address, and an access to the physical memory is performed; otherwise, an address error exception is thrown. Segments are divided into mapped and unmapped; the latter, in turn, are subdivided into cached and uncached. Addresses of mapped segments are translated with the help of translation lookaside buffers (TLB), which store the mapping between virtual page numbers (VPN) and physical frame numbers (PFN). If there is a match, the VPN bits of the virtual address are replaced with the PFN bits, and the process continues. Otherwise, a TLB refill exception is thrown, which triggers the operating system to look up the page table and update the TLB. Unmapped addresses are translated directly with no use of the buffers. Accessing cached segments, as opposed to uncached ones, activates the caching mechanisms. A cache is an intermediate storage responsible for speeding up access to frequently used data. An average microprocessor has two-or three-level cache memory. Typically, an Li cache stores a subset of Li+1 contents; the highest-level cache is the largest one; it interacts immediately with the main memory. A cache works as follows. As soon as data are requested, the cache controller checks whether they are in the buffer. If they are (it is said to be a cache hit), the data are taken from there and returned to the requester. Otherwise (it is said to be a cache miss), the controller chooses a victim among the data blocks stored in the buffer and replaces it with the data loaded from the higher-level cache or the main memory. In the general case, a cache comprises a number of sets; each set consists of a number of lines; each line includes data and a tag. Let S = 2 s be the number of sets; W be the number of lines in a set; B = 2 b be the size of a data block. Depending on the values of S and W, the following types of cache memory are recognized: (1) a direct-mapped cache (W = 1); (2) a fully associative cache (S = 1); (3) a setassociative cache (W > 1 and S > 1). The bit representation of an address is interpreted as follows: the bits [0, …, b-1] refer to a byte inside a data block; [b, …, b+s-1] identify a set; [b+s, …, m-1], where m is the address length, define a tag. To determine whether the cache contains data for a given address, first, the set is identified; then, the tags of the set's lines are concurrently compared with the tag extracted from the address. If there is a match, then the requested data are available in the cache.

Related Work
There are several TPG tools based on formal specifications of memory subsystems. DeepTrans (IBM Research) [4] is one of them. The approach is targeted at testing address translation mechanisms and uses a special-purpose modeling language. A process of address translation is depicted as a directed acyclic graph whose vertices correspond to the process stages and whose edges relate to the transitions between the stages. A path from the source of the graph to the sink defines a particular situation in the address translation. Such situations can be referred from high-level descriptions of test programs, so-called templates. The latter are processed by the Genesys-Pro generator [2], which formulates constraints on instruction operands, solves them and transforms the results into the instruction sequences. The major advantage of the approach is the use of the highly developed languages for modeling address translation and describing test templates. The disadvantage is that the tool is not able to automatically extract conflicts and dependencies between instructions. Verification engineers have to manually specify such kind of information in test templates. In [5], the Java programming language coupled with a specialized library is used to specify MMU. As in DeepTrans, the situations correspond to the paths in the graph describing the subsystem under test; here is an example: {Mapped (data are requested via a mapped segment), TLBHit (there is a TLB hit), TLBValid (the matched TLB entry is valid), L1Hit (a miss in the first-level cache occurs)}. In addition, the approach provides means for specifying instruction dependencies; an example is as follows: {TLBEqual (instructions use different TLB entries), L1IndexEqual (data are mapped to the same set of the first-level cache), L1TagEqual (data belong to different cache lines)}. Test templates are constructed automatically by combining situations and dependencies for short sequences of instructions. Building templates and creating programs on their basis is done by the MicroTESK generator (ISP RAS) [6]. The strength of the approach is systematic test enumeration that takes into consideration instruction execution paths as well as dependencies between instructions. The principal weakness is underdeveloped specification facilities.

Approach Description
The main goal of the presented research is to combine the advantages of the methods [4] and [5] as well as to avoid their drawbacks. It can be achieved by using formal specifications. Accordingly, microprocessor instructions, an MMU and test templates are described in formal domain-specific languages. Specifications are analyzed to extract testing knowledge, that is, situations and dependencies. The information having been extracted is used to automatically generate test programs from templates as well as to automatically construct templates in a systematic way. The suggested method is supported by the MicroTESK TPG [7].

Formal Specifications
Formal specification of a microprocessor under test touches on the instruction set and the memory subsystem. Instructions are described in the nML language [8]. Descriptions declare the registers and define the assembly syntax, binary image and the semantics of the instructions. Semantics is specified in the usual imperative form by means of the bit-vector and floating point operations. Here is an nML specification of the MIPS [9] integer addition instruction (ADD): op ADD (rd: REG, rs: REG, rt: REG) syntax = format("add %s, %s, %s", rd.syntax, rs.syntax, rt.syntax) image = format("000000%s%s%s00000100000", rs.image, rt.image, rd.image) action = { temp = rs<31>::rs<31..0> + rt<31>::rt<31..0>; if temp<32> != temp<31> then exception("IntegerOverflow"); else rd = coerce(DWORD, temp<31..0>); endif; } Being rather simple, nML does not have adequate facilities to describe memory management. Though the language is powerful enough to specify caching and address translation mechanisms, pure nML specifications of MMU are awkward and hardly analyzable; in particular, it is difficult to extract testing knowledge to automate test program generation. In that situation, a domain-specific language has been introduced. A memory access instruction is described in nML in an intuitive manner by reading or writing data from or to the byte array representing the physical memory. Every access to the array triggers the MMU logic specified in a separate file. An nML specification of the MIPS load byte instruction (LB) may look as follows: op LB (rt: REG, offset: SHORT, base: REG) syntax = format("lb %s, %d(%s)", rt.syntax, offset, base.syntax) image = format("100000%s%s%s", base.image, rt.image, offset) where MEM is an array declared as mem MEM[2**36, BYTE]; 2**36 (that is 2 36 ) is the memory size in bytes. Note that notwithstanding the array is specified as the physical memory, it is accessed through the virtual address. Memory management is described in a special language. MMU specifications include address types, memory segments, buffers, such as TLB and caches, and detailed algorithms for handling load and store instructions. Addresses and segments are described straightforwardly; buffers are specified with the following parameters: the associativity (ways), the number of sets (sets), the entry (line) format (entry), the index calculation function (index), the tag calculation function (tag) and the data eviction policy (policy). Here is a description of the virtual and physical addresses (VA and PA correspondingly), user segment (XUSEG), address translation buffer (TLB) and the first-level cache memory (L1) of a MIPS microprocessor: Processing of loads and stores is specified by requesting the buffers and handling their responses. The syntax is similar to nML though allows using such conditions as XUSEG(va).hit (the address va belongs to the segment XUSEG) and L1(pa).hit (the buffer L1 contains the data for the address pa). Here comes an example:

Coverage Extractor
Formal specifications are parsed and the control flow graph (CFG) is build. A coverage extractor traverses the CFG and constructs the set of all possible execution paths (the graph is assumed to be acyclic). A single path, so-called a situation, describes processing of an individual request and finishes either with a memory access or with an exception (incorrect address, TLB refill, etc.). Each transition of the path is labeled with a guard, i.e. a condition that enables the transition, and an action to be performed. Here is an example of a load situation (for the sake of simplicity, the transition actions are omitted): {XUSEG(va).hit, TLB(va).hit, va<12> = 0, v = 1, L1(pa).hit}. Given a pair of execution paths, the coverage extractor may be demanded to construct the set of all possible dependencies. A dependency is a map from the set of buffers common for the two given execution paths to the set of conflicts. Speaking formally, a dependency is a partial map d: B  C, where B is the set of buffers and C is the set of conflicts. The following types of buffer usage conflicts are predefined in the tool:  AddrEqualusing the same data;  AddrNotEqualusing different data: o IndexEqualusing data of the same set:  TagEqualusing data of the same line;  TagReplacedusing data of the replaced line;  TagNotReplaced

Template Iterator
A template is a sequence of situations linked together with a number of dependencies. A template iterator systematically enumerates templates to cover a representative set of cases of the memory subsystem behavior. Let S be the set of situations; D be the set of dependencies; n be the length of templates. Formally, a test template of the length n is a pair , , where  = (s1, ..., sn)  S n is the template skeleton and  = {dij}, where i = 1, ..., n-1 and j = i+1, ..., n, is the template ligaments. An example of a two-situation template is given below:

Test Data Generator
Templates are symbolic representation of test programs. To produce a test program from a template, the latter should be instantiated. A test data generator plays the key role in this activity. Test data, in a sense, are a solution to the constraints stipulated in the template. They include virtual addresses to be used by the instructions as well as some auxiliary information intended for setting up the state of the microprocessor under test such as indices of TLB entries, VPN-to-PFN mappings, sequences of addresses to be accessed to load or evict data to or from the buffers, etc. The test data generator acts in compliance with one of the following strategies: (1) heavyweight template elaboration with an attempt to find an exact solution to the problem or (2) lightweight processing targeted at constructing an approximate solution. In the main, our approach follows the second strategy. Detailed analysis of templates makes sense only for accurate MMU specifications, while instructionlevel models are rather abstract. Another argument is that the lightweight approach gives a significant benefit in terms of performance, while the quality of testing is comparable. Given a template (s1, ..., sn), {dij}, consider how test data are generated. First, for each situation sj of the template, a united dependency depj: B  C  2 {1, ..., j-1} is built . For each buffer b and conflict c, depj(b, c) contains indices i < j such that b  dom(dij) and dij(b) = c, that is, the situations si and sj access the buffer b and there is the access conflict c. Then, the template's situations are processed one after another. Given a situation sj, the buffers affected in sj are sequentially inspected. For each buffer b, the actions listed below are performed:  if depj(b, AddrEqual)  , then data(sj).addr  data(si).addr, where data(sj) denotes the test data associated with sj; addr is the virtual or physical address depending on the b type; i is any index from depj(b, AddrEqual);  otherwise, if depj(b, IndexEqual)  , then data(sj).addr<I>  data(si).addr<I>, where I is the bit range given in the index section of the b specification; TagReplaced conflictsreferred to as dynamic conflictsare handled in a special way. As soon as all other constraints, including hits and misses (see the next paragraph for details), are resolved, the created sequence of instructions is simulated on a simplified model derived from the MMU specifications. This enables the generator to predict the lines being evicted and replaced with recently accessed data. If there is a TagReplaced conflict between two instructions (template situations, to be more precise), the evicted tag having been predicted for the first instruction is copied into the address of the second one. In between static Equal/NotEqual and dynamic Replaced conflicts, hits and misses are considered. For a hit, an access to the designated address is appended to the template test data: hit(b).add(data(sj).addr), where hit(b) is a set-separated data structure that stores sequences of addresses targeted at loading data into the buffer b. For a miss, an address sequence  is added: miss(b).add(), where miss(b) is a storage of addresses used to evict data from b, and  = {addr1, ..., addrW} is a socalled evicting sequence, that is, addrk<I> = data(sj).addr<I>, addrk<T>  data(sj).addr<T> and addrk<T>  addrl<T> for all k, l  {1, ..., W} such that k  l; W is the b associativity. Note that appending an address to the hit(b) structure may require adding evicting sequences for the preceding buffers with the miss constraint having been set.

Test Data Adapter
Indeed, test data concretize symbolic templates, but being instruction set independent they are still too general to be immediately applied to testing. It is a test data adapter who translates a template coupled with test data into a sequence of specific instructions, so-called a test case. Such a sequence usually consists of two parts: a preparation, which sets up the microprocessor state, and a stimulus, which performs a series of memory accesses to stress the microprocessor's MMU. Making a stimulus is straightforward: each situation of the template skeleton is converted into a load or a store depending on the specification section, read or write, the execution path belongs to. A particular type of the instruction, i.e. the size of a data block being accessed, is either derived from the template / specifications or randomized. The instruction is allowed to use any registers from the user-defined set.
Note that the procedure requires a mapping from {read, write}  {byte, word, ...} to the set of memory access instructions implemented in the design. Constructing a preparation sequence is more intricate. The main problem is that placing data into a buffer may change the state of others. Here is how the problem is solved. First, virtual address based buffers, e.g., TLB, are handled before buffers accessed by physical addresses, e.g., L1 and L2. Initialization of the latter can be carried out by using unmapped addresses, which does not affect the former. Second, the "largest buffer first" strategy is applied. Typically, a set of lines of a smaller buffer maps several sets of lines of a larger one, which gives a possibility to change the smaller buffer with no tangible effect to the larger one. Given a buffer, the preparation sequence is cut into pieces corresponding to particular sets of the buffer. Each piece is the catenation of the miss and hit sequences. It is implied that each buffer is provided with a code pattern to be used to place data for a given address.
Here comes a simplistic test case for the MIPS architecture: The instructions here are as follows [9]: TLBWI writes a TLB entry; LUI loads a constant into an upper half of a word; ORI does a bitwise OR with a constant; LB loads a byte from memory; SB stores a byte to memory. Preparations may be of significant length, but the tool is able to reduce the volume of such kind of code. It keeps track of the microprocessor state during test generation and skips useless initialization (e.g., it does not load data into a buffer if they are already there). Moreover, the generator can choose a data tag so as to fit the desired event, a hit or a miss. On the other hand, preparation sequences are of interest as theyas our experience showscan stress the memory subsystem and discover "high-quality" bugs.

Industrial Application
The proposed approach is implemented in the MicroTESK test program generator [6,7]. Since 2006, different versions of the toolincluding one described in [5] have been applying to functional verification of several industrial microprocessors with the MIPS architecture [9]. MMU specifications take into account such buffers as a JTLB (a joint TLB), a DTLB (a micro TLB used to speed up data address translation), an L1 (a first-level cache) and an L2 (a second-level cache). Besides, they involve mapped and unmapped memory segments (XUSEG, KSEG0, KSEG1 and XKPHYS), TLB control bits (Valid, Dirty and Global) and cache policies (various combinations of Write-Through, Write-Allocate and Write-Back flags). Stimuli are composed from load and store instructions. The approach has allowed revealing a great number of critical bugs (e.g., reading incorrect data from memory) in the MMU designs, which had not been detected by randomly generated test programs.