THE RELIABILITY MODEL OF A DISTRIBUTED DATA STORAGE IN CASE OF EXPLICIT AND LATENT DISK FAULTS

This work examines the approach to the estimation of the data storage reliability that accounts for both explicit disk faults and latent bit errors as well as procedures to detect them. A new analytical math model of the failure and recovery events in the distributed data storage is proposed to calculate reliability. The model describes dynamics of the data loss and recovery based on Markov chains corresponding to the different schemes of redundant encoding. Advantages of the developed model as compared to classical models for traditional RAIDs are covered. Influence of latent HDD errors is considered, while other bit faults occurring in the other hardware components of the machine are omitted. Reliability is estimated according to new analytical formulas for calculation of the mean time to failure, at which data loss exceeds the recoverability threshold defined by the redundant encoding parameters. New analytical dependencies between the storage average lifetime until the data loss and the mean time for complete verification of the storage data are given.


INTRODUCTION
Petabyte size data storages are composed of a large number of hard disk drives.Data integrity in a distributed system largely depends on reliability and performance characteristics of HDDs used.Determining the storage reliability in general is an important applied problem that becomes more complex as the number of disks in the storage grows.
A hard disk, which is a complex technical device, is prone to failures that are caused by a set of various random factors and are of stochastic nature.With a certain level of accuracy, probability of HDD failure can be approximated by statistical law, then the fault occurrence process can be generalized to the whole population of disks in the storage.
A model of storage reliability describing a series of HDD faults and replacements prior to a certain condition becomes true corresponds to a Huygens' gambler's ruin problem.The Huygens class of problems considers a game with the limited initial sum at the fixed stake and the known math expectation of winning in each round.This game is played either till the initial capital is augmented or till it is completely lost.Two equivalent approaches to this problem are widely used, which are: Bernoulli process and random walk.Thus, the random walk method considers the stochastic movement of the gambler among discrete states depending on the outcome of another round.
In the storage reliability model, assuming all disks are equally significant and independent, the number of discrete states corresponds to the number of operational HDDs.Transition between states is defined by explicit failure detected right after its occurrence.In particular, the fault of the HDD firmware is one of such states.
Since backup methods are used in actual applications, data integrity is preserved when a certain portion of HDDs fail.The maximum threshold of failed HDDs is defined by the scheme and parameters of redundant encoding, and if exceeded, causes irrecoverable data loss.
Another technical aspect of HDDs is the occurrence of latent bit errors that are not detected right at the moment they occur.Bit errors corrupt the stored data but do not affect the physical operation of equipment in any way.Since the recovery procedure is started only when errors are detected, latent bit errors have negative impact on the storage reliability.This class of errors belongs to irrecoverable read errors and should be considered when designing data storages.
Calculation of checksums for the fragments of the each data block with the subsequent verification of these checksums is the main way to deal with such errors.Several approaches to data verification are possible.
According to the first method, checksums are verified only when the client requests the data.The recovery process is initiated if a checksum mismatch is detected.In this case, the mean intensity of data access is an important parameter affecting the storage reliability.In practice, the intensity of access to different types of data may vary significantly, so risk of losing data for a large archive file that is rarely accessed may be rather high.Bearing this in mind, we can state that the first option does not provide the sufficient level of storage reliability for all files.
The second approach to checksum verification presumes that the storage has a centralized service that manages the continuous process of the checksum verification for the data fragments.This technique is also known as scrubbing.In this method, the centralized service does not necessarily have to recalculate checksums.It is sufficient for this service to implement storage of information required to check data, and to send data verification commands to the storage services.Thus, the load related to the data verification process is distributed across the entire storage.The intensity of such a process is usually estimated by the mean scrubbing interval, i.e. the expected time to check the entire data contained in the storage that varies from several days to several weeks.This work proposes a new math model of storage reliability with original analytical description of transitions between states.The model gives consideration to latent disk errors and continuous scrubbing process.The presented model builds on the ideas of describing traditional hardware of local data storage in RAIDs of individual HDDs and considers advanced distributed data storage systems with redundant encoding where individual fragments of data blocks are not bound to disks and are moved due to the replication or load balancing processes.The developed math model redefines the semantics of classical models of the RAID systemstheir states correspond to the block fragments rather than to the disks.Similarly, transitions between the states describe loss or recovery of a data fragment, rather than failure of an individual HDD.

LITERATURE REVIEW
Classical data storage reliability models based on Markov chains in continuous time provide approximate idea about the storage MTTDL but do not account for bit errors on disks.These models are covered in several works (for example [1], [2], [3], [4]) dedicated to reliability of RAIDs in conditions of explicit disk failures.In recent years, an increasing number of scientific works is dedicated to the development of the extended Markov reliability models that include math descriptions of latent disk errors and processes used to detect them.Thus, for example, drawbacks of classical models with unidimensional Markov chains without memory are described in details in the well-known work [5].Counterarguments showing the applicability of Markov chains for reliability estimation are provided in [6], where the memory effect is reproduced within more accurate detailed description of the system by increasing the number of states and transitions between them.The original model for a RAID of SSDs with growth of error intensity according to amortization is proposed in [7].For example, [8,9,10,11] and [12] propose the consistent generalizations of the RAID systems reliability model considering latent sector read errors and mechanisms of their detection (scrubbing).However, the underlying RAID-5 and RAID-6 data redundancy schemes result in severe limitations to governing mathematical models.The RAID-groups usually allow only the complete scanning of the whole disks.This mode of operation is unsuitable for data storages, consisting of high-capacity disks due to prolonged scrubbing times and wasteful free-space checking of half-empty disks [13]; [14].Moreover, in [15] Iliadiset al.state that the excessive scrubbing rate decreases reliability.To overcome the mentioned problems, in [16] Liu et al. propose a frequency-cost function to keep an optimum trade-off between the reliability and the data cost by adjusting dynamically the scrubbing frequency.In [17]Venkatesanet al. summarize the effect of latent errors stating that the ones of high probability reduce the reliability of RAID-6 to that of RAID-5 without sector errors for all symmetric data placement schemes and all MDS erasure codes.
In published works, the two-dimensional Markov model, due to its complexity, is examined by numeric simulation methods that include statistical modeling.Despite complexity of describing two-dimensional models, Markov models remain attractive to use, since they providean estimation of the data storage reliability for an arbitrary scheme of redundant encoding with consideration given to various types of disk failures, different states of storage components, and various policies for replacement of failed components and data recovery, as well as varying intensities of recovery for a different number of lost fragments.Unlike simulation modeling based on Monte-Carlo statistical tests, this method allows obtaining of exact analytical expression for storage reliability within the given model.This analytical expression may be used for identification of certain functional dependencies between the model parameters.Besides, Markov models preserve significant advantage in terms of performance and calculation speed (up to 150 times, see [6]) as compared to the full scale simulation.

METHODOLOGY a) Basic Math model
Assume there is a data block that consists of  encoded fragments, of which  fragments correspond to the source data, while the remaining( − ) fragments are checksums.Also, assume that the encoding scheme allows data to be recovered when any  fragments are available.A data block is defined by the following set of states   :  0 -all  fragments are available, and no one fragment is defective,  1 -one fragment is defective,  2 -2 fragments are defective, …,  − -( − ) fragments are defective,  −+1 -more than ( − ) fragments are defective, which means it is impossible to recover the block, i.e., the block data are lost.For future use, the ( −  + 1) state is convenient to call  ("data loss") to stand it out among the remaining states.The  state is absorbing one, since there in no back transitions into other states of the system for it.According to the model, in certain time the system will go into the absorbing state regardless of its initial state with probability of 1. Important practical characteristic of such systems is the mean time of operation until transition into the absorbing state.
The system's transition between states describes loss of data fragments in a block due to HDD failures.As the first simplified assumption, we can say that disk failures comply with Poisson distribution, according to which probability of a disk failure during any given period of time does not depend on the disk age.Disk failure means complete loss of the data stored on the disk, and so the lost fragments can be restored only using the fragments residing on operational disks.Assume, the intensity of disk failures is , which means that the expected time to failure of a disk is 1  ⁄ .For the majority of HDD models failure the intensity is about 10 −9 second.Probability of disk failure within the [0, ) time interval is defined by the disk failure intensity and is Pr(  < ) = 1 −  − .
The second generally accepted assumption is the supposition about the mutual independence of individual disk failures.In this case, for  working disks the total failure intensity is , while the expected mean time to failure of at least one disk of  is 1 () ⁄ .A disk failure defines transition of the model from the   state with  non-functional disks into the  +1 state with ( + 1) non-functional disks.Since in models with continuous time simultaneous failure of any two disks is assumed to have zero probability, it may be considered that there are no direct transitions due to the failures between the states that differ by more than one operational disk.
The next important factor of the model is the mean time to data fragment recovery1  ⁄ .Unlike disk failure intensity, this parameter is defined not only by physical properties of the disk, but also by the architecture of a certain distributed storage.It also may depend on many other factors.The mean time to recover data fragment is made up of the time from the moment of disk failure to its detection and the time from the failure detection to the moment when the fragment to be recovered is written on the one of the disks of the system.Generally, the architecture of distributed data storage provides a monitoring service that tracks the state of all blocks in the data storage and launches the data recovery procedure when failures are detected.Because of it, the time to failure detection, even for large volumes of data stored in the system, can be considered small enoughabout several minutes.Below, the classical model is generalized to the scenario of occurring latent disk failures and the procedures to detect them.

Frequency of irrecoverable read errors
Generally, frequency of irrecoverable read errors is provided by the manufacturer in the disk specification and is about one error per 10 14 bits or about one error per 11 TB of data read from the disk.To account for this parameter in the proposed model, we need to estimate how often such errors occur.To do this, we need to estimate the average amount of data read from a single disk in the data center during some characteristic timea year, for example.
The mean disk access intensity can be estimated based on the average disk load, provided in the specification.Assume that the expected disk load is about 20%, i.e., during 20% of the time the disk is accessed and it is idle during 80% of time.Then, the steady-state speed of sequential disk read is 140 Mbps, according to this, (140 • 0.2 • 3600 • 24 • 365)/1024 = 862312 GB or about 842 TB is read from the disk annually.Based on this, frequency of irrecoverable bit errors for a single disk by order of magnitude can be estimated as about 77 errors per year.Assume the disk size is 2 TB and it is half full, while the typical size of a data fragment in the storage is 50 MB.Then, the mean frequency of read errors for the selected data block fragment is about 0.00367 errors per year or about 1 error during 272 years.
The parameter  (frequency of irrecoverable read error for a single data block fragment) that in this case is 272years −1 has been added into the model.The value of this parameter can be compared to the disk failure intensity that for disks with MTTF=200000 hours can be estimated as 23 years −1 .This circumstance should be accounted for when evaluating MTTDL of the data storage.
It is worthy of note that this parameter was calculated for the scenario of relatively high degree of disk usage.In general, frequency of irrecoverable read errors depends significantly on disk usage intensity.The average expected disk load defined in the specification is a value for approximate estimates.However, for certain data storage it is better to take the average estimates of disk usage for the given storage.For example, archive data storage may have lower disk usage rate.In general, frequency of irrecoverable read errors depends on the data access intensity and patterns of the actual data center.However, the calculated value of this parameter will provide the lower-bound estimate of the storage MTTDL based on presence of bit errors and the way they affect data storage reliability.

The mean scrubbing interval
The mean scrubbing interval, i.e.the mean time to perform complete data integrity check in a data center, is defined by the mean data verification intensity i.e., the average amount of data checked in a unit of time.Physically realistic values for the mean data integrity checking intensity depend on the storage load level and guaranteed performance the storage is required to provide to its clients, as well as economic costs to perform scrubbing.
Depending on the storage load level user tasks and system processes may compete for the same resources.The examples of such resources in a distributed storage system are CPU time, disk IO operations and network throughput.In this case, as a rule, resources used by service processes are limited to provide customers with the guaranteed performance level (guaranteed response time, guaranteed number of the IO operations per second).
In cases when the actual storage load level is far from the maximum values and system processes don't interfere with the user queries the data integrity checking intensity is completely defined by the balance between the data storage reliability requirements and the maximum allowed costs for the scrubbing process.Firstly, these costs are defined by the possible necessity to start spinning the disk containing the fragment to be verified in case when the disk is in the standby mode and its power consumption is low.Secondly, costs depend on the need to use CPU time for the computationally intensive algorithm of checksum calculation as well as, possibly to wake one of CPU cores up from the low power consumption mode.
As an approximate value for the mean scrubbing interval, a period specific for the data center, from one week to one month, can be selected.Also, the mean time of complete verification should be less than the mean time between requests to the same data by the storage user.The value of data integrity checking intensity is .
In a classical model, the state of a distributed storage system is fully defined by the number of faulty HDDs.In the new model, the system state depends not only on the number of the explicit, but also on the number of the latent fragment faults.So, the extended model, unlike the classical model, is built on two-dimensional rather than unidimensional space of the Markov chain states.
Assume that the state of the (, ) system is defined by explicit faults of  disks and by  latent fragment corruptions.Whereby, latent fragment corruptions are accounted only for operational disks, since data on faulty disks are considered unavailable.
It also should be noted that the value of the sum ( + ) for the state, in which block data still can be restored is limited on top by the number of fragments that can be lost in this error-correcting code without losing data.In other words, the 0 ≤  +  ≤  −  inequality is true for this state.For states with the  +  >  −  condition, block data are irrecoverable, therefore, regardless of the ratio of latent and explicit faults, all these states may be joined into a single  state ("data loss").
So, the total number of states in the system, apart from the  state, is The first type of state transition describes disk failures.Two transition options are possible for the (l,m) state.Firstly, if a disk corresponding to one of the corrupted fragments fails then the system with intensity of mλ goes into the (l+1, m-1) state.Secondly, if a disk corresponding to one of the intact fragments fails then the system with intensity of (n-l)λgoes into the (l+1,m) state.
The second transition type is transition into states with latent corruptions because of the bit errors.Bit errors can cause transition from the (, ) state into the (,  + 1) state with intensity of ( −  − ).This transition does not account for latent bit errors on disks that already failed and also does not consider the possibility of new bit errors for already corrupted fragments since these events do not change the state of the system.
Data recovery related transitions between the states take place due to the recovery process initiated after an explicit disk fault or latent bit corruption detection.Assume that when recovering after explicit faults, checksums for fragments on operational disks are verified, and conversely, when latent errors are detected not only the corrupted fragments are rewritten, but also fragments lost as a result of explicit disk failures are recovered.Also, assume that the data recovery process always moves the system into the (0, 0) state without any explicit or latent faults.The move into the (0, 0) state is based on the fact that the MTTDL value for the sequential and parallel data recovery processes is the same in the case when intensity of recovery processes is much higher than intensity of disk failures.Depending on the state of the system, intensity of transition into the (0, 0) state changes as follows: for states of the (, 0),  > 0 type, intensity is , for states of the (0, ),  > 0 type, intensity is , and for states of the (, ),  > 0,  > 0 type, intensity is ( + ).
In the new model, transitions into the  state differ from transitions into other states, since the  state combines multiple states.In the  state, the system may shift from states of the (,  −  − ) type, where 0 ≤  ≤  − .Intensity of the transition into the  state for any of these states is ( + ).

MTTDL calculation with account for irrecoverable bit errors and scrubbing process
Prior to the description of the model with arbitrary  and  parameters, for illustration purposes, it is helpful to consider practically important special cases of MTTDL calculation with values  −  = 1and  −  = 2. Applied efficiency of these parameter sets for Locally Repairable Codes (LRC) is confirmed in [18]and [12].The schemes of the system states and transitions between them for these cases are shown on  Assume that  , is the probability of the system moving from the initial (, ) state into the  state prior to getting into the (0, 0) state without any latent or explicit failures.The  , definition suggests that  0,0 = 0, and   = 1.
The value  1,0 is expressed through  0,0 and   probabilities for the states, in which the system can move from the (1, 0) state: From this item on when performing calculations we assume that values of  and  parameters are negligible as compared to values of  and  parameters.Given that  0,0 = 0,   = 1 and omitting negligible elements, we can obtain: Value of  0,1 is expressed as follows: In this model, two alternatives of the cycle start are possible; the cycle can start either from a disk failure, i.e., from the (1, 0) state, or from irrecoverable read error, i.e., from the (0, 1) state.
The expected number of cycles to data loss in case of starting from the (1, 0) state is The expected number of cycles to data loss in case of starting from the (0, 1) state is The mean cycle time is calculated as follows.Assume that  , is expected time that will pass prior to the system gets from the (, ) state into the (0, 0) or state.By definition,  0,0 = 0,   = 0.
For mean cycle times  ,(1,0) and  ,(0,1) in case of starting from the (1, 0) and(0, 1) states the following formulas are true: Given that the 1  ⁄ and 1  ⁄ values are negligible as compared to1 ( + ) ⁄ , we can obtain The above formulas suggest that storage MTTDL estimation for the new model is equal to the least value of the expected times to data loss in case of starting from the (1, 0) or(0, 1) state: The value of  1,0 is negligible in comparison with  1,1 and  0,2 ,so: The expected number of cycles till data loss is calculated in case of the initial (1,0) and (0,1) states:  The  ,(1,0) and ,(0,1) mean cycle times for the model with (, ) arbitrary parameters are calculated as follows.
Omitting negligible elements, we can obtain: The brought out formula suggests that the  , value depends only on the position of the (, ) state within the  +  =  diagonal, and does not depend on the diagonal.So, the mean cycle times are: Calculation of  , in arbitrary case is a bit more complex.
Bearing in mind that  +1,−1 negligible as compared to  +1, , and values of  ,+1 and  , lie within the same  +  =  diagonal and are comparable, while values of  , on inner diagonals are negligible as compared to values of  , on outer diagonals that are external in relation to these diagonals, the following conditions are true: Let a "simple" path from the   1 , 1 state into the   2 , 2 state be called a sequence of transitions between states of the system that does not contain data recovery events as well as transitions between the states located on the same diagonal.Paths containing transitions between states on the same diagonal can be omitted since their probability as compared to "simple" paths is small.
It should be noted that the  , value in the (, ) state is defined as sum of terms   calculated along all "simple" paths from the  , state to the   state.The number of such "simple" paths for the state  , is 2 −−(+) , since the length of all "simple" paths from the  , state into the   state is fixed and equal to  −  − ( + ), and in each intermediate state there are two options for the further direction of the pathup or down.
The recurrent relations for  , suggest that the   value corresponding to some fixed "simple" path is defined as product of transition intensity between states along this path divided by the product of data recovery intensities in each of the path states except the  state.

Results and comparative analysis of the classical and extended models
Comparison of the calculation results based on the proposed model with those obtained using the classical model is shown on Figure -4.In the proposed model, intensity of bit errors was defined by the mean intensity of disk access in the storage.As one can see on the chart, the results obtained in the classical model that account only for explicit disk failures differ from the refined results obtained within the proposed model.The observed difference is significant and confirms practical relevance of the proposed model.The obtained results are in line with the conclusions of [11] stating the need to account for latent errors and perform data checksum verification.The dependency of storage MTTDL on parameters (, ) agrees quantitatively with results of [19], namely with behavior of conditional probability of data loss against parity fragments of erasure code.
The plots show that in this model the storage MTTDL largely depends on the mean disk usage intensity with the decreasing of disk usage intensity the bit error rate also decreases and storage MTTDL becomes closer to the MTTDL values computed using a simpler model that doesn't account for the irrecoverable bit errors.The graph of data storage MTTDL dependence on the mean scrubbing interval shows that on the initial section, MTTDL is constant and virtually does not depend on scrubbing intensity.Then, it starts to go down abruptly as the mean scrubbing interval increases (see.Figure -5).This happens because the formula for the storage MTTDL contains two expressions, from which the minimum one is selected.At the point where MTTDL starts to decrease the expression with the minimum value changes.The graph (see  shows that the proposed method allows us to find the balance between data storage reliability and financial costs imposed by the continuous scrubbing process.

DISCUSSION
The proposed model offers more flexible and reliable scheme than those describing RAID-5 and RAID-6 data redundancy schemes.The scrubbing of RAIDgroups involves the complete scanning of the whole disks, leading to prolonged timeframes of reduced performance and reliability.
The study [15] considers the similar problem of improving reliability of RAID-group storages by means of disk scrubbing.Despite the differences in RAID and error coding schemes, the results from corresponding models are in agreement with each other.The plot (Figure -6) demonstrates MTTDL dependence on scrubbing period for an event-driven simulated 10 PB system of RAID-6 groups with scrubbing IO-load at ten per cent.The set of underlying parameters represent SATA drives with

Figure- 1
and Figure-2.In general, the calculation algorithm is similar to calculations for the classical model.

Figure- 1 .
Figure-1.Intensity of transitions in the Markov chain for the case of n-k=1.

Figure- 2 .
Figure-2.Intensity of transitions in the Markov chain for the case n-k=2.Values of  , are calculated beginning from the states on the  +  =  −  diagonal.
www.arpnjournals.com9719 Mean cycle times for the initial (1,0) and (0,1) states are estimated.Values of  , are calculated in the same sequence as the corresponding  , values: For mean cycle times  ,(1,0) and  ,(0,1) for the (1, 0) and(0, 1) initial states the formulas take on form: The storage MTTDL for this model is Generalization of the extended model to arbitrary parameters (, ) Formulas obtained for the  −  = 1 and  −  = 2 special cases can be generalized for arbitrary parameters (, ).The scheme for the calculation of the  , and  , values shown on Figure-3 assumes sequential calculation of values up to the  +  =  diagonals beginning from the diagonal corresponding to the (, ) states, for which  +  =  −  when moving from top to bottom along the diagonals.

Figure- 4 .
Figure-4.Comparison of results obtained in the model that accounts for irrecoverable bit errors and scrubbing process and those from the simpler model.

Figure- 5 .
Figure-5.Dependence of the data storage MTTDL on the mean scrubbing interval.