Hybrid importance sampling Monte Carlo approach for yield estimation in circuit design

The dimension of transistors shrinks with each new technology developed in the semiconductor industry. The extreme scaling of transistors introduces important statistical variations in their process parameters. A large digital integrated circuit consists of a very large number (in millions or billions) of transistors, and therefore the number of statistical parameters may become very large if mismatch variations are modeled. The parametric variations often cause to the circuit performance degradation. Such degradation can lead to a circuit failure that directly affects the yield of the producing company and its fame for reliable products. As a consequence, the failure probability of a circuit must be estimated accurately enough. In this paper, we consider the Importance Sampling Monte Carlo method as a reference probability estimator for estimating tail probabilities. We propose a Hybrid ISMC approach for dealing with circuits having a large number of input parameters and provide a fast estimation of the probability. In the Hybrid approach, we replace the expensive to use circuit model by its cheap surrogate for most of the simulations. The expensive circuit model is used only for getting the training sets (to fit the surrogates) and near to the failure threshold for reducing the bias introduced by the replacement.


Introduction
Due to the continuously increase of the number of individual components on an Integrated Circuit (IC) the probability of a bad working IC will increase dramatically, see [1,2]. This can simply be illustrated by the example in [2], where an IC with S "identical" components (each having a failure probability p fail ) has a rather large probability of P fail = 1 -(1 -p fail ) S on break-down, even p fail is considerable small (i.e., being a rare event). For example, consider a 256 Mbit SRAM circuit, having 256 million "identical" bit cells. Then to guarantee a failure probability of 1% for this circuit (i.e. P fail = 0.01 with S = 256 × 10 6 ), it is required that p fail < 3.9 × 10 -11 , being a rare event indeed. Notice that the yield Y of an IC is closely related to the failure probability P fail and can be expressed as Y = 1-P fail = (1-p fail ) S . Thus, the yield Y of an IC is estimated by using the failure probability p fail of its component.
We consider Monte Carlo (MC) techniques [3] for estimating the failure probability p fail . The standard Monte Carlo produces an estimatorp fail = k/n for the true probability p fail by running the simulator n times with independent random inputs and counting the k occurrences of the 'fail' event. Notice that np fail ∼ Bin(n, p fail ) follows a binomial law with probability p fail of getting success out of n trials. The useful properties of the estimator p fail are its unbiasedness i.e., E(p fail ) = p fail and its independency on the dimension d of the random vector X. However, the variance of the estimatorp fail is given by Var(p fail ) = p fail (1 -p fail )/n, which can be (relatively) large for small p fail and limited number n of MC runs. Using the 'normal approximation' of the binomial distribution, the 95% confidence interval for (small)p fail is estimated to be ±1.96/ np fail . So, to determinep fail in the range 10 -11 with an accuracy of ±10% with 95% confidence level, one needs about 4 × 10 13 MC runs, which is intractable in industry even with the fastest computer simulations.
To overcome the drawback of the standard MC method, a variance reducing Importance Sampling Monte Carlo (ISMC) technique is proposed in [2]. There it was shown that a reduction of several orders can be achieved, from 4 × 10 13 to at most few thousands runs. However, when we estimate the probability of failure, it is done at some fixed environmental parameters (such as temperature, supply voltage, and process corners). These parameters add multiple levels of complexity. For instance, the failure probability must be computed for a complete range of working temperatures. The complexity grows exponentially when the other dimensions are combined. For complex systems one usually can only afford a very limited number (say, in hundreds) of simulations, and therefore the ISMC technique remains unattractive. In [1], a model based ISMC approach has been proposed for estimating rare circuit events. In the model based approach the circuit model is replaced by a surrogate which is much faster to evaluate, the circuit model is only used for drawing training samples which are used to build a surrogate. Usually, the number of training samples is much smaller than the total number of MC simulations for estimating the probability. Hence, the overall computational cost is reduced. Nevertheless, it is often difficult or even impossible to quantify the error made by such a substitution. There is another model based approach proposed in [4] which introduces a statistical blockade approach. In this approach one draws a large number of samples from a surrogate model initially, find the samples which belong to the tail region, and replace them by the true responses. The authors use a linear classifier saying that such classifier is enough for SRAM bitcells. However, our goal is to address the large circuits (such as analog IPs (Intellectual Properties)). Our experiments show that a linear model does not work for such large circuits and it is difficult to fit a surrogate (nonlinear) model that is accurate in the tail so that one can classify the samples that really belong to the tail region. The other limitation of both of the above model based approaches is that they do not address the dimensionality issue of the problem.
In this paper, we propose a Hybrid Importance Sampling Monte Carlo (HISMC) approach for estimating small failure probability. This approach is a modification of the model based approach proposed in [1] and can be used for large dimensional circuit problems. The idea is to only use the expensive circuit model a (for a small portion of the overall samples used to estimate the probability) close to the failure threshold and the surrogate is used for the remaining samples that are reasonably away from the failure threshold. The use of these small number of samples of the circuit model can prevent loss of accuracy. The Kriging model [5,6] is used as a surrogate of the circuit model because it inherits a solid mathematical foundation with several useful properties including interpolation of the training data and a closed formula for approximating the prediction error known as a Kriging variance. The latter is useful for improving the Kriging model near the failure threshold as well as for selecting the samples near to the failure threshold for which the circuit model is to be used. Our experience with the circuits shows the Kriging model works well up to 35 input variables. This paper is organised as follows. We start with the reference method mean-shift ISMC approach in Sect. 2. Then we introduce a surrogate modelling technique in Sect. 3 that combines a feature selection method and the Kriging model. Using this surrogate technique we present our HISMC approach in Sect. 4. Finally, the results are shown in Sect. 5 and a conclusion is made in Sect. 6.

The importance sampling Monte Carlo method 2.1 General framework
Let x ∈ R d be a vector of d process parameters, which is a realization of the random vector (r.v.) X with probability density function (pdf ) g(x), and let H(x) be a corresponding response b of the circuit under examination. The mathematical equation of the failure probability p fail = P(H(x) ≥ γ ) is given by where subscript g means that the expectation is taken with respect to the pdf g(x), γ is a given failure threshold and 1 {H(x)≥γ } is an indicator function that gives the value 1 if H(x ≥ γ ), 0 otherwise. We assume that the (failure) region of interest lies on the upper tail of the output distribution. This is without loss of generality, because any lower tail can be converted to the upper tail by replacing H(X) = -H(X). Therefore, the probability P[H(X) ≤ γ ] can be converted to pfail (γ ) = P[-H(X) ≥ -γ ] for some give failure threshold γ on the lower tail of the distribution. Hence, it is sufficient to estimate the probability for the upper tail and hereafter we will simply write p fail instead of p + fail (γ ). Assume that we have another density f such that 1 {H(x)≥γ } g(x) > 0 ⇒ f (x) > 0, we say g is absolutely continuous with respect to f . Then we can write (2.1) as where Y is a r.v. generated from the new pdf f (x) and L(x) = g(x)/f (x) if f > 0 and L(x) = 0 otherwise, is a likelihood ratio between two densities. The ISMC estimator is then given byp where Y i 's are N independent and identically distributed (iid) random samples generated from f (x).
Thep IS fail is unbiased [7] with the variance where From a practical point of view, one has to find the 'best' pdf f (x) in order to maximize the accuracy of the estimatorp IS fail . One of the ways to find such f (x) is by minimizing the variance σ 2 IS . The work in this paper is an additional contribution to the developments at Mentor Graphics where a mean-shift ISMC technique (see Sect. 2.2) is being used, assuming that the original input distributions can be transformed into a Gaussian distribution. In this context, the importance density is found by shifting the mean of the original density to the area of interest. We use the same technique as a reference approach. The study of other ISMC techniques is out of scope of this paper.

The mean-shift approach
In this paper, we consider a particular case where the original pdf g(x) is Gaussian with mean 0 and variance I, i.e., g(x) ∼ N (0, I). We define the importance density f (x) = g θ (x) with g θ (x) ∼ N (θ, I) parameterized by its mean θ ∈ R d (see, [2]), in other words g θ (x) = g(xθ ). Then the likelihood ratio L(x) becomes and the relation between the random vectors X and Y is Using (2.6) and (2.7), the ISMC probability estimator (2.3) can then be written aŝ where X i 's are N iid random vectors with density g(x). Furthermore, the second moment (first term in the right hand side of (2.5)) can also be written in a simplified way Recalling that for maximizing the accuracy of the probability estimatorp IS fail one has to find the pdf g θ (x) or equivalently to find the mean-shift θ , such that the variance σ 2 IS is minimum. Moreover, the variance σ 2 IS would be minimum if the second moment v(θ) is minimum. By minimizing v(θ) we obtain a probability estimator with a smaller number of simulations [2, Sect. 3.1] than with the standard MC simulation. Under some assumptions it has been proved in [7] that the function v(θ ) has a unique minimizer θ * such that ∇v(θ * ) = 0. The optimal θ * can be approximated with the Newton algorithm by solving the following optimization problem the MC approximation of the second moment v(θ). For details we refer to [2]. Following [2], there must be at least one X j such that 1 (H(X j )≥γ ) = 0 to solve the optimization problem (2.10). However, this condition may fail in a rare event context. To overcome this problem, a multilevel approach is suggested in [8] for solving such problems in the context of cross-entropy approaches.

Multi-level approach for rare events simulations
In the multi-level approach, we solve the optimization problem (2.10) iteratively. Starting with the mean θ (0) = 0 of the given density function g(x), we construct a sequence of meanshifts {θ (k) , k ≥ 1} and a sequence of levels {γ k , k ≥ 1}, and iterate in both γ k and θ (k) until convergence, see the steps 1 to 6 of the Algorithm 2.1 below. Following [8], each iteration k of the multi-level approach consists of two phases; in the first phase we fix θ (k-1) and obtain the level γ k , and in the second phase we compute θ (k) using θ (k-1) and γ k . The computation of γ k and θ (k) at iteration k is as follows: 1. Computation of γ k : For fixed θ (k-1) , we let γ k to be a (1ρ)-quantile of H(X (k-1) ), i.e., where X (k-1) ∼ g θ (k-1) (x) and ρ is a probability which is to be chosen such that ρ p fail , the probability to be estimated. An estimator γ k of γ k is obtained by drawing m random samples X (k-1) calculating the responses H(X (k-1) i ) for all i, ordering them from smallest to largest ) and finally evaluating the (1ρ) sample quantile as where x is the smallest integer greater than or equal to x. Note that the estimation γ k of γ k depends on two parameters, the probability ρ and the number of samples m. Our empirical results show that if we fix m = 1000 then a good choice of ρ is 0.20 for getting an accurate estimation of γ k . However, one may choose a smaller ρ but that may require larger m for estimating γ k accurately. For more details we refer to [8,9].
2. Computation of θ (k) : Let g θ (k-1) (x) be the density function known at iteration k and g θ (k) (x) be the new density we want to obtain. The likelihood ratio of densities g θ (k-1) (x) and g θ (k) (x) at iteration k is given by Therefore, the second moment (2.9) at iteration k can be extended as Using the above information the optimal mean-shift θ (k) can be approximated with the Newton algorithm by solving the following optimization problem the MC approximation of the second moment v (k) (θ ). Below we present the multi-level ISMC algorithm.
m (θ). 6. If γ k < γ , go to step 2 with k ← k + 1, otherwise, go to step 7. 7. Compute (2.8) with θ * = θ (k) . 8. End of the algorithm. Figure 1 illustrates the multi-level ISMC approach (Algorithm 2.1) in the situation x ∈ R 2 . Here, we assume that the optimal mean-shift θ * is estimated in four iterations. The sample distribution per mean-shift θ (k) is shown by the ellipses. Moreover, the blue curves (3-dotted are intermediate and 1-solid is the target) represent the contours at levels γ k , k = 1, . . . , 4. Notice that γ 4 = γ . We start by sampling from the original pdf g θ (0) (x) = g(x), see steps 1 and 2 of the algorithm for iteration k = 1. Then we find the level γ 1 (step 3). Afterwards, we estimate θ (1) using the variance criterion in steps 4 and 5. Now, we go to the second iteration and start by sampling from the new distribution g θ (1) (x). We repeat the same procedure until convergence to the optimal mean-shift θ * , see step 6. Here we have γ 4 = γ . Finally, we sample from the optimal pdf g θ * (x) (distribution is shown by green ellipse centered at θ * ) and computep IS fail , see step 7.

The surrogate modeling technique
As stated in Sect. 1, we deal with circuits having a large number of statistical input variables. Usually, only a few of them are important i.e., having a statistical effect on the circuit response. The remaining variables have a negligible or no statistical effect on the circuit response. It should be noticed that almost all surrogate models (including Kriging) lose their accuracy with increasing dimension and it is not possible to build an accurate surrogate model with a complete set of variables. Therefore, dimension reduction is required before fitting a surrogate model. Dimension reduction is the process of reducing the number of random variables under consideration and is often divided into two categories: feature extraction and feature selection [10,Chap. 6]. In feature extraction, one performs the projection of the original input space to a reduced space. The dimension of the reduced space is much smaller than the original space. There exists a variety of different projection methods. Among others Partial Least Squares Regression (PLSR) [11,12] belongs to this category. On the other hand, feature selection is the process of selecting a subset of important variables while other variables are set to their nominal values and removed from the set of input variables used for building a surrogate model. Our experience with feature extraction, especially with PLSR, shows that it requires a large number (which grows with the dimension of the problem) of training samples to perform accurately, and therefore it cannot be used when only a limited number of simulations are allowed. Hence, we prefer to use a feature selection method that performs accurately even with a limited number of simulations.

Feature selection
In circuit simulations, feature selection is a process of selecting a subset of important variables, having statistical effect on the circuit response of the circuit under study, while other variables having no or negligible effect are set to their nominal values without incurring much loss of information. Several different approaches for feature selection exist, see for examples [13]. Here we will employ the well known Least Absolute Shrinkage and Selection Operator (LASSO) [14] which is stable and exhibits good properties for feature selection.

LASSO
gives a certain amount of information of the relative importance of x j . In the context of circuit simulation the dimension of the vector x of input variables is very large. However, only few of x j 's are important. This means we are willing to determine a small subset of variables that gives the main effects and neglect the small effects. LASSO is a well known method that performs this task. Given a training set (X i , Depending on the value of the tuning parameter λ, LASSO sets many coefficients exactly to zero. The variables x j corresponding to the nonzero coefficients b j are selected as important variables. The computation of the LASSO solutions is a quadratic programming problem and can be tackled by standard numerical analysis algorithms. However, the least angle regression (LARS) [15] procedure is a better approach in terms of the computational efficiency. The LARS algorithm exploits the special structure of the LASSO problem, and provides an efficient way to compute the solutions simultaneously for all possible values of λ. For detail we refer to [ Remark 3.1 To avoid the use of additional notations, we will introduce the Kriging model for an input vector x having full dimension d. However, in our algorithms, x will be used in its reduced form. See Algorithm 4.1.

The Kriging model
The notion of a Kriging model, also known as Gaussian process regression in literature [6], was initially developed in a geostatic framework [17]. The Kriging model also plays a central role in the design and analysis of "computer experiments" [5,18]. The main idea of the Kriging model is to assume that the response function H(x) is a realization of a Gaussian process G(x) where f(x) T β is a linear regression model on a given functional basis For the numerical experiments in Sect. 5, we use the DACE Kriging toolbox [19] for Matlab to model the response function H(x). DACE provides the functional basis f(x) as a set of polynomials of order 0, 1 and 2. In this paper, we use the linear functional basis, i.e., For details we refer to [19]. The second term Z(x) on the right hand side of (3.3) is a Gaussian process with zero mean and covariance where X ∈ R d is the domain of input variable x, σ 2 is the process variance of G(x) and R(x, x , ) is a correlation function characterized by a vector of parameters = [ 1 , . . . , d ] T . The choice of the correlation function R depends on the smoothness of the response function H(x) [20]. Assuming H(x) is smooth, we use the following Gaussian correlation function For other correlation functions we refer to [19,20].

The Kriging predictor
We have a circuit model to be used for simulating the behaviour of a circuit. We find a or- Notice that the notation s is used here for the input x to distinguish the training samples from the one that will be used for the other simulations. The set of experiments is usually referred to as a Design Of Experiments (DOE), see [5]. The construction of a Kriging predictor depends on D, and the DOE should be selected carefully in order to get the largest amount of the statistical information about the response function over the input space.
In this report, we use the Latin Hypercube Sampling (LHS) space filling technique for our DOE. We will not discuss the LHS in this paper, but we refer to [21].
Following [19], given a training set D = (S, H) the Kriging predictor of an untried point x, i.e., x / ∈ S is given by: where ..,N tr ,j=1,...,N tr , (3.11) The prediction error also known as the Kriging variance at a point x is given by:

Estimation of parameters
Given a choice of regression and correlation models, the optimal values of the parameters β, σ 2 and can be inferred using the Maximum Likelihood Estimation (MLE). The MLE of β is the generalized least-square estimate β given in (3.9) and the MLE of σ 2 (see, [19,Sect. 3]) is (3.14) Using these β and σ 2 the optimal correlation coefficients of the correlation function solve the following optimization problem where |R| is the determinant of R. See [19].

A hybrid importance sampling Monte Carlo approach
In this section we propose a HISMC approach. This approach is a modification of the hybrid approach proposed in [1,Sect. 4] and can be used for large circuits. Similar to [1], we split the ISMC Algorithm 2.1 into two phases; The first is the Exploration Phase that includes the steps 1 to 6 for estimating the optimal mean-shift θ * , and the second is the Estimation Phase that consists of step 7 for estimating the probability (2.8) using the optimal value θ * of the mean-shift θ . In the HISMC approach, we will use a hybrid combination of LASSO, the Kriging model and the circuit model. We will treat the two phases (the exploration phase and the estimation phase) separately.

The exploration phase
The goal of the exploration phase is to find the optimal mean-shift θ * . Here we replace the indicator function 1 (H(x)≥γ k ) in step 4 of Algorithm 2.1 by an approximation 1 ( H(x)≥γ k ) , where H(x) is the Kriging prediction of the response H(x) at some input x. Notice that the accuracy of the model 1 ( H(x)≥γ k ) is important and must be checked before its use. The Misclassification Error (MCR) is used as a measure of accuracy for the metamodel 1 ( H(x)≥γ k ) .
A leave-one-out cross validation technique for the Kriging model is proposed by [22] and it is used to estimate the MCR in this paper. Before presenting our algorithm for the exploration phase, we want to mention some preliminaries for extra understanding of the algorithm.  Fig. 1) that choosing larger value of γ k will give faster convergence to failure threshold γ . However, for doing that we use a smaller ρ which needs a large n. In a surrogate based approach we use a cheap model instead of the full circuit model and thus a large number m (say, 10,000) of simulations can be used, and therefore a small ρ (say, 0.05) is acceptable.

The estimation phase
Assuming that the optimal value θ * of the mean-shift θ is computed in the exploration phase, our next goal is to find an estimation of the probability p fail . To this end, we build a surrogate based accurate probability estimator. One simple approach [1] is to replace the indicator function 1 (H(Y)≥γ ) in (2.8) by its surrogate model 1 ( H(Y)≥γ ) , where H(Y) is the Kriging prediction of the response H(Y) at some input Y = X + θ * . Here the Kriging predictor is built on the training set with input vectors centered at θ * . To get an impression of accuracy of the probability estimator we would like to have a confidence interval. A candidate is Pseudo Confidence Interval (PCI), that depends on the Kriging variance, is provided in [1]. However, there is no proof that the true probability would lie in the PCI. Moreover, the PCI can be very wide if the Kriging prediction has a large variance. To prevent loss of accuracy of the probability estimator, a hybrid approach is proposed by [23,24] that combines the simulations of the Kriging model and the original system (circuit model in our context). The original system is used only for the responses that are close to the failure threshold. In this section we will use a similar approach based on the Kriging model. Unlike the hybrid approach in [23] where the first surrogate model is used to simulate, we check the accuracy of the model online and improve (re-build) it, if required. The authors in [25] demonstrate the benefits of using an improved surrogate model which might be useful to our application. In this paper, we use an adaptive sampling technique to improve the Kriging model. We add some samples (adaptively) to the initial training set from the region of interest, see step 14 in Algorithm 4.2. Due to the interpolation nature, the Kriging model gives a better fit in that region after the improvement.
We start with drawing a training set D * = (S * , H * ) where S * is the N tr × d design matrix with rows representing the N tr random vectors generated with the LHS(θ * ± a) and H * is an N tr × 1 vector of corresponding responses. Then we perform the feature selection (using LASSO) and find the reduced training set D * r = (S * r , H * ) where S * r ⊆ S * is the reduced design matrix containing the columns of S * corresponding to the important variables. Afterward, we build a Kriging model on the updated training set D * r . Finally, we build a hybrid indicator function I γ (Y) called an emulator that combines the true indicator function 1 (H(Y)≥γ ) and its surrogate 1 ( H(Y)≥γ ) based on an accept/reject criterion. See next section.

The emulator
We define an interval known as the "margin of uncertainty" [20] around the threshold γ .
is the inverse cumulative distribution function of the standard normal distribution. If α = 0.01 for which z α/2 = 2.58, we assume that there is 99% chance that a true value lies in the The accept/reject regions are indicated in Fig. 2

which says that we accept the simulation of the Kriging predictor H(Y) if H(Y) is reasonably away (Y /
∈ M) from the failure threshold γ and we reject H(Y) if it is close (Y ∈ M) to γ . In the latter case the circuit model must be used. The mathematical formulation of the Emulator is given as follows

Probability estimator
The emulator-based probability estimatorp E fail of the failure probability p fail is given bŷ Figure 2 The visualization of the accept/reject criterion and its variance can be estimated bŷ (4.5) Using (4.4) and (4.5), the 100(1α )% confidence interval for the true probability p fail is defined as (4.6) and the accuracy of the probability estimatorp E fail can be measured by the coefficient of variation (4.7) The 95% confidence interval is the most commonly used interval that corresponds to z α /2 = 1.96 for α = 0.05. The accepted probabilityp E fail is the one with a coefficient of variation CV ≤ 10%.
Combining the information from the exploration and estimation phases, the full HISMC algorithm can be formulated as follows:

Determine
13. If CV > tol_cv and l < l max go to step 14 otherwise go to step 15. 14. Use the full simulations N full drawn to compute I γ (Y (l) i ), see definition (4.3). Select some points d (say, min{10, N full }) uniformly among all N full simulations. Add these samples into the training set and rebuild the Kriging model with the updated training set. Go to step 6 with l ← l + 1. 15. Determine the 95% confidence interval 16. Save the probabilityp E fail , the varianceσ 2 E , CI and the CV. 17. End of the algorithm.

Results and discussion
In this paper two realistic circuits are used for validation purpose. The first one is a VCO with 1500 statistical input parameters and scalar response 'oscillation frequency' and the second one is a memory cell with 2096 statistical input parameters and scalar response 'read delay' . The ISMC and HISMC algorithms are repeated 100 times, and the empirical results for both the exploration and estimation phase are compared. In the exploration phase we compare the optimal mean-shift computed by both the algorithms and the required number of simulations. On the other hand, we compute the probability in the estimation phase, and therefore we need to measure the efficiency of the probability. When we replace the circuit model with a surrogate model, the simulation time becomes negligible, and one can take only true simulation runs into account for estimating the efficiency of the algorithm. However, the surrogate model introduces a bias that may be very large. Therefore, we use the following procedure to measure the efficiency of the emulator based probability estimator.
For a 95% confidence interval CI M , an unbiased estimatorp M fail and a large N rep the CCP must be 0.95 (approximately). However, for a biased estimator CCP might be smaller than 0.95. We assume that a (biased) estimator is good enough if it does not have CCP lower than 0.90, i.e., a 5% error in the confidence interval is acceptable. Then we can say that the MSE is the sum of variance and squared bias of the estimator which provide a useful way to estimate the efficiency of a biased estimator (see 4 below).

Estimate the Efficiency Metric: We now introduce an efficiency estimator denoted by
Eff and given as where T M1 and T M2 are the average computational costs (CPU time) required for method M1 and M2, respectively. If Eff(M1, M2) = κ > 1, it means that method M1 requires κ times more computational cost than M2 to obtain the same accuracy. If Eff(M1, M2) > 1 then estimator M2 is preferred to M1. Figure 3 indicates the computation of the mean-shift θ (k) and the target level γ k at each iteration of the exploration phase. Note that we plot the empirical means of θ (k) and γ k from the ISMC and HISMC algorithms repeated 100 times. In both figures, the blue dotted curves with asterisks stand for the ISMC approach and the red dotted curves with squares are for the HISMC approach. We start with θ (0) = 0 and compute the pair (γ k , θ (k) ), iteratively. It can be seen from the figure (at right) that the ISMC algorithm takes 7 iterations to reach the target γ = 1900. However, the HISMC algorithm takes only 4 iterations. Indeed, in the case of ISMC we used ρ = 0.2 and m = 1000 for estimating the intermediate levels γ k . On the other hand, in HISMC we replace the expensive circuit model by the Kriging model that allows us to use more samples (m = 10,000) and then a smaller value of ρ (0.05) is acceptable. Clearly, HISMC makes bigger steps that results into less iterations. In the left plot the estimated norm of the mean-shifts corresponding to γ k is shown at level k. From Tables 1 and 2, the last θ (k) of both methods are 6.42 and 6.44. The HISMC has only 0.3% relative error with respect to ISMC and thus we can see in Fig. 3 that the last θ (k) of both methods lies on the same (black horizontal) line, approximately. Tables 1 and 2 represent the numerical results for the exploration phase of the ISMC and the HISMC algorithm, respectively. The first, second, third and fourth columns represent the iteration number k, the number of full simulations, the intermediate level γ k and the mean-shift θ (k) per iteration, respectively. The fifth and sixth column in Table 2 represent the reduced dimension d r and the LOO-MCR (leave-one-out misclassification     error) at iteration k, respectively. Recalling the training sample rule (see Algorithm 4.1), we draw 500 training samples at iteration k = 1 (see the second column) since d = 1500 which is greater than 50. The training sample size at iteration k + 1 depends on the reduced dimension d r (fifth column) at iteration k. Further, the LOO-MCR (4.1) stands for the "leave one out misclassification error" of the Kriging model at iteration k. We can see that the maximum error is 1% (for k = 3 and 4). Moreover, it can be seen that the ISMC requires total 7000 simulations to estimate the optimal mean-shift. On the other hand, HISMC requires 1260 full simulations only.

Results of the estimation phase
Given the optimal mean-shift from the exploration phase, we estimate the failure probability p fail . First we get a reference probability p ref fail by running the ISMC algorithm with a small (less than 1%) coefficient of variation. The reference results are shown in Table 3. The probabilityp ref fail is being useful to measure the bias, MSE and thus the efficiency of the HISMC method. Now we present the empirical results of the estimation phase of the ISMC and HISMC algorithms. Recall that these results are based on the 100 experiments, i.e., both the algorithms are repeated 100 times with different 'seed' of the random generator each time. It is worth mentioning here that we performed a feature selection before fitting the Kriging model in the estimation phase of the HISMC algorithm and we also rebuild the Kriging model many times during the process. The average number of important parameters is d r = 23. Before converging the probability up to the required accuracy, we rebuild the Kriging model 4 times by adding some samples from the region near to the failure threshold. Table 4 represents the results of the estimation phase of the ISMC and HISMC algorithms. The mean probability, the average coefficient of variation (CV(%)) and the mean squared error (MSE) are computed for both of the techniques. We also computed the relative bias ( rel (%)) and the central coverage probability (CCP) for the HISMC technique. The last column indicates the total number (#Runs) of true simulations used in the estimation phase. The results in Table 4 are visualized by Fig. 4. The left and right-hand side plots show the probability distributions from N rep = 100 experiments (repetitions) of ISMC and HISMC methods, respectively. The vertical solid 'black' line in the center represents the mean of the N rep random values of the failure probability. The dotted line in the center is the reference probabilityp ref fail . The two dashed lines around the center represent the 95% confidence interval. Clearly, the reference probabilityp ref fail lies within the 95% empirical CI for both the methods. Moreover, the difference between mean and the reference probability gives the bias in the probability estimator. For the HISMC estimator (right-hand side plot), a bias ( rel = 2.92%) can be noticed. Because of this bias, the CCP is equal to 0.92 (see Table 4) which is smaller than 0.95 (the desired probability for a 95% CI). We assume that for N rep = 100 a CCP with less than and equal to 5% error is acceptable. For more accurate estimation of the bias rel and CCP we need to perform a larger number N rep (say 1000) of experiments. Finally, we estimate the efficiency of the HISMC with respect to the ISMC method. Recalling formula (5.6), we require the mean squared error and the average CPU time for both methods. The mean squared error is given in This means that ISMC requires approximately 6 times more CPU time than HISMC to achieve the same accuracy. Hence, we get a speedup of factor 6.

Results of the exploration phase
Similar to the VCO, both the ISMC and HISMC algorithms were repeated N rep = 100 times with a different seed of the random generator each time. The mean results (average of 100 experiments) of the exploration phase are given in this section. Figure 5 indicates the computation of the mean-shift θ (k) and the intermediate failure thresholds γ k at each iteration of the exploration phase for the memory cell. Here the target threshold failure γ = 902 is reached in 6 and 4 iterations (right plot) per method.   rebuild the Kriging model 4 times (during the process) by adding some samples from the region near to the failure threshold. Furthermore, we see in the table that the relative bias ( rel ) for HISMC is small, and therefore we have a good estimation of the CCP. Similar to VCO, the probability distribution plots from N rep = 100 experiments (repetitions) of ISMC and HISMC methods are shown in Fig. 6. Clearly, the reference probabilityp ref fail lies within the 95% empirical CI for both the methods. Moreover, the difference between mean and the reference probability gives the bias in the probability estimator. Compared to VCO we here see a smaller bias in the HISMC estimator. The reason is the small number N rep = 100 of experiments. For accurate estimation of the bias rel and CCP we need to perform a higher number N rep (say 1000) of experiments.
Finally, we estimate the efficiency of the HISMC with respect to the ISMC method. The mean squared error is given in This means that ISMC requires approximately 6.7 times more CPU time than HISMC to achieve the same accuracy. Hence, we get a speedup of factor 6.7, and therefore we prefer the HISMC over the ISMC.

Conclusion and future work
In this paper we proposed a HISMC approach for yield optimization of circuits having a very large number of input variables and scalar response. Moreover, we assume that only a few (say less than 35) of the input variables are important. The HISMC approach uses a feature selection method (LASSO) that reduces the dimension of the input variables of an underlying problem that allows us to fit the Kriging model on the reduced dimension. The Kriging model is used for most of the simulations and makes a significant reduction on runs from the expensive to use circuit model. Although it is hard or even impossible to quantify the bias in the probability estimator, the Emulator prevents loss of accuracy by using the true simulations near to the failure threshold. For future work we will try to compare the HISMC approach (in terms of efficiency and robustness) with a hybrid approach proposed by [24] and with commercially available methods (e.g., Solido and MunEDA ). More focus will be on multi-input and multi-output circuits, especially in considering output H(x) with more constraints involved.