- Research
- Open Access

# Hybrid importance sampling Monte Carlo approach for yield estimation in circuit design

- Anuj K. Tyagi
^{1}Email author, - Xavier Jonsson
^{2}, - Theo G. J. Beelen
^{1}and - Wil H. A. Schilders
^{1}

**8**:11

https://doi.org/10.1186/s13362-018-0053-4

© The Author(s) 2018

**Received:**17 January 2018**Accepted:**22 October 2018**Published:**25 October 2018

## Abstract

The dimension of transistors shrinks with each new technology developed in the semiconductor industry. The extreme scaling of transistors introduces important statistical variations in their process parameters. A large digital integrated circuit consists of a very large number (in millions or billions) of transistors, and therefore the number of statistical parameters may become very large if mismatch variations are modeled. The parametric variations often cause to the circuit performance degradation. Such degradation can lead to a circuit failure that directly affects the yield of the producing company and its fame for reliable products. As a consequence, the failure probability of a circuit must be estimated accurately enough. In this paper, we consider the Importance Sampling Monte Carlo method as a reference probability estimator for estimating tail probabilities. We propose a Hybrid ISMC approach for dealing with circuits having a large number of input parameters and provide a fast estimation of the probability. In the Hybrid approach, we replace the expensive to use circuit model by its cheap surrogate for most of the simulations. The expensive circuit model is used only for getting the training sets (to fit the surrogates) and near to the failure threshold for reducing the bias introduced by the replacement.

## Keywords

- Yield
- Failure probability
- Monte Carlo
- Hybrid importance sampling Monte Carlo
- Dimension reduction
- Exploration phase
- Estimation phase
- Kriging model
- Probability estimator

## 1 Introduction

Due to the continuously increase of the number of individual components on an Integrated Circuit (IC) the probability of a bad working IC will increase dramatically, see [1, 2]. This can simply be illustrated by the example in [2], where an IC with *S* “identical” components (each having a failure probability \(\mathrm {p}_{\mathrm {fail}}\)) has a rather large probability of \(P_{\mathrm{fail}}=1-(1-\mathrm {p}_{\mathrm {fail}})^{S}\) on break-down, even \(\mathrm {p}_{\mathrm {fail}}\) is considerable small (i.e., being a rare event). For example, consider a 256 Mbit SRAM circuit, having 256 million “identical” bit cells. Then to guarantee a failure probability of 1% for this circuit (i.e. \(P_{\mathrm{fail}}=0.01\) with \(S=256\times10^{6}\)), it is required that \(\mathrm {p}_{\mathrm {fail}}< 3.9\times 10^{-11}\), being a rare event indeed. Notice that the yield *Y* of an IC is closely related to the failure probability \(P_{\mathrm{fail}}\) and can be expressed as \(Y=1-P_{\mathrm{fail}} = (1-\mathrm {p}_{\mathrm {fail}})^{S}\). Thus, the yield *Y* of an IC is estimated by using the failure probability \(\mathrm {p}_{\mathrm {fail}}\) of its component.

We consider Monte Carlo (MC) techniques [3] for estimating the failure probability \(\mathrm {p}_{\mathrm {fail}}\). The standard Monte Carlo produces an estimator \(\hat {\mathrm {p}}_{\mathrm {fail}}= k/n\) for the true probability \(\mathrm {p}_{\mathrm {fail}}\) by running the simulator *n* times with independent random inputs and counting the *k* occurrences of the ‘fail’ event. Notice that \(n\hat {\mathrm {p}}_{\mathrm {fail}}\sim \operatorname{Bin}(n,\mathrm {p}_{\mathrm {fail}})\) follows a binomial law with probability \(\mathrm {p}_{\mathrm {fail}}\) of getting success out of *n* trials. The useful properties of the estimator \(\hat {\mathrm {p}}_{\mathrm {fail}}\) are its unbiasedness i.e., \(\mathbb{E}(\hat {\mathrm {p}}_{\mathrm {fail}}) = \mathrm {p}_{\mathrm {fail}}\) and its independency on the dimension *d* of the random vector **X**. However, the variance of the estimator \(\hat {\mathrm {p}}_{\mathrm {fail}}\) is given by \(\operatorname{Var}(\hat {\mathrm {p}}_{\mathrm {fail}}) = \mathrm {p}_{\mathrm {fail}}(1-\mathrm {p}_{\mathrm {fail}})/n\), which can be (relatively) large for small \(\mathrm {p}_{\mathrm {fail}}\) and limited number *n* of MC runs. Using the ‘normal approximation’ of the binomial distribution, the 95% confidence interval for (small) \(\hat {\mathrm {p}}_{\mathrm {fail}}\) is estimated to be \({\pm}1.96/\sqrt {n \hat {\mathrm {p}}_{\mathrm {fail}}}\). So, to determine \(\hat {\mathrm {p}}_{\mathrm {fail}}\) in the range 10^{−11} with an accuracy of \({\pm}10\%\) with 95% confidence level, one needs about \(4\times 10^{13}\) MC runs, which is intractable in industry even with the fastest computer simulations.

To overcome the drawback of the standard MC method, a variance reducing Importance Sampling Monte Carlo (ISMC) technique is proposed in [2]. There it was shown that a reduction of several orders can be achieved, from \(4\times10^{13}\) to at most few thousands runs. However, when we estimate the probability of failure, it is done at some fixed environmental parameters (such as temperature, supply voltage, and process corners). These parameters add multiple levels of complexity. For instance, the failure probability must be computed for a complete range of working temperatures. The complexity grows exponentially when the other dimensions are combined. For complex systems one usually can only afford a very limited number (say, in hundreds) of simulations, and therefore the ISMC technique remains unattractive. In [1], a model based ISMC approach has been proposed for estimating rare circuit events. In the model based approach the circuit model is replaced by a surrogate which is much faster to evaluate, the circuit model is only used for drawing training samples which are used to build a surrogate. Usually, the number of training samples is much smaller than the total number of MC simulations for estimating the probability. Hence, the overall computational cost is reduced. Nevertheless, it is often difficult or even impossible to quantify the error made by such a substitution. There is another model based approach proposed in [4] which introduces a statistical blockade approach. In this approach one draws a large number of samples from a surrogate model initially, find the samples which belong to the tail region, and replace them by the true responses. The authors use a linear classifier saying that such classifier is enough for SRAM bitcells. However, our goal is to address the large circuits (such as analog IPs (Intellectual Properties)). Our experiments show that a linear model does not work for such large circuits and it is difficult to fit a surrogate (nonlinear) model that is accurate in the tail so that one can classify the samples that really belong to the tail region. The other limitation of both of the above model based approaches is that they do not address the dimensionality issue of the problem.

In this paper, we propose a Hybrid Importance Sampling Monte Carlo (HISMC) approach for estimating small failure probability. This approach is a modification of the model based approach proposed in [1] and can be used for large dimensional circuit problems. The idea is to only use the expensive circuit model^{1} (for a small portion of the overall samples used to estimate the probability) close to the failure threshold and the surrogate is used for the remaining samples that are reasonably away from the failure threshold. The use of these small number of samples of the circuit model can prevent loss of accuracy. The Kriging model [5, 6] is used as a surrogate of the circuit model because it inherits a solid mathematical foundation with several useful properties including interpolation of the training data and a closed formula for approximating the prediction error known as a Kriging variance. The latter is useful for improving the Kriging model near the failure threshold as well as for selecting the samples near to the failure threshold for which the circuit model is to be used. Our experience with the circuits shows the Kriging model works well up to 35 input variables.

This paper is organised as follows. We start with the reference method mean-shift ISMC approach in Sect. 2. Then we introduce a surrogate modelling technique in Sect. 3 that combines a feature selection method and the Kriging model. Using this surrogate technique we present our HISMC approach in Sect. 4. Finally, the results are shown in Sect. 5 and a conclusion is made in Sect. 6.

## 2 The importance sampling Monte Carlo method

### 2.1 General framework

*d*process parameters, which is a realization of the random vector (r.v.)

**X**with probability density function (pdf) \(g(\mathbf{x})\), and let \(H(\mathbf {x})\) be a corresponding response

^{2}of the circuit under examination. The mathematical equation of the failure probability \(\mathrm {p}_{\mathrm {fail}}=\mathbb {P}(H(\mathbf{x})\geq\gamma)\) is given by

*g*means that the expectation is taken with respect to the pdf \(g(\mathbf{x})\),

*γ*is a given failure threshold and ${\mathbb{1}}_{\{H(\mathbf{x})\ge \gamma \}}$ is an indicator function that gives the value 1 if \(H(\mathbf{x}\geq\gamma)\), 0 otherwise.

We assume that the (failure) region of interest lies on the upper tail of the output distribution. This is without loss of generality, because any lower tail can be converted to the upper tail by replacing \(H(\mathbf{X}) = -H(\mathbf{X})\). Therefore, the probability \(\mathbb {P} [H(\mathbf{X})\leq\gamma' ]\) can be converted to \(\mathrm {p}_{\mathrm {fail}}^{-}(\gamma') = \mathbb{P} [-H(\mathbf{X})\geq-\gamma' ]\) for some give failure threshold \(\gamma'\) on the lower tail of the distribution. Hence, it is sufficient to estimate the probability for the upper tail and hereafter we will simply write \(\mathrm {p}_{\mathrm {fail}}\) instead of \(\mathrm {p}_{\mathrm {fail}}^{+}(\gamma)\).

*f*such that ${\mathbb{1}}_{\{H(\mathbf{x})\ge \gamma \}}g(\mathbf{x})>0\Rightarrow f(\mathbf{x})>0$, we say

*g*is absolutely continuous with respect to

*f*. Then we can write (2.1) as

**Y**is a r.v. generated from the new pdf \(f(\mathbf{x})\) and \(\mathcal{L}(\mathbf{x}) = {g(\mathbf{x})}/{f(\mathbf{x})}\) if \(f>0\) and \(\mathcal{L}(\mathbf{x})=0\) otherwise, is a likelihood ratio between two densities. The ISMC estimator is then given by

*N*independent and identically distributed (iid) random samples generated from \(f(\mathbf{x})\).

The work in this paper is an additional contribution to the developments at Mentor Graphics where a mean-shift ISMC technique (see Sect. 2.2) is being used, assuming that the original input distributions can be transformed into a Gaussian distribution. In this context, the importance density is found by shifting the mean of the original density to the area of interest. We use the same technique as a reference approach. The study of other ISMC techniques is out of scope of this paper.

### 2.2 The mean-shift approach

**0**and variance

**I**, i.e., \(g(\mathbf{x})\sim\mathcal{N}(\boldsymbol {0},\mathbf{I})\). We define the importance density \(f(\mathbf{x}) = g^{\boldsymbol {\theta }}(\mathbf{x})\) with \(g^{\boldsymbol {\theta}}(\mathbf{x}) \sim\mathcal{N}(\boldsymbol {\theta },\mathbf{I})\) parameterized by its mean \(\boldsymbol {\theta} \in\mathbb{R}^{d}\) (see, [2]), in other words \(g^{\boldsymbol {\theta}}(\mathbf{x}) = g(\mathbf{x} - \boldsymbol {\theta})\). Then the likelihood ratio \(\mathcal {L}(\mathbf{x})\) becomes

**X**and

**Y**is

*N*iid random vectors with density \(g(\mathbf{x})\).

Following [2], there must be at least one \(\mathbf {X}_{j}\) such that ${\mathbb{1}}_{(H({\mathbf{X}}_{j})\ge \gamma )}\ne 0$ to solve the optimization problem (2.10). However, this condition may fail in a rare event context. To overcome this problem, a multilevel approach is suggested in [8] for solving such problems in the context of cross-entropy approaches.

### 2.3 Multi-level approach for rare events simulations

*k*of the multi-level approach consists of two phases; in the first phase we fix \(\boldsymbol {\theta}^{(k-1)}\) and obtain the level \(\gamma_{k}\), and in the second phase we compute \(\boldsymbol {\theta}^{(k)}\) using \(\boldsymbol {\theta}^{(k-1)}\) and \(\gamma_{k}\). The computation of \(\gamma_{k}\) and \(\boldsymbol {\theta}^{(k)}\) at iteration

*k*is as follows:

- 1.
*Computation of*\(\gamma_{k}\): For fixed \(\boldsymbol {\theta }^{(k-1)}\), we let \(\gamma_{k}\) to be a \((1-\rho)\)-quantile of \(H(\mathbf{X}^{(k-1)})\), i.e.,$$\begin{aligned}& \mathbb{P} \bigl(H\bigl(\mathbf{X}^{(k-1)}\bigr)\geq \gamma_{k} \bigr) \geq \rho, \end{aligned}$$(2.12)where \(\mathbf{X}^{(k-1)}\sim g^{\boldsymbol {\theta}^{(k-1)}}(\mathbf{x})\) and$$\begin{aligned}& \mathbb{P} \bigl(H\bigl(\mathbf{X}^{(k-1)}\bigr)\leq\gamma_{k} \bigr) \geq 1-\rho, \end{aligned}$$(2.13)*ρ*is a probability which is to be chosen such that \(\rho\gg \mathrm {p}_{\mathrm {fail}}\), the probability to be estimated.An estimator \(\widehat{\gamma}_{k}\) of \(\gamma_{k}\) is obtained by drawing*m*random samples \(\mathbf{X}^{(k-1)}_{i}\sim g^{\boldsymbol {\theta }^{(k-1)}}(\mathbf{x})\), calculating the responses \(H(\mathbf {X}^{(k-1)}_{i})\) for all*i*, ordering them from smallest to largest \(H_{(1)}^{(k-1)}\leq\cdots\leq H_{(m)}^{(k-1)}\) where \(H_{(l)}^{(k-1)} := H(\mathbf{X}_{l}^{(k-1)})\) and finally evaluating the \((1-\rho)\) sample quantile aswhere \(\lceil x\rceil\) is the smallest integer greater than or equal to$$ \widehat{\gamma}_{k} = H^{(k-1)}_{(\lceil(1-\rho) m\rceil)}, $$(2.14)*x*.Note that the estimation \(\widehat{\gamma}_{k}\) of \(\gamma_{k}\) depends on two parameters, the probability

*ρ*and the number of samples*m*. Our empirical results show that if we fix \(m=1000\) then a good choice of*ρ*is 0.20 for getting an accurate estimation of \(\gamma_{k}\). However, one may choose a smaller*ρ*but that may require larger*m*for estimating \(\gamma_{k}\) accurately. For more details we refer to [8, 9]. - 2.
*Computation of*\(\boldsymbol {\theta}^{(k)}\): Let \(g^{\boldsymbol {\theta}^{(k-1)}}(\mathbf{x})\) be the density function known at iteration*k*and \(g^{\boldsymbol {\theta}^{(k)}}(\mathbf{x})\) be the new density we want to obtain. The likelihood ratio of densities \(g^{\boldsymbol {\theta}^{(k-1)}}(\mathbf{x})\) and \(g^{\boldsymbol {\theta}^{(k)}}(\mathbf{x})\) at iteration*k*is given by$$ \mathcal{L}^{(k)}(\mathbf{x}) = \frac{g^{\boldsymbol {\theta }^{(k-1)}}(\mathbf {x})}{g^{\boldsymbol {\theta}^{(k)}}(\mathbf{x})} = e^{- (\boldsymbol {\theta }^{(k)}-\boldsymbol {\theta}^{(k-1)} )\mathbf{x}+\frac{1}{2} (|\boldsymbol {\theta}^{(k)}|^{2}-|\boldsymbol {\theta}^{(k-1)}|^{2} )}. $$(2.15)Therefore, the second moment (2.9) at iteration*k*can be extended aswhere \(\mathbf{X}^{(k-1)}\sim g^{\boldsymbol {\theta}^{(k-1)}}\) and \(\mathbf {X}^{(k)}\sim g^{\boldsymbol {\theta}^{(k)}}\).$\begin{array}{rl}{v}^{(k)}\left({\mathit{\theta}}^{(k)}\right)& ={\mathbb{E}}_{{g}^{{\mathit{\theta}}^{(k)}}}\left[{\left({\mathbb{1}}_{(H({\mathbf{X}}^{(k)})\ge {\gamma}_{k})}\phantom{\rule{0.2em}{0ex}}{\mathcal{L}}^{(k)}\left({\mathbf{X}}^{(k)}\right)\right)}^{2}\right]\\ & ={\mathbb{E}}_{{g}^{{\mathit{\theta}}^{(k-1)}}}\left[{\mathbb{1}}_{(H({\mathbf{X}}^{(k-1)})\ge {\gamma}_{k})}\phantom{\rule{0.2em}{0ex}}{\mathcal{L}}^{(k)}\left({\mathbf{X}}^{(k-1)}\right)\right],\end{array}$(2.16)Using the above information the optimal mean-shift \(\boldsymbol {\theta}^{(k)}\) can be approximated with the Newton algorithm by solving the following optimization problemwith$$ \boldsymbol {\theta}^{(k)} = \min_{\boldsymbol {\theta}}v_{m}^{(k)}( \boldsymbol {\theta}) $$(2.17)the MC approximation of the second moment \(v^{(k)}(\boldsymbol {\theta})\).${v}_{m}^{(k)}(\mathit{\theta})=\frac{1}{m}\sum _{j=1}^{m}{\mathbb{1}}_{(H({\mathbf{X}}_{j}^{(k-1)})\ge {\gamma}_{k})}\phantom{\rule{0.2em}{0ex}}{e}^{-(\mathit{\theta}-{\mathit{\theta}}^{(k-1)}){\mathbf{X}}_{j}^{(k-1)}+\frac{1}{2}({|\mathit{\theta}|}^{2}-{|{\mathit{\theta}}^{(k-1)}|}^{2})}$(2.18)

Below we present the multi-level ISMC algorithm.

### Algorithm 2.1

(Multi-level ISMC)

- 1.
Set \(k=1\), \(\boldsymbol {\theta}^{(0)} = 0 \in\mathbb{R}^{d}\) (then \(g(\mathbf{x})\equiv g^{\boldsymbol {\theta}^{(0)}}(\mathbf{x})\)), \(\rho= 0.2\) and \(m=1000\).

- 2.
Draw

*m*samples \(\mathbf{X}_{1}^{(k-1)},\ldots, \mathbf {X}_{m}^{(k-1)}\) according to the distribution \(g^{\boldsymbol {\theta }^{(k-1)}}(\mathbf{x})\). - 3.
Compute the \((1-\rho)\)-quantile \(\gamma_{k}\) of \(H(\mathbf {X}^{(k-1)})\). If \(\gamma_{k} > \gamma\) reset \(\gamma_{k}\) to

*γ*. - 4.Introduce${v}_{m}^{(k)}(\mathit{\theta})=\frac{1}{m}\sum _{j=1}^{m}{\mathbb{1}}_{(H({\mathbf{X}}_{j}^{(k-1)})\ge {\gamma}_{k})}\phantom{\rule{0.2em}{0ex}}{e}^{-(\mathit{\theta}-{\mathit{\theta}}^{(k-1)}).{\mathbf{X}}_{j}^{(k-1)}+\frac{{|\mathit{\theta}-{\mathit{\theta}}^{(k-1)}|}^{2}}{2}}.$
- 5.
Solve \(\boldsymbol {\theta}^{(k)} = \min_{\boldsymbol {\theta}}v_{m}^{(k)}(\boldsymbol {\theta})\).

- 6.
If \(\gamma_{k} < \gamma\), go to step 2 with \(k \leftarrow k+1\), otherwise, go to step 7.

- 7.
Compute (2.8) with \(\boldsymbol {\theta}^{*} = \boldsymbol {\theta}^{(k)}\).

- 8.
End of the algorithm.

## 3 The surrogate modeling technique

As stated in Sect. 1, we deal with circuits having a large number of statistical input variables. Usually, only a few of them are important i.e., having a statistical effect on the circuit response. The remaining variables have a negligible or no statistical effect on the circuit response. It should be noticed that almost all surrogate models (including Kriging) lose their accuracy with increasing dimension and it is not possible to build an accurate surrogate model with a complete set of variables. Therefore, dimension reduction is required before fitting a surrogate model. Dimension reduction is the process of reducing the number of random variables under consideration and is often divided into two categories: feature extraction and feature selection [10, Chap. 6]. In *feature extraction*, one performs the projection of the original input space to a reduced space. The dimension of the reduced space is much smaller than the original space. There exists a variety of different projection methods. Among others Partial Least Squares Regression (PLSR) [11, 12] belongs to this category. On the other hand, *feature selection* is the process of selecting a subset of important variables while other variables are set to their nominal values and removed from the set of input variables used for building a surrogate model. Our experience with feature extraction, especially with PLSR, shows that it requires a large number (which grows with the dimension of the problem) of training samples to perform accurately, and therefore it cannot be used when only a limited number of simulations are allowed. Hence, we prefer to use a feature selection method that performs accurately even with a limited number of simulations.

### 3.1 Feature selection

In circuit simulations, *feature selection* is a process of selecting a subset of important variables, having statistical effect on the circuit response of the circuit under study, while other variables having no or negligible effect are set to their nominal values without incurring much loss of information. Several different approaches for feature selection exist, see for examples [13]. Here we will employ the well known Least Absolute Shrinkage and Selection Operator (LASSO) [14] which is stable and exhibits good properties for feature selection.

#### 3.1.1 LASSO

**b**gives a certain amount of information of the relative importance of \(\mathrm{x}_{j}\). In the context of circuit simulation the dimension of the vector

**x**of input variables is very large. However, only few of \(\mathrm{x}_{j}\)’s are important. This means we are willing to determine a small subset of variables that gives the main effects and neglect the small effects. LASSO is a well known method that performs this task. Given a training set (\(\mathbf{X}_{i}\), \(H_{i}\))\(_{i=1,\ldots,n}\) where \(\mathbf{X}_{i}\) are

*n*random input vectors generated with some design of experiment (see Sect. 3.2.1) and \(H_{i}:=H(\mathbf{X}_{i})\) are the corresponding responses, and a tuning parameter \(\lambda\geq0\), LASSO estimates the components of

**b**by minimizing the following Lagrangian expression

*λ*, LASSO sets many coefficients exactly to zero. The variables \(\mathrm{x}_{j}\) corresponding to the nonzero coefficients \(b_{j}\) are selected as important variables. The computation of the LASSO solutions is a quadratic programming problem and can be tackled by standard numerical analysis algorithms. However, the least angle regression (LARS) [15] procedure is a better approach in terms of the computational efficiency. The LARS algorithm exploits the special structure of the LASSO problem, and provides an efficient way to compute the solutions simultaneously for all possible values of

*λ*. For detail we refer to [16, Sect. 3]. Among all solutions we choose the one that fits the model (3.2) best by cross-validation. Another aspect of LASSO is the choice of the number

*n*of training samples. Typically,

*n*should be much smaller than the dimension

*d*(when

*d*is large). Our experiments show that

*n*depends on the number of important parameters rather than the dimension. The rule of thumb suggests to have 10 samples to each important parameter. Therefore, if we assume there are maximum 50 important variables then at most \(50\times10 = 500\) simulations are required to performed LASSO accurately.

Once we have selected the important variables we are ready to build a surrogate (Kriging) model. The surrogate model will be built with the important variables only.

### 3.2 The Kriging model

### Remark 3.1

To avoid the use of additional notations, we will introduce the Kriging model for an input vector **x** having full dimension *d*. However, in our algorithms, **x** will be used in its reduced form. See Algorithm 4.1.

^{®}to model the response function \(H(\mathbf{x})\). DACE provides the functional basis \(\mathbf{f}(\mathbf{x})\) as a set of polynomials of order 0, 1 and 2. In this paper, we use the linear functional basis, i.e.,

**x**, \(\sigma^{2}\) is the process variance of \(\mathcal {G}(\mathbf {x})\) and \(\mathcal{R}(\mathbf{x}, \mathbf{x}',\boldsymbol {\ell})\) is a correlation function characterized by a vector of parameters \(\boldsymbol {\ell} = [\ell_{1}, \ldots, \ell_{d}]^{T}\).

#### 3.2.1 The Kriging predictor

We have a circuit model to be used for simulating the behaviour of a circuit. We find a ordered set \(\mathcal{D} = (\mathbf{S}, \mathbf {H})\), called the training set, where \(\mathbf{H} = [H(\mathbf {s}_{1}), \ldots, H(\mathbf{s}_{N_{\mathrm{tr}}}) ]^{T}\) is a vector of observations that results from the circuit model on a set of experiments, \(\mathbf{S} = [\mathbf{s}_{1}, \ldots, \mathbf {s}_{N_{\mathrm{tr}}} ]^{T}\). Notice that the notation **s** is used here for the input **x** to distinguish the training samples from the one that will be used for the other simulations. The set of experiments is usually referred to as a *Design Of Experiments* (DOE), see [5]. The construction of a Kriging predictor depends on \(\mathcal{D}\), and the DOE should be selected carefully in order to get the largest amount of the statistical information about the response function over the input space. In this report, we use the *Latin Hypercube Sampling* (LHS) space filling technique for our DOE. We will not discuss the LHS in this paper, but we refer to [21].

**x**, i.e., \(\mathbf{x}\notin\mathbf{S}\) is given by:

**x**is given by:

#### 3.2.2 Estimation of parameters

*Maximum Likelihood Estimation*(MLE). The MLE of β is the generalized least-square estimate \(\widehat {\boldsymbol {\beta}}\) given in (3.9) and the MLE of \(\sigma^{2}\) (see, [19, Sect. 3]) is

*R*. See [19].

## 4 A hybrid importance sampling Monte Carlo approach

In this section we propose a HISMC approach. This approach is a modification of the hybrid approach proposed in [1, Sect. 4] and can be used for large circuits. Similar to [1], we split the ISMC Algorithm 2.1 into two phases; The first is the *Exploration Phase* that includes the steps 1 to 6 for estimating the optimal mean-shift \(\boldsymbol {\theta}^{*}\), and the second is the *Estimation Phase* that consists of step 7 for estimating the probability (2.8) using the optimal value \(\boldsymbol {\theta }^{*}\) of the mean-shift θ. In the HISMC approach, we will use a hybrid combination of LASSO, the Kriging model and the circuit model. We will treat the two phases (the exploration phase and the estimation phase) separately.

### 4.1 The exploration phase

**x**. Notice that the accuracy of the model ${\mathbb{1}}_{(\stackrel{\u02c6}{H}(\mathbf{x})\ge {\gamma}_{k})}$ is important and must be checked before its use. The

*Misclassification Error*(MCR) is used as a measure of accuracy for the metamodel ${\mathbb{1}}_{(\stackrel{\u02c6}{H}(\mathbf{x})\ge {\gamma}_{k})}$.

Before presenting our algorithm for the exploration phase, we want to mention some preliminaries for extra understanding of the algorithm.

### Remark 4.1

(Preliminaries)

- (i)
A new Kriging model is built at each level of the exploration phase, i.e., in general, the Kriging model at iteration

*k*is different from \(k-1\). - (ii)
At each iteration

*k*of the exploration phase, the feature selection (using LASSO) is performed before fitting a Kriging model. Note that for iteration*k*, the feature selection is performed on the reduced set of variables selected at iteration \(k-1\). - (iii)
The notation \(\operatorname{LHS} (\boldsymbol {\theta}^{(k-1)} \pm\underline {a} )\) indicates the LHS in the interval \([\boldsymbol {\theta}^{(k-1)} - \underline{a}, \boldsymbol {\theta}^{(k-1)}+\underline{a}]\) where \(\boldsymbol {\theta} ^{(k-1)}\in\mathbb{R}^{d}\) is the mean of the known pdf \(g^{\boldsymbol {\theta}^{(k-1)}}(\mathbf{x})\) at iteration

*k*and \(\underline{a} = [a, a, \ldots, a ]\in\mathbb{R}^{d}\) is a vector with a user defined positive integer*a*. - (iv)
\(N_{\mathrm{tr}}\),

*d*and \(d_{r}\) denote the size of a training set, the full and reduced dimensions of the input vector**X**respectively. - (v)
Define a design matrix \(\mathbf{S}^{(k)}=[\mathbf {s}_{i}^{(k)}]_{N_{\mathrm{tr}}\times d}\) where \(\mathbf {s}_{i}^{(k)}\sim\operatorname{LHS} (\boldsymbol {\theta}^{(k-1)} \pm \underline {a} )\) are \(N_{\mathrm{tr}}\) training samples. In this paper, we use \(a=3\) so that a surrogate model can be fitted to predict the response for the inputs lie in the range of 3-sigma. For a Gaussian distribution 99.7% (almost all) samples lies in this range.

- (vi)
Given the design matrix \(\mathbf{S}^{(k)}\), the corresponding response vector is evaluated from the circuit model and is denoted by \(\mathbf{H}^{(k)}\). Note that for iteration

*k*, the columns of \(\mathbf {S}^{(k)}\) corresponding to irrelevant variables (outcome of LASSO at iteration \(k-1\)) are set to zero (nominal values) before evaluating the outputs \(\mathbf{H}^{(k)}\). - (vii)
Introduce the training set \(\mathcal{D}_{r}^{(k)} = (\mathbf {S}_{r}^{(k)}, \mathbf{H}^{(k)} )\) where \(\mathbf {S}_{r}^{(k)}\subseteq \mathbf{S}^{(k)}\) is the reduced design matrix containing the columns of \(\mathbf{S}^{(k)}\) corresponding to the important variables (outcome of the feature selection process).

- (viii)
In ISMC Algorithm 2.1, we have \(\rho=0.2\) and \(m=1000\) for estimating an intermediate level \(\gamma_{k}\) at iteration

*k*. Notice (from Fig. 1) that choosing larger value of \(\gamma_{k}\) will give faster convergence to failure threshold*γ*. However, for doing that we use a smaller*ρ*which needs a large*n*. In a surrogate based approach we use a cheap model instead of the full circuit model and thus a large number*m*(say, 10,000) of simulations can be used, and therefore a small*ρ*(say, 0.05) is acceptable.

### Algorithm 4.1

(HISMC/Exploration phase)

- 1.
Set \(k = 1\), \(\boldsymbol {\theta}^{(k-1)}=\boldsymbol {\theta}^{(0)} = 0 \in \mathbb{R}^{d}\), \(\rho=0.05\), \(m=10\text{,}000\), \(d_{r}^{(k-1)}=d\) and \(a=3\).

- 2.
Choose \(N_{\mathrm{tr}}=200\) if \(d_{r}^{(k-1)}<20\), \(N_{\mathrm {tr}}=10d_{r}^{(k-1)}\) if \(20\leq d_{r}^{(k-1)} \leq50\) otherwise \(N_{\mathrm{tr}}=500\).

- 3.
Find the training set \(\mathcal{D}^{(k)} = (\mathbf {S}^{(k)}, \mathbf{H}^{(k)} )\) at iteration

*k*. - 4.
Perform feature selection using LASSO on the training set \(\mathcal{D}^{(k)}\).

- 5.
Find \(d_{r}^{(k)}\) and \(\mathcal{D}_{r}^{k} = (\mathbf{S}_{r}^{k}, \mathbf {H}^{k})\) the number of important variables and the reduced training set, respectively.

- 6.
Fit a Kriging model using the training set \(\mathcal{D}_{r}^{k}\).

- 7.
Draw

*m*iid random samples \(\mathbf{X}_{1}^{(k)}, \ldots, \mathbf {X}_{m}^{(k)} \sim f(\mathbf{x},\boldsymbol {\theta}^{(k-1)})\) and estimate \(\widehat{H}(\mathbf{X}_{1}^{(k)}), \ldots, \widehat{H}(\mathbf {X}_{m}^{(k)})\) using the Kriging model. - 8.
Compute the \((1-\rho)\) sample quantile \(\gamma_{k}\). If \(\gamma _{k}>\gamma\) reset \(\gamma_{k}\) to

*γ*. - 9.Introduce the Kriging based variance criterionand compute \(\theta^{(k)} = \arg\min_{\theta} v_{m}^{(k)}(\theta)\).${v}_{m}^{(k)}(\mathit{\theta})=\frac{1}{m}\sum _{j=1}^{m}{({\mathbb{1}}_{\{\stackrel{\u02c6}{H}({\mathbf{X}}_{j}^{(k)})\ge {\gamma}_{k}\}})}^{2}\phantom{\rule{0.2em}{0ex}}{e}^{(-{\mathbf{X}}_{j}^{(k)}\cdot (\mathit{\theta}-{\mathit{\theta}}^{(k-1)})+\frac{1}{2}(|\mathit{\theta}|-{|{\mathit{\theta}}^{(k-1)}|}^{2}))}$
- 10.
If \(\gamma_{k} < \gamma\), return to step 2 and proceed with \(k \gets k+1\), otherwise save \(\theta^{*} = \theta^{(k)}\), \(d_{r} = d_{r}^{(k)}\) and the reduced set (say \(\mathbf{x}_{r}\)) of input variables.

- 11.
Go to the estimation phase.

### 4.2 The estimation phase

Assuming that the optimal value \(\boldsymbol {\theta}^{*}\) of the mean-shift θ is computed in the exploration phase, our next goal is to find an estimation of the probability \(\mathrm {p}_{\mathrm {fail}}\). To this end, we build a surrogate based accurate probability estimator. One simple approach [1] is to replace the indicator function ${\mathbb{1}}_{(H(\mathbf{Y})\ge \gamma )}$ in (2.8) by its surrogate model ${\mathbb{1}}_{(\stackrel{\u02c6}{H}(\mathbf{Y})\ge \gamma )}$, where \(\widehat{H}(\mathbf{Y})\) is the Kriging prediction of the response \(H(\mathbf{Y})\) at some input \(\mathbf{Y} = \mathbf{X} +\boldsymbol {\theta}^{*}\). Here the Kriging predictor is built on the training set with input vectors centered at \(\boldsymbol {\theta}^{*}\). To get an impression of accuracy of the probability estimator we would like to have a confidence interval. A candidate is Pseudo Confidence Interval (PCI), that depends on the Kriging variance, is provided in [1]. However, there is no proof that the true probability would lie in the PCI. Moreover, the PCI can be very wide if the Kriging prediction has a large variance. To prevent loss of accuracy of the probability estimator, a hybrid approach is proposed by [23, 24] that combines the simulations of the Kriging model and the original system (circuit model in our context). The original system is used only for the responses that are close to the failure threshold. In this section we will use a similar approach based on the Kriging model. Unlike the hybrid approach in [23] where the first surrogate model is used to simulate, we check the accuracy of the model online and improve (re-build) it, if required. The authors in [25] demonstrate the benefits of using an improved surrogate model which might be useful to our application. In this paper, we use an adaptive sampling technique to improve the Kriging model. We add some samples (adaptively) to the initial training set from the region of interest, see step 14 in Algorithm 4.2. Due to the interpolation nature, the Kriging model gives a better fit in that region after the improvement.

We start with drawing a training set \(\mathcal{D}^{*} = (\mathbf {S}^{*}, \mathbf{H}^{*})\) where \(\mathbf{S}^{*}\) is the \(N_{\mathrm {tr}}\times{d}\) design matrix with rows representing the \(N_{\mathrm {tr}}\) random vectors generated with the \(\operatorname{LHS} (\boldsymbol {\theta} ^{*} \pm\underline{a} )\) and \(\mathbf{H}^{*}\) is an \(N_{\mathrm {tr}}\times1\) vector of corresponding responses. Then we perform the feature selection (using LASSO) and find the reduced training set \(\mathcal{D}_{r}^{*} = (\mathbf{S}_{r}^{*}, \mathbf{H}^{*} )\) where \(\mathbf{S}_{r}^{*}\subseteq\mathbf{S}^{*}\) is the reduced design matrix containing the columns of \(\mathbf{S}^{*}\) corresponding to the important variables. Afterward, we build a Kriging model on the updated training set \(\mathcal{D}_{r}^{*}\). Finally, we build a hybrid indicator function \(\mathcal{I}_{\gamma}(\mathbf{Y})\) called an emulator that combines the true indicator function ${\mathbb{1}}_{(H(\mathbf{Y})\ge \gamma )}$ and its surrogate ${\mathbb{1}}_{(\stackrel{\u02c6}{H}(\mathbf{Y})\ge \gamma )}$ based on an accept/reject criterion. See next section.

#### 4.2.1 The emulator

*γ*.

**Y**, \(z_{\alpha/2} = \varPhi^{-1}(1-\alpha/2)\) with \(\varPhi ^{-1}(x)\) is the inverse cumulative distribution function of the standard normal distribution. If \(\alpha= 0.01\) for which \(z_{\alpha /2} = 2.58\), we assume that there is 99% chance that a true value lies in the interval \(\gamma-z_{\alpha/2}\widehat{\sigma}_{K}(\mathbf {Y})\leq\widehat{H}(\mathbf{Y})\leq\gamma+z_{\alpha/2}\widehat {\sigma}_{K}(\mathbf{Y})\).

*γ*and we reject \(\widehat{H}(\mathbf{Y})\) if it is close (\(\mathbf{Y}\in \mathbb{M}\)) to

*γ*. In the latter case the circuit model must be used.

#### 4.2.2 Probability estimator

Combining the information from the exploration and estimation phases, the full HISMC algorithm can be formulated as follows:

### Algorithm 4.2

(HISMC/Estimation phase)

- 1.
Use Algorithm 4.1 for finding the optimal mean-shift \(\boldsymbol {\theta}^{*}\), the number \({d_{r}}\) of important variables and the reduced set of (important) input variables \(\mathbf{x}_{r}\).

- 2.
*Initialize the Estimation phase*:- (a)
Set \(a=3\) and \(N_{\mathrm{tr}}=200, 10d_{r}, 500\) if \(d_{r}<20\), \(20\leq d_{r} \leq50\), otherwise.

- (b)
Set the iteration parameter \(l=0\), maximum number of iterations \(l_{\mathrm{max}}=10\), the number \(n=1000\) of simulations used at iteration

*l*, initialize the total number \(N=0\) of simulations before iteration*l*, \(z_{\alpha/2}=2.58\), \(z_{\alpha'/2}=1.96\), tolerance of CV \(\mathrm{tol\_cv} = 0.1\), \(\tau=0\), \(\eta=0\).

- (a)
- 3.
Get the training set \(\mathcal{D}^{*} = (\mathbf{S}^{*}, \mathbf {H}^{*} )\) where \(\mathbf{S}^{*}=[\mathbf{s}_{i}^{*}]_{N_{\mathrm {tr}}\times d}\) with \(\mathbf{s}_{i}^{*}\sim\operatorname{LHS} (\boldsymbol {\theta }^{*} \pm\underline{a} )\) and \(\mathbf{H}^{*}\) is a vector of corresponding responses.

^{3} - 4.
Perform feature selection using LASSO on the training set \(\mathcal{D}^{*}\).

- 5.
Update the number \({d_{r}}\) of important parameters and find the reduced training set \(\mathcal{D}_{r}^{*} = (\mathbf{S}_{r}^{*}, \mathbf {H}^{*} )\). See preliminaries for Algorithm 4.1.

- 6.
Fit a Kriging model.

- 7.
Draw

*n*iid random samples \(\mathbf{X}_{i}^{(l)}\sim g(\mathbf {x})\) and shift them to \(\mathbf{Y}_{i}^{(l)}=\mathbf{X}_{i}^{(l)}+\boldsymbol {\theta}^{*}\). - 8.
Find the Kriging predictions \(\widehat{H} (\mathbf {Y}_{i}^{(l)} )\) and Kriging variances \(\widehat{\sigma}_{K}^{2} (\mathbf{Y}_{i}^{(l)} )\) using (3.7) and (3.13), respectively.

- 9.
Compute the likelihood weights \({w}_{i}^{(l)}= e^{-\boldsymbol {\theta}^{*} \mathbf{X}_{i}^{(l)} - \frac{1}{2}|\boldsymbol {\theta}^{*}|^{2}}\) and \({v}_{i}^{(l)} = ({w}_{i}^{(l)} )^{2}\) for all \(i=1,\ldots, n\).

- 10.
Determine \(\tau\leftarrow\tau+ \sum_{i=1}^{n}w_{i}^{(l)}\mathcal {I}_{\gamma}(\mathbf{Y}_{i}^{(l)})\), \(\eta\leftarrow\eta+ \sum_{i=1}^{n}v_{i}^{(l)}\mathcal{I}_{\gamma}(\mathbf{Y}_{i}^{(l)})\) and \(N \leftarrow N + n\).

- 11.
Compute \(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{E}}= \frac {\tau}{N}\) and \(\hat{\sigma}_{E}^{2} = \frac {1}{N(N-1)} (\eta- N (\hat{\mathrm{p}}_{\mathrm {fail}}^{\mathrm{E}} )^{2} )\).

- 12.
Calculate \(\mathrm {CV}= z_{\alpha'/2}\frac{\hat{\sigma}_{E}}{\hat {\mathrm{p}}_{\mathrm{fail}}^{\mathrm{E}}}\).

- 13.
If \(\mathrm {CV}> \mathrm{tol\_{cv}}\) and \(l < l_{\mathrm{max}}\) go to step 14 otherwise go to step 15.

- 14.
Use the full simulations \(N_{\mathrm{full}}\) drawn to compute \(\mathcal{I}_{\gamma}(\mathbf{Y}_{i}^{(l)})\), see definition (4.3). Select some points

^{4}(say, \(\min\{10, N_{\mathrm{full}}\}\)) uniformly among all \(N_{\mathrm{full}}\) simulations. Add these samples into the training set and rebuild the Kriging model with the updated training set. Go to step 6 with \(l\leftarrow l+1\). - 15.Determine the 95% confidence interval$$ \mathrm {CI}= \bigl[\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {E}}-z_{\alpha'/2}\hat{ \sigma}_{E},\, \hat{\mathrm{p}}_{\mathrm {fail}}^{\mathrm{E}}+z_{\alpha '/2} \hat{\sigma}_{E} \bigr]. $$
- 16.
Save the probability \(\hat{\mathrm{p}}_{\mathrm {fail}}^{\mathrm{E}}\), the variance \(\hat{\sigma}_{E}^{2}\), CI and the CV.

- 17.
End of the algorithm.

## 5 Results and discussion

- 1.
*Get a reference probability*: Let \(\hat{\mathrm {p}}_{\mathrm{fail}}^{\mathrm{M}}\) be a probability estimator of a method M for estimating the probability \(\mathrm {p}_{\mathrm {fail}}\). To estimate empirically the bias in \(\hat{\mathrm{p}}_{\mathrm {fail}}^{\mathrm{M}}\) we need a reference probability \(\mathrm{p}_{\mathrm{fail}}^{\mathrm{ref}}\). A simple estimation \(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{ref}}\) of \(\mathrm{p}_{\mathrm{fail}}^{\mathrm{ref}}\) can be obtained by running ISMC Algorithm 2.1 with a small coefficient of variation (say less than 1%). - 2.
*Perform*\(N_{\mathrm{rep}}\)*experiments of method*M: Notice that a probability generated from the MC estimator \(\hat {\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}\) is a random number and thus comparing a single outcome of \(\hat {\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}\) with the reference probability \(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {ref}}\) is not valid. Hence, we perform \(N_{\mathrm{rep}}\) independent experiments of the method M and store \(N_{\mathrm{rep}}\) outcomes \(\hat{\mathrm{p}}_{\mathrm {fail}}^{\mathrm{M}}(i)\) and their confidence intervals \(\mathrm {CI}^{\mathrm{M}}(i)\) for \(i=1,\ldots, N_{\mathrm{rep}}\). - 3.Then we compute
- (a)
*Relative bias*: The relative bias \(\epsilon_{\mathrm {rel}}\) of the estimator \(\hat {\mathrm {p}}_{\mathrm {fail}}\) with respect to the reference probability \(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{ref}}\)where$$ \epsilon_{\mathrm{rel}}\bigl(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {M}} \bigr) = \frac{\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {ref}}-\operatorname {mean}(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}} )}{|\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{ref}} |}\times100\%, $$(5.1)Note that we do not use absolute value for the numerator in (5.1), since it gives an indication whether or not \(\hat{\mathrm {p}}_{\mathrm{fail}}^{\mathrm{M}}\) underestimates or overestimates the reference value.$$ \operatorname {mean}\bigl(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}\bigr) = \frac {1}{N_{\mathrm{rep}}}\sum_{i=1}^{N_{\mathrm{rep}}} \hat{ \mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}(i). $$(5.2) - (b)
*Central Coverage Probability*(*CCP*): The CCP for the estimator \(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}\), which is the probability that \(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {ref}}\) lies within \(\mathrm {CI}^{\mathrm{M}}\), is given byFor a 95% confidence interval \(\mathrm {CI}^{\mathrm{M}}\), an unbiased estimator \(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}\) and a large \(N_{\mathrm{rep}}\) the CCP must be 0.95 (approximately). However, for a biased estimator CCP might be smaller than 0.95. We assume that a (biased) estimator is good enough if it does not have CCP lower than 0.90, i.e., a 5% error in the confidence interval is acceptable.$\mathrm{CCP}\left({\stackrel{\u02c6}{\mathrm{p}}}_{\mathrm{fail}}^{\mathrm{M}}\right)=\frac{1}{{N}_{\mathrm{rep}}}\sum _{i=1}^{{N}_{\mathrm{rep}}}{\mathbb{1}}_{\{{\stackrel{\u02c6}{\mathrm{p}}}_{\mathrm{fail}}^{\mathrm{ref}}\in {\mathrm{CI}}^{\mathrm{M}}(i)\}}.$(5.3) - (c)
*Mean Squared Error (MSE)*: The MSE of \(\hat {\mathrm {p}}_{\mathrm {fail}}\) is computed asNote that \(\mathrm {MSE}(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}})\) can be written as$$ \mathrm {MSE}\bigl(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}\bigr) = \frac {1}{N_{\mathrm{rep}}}\sum_{i=1}^{N_{\mathrm {rep}}} \bigl(\hat{ \mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}(i)-\hat {\mathrm{p}}_{\mathrm{fail}}^{\mathrm{ref}} \bigr)^{2}. $$(5.4)Then we can say that the MSE is the sum of variance and squared bias of the estimator which provide a useful way to estimate the efficiency of a biased estimator (see 4 below).$$\begin{aligned} \mathrm {MSE}\bigl(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}\bigr) &= \frac {1}{N_{\mathrm{rep}}}\sum_{i=1}^{N_{\mathrm{rep}}} \bigl(\hat{ \mathrm{p}}_{\mathrm {fail}}^{\mathrm{M}}(i)-\operatorname {mean}\bigl(\hat{ \mathrm{p}}_{\mathrm {fail}}^{\mathrm{M}}\bigr) \bigr)^{2} + \bigl( \operatorname {mean}\bigl(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}\bigr)-\hat { \mathrm{p}}_{\mathrm{fail}}^{\mathrm{ref}} \bigr)^{2} \\ &= \operatorname {Var}\bigl(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}\bigr) + \bigl( \mathrm{bias}\bigl(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {M}}\bigr) \bigr)^{2}. \end{aligned}$$(5.5)

- (a)
- 4.
*Estimate the Efficiency Metric*: We now introduce an efficiency estimator denoted by \(\widehat{\operatorname {Eff}}\) and given aswhere ${\overline{T}}_{\mathrm{M}1}$ and ${\overline{T}}_{\mathrm{M}2}$ are the average computational costs (CPU time) required for method \(\mathrm {M1}\) and \(\mathrm{M2}\), respectively.$\stackrel{\u02c6}{Eff}(\mathrm{M}1,\mathrm{M}2)=\frac{\mathrm{MSE}({\stackrel{\u02c6}{\mathrm{p}}}_{\mathrm{fail}}^{\mathrm{M}1})}{\mathrm{MSE}({\stackrel{\u02c6}{\mathrm{p}}}_{\mathrm{fail}}^{\mathrm{M}2})}\frac{{\overline{T}}_{\mathrm{M}1}}{{\overline{T}}_{\mathrm{M}2}},$(5.6)If \(\widehat{\operatorname {Eff}}(\mathrm{M1},\mathrm{M2}) = \kappa>1\), it means that method \(\mathrm{M1}\) requires

*κ*times more computational cost than \(\mathrm{M2}\) to obtain the same accuracy. If \(\widehat{\operatorname {Eff}}(\mathrm{M1},\mathrm{M2})>1\) then estimator \(\mathrm{M2}\) is preferred to \(\mathrm{M1}\).

### 5.1 The VCO

#### 5.1.1 Results of the exploration phase

*ρ*(0.05) is acceptable. Clearly, HISMC makes bigger steps that results into less iterations. In the left plot the estimated norm of the mean-shifts corresponding to \(\gamma_{k}\) is shown at level

*k*. From Tables 1 and 2, the last \(\|\boldsymbol {\theta}^{(k)}\|\) of both methods are 6.42 and 6.44. The HISMC has only 0.3% relative error with respect to ISMC and thus we can see in Fig. 3 that the last \(\|\boldsymbol {\theta}^{(k)}\|\) of both methods lies on the same (black horizontal) line, approximately.

Memory cell: ISMC exploration phase

Iteration ( | #Runs | \(\gamma_{k}\) | \(\|\boldsymbol {\theta}_{k}\|\) |
---|---|---|---|

1 | 1000 | 1600 | 1.20 |

2 | 1000 | 1655 | 2.18 |

3 | 1000 | 1710 | 3.20 |

4 | 1000 | 1764 | 4.14 |

5 | 1000 | 1816 | 5.06 |

6 | 1000 | 1868 | 5.92 |

7 | 1000 | 1900 | 6.42 |

Total | 7000 |

VCO: HISMC exploration phase

Iteration ( | #Runs | \(\gamma_{k}\) | \(\|\boldsymbol {\theta}_{k}\|\) | \(d_{r}\) | LOO-MCR (%) |
---|---|---|---|---|---|

1 | 500 | 1645 | 1.87 | 27 | 0.6 |

2 | 270 | 1738 | 3.64 | 25 | 0.3 |

3 | 250 | 1835 | 5.38 | 24 | 1.0 |

4 | 240 | 1900 | 6.44 | 24 | 1.0 |

Total | 1260 |

Tables 1 and 2 represent the numerical results for the exploration phase of the ISMC and the HISMC algorithm, respectively. The first, second, third and fourth columns represent the iteration number *k*, the number of full simulations, the intermediate level \(\gamma_{k}\) and the mean-shift \(\boldsymbol {\theta}^{(k)}\) per iteration, respectively. The fifth and sixth column in Table 2 represent the reduced dimension \(d_{r}\) and the LOO-MCR (leave-one-out misclassification error) at iteration *k*, respectively. Recalling the training sample rule (see Algorithm 4.1), we draw 500 training samples at iteration \(k=1\) (see the second column) since \(d=1500\) which is greater than 50. The training sample size at iteration \(k+1\) depends on the reduced dimension \(d_{r}\) (fifth column) at iteration *k*. Further, the LOO-MCR (4.1) stands for the “leave one out misclassification error” of the Kriging model at iteration *k*. We can see that the maximum error is 1% (for \(k=3\) and 4).

Moreover, it can be seen that the ISMC requires total 7000 simulations to estimate the optimal mean-shift. On the other hand, HISMC requires 1260 full simulations only.

#### 5.1.2 Results of the estimation phase

VCO: reference probability estimation

Method | Probability (\(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {ref}}\)) | CV (%) | #Runs |
---|---|---|---|

Reference | 1.10 × 10 | 0.99 | 345,000 |

VCO: ISMC versus HISMC probability estimation

Method | Mean probability | CV (%) | \(\epsilon_{\mathrm{rel}}\) (%) | CCP | MSE | #Runs |
---|---|---|---|---|---|---|

ISMC | 1.10 × 10 | 8.14 | 0 | 0.94 | 2.34 × 10 | 5000 |

HISMC | 1.07 × 10 | 9.21 | 2.92 | 0.92 | 3.27 × 10 | 457 |

### 5.2 The memory cell

#### 5.2.1 Results of the exploration phase

Similar to the VCO, both the ISMC and HISMC algorithms were repeated \(N_{\mathrm{rep}}=100\) times with a different seed of the random generator each time. The mean results (average of 100 experiments) of the exploration phase are given in this section.

Memory cell: ISMC exploration phase

Iteration ( | #Runs | \(\gamma_{k}\) | \(\|\boldsymbol {\theta}_{k}\|\) |
---|---|---|---|

1 | 1000 | 878.72 | 1.30 |

2 | 1000 | 884.99 | 2.36 |

3 | 1000 | 890.40 | 3.38 |

4 | 1000 | 896.01 | 4.14 |

5 | 1000 | 901.10 | 5.26 |

6 | 1000 | 902.00 | 5.31 |

Total | 6000 |

Memory cell: HISMC exploration phase

Iteration ( | #Runs | \(\gamma_{k}\) | \(\|\boldsymbol {\theta}_{k}\|\) | \(d_{r}\) | LOO-MCR (%) |
---|---|---|---|---|---|

1 | 500 | 883.38 | 1.91 | 35 | 2.0 |

2 | 350 | 892.40 | 3.64 | 34 | 1.0 |

3 | 340 | 901.75 | 5.30 | 33 | 1.0 |

4 | 330 | 902.00 | 5.32 | 33 | 1.0 |

Total | 1520 |

#### 5.2.2 Results of the estimation phase

Memory cell: reference probability estimation

Method | Probability (\(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {ref}}\)) | Variance (\(\widehat{\sigma }_{\mathrm {ref}}^{2}\)) | CV (%) | #Runs |
---|---|---|---|---|

ISMC | 6.01 × 10 | 9.26 × 10 | 0.99 | 315,000 |

Memory cell: ISMC versus HISMC probability estimation

Method | Mean probability | CV (%) | \(\epsilon_{\mathrm{rel}}\) (%) | CCP | MSE | #Runs |
---|---|---|---|---|---|---|

ISMC | 6.02 × 10 | 7.85 | −0.17 | 0.95 | 3.65 × 10 | 5000 |

HISMC | 5.98 × 10 | 6.83 | 0.51 | 0.96 | 3.70 × 10 | 1130 |

## 6 Conclusion and future work

In this paper we proposed a HISMC approach for yield optimization of circuits having a very large number of input variables and scalar response. Moreover, we assume that only a few (say less than 35) of the input variables are important. The HISMC approach uses a feature selection method (LASSO) that reduces the dimension of the input variables of an underlying problem that allows us to fit the Kriging model on the reduced dimension. The Kriging model is used for most of the simulations and makes a significant reduction on runs from the expensive to use circuit model. Although it is hard or even impossible to quantify the bias in the probability estimator, the Emulator prevents loss of accuracy by using the true simulations near to the failure threshold. For future work we will try to compare the HISMC approach (in terms of efficiency and robustness) with a hybrid approach proposed by [24] and with commercially available methods (e.g., Solido^{®} and MunEDA^{®}). More focus will be on multi-input and multi-output circuits, especially in considering output \(H(\mathbf{x})\) with more constraints involved.

We use the SPICE-like Eldo^{®} simulator from Mentor Graphics^{®} to perform the circuit simulations.

Note that, in this paper, we only consider a scalar response of the circuit. However, for the cases where a circuit has multiple responses, the algorithm proposed in this paper has to be repeated for each output, individually. This process will reduce the overall speedup of the proposed method. For such cases a further research is required.

The columns of \(\mathbf{S}^{*}\) corresponding to irrelevant variables (complement of \(\mathbf{x}_{r}\)) are set to zeros before evaluating the outputs \(\mathbf{H}^{*}\).

The purpose of selecting these points is to improve the Kriging predictor in the margin of uncertainty.

## Declarations

### Acknowledgements

The authors would like to thank Cyril Desclèves, Joost Rommes, Pascal Bolcato from Mentor Graphics, Grenoble for valuable discussions to improve some aspects of the work. The first author is grateful to the financial support from the Marie Curie Action.

### Availability of data and materials

The data that support the findings of this study are available from Mentor Graphics but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Mentor Graphics.

### Funding

This research is completely supported by the European Union in the FP7-PEOPLE-2013-ITN Program under Grant Agreement Number 608243 (FP7 Marie Curie Action, Project ASIVA14—Analog Simulation and Variability Analysis for 14nm designs).

### Authors’ contributions

The first author AKT wrote this manuscript and performed all the experiments. XJ did help to find the research direction, arranged the circuit examples and reviewed this work closely together with TGJB who also supported with fine-tuning and proofreading of the manuscript. WHAS followed the work closely, arranged the meetings with the expertise in the field and made useful suggestions. All the authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- Tyagi AK, Jonsson X, Beelen TGJ, Schilders WHA. Speeding up rare event simulations using Kriging models. In: Proceedings of IEEE 21st workshop on signal and power integrity (SPI). Baveno: IEEE; 2017. Google Scholar
- Ciampolini L, Lafont J-C, Drissi FT, Morin J-P, Turgis D, Jonsson X, Desclèves C, Nguyen J. Efficient yield estimation through generalized importance sampling with application to NBL-assisted SRAM bitcells. In: Proceedings of the 35th international conference on computer-aided design. ICCAD ’16. New York: ACM; 2016. Google Scholar
- Haldar A, Mahadevan S. Probability, reliability and statistical methods in engineering design. New York: Wiley; 2000. Google Scholar
- Singhee A, Rutenbar RA. Statistical blockade: very fast statistical simulation and modeling of rare circuit events and its application to memory design. IEEE Trans Comput-Aided Des Integr Circuits Syst. 2009;28(8):1176–89. View ArticleGoogle Scholar
- Santner T, Williams B, Notz W. The design and analysis of computer experiments. Berlin: Springer; 2003. View ArticleGoogle Scholar
- Rasmussen CE, Williams CKI. Gaussian processes for machine learning. Cambridge: MIT Press; 2006. MATHGoogle Scholar
- Jourdain B, Lelong J. Robust adaptive importance sampling for normal random vectors. Ann Appl Probab. 2009;19(5):1687–718. MathSciNetView ArticleGoogle Scholar
- Homem-de-Mello T, Rubinstein RY. Estimation of rare event probabilities using cross-entropy. In: Proceedings of the winter simulation conference. vol. 1. San Diego, CA, USA. 2002. p. 310–9. View ArticleGoogle Scholar
- Kroese DP, Porotsky S, Rubinstein RY. The cross-entropy method for continuous multi-extremal optimization. Methodol Comput Appl Probab. 2006;8(3):383–407. MathSciNetView ArticleGoogle Scholar
- Alpaydin E. Introduction to machine learning. London: MIT Press; 2010. MATHGoogle Scholar
- Rosipal R, Krämer N. Overview and recent advances in partial least squares. In: Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J, editors. Subspace, latent structure and feature selection. Berlin: Springer; 2006. p. 34–51. View ArticleGoogle Scholar
- Li G-Z, Zeng X-Q, Yang JY, Yang MQ. Partial least squares based dimension reduction with gene selection for tumor classification. In: 2007 IEEE 7th international symposium on BioInformatics and BioEngineering. Boston, MA, USA. 2007. p. 1439–44. View ArticleGoogle Scholar
- Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82. MATHGoogle Scholar
- Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc, Ser B, Methodol. 1996;58(1):267–88. MathSciNetMATHGoogle Scholar
- Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Stat. 2004;32(2):407–99. MathSciNetView ArticleGoogle Scholar
- Tibshirani RJ. The lasso problem and uniqueness. Electron J Stat. 2013;7:1456–90. MathSciNetView ArticleGoogle Scholar
- Matheron G. Principles of geostatistics. Econ Geol. 1963;58:1246–68. View ArticleGoogle Scholar
- Sacks J, Welch WJ, Mitchell TJ, Wynn HP. Design and analysis of computer experiments. Stat Sci. 1989;4:409–35. MathSciNetView ArticleGoogle Scholar
- Lophaven SN, Nielsen HB, Sondergaard J. DACE: a Matlab Kriging toolbox, version 2.0. Technical University of Denmark, DK-2800 Kgs. Lyngby—Denmark. 2002. Google Scholar
- Dubourg V. Adaptive surrogate models for reliability analysis and reliability-based design optimizations [PhD thesis]. Clermont-Ferrand, France: Blaise Pascal University—Clermont II; 2011. Google Scholar
- McKay MD, Beckman RJ, Conover WJ. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics. 1979;21(2):239–45. MathSciNetMATHGoogle Scholar
- Dubrule O. Cross validation of Kriging in unique neighborhood. Math Geol. 1983;15(6):687–99. MathSciNetView ArticleGoogle Scholar
- Li J, Xiu D. Evaluation of failure probability via surrogate models. J Comput Phys. 2010;229:8966–80. MathSciNetView ArticleGoogle Scholar
- Li J, Li J, Xiu D. An efficient surrogate-based method for computing rare failure probability. J Comput Phys. 2011;230:8683–97. MathSciNetView ArticleGoogle Scholar
- Butler T, Dawson C, Wildey T. Propagation of uncertainties using surrogate models. SIAM/ASA J Uncertain Quantificat. 2013;1:164–91. MathSciNetView ArticleGoogle Scholar