# Hybrid importance sampling Monte Carlo approach for yield estimation in circuit design

## Abstract

The dimension of transistors shrinks with each new technology developed in the semiconductor industry. The extreme scaling of transistors introduces important statistical variations in their process parameters. A large digital integrated circuit consists of a very large number (in millions or billions) of transistors, and therefore the number of statistical parameters may become very large if mismatch variations are modeled. The parametric variations often cause to the circuit performance degradation. Such degradation can lead to a circuit failure that directly affects the yield of the producing company and its fame for reliable products. As a consequence, the failure probability of a circuit must be estimated accurately enough. In this paper, we consider the Importance Sampling Monte Carlo method as a reference probability estimator for estimating tail probabilities. We propose a Hybrid ISMC approach for dealing with circuits having a large number of input parameters and provide a fast estimation of the probability. In the Hybrid approach, we replace the expensive to use circuit model by its cheap surrogate for most of the simulations. The expensive circuit model is used only for getting the training sets (to fit the surrogates) and near to the failure threshold for reducing the bias introduced by the replacement.

## Introduction

Due to the continuously increase of the number of individual components on an Integrated Circuit (IC) the probability of a bad working IC will increase dramatically, see [1, 2]. This can simply be illustrated by the example in , where an IC with S “identical” components (each having a failure probability $$\mathrm {p}_{\mathrm {fail}}$$) has a rather large probability of $$P_{\mathrm{fail}}=1-(1-\mathrm {p}_{\mathrm {fail}})^{S}$$ on break-down, even $$\mathrm {p}_{\mathrm {fail}}$$ is considerable small (i.e., being a rare event). For example, consider a 256 Mbit SRAM circuit, having 256 million “identical” bit cells. Then to guarantee a failure probability of 1% for this circuit (i.e. $$P_{\mathrm{fail}}=0.01$$ with $$S=256\times10^{6}$$), it is required that $$\mathrm {p}_{\mathrm {fail}}< 3.9\times 10^{-11}$$, being a rare event indeed. Notice that the yield Y of an IC is closely related to the failure probability $$P_{\mathrm{fail}}$$ and can be expressed as $$Y=1-P_{\mathrm{fail}} = (1-\mathrm {p}_{\mathrm {fail}})^{S}$$. Thus, the yield Y of an IC is estimated by using the failure probability $$\mathrm {p}_{\mathrm {fail}}$$ of its component.

We consider Monte Carlo (MC) techniques  for estimating the failure probability $$\mathrm {p}_{\mathrm {fail}}$$. The standard Monte Carlo produces an estimator $$\hat {\mathrm {p}}_{\mathrm {fail}}= k/n$$ for the true probability $$\mathrm {p}_{\mathrm {fail}}$$ by running the simulator n times with independent random inputs and counting the k occurrences of the ‘fail’ event. Notice that $$n\hat {\mathrm {p}}_{\mathrm {fail}}\sim \operatorname{Bin}(n,\mathrm {p}_{\mathrm {fail}})$$ follows a binomial law with probability $$\mathrm {p}_{\mathrm {fail}}$$ of getting success out of n trials. The useful properties of the estimator $$\hat {\mathrm {p}}_{\mathrm {fail}}$$ are its unbiasedness i.e., $$\mathbb{E}(\hat {\mathrm {p}}_{\mathrm {fail}}) = \mathrm {p}_{\mathrm {fail}}$$ and its independency on the dimension d of the random vector X. However, the variance of the estimator $$\hat {\mathrm {p}}_{\mathrm {fail}}$$ is given by $$\operatorname{Var}(\hat {\mathrm {p}}_{\mathrm {fail}}) = \mathrm {p}_{\mathrm {fail}}(1-\mathrm {p}_{\mathrm {fail}})/n$$, which can be (relatively) large for small $$\mathrm {p}_{\mathrm {fail}}$$ and limited number n of MC runs. Using the ‘normal approximation’ of the binomial distribution, the 95% confidence interval for (small) $$\hat {\mathrm {p}}_{\mathrm {fail}}$$ is estimated to be $${\pm}1.96/\sqrt {n \hat {\mathrm {p}}_{\mathrm {fail}}}$$. So, to determine $$\hat {\mathrm {p}}_{\mathrm {fail}}$$ in the range 10−11 with an accuracy of $${\pm}10\%$$ with 95% confidence level, one needs about $$4\times 10^{13}$$ MC runs, which is intractable in industry even with the fastest computer simulations.

To overcome the drawback of the standard MC method, a variance reducing Importance Sampling Monte Carlo (ISMC) technique is proposed in . There it was shown that a reduction of several orders can be achieved, from $$4\times10^{13}$$ to at most few thousands runs. However, when we estimate the probability of failure, it is done at some fixed environmental parameters (such as temperature, supply voltage, and process corners). These parameters add multiple levels of complexity. For instance, the failure probability must be computed for a complete range of working temperatures. The complexity grows exponentially when the other dimensions are combined. For complex systems one usually can only afford a very limited number (say, in hundreds) of simulations, and therefore the ISMC technique remains unattractive. In , a model based ISMC approach has been proposed for estimating rare circuit events. In the model based approach the circuit model is replaced by a surrogate which is much faster to evaluate, the circuit model is only used for drawing training samples which are used to build a surrogate. Usually, the number of training samples is much smaller than the total number of MC simulations for estimating the probability. Hence, the overall computational cost is reduced. Nevertheless, it is often difficult or even impossible to quantify the error made by such a substitution. There is another model based approach proposed in  which introduces a statistical blockade approach. In this approach one draws a large number of samples from a surrogate model initially, find the samples which belong to the tail region, and replace them by the true responses. The authors use a linear classifier saying that such classifier is enough for SRAM bitcells. However, our goal is to address the large circuits (such as analog IPs (Intellectual Properties)). Our experiments show that a linear model does not work for such large circuits and it is difficult to fit a surrogate (nonlinear) model that is accurate in the tail so that one can classify the samples that really belong to the tail region. The other limitation of both of the above model based approaches is that they do not address the dimensionality issue of the problem.

In this paper, we propose a Hybrid Importance Sampling Monte Carlo (HISMC) approach for estimating small failure probability. This approach is a modification of the model based approach proposed in  and can be used for large dimensional circuit problems. The idea is to only use the expensive circuit modelFootnote 1 (for a small portion of the overall samples used to estimate the probability) close to the failure threshold and the surrogate is used for the remaining samples that are reasonably away from the failure threshold. The use of these small number of samples of the circuit model can prevent loss of accuracy. The Kriging model [5, 6] is used as a surrogate of the circuit model because it inherits a solid mathematical foundation with several useful properties including interpolation of the training data and a closed formula for approximating the prediction error known as a Kriging variance. The latter is useful for improving the Kriging model near the failure threshold as well as for selecting the samples near to the failure threshold for which the circuit model is to be used. Our experience with the circuits shows the Kriging model works well up to 35 input variables.

This paper is organised as follows. We start with the reference method mean-shift ISMC approach in Sect. 2. Then we introduce a surrogate modelling technique in Sect. 3 that combines a feature selection method and the Kriging model. Using this surrogate technique we present our HISMC approach in Sect. 4. Finally, the results are shown in Sect. 5 and a conclusion is made in Sect. 6.

## The importance sampling Monte Carlo method

### General framework

Let $$\mathbf{x}\in\mathbb{R}^{d}$$ be a vector of d process parameters, which is a realization of the random vector (r.v.) X with probability density function (pdf) $$g(\mathbf{x})$$, and let $$H(\mathbf {x})$$ be a corresponding responseFootnote 2 of the circuit under examination. The mathematical equation of the failure probability $$\mathrm {p}_{\mathrm {fail}}=\mathbb {P}(H(\mathbf{x})\geq\gamma)$$ is given by

${\mathrm{p}}_{\mathrm{fail}}^{+}\left(\gamma \right)={\mathbb{E}}_{g}\left[{\mathbb{1}}_{\left\{H\left(\mathbf{X}\right)\ge \gamma \right\}}\right]=\int {\mathbb{1}}_{\left\{H\left(\mathbf{x}\right)\ge \gamma \right\}}g\left(\mathbf{x}\right)\phantom{\rule{0.2em}{0ex}}d\mathbf{x},$
(2.1)

where subscript g means that the expectation is taken with respect to the pdf $$g(\mathbf{x})$$, γ is a given failure threshold and ${\mathbb{1}}_{\left\{H\left(\mathbf{x}\right)\ge \gamma \right\}}$ is an indicator function that gives the value 1 if $$H(\mathbf{x}\geq\gamma)$$, 0 otherwise.

We assume that the (failure) region of interest lies on the upper tail of the output distribution. This is without loss of generality, because any lower tail can be converted to the upper tail by replacing $$H(\mathbf{X}) = -H(\mathbf{X})$$. Therefore, the probability $$\mathbb {P} [H(\mathbf{X})\leq\gamma' ]$$ can be converted to $$\mathrm {p}_{\mathrm {fail}}^{-}(\gamma') = \mathbb{P} [-H(\mathbf{X})\geq-\gamma' ]$$ for some give failure threshold $$\gamma'$$ on the lower tail of the distribution. Hence, it is sufficient to estimate the probability for the upper tail and hereafter we will simply write $$\mathrm {p}_{\mathrm {fail}}$$ instead of $$\mathrm {p}_{\mathrm {fail}}^{+}(\gamma)$$.

Assume that we have another density f such that ${\mathbb{1}}_{\left\{H\left(\mathbf{x}\right)\ge \gamma \right\}}g\left(\mathbf{x}\right)>0⇒f\left(\mathbf{x}\right)>0$, we say g is absolutely continuous with respect to f. Then we can write (2.1) as

$\begin{array}{rl}{\mathbb{E}}_{g}\left[{\mathbb{1}}_{\left(H\left(\mathbf{X}\right)\ge \gamma \right)}\right]& =\int {\mathbb{1}}_{\left(H\left(\mathbf{x}\right)\ge \gamma \right)}g\left(\mathbf{x}\right)\phantom{\rule{0.2em}{0ex}}d\mathbf{x}\\ & =\int {\mathbb{1}}_{\left(H\left(\mathbf{x}\right)\ge \gamma \right)}\mathcal{L}\left(\mathbf{x}\right)f\left(\mathbf{x}\right)\phantom{\rule{0.2em}{0ex}}d\mathbf{x}\\ & ={\mathbb{E}}_{f}\left[{\mathbb{1}}_{\left(H\left(\mathbf{Y}\right)\ge \gamma \right)}\mathcal{L}\left(\mathbf{Y}\right)\right],\end{array}$
(2.2)

where Y is a r.v. generated from the new pdf $$f(\mathbf{x})$$ and $$\mathcal{L}(\mathbf{x}) = {g(\mathbf{x})}/{f(\mathbf{x})}$$ if $$f>0$$ and $$\mathcal{L}(\mathbf{x})=0$$ otherwise, is a likelihood ratio between two densities. The ISMC estimator is then given by

${\stackrel{ˆ}{\mathrm{p}}}_{\mathrm{fail}}^{\mathrm{IS}}=\frac{1}{N}\sum _{i=1}^{N}{\mathbb{1}}_{\left(H\left({\mathbf{Y}}_{i}\right)\ge \gamma \right)}\mathcal{L}\left({\mathbf{Y}}_{i}\right),$
(2.3)

where $$\mathbf{Y}_{i}$$’s are N independent and identically distributed (iid) random samples generated from $$f(\mathbf{x})$$.

The $$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{IS}}$$ is unbiased  with the variance

${\sigma }_{\mathrm{IS}}^{2}=\frac{1}{N}{Var}_{f}\left({\mathbb{1}}_{\left(H\left(\mathbf{Y}\right)\ge \gamma \right)}\mathcal{L}\left(\mathbf{Y}\right)\right),$
(2.4)

where

${Var}_{f}\left({\mathbb{1}}_{\left(H\left(\mathbf{Y}\right)\ge \gamma \right)}\mathcal{L}\left(\mathbf{Y}\right)\right)={\mathbb{E}}_{g}\left[{\left({\mathbb{1}}_{\left\{H\left(\mathbf{X}\right)\ge \gamma \right\}}\right)}^{2}\frac{g\left(\mathbf{X}\right)}{f\left(\mathbf{X}\right)}\right]-{\mathrm{p}}_{\mathrm{fail}}^{2}.$
(2.5)

From a practical point of view, one has to find the ‘best’ pdf $$f(\mathbf{x})$$ in order to maximize the accuracy of the estimator $$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{IS}}$$. One of the ways to find such $$f(\mathbf{x})$$ is by minimizing the variance $$\sigma_{\mathrm{IS}}^{2}$$.

The work in this paper is an additional contribution to the developments at Mentor Graphics where a mean-shift ISMC technique (see Sect. 2.2) is being used, assuming that the original input distributions can be transformed into a Gaussian distribution. In this context, the importance density is found by shifting the mean of the original density to the area of interest. We use the same technique as a reference approach. The study of other ISMC techniques is out of scope of this paper.

### The mean-shift approach

In this paper, we consider a particular case where the original pdf $$g(\mathbf{x})$$ is Gaussian with mean 0 and variance I, i.e., $$g(\mathbf{x})\sim\mathcal{N}(\boldsymbol {0},\mathbf{I})$$. We define the importance density $$f(\mathbf{x}) = g^{\boldsymbol {\theta }}(\mathbf{x})$$ with $$g^{\boldsymbol {\theta}}(\mathbf{x}) \sim\mathcal{N}(\boldsymbol {\theta },\mathbf{I})$$ parameterized by its mean $$\boldsymbol {\theta} \in\mathbb{R}^{d}$$ (see, ), in other words $$g^{\boldsymbol {\theta}}(\mathbf{x}) = g(\mathbf{x} - \boldsymbol {\theta})$$. Then the likelihood ratio $$\mathcal {L}(\mathbf{x})$$ becomes

$$\mathcal{L}(\mathbf{x}) = \frac{g(\mathbf{x})}{g^{\boldsymbol {\theta }}(\mathbf {x})}=e^{-\boldsymbol {\theta}\mathbf{x} + \frac{1}{2}|\boldsymbol {\theta}|^{2}}$$
(2.6)

and the relation between the random vectors X and Y is

$$\mathbf{Y} = \mathbf{X} + \boldsymbol {\theta}.$$
(2.7)

Using (2.6) and (2.7), the ISMC probability estimator (2.3) can then be written as

${\stackrel{ˆ}{\mathrm{p}}}_{\mathrm{fail}}^{\mathrm{IS}}=\frac{1}{N}\sum _{i=1}^{N}{\mathbb{1}}_{\left(H\left({\mathbf{X}}_{i}+\mathbit{\theta }\right)\ge \gamma \right)}\phantom{\rule{0.2em}{0ex}}{e}^{-\mathbit{\theta }{\mathbf{X}}_{i}-\frac{1}{2}{|\mathbit{\theta }|}^{2}},$
(2.8)

where $$\mathbf{X}_{i}$$’s are N iid random vectors with density $$g(\mathbf{x})$$.

Furthermore, the second moment (first term in the right hand side of (2.5)) can also be written in a simplified way

${\mathbb{E}}_{g}\left[{\mathbb{1}}_{\left\{H\left(\mathbf{X}\right)\ge \gamma \right\}}\phantom{\rule{0.2em}{0ex}}\frac{g\left(\mathbf{X}\right)}{{g}^{\mathbit{\theta }}\left(\mathbf{X}\right)}\right]={\mathbb{E}}_{g}\left[{\mathbb{1}}_{\left\{H\left(\mathbf{X}\right)\ge \gamma \right\}}\phantom{\rule{0.2em}{0ex}}{e}^{-\mathbit{\theta }\mathbf{X}+\frac{1}{2}{|\mathbit{\theta }|}^{2}}\right]=:v\left(\mathbit{\theta }\right).$
(2.9)

Recalling that for maximizing the accuracy of the probability estimator $$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{IS}}$$ one has to find the pdf $$g^{\boldsymbol {\theta}}(\mathbf{x})$$ or equivalently to find the mean-shift θ, such that the variance $$\sigma_{\mathrm{IS}}^{2}$$ is minimum. Moreover, the variance $$\sigma _{\mathrm{IS}}^{2}$$ would be minimum if the second moment $$v(\boldsymbol {\theta})$$ is minimum. By minimizing $$v(\boldsymbol {\theta})$$ we obtain a probability estimator with a smaller number of simulations [2, Sect. 3.1] than with the standard MC simulation.

Under some assumptions it has been proved in  that the function $$v(\boldsymbol {\theta})$$ has a unique minimizer $$\boldsymbol {\theta}^{*}$$ such that $$\nabla v(\boldsymbol {\theta}^{*}) = 0$$. The optimal $$\boldsymbol {\theta}^{*}$$ can be approximated with the Newton algorithm by solving the following optimization problem

$$\boldsymbol {\theta}^{*} = \min_{\boldsymbol {\theta}}v_{m}(\boldsymbol { \theta})$$
(2.10)

with

${v}_{m}\left(\mathbit{\theta }\right)=\frac{1}{m}\sum _{j=1}^{m}{\mathbb{1}}_{\left(H\left({\mathbf{X}}_{j}\right)\ge \gamma \right)}\phantom{\rule{0.2em}{0ex}}{e}^{-\mathbit{\theta }{\mathbf{X}}_{j}+\frac{{|\mathbit{\theta }|}^{2}}{2}},$
(2.11)

the MC approximation of the second moment $$v(\boldsymbol {\theta})$$. For details we refer to .

Following , there must be at least one $$\mathbf {X}_{j}$$ such that ${\mathbb{1}}_{\left(H\left({\mathbf{X}}_{j}\right)\ge \gamma \right)}\ne 0$ to solve the optimization problem (2.10). However, this condition may fail in a rare event context. To overcome this problem, a multilevel approach is suggested in  for solving such problems in the context of cross-entropy approaches.

### Multi-level approach for rare events simulations

In the multi-level approach, we solve the optimization problem (2.10) iteratively. Starting with the mean $$\boldsymbol {\theta}^{(0)} = \boldsymbol{0}$$ of the given density function $$g(\mathbf{x})$$, we construct a sequence of mean-shifts $$\{\boldsymbol {\theta}^{(k)}, k\geq1\}$$ and a sequence of levels $$\{\gamma_{k}, k\geq1\}$$, and iterate in both $$\gamma_{k}$$ and $$\boldsymbol {\theta}^{(k)}$$ until convergence, see the steps 1 to 6 of the Algorithm 2.1 below. Following , each iteration k of the multi-level approach consists of two phases; in the first phase we fix $$\boldsymbol {\theta}^{(k-1)}$$ and obtain the level $$\gamma_{k}$$, and in the second phase we compute $$\boldsymbol {\theta}^{(k)}$$ using $$\boldsymbol {\theta}^{(k-1)}$$ and $$\gamma_{k}$$. The computation of $$\gamma_{k}$$ and $$\boldsymbol {\theta}^{(k)}$$ at iteration k is as follows:

1. 1.

Computation of $$\gamma_{k}$$: For fixed $$\boldsymbol {\theta }^{(k-1)}$$, we let $$\gamma_{k}$$ to be a $$(1-\rho)$$-quantile of $$H(\mathbf{X}^{(k-1)})$$, i.e.,

\begin{aligned}& \mathbb{P} \bigl(H\bigl(\mathbf{X}^{(k-1)}\bigr)\geq \gamma_{k} \bigr) \geq \rho, \end{aligned}
(2.12)
\begin{aligned}& \mathbb{P} \bigl(H\bigl(\mathbf{X}^{(k-1)}\bigr)\leq\gamma_{k} \bigr) \geq 1-\rho, \end{aligned}
(2.13)

where $$\mathbf{X}^{(k-1)}\sim g^{\boldsymbol {\theta}^{(k-1)}}(\mathbf{x})$$ and ρ is a probability which is to be chosen such that $$\rho\gg \mathrm {p}_{\mathrm {fail}}$$, the probability to be estimated.

An estimator $$\widehat{\gamma}_{k}$$ of $$\gamma_{k}$$ is obtained by drawing m random samples $$\mathbf{X}^{(k-1)}_{i}\sim g^{\boldsymbol {\theta }^{(k-1)}}(\mathbf{x})$$, calculating the responses $$H(\mathbf {X}^{(k-1)}_{i})$$ for all i, ordering them from smallest to largest $$H_{(1)}^{(k-1)}\leq\cdots\leq H_{(m)}^{(k-1)}$$ where $$H_{(l)}^{(k-1)} := H(\mathbf{X}_{l}^{(k-1)})$$ and finally evaluating the $$(1-\rho)$$ sample quantile as

$$\widehat{\gamma}_{k} = H^{(k-1)}_{(\lceil(1-\rho) m\rceil)},$$
(2.14)

where $$\lceil x\rceil$$ is the smallest integer greater than or equal to x.

Note that the estimation $$\widehat{\gamma}_{k}$$ of $$\gamma_{k}$$ depends on two parameters, the probability ρ and the number of samples m. Our empirical results show that if we fix $$m=1000$$ then a good choice of ρ is 0.20 for getting an accurate estimation of $$\gamma_{k}$$. However, one may choose a smaller ρ but that may require larger m for estimating $$\gamma_{k}$$ accurately. For more details we refer to [8, 9].

2. 2.

Computation of $$\boldsymbol {\theta}^{(k)}$$: Let $$g^{\boldsymbol {\theta}^{(k-1)}}(\mathbf{x})$$ be the density function known at iteration k and $$g^{\boldsymbol {\theta}^{(k)}}(\mathbf{x})$$ be the new density we want to obtain. The likelihood ratio of densities $$g^{\boldsymbol {\theta}^{(k-1)}}(\mathbf{x})$$ and $$g^{\boldsymbol {\theta}^{(k)}}(\mathbf{x})$$ at iteration k is given by

$$\mathcal{L}^{(k)}(\mathbf{x}) = \frac{g^{\boldsymbol {\theta }^{(k-1)}}(\mathbf {x})}{g^{\boldsymbol {\theta}^{(k)}}(\mathbf{x})} = e^{- (\boldsymbol {\theta }^{(k)}-\boldsymbol {\theta}^{(k-1)} )\mathbf{x}+\frac{1}{2} (|\boldsymbol {\theta}^{(k)}|^{2}-|\boldsymbol {\theta}^{(k-1)}|^{2} )}.$$
(2.15)

Therefore, the second moment (2.9) at iteration k can be extended as

$\begin{array}{rl}{v}^{\left(k\right)}\left({\mathbit{\theta }}^{\left(k\right)}\right)& ={\mathbb{E}}_{{g}^{{\mathbit{\theta }}^{\left(k\right)}}}\left[{\left({\mathbb{1}}_{\left(H\left({\mathbf{X}}^{\left(k\right)}\right)\ge {\gamma }_{k}\right)}\phantom{\rule{0.2em}{0ex}}{\mathcal{L}}^{\left(k\right)}\left({\mathbf{X}}^{\left(k\right)}\right)\right)}^{2}\right]\\ & ={\mathbb{E}}_{{g}^{{\mathbit{\theta }}^{\left(k-1\right)}}}\left[{\mathbb{1}}_{\left(H\left({\mathbf{X}}^{\left(k-1\right)}\right)\ge {\gamma }_{k}\right)}\phantom{\rule{0.2em}{0ex}}{\mathcal{L}}^{\left(k\right)}\left({\mathbf{X}}^{\left(k-1\right)}\right)\right],\end{array}$
(2.16)

where $$\mathbf{X}^{(k-1)}\sim g^{\boldsymbol {\theta}^{(k-1)}}$$ and $$\mathbf {X}^{(k)}\sim g^{\boldsymbol {\theta}^{(k)}}$$.

Using the above information the optimal mean-shift $$\boldsymbol {\theta}^{(k)}$$ can be approximated with the Newton algorithm by solving the following optimization problem

$$\boldsymbol {\theta}^{(k)} = \min_{\boldsymbol {\theta}}v_{m}^{(k)}( \boldsymbol {\theta})$$
(2.17)

with

${v}_{m}^{\left(k\right)}\left(\mathbit{\theta }\right)=\frac{1}{m}\sum _{j=1}^{m}{\mathbb{1}}_{\left(H\left({\mathbf{X}}_{j}^{\left(k-1\right)}\right)\ge {\gamma }_{k}\right)}\phantom{\rule{0.2em}{0ex}}{e}^{-\left(\mathbit{\theta }-{\mathbit{\theta }}^{\left(k-1\right)}\right){\mathbf{X}}_{j}^{\left(k-1\right)}+\frac{1}{2}\left({|\mathbit{\theta }|}^{2}-{|{\mathbit{\theta }}^{\left(k-1\right)}|}^{2}\right)}$
(2.18)

the MC approximation of the second moment $$v^{(k)}(\boldsymbol {\theta})$$.

Below we present the multi-level ISMC algorithm.

### Algorithm 2.1

(Multi-level ISMC)

1. 1.

Set $$k=1$$, $$\boldsymbol {\theta}^{(0)} = 0 \in\mathbb{R}^{d}$$ (then $$g(\mathbf{x})\equiv g^{\boldsymbol {\theta}^{(0)}}(\mathbf{x})$$), $$\rho= 0.2$$ and $$m=1000$$.

2. 2.

Draw m samples $$\mathbf{X}_{1}^{(k-1)},\ldots, \mathbf {X}_{m}^{(k-1)}$$ according to the distribution $$g^{\boldsymbol {\theta }^{(k-1)}}(\mathbf{x})$$.

3. 3.

Compute the $$(1-\rho)$$-quantile $$\gamma_{k}$$ of $$H(\mathbf {X}^{(k-1)})$$. If $$\gamma_{k} > \gamma$$ reset $$\gamma_{k}$$ to γ.

4. 4.

Introduce

${v}_{m}^{\left(k\right)}\left(\mathbit{\theta }\right)=\frac{1}{m}\sum _{j=1}^{m}{\mathbb{1}}_{\left(H\left({\mathbf{X}}_{j}^{\left(k-1\right)}\right)\ge {\gamma }_{k}\right)}\phantom{\rule{0.2em}{0ex}}{e}^{-\left(\mathbit{\theta }-{\mathbit{\theta }}^{\left(k-1\right)}\right).{\mathbf{X}}_{j}^{\left(k-1\right)}+\frac{{|\mathbit{\theta }-{\mathbit{\theta }}^{\left(k-1\right)}|}^{2}}{2}}.$
5. 5.

Solve $$\boldsymbol {\theta}^{(k)} = \min_{\boldsymbol {\theta}}v_{m}^{(k)}(\boldsymbol {\theta})$$.

6. 6.

If $$\gamma_{k} < \gamma$$, go to step 2 with $$k \leftarrow k+1$$, otherwise, go to step 7.

7. 7.

Compute (2.8) with $$\boldsymbol {\theta}^{*} = \boldsymbol {\theta}^{(k)}$$.

8. 8.

End of the algorithm.

Figure 1 illustrates the multi-level ISMC approach (Algorithm 2.1) in the situation $$\mathbf{x}\in\mathbb {R}^{2}$$. Here, we assume that the optimal mean-shift $$\boldsymbol {\theta}^{*}$$ is estimated in four iterations. The sample distribution per mean-shift $$\boldsymbol {\theta }^{(k)}$$ is shown by the ellipses. Moreover, the blue curves (3-dotted are intermediate and 1-solid is the target) represent the contours at levels $$\gamma_{k}$$, $$k=1,\ldots, 4$$. Notice that $$\gamma_{4}=\gamma$$. We start by sampling from the original pdf $$g^{\boldsymbol {\theta }^{(0)}}(\mathbf{x})=g(\mathbf{x})$$, see steps 1 and 2 of the algorithm for iteration $$k=1$$. Then we find the level $$\gamma_{1}$$ (step 3). Afterwards, we estimate $$\boldsymbol {\theta}^{(1)}$$ using the variance criterion in steps 4 and 5. Now, we go to the second iteration and start by sampling from the new distribution $$g^{\boldsymbol {\theta}^{(1)}}(\mathbf{x})$$. We repeat the same procedure until convergence to the optimal mean-shift $$\boldsymbol {\theta }^{*}$$, see step 6. Here we have $$\gamma_{4}=\gamma$$. Finally, we sample from the optimal pdf $$g^{\boldsymbol {\theta}^{*}}(\mathbf{x})$$ (distribution is shown by green ellipse centered at $$\boldsymbol {\theta}^{*}$$) and compute $$\hat {\mathrm{p}}_{\mathrm{fail}}^{\mathrm{IS}}$$, see step 7.

## The surrogate modeling technique

As stated in Sect. 1, we deal with circuits having a large number of statistical input variables. Usually, only a few of them are important i.e., having a statistical effect on the circuit response. The remaining variables have a negligible or no statistical effect on the circuit response. It should be noticed that almost all surrogate models (including Kriging) lose their accuracy with increasing dimension and it is not possible to build an accurate surrogate model with a complete set of variables. Therefore, dimension reduction is required before fitting a surrogate model. Dimension reduction is the process of reducing the number of random variables under consideration and is often divided into two categories: feature extraction and feature selection [10, Chap. 6]. In feature extraction, one performs the projection of the original input space to a reduced space. The dimension of the reduced space is much smaller than the original space. There exists a variety of different projection methods. Among others Partial Least Squares Regression (PLSR) [11, 12] belongs to this category. On the other hand, feature selection is the process of selecting a subset of important variables while other variables are set to their nominal values and removed from the set of input variables used for building a surrogate model. Our experience with feature extraction, especially with PLSR, shows that it requires a large number (which grows with the dimension of the problem) of training samples to perform accurately, and therefore it cannot be used when only a limited number of simulations are allowed. Hence, we prefer to use a feature selection method that performs accurately even with a limited number of simulations.

### Feature selection

In circuit simulations, feature selection is a process of selecting a subset of important variables, having statistical effect on the circuit response of the circuit under study, while other variables having no or negligible effect are set to their nominal values without incurring much loss of information. Several different approaches for feature selection exist, see for examples . Here we will employ the well known Least Absolute Shrinkage and Selection Operator (LASSO)  which is stable and exhibits good properties for feature selection.

#### LASSO

Let $$\mathbf{x} = [\mathrm{x}_{1},\ldots,\mathrm{x}_{d}]$$ be a vector of input variables and $$H(\mathbf{x})$$ be a circuit response of interest. We can approximate $$H(\mathbf{x})$$ by a linear model

$$H(\mathbf{x}) \approx\mathbf{x}\mathbf{b}=b_{1} \mathrm{x}_{1}+\cdots +b_{d}\mathrm{x}_{d},$$
(3.1)

where $$\mathbf{b}=[b_{1},\ldots, b_{d}]^{T}$$ is a column vector of regression coefficients. It is clear from (3.1) that each component $$b_{j}$$ of b gives a certain amount of information of the relative importance of $$\mathrm{x}_{j}$$. In the context of circuit simulation the dimension of the vector x of input variables is very large. However, only few of $$\mathrm{x}_{j}$$’s are important. This means we are willing to determine a small subset of variables that gives the main effects and neglect the small effects. LASSO is a well known method that performs this task. Given a training set ($$\mathbf{X}_{i}$$, $$H_{i}$$)$$_{i=1,\ldots,n}$$ where $$\mathbf{X}_{i}$$ are n random input vectors generated with some design of experiment (see Sect. 3.2.1) and $$H_{i}:=H(\mathbf{X}_{i})$$ are the corresponding responses, and a tuning parameter $$\lambda\geq0$$, LASSO estimates the components of b by minimizing the following Lagrangian expression

$$\widehat{\mathbf{b}} = \min_{\mathbf{b}} \Biggl\{ \sum _{i=1}^{n} (H_{i}- \mathbf{X}_{i}\mathbf{b} )^{2} + \lambda\sum _{j=1}^{d}|b_{j}| \Biggr\} .$$
(3.2)

Depending on the value of the tuning parameter λ, LASSO sets many coefficients exactly to zero. The variables $$\mathrm{x}_{j}$$ corresponding to the nonzero coefficients $$b_{j}$$ are selected as important variables. The computation of the LASSO solutions is a quadratic programming problem and can be tackled by standard numerical analysis algorithms. However, the least angle regression (LARS)  procedure is a better approach in terms of the computational efficiency. The LARS algorithm exploits the special structure of the LASSO problem, and provides an efficient way to compute the solutions simultaneously for all possible values of λ. For detail we refer to [16, Sect. 3]. Among all solutions we choose the one that fits the model (3.2) best by cross-validation. Another aspect of LASSO is the choice of the number n of training samples. Typically, n should be much smaller than the dimension d (when d is large). Our experiments show that n depends on the number of important parameters rather than the dimension. The rule of thumb suggests to have 10 samples to each important parameter. Therefore, if we assume there are maximum 50 important variables then at most $$50\times10 = 500$$ simulations are required to performed LASSO accurately.

Once we have selected the important variables we are ready to build a surrogate (Kriging) model. The surrogate model will be built with the important variables only.

### Remark 3.1

To avoid the use of additional notations, we will introduce the Kriging model for an input vector x having full dimension d. However, in our algorithms, x will be used in its reduced form. See Algorithm 4.1.

The notion of a Kriging model, also known as Gaussian process regression in literature , was initially developed in a geostatic framework . The Kriging model also plays a central role in the design and analysis of “computer experiments” [5, 18]. The main idea of the Kriging model is to assume that the response function $$H(\mathbf{x})$$ is a realization of a Gaussian process $$\mathcal{G}(\mathbf{x})$$

$$\mathcal{G}(\mathbf{x}) = \mathbf{f}(\mathbf{x})^{T} \boldsymbol {\beta} + Z(\mathbf{x}),$$
(3.3)

where $$\mathbf{f}(\mathbf{x})^{T}\boldsymbol {\beta}$$ is a linear regression model on a given functional basis $$\mathbf{f}(\mathbf{x}) = [f_{1}(\mathbf{x}), \ldots, f_{q}(\mathbf{x}) ]^{T}$$ and a vector of regression parameters $$\boldsymbol {\beta} = [\beta_{1}, \ldots,\beta_{q}]^{T}$$.

For the numerical experiments in Sect. 5, we use the DACE Kriging toolbox  for Matlab® to model the response function $$H(\mathbf{x})$$. DACE provides the functional basis $$\mathbf{f}(\mathbf{x})$$ as a set of polynomials of order 0, 1 and 2. In this paper, we use the linear functional basis, i.e.,

$$f_{1}(\mathbf{x})=1,\qquad f_{2}(\mathbf{x})= \mathrm{x}_{1},\qquad \ldots,\qquad f_{d+1}(\mathbf{x})= \mathrm{x}_{d}.$$
(3.4)

For details we refer to .

The second term $$Z(\mathbf{x})$$ on the right hand side of (3.3) is a Gaussian process with zero mean and covariance

$$\mathbb{E}\bigl[Z(\mathbf{x})Z\bigl(\mathbf{x}'\bigr)\bigr] = \sigma^{2}\mathcal {R}\bigl(\mathbf {x}, \mathbf{x}',\boldsymbol { \ell}\bigr),\quad \forall\bigl(\mathbf{x},\mathbf {x}'\bigr)\in \mathbb{X}\times\mathbb{X},$$
(3.5)

where $$\mathbb{X}\in\mathbb{R}^{d}$$ is the domain of input variable x, $$\sigma^{2}$$ is the process variance of $$\mathcal {G}(\mathbf {x})$$ and $$\mathcal{R}(\mathbf{x}, \mathbf{x}',\boldsymbol {\ell})$$ is a correlation function characterized by a vector of parameters $$\boldsymbol {\ell} = [\ell_{1}, \ldots, \ell_{d}]^{T}$$.

The choice of the correlation function $$\mathcal{R}$$ depends on the smoothness of the response function $$H(\mathbf{x})$$ . Assuming $$H(\mathbf{x})$$ is smooth, we use the following Gaussian correlation function

$$\mathcal{R}\bigl(\mathbf{x}, \mathbf{x}',\boldsymbol {\ell} \bigr) = \prod_{j=1}^{{d}}\exp \biggl(- \biggl\vert \frac{(\mathrm{x}_{j} - \mathrm{x}'_{j})^{2}}{\ell _{j}} \biggr\vert \biggr).$$
(3.6)

For other correlation functions we refer to [19, 20].

#### The Kriging predictor

We have a circuit model to be used for simulating the behaviour of a circuit. We find a ordered set $$\mathcal{D} = (\mathbf{S}, \mathbf {H})$$, called the training set, where $$\mathbf{H} = [H(\mathbf {s}_{1}), \ldots, H(\mathbf{s}_{N_{\mathrm{tr}}}) ]^{T}$$ is a vector of observations that results from the circuit model on a set of experiments, $$\mathbf{S} = [\mathbf{s}_{1}, \ldots, \mathbf {s}_{N_{\mathrm{tr}}} ]^{T}$$. Notice that the notation s is used here for the input x to distinguish the training samples from the one that will be used for the other simulations. The set of experiments is usually referred to as a Design Of Experiments (DOE), see . The construction of a Kriging predictor depends on $$\mathcal{D}$$, and the DOE should be selected carefully in order to get the largest amount of the statistical information about the response function over the input space. In this report, we use the Latin Hypercube Sampling (LHS) space filling technique for our DOE. We will not discuss the LHS in this paper, but we refer to .

Following , given a training set $$\mathcal{D} = (\mathbf {S}, \mathbf{H})$$ the Kriging predictor of an untried point x, i.e., $$\mathbf{x}\notin\mathbf{S}$$ is given by:

$$\widehat{H}(\mathbf{x}) = \mathbf{f}(\mathbf{x})^{T} \widehat{\boldsymbol {\beta}} + r(\mathbf{x})^{T}R^{-1}(\mathbf{H} - F \widehat{\boldsymbol {\beta}}),$$
(3.7)

where

\begin{aligned}& \mathbf{H} = \bigl[H(\mathbf{s}_{i})\bigr]_{i=1,\ldots,N_{\mathrm{tr}}}, \end{aligned}
(3.8)
\begin{aligned}& F = \bigl[f_{i}({\mathbf{s}}_{j}) \bigr]_{i=1,\ldots,q, j=1,\ldots,N_{\mathrm {tr}}}, \end{aligned}
(3.9)
\begin{aligned}& \widehat{\boldsymbol {\beta}} = \bigl(F^{T}R^{-1}F \bigr)^{-1}F^{T}R^{-1}\mathbf{H}, \end{aligned}
(3.10)
\begin{aligned}& R = \bigl[\mathcal{R}({\mathbf{s}}_{i}, {\mathbf{s}}_{j}, \boldsymbol {\ell })\bigr]_{i=1,\ldots ,N_{\mathrm{tr}}, j=1,\ldots,N_{\mathrm{tr}}}, \end{aligned}
(3.11)
\begin{aligned}& r(\mathbf{x}) = \bigl[\mathcal{R}(\mathbf{x}, {\mathbf{s}}_{i}, \boldsymbol { \ell })\bigr]_{i=1,\ldots,N_{\mathrm{tr}}}. \end{aligned}
(3.12)

The prediction error also known as the Kriging variance at a point x is given by:

$$\widehat{\sigma}_{K}^{2}(\mathbf{x}) = \sigma^{2} \bigl(1 + u(\mathbf {x})^{T}\bigl(F^{T}R^{-1}F \bigr)^{-1}u(\mathbf{x}) - r(\mathbf{x})^{T}R^{-1}r( \mathbf {x}) \bigr),$$
(3.13)

where $$u(\mathbf{x}) = F^{T}R^{-1}r(\mathbf{x}) - \mathbf{f}(\mathbf{x})$$.

#### Estimation of parameters

Given a choice of regression and correlation models, the optimal values of the parameters β, $${\sigma}^{2}$$ and can be inferred using the Maximum Likelihood Estimation (MLE). The MLE of β is the generalized least-square estimate $$\widehat {\boldsymbol {\beta}}$$ given in (3.9) and the MLE of $$\sigma^{2}$$ (see, [19, Sect. 3]) is

$$\widehat{\sigma}^{2} = \frac{1}{N_{\mathrm{tr}}}(\mathbf{H} - F\widehat{\boldsymbol { \beta }})^{T}R^{-1}(\mathbf{H} - F\widehat{\boldsymbol {\beta}}).$$
(3.14)

Using these $$\widehat{\boldsymbol {\beta}}$$ and $$\widehat{\sigma}^{2}$$ the optimal correlation coefficients $$\widehat{\boldsymbol {\ell}}$$ of the correlation function solve the following optimization problem

$$\widehat{\boldsymbol {\ell}} = \min_{\boldsymbol {\ell}}|R|^{1/N_{\mathrm{tr}}}\widehat { \sigma}^{2},$$
(3.15)

where $$|R|$$ is the determinant of R. See .

## A hybrid importance sampling Monte Carlo approach

In this section we propose a HISMC approach. This approach is a modification of the hybrid approach proposed in [1, Sect. 4] and can be used for large circuits. Similar to , we split the ISMC Algorithm 2.1 into two phases; The first is the Exploration Phase that includes the steps 1 to 6 for estimating the optimal mean-shift $$\boldsymbol {\theta}^{*}$$, and the second is the Estimation Phase that consists of step 7 for estimating the probability (2.8) using the optimal value $$\boldsymbol {\theta }^{*}$$ of the mean-shift θ. In the HISMC approach, we will use a hybrid combination of LASSO, the Kriging model and the circuit model. We will treat the two phases (the exploration phase and the estimation phase) separately.

### The exploration phase

The goal of the exploration phase is to find the optimal mean-shift $$\boldsymbol {\theta}^{*}$$. Here we replace the indicator function ${\mathbb{1}}_{\left(H\left(\mathbf{x}\right)\ge {\gamma }_{k}\right)}$ in step 4 of Algorithm 2.1 by an approximation ${\mathbb{1}}_{\left(\stackrel{ˆ}{H}\left(\mathbf{x}\right)\ge {\gamma }_{k}\right)}$, where $$\widehat{H}(\mathbf{x})$$ is the Kriging prediction of the response $$H(\mathbf{x})$$ at some input x. Notice that the accuracy of the model ${\mathbb{1}}_{\left(\stackrel{ˆ}{H}\left(\mathbf{x}\right)\ge {\gamma }_{k}\right)}$ is important and must be checked before its use. The Misclassification Error (MCR) is used as a measure of accuracy for the metamodel ${\mathbb{1}}_{\left(\stackrel{ˆ}{H}\left(\mathbf{x}\right)\ge {\gamma }_{k}\right)}$.

$\mathrm{MCR}=\frac{1}{N}\sum _{i=1}^{N}{\mathbb{1}}_{\left({\mathbb{1}}_{H\left({\mathbf{x}}_{i}\right)\ge {\gamma }_{k}}\ne {\mathbb{1}}_{\stackrel{ˆ}{H}\left({\mathbf{x}}_{i}\right)\ge {\gamma }_{k}}\right)}.$
(4.1)

A leave-one-out cross validation technique for the Kriging model is proposed by  and it is used to estimate the MCR in this paper.

Before presenting our algorithm for the exploration phase, we want to mention some preliminaries for extra understanding of the algorithm.

### Remark 4.1

(Preliminaries)

1. (i)

A new Kriging model is built at each level of the exploration phase, i.e., in general, the Kriging model at iteration k is different from $$k-1$$.

2. (ii)

At each iteration k of the exploration phase, the feature selection (using LASSO) is performed before fitting a Kriging model. Note that for iteration k, the feature selection is performed on the reduced set of variables selected at iteration $$k-1$$.

3. (iii)

The notation $$\operatorname{LHS} (\boldsymbol {\theta}^{(k-1)} \pm\underline {a} )$$ indicates the LHS in the interval $$[\boldsymbol {\theta}^{(k-1)} - \underline{a}, \boldsymbol {\theta}^{(k-1)}+\underline{a}]$$ where $$\boldsymbol {\theta} ^{(k-1)}\in\mathbb{R}^{d}$$ is the mean of the known pdf $$g^{\boldsymbol {\theta}^{(k-1)}}(\mathbf{x})$$ at iteration k and $$\underline{a} = [a, a, \ldots, a ]\in\mathbb{R}^{d}$$ is a vector with a user defined positive integer a.

4. (iv)

$$N_{\mathrm{tr}}$$, d and $$d_{r}$$ denote the size of a training set, the full and reduced dimensions of the input vector X respectively.

5. (v)

Define a design matrix $$\mathbf{S}^{(k)}=[\mathbf {s}_{i}^{(k)}]_{N_{\mathrm{tr}}\times d}$$ where $$\mathbf {s}_{i}^{(k)}\sim\operatorname{LHS} (\boldsymbol {\theta}^{(k-1)} \pm \underline {a} )$$ are $$N_{\mathrm{tr}}$$ training samples. In this paper, we use $$a=3$$ so that a surrogate model can be fitted to predict the response for the inputs lie in the range of 3-sigma. For a Gaussian distribution 99.7% (almost all) samples lies in this range.

6. (vi)

Given the design matrix $$\mathbf{S}^{(k)}$$, the corresponding response vector is evaluated from the circuit model and is denoted by $$\mathbf{H}^{(k)}$$. Note that for iteration k, the columns of $$\mathbf {S}^{(k)}$$ corresponding to irrelevant variables (outcome of LASSO at iteration $$k-1$$) are set to zero (nominal values) before evaluating the outputs $$\mathbf{H}^{(k)}$$.

7. (vii)

Introduce the training set $$\mathcal{D}_{r}^{(k)} = (\mathbf {S}_{r}^{(k)}, \mathbf{H}^{(k)} )$$ where $$\mathbf {S}_{r}^{(k)}\subseteq \mathbf{S}^{(k)}$$ is the reduced design matrix containing the columns of $$\mathbf{S}^{(k)}$$ corresponding to the important variables (outcome of the feature selection process).

8. (viii)

In ISMC Algorithm 2.1, we have $$\rho=0.2$$ and $$m=1000$$ for estimating an intermediate level $$\gamma_{k}$$ at iteration k. Notice (from Fig. 1) that choosing larger value of $$\gamma_{k}$$ will give faster convergence to failure threshold γ. However, for doing that we use a smaller ρ which needs a large n. In a surrogate based approach we use a cheap model instead of the full circuit model and thus a large number m (say, 10,000) of simulations can be used, and therefore a small ρ (say, 0.05) is acceptable.

### Algorithm 4.1

(HISMC/Exploration phase)

1. 1.

Set $$k = 1$$, $$\boldsymbol {\theta}^{(k-1)}=\boldsymbol {\theta}^{(0)} = 0 \in \mathbb{R}^{d}$$, $$\rho=0.05$$, $$m=10\text{,}000$$, $$d_{r}^{(k-1)}=d$$ and $$a=3$$.

2. 2.

Choose $$N_{\mathrm{tr}}=200$$ if $$d_{r}^{(k-1)}<20$$, $$N_{\mathrm {tr}}=10d_{r}^{(k-1)}$$ if $$20\leq d_{r}^{(k-1)} \leq50$$ otherwise $$N_{\mathrm{tr}}=500$$.

3. 3.

Find the training set $$\mathcal{D}^{(k)} = (\mathbf {S}^{(k)}, \mathbf{H}^{(k)} )$$ at iteration k.

4. 4.

Perform feature selection using LASSO on the training set $$\mathcal{D}^{(k)}$$.

5. 5.

Find $$d_{r}^{(k)}$$ and $$\mathcal{D}_{r}^{k} = (\mathbf{S}_{r}^{k}, \mathbf {H}^{k})$$ the number of important variables and the reduced training set, respectively.

6. 6.

Fit a Kriging model using the training set $$\mathcal{D}_{r}^{k}$$.

7. 7.

Draw m iid random samples $$\mathbf{X}_{1}^{(k)}, \ldots, \mathbf {X}_{m}^{(k)} \sim f(\mathbf{x},\boldsymbol {\theta}^{(k-1)})$$ and estimate $$\widehat{H}(\mathbf{X}_{1}^{(k)}), \ldots, \widehat{H}(\mathbf {X}_{m}^{(k)})$$ using the Kriging model.

8. 8.

Compute the $$(1-\rho)$$ sample quantile $$\gamma_{k}$$. If $$\gamma _{k}>\gamma$$ reset $$\gamma_{k}$$ to γ.

9. 9.

Introduce the Kriging based variance criterion

${v}_{m}^{\left(k\right)}\left(\mathbit{\theta }\right)=\frac{1}{m}\sum _{j=1}^{m}{\left({\mathbb{1}}_{\left\{\stackrel{ˆ}{H}\left({\mathbf{X}}_{j}^{\left(k\right)}\right)\ge {\gamma }_{k}\right\}}\right)}^{2}\phantom{\rule{0.2em}{0ex}}{e}^{\left(-{\mathbf{X}}_{j}^{\left(k\right)}\cdot \left(\mathbit{\theta }-{\mathbit{\theta }}^{\left(k-1\right)}\right)+\frac{1}{2}\left(|\mathbit{\theta }|-{|{\mathbit{\theta }}^{\left(k-1\right)}|}^{2}\right)\right)}$

and compute $$\theta^{(k)} = \arg\min_{\theta} v_{m}^{(k)}(\theta)$$.

10. 10.

If $$\gamma_{k} < \gamma$$, return to step 2 and proceed with $$k \gets k+1$$, otherwise save $$\theta^{*} = \theta^{(k)}$$, $$d_{r} = d_{r}^{(k)}$$ and the reduced set (say $$\mathbf{x}_{r}$$) of input variables.

11. 11.

Go to the estimation phase.

### The estimation phase

Assuming that the optimal value $$\boldsymbol {\theta}^{*}$$ of the mean-shift θ is computed in the exploration phase, our next goal is to find an estimation of the probability $$\mathrm {p}_{\mathrm {fail}}$$. To this end, we build a surrogate based accurate probability estimator. One simple approach  is to replace the indicator function ${\mathbb{1}}_{\left(H\left(\mathbf{Y}\right)\ge \gamma \right)}$ in (2.8) by its surrogate model ${\mathbb{1}}_{\left(\stackrel{ˆ}{H}\left(\mathbf{Y}\right)\ge \gamma \right)}$, where $$\widehat{H}(\mathbf{Y})$$ is the Kriging prediction of the response $$H(\mathbf{Y})$$ at some input $$\mathbf{Y} = \mathbf{X} +\boldsymbol {\theta}^{*}$$. Here the Kriging predictor is built on the training set with input vectors centered at $$\boldsymbol {\theta}^{*}$$. To get an impression of accuracy of the probability estimator we would like to have a confidence interval. A candidate is Pseudo Confidence Interval (PCI), that depends on the Kriging variance, is provided in . However, there is no proof that the true probability would lie in the PCI. Moreover, the PCI can be very wide if the Kriging prediction has a large variance. To prevent loss of accuracy of the probability estimator, a hybrid approach is proposed by [23, 24] that combines the simulations of the Kriging model and the original system (circuit model in our context). The original system is used only for the responses that are close to the failure threshold. In this section we will use a similar approach based on the Kriging model. Unlike the hybrid approach in  where the first surrogate model is used to simulate, we check the accuracy of the model online and improve (re-build) it, if required. The authors in  demonstrate the benefits of using an improved surrogate model which might be useful to our application. In this paper, we use an adaptive sampling technique to improve the Kriging model. We add some samples (adaptively) to the initial training set from the region of interest, see step 14 in Algorithm 4.2. Due to the interpolation nature, the Kriging model gives a better fit in that region after the improvement.

We start with drawing a training set $$\mathcal{D}^{*} = (\mathbf {S}^{*}, \mathbf{H}^{*})$$ where $$\mathbf{S}^{*}$$ is the $$N_{\mathrm {tr}}\times{d}$$ design matrix with rows representing the $$N_{\mathrm {tr}}$$ random vectors generated with the $$\operatorname{LHS} (\boldsymbol {\theta} ^{*} \pm\underline{a} )$$ and $$\mathbf{H}^{*}$$ is an $$N_{\mathrm {tr}}\times1$$ vector of corresponding responses. Then we perform the feature selection (using LASSO) and find the reduced training set $$\mathcal{D}_{r}^{*} = (\mathbf{S}_{r}^{*}, \mathbf{H}^{*} )$$ where $$\mathbf{S}_{r}^{*}\subseteq\mathbf{S}^{*}$$ is the reduced design matrix containing the columns of $$\mathbf{S}^{*}$$ corresponding to the important variables. Afterward, we build a Kriging model on the updated training set $$\mathcal{D}_{r}^{*}$$. Finally, we build a hybrid indicator function $$\mathcal{I}_{\gamma}(\mathbf{Y})$$ called an emulator that combines the true indicator function ${\mathbb{1}}_{\left(H\left(\mathbf{Y}\right)\ge \gamma \right)}$ and its surrogate ${\mathbb{1}}_{\left(\stackrel{ˆ}{H}\left(\mathbf{Y}\right)\ge \gamma \right)}$ based on an accept/reject criterion. See next section.

#### The emulator

We define an interval known as the “margin of uncertainty”  around the threshold γ.

$$\mathbb{M} = \bigl\{ \mathbf{Y} : \bigl\vert \widehat{H}( \mathbf{Y}) - \gamma \bigr\vert \leq z_{\alpha/2}\widehat{ \sigma}_{K}(\mathbf{Y}) \bigr\} ,$$
(4.2)

where $$\widehat{\sigma}_{K}(\mathbf{Y})$$ is the Kriging variance at point Y, $$z_{\alpha/2} = \varPhi^{-1}(1-\alpha/2)$$ with $$\varPhi ^{-1}(x)$$ is the inverse cumulative distribution function of the standard normal distribution. If $$\alpha= 0.01$$ for which $$z_{\alpha /2} = 2.58$$, we assume that there is 99% chance that a true value lies in the interval $$\gamma-z_{\alpha/2}\widehat{\sigma}_{K}(\mathbf {Y})\leq\widehat{H}(\mathbf{Y})\leq\gamma+z_{\alpha/2}\widehat {\sigma}_{K}(\mathbf{Y})$$.

The accept/reject regions are indicated in Fig. 2 which says that we accept the simulation of the Kriging predictor $$\widehat {H}(\mathbf{Y})$$ if $$\widehat{H}(\mathbf{Y})$$ is reasonably away ($$\mathbf{Y}\notin\mathbb{M}$$) from the failure threshold γ and we reject $$\widehat{H}(\mathbf{Y})$$ if it is close ($$\mathbf{Y}\in \mathbb{M}$$) to γ. In the latter case the circuit model must be used.

The mathematical formulation of the Emulator is given as follows

(4.3)

#### Probability estimator

The emulator-based probability estimator $$\hat{\mathrm{p}}_{\mathrm {fail}}^{\mathrm{E}}$$ of the failure probability $$\mathrm {p}_{\mathrm {fail}}$$ is given by

$$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{E}}= \frac{1}{N} \sum_{i=1}^{N} \mathcal{I}_{\gamma}( \mathbf {Y}_{i})\, e^{-\boldsymbol {\theta}^{*} \mathbf{X}_{i} - \frac{1}{2}|\boldsymbol {\theta}^{*}|^{2}}$$
(4.4)

and its variance can be estimated by

\begin{aligned} \hat{\sigma}_{E}^{2} &= \frac{1}{N} \operatorname{Var} \bigl(\mathcal {I}_{\gamma }(\mathbf{Y}) \,e^{-\boldsymbol {\theta}^{*} \mathbf{X} - \frac{1}{2}|\boldsymbol {\theta }^{*}|^{2}} \bigr) \\ &= \frac{1}{N}\mathbb{E} \bigl[ \bigl(\mathcal{I}_{\gamma}(\mathbf {Y})\, e^{-\boldsymbol {\theta}^{*} \mathbf{X} - \frac{1}{2}|\boldsymbol {\theta}^{*}|^{2}} - \hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{E}} \bigr)^{2} \bigr] \\ &\approx\frac{1}{N} \Biggl[\frac{1}{N-1} \Biggl(\sum _{i=1}^{N} \bigl(\mathcal{I}_{\gamma}( \mathbf{Y}_{i})\,e^{-\boldsymbol {\theta}^{*} \mathbf {X}_{i} - \frac{1}{2}|\boldsymbol {\theta}^{*}|^{2}} \bigr)^{2} - N \bigl(\hat{ \mathrm {p}}_{\mathrm{fail}}^{\mathrm{E}} \bigr)^{2} \Biggr) \Biggr]. \end{aligned}
(4.5)

Using (4.4) and (4.5), the $$100(1-\alpha')\%$$ confidence interval for the true probability $$\mathrm {p}_{\mathrm {fail}}$$ is defined as

$$\mathrm {CI}= \bigl[\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{E}}-z_{\alpha '/2} \hat{\sigma}_{E},\, \hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {E}}+z_{\alpha'/2} \hat {\sigma}_{E}\bigr]$$
(4.6)

and the accuracy of the probability estimator $$\hat{\mathrm {p}}_{\mathrm{fail}}^{\mathrm{E}}$$ can be measured by the coefficient of variation

$$\mathrm {CV}= z_{\alpha'/2}\,\frac{\hat{\sigma}_{E}}{\hat{\mathrm {p}}_{\mathrm{fail}}^{\mathrm{E}}}.$$
(4.7)

The 95% confidence interval is the most commonly used interval that corresponds to $$z_{\alpha'/2}=1.96$$ for $$\alpha'=0.05$$. The accepted probability $$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{E}}$$ is the one with a coefficient of variation $$\mathrm {CV}\leq 10\%$$.

Combining the information from the exploration and estimation phases, the full HISMC algorithm can be formulated as follows:

### Algorithm 4.2

(HISMC/Estimation phase)

1. 1.

Use Algorithm 4.1 for finding the optimal mean-shift $$\boldsymbol {\theta}^{*}$$, the number $${d_{r}}$$ of important variables and the reduced set of (important) input variables $$\mathbf{x}_{r}$$.

2. 2.

Initialize the Estimation phase:

1. (a)

Set $$a=3$$ and $$N_{\mathrm{tr}}=200, 10d_{r}, 500$$ if $$d_{r}<20$$, $$20\leq d_{r} \leq50$$, otherwise.

2. (b)

Set the iteration parameter $$l=0$$, maximum number of iterations $$l_{\mathrm{max}}=10$$, the number $$n=1000$$ of simulations used at iteration l, initialize the total number $$N=0$$ of simulations before iteration l, $$z_{\alpha/2}=2.58$$, $$z_{\alpha'/2}=1.96$$, tolerance of CV $$\mathrm{tol\_cv} = 0.1$$, $$\tau=0$$, $$\eta=0$$.

3. 3.

Get the training set $$\mathcal{D}^{*} = (\mathbf{S}^{*}, \mathbf {H}^{*} )$$ where $$\mathbf{S}^{*}=[\mathbf{s}_{i}^{*}]_{N_{\mathrm {tr}}\times d}$$ with $$\mathbf{s}_{i}^{*}\sim\operatorname{LHS} (\boldsymbol {\theta }^{*} \pm\underline{a} )$$ and $$\mathbf{H}^{*}$$ is a vector of corresponding responses.Footnote 3

4. 4.

Perform feature selection using LASSO on the training set $$\mathcal{D}^{*}$$.

5. 5.

Update the number $${d_{r}}$$ of important parameters and find the reduced training set $$\mathcal{D}_{r}^{*} = (\mathbf{S}_{r}^{*}, \mathbf {H}^{*} )$$. See preliminaries for Algorithm 4.1.

6. 6.

Fit a Kriging model.

7. 7.

Draw n iid random samples $$\mathbf{X}_{i}^{(l)}\sim g(\mathbf {x})$$ and shift them to $$\mathbf{Y}_{i}^{(l)}=\mathbf{X}_{i}^{(l)}+\boldsymbol {\theta}^{*}$$.

8. 8.

Find the Kriging predictions $$\widehat{H} (\mathbf {Y}_{i}^{(l)} )$$ and Kriging variances $$\widehat{\sigma}_{K}^{2} (\mathbf{Y}_{i}^{(l)} )$$ using (3.7) and (3.13), respectively.

9. 9.

Compute the likelihood weights $${w}_{i}^{(l)}= e^{-\boldsymbol {\theta}^{*} \mathbf{X}_{i}^{(l)} - \frac{1}{2}|\boldsymbol {\theta}^{*}|^{2}}$$ and $${v}_{i}^{(l)} = ({w}_{i}^{(l)} )^{2}$$ for all $$i=1,\ldots, n$$.

10. 10.

Determine $$\tau\leftarrow\tau+ \sum_{i=1}^{n}w_{i}^{(l)}\mathcal {I}_{\gamma}(\mathbf{Y}_{i}^{(l)})$$, $$\eta\leftarrow\eta+ \sum_{i=1}^{n}v_{i}^{(l)}\mathcal{I}_{\gamma}(\mathbf{Y}_{i}^{(l)})$$ and $$N \leftarrow N + n$$.

11. 11.

Compute $$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{E}}= \frac {\tau}{N}$$ and $$\hat{\sigma}_{E}^{2} = \frac {1}{N(N-1)} (\eta- N (\hat{\mathrm{p}}_{\mathrm {fail}}^{\mathrm{E}} )^{2} )$$.

12. 12.

Calculate $$\mathrm {CV}= z_{\alpha'/2}\frac{\hat{\sigma}_{E}}{\hat {\mathrm{p}}_{\mathrm{fail}}^{\mathrm{E}}}$$.

13. 13.

If $$\mathrm {CV}> \mathrm{tol\_{cv}}$$ and $$l < l_{\mathrm{max}}$$ go to step 14 otherwise go to step 15.

14. 14.

Use the full simulations $$N_{\mathrm{full}}$$ drawn to compute $$\mathcal{I}_{\gamma}(\mathbf{Y}_{i}^{(l)})$$, see definition (4.3). Select some pointsFootnote 4 (say, $$\min\{10, N_{\mathrm{full}}\}$$) uniformly among all $$N_{\mathrm{full}}$$ simulations. Add these samples into the training set and rebuild the Kriging model with the updated training set. Go to step 6 with $$l\leftarrow l+1$$.

15. 15.

Determine the 95% confidence interval

$$\mathrm {CI}= \bigl[\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {E}}-z_{\alpha'/2}\hat{ \sigma}_{E},\, \hat{\mathrm{p}}_{\mathrm {fail}}^{\mathrm{E}}+z_{\alpha '/2} \hat{\sigma}_{E} \bigr].$$
16. 16.

Save the probability $$\hat{\mathrm{p}}_{\mathrm {fail}}^{\mathrm{E}}$$, the variance $$\hat{\sigma}_{E}^{2}$$, CI and the CV.

17. 17.

End of the algorithm.

## Results and discussion

In this paper two realistic circuits are used for validation purpose. The first one is a VCO with 1500 statistical input parameters and scalar response ‘oscillation frequency’ and the second one is a memory cell with 2096 statistical input parameters and scalar response ‘read delay’. The ISMC and HISMC algorithms are repeated 100 times, and the empirical results for both the exploration and estimation phase are compared. In the exploration phase we compare the optimal mean-shift computed by both the algorithms and the required number of simulations. On the other hand, we compute the probability in the estimation phase, and therefore we need to measure the efficiency of the probability. When we replace the circuit model with a surrogate model, the simulation time becomes negligible, and one can take only true simulation runs into account for estimating the efficiency of the algorithm. However, the surrogate model introduces a bias that may be very large. Therefore, we use the following procedure to measure the efficiency of the emulator based probability estimator.

1. 1.

Get a reference probability: Let $$\hat{\mathrm {p}}_{\mathrm{fail}}^{\mathrm{M}}$$ be a probability estimator of a method M for estimating the probability $$\mathrm {p}_{\mathrm {fail}}$$. To estimate empirically the bias in $$\hat{\mathrm{p}}_{\mathrm {fail}}^{\mathrm{M}}$$ we need a reference probability $$\mathrm{p}_{\mathrm{fail}}^{\mathrm{ref}}$$. A simple estimation $$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{ref}}$$ of $$\mathrm{p}_{\mathrm{fail}}^{\mathrm{ref}}$$ can be obtained by running ISMC Algorithm 2.1 with a small coefficient of variation (say less than 1%).

2. 2.

Perform $$N_{\mathrm{rep}}$$ experiments of method M: Notice that a probability generated from the MC estimator $$\hat {\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}$$ is a random number and thus comparing a single outcome of $$\hat {\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}$$ with the reference probability $$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {ref}}$$ is not valid. Hence, we perform $$N_{\mathrm{rep}}$$ independent experiments of the method M and store $$N_{\mathrm{rep}}$$ outcomes $$\hat{\mathrm{p}}_{\mathrm {fail}}^{\mathrm{M}}(i)$$ and their confidence intervals $$\mathrm {CI}^{\mathrm{M}}(i)$$ for $$i=1,\ldots, N_{\mathrm{rep}}$$.

3. 3.

Then we compute

1. (a)

Relative bias: The relative bias $$\epsilon_{\mathrm {rel}}$$ of the estimator $$\hat {\mathrm {p}}_{\mathrm {fail}}$$ with respect to the reference probability $$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{ref}}$$

$$\epsilon_{\mathrm{rel}}\bigl(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {M}} \bigr) = \frac{\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {ref}}-\operatorname {mean}(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}} )}{|\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{ref}} |}\times100\%,$$
(5.1)

where

$$\operatorname {mean}\bigl(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}\bigr) = \frac {1}{N_{\mathrm{rep}}}\sum_{i=1}^{N_{\mathrm{rep}}} \hat{ \mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}(i).$$
(5.2)

Note that we do not use absolute value for the numerator in (5.1), since it gives an indication whether or not $$\hat{\mathrm {p}}_{\mathrm{fail}}^{\mathrm{M}}$$ underestimates or overestimates the reference value.

2. (b)

Central Coverage Probability (CCP): The CCP for the estimator $$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}$$, which is the probability that $$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {ref}}$$ lies within $$\mathrm {CI}^{\mathrm{M}}$$, is given by

$\mathrm{CCP}\left({\stackrel{ˆ}{\mathrm{p}}}_{\mathrm{fail}}^{\mathrm{M}}\right)=\frac{1}{{N}_{\mathrm{rep}}}\sum _{i=1}^{{N}_{\mathrm{rep}}}{\mathbb{1}}_{\left\{{\stackrel{ˆ}{\mathrm{p}}}_{\mathrm{fail}}^{\mathrm{ref}}\in {\mathrm{CI}}^{\mathrm{M}}\left(i\right)\right\}}.$
(5.3)

For a 95% confidence interval $$\mathrm {CI}^{\mathrm{M}}$$, an unbiased estimator $$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}$$ and a large $$N_{\mathrm{rep}}$$ the CCP must be 0.95 (approximately). However, for a biased estimator CCP might be smaller than 0.95. We assume that a (biased) estimator is good enough if it does not have CCP lower than 0.90, i.e., a 5% error in the confidence interval is acceptable.

3. (c)

Mean Squared Error (MSE): The MSE of $$\hat {\mathrm {p}}_{\mathrm {fail}}$$ is computed as

$$\mathrm {MSE}\bigl(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}\bigr) = \frac {1}{N_{\mathrm{rep}}}\sum_{i=1}^{N_{\mathrm {rep}}} \bigl(\hat{ \mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}(i)-\hat {\mathrm{p}}_{\mathrm{fail}}^{\mathrm{ref}} \bigr)^{2}.$$
(5.4)

Note that $$\mathrm {MSE}(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}})$$ can be written as

\begin{aligned} \mathrm {MSE}\bigl(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}\bigr) &= \frac {1}{N_{\mathrm{rep}}}\sum_{i=1}^{N_{\mathrm{rep}}} \bigl(\hat{ \mathrm{p}}_{\mathrm {fail}}^{\mathrm{M}}(i)-\operatorname {mean}\bigl(\hat{ \mathrm{p}}_{\mathrm {fail}}^{\mathrm{M}}\bigr) \bigr)^{2} + \bigl( \operatorname {mean}\bigl(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}\bigr)-\hat { \mathrm{p}}_{\mathrm{fail}}^{\mathrm{ref}} \bigr)^{2} \\ &= \operatorname {Var}\bigl(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{M}}\bigr) + \bigl( \mathrm{bias}\bigl(\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {M}}\bigr) \bigr)^{2}. \end{aligned}
(5.5)

Then we can say that the MSE is the sum of variance and squared bias of the estimator which provide a useful way to estimate the efficiency of a biased estimator (see 4 below).

4. 4.

Estimate the Efficiency Metric: We now introduce an efficiency estimator denoted by $$\widehat{\operatorname {Eff}}$$ and given as

$\stackrel{ˆ}{Eff}\left(\mathrm{M}1,\mathrm{M}2\right)=\frac{\mathrm{MSE}\left({\stackrel{ˆ}{\mathrm{p}}}_{\mathrm{fail}}^{\mathrm{M}1}\right)}{\mathrm{MSE}\left({\stackrel{ˆ}{\mathrm{p}}}_{\mathrm{fail}}^{\mathrm{M}2}\right)}\frac{{\overline{T}}_{\mathrm{M}1}}{{\overline{T}}_{\mathrm{M}2}},$
(5.6)

where ${\overline{T}}_{\mathrm{M}1}$ and ${\overline{T}}_{\mathrm{M}2}$ are the average computational costs (CPU time) required for method $$\mathrm {M1}$$ and $$\mathrm{M2}$$, respectively.

If $$\widehat{\operatorname {Eff}}(\mathrm{M1},\mathrm{M2}) = \kappa>1$$, it means that method $$\mathrm{M1}$$ requires κ times more computational cost than $$\mathrm{M2}$$ to obtain the same accuracy. If $$\widehat{\operatorname {Eff}}(\mathrm{M1},\mathrm{M2})>1$$ then estimator $$\mathrm{M2}$$ is preferred to $$\mathrm{M1}$$.

### The VCO

#### Results of the exploration phase

Figure 3 indicates the computation of the mean-shift $$\boldsymbol {\theta}^{(k)}$$ and the target level $$\gamma_{k}$$ at each iteration of the exploration phase. Note that we plot the empirical means of $$\| \boldsymbol {\theta}^{(k)}\|$$ and $$\gamma_{k}$$ from the ISMC and HISMC algorithms repeated 100 times. In both figures, the blue dotted curves with asterisks stand for the ISMC approach and the red dotted curves with squares are for the HISMC approach. We start with $$\boldsymbol {\theta}^{(0)} = 0$$ and compute the pair $$(\gamma_{k}, \boldsymbol {\theta}^{(k)})$$, iteratively. It can be seen from the figure (at right) that the ISMC algorithm takes 7 iterations to reach the target $$\gamma=1900$$. However, the HISMC algorithm takes only 4 iterations. Indeed, in the case of ISMC we used $$\rho=0.2$$ and $$m=1000$$ for estimating the intermediate levels $$\gamma_{k}$$. On the other hand, in HISMC we replace the expensive circuit model by the Kriging model that allows us to use more samples ($$m=10\text{,}000$$) and then a smaller value of ρ (0.05) is acceptable. Clearly, HISMC makes bigger steps that results into less iterations. In the left plot the estimated norm of the mean-shifts corresponding to $$\gamma_{k}$$ is shown at level k. From Tables 1 and 2, the last $$\|\boldsymbol {\theta}^{(k)}\|$$ of both methods are 6.42 and 6.44. The HISMC has only 0.3% relative error with respect to ISMC and thus we can see in Fig. 3 that the last $$\|\boldsymbol {\theta}^{(k)}\|$$ of both methods lies on the same (black horizontal) line, approximately.

Tables 1 and 2 represent the numerical results for the exploration phase of the ISMC and the HISMC algorithm, respectively. The first, second, third and fourth columns represent the iteration number k, the number of full simulations, the intermediate level $$\gamma_{k}$$ and the mean-shift $$\boldsymbol {\theta}^{(k)}$$ per iteration, respectively. The fifth and sixth column in Table 2 represent the reduced dimension $$d_{r}$$ and the LOO-MCR (leave-one-out misclassification error) at iteration k, respectively. Recalling the training sample rule (see Algorithm 4.1), we draw 500 training samples at iteration $$k=1$$ (see the second column) since $$d=1500$$ which is greater than 50. The training sample size at iteration $$k+1$$ depends on the reduced dimension $$d_{r}$$ (fifth column) at iteration k. Further, the LOO-MCR (4.1) stands for the “leave one out misclassification error” of the Kriging model at iteration k. We can see that the maximum error is 1% (for $$k=3$$ and 4).

Moreover, it can be seen that the ISMC requires total 7000 simulations to estimate the optimal mean-shift. On the other hand, HISMC requires 1260 full simulations only.

#### Results of the estimation phase

Given the optimal mean-shift from the exploration phase, we estimate the failure probability $$\mathrm {p}_{\mathrm {fail}}$$. First we get a reference probability $$\mathrm{p}_{\mathrm{fail}}^{\mathrm{ref}}$$ by running the ISMC algorithm with a small (less than 1%) coefficient of variation. The reference results are shown in Table 3. The probability $$\hat{\mathrm{p}}_{\mathrm {fail}}^{\mathrm{ref}}$$ is being useful to measure the bias, MSE and thus the efficiency of the HISMC method.

Now we present the empirical results of the estimation phase of the ISMC and HISMC algorithms. Recall that these results are based on the 100 experiments, i.e., both the algorithms are repeated 100 times with different ‘seed’ of the random generator each time. It is worth mentioning here that we performed a feature selection before fitting the Kriging model in the estimation phase of the HISMC algorithm and we also rebuild the Kriging model many times during the process. The average number of important parameters is $${d_{r}}=23$$. Before converging the probability up to the required accuracy, we rebuild the Kriging model 4 times by adding some samples from the region near to the failure threshold. Table 4 represents the results of the estimation phase of the ISMC and HISMC algorithms. The mean probability, the average coefficient of variation ($$\mathrm {CV}(\%)$$) and the mean squared error (MSE) are computed for both of the techniques. We also computed the relative bias ($$\epsilon_{\mathrm{rel}}(\%)$$) and the central coverage probability (CCP) for the HISMC technique. The last column indicates the total number (#Runs) of true simulations used in the estimation phase.

The results in Table 4 are visualized by Fig. 4. The left and right-hand side plots show the probability distributions from $$N_{\mathrm{rep}}=100$$ experiments (repetitions) of ISMC and HISMC methods, respectively. The vertical solid ‘black’ line in the center represents the mean of the $$N_{\mathrm{rep}}$$ random values of the failure probability. The dotted line in the center is the reference probability $$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {ref}}$$. The two dashed lines around the center represent the 95% confidence interval. Clearly, the reference probability $$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{ref}}$$ lies within the 95% empirical CI for both the methods. Moreover, the difference between mean and the reference probability gives the bias in the probability estimator. For the HISMC estimator (right-hand side plot), a bias ($$\epsilon_{\mathrm {rel}}=2.92\%$$) can be noticed. Because of this bias, the CCP is equal to 0.92 (see Table 4) which is smaller than 0.95 (the desired probability for a 95% CI). We assume that for $$N_{\mathrm{rep}}=100$$ a CCP with less than and equal to 5% error is acceptable. For more accurate estimation of the bias $$\epsilon _{\mathrm{rel}}$$ and CCP we need to perform a larger number $$N_{\mathrm{rep}}$$ (say 1000) of experiments. Finally, we estimate the efficiency of the HISMC with respect to the ISMC method. Recalling formula (5.6), we require the mean squared error and the average CPU time for both methods. The mean squared error is given in Table 4. The average CPU times required for ISMC and HISMC are $$\bar{T}_{\mathrm{ISMC}}= 82\text{,}239~\mbox{s}$$ and $$\bar{T}_{\mathrm{HISMC}}= 9676~\mbox{s}$$, respectively. Thus, we get

$$\widehat{\operatorname {Eff}}(\mathrm{ISMC}, \mathrm{HISMC}) = \frac{\mathrm {MSE}(\hat {\mathrm{p}}_{\mathrm{fail}}^{\mathrm{IS}} )\times\bar{T}_{\mathrm{ISMC}}}{\mathrm {MSE}(\hat{\mathrm{p}}_{\mathrm {fail}}^{\mathrm{E}})\times\bar{T}_{\mathrm {HISMC}}} = \frac{2.34\times10^{-23}\times82\text{,}239}{3.28\times 10^{-23}\times9676} \approx6.$$

This means that ISMC requires approximately 6 times more CPU time than HISMC to achieve the same accuracy. Hence, we get a speedup of factor 6.

### The memory cell

#### Results of the exploration phase

Similar to the VCO, both the ISMC and HISMC algorithms were repeated $$N_{\mathrm{rep}}=100$$ times with a different seed of the random generator each time. The mean results (average of 100 experiments) of the exploration phase are given in this section.

Figure 5 indicates the computation of the mean-shift $$\boldsymbol {\theta}^{(k)}$$ and the intermediate failure thresholds $$\gamma_{k}$$ at each iteration of the exploration phase for the memory cell. Here the target threshold failure $$\gamma= 902$$ is reached in 6 and 4 iterations (right plot) per method. The last $$\|\boldsymbol {\theta}^{(k)}\|$$ which are known as optimal-mean shift have the values 5.31 and 5.32 for the ISMC and HISMC methods, respectively. Thus, $$\|\boldsymbol {\theta }^{(k)}\|$$ for both methods converge to the same line. To this end, we can say that HISMC gives a good estimation of the mean-shift computed with ISMC.

Tables 5 and 6 show the numerical results from the exploration phase of the ISMC and HISMC, respectively. It can be seen from the tables that HISMC requires 1520 full simulations (used for training the Kriging models) for estimating the optimal mean-shift. On the other hand, ISMC requires 6000 simulations.

#### Results of the estimation phase

The reference probability $$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm {ref}}$$ is shown in Table 7. Again, the $$\hat{\mathrm{p}}_{\mathrm {fail}}^{\mathrm{ref}}$$ is computed using ISMC algorithm with a small coefficient of variation.

The empirical results of the estimation phase with $$N_{\mathrm {rep}}=100$$ experiments of ISMC and HISMC methods are shown in Table 8. It is worth noting here that, in HISMC, we perform the dimension reduction (using LASSO) before fitting the initial Kriging model in the estimation phase. The average number of important parameters is $${d_{r}}=33$$. Moreover, we rebuild the Kriging model 4 times (during the process) by adding some samples from the region near to the failure threshold. Furthermore, we see in the table that the relative bias ($$\epsilon_{\mathrm{rel}}$$) for HISMC is small, and therefore we have a good estimation of the CCP.

Similar to VCO, the probability distribution plots from $$N_{\mathrm {rep}}=100$$ experiments (repetitions) of ISMC and HISMC methods are shown in Fig. 6. Clearly, the reference probability $$\hat{\mathrm{p}}_{\mathrm{fail}}^{\mathrm{ref}}$$ lies within the 95% empirical CI for both the methods. Moreover, the difference between mean and the reference probability gives the bias in the probability estimator. Compared to VCO we here see a smaller bias in the HISMC estimator. The reason is the small number $$N_{\mathrm{rep}}=100$$ of experiments. For accurate estimation of the bias $$\epsilon_{\mathrm{rel}}$$ and CCP we need to perform a higher number $$N_{\mathrm{rep}}$$ (say 1000) of experiments.

Finally, we estimate the efficiency of the HISMC with respect to the ISMC method. The mean squared error is given in Table 8. The average CPU times required for ISMC and HISMC are $$\bar {T}_{\mathrm {ISMC}}= 53\text{,}448~\mbox{s}$$ and $$\bar{T}_{\mathrm{HISMC}}= 7922~\mbox{s}$$, respectively. Thus, we get

$$\widehat{\operatorname {Eff}}(\mathrm{ISMC}, \mathrm{HISMC}) = \frac{3.65\times 10^{-18}\times53\text{,}448}{3.70\times10^{-18}\times7922} = 6.7.$$

This means that ISMC requires approximately 6.7 times more CPU time than HISMC to achieve the same accuracy. Hence, we get a speedup of factor 6.7, and therefore we prefer the HISMC over the ISMC.

## Conclusion and future work

In this paper we proposed a HISMC approach for yield optimization of circuits having a very large number of input variables and scalar response. Moreover, we assume that only a few (say less than 35) of the input variables are important. The HISMC approach uses a feature selection method (LASSO) that reduces the dimension of the input variables of an underlying problem that allows us to fit the Kriging model on the reduced dimension. The Kriging model is used for most of the simulations and makes a significant reduction on runs from the expensive to use circuit model. Although it is hard or even impossible to quantify the bias in the probability estimator, the Emulator prevents loss of accuracy by using the true simulations near to the failure threshold. For future work we will try to compare the HISMC approach (in terms of efficiency and robustness) with a hybrid approach proposed by  and with commercially available methods (e.g., Solido® and MunEDA®). More focus will be on multi-input and multi-output circuits, especially in considering output $$H(\mathbf{x})$$ with more constraints involved.

1. We use the SPICE-like Eldo® simulator from Mentor Graphics® to perform the circuit simulations.

2. Note that, in this paper, we only consider a scalar response of the circuit. However, for the cases where a circuit has multiple responses, the algorithm proposed in this paper has to be repeated for each output, individually. This process will reduce the overall speedup of the proposed method. For such cases a further research is required.

3. The columns of $$\mathbf{S}^{*}$$ corresponding to irrelevant variables (complement of $$\mathbf{x}_{r}$$) are set to zeros before evaluating the outputs $$\mathbf{H}^{*}$$.

4. The purpose of selecting these points is to improve the Kriging predictor in the margin of uncertainty.

## Abbreviations

CCP:

Central Coverage Probability

CI:

Confidence Interval

CV:

Coefficient of Variation

DOE:

Design Of Experiments

HISMC:

Hybrid Importance Sampling Monte Carlo

IC:

Integrated Circuit

iid:

independent and identically distributed

IPs:

Intellectual Properties

ISMC:

Importance Sampling Monte Carlo

LASSO:

Least Absolute Shrinkage and Selection Operator

LHS:

Latin Hypercube Sampling

MC:

Monte Carlo

MCR:

Misclassification Error

MLE:

Maximum Likelihood Estimation

MSE:

Mean Squared Error

pdf:

probability density function

PLSR:

Partial Least Squares Regression

r.v.:

random vector

## References

1. Tyagi AK, Jonsson X, Beelen TGJ, Schilders WHA. Speeding up rare event simulations using Kriging models. In: Proceedings of IEEE 21st workshop on signal and power integrity (SPI). Baveno: IEEE; 2017.

2. Ciampolini L, Lafont J-C, Drissi FT, Morin J-P, Turgis D, Jonsson X, Desclèves C, Nguyen J. Efficient yield estimation through generalized importance sampling with application to NBL-assisted SRAM bitcells. In: Proceedings of the 35th international conference on computer-aided design. ICCAD ’16. New York: ACM; 2016.

3. Haldar A, Mahadevan S. Probability, reliability and statistical methods in engineering design. New York: Wiley; 2000.

4. Singhee A, Rutenbar RA. Statistical blockade: very fast statistical simulation and modeling of rare circuit events and its application to memory design. IEEE Trans Comput-Aided Des Integr Circuits Syst. 2009;28(8):1176–89.

5. Santner T, Williams B, Notz W. The design and analysis of computer experiments. Berlin: Springer; 2003.

6. Rasmussen CE, Williams CKI. Gaussian processes for machine learning. Cambridge: MIT Press; 2006.

7. Jourdain B, Lelong J. Robust adaptive importance sampling for normal random vectors. Ann Appl Probab. 2009;19(5):1687–718.

8. Homem-de-Mello T, Rubinstein RY. Estimation of rare event probabilities using cross-entropy. In: Proceedings of the winter simulation conference. vol. 1. San Diego, CA, USA. 2002. p. 310–9.

9. Kroese DP, Porotsky S, Rubinstein RY. The cross-entropy method for continuous multi-extremal optimization. Methodol Comput Appl Probab. 2006;8(3):383–407.

10. Alpaydin E. Introduction to machine learning. London: MIT Press; 2010.

11. Rosipal R, Krämer N. Overview and recent advances in partial least squares. In: Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J, editors. Subspace, latent structure and feature selection. Berlin: Springer; 2006. p. 34–51.

12. Li G-Z, Zeng X-Q, Yang JY, Yang MQ. Partial least squares based dimension reduction with gene selection for tumor classification. In: 2007 IEEE 7th international symposium on BioInformatics and BioEngineering. Boston, MA, USA. 2007. p. 1439–44.

13. Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82.

14. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc, Ser B, Methodol. 1996;58(1):267–88.

15. Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Stat. 2004;32(2):407–99.

16. Tibshirani RJ. The lasso problem and uniqueness. Electron J Stat. 2013;7:1456–90.

17. Matheron G. Principles of geostatistics. Econ Geol. 1963;58:1246–68.

18. Sacks J, Welch WJ, Mitchell TJ, Wynn HP. Design and analysis of computer experiments. Stat Sci. 1989;4:409–35.

19. Lophaven SN, Nielsen HB, Sondergaard J. DACE: a Matlab Kriging toolbox, version 2.0. Technical University of Denmark, DK-2800 Kgs. Lyngby—Denmark. 2002.

20. Dubourg V. Adaptive surrogate models for reliability analysis and reliability-based design optimizations [PhD thesis]. Clermont-Ferrand, France: Blaise Pascal University—Clermont II; 2011.

21. McKay MD, Beckman RJ, Conover WJ. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics. 1979;21(2):239–45.

22. Dubrule O. Cross validation of Kriging in unique neighborhood. Math Geol. 1983;15(6):687–99.

23. Li J, Xiu D. Evaluation of failure probability via surrogate models. J Comput Phys. 2010;229:8966–80.

24. Li J, Li J, Xiu D. An efficient surrogate-based method for computing rare failure probability. J Comput Phys. 2011;230:8683–97.

25. Butler T, Dawson C, Wildey T. Propagation of uncertainties using surrogate models. SIAM/ASA J Uncertain Quantificat. 2013;1:164–91.

### Acknowledgements

The authors would like to thank Cyril Desclèves, Joost Rommes, Pascal Bolcato from Mentor Graphics, Grenoble for valuable discussions to improve some aspects of the work. The first author is grateful to the financial support from the Marie Curie Action.

### Availability of data and materials

The data that support the findings of this study are available from Mentor Graphics but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Mentor Graphics.

## Funding

This research is completely supported by the European Union in the FP7-PEOPLE-2013-ITN Program under Grant Agreement Number 608243 (FP7 Marie Curie Action, Project ASIVA14—Analog Simulation and Variability Analysis for 14nm designs).

## Author information

Authors

### Contributions

The first author AKT wrote this manuscript and performed all the experiments. XJ did help to find the research direction, arranged the circuit examples and reviewed this work closely together with TGJB who also supported with fine-tuning and proofreading of the manuscript. WHAS followed the work closely, arranged the meetings with the expertise in the field and made useful suggestions. All the authors read and approved the final manuscript.

### Corresponding author

Correspondence to Anuj K. Tyagi.

## Ethics declarations

### Competing interests

The authors declare that they have no competing interests. 