A neural network-based framework for financial model calibration

A data-driven approach called CaNN (Calibration Neural Network) is proposed to calibrate financial asset price models using an Artificial Neural Network (ANN). Determining optimal values of the model parameters is formulated as training hidden neurons within a machine learning framework, based on available financial option prices. The framework consists of two parts: a forward pass in which we train the weights of the ANN off-line, valuing options under many different asset model parameter settings; and a backward pass, in which we evaluate the trained ANN-solver on-line, aiming to find the weights of the neurons in the input layer. The rapid on-line learning of implied volatility by ANNs, in combination with the use of an adapted parallel global optimization method, tackles the computation bottleneck and provides a fast and reliable technique for calibrating model parameters while avoiding, as much as possible, getting stuck in local minima. Numerical experiments confirm that this machine-learning framework can be employed to calibrate parameters of high-dimensional stochastic volatility models efficiently and accurately.


Introduction
Model calibration can be formulated as an inverse problem, where, based on observed output results, the input parameters need to be inferred. Previous work on solving inverse problems includes research on adjoint optimization methods (Deng et al., 2008;Bouchouev and Isakov, 1997)), Bayesian methods (Kennedy and O'Hagan, 2001;Cont, 2019), and sparsity regularization (Daubechies et al., 2004).
In a financial context, e.g., in the pricing and risk management of financial derivative contracts, asset model calibration aims at recovering the model Secondly, there is inherent parallelism in our ANN approach, so we will also take advantage of modern processing units (like GPUs). The paper (Horvath et al., 2019) also presented a neural network-based method to compute and calibrate rough volatility models. Our CaNN however incorporates a parallel global search method for calibration. Moreover, the CaNN is a generic ANNbased framework, and views the three phases, training/prediction/calibration, as a whole, the difference between them being just to change the learnable units. Furthermore, the proposed ANN approach can handle a flexible number of input market data. In other papers, like (Hernandez, 2016), (Dimitroff et al., 2018), the number of input data had to be fixed in order to fit the employed Convolutional Neural Networks.
Calibrating financial models often gives rise to non-convex optimization problems, for which local optimization algorithms may have convergence issues. A local optimization technique is generally relatively cheap and fast, but a key factor is to choose an accurate initial guess. Otherwise it may fail to converge and get stuck in a local minimum. The authors in (Gilli and Schumann, 2012) vary two parameters of the Heston model (keeping the other parameters unchanged), and show that the objective function exhibits multiple local minima. Also in (Homescu, 2011) it is stated that multiple local minimal are common for calibration in the FX and commodities markets.
To address robustness, global optimizers are popular to calibrate financial models, like the Differential Evolution (DE) technique, Particle Swarm optimization and Simulated Annealing, as their convergence does not rely on specific initial values. DE has been used to calibrate financial models (Vollrath and Wendland, 2009;Gilli and Schumann, 2012) and to train neural networks (Slowik and Bialko, 2008). Parallel computing may help to solve calibration problems with global optimization within reasonable time.
The contributions of this paper are three-fold. First, we design a generic ANN-based framework for calibration. Apart from data generators, all the components and tasks are implemented on a unified computing platform. Second, a parallel global searcher is adopted based on a population-based optimization algorithm (here DE), an approach that fits well within the ANN-based calibration framework. Both the forward and backward passes run in parallel, tackling the computational bottleneck of global optimization and making the calibration time reasonable, even in the case of employing a large neural network. Third, the key components are robust and stable: using a robust data generator in combination and the global optimization technique makes sure that the ANN-based calibration method does not get stuck in local minima.
The rest of the paper is organized as follows. In Section 2, the Heston and Bates stochastic volatility models and their calibration requirements are briefly introduced. In Section 3, artificial neural networks are introduced as function approximators, in the context of parametric financial models. In this section, a generic machine learning framework for model calibration to find the global solution is presented as well. In Section 4, numerical experiments are presented to demonstrate the performance of the proposed calibration framework. Some details of the employed COS option pricing method are given in the appendix.

Financial Model Calibration
We start by explaining the stochastic models for the asset prices, the corresponding partial differential equations for the option valuation and the standard ways of calibrating these models. The open parameters in these models, that need to be calibrated with the help of an objective function are also discussed.

Asset pricing models
In the following subsections we present the financial asset pricing models that will be used in this paper, the Heston and Bates stochastic volatility models. European option contracts are used as the examples to derive the pricing models, however, other types of financial derivatives can be taken into consideration in a similar way.

The Heston model
One of the most popular stochastic volatility asset pricing models is the Heston model (Heston, 1993), for which the system of stochastic equations under the risk-neural measure Q reads, with ν t the instantaneous variance, r the risk-free interest rate and W s t , W ν t are two Wiener processes with correlation coefficient ρ x,ν 1 . To avoid negative volatilities, the asset's variance in Equation (1) is modeled by a CIR process, which is proposed in (Cox et al., 1985) to model interest rates. It precludes negative values for ν(t), so that when ν(t) reaches zero it subsequently becomes positive. The process can be characterized as a mean reverting square-root process, with as the parametersν the long term variance, κ the reversion speed; γ is the volatility of the variance. An additional parameter is ν 0 , the t 0 -value of the variance.
By the martingale approach, the following two-dimensional Heston option pricing PDE is found, with the given terminal condition V (T, S, ν; T, K), where V = V (t, S, ν; T, K) is the option price at time t.

The Bates model
Next to the Heston model, we will also consider its generalization, the Bates model (Bates, 1996), by adding jumps to the Heston stock price process. The model is described by the following system of SDEs: with X P (t) a Poisson process with intensity λ J , and J being normally distributed jump sizes with expectation µ J and variance ν 2 J , i.e. J ∼ N (µ J , ν 2 J ). The Poisson process X P (t) is assumed to be independent of the Brownian motions and of the jump sizes. Clearly, we have three more parameters, λ J , µ J and ν 2 J , to calibrate in this case. The corresponding option pricing equation is a so-called Partial Integro-Differential Equation (PIDE), with the given terminal condition V (T, S, ν; T, K), where P J (x) is the lognormal probability density function of the jump magnitudes. Both the Heston and Bates models do not give rise to analytic option value solutions and the governing P(I)DEs thus have to be solved numerically. There are several possibilities for this, like by means of finite difference PDE techniques, Monte Carlo, or numerical integration methods. We will employ a Fourier-type method, the COS method from (Fang and Oosterlee, 2009), to obtain highly accurate option values, for the details we refer to the Appendix. A prerequisite to using Fourier methods is the availability of the asset price's characteristic function. From the resulting option values, the corresponding Black-Scholes' implied volatilities will be determined by means of a robust rootfinding iteration known as Brent's method (Brent, 1973).

The calibration procedure
Calibration refers to estimating the model parameters (i.e., the constant coefficients in the PDEs) given the samples of the market data. The market value of either option prices or implied volatilities, with moneyness m := S 0 /K and time to maturity τ := T − t, is denoted by Q * (τ, m), and the corresponding model-based value is Q(τ, m; Θ), with the parameter vector Θ ∈ R n , where n denotes the number of parameters to calibrate. For the Heston model,Θ := [ρ,κ,γ,ν,ν 0 ], while for the Bates model we have, Θ := [ρ, κ, γ,ν, ν 0 , λ J , µ J , σ J ].
The difference between the observed values and the ones given by the model is indicated by an error measure, where || · || measures the distance, and N is the number of available calibration instruments. The total difference is represented by the following target function, where ω i are the corresponding weights andλ is a regularization parameter.
When ω i = 1 N andλ = 0 with squared errors in Equation (6), we obtain a well-known error measure, the MSE (Mean Squared Error). When people wish to guarantee perfect calibration for ATM options (the options are most liquid in the market), the corresponding weight value ω i is sometimes increased. Usually calibrating financial models reduces to the following minimization problem, which gives us a set of parameter values making the difference between the market and the model quantities as small as possible. The above formula is over-determined in the sense that N > n, i.e., the number of data samples is larger than the number of to-calibrate parameters. Equation (7) is usually solved iteratively to minimize the residual. Initially a set of parameter values is assigned and the corresponding model values are determined; These values are compared with market data, and the corresponding error is computed, after which a search direction is determined to find a next parameter set. The above steps are repeated until a stopping criterion is met. While evaluating Equation (6), an array of options with different strikes and maturities need to be valued thousands of times and therefore this valuation should be performed highly efficiently. Here, we will employ ANNs that can deal with a complete array of option prices in parallel.

Choices within calibration
Usually the objective function is highly nonlinear and even nonconvex. The authors in (Guillaume and Schoutens, 2012) discuss the impact of the objective function and the calibration method for the Heston model. This issue becomes worse when being faced with a high-dimensional optimization problem. A way to address this problem is to smooth the objective function and employ traditional local optimization methods. Another difficulty when calibrating the model is that the set Θ includes multiple parameters that need to be determined, and that these model parameters are not completely "independent", for example, the effect of different parameters on the shape of the implied volatility smile may be quite similar. For this reason, one may encounter several "local minima" when searching for optimal parameter values. In most cases, a global optimization algorithm should be preferred during calibration.
Regarding the target objective function, there are two popular choices in the financial context, namely either based on observed option prices or based on computed implied volatilities. Option prices can be collected directly from the market, and implied volatility should be computed based on the collected option prices. The most common choices without regularization terms include, and where V * c (T j − t 0 , S 0 /K i ) is the call option price for strike K i and maturity T j with instantaneous stock price S 0 at time t 0 as observed in the market; V c (T j − t 0 , S 0 /K i ; Θ) is the call option value computed from the model using model parameters Θ; similarly σ * imp (·), σ imp (·) are the implied volatilities from the market and from the Heston/Bates model, respectively; ω i,j is some weighting function. The notation i and j is to distinguish the two factors impacting the target quantity. A third approach is to calibrate the model to both prices and implied volatility. For option prices, weighting the target quantity by Vega (the derivative of the option price with respect to the volatility) is a technique to remedy model risk. When taking implied volatility into account, a numerical root-finding method is often employed to invert the Black-Scholes formula in addition to computing option prices. That is to say, two numerical methods are required, one for pricing options, the other one for calculating the Black-Scholes implied volatility. Nevertheless calibrating to an implied volatility surface can help people specify prices of all vanilla options, given the current term structure of interest rates. This is one of the reasons why the practitioners prefer implied volatility during calibration. Besides, we will mathematically discuss the difference between calibrating to option prices and implied volatilities in Section 4.3.2. Moreover, it is well known that OTM instruments are liquid or heavily traded in the market. Calibrating the financial models to OTM instruments is common practice in reality.
The calibration performance (e.g., speed and accuracy) is also influenced by the employed method while solving the financial models. An analytic solution is not necessarily available for the model to be calibrated, and different numerical methods have therefore been developed to solve the corresponding option pricing models. Alternatively, based on some existing solvers, ANNs can be used as a numerical method to learn the solution (Liu et al., 2019).

An ANNs-based approach to calibration
This section presents the framework to calibrate a financial model by means of machine learning. Training the ANNs and calibrating financial models both boil down to optimization problems, which motivates the present machine learningbased approach to model calibration.

Artificial Neural Networks
This section introduces the ANNs. In general, ANNs are built using three components: neurons, layers and the complete architecture from bottom to top. As the fundamental unit, a neuron consists of three consecutive operations, summing up the weighted input, adding a bias to the summation, and computing the output via an activation function. This activation function determines whether and by how much a particular neuron is active. A number of neurons make up a hidden layer. Stacking different layers then defines the full architecture of the ANNs. With signals travelling from the input layer through the hidden layers to the output layer, the ANN builds a mapping among input-output pairs.
The basic ANN is the multi-layer perceptron (MLP), which can be written as a composite function, where θ (i) = (w i , b i ) 2 , w i is a weight matrix and b i is a bias vector. A one hidden layer MLP can, for example, be written as follows, with w j the unknown weights, ϕ(w 1j x j + b 1j ) the neuron's basis function, ϕ(·) an activation function (m is the number of neurons in a hidden layer). The loss function is equivalent to a distance in the case of supervised learning, where f (x) is the target function. Training the ANNs is learning the optimal weights and biases in Equation (10) to make the loss function as small as possible. The process of training neural networks can be formulated as an optimization problem, arg min given the input-output pairs (X, Y) and a user-defined loss function L(θ). Assuming the training data set (X, Y) can define the true function on a domain Ω, ANNs with sufficiently many neurons can approximate this function in a certain norm, e.g., the l 2 -norm. ANNs are thus powerful universal function approximators and can be used without assuming any pre-specified relation between the input and the output. Quantitative theoretical error bounds for ANNs to approximate any function are not yet available. For continuous functions, in the case of a single hidden layer, the number of neurons should grow exponentially with the input dimensionality (Mhaskar, 1996). In the case of two hidden layers, the number of neurons should grow polynomially. The authors in (Maiorov and Pinkus, 1999) proved that any continuous function defined on the unit hypercube C[0, 1] d can be uniformly approximated to arbitrary precision by a two hidden layer MLP, with 3d and 6d + 3 neurons in the first and second hidden layer, respectively. In (Yarotsky, 2017) the error bounds for approximating smooth functions by ANNs with adaptive depth architectures are presented. The theory gets complicated when the ANN structure goes deeper, however, these deep neural networks have recently significantly increased the power of ANNs, see, for example the Residual Neural Networks (Lin and Jegelka, 2018).
In order to perform the optimization in (13), the above composite function (10) is differentiated using the chain rule. The first-and second-order partial derivatives of the loss function with respect to any weight w (or bias b) are easily computable; for more details we refer to (Goodfellow et al., 2016). This differentiation enables us to not only train ANNs with gradient-based methods, but also the sensitivity of the approximated functions using the trained ANN can be investigated. For this latter task, the Hessian matrix will be derived in Section 4 to study the sensitivity of the objective function with respect to the calibrated parameters.

The forward pass: learning the solution with ANNs
The first part of the CaNN, the forward pass, employs an ANN, in the form of an MLP, to learn the solution generated by different numerical methods and subsequently maps the input to the output of interest (i.e., neglecting the intermediate variables). For example, in order to approximate the Black-Scholes implied volatilities based on the Heston input parameters, two numerical methods are required, i.e., the COS method to calculate the Heston option prices and Brent's root-finding algorithm to determine the corresponding implied volatility, as presented in Figure 1. Using two separate ANNs to map the Heston parameters to implied volatility has been applied in Liu et al. (2019). In the present paper, we merge these two ANNs, see Figure 1. In other words, the Heston-IV-ANN is used as the forward pass to learn the mapping between the model parameters and the implied volatility. Note that a similar model is employed for the Bates model, however then based on the Bates model parameters. The forward pass consists of training and prediction, and in order to do so the network architecture and optimization method have to be defined. Generally, an increasing number of neurons, or a deeper structure, may lead to better approximations, but may also result in a computationally heavy optimization and evaluation of the network. In (Liang and Srikant, 2016) it is proved that a deep NN can approximate a function for which a shallow NN may need a very large number of neurons to reach the same accuracy. Different residual neural networks have been trained and tested as a validation of our work. They may improve the predictive power while using a similar number of weights as in an MLP, but they typically take significantly more computing time during the training and testing phases. Very deep network structures may reduce the parallel efficiency, because the operations within a layer have to wait for the output of previous layers. With the limitation of computing resources available, a trade-off between ANN's computation speed and approximation capacity may be considered.
Many techniques have been put forward to train ANNs, especially for deep networks. Most of the neural network training relies on gradient-based methods. A proper random initialization may ensure the network to start with suitable initial weight values. Batch normalization scales the output of a layer by subtracting the batch mean and dividing it by the batch standard deviation. This can often speed up the training process. A dropout operation randomly selects a proportion of the neurons and deactivates them, which forces the network to learn more generalized features and prevents over-fitting. The dropout rate p refers to the proportion of deactivated neurons in a layer. In the testing phase, in order to take into account the missing activation during training, each activation in the entire network is reduced by a factor p. As a consequence, the ANNs prediction slows down, which has been verified during experiments on GPUs. We found that our ANNs model did not encounter over-fitting even when using a zero dropout rate, as long as sufficient training data were provided. In our neural network we employ the Stochastic Gradient Descent method, as further described in Section 3.4.

The backward pass: calibration using ANNs
This section discusses the connection between training the ANN and calibrating the financial model. First of all, both Equations (7) and (13) aim at estimating a set of parameters to minimize a particular objective function. For the calibration problem, these are the parameters of the financial model and the objective function is the error measure between the market quantity and the model-based quantity. For the neural networks, the parameters correspond to the learnable weights and biases in the artificial neurons and the objective function is the user-defined loss. This connection forms an inspiration for the machine learningbased approach to calibrate financial models.
As mentioned before, the ANN approach comprises three phases, training, prediction and calibration. During training, given the input-output pairs and a loss function as in Equation (13), the hidden layers are optimized to determine the appropriate values of the weights and biases, as shown in Figure 2a, which results in a trained ANN approximating the option solutions of the financial model (the forward pass, as explained in the previous section).
During the prediction phase, the hidden layers of the trained ANN are fixed (frozen), and new input parameters enter the ANN to yield the output quantities of interest. This phase is used to evaluate the performance of the trained ANN (the so-called model testing) or to accelerate option pricing by replacing the original solver.  During the calibration phase (or the backward pass), the original input layer of the ANN is transformed into a learnable layer, while all hidden layers remain unchanged. These layers are the ANN layers obtained from the forward pass with the already trained weights, as shown in Figure 2b. By providing the output data, here consisting of market-observed option prices and implied volatilities, and changing to an objective function for model calibration, see Equation (7), the ANN can be used to find the input values that match the given output. The task is thus to solve the inverse problem by learning a cer-tain set of input values, here the model parameters Θ, either for the Heston or Bates model. The option's strike price K, as an example, belongs to the input layer, but is not estimated in this phase. Note that the training phase in the forward pass is time-consuming but done off-line and only once. The calibration phase is computationally cheap, and is performed on-line. The calibration phase thus results in model parameters that best match the observed market data.
The gradients of the objective function, with respect to the input parameters, can be derived. This is useful when employing gradient-based optimization algorithms to conduct model calibration with the trained ANNs. Compared to the classical calibration methods, in the ANN-based approach it is also possible to incorporate the gradient information from the trained ANNs to compute the search direction (without external numerical techniques). As mentioned, we focus on a general calibration framework in which we integrate both gradientbased and gradient-free algorithms. Importantly, within the proposed calibration framework we may insert any number of market quotes, without requiring a fixed structure or a grid of input parameters.

Optimization
The optimization method plays a key role in training ANNs and calibrating financial models, but there are different requirements on the solutions for different phases. When training the neural network to learn the mapping between input and output values, we aim for a good performance on a test data set while optimizing the model on a training data set (this concept is called generalization). Calibration is regarded as an optimization problem with only a training data set, where the objective is to fit the market-observed prices as well as possible. In this work, the Stochastic Gradient Descent (SGD) is used when training the ANN, and Differential Evolution is preferred in the phase of calibration to address the problem of multiple local minima. 3

Stochastic Gradient Descent
A popular optimizer to train ANNs is SGD (Robbins and Monro, 1951). Neural networks contain thousands of weights, which gives rise to a high-dimensional, nonconvex optimization problem. The local minima appear not to be problematic for this involved black-box system, as long as the cost function reaches a sufficiently low value. Optimization of (6) based on SGD is computed using, where L is a loss function as in (12) and N T is the number of training iterations. The bias and weights parameters are denoted by θ = (W, b). The loss function of training the ANN solver is based on MSE in this paper. In practice, the gradients are computed over mini-batches because of computer memory limitations. Instead of all input samples, a portion is randomly selected within each iteration to calculate an approximation of the gradient of the objective function. The size of the mini-batch is used to determine the portion. Due to the architecture of the GPUs, batch sizes of powers of two can be efficiently implemented. Several variants of SGD have been developed in the past decades, e.g., RMSprop and Adam, where the latter method handles an optimization problem adaptively by adjusting the involved parameters over time.

Differential Evolution
Differential Evolution (DE) (Storn and Price, 1997) is a population-based, derivativefree optimization algorithm, which does not require any specific initialization. With DE, a global optimum can be found, even when the objective function is nonconvex. The general form of the DE algorithm usually comprises the following four steps: 1. Initialization: Generate the population with N p individuals and locate each member with random positions in the search space, (θ 1 , θ 2 , ..., θ Np ) 2. Mutation: Once initialized, a randomly sampled difference is added to each individual, named differential mutation.
where i represents the i-th candidate, and the indices a, b, c are randomly selected from the population with a = i. The resulting θ is called a mutant. The differential weight F ∈ [0, ∞) determines the step size of the evolution. Generally, large F values increase the search radius, but may cause DE to converge slowly. There are several mutation strategies, for example, when θ a is always the best candidate of the previous population, the mutation strategy is called best1bin, which will be used in the following numerical experiments; when θ a is randomly chosen, it is called rand1bin. After this step, an intermediary (or donor) population, consisting of N p mutant candidates, is generated.
3. Crossover: During the crossover stage mutated candidates that may enter the next evaluation stage will be determined. For each i ∈ {1, . . . , N p }, a uniformly distributed random number p i ∼ U (0, 1) is selected. Some samples are filtered out by setting a user-defined crossover possibility Cr ∈ [0, 1], If the probability is greater than Cr, the donor candidate will be discarded. Increasing Cr allows more mutants to enter the next generation, but at the expense of population stability. Here, a trial population (θ 1 , θ 2 , ..., θ Np ) has been defined.
4. Selection: Comparing each new trial candidate with the corresponding target individual on the objective function, If the trail individual has improved performance, the selected individual is replaced. Otherwise, the offspring individual inherits the parameters from its parent. This gives birth to a next generation population.
The Steps (2)-(4) are repeated until the algorithm converges or until a predefined criterion is satisfied. Adjusting the control parameters may impact the performance of DE. For example, a large population size and mutation rate can increase the probability of finding the global minimum. An additional parameter, convergence tolerance, is used to measure the diversity within a population, and determines when to stop DE. The control parameters can also change over time, which is out of our scope here.

Acceleration of calibration
In this section we develop DE into a parallel version which is beneficial within the ANNs. Generally, matrix multiplications and element-wise operations in a neural network can be implemented in parallel to reduce the computing time, especially when a large number of arguments is involved. As a result, several components of the calibration procedure can be accelerated. For the ANN solver in the forward pass, all observed market samples can be evaluated at once. Furthermore, in the selection stage of the DE, an entire population can be treated simultaneously. Note that the ANN solver runs in parallel, especially on any GPU.  Table 1, where the population of one generation comprises 50 vector candidates for the calibrated parameters (e.g., a vector candidate contains five parameters to calibrate in the Heston model), and each candidate produces a number of market samples (here 35, i.e., 7 strike prices K and 5 time points). So, there are 50×35=1650 input samples for the Heston model each generation. Traditionally, all these input samples (here 1650) are computed individually, except for those with the same maturity time T . The first speed-up is achieved because 35 sample output quantities from each parameter candidate can be computed by the ANN solver at the same time, even if these samples have different maturity times and strike prices. The second speed-up is based on the parallel DE combined with the ANN, where all parameter candidates in one generation enter the ANN solver at once, that is, all 1650 input samples in one generation can be included in the ANN solver simultaneously, giving 1650 output values (e.g., implied volatilities). Note that the batch size of the ANN solver should be adapted to the limitations of the specific processor, here 2048 in our used processor. We find that with the population size being around 50, the parallel CaNN is at least 10 times faster than the conventional CaNN, on either a CPU or a GPU. It is believed that a larger population size should lead to a higher parallel computing performance, especially on a GPU.
Remark. There are basically two error sources in this framework. One is a consistency error which comes from the employed numerical methods to solve the financial model, and it is found while generating the training data set. The other is an optimization error during training and calibration. These errors will influence the performance of the CaNN.

Numerical results
In this section we show the performance of the proposed CaNN. We begin with calibrating the Heston model, a special case of the Bates model. Some insights into the effect of the Heston parameters on the implied volatility are discussed to give some intuition on the relation, since no explicit mapping between them exists. Then, the forward pass is presented where an ANN is trained to build a mapping between the model parameters and implied volatilities. It is also demonstrated that the trained forward pass can be used as a tool for performing the sensitivity analysis of the model parameters. After that, we implement the backward pass of the Heston-CaNN to calibrate the model and evaluate the CaNN performance. We end this section by considering the calibration of the Bates model, a model that consists of more parameters than the Heston model, using the Bates-CaNN.

Parameter sensitivities for Heston model
This section discusses the sensitivity of the implied volatility to the Heston coefficients. This sensitivity analysis can be used to estimate a set of initial parameters, as is used in traditional calibration methods. In our calibration method this will not be required, however, we can gain some insights in the case of no explicit formulas.
The typically observed implied volatility shapes in the market, e.g., the implied volatility smile or skew, can be reproduced by varying the above parameters {κ, ρ, γ, ν 0 ,ν}. We will give some intuition about the parameter values and their impact on the implied volatility shape. From a PDE viewpoint, the calibration problem consists of finding appropriate values of PDE coefficients {κ, ρ, γ, ν 0 ,ν} to make the Heston model to reproduce the observed option/implied volatility data. The authors in (Gauthier and Rivaille, 2009) reduce the calibration time by giving smart initial values for asset models, whereas in (Forde et al., 2010) an approximation formula for the Heston dynamics was employed to determine a satisfactory initial set of parameters, followed by a local optimization to reach the final parameters. The paper (Cui et al., 2017) derived a Heston model characteristic function to analytically obtain gradient information of the option prices during the search for an optimal solution. In Section 4.3.2 we will use the ANN to extract gradient information of the implied volatility with respect to the Heston parameters.

Effect of individual parameters
To analyze the parameter effects numerically, we use the following set of reference parameters, Two important parameters that are varied are the correlation parameter ρ and the volatility-of-variance parameter γ. Figure 3 (left side) shows that, when ρ = 0%, an increasing value of γ gives a more pronounced implied volatility smile. A higher volatility-of-variance parameter thus increases the implied volatility curvature. We also see, in Figure 3 (right side), that when the correlation between stock and variance process gets increasingly negative, the slope of the skew in the implied volatility curve increases. Furthermore, it is found that parameter κ has a limited effect on the implied volatility smile or skew, up to 1% − 2% only. It determines the speed at which the volatility converges to the long-term volatilityν.
The optimization can be accelerated by a reduction of the set of parameters to be optimized. By comparing the impact of the speed of mean reversion parameter κ and the curvature parameter γ, it is observed that these two parameters have a similar effect on the shape of the implied volatility. It is therefore common (industrial) practice to prescribe (or fix) one of them. Practitioners often fix κ = 0.5 and optimize parameter γ. By this, the optimization reduces to four parameters. Effect of x,v on implied volatility x,v =0.00 x,v =-0.25 x,v =-0.50 x,v =-0.90 Figure 3: Impact of variation of the Heston parameter γ (left side), and correlation parameter ρ (right side), on the implied volatility which varies as a function of strike price K.
Another parameter which may be determined in advance, using heuristics, is the initial value of the variance process ν 0 . For maturity time T "close to today" (i.e., T → 0), one expects the stock price to behave like in the Black-Scholes case. The impact of a stochastic variance process should reduce to zero, in the limit T → 0. For options with short maturities, the process may therefore be approximated by a process of the following form: This suggests that for initial variance ν 0 one may use the square of the ATM implied volatility of an option with the shortest maturity, ν 0 ≈ σ 2 imp , for T → 0, as an accurate approximation for the initial guess for the parameter. One may also use the connection of the Heston dynamics to the Black-Scholes dynamics with a time-dependent volatility function. In the Heston model we may, for example, project the variance process onto its expectation, i.e.,

dS(t) = rS(t)dt + E[ ν(t)]S(t)dW x (t).
By this projection the parameters of the variance process ν(t) may be calibrated similar to the case of the time-dependent Black-Scholes model. The Heston parameters are then determined, such that where σ AT M (T i ) is the ATM implied volatility for maturity T i .
Another classical calibration technique for the Heston parameters is to use VIX index market quotes. With different market quotes for different strike prices K i and for different maturities T j , we may determine the optimal parameters by solving the following equalities, for all pairs (i, j), When the initial values of the parameters have been determined, one can use the whole implied volatility surface to determine the optimal model parameters.
To conclude, the number of the Heston parameter to be calibrated depends on different scenarios. The flexibility of our CaNN is that it can handle varying numbers of to-calibrate parameters.

Effect of two combined parameters
In this section, two parameters are varied simultaneously in order to understand the joint impact on the objective function. Figure 4a presents the landscape of the objective function, here the logarithm of the MSE, when varying ν 0 and κ but keeping the other parameters fixed in the Heston model. It is observed that the valley is narrow in the direction of ν 0 but flat in the direction of κ. Several values of these parameters thus result in similar values of the objective function, which means that there may be no unique global minimum above a certain error threshold. Furthermore, forν and κ we observe also a flat minimum, with multiple local minima giving rise to similar MSEs, see Figure 4b.  For higher-dimensional objective functions, the structure becomes even more complex. This is a preliminary study of the sensitives, and advanced tools are required for studying the effect of more than two parameters. We will show that the ANN can be used to obtain the sensitivities for more than two parameters to present the bigger picture of the dependencies and sensitivities. For this task the Hessian matrix of the five Heston parameters will be extracted (see Section 4.3.2).

The forward pass
In this section, we discuss the forward pass, i.e., Heston-IV-ANN. The selected hyper-parameters are listed in Table 2. Increasing the number of neurons or using a deeper structure may lead to better approximations, but gives rise to an expensive-to-compute network. With our computing resources, we choose to employ 200 neurons each hidden layer to balance the calibration speed and accuracy. We use 4 hidden layers and a linear output (regression) layer, so that the network contains 122,601 trainable parameters. MSE is used as the loss function measure to train the forward pass. The global structure is depicted in Figure 6. More details on the ANN solver can be founded in (Liu et al., 2019). Glorot uniform Optimizer Adam Batch size 1024 Figure 6: The structure of the ANN.
As a data-driven method, the samples from the parameter set for which the ANN is trained are randomly generated for the pricing of European put options. The input contains eight variables, and Table 3 presents the range of six Heston input parameters (r, ρ, κ,ν, γ, ν 0 ) as well as two option contractrelated parameters (τ , m), with a fixed strike price K = 1. There are around one million data points. The complete data set is randomly divided into three parts, with 10% as the testing set, 10% as validation and 80% as the training data set.
After sampling the parameters, a robust version of the COS method is used to determine the option prices under the Heston model numerically. The default setting with L COS = 50 and N COS = 1500 will provide highly accurate option solutions for most of the samples, but it may end up with insufficient precision in some extreme parameter cases. In such cases, the integration interval [a, b] will be enlarged automatically, by increasing L COS until the lower bound a and the upper bound b have different signs. Subsequently, the Black-Scholes implied volatility is calculated by Brent's method.
The option prices are just intermediate variables during training in the forward pass. The overall Heston-IV-ANN solver does not depend on the type of European option (e.g., call or put), since during the computation of the Black-Scholes implied volatilities the European options with identical Heston parameters should give rise to the same implied volatilities, independent of call or put prices. The forward pass can handle both call and put implied volatilities without requiring additional efforts. The ANN takes as input parameters (r, ρ, κ,ν, γ, ν 0 , τ , m), and approximates the Black-Scholes implied volatility σ. As mentioned in Table 2, the optimizer Adam is used to train the ANN on the generated data set. The learning rate is halved every 500 epochs. The training consists of 8000 epochs, both the training and validation losses have converged. The performance of the trained model is shown in Table 4. We observe that the forward pass is able to obtain a very good accuracy and therefore learns the mapping between model parameters and implied volatility in a robust and accurate manner. The test set performance is very similar to the train performance, showing that the ANN is able to generalize well.

The backward pass
We will perform calibration using the CaNN based on the trained ANN from the previous section and evaluate its performance. We will work with the full set of Heston parameters to calibrate, but we will also study the impact of reducing the number of parameters to calibrate. The aim is to check how accurately and efficiently the ANN approach can recover the input values. In order to investigate the performance of the proposed calibration approach, as shown in Figure 7, we generate synthetic samples by means of Heston-IV/COS-Brent, where the 'true' values of the parameters are known in advance. In other words, the parameters used to obtain the IV's from the COS-Brent's method, are now taken as output of the backward pass of the neural network, with σ imp being the input conditional on (K, τ, S 0 , r). Different financial models correspond to different CaNNs. Here we distinguish the Heston-CaNN (based on the Heston model, studied in this section), from the Bates-CaNN (based on the Bates model, studied in Section 4.3.3).
There are 5 × 7 = 35 'observed' European option prices, that are made up of European OTM puts and calls. As shown in Table 5, the moneyness ranges from 0.85 to 1.15, and the maturity times vary from 0.5 to 2.0. Each implied volatility surface contains moneyness levels (85%, 90%, 95%, 100%, 110%, 115%) and maturities (0.5, 0.75, 1, 1.25, 1.5, 1.75, 2.0) with a prescribed risk-free interest rate of 3%. The samples with m < 1 correspond to European call OTM options, while those ones with m > 1 and m = 1 are OTM and ATM put options, respectively. We use the total squared error measure J(Θ) as the objective function during the calibration, where σ ANN imp is the ANN-model-based value and σ * imp is the observed one. We give a small penalty parameterλ depending on the dimensionality of the cal-ibration 4 . The forward pass has been trained with implied volatility as the output quantity, as described in Section 4.1.2. The parameter settings of the DE optimization is shown in Table 1.

Calibration to Heston option quotes
In this section we focus on two scenarios for the Heston model, calibrating either three parameters, with a fixed κ and a known ν 0 , or calibrating five parameters. In order to create synthetic calibration data, we choose five equallyspaced points between the lower and upper bound for each parameter, and there are 5 5 = 3125 combination cases in total, as shown in Table 6. For each experiment, five different random seeds of DE are tested, because the DE optimization involves random operations which may cause the performance to fluctuate. In addition, all quotes have the equal weight ω = 1 in this section.
First, the scenario of three parameters is studied, fixing κ and ν 0 during calibration. We compare the averaged results by implementing each test case five times. The wording "function evaluation" refers to how many times the model has been compared to the observed implied volatility. The population size in the DE is 15 × N v , that is, 15 × 3 = 45. With the population ratio increasing further, no significant benefits were observed. As shown in Table 7, the time on the GPU is around half of that on the CPU.  In the case of five parameters (ρ,ν, γ, ν 0 , κ), the calibration problem is more likely to give rise to a many-to-one problem; that is, many sets of parameter values may correspond to the same volatility surface. A regularization factorλ = 1.0 × 10 −6 is added to guide CaNN to a set of values for which the sum of their magnitude is the smallest among the feasible solutions, as shown in Equation (6). Here the DE population size is 50 = 10 × 5 parameters. As shown in Table  8, the Heston-CaNN finds the values of these parameters in approximately 0.5 seconds on a GPU, with around 20,000 function evaluations. There are several reasons why the CaNN with DE performs fast and efficiently. One reason is that the forward pass runs faster compared to a two-step computation from the Heston parameters to the implied volatilities, since an iterative root-finding algorithm for the implied volatility takes some computing time. In addition, the entire group of observed data can be evaluated at once in the framework. Other benefits come from the acceleration due to the parallelized DE optimization, where the whole population is computed simultaneously in the selection stage.

Sensitivity analysis based on ANNs
The gradients of the objective function can be extracted from the trained model, as mentioned in Section 3.1. These can be used to gain some insights into the complex structure of the loss surface and thus into the complexity of the optimization problem for calibration. We use here the Hessian matrix, which describes the local curvature of the loss function. No explicit formula is available for the relations the neural network learns between the implied volatilities and the model parameters, however, it is feasible to extract the Hessian from the trained ANN, giving insight into this relation and the sensitivities. Table 9 shows a Hessian matrix, where the Hessian is defined as ∂ yiyj L(θ), where y i and y j are output of the neural network (the to-be calibrated parameters). The Hessian is computed by differentiating the Heston-IV-ANN loss for computing the Black-Scholes implied volatility with respect to the Heston parameters on 35 market data points based on the parameter ranges in Table 5. Here the objective function is the MSE to exclude the effects of a regularization factor.
We can understand how the parameters affect the loss surface around the optimum with help of the Hessian matrix, by analyzing the sensitivities of the implied volatility with respect to the five parameters. Observe that the value of the Hessian with respect to κ is the smallest among the sensitivities. As shown in Table 9, the ratio between ∂ 2 J(Θ * )/∂ν 2 0 and ∂ 2 J(Θ * )/∂κ 2 is around 323, which suggests that changing 1 unit of ν 0 is approximately equivalent to changing 323 units of κ for the objective function. When the Hessian value is small in absolute value, the loss surface at that point exhibits flatness in the corresponding direction. As visible in Figure 4, the ground-truth loss surface gets increasingly stretched along the axis with κ, resulting in a narrow valley with a flat bottom. This also indicates that there is no unique global minimum above a certain non-zero convergence tolerance, since multiple values of κ would result in similar values of the loss function. In addition, the convergence performance, especially for the steepest descent method, depends on the ratio of the smallest to the largest eigenvalue of the Hessian; this ratio is also known as the condition number in the case of symmetric positive matrices. The ratio between ∂ 2 J(Θ * )/∂ν 2 and ∂ 2 J(Θ * )/∂κ 2 is around 45, as visible in Figure 4b. From the results in (Cui et al., 2017), when the target quantity is based on the option prices, this ratio between ∂ 2 J(Θ * )/∂ν 2 and ∂ 2 J(Θ * )/∂κ 2 is sometimes found to be of order 10 6 , which makes the calibration problem increasingly complex due to a great disparity in sensitivity. Calibrating to the implied volatility appears to reduce the ratio between different Hessian entries compared to the option prices, thus decreasing Hessian's condition number and resulting in a more efficient and accurate calibration performance.
Table 9 also suggests that the entries |∂ 2 J(Θ * )/∂κ 2 | and |∂ 2 J(Θ * )/∂ρ 2 | are among the smallest ones around the optimum. These two parameters thus have the smallest effect on the objective function. Therefore, the DE method can converge to values that are in a wide area of the search space, since these parameters do not impact the error measure significantly. A straightforward way to address this issue is by adding a regularization term to choose a particular solution, for example, like Equation (19). Another way is to take advantage of the population-based algorithm DE. Since there are several candidates in each generation, we can select the top few candidates to get an averaged solution when DE converges. This averaged solution may lead to wider optima and better robustness. Some recent papers, like (Izmailov et al., 2018) have used similar ideas to improve the generalization of the neural network. The parameter ν 0 is the most sensitive one and it appears to dominate the ANN calibration process. Therefore, the predicted parameter ν 0 is the most precise among all parameters in order to achieve the desired accuracy. The above analysis explains the behavior of the absolute deviation of the five parameters as shown in Table 8. The error measure MJ can not drop significantly below 7.18 × 10 −8 , as this value is close to the testing accuracy, MSE= 1.23 × 10 −7 , of the Heston-IV-ANN model. In other words, any further exploration of the DE optimization can not distinguish the parameters impact on the loss anymore.

Calibration to Bates quotes
In this section, we use the Bates model to create the synthetic market data, in order to generate a more realistic (complex) volatility shape by adding some 'perturbations' to the previous Heston data. It is then followed by a calibration based on the Heston model. The aim is to check whether the resulting implied volatilities can be recovered by the machine learning calibration framework. So, the observed data set in Table 5 is from the Bates option prices. During the calibration, we will employ the backward pass based on the Heston model to determine a set of parameter values which approximate the generated implied volatility function.
There are two sets of experiments, based on either rare jumps or common jumps in the stock price process. Figure 8 compares the implied volatility from the Bates model (forward) computations and the CaNN-based Heston implied volatilities. Clearly, when the impact of the jumps is small, the Heston model can accurately mimic the implied volatility generated by the Bates parameters. In this case, many different input parameters for the Bates model will give very similar implied volatility surfaces. With an increasing jump intensity, the deviation between two models can become significant, especially for short maturity options.  In financial practice, a perfect calibration to the ATM options is often required. We can enforce this, by increasing the weights of the ATM options in the objective function. The third figure from Figure 8 and Table 10 compare the differences when weighting ATM options in the objective function. The two curves fit very well ATM, however, in this case the total error increases with unequal weighting. The results demonstrate the robustness of the CaNN framework. It is however well-known that the Heston model can not fit shortmaturity market implied volatility very well, and therefore we will also employ a higher-dimensional model, e.g., calibrating directly the Bates model, which will be discussed in Section 4.4.

Calibrating the Bates model
In this section, we show the ability of Bates-CaNN to calibrate the Bates model parameters. The Bates model calibration is a higher-dimensional problem, since the Bates model is based on more parameters than the Heston model. The proposed CaNN framework is used to calibrate eight parameters in the Bates model, a setting in which we are dealing with more complex implied volatility surfaces.
Initially, the Bates-IV-ANN forward pass is trained on the training data set consisting of one million samples that are generated by the Bates model. Compared to the forward pass of the Heston model, merely a different characteristic function is inserted in the COS method, and three additional model parameters have been varied. The Bates-CaNN is employed to calibrate the Bates model, aiming to recover the eight Bates model parameters possibly well. All the samples have equal weight, and the regularization factor isλ = 1.0 × 10 −6 . Table 11 shows an example with high intensity, large variance jumps, for which the Heston model can not capture the corresponding implied volatility accurately. There are still 35 market samples as shown in Table 5. This is a challenging problem, estimating eight parameters, and including millions of comparisons between the model and the market values during calibration. Figure 9 compares the implied volatilities from the synthetic market and the calibrated Bates model. These volatilities resemble each other very well, even when the curvature is high with short time to maturity.  Figure 9: The solid lines represent the observed implied volatilities, with the dashed lines being the model calibrated ones. This plot shows the result with equal weights andλ = 1.0 × 10 −6 . The random seed is 2 during calibration.

Conclusion
In this work we proposed a machine learning-based framework to calibrate pricing models, in particular focussing on the high-dimensional calibration problems of the Heston and Bates models. The proposed approach has several favorable features, where an important one is robustness. Without choosing specific initial values, the DE global optimizer prevents the model calibration getting stuck in a local minimum. Fast calibration results from several factors. An ANN is efficient in computing the output values for a single input setting. When calibrating, the market data can be computed by ANNs simultaneously. Using DE, during the selection stage, ANNs can calculate a whole population in each generation at once, in parallel on a parallel computing architecture. The numerical experiments show that optimal values can be found within a second even when using a global optimization algorithm.
Using the ANN-based approach provides new tools to gain insight into the calibration problem. We used the Hessian matrix to perform a sensitivity analysis, where the sensitivities can efficiently be extracted for large numbers of model parameters. The Hessian matrix also explained why implied volatility, used in our work, is preferred over option prices, used in previous works, from an optimization perspective.
The calibration framework furthermore is generic, and does not require characteristic functions, or explicit gradients of financial models. The number of market data or to-calibrate parameters is also flexible. With this framework, the model can be extended to multiple quantities, e.g., calibrating to both option prices and implied volatility. To conclude, the ANN combined with DE provides an efficient and accurate framework for calibrating financial models.
To look forward, the above ANN calibration process does not rely on the quality of the initial guess. However, because the market does not change dramatically in a short time period, it may make sense to take the last available values as starting point of the calibration in future work. There are several possible strategies for the calibration framework in this situation. One is switching to the gradient-based local optimization algorithms and another one is narrowing the search space of the DE, which will further reduce the computational time considerably. Further future improvements include combining gradient-based optimization with the DE, since the gradient information is readily accessible. It is also feasible to employ a small neural network to reduce the computing time, like in the paper (Horvath et al., 2019) which builds a three-hidden-layers ANNs and each layer has 30 nodes during the calibration.

A COS Pricing Method
Based on the Feynman-Kac Theorem, the solution of the governing option valuation PDEs is given by the risk-neutral valuation formula, where V (t, x, ν) is the option value, and x, y are increasing functions of the underlying at t 0 and T , respectively. To get to the COS formula, we truncate the integration range, so that with | R f (y|x)dy − b a f (y|x)dy| < T OL. The density function of the underlying is then approximated by means of the characteristic function with a truncated Fourier cosine expansion, as follows: where Re means taking the real part of the expression in brackets, andf (ω; x) is the characteristic function of f (y|x) defined as beloŵ f (ω; x) = E(e iωy |x).
The prime at the sum symbol in (21) indicates that the first term in the expansion is multiplied by one-half. Replacing f (y|x) by its approximation (21) in (20) and interchanging the order of integration and summation, gives us the COS algorithm to approximate the value of a European option, as below: where is the Fourier cosine coefficient of H(t, y) = V (T, y, ν), which is available in closed-form for several European option payoff functions. Formula (23) can be directly applied to calculate the value of European option, it also forms the basis for the pricing of Bermudan options.
The COS algorithm exhibits an exponential convergence rate for all processes whose conditional density f (y|x) ∈ C ∞ ((a, b) ⊂ R). The size of the integration interval [a, b] can be determined with help of the cumulants.