Skip to main content

How deep is your model? Network topology selection from a model validation perspective


Deep learning is a powerful tool, which is becoming increasingly popular in financial modeling. However, model validation requirements such as SR 11-7 pose a significant obstacle to the deployment of neural networks in a bank’s production system. Their typically high number of (hyper-)parameters poses a particular challenge to model selection, benchmarking and documentation. We present a simple grid based method together with an open source implementation and show how this pragmatically satisfies model validation requirements. We illustrate the method by learning the option pricing formula in the Black–Scholes and the Heston model.

Recent advances in machine learning have shown how neural networks can learn to solve classical problems in quantitative finance such as pricing (see [19]), calibration, (see [3, 12, 16]) and hedging (see [5]). A key advantage of using this technique is that once the network has been trained, it performs the computations much faster than classical approaches such as Monte Carlo simulation.

While the machine learning theory and practical advances as well as the quantitative finance foundations behind this approach are well recognized, these models are not yet widely used in production. One of the key obstacles to the deployment of any model in a bank is that it has to pass a thorough model validation first and – depending on its use case – requires regulatory approval, which is not straightforward to obtain.

A key step in model validation is the model selection process. The model development function of a bank needs to explain and justify to the model validation function why a certain model has been selected (“conceptual soundness”). Traditionally, this is often formulated by choosing a champion model and one or various challenger models. However, those principles and practices have been formulated with models in mind like Heston vs. Black–Scholes.

For models with parameters, the model selection process includes the choice of these parameters. Consequently, for artificial neural networks (ANN), which typically have many parameters and hyperparameters, this question is a bit more complex than for classical quantitative finance models. First of all, a type of network topology has to be chosen. For example, can the problem be learned best with a multilayer perceptron (MLP) or a long-term-short-term memory network (LSTM)?Footnote 1 After a type of topology has been selected, for example MLP, the precise shape of that topology still needs to be determined and justified, e.g. the number of layers and neurons needs to be specified as hyperparameters as well as the activation functions. In a last step, the networks weights have to be chosen. This last step can be performed automatically by stochastic gradient descent methods such as Adam, see [18], but those often require hyperparmeters as well, e.g. the learning rate.

ANNs are already widely used in many areas and the problem of choosing the type of network topology and the number of layers and units occurs everywhere and not just in quantitative finance.

Typical approaches to this problem are:

  1. 1.

    Arbitrary choice: The model developer simply makes a choice and plays around with it until the results are satisfactory. Alternative choices are not documented or systematically evaluated.

  2. 2.

    Automatic Machine Learning: There are attempts to automate the process of finding supervised machine learning solutions for a given problem and data set using unsupervised machine learning techniques – those are dubbed AutoML.

  3. 3.

    Functional Analysis: From a mathematical perspective, a neural network is a method to approximate a non-linear function. All possible choices of network topologies and weights constitute a space of approximation functions and thus existing theory from approximation theory can be utilized.

The first method is of course straight-forward and quite often successfully used in practice. However, it is not always successful and it is certainly not compliant with any model governance framework. For example, the SR 11-7 guidelines clearly state that “a sound development process will produce documented evidence in support of all model choices”, see [2, Sect. V.1]. This method clearly violates that requirement and thus cannot be used for financial models in production.

While the idea of the second method, i.e. using unsupervised learning to automatically choose the topology for a supervised learning problem, is very appealing, there are two problems to consider: First, automatic machine learning is unsurprisingly a much more difficult problem than just one supervised learning problem and it is not always feasible to apply it in practice. The second issue is that from a model validation perspective, one tries to shed light into a blackbox with another blackbox. AutoML, in particular if used in a proprietary version, can only be used to validate a bank’s machine learning solution to a specific problem after the AutoML engine that was used has itself successfully completed a model validation process. That might be possible, but requires a lot of additional resources – potentially much more than to justify the neural network in question directly. Also, it remains to be seen if a regulator would ever sign off such a blank check. One should however remark that open source initiatives such as [17] are currently improving transparency and efficacy of such automatic machine learning solutions.

The third method, has the advantage that a lot of literature and research already exists in the area of function approximation, see for example [9, 22, 24]. Specifically for neural networks, there is the famous universal approximation theorem, see [8, 11, 15]. This literature is very helpful in providing sound methodological justifications and theoretical background. However, in practice, this will often not be enough to make the determination of the concrete (hyper-)parameters of the network straightforward for the problem at hand.

We therefore propose an intermediary solution that improves the first method by some of the techniques used in the second, in particular the use of grids. We formulate this approach in a language that takes the more classical perspective of degrees of freedom, which is very common in model validation. By systematically training neural networks on a grid of comparable parameters, we obtain a framework that can be used in practice to simultaneously satisfy multiple model validation requirements.

The rest of this paper is organized as follows: After a review of some popular neural network topologies in Sect. 1, in particular the MLPs and the LSTMs, we discuss the model selection problem in Sect. 2 and present a simple grid based method to select a neural network topology for a quantitative finance problem. We discuss in detail how this method addresses SR 11-7 model validation requirements in Sect. 3. After quickly discussing the implementation in Sect. 4, we apply the method in Sect. 5 to pricing in the context of Black–Scholes model and a Heston model. The conclusions are summarized in Sect. 6.

1 Artificial neural networks (ANN)

In this section, we establish the notation for two of the most common typesFootnote 2 of artificial neural networks: the multilayer perceptron (MLP) and the long-term-short-term-memory network (LSTM). It should be highlighted that the notation, the mathematical formalization and also the implementation slightly varies throughout the literature, see for example [10, 20, 21]. We have chosen an approach here that is compatible with keras and tensorflow, see [1, 6], as those frameworks are very common.

1.1 Multilayer perceptron (MLP)

The most common and most basic form of neural networks are multilayer perceptrons.

Definition 1.1

(Multilayer perceptron)

A multilayer perceptron MLP is a tuple \(\operatorname{MLP}=(A_{l}, b_{l}, \sigma _{l})_{1 \leq l \leq n_{L}}\) defined by

  • a number \(n_{i}\) of inputs,

  • a number \(n_{o}\) of outputs,

  • a number \(n_{L}\) of layers and

  • for each layer \(1 \leq l \leq n_{L}\)

    • a number \(n_{l}\) of neurons (or units),

    • a matrix \(A_{l} = (A_{l;ij}) \in \operatorname{\mathbb{R}}^{n_{l-1} \times n_{l}}\) and a vector \(b_{l} = (b_{l;i}) \in \operatorname{\mathbb{R}}^{n_{l}}\) (called bias) of weights such that \(n_{0} = n_{i}\), \(n_{n_{L}}=n_{o}\) and

    • an activation function \(\sigma _{l}:\operatorname{\mathbb{R}}\to \operatorname{\mathbb{R}}\).

For any \(1 \leq l \leq n_{L}\), the tuple \((A_{l}, b_{l}, \sigma _{l})\) is called a layer. For \(l=n_{n_{L}}\), the layer is called output layer and for \(1 \leq l< n_{L}\), the layer is called hidden layer.

Neural networks can be visualized as in Fig. 1: This shows a network \((A_{l}, b_{l}, \sigma _{l})_{1 \leq l \leq n_{L}}\) with a total of \(n_{L}=4\) layers, i.e. 3 layers are hidden. Notice that the input layer is just a visualization of the input and is not part of the actual network topology.

Figure 1
figure 1

Multilayer Perceptron

Computing the output from the input is codified in the feed forward.

Definition 1.2

(Feed forward)

Let \(\operatorname{MLP}=(A_{l}, b_{l}, \sigma _{l})_{1 \leq l \leq n_{L}}\) be a multilayer perceptron. Then for each \(1 \leq l \leq n_{L}\), we define a function

$$\begin{aligned} F_{l}:\operatorname{\mathbb{R}}^{n_{l-1}} \to \operatorname{\mathbb{R}}^{n_{l}}, \qquad v \mapsto \sigma _{l}\bigl(v^{T} A_{l} + b_{l}\bigr), \end{aligned}$$

where we employ the convention that \(\sigma _{l}\) is applied in every component. The composition

$$\begin{aligned} F:\operatorname{\mathbb{R}}^{n_{i}} \to \operatorname{\mathbb{R}}^{n_{o}}, \qquad F := F_{L} \circ \cdots \circ F_{2} \circ F_{1} \end{aligned}$$

is called the feed forward of MLP. Any set of inputs \(x \in \operatorname{\mathbb{R}}^{n_{i}}\) is called an input layer.

The links in Fig. 1 between the nodes visualize the feedforward and the dependence of each output on each input.

We assume here that the activation functions are chosen as the standard sigmoid, \(\sigma _{l}(x) := (1 + e^{-x})^{-1}\), for all but the last layer,Footnote 3 where we chose the linear activation \(\sigma _{L}(x) = x\).

The feed forward \(F=F_{\Theta}\) depends on all the parameters \(\Theta =(A_{l}, b_{l})_{1 \leq l \leq n_{L}}\). In general, the function \(F_{\Theta}\) will be unrelated to the problem if the weights Θ are not chosen carefully. This is performed by training the neural network with a training set \((x_{i}, y_{i})_{1 \leq i \leq N}\). More precisely, for a given cost function, say least squares, this amounts to solving the optimization problem

$$\begin{aligned} \Theta ^{*} := \operatorname*{\operatorname{argmin}}_{\Theta}{\sum _{i=1}^{N}{ \bigl\Vert F_{\Theta}(x_{i}) - y_{i} \bigr\Vert ^{2}}}. \end{aligned}$$

In practice, this optimization is performed by stochastic gradient descent methods such as Adam, see [18], and a clever computation of the gradient, called backpropagation, see [23].

1.2 Long-Term-Short-Term-Memory Network (LSTM)

While MLPs work very well in many situations, there are certain applications for which that network topology is sometimes not ideal. MLPs are built to make a prediction y given one input x. But some applications have a canonical time structure and the task is to make time-dependent predictions \(y_{t}\) from time-dependent inputs \(x_{t}\). In principle, this case can of course be covered by MLPs as well, for example by adding a time grid to the inputs and the outputs, i.e. to learn \((y_{t_{1}}, \ldots , y_{t_{n}})\) from \((x_{t_{1}}, \ldots , {x_{t_{n}}}, t_{1}, \ldots , t_{n})\). This has the advantage that it is straight-forward to implement, but the disadvantage that the network has to be potentially quite large. Another option is to train a network separately for each point \(t_{i}\). That has the disadvantage that there is no flow of information between the networks of different points in time \(t_{i}\) and the prediction of this sequence of networks might suffer from inconsistencies. A key application that has driven the research in that area is Natural Language Processing (NLP), where language is seen as a sequence of words.

A known way out of these technical problems are long-term-short-term memory networks as suggested in [14]. The idea depicted in Fig. 2 is as follows: Only one neural network, i.e. with one fixed set of weights, is trained, but in addition to the input, the network also processes the cell state. This additional piece of information is transmitted through the network and serves as a memory of previous predictions.

Figure 2
figure 2

LSTM Concept

Formally, an LSTM can be defined by first defining a single LSTM layer, LSTML, and the associated feedforward and then stacking multiple of those together to an LSTM.

Definition 1.3


A long-term-short-term-memory neural network layer is a tuple \(\operatorname{LSTML}=\operatorname{LSTML}(W, U, b, \tau , \sigma )\) consisting of

  • a number m of units and a number k of features,

  • a 4-tuple W of matrices \(W_{i}, W_{f}, W_{c}, W_{o} \in \operatorname{\mathbb{R}}^{k \times m}\) called input, forget, cell and output kernels,

  • a 4-tuple U of matrices \(U_{i}, U_{f}, U_{c}, U_{o} \in \operatorname{\mathbb{R}}^{m \times m}\) called input, forget, cell and output recurrent kernels,

  • a 4-tuple b of vectors \(b_{i}, b_{f}, b_{c}, b_{o} \in \operatorname{\mathbb{R}}^{m}\) called input, forget, cell and output bias,

  • two functions \(\sigma , \tau :\operatorname{\mathbb{R}}\to \operatorname{\mathbb{R}}\) called activation and recurrent activation.

This definition needs to be understood in the context of the associated feedforward. The feedforward of an LSTML is more complex than for MLPs. Because of the time-dependence, it needs to keep track not only of the final output, but of all the outputs at the various points in time and in addition it needs to keep track of the cell state.

Definition 1.4


Let \(\operatorname{LSTML}=\operatorname{LSTML}(W,U,b,\tau ,\sigma )\) be as above and let \(T \in \mathbb{N}\) be a natural number. Any sequence \(x = (x_{1}, \ldots , x_{T})\), \(x_{t} \in \operatorname{\mathbb{R}}^{k}\), is called an input sequence. The two sequences \(c_{t}\) and \(y_{t}\), \(t=1, \ldots , T\), called cell state and carry state, are recursively defined as follows:Footnote 4


\(i_{t} := \tau (x_{t} \bullet W_{i} + y_{t-1} \bullet U_{i} + b_{i}) \in \operatorname{\mathbb{R}}^{m}\),


\(f_{t} := \tau (x_{t} \bullet W_{f} + y_{t-1} \bullet U_{f} + b_{f}) \in \operatorname{\mathbb{R}}^{m}\),


\(\tilde{c}_{t} := \sigma (x_{t} \bullet W_{c} + y_{t-1} \bullet U_{c} + b_{c}) \in \operatorname{\mathbb{R}}^{m}\),


\(c_{t} := f_{t} \odot c_{t-1} + i_{t} \odot \tilde{c}_{t} \in \operatorname{\mathbb{R}}^{m}\),


\(o_{t} := \tau (x_{t} \bullet W_{o} + y_{t-1} \bullet U_{o} + b_{o}) \in \operatorname{\mathbb{R}}^{m}\),


\(y_{t} := o_{t} \tau (c_{t}) \in \operatorname{\mathbb{R}}^{m}\).

Finally, the function

$$\begin{aligned} F_{T}: &\operatorname{\mathbb{R}}^{k \times T} \to \operatorname{\mathbb{R}}^{m}, \\ x=&(x_{1}, \ldots , x_{T}) \mapsto (y_{1}, \ldots , y_{T}) \end{aligned}$$

is called, the feedforward of LSTML of length T.

The interpretation of this is as follows (see Fig. 3): The input value \(i_{t}\) and forget value \(f_{t}\) represent how much weight is put by the network on the current input \(x_{t}\) and how much weight is put on forgetting the past memory. Then, a candidate cell state \(\tilde{c}_{t}\) and a candidate output \(o_{t}\) are computed on the basis of only the current input \(x_{t}\) and the last prediction \(y_{t-1}\). Thus, the new cell state \(c_{t}\) is computed as a weighted average between the candidate cell state \(\tilde{c}_{t}\) and the previous cell state \(c_{t-1}\), where the weights are given by the input weight \(i_{t}\) and the forget weight \(f_{t}\). Finally, the carry \(y_{t}\), i.e. the intermediate output at t, is computed by first computing a candidate output \(o_{t}\), which is also based only on the current input \(x_{t}\) and the last prediction \(y_{t}\), and then \(y_{t}\) is computed from \(o_{t}\) by weighing \(o_{t}\) with the new cell state \(c_{t}\). It should be noted that depending on the application, one might either consider the value \(y_{T}\) as the feedforward of the network or the vector \((y_{1}, \ldots , y_{T})\).

Figure 3
figure 3

Long-Term-Short-Term-Memory Network Cell

Practical applications rarely comprise of a single LSTML for various reasons. First, as we can see in Definition 1.4, each step of the computation, i.e. the input, forget, output and candidate cell state are equivalent to a feedforward of an MLP with just a single layer. That means that a single LSTML cannot capture arbitrary non-linearities with these computations. Also, notice that if a network comprises of just an LSTML, then the number k of features is forced to be \(k=n_{i}\), i.e. the number of inputs and the number m is forced to be \(m=n_{o}\), i.e. the number of outputs. That means that no parameters can be changed to adapt the network to the problem. The solution to both of these problems is to chain multiple LSMTLs in sequence, where the number m of units can be chosen at will, followed by a single MLP layer to ensure that the last output is of the same shape as \(n_{o}\).

Definition 1.5


A long-term-short-term memory network (LSTM) with L layers is defined by a sequence \(\operatorname{LSTM}= (\operatorname{LSTML}_{1}, \ldots , \operatorname{LSTML}_{L-1}, \operatorname{MLP})\) of \(L-1\) LSTMLs with number of units \(m_{l}\) and number of features \(k_{l}\) such that \(k_{1}=n_{i}\), and a single MLP with input dimension \(k_{L-1}\) and output dimension \(n_{o}\).

2 Network topology selection

In this section we introduce a method to select a network topology that is consistent with established model validation practices.

Assume we want to train an MLP on a financial problem such as pricing. The key model parameters to choose are the number of layers \(n_{L}\) and the number of units \(n_{u}\) in each layer (we are assuming that we want to choose the same number in each layer). A very simple way to obtain a documented choice for that is to not pick one arbitrary parameter vector \((n_{u}, n_{L})\), but to specify a grid of those parameters and built an MLP for each of them, see Fig. 4 for an example. The idea is to then train all of those models with the same data, analyze their learning curves and the performance of the trained model and then the select the least complex model such that its performance is within thresholds acceptable by the business case. As discussed in Sect. 3, this simple procedure already satisfies many key requirements of model validation, but there is one catch.

Figure 4
figure 4

Mapping model parameters to models NN

A key question, which is not only interesting from a model validation perspective is: For the given learning problem, is it better to increase model performance by increasing the number of layers or by increasing the number of units? If the grid \(\mathcal{G}=(g_{ij})\) of parameters \(g_{ij}=(n_{u_{i}}, n_{L_{j}})\) is a cartesian product of a vector of possible number units \((n_{u_{i}})\) and a vector of possible number of layers \((n_{L_{j}})\), this amounts to the question of whether one should go down a row or right along a column in the grid, see Fig. 4.

In order to obtain a meaningful answer to this question, one has to consider the degrees of freedom of the model. For a neural network, these are exactly the number \(n_{w}\) of trainable weights. It is good model selection practice to increase the degrees of freedom of a model slowly from below to obtain as many as needed to be accurate, but no more than necessary to avoid overfitting. Applied to MLPs, this means that we actually want a grid where along the one (and only one) dimension, we can increase the degrees of freedom and along the other we can change the topology of the network keeping the degrees of freedoms fixed. This is not achieved by a grid \(\mathcal{G}\) that is simply a cartesian product of number of units and number of layers, because both increase the degrees of freedom and they do so in a very different way, see Fig. 5(a). An easy way to fix this is to build the grid by first fixing the smallest candidate for the number of layers, say \(n_{L_{1}}=2\), and then fill the first column of the grid with \(g_{i1}=(n_{u_{i}}, n_{L_{1}})\), where \(n_{u_{i}}\) is the vector of candidate number of units. In a second step, we then compute the resulting degrees of freedom \(n_{w}\) for each row in the first column. In a third step, we fill the other columns by gradually increasing the number of layers, but simultaneously reducing the number of units to keep the degrees of freedom in each row constant. In this way, the row axis increases the degrees of freedom (via increasing the number of units) and along the column axis we obtain models with the same degree of freedom, but different network topologies. That is exactly what we want. To carry out this program in detail, we need to calculate the degrees of freedom, i.e. the number of trainable weights of the network, see below, and formalize the above in an algorithm, see Algorithm 2.2.

Figure 5
figure 5

Number of trainable weights

For an MLP \(\operatorname{NN}=(A_{l}, b_{l}, \sigma _{l})_{1 \leq l \leq n_{L}}\), calculating the number of trainable weights amounts to the following: Given that any layer has a matrix \(A \in \mathbb{R}^{n_{l-1} \times n_{l}}\) and a bias \(b \in \mathbb{R}^{n_{l}}\), this yields \(n_{l-1} n_{l} + n_{l} = n_{l}(n_{l-1}+1)\) trainable weights per layer. Taking into account that \(n_{0}=n_{i}\) and \(n_{L}=n_{o}\), i.e. the dimensions of the input and the output layers are fixed, the total number \(n_{w}\) of trainable weights is given by

$$\begin{aligned} n_{w} = \textstyle\begin{cases} n_{o}(n_{i} + 1), & n_{L}=1, \\ n_{1}(n_{i} + n_{o} + 1) + n_{o}, & n_{L}=2, \end{cases}\displaystyle \end{aligned}$$

and for \(n_{L} \geq 3\)

$$\begin{aligned} n_{w} ={}& n_{1}(n_{i} + 1) + n_{o} (n_{L}+1) + \sum_{l=2}^{n_{L} - 1}{n_{l}(n_{l-1} + 1)}. \end{aligned}$$

While in theory it is perfectly possible to choose a different number of units for every hidden layer, in practice one often uses the following.

Assumption 2.1

The number of units \(n_{u}\) is the same in each layer.

In that case Eq. (2) simplifies to

$$\begin{aligned} n_{w} & = n_{u}(n_{i} + 1) + n_{o} (n_{u}+1) + (n_{L}-2)n_{u}(n_{u} + 1) \\ &= n_{u}^{2} (n_{L} - 2) + (n_{i} + n_{o} + n_{L} - 1)n_{u} + n_{o}, \end{aligned}$$

which requires two choices, namely \(n_{L}\), the number of layers, and \(n_{u}\), the number of units per layer. In Fig. 5(a) we plot this function for an example. We see that – in accordance with Eq. (3) – the number of trainable weights \(n_{w} = n_{w}(n_{L}, n_{u})\) depends linearly in the number \(n_{L}\) of layers, but quadratically in the number \(n_{u}\) of units. The crucial exception from this is the case of \(n_{L}=2\) as evident from Eq. (1), where the number of units only enters linearly. This is why we see the huge jump in trainable weights when passing from \(n_{L}=2\) to \(n_{L}=3\) layers.

Computing a reduced number \(n_{u}\) of units after increasing the number \(n_{L}\) of layers keeping the total number of weights constant, therefore amounts to rewriting the quadratic equation Eq. (3) as

$$\begin{aligned} n_{u}^{2} + \underbrace{ \frac{n_{i} + n_{o} + n_{L} - 1}{n_{L} - 2}}_{=:p} n_{u} + \underbrace{ \frac{n_{o} - n_{w}}{n_{L} - 2}}_{=:q} = 0, \end{aligned}$$

which is easily solved by setting

$$\begin{aligned} n_{u} = - \frac{p}{2} + \sqrt{\frac{p^{2}}{4} - q}. \end{aligned}$$

Of course in practice one has to take the floor \(\lfloor n_{u} \rfloor \) (or a rounding) to enforce an integer number of units. Using this we can keep the number of weights approximately constant when increasing the number of layers. This is illustrated in an example in Fig. 5(b). Keeping the number of weights constant when increasing the layers allows us to control the degrees of freedom with a single variable resulting in a more meaningful comparison between various network topologies.

This leaves us with the following method of determining a good network topology for any given problem.

Algorithm 2.2

(Network topology selection)


  1. (i)

    An artificial neural network NN.

  2. (ii)

    A labeled data set \((x,y)\) together with a train/test split (e.g. 80%, 20%).

  3. (iii)

    A range of number of layers \(\mathcal{N}_{L} = (L_{1}, \ldots , L_{r})\).

  4. (iv)

    A range of number of original units \(\mathcal{N}_{u} = (u_{1}, \ldots , u_{s})\).

  5. (v)

    A bias threshold \(t_{b}\) and a variance threshold \(t_{v}\) together with metrics for both (e.g. MSE).

  6. (vi)

    A number \(e_{\max}\) of maximal epochs.


  1. (i)

    Create a grid \(\mathcal{G} = (g_{ij})_{1 \leq i \leq s, 1 \leq j \leq r}\) of tuples \(g_{ij}\) as follows: For the first number \(n_{L_{1}}\) of layers initialize \(g_{i1} := (u_{i}, n_{L_{1}})\), i.e. use the original number of units and \(L_{1}\) layers. For any \(j>1\), set \(g_{ij} := (u_{i}', n_{L_{j}})\) where \(u_{i}'\) is the solution of Eq. (5) with \(n_{w}\) set as the same as resulting from \(g_{i1}\). This results in a grid where the degrees of freedom increase by going down a row, but keep constant when going right a column, c.f. Figure 4.

  2. (ii)

    For each network NN resulting from the parameters in the grid \(\mathcal{G}\), train the network with \((x,y)\) until the bias and variance is below the thresholds \(t_{b}\) and \(t_{v}\) (or the maximum number of epochs \(e_{\max}\) is reached). This results in a grid of trained models, see Sect. 5.2 for an example.

  3. (iii)

    Cross out all networks on the grid, for which the bias and the variance are not below the given thresholds.Footnote 5

  4. (iv)

    Amongst the remaining, find the smallest number \(n_{L}\) of layers, for which there exist a number of units \(n_{u}\) such that the model \((n_{u}, n_{L})\) has not been crossed out. Amongst those, choose the one with the smallest number \(n_{u}\) of units.

Output: A number \((n_{u}, n_{L})\) of units and layers for the network NN such that the bias and the variance of the network on \((x,y)\) are within the threshold and the numbers \((n_{u}, n_{L})\) are optimal within the given range.Footnote 6

Optionally, one can create a second grid to compare how the MLP performs against the LSTM. One only has to determine the number of weights in an LSTM as well and choose the number of units in the original LSTMs such that the resulting degrees of freedom are approximately the same as in the reference MLPs. It follows from Definition 1.3 that the number of weights in a single LSTM layer is given by

$$\begin{aligned} 4m^{2} + 4(k+1)m \end{aligned}$$

and thus by Definition 1.5, we obtain that the total number \(n_{w}\) of weights in the LSTM is given by

$$\begin{aligned} n_{w} & = 4 (2 n_{L} -3) n_{u}^{2} + (4 n_{i} + n_{o} + 4 n_{L} - 4 )n_{u} + n_{o}, \end{aligned}$$

where \(n_{L}\) is the number of layers, \(n_{u}\) is the number of units in each layer and \(n_{i}\) and \(n_{o}\) are the number of inputs and outputs.

3 Fulfillment of model validation requirements

The network topology selection as discussed in Algorithm 2.2 serves to fulfill various model validation requirements as mandated by SR 11-7, see [2]. First of all we note that just because neural networks are not classical Monte Carlo simulations, this does not mean that they are in principle ineligible for production. In fact, SR 11-7 explicitly states that the “nature of testing and analysis will depend on the type of model and will be judged by different criteria depending on the context”, [2, Sect. IV, p.6]. Thus, it is reasonable and compliant that testing neural networks looks a bit different from testing a Monte Carlo simulation.

Our approach to select a network topology addresses multiple SR 11-7 validation requirements simultaneously:

  1. 1.

    Documented Choice: “A sound development process will produce documented evidence in support of all model choices. Comparison to alternative theories and approaches should be included” ([2, Sect. V.I, p.11]). The learning curves of a grid of trained models resulting from the network topology selection method, see Fig. 9 for an example, serve as documented evidence that a “comparison with alternative theories and approaches” as a “fundamental component of a sound modeling process” ([2, Sect. IV, p.6]) has been conducted. This applies to both, the number of layers and units within a network topology, and comparisons between multiple network topologies.

  2. 2.

    Benchmarking: “Benchmarking is the comparison of a given model’s inputs and outputs to estimates from alternative internal or external data or models. It can be incorporated in model development as well as in ongoing monitoring. Whatever the source, benchmark models should be rigorous and benchmark data should be accurate and complete to ensure a reasonable comparison.” ([2, Sect. V.2, p.13]) Because the model selection is based on a comprehensive systematic benchmark against alternatives based on exactly the same implementation approach, input data and fitting parameters, this requirement is satisfied automatically.

  3. 3.

    Outcome Analysis: “The third core element of the validation process is outcomes analysis, a comparison of model outputs to corresponding actual outcomes. The precise nature of the comparison depends on the objectives of a model, and might include an assessment of the accuracy of estimates or forecasts, an evaluation of rank-ordering ability, or other appropriate tests.” ([2, Sect. V.3, p.13]) Because a model in the grid is only chosen, if its bias is below the specified threshold in a suitable metric, this ensures that the chosen model is accurate.

  4. 4.

    Prevention of Overfitting: “Analysis of in-sample fit and of model performance in holdout samples (data set aside and not used to estimate the original model) are important parts of model development...” ([2, Sect. V.3, p.14]) Because the learning curves include the variance, the requirement to consider both is satisfied. In fact, because the total number of degrees of freedom in the model is increased from below, this method automatically makes the model much less prone to overfitting.

One should highlight that the last point is particularly delicate for financial applications. Models in the financial domain that have hundreds or even thousands of parameters are typically met with great scepticism as avoiding overfitting and instabilities in those models is no easy task. Because the machine learning community routinely deals with models that have many parameters, diagnostic methodological frameworks, a strong culture to cross-validate and standardized open source solutions have already made those models tractable.

We conclude by stressing that the proposed method to network topology selection is only one aspect of model selection, which in turn is only one aspect of model validation. Performing the suggested method of network topology selection does not exonerate the user from performing the full set of validation tasks mandated by SR11-7.

4 Technical implementation

An advantage of the network topology selection method (Algorithm 2.2) is that in theory, the implementation is straightforward. Any framework that can run one neural network can be used to run a grid of neural networks by using a loop. In practice, however, this results in various IT tasks such as managing the IDs of the models, loading and saving them, ensuring that their training parameters are consistent etc. The numerical simulation in Sect. 5 has been performed using the popular keras models, see [6]. We have isolated the part of the code that wraps the models into keras_grid, an open source module, which conveniently solves these problems.Footnote 7

5 Application to quant finance models

In this section we apply the network topology selection method (Algorithm 2.2) to the problem of learning the option pricing function of the Black–Scholes model and the Heston model. In both cases we compare the topologies resulting from the MLP and the LSTM. We find that even though the prices \(C(T, K)\) of call options clearly have a time dependence T, the MLP is actually much better suited to learn them than the LSTM. This is plausible as the problem of managing complex long-term-short-term memory does not really occur for Markovian paths generated by the Black–Scholes or Heston model. The results are described below and can also be explored interactively in a jupyter notebook.Footnote 8

5.1 Black–Scholes & Heston model

The Black–Scholes model, see [4], assumes that the stock price \(S_{t}\) is a stochastic process on a probability space \((\Omega , \mathcal{F}, \mathbb{Q})\) (where we think of \(\mathbb{Q}\) as the risk-neutral measure) satisfying

$$\begin{aligned} dS_{t} &= r S_{t}\,dt + \sigma S_{t}\,dW_{t}, \end{aligned}$$

where \(r \in \operatorname{\mathbb{R}}\) is a fixed risk-free rate, \(\sigma > 0\) is the volatility and the process \(W_{t}\) is a Brownian motion. We denote by \(\mathbb{F} = (\mathcal{F}_{t})_{t \geq 0}\) the augmented filtration generated by \(W_{t}\). Under these assumptions, a European call option with expiry at T and strike K, i.e. a derivative with payoff \((S_{T} - K)^{+} \) can be priced analytically with the famous Black–Scholes formula:

$$\begin{aligned} \begin{aligned} C_{t}(T,K) & = \mathbb{E} \bigl[e^{-r(T-t)}(S_{T} - K)^{+} \mid \mathcal{F}_{t}\bigr] \\ & = S_{t} \Phi (d_{1}) - K e^{-r (T-t)} \Phi (d_{2}), \end{aligned} \end{aligned}$$

where Φ denotes the cdf of the standard normal distribution and

$$\begin{aligned} &d_{1}:= \frac{1}{\sqrt{T-t}} \biggl( \log \biggl( \frac{S_{t}}{K} \biggr) + \biggl(r + \frac{\sigma ^{2}}{2}\biggr) (t-T) \biggr), \\ &d_{2} := d_{1} - \sigma \sqrt{T-t}. \end{aligned}$$

The Black–Scholes model rests on the assumption that the volatility is constant, which is arguably not realistic. The Heston model, see [13], belongs to the class of stochastic volatility models, which assume a stochastic dynamic not just for the stock price, but also for the volatility. It is defined by

$$\begin{aligned} &dS_{t} = r S_{t}\,dt + \sqrt{\nu _{t}} S_{t}\,dW_{t}^{S}, \end{aligned}$$
$$\begin{aligned} &d \nu _{t} = \kappa (\theta - \nu _{t})\,dt + \xi \sqrt{ \nu _{t}}\,dW_{t}^{\nu}, \end{aligned}$$
$$\begin{aligned} &dW_{t}^{S}\,dW_{t}^{\nu}= \rho \,dt, \end{aligned}$$

where \(r \in \operatorname{\mathbb{R}}\) is the risk-free rate, \(\kappa \in \operatorname{\mathbb{R}}\) is the rate at which the stochastic variance \(\nu _{t}\) reverts to the long-term mean \(\theta > 0\), \(\xi > 0\) is the volatility of the volatility and \(\rho \in [0, 1]\) is the correlation between the Brownian motions \(W_{t}^{S}, W_{t}^{\nu}\).

The option price in a Heston model can be computed via (see [7])

$$\begin{aligned} C_{0}(T,K) = S_{0} \Pi _{1} - e^{-rT} K \Pi _{2}, \end{aligned}$$

where \(\Pi _{1}\) and \(\Pi _{2}\) are given as integrals over the characteristic function \(\Psi = \Psi _{\ln (S_{T})}\) of \(\ln (S_{T})\):

$$\begin{aligned} &\Pi _{1}= \frac{1}{2} + \frac{1}{\pi} \int _{0}^{\infty}{ \operatorname{Re} \biggl( \frac{e^{-iw \ln (K)} \Psi (w-i)}{i w \Psi (-i)} \biggr)\,dw}, \\ &\Pi _{2}= \frac{1}{2} + \frac{1}{\pi} \int _{0}^{\infty}{ \operatorname{Re} \biggl( \frac{e^{-iw \ln (K)} \Psi (w)}{i w } \biggr)\,dw}. \end{aligned}$$

We learn the Black–Scholes formula Eq. (7) for \(t=0\) as well as the Heston option pricing formula Eq. (11) with a neural network.

To that end we generate a data set as follows: We first define an evenly spaced grid of 60 maturities T between 3M and 5Y. Second, for the other model parameters, we generate \(10{,}000\) samples uniformly distributed in the hypercube with the bounds specified in Fig. 6. Third, we take the cartesian product between the maturities and the samples and obtain a data set with \(600{,}000\) samples. The special treatment of the maturity as a parameter is required here to adhere to the input format of the LSTM. Notice that in a productive environment, the specification of the bounds for the traning set has to be in line with business requirements and careful input checking against these bounds has to be performed after training when predictions are made. An example of a price surface for both models is shown in Fig. 7.

Figure 6
figure 6

Parameter Ranges Training Set

Figure 7
figure 7

Data Set

5.2 Performing network topology selection

For the grid of MLPs we choose the range of number of layers as \(\mathcal{N}_{L} = (2, 3, 4)\) and the original number of units for \(L=2\) layers as \(\mathcal{N}_{u} = (64, 128, 256, 512, 1024)\). For the grid of LSTMs we choose the original number of units such that the total number of trainable weights in the first column of the LSTM grid is the same as in the MLP grid. This ensures the degrees of freedom of the LSTMs are comparable to the MLPs. The resulting graph of weights is shown in Fig. 8. The key insight here is that while their shape looks the same as expected, the order of magnitude of original number of units is much lower for the LSTM as for the MLP. That is because all these additional complexities of the LSTM, recall Definition 1.3, mean that for the same number of units, the LSTM has much more trainable weights than the MLP. Thus, in order to achieve the same number of trainable weights as the MLP, the LSTM has to be instantiated with a much lower number of units.

Figure 8
figure 8

Number of trainable weights

We train both models on the above data set with an 80%/20%-split into train/test data with random shuffling. The thresholds are set to \(t_{b} := t_{v} := 0.25\) and the maximal number of epochs is \(e_{\max} := 50\). We choose the mean squared error (MSE) as a loss function and the mean absolute error (MAE) as our metric. The resulting learning curves with the bias and the variance are shown in Fig. 9 for the MLP. In this grid the number of layers increases in each column from left to right and the number of original units increases in each row from top to bottom. We find that the first column does not have enough layers to capture the non-linearity in the pricing function of the Black–Scholes or the Heston model. However, in the second column, the first row is already below the threshold, but only barely and not yet very stably after the 50 epochs, so we select the model below in the second row. This model has \(L=3\) layers and just \(n_{u}=26\) units in the hidden layers (reduced from the original number of 128). This means that this model has learned the Black–Scholes and Heston pricing function with only \(n_{w} = 885\) trained weights after just 50 epochs.

Figure 9
figure 9

Learning Curves MLP

The learning curves for the corresponding LSTMs are shown in Fig. 10. We find that they are significantly worse than the MLP. Only for the very last model for \(L=3\) layers and \(n_{u}=17\) units per hidden layer (reduced from an original number of 1024), the learning curves are just at the threshold, so we select this one. Despite having \(n_{w} = 6274\) trained weights in total, it still performs worse than the MLP we selected above (which has less than 7x the number of trained weights). The conclusion from this is not that LSTMs cannot learn the Black–Scholes or Heston pricing function, but rather that the simplicity of the MLP network topology is better adapted to this specific problem. The complexity of the LSTM network topology is adapted to a situation, which doesn’t occur here, and thus it cannot achieve the same performance as the MLP when constrained by the same number of trainable weights.

Figure 10
figure 10

Learning Curves LSTM

5.3 Error distribution

While the mean absolute error (MAE) is a good metric for the learning curve, it is often not sufficient for financial applications. Therefore, we study the whole error distribution for the MLP and the LSTM selected above, see Fig. 11. Unsurprisingly, the LSTM has a higher and a wider error distribution indicating it has learned the pricing functions less well than the MLP. For the MLP it is interesting to note that while the learning curves in Fig. 9 suggest that in the mean error for the Black–Scholes and the Heston model is similar, we see in Fig. 11, that the Heston error distribution is wider than Black–Scholes. This shows that it is ‘harder’ for the network to learn this, which given the higher dimensionality of the data set is not surprising.

Figure 11
figure 11

Distribution of Error

Both impressions are also confirmed Fig. 12, where we compute some statistics of the error distribution. While for MLP, high percentiles or even the max of the error is still <1, meaning in cases where the MLP error is worse than the mean, this network fails ‘gracefully’, the max error for the LSTM is much worse. For the MLP, the 95th and 99th quantile of the error as well as the max error are slightly higher for Heston than for Black–Scholes.

Figure 12
figure 12

Error Statistics

5.4 Generalization to other models

We have illustrated the network topology selection method for parametric models trained on synthetic data in a given range as this flavour is currently the most popular. It is well known that neural networks usually do not perform very well in regions outside the training set, i.e. when they extrapolate. Therefore, we suggest to take that into account when generating the training set. The fact that this is possible is one of the big advantages of working with synthetic data (rather than real world or historic data). In practice, we recommend an automated bound checking of the new inputs supplied to the network that is consistent with the bounds used in training. Even more care has to be taken when training non-parametric models on historic data, e.g. a VaR model on historic market data shifts. The high dimensionality of the input and non-linearity of the PnL might require a topology so big that given limited availability of historic data, it might simply not be trainiable to get the bias below an acceptable threshold. This will also make it impossible to reduce the variance to satisfactory level. If the network is trained on a decade where markets are calm, it will also easily breach the ranges in which it was trainined when used in a time of market stress causing extrapolation failures.

6 Conclusions

We conclude that the SR 11-7 requirement to conduct model validation, in particular a thorough model selection process, can be satisfied for neural network models as well. The simple grid based network topology selection method is a pragmatic way of producing a good and documented choice of hyperparameters for a given financial application.

As a byproduct we obtain interesting insights into how neural networks learn financial models. Given that evaluating a network is very fast, this makes the use of pricing models feasible, which are too computationally expensive otherwise. We also find that while option prices (or implied volatilities) clearly have a time dependence, LSTMs are overly complicated for this application and the much simpler MLPs perform better in a fair comparison – a textbook case of Occam’s razor.

Availability of data and materials

Data and an illustrating notebook is available on github:


  1. See Sect. 1 for detailed definitions.

  2. We illustrate our method on MLPs and LSTMs, but it can in general be applied to other topologies as well like CNNs.

  3. The sigmoid function takes values in \([0,1]\). Thus, choosing a sigmoid activation in the output layer can vastly decrease the accuracy of the network, in particular if the function intended to learn takes unbounded values in .

  4. Here, we assume that all vectors are row vectors, all sequences are initialized with zero, • denotes the usual matrix-vector multiplication, denotes the element-wise multiplication of vectors (Hadamard product) and the application of a function \(\operatorname{\mathbb{R}}\to \operatorname{\mathbb{R}}\), e.g. σ and τ, to a vector is performed element-wise.

  5. In case all models are crossed out, the number of units or layers or the number of training samples or the number of epochs needs to be increased to yield a meaningful result.

  6. They are not optimal in a mathematical sense as a grid obviously only tests this on a finite number of candidate models. However, if the grid is fine enough, this is sufficient for practical model validation purposes.

  7. See

  8. See


  1. Abadi M, Agarwal A, Barham P et al.. TensorFlow: large-scale machine learning on heterogeneous systems. 2015.

    Google Scholar 

  2. B. of Governors of the Federal Reserver System/Office of the Comptroller of the Currency. Governors of the Federal Reserver System/Office of the Comptroller of the Currency. Supervisory Guidance on Model Risk Management. 2010.

  3. Bayer C, Stemper B. Deep calibration of rough stochastic volatility models. 2018.

  4. Black F, Scholes M. The pricing of options and corporate liabilities. J Polit Econ. 1973;81(3):637–54.

    Article  MathSciNet  MATH  Google Scholar 

  5. Buhler H, Gonon L, Teichmann J, Wood B. Deep Hedging. 2019.

  6. Chollet F, et al. Keras. 2015.

  7. Crisostomo R. An analysis of the heston stochastic volatility model: implementation and calibration using Matlab. 2014.

  8. Cybenko G. Approximation by superpositions of a sigmoidal function. Math Control Signals Syst. 1989;2(4):303–14. issn: 0932-4194.

    Article  MathSciNet  MATH  Google Scholar 

  9. Ern A, Guermond J. Theory and practice of finite elements. Applied mathematical sciences. New York: Springer; 2004. ISBN 9780387205748.

    Book  Google Scholar 

  10. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Springer series in statistics. New York: Springer; 2001.

    Book  Google Scholar 

  11. Haykin S. Neural networks: a comprehensive foundation. 2nd ed. New York: Prentice Hall; 1998. ISBN 0132733501.

    MATH  Google Scholar 

  12. Hernandez A. Model Calibration with Neural Networks. 2016.

  13. Heston SL. A closed-form solution for options with stochastic volatility with applications to bond and currency options. Rev Financ Stud. 2015;6(2):327–43. issn: 0893-9454.

    Article  MathSciNet  MATH  Google Scholar 

  14. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–80.

    Article  Google Scholar 

  15. Hornik K. Approximation capabilities of multilayer feedforward networks. Neural Netw. 1991;4(2):251–7.

    Article  MathSciNet  Google Scholar 

  16. Horvath B, Muguruza A, Tomas M. Deep Learning Volatility. 2019.

  17. Jin H, Song Q, Hu X. Efficient neural architecture search with network morphism. In: CoRR. 2018.

    Google Scholar 

  18. Kingma DP, Ba J. Adam: a method for stochastic optimization. 2014. 1412.6980 [cs.LG].

    Google Scholar 

  19. Liu S, Oosterlee CW, Bohte SM. Pricing options and computing implied volatilities using neural networks. 2019.

  20. Muller A, Guido S. Introduction to machine learning with python: a guide for data scientists. Sebastopol: O’Reilly Media; 2018. ISBN 9789352134571.

    Google Scholar 

  21. Murphy KP. Machine learning: a probabilistic perspective. Cambridge: MIT Press; 2013. ISBN 9780262018029.

    MATH  Google Scholar 

  22. Rudin W. Functional analysis. International series in pure and applied mathematics. New York: McGraw-Hill; 1991. ISBN 9780070542365.

    Google Scholar 

  23. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323(6088):533–6.

    Article  Google Scholar 

  24. Timan A. Theory of approximation of functions of a real variable. Dover books on advanced mathematics. New York: Dover; 1994. ISBN 9780486678306.

    Google Scholar 

Download references


We would like to thank Gordon Lee for interesting discussions and feedback. We would also like to thank the reviewer for constructive comments.


Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations



All authors contributed equally to the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jörg Kienitz.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nowaczyk, N., Kienitz, J., Acar, S.K. et al. How deep is your model? Network topology selection from a model validation perspective. J.Math.Industry 12, 1 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: