How deep is your model? Network topology selection from a model validation perspective

Deep learning is a powerful tool, which is becoming increasingly popular in financial modeling. However, model validation requirements such as SR 11-7 pose a significant obstacle to the deployment of neural networks in a bank’s production system. Their typically high number of (hyper-)parameters poses a particular challenge to model selection, benchmarking and documentation. We present a simple grid based method together with an open source implementation and show how this pragmatically satisfies model validation requirements. We illustrate the method by learning the option pricing formula in the Black–Scholes and the Heston model.

example, can the problem be learned best with a multilayer perceptron (MLP) or a longterm-short-term memory network (LSTM)? 1 After a type of topology has been selected, for example MLP, the precise shape of that topology still needs to be determined and justified, e.g. the number of layers and neurons needs to be specified as hyperparameters as well as the activation functions. In a last step, the networks weights have to be chosen. This last step can be performed automatically by stochastic gradient descent methods such as Adam, see [18], but those often require hyperparmeters as well, e.g. the learning rate.
ANNs are already widely used in many areas and the problem of choosing the type of network topology and the number of layers and units occurs everywhere and not just in quantitative finance.
Typical approaches to this problem are: 1. Arbitrary choice: The model developer simply makes a choice and plays around with it until the results are satisfactory. Alternative choices are not documented or systematically evaluated. 2. Automatic Machine Learning: There are attempts to automate the process of finding supervised machine learning solutions for a given problem and data set using unsupervised machine learning techniques -those are dubbed AutoML. 3. Functional Analysis: From a mathematical perspective, a neural network is a method to approximate a non-linear function. All possible choices of network topologies and weights constitute a space of approximation functions and thus existing theory from approximation theory can be utilized. The first method is of course straight-forward and quite often successfully used in practice. However, it is not always successful and it is certainly not compliant with any model governance framework. For example, the SR 11-7 guidelines clearly state that "a sound development process will produce documented evidence in support of all model choices", see [2,Sect. V.1]. This method clearly violates that requirement and thus cannot be used for financial models in production.
While the idea of the second method, i.e. using unsupervised learning to automatically choose the topology for a supervised learning problem, is very appealing, there are two problems to consider: First, automatic machine learning is unsurprisingly a much more difficult problem than just one supervised learning problem and it is not always feasible to apply it in practice. The second issue is that from a model validation perspective, one tries to shed light into a blackbox with another blackbox. AutoML, in particular if used in a proprietary version, can only be used to validate a bank's machine learning solution to a specific problem after the AutoML engine that was used has itself successfully completed a model validation process. That might be possible, but requires a lot of additional resources -potentially much more than to justify the neural network in question directly. Also, it remains to be seen if a regulator would ever sign off such a blank check. One should however remark that open source initiatives such as [17] are currently improving transparency and efficacy of such automatic machine learning solutions.
The third method, has the advantage that a lot of literature and research already exists in the area of function approximation, see for example [9,22,24]. Specifically for neural networks, there is the famous universal approximation theorem, see [8,11,15]. This literature is very helpful in providing sound methodological justifications and theoretical

Artificial neural networks (ANN)
In this section, we establish the notation for two of the most common types 2 of artificial neural networks: the multilayer perceptron (MLP) and the long-term-short-termmemory network (LSTM). It should be highlighted that the notation, the mathematical formalization and also the implementation slightly varies throughout the literature, see for example [10,20,21]. We have chosen an approach here that is compatible with keras and tensorflow, see [1,6], as those frameworks are very common.

Multilayer perceptron (MLP)
The most common and most basic form of neural networks are multilayer perceptrons.
• a number n o of outputs, • a number n L of layers and • for each layer 1 ≤ l ≤ n L -a number n l of neurons (or units), -a matrix A l = (A l;ij ) ∈ R n l-1 ×n l and a vector b l = (b l;i ) ∈ R n l (called bias) of weights such that n 0 = n i , n n L = n o and -an activation function σ l : R → R. For any 1 ≤ l ≤ n L , the tuple (A l , b l , σ l ) is called a layer. For l = n n L , the layer is called output layer and for 1 ≤ l < n L , the layer is called hidden layer.
Neural networks can be visualized as in Fig. 1: This shows a network (A l , b l , σ l ) 1≤l≤n L with a total of n L = 4 layers, i.e. 3 layers are hidden. Notice that the input layer is just a visualization of the input and is not part of the actual network topology.
Computing the output from the input is codified in the feed forward. Then for each 1 ≤ l ≤ n L , we define a function where we employ the convention that σ l is applied in every component. The composition is called the feed forward of MLP. Any set of inputs x ∈ R n i is called an input layer.
The links in Fig. 1 between the nodes visualize the feedforward and the dependence of each output on each input.
We assume here that the activation functions are chosen as the standard sigmoid, σ l (x) := (1 + e -x ) -1 , for all but the last layer, 3 where we chose the linear activation σ L (x) = x.
The feed forward F = F depends on all the parameters = (A l , b l ) 1≤l≤n L . In general, the function F will be unrelated to the problem if the weights are not chosen carefully. This is performed by training the neural network with a training set (x i , y i ) 1≤i≤N . More precisely, for a given cost function, say least squares, this amounts to solving the optimization In practice, this optimization is performed by stochastic gradient descent methods such as Adam, see [18], and a clever computation of the gradient, called backpropagation, see [23].

Long-Term-Short-Term-Memory Network (LSTM)
While MLPs work very well in many situations, there are certain applications for which that network topology is sometimes not ideal. MLPs are built to make a prediction y given one input x. But some applications have a canonical time structure and the task is to make time-dependent predictions y t from time-dependent inputs x t . In principle, this case can of course be covered by MLPs as well, for example by adding a time grid to the inputs and the outputs, i.e. to learn (y t 1 , . . . , y t n ) from (x t 1 , . . . , x t n , t 1 , . . . , t n ). This has the advantage that it is straight-forward to implement, but the disadvantage that the network has to be potentially quite large. Another option is to train a network separately for each point t i . That has the disadvantage that there is no flow of information between the networks of different points in time t i and the prediction of this sequence of networks might suffer from inconsistencies. A key application that has driven the research in that area is Natural Language Processing (NLP), where language is seen as a sequence of words.
A known way out of these technical problems are long-term-short-term memory networks as suggested in [14]. The idea depicted in Fig. 2 is as follows: Only one neural network, i.e. with one fixed set of weights, is trained, but in addition to the input, the network also processes the cell state. This additional piece of information is transmitted through the network and serves as a memory of previous predictions.
Formally, an LSTM can be defined by first defining a single LSTM layer, LSTML, and the associated feedforward and then stacking multiple of those together to an LSTM. This definition needs to be understood in the context of the associated feedforward. The feedforward of an LSTML is more complex than for MLPs. Because of the time- The two sequences c t and y t , t = 1, . . . , T, called cell state and carry state, are recursively defined as follows: 4 input: The interpretation of this is as follows (see Fig. 3): The input value i t and forget value f t represent how much weight is put by the network on the current input x t and how much weight is put on forgetting the past memory. Then, a candidate cell statec t and a candidate output o t are computed on the basis of only the current input x t and the last prediction y t-1 . Thus, the new cell state c t is computed as a weighted average between the candidate cell statec t and the previous cell state c t-1 , where the weights are given by the input weight i t and the forget weight f t . Finally, the carry y t , i.e. the intermediate output at t, is computed by first computing a candidate output o t , which is also based only on the current input x t and the last prediction y t , and then y t is computed from o t by weighing o t with the new cell state c t . It should be noted that depending on the application, one might either consider the value y T as the feedforward of the network or the vector (y 1 , . . . , y T ).
Practical applications rarely comprise of a single LSTML for various reasons. First, as we can see in Definition 1.4, each step of the computation, i.e. the input, forget, output and candidate cell state are equivalent to a feedforward of an MLP with just a single layer. That means that a single LSTML cannot capture arbitrary non-linearities with these computations. Also, notice that if a network comprises of just an LSTML, then the number k of features is forced to be k = n i , i.e. the number of inputs and the number m is forced to be m = n o , i.e. the number of outputs. That means that no parameters can be changed to adapt the network to the problem. The solution to both of these problems is to chain multiple LSMTLs in sequence, where the number m of units can be chosen at will, followed by a single MLP layer to ensure that the last output is of the same shape as n o .

Network topology selection
In this section we introduce a method to select a network topology that is consistent with established model validation practices.
Assume we want to train an MLP on a financial problem such as pricing. The key model parameters to choose are the number of layers n L and the number of units n u in each layer (we are assuming that we want to choose the same number in each layer). A very simple way to obtain a documented choice for that is to not pick one arbitrary parameter vector (n u , n L ), but to specify a grid of those parameters and built an MLP for each of them, see Fig. 4 for an example. The idea is to then train all of those models with the same data, analyze their learning curves and the performance of the trained model and then the select the least complex model such that its performance is within thresholds acceptable by the business case. As discussed in Sect. 3, this simple procedure already satisfies many key requirements of model validation, but there is one catch.
A key question, which is not only interesting from a model validation perspective is: For the given learning problem, is it better to increase model performance by increasing the number of layers or by increasing the number of units? If the grid G = (g ij ) of parameters g ij = (n u i , n L j ) is a cartesian product of a vector of possible number units (n u i ) and a vector In order to obtain a meaningful answer to this question, one has to consider the degrees of freedom of the model. For a neural network, these are exactly the number n w of trainable weights. It is good model selection practice to increase the degrees of freedom of a model slowly from below to obtain as many as needed to be accurate, but no more than necessary to avoid overfitting. Applied to MLPs, this means that we actually want a grid where along the one (and only one) dimension, we can increase the degrees of freedom and along the other we can change the topology of the network keeping the degrees of freedoms fixed. This is not achieved by a grid G that is simply a cartesian product of number of units and number of layers, because both increase the degrees of freedom and they do so in a very different way, see Fig. 5(a). An easy way to fix this is to build the grid by first fixing the smallest candidate for the number of layers, say n L 1 = 2, and then fill the first column of the grid with g i1 = (n u i , n L 1 ), where n u i is the vector of candidate number of units. In a second step, we then compute the resulting degrees of freedom n w for each row in the first column. In a third step, we fill the other columns by gradually increasing the number of layers, but simultaneously reducing the number of units to keep the degrees of freedom in each row constant. In this way, the row axis increases the degrees of freedom (via increasing the number of units) and along the column axis we obtain models with the same degree of freedom, but different network topologies. That is exactly what we want. To carry out this program in detail, we need to calculate the degrees of freedom, i.e. the number of trainable weights of the network, see below, and formalize the above in an algorithm, see Algorithm 2.2.
For an MLP NN = (A l , b l , σ l ) 1≤l≤n L , calculating the number of trainable weights amounts to the following: Given that any layer has a matrix A ∈ R n l-1 ×n l and a bias b ∈ R n l , this yields n l-1 n l + n l = n l (n l-1 + 1) trainable weights per layer. Taking into account that n 0 = n i and n L = n o , i.e. the dimensions of the input and the output layers are fixed, the total number n w of trainable weights is given by and for n L ≥ 3 n w = n 1 (n i + 1) + n o (n L + 1) + n L -1 l=2 n l (n l-1 + 1).
While in theory it is perfectly possible to choose a different number of units for every hidden layer, in practice one often uses the following.

Assumption 2.1
The number of units n u is the same in each layer.
In that case Eq. (2) simplifies to n w = n u (n i + 1) + n o (n u + 1) + (n L -2)n u (n u + 1) which requires two choices, namely n L , the number of layers, and n u , the number of units per layer. In Fig. 5(a) we plot this function for an example. We see that -in accordance with Eq.
which is easily solved by setting Of course in practice one has to take the floor n u (or a rounding) to enforce an integer number of units. Using this we can keep the number of weights approximately constant when increasing the number of layers. This is illustrated in an example in Fig. 5(b). Keeping the number of weights constant when increasing the layers allows us to control the degrees of freedom with a single variable resulting in a more meaningful comparison between various network topologies. This leaves us with the following method of determining a good network topology for any given problem.

Algorithm 2.2 (Network topology selection)
Input: (i) An artificial neural network NN.
(iv) A range of number of original units N u = (u 1 , . . . , u s ).
(v) A bias threshold t b and a variance threshold t v together with metrics for both (e.g. MSE). (vi) A number e max of maximal epochs. Steps: (i) Create a grid G = (g ij ) 1≤i≤s,1≤j≤r of tuples g ij as follows: For the first number n L 1 of layers initialize g i1 := (u i , n L 1 ), i.e. use the original number of units and L 1 layers. For any j > 1, set g ij := (u i , n L j ) where u i is the solution of Eq. (5) with n w set as the same as resulting from g i1 . This results in a grid where the degrees of freedom increase by going down a row, but keep constant when going right a column, c.f. Figure 4. (ii) For each network NN resulting from the parameters in the grid G, train the network with (x, y) until the bias and variance is below the thresholds t b and t v (or the maximum number of epochs e max is reached). This results in a grid of trained models, see Sect. 5.2 for an example. (iii) Cross out all networks on the grid, for which the bias and the variance are not below the given thresholds. 5 (iv) Amongst the remaining, find the smallest number n L of layers, for which there exist a number of units n u such that the model (n u , n L ) has not been crossed out. Amongst those, choose the one with the smallest number n u of units. Output: A number (n u , n L ) of units and layers for the network NN such that the bias and the variance of the network on (x, y) are within the threshold and the numbers (n u , n L ) are optimal within the given range. 6 Optionally, one can create a second grid to compare how the MLP performs against the LSTM. One only has to determine the number of weights in an LSTM as well and choose the number of units in the original LSTMs such that the resulting degrees of freedom are approximately the same as in the reference MLPs. It follows from Definition 1.3 that the number of weights in a single LSTM layer is given by 4m 2 + 4(k + 1)m and thus by Definition 1.5, we obtain that the total number n w of weights in the LSTM is given by where n L is the number of layers, n u is the number of units in each layer and n i and n o are the number of inputs and outputs.

Fulfillment of model validation requirements
The network topology selection as discussed in Algorithm 2.2 serves to fulfill various model validation requirements as mandated by SR 11-7, see [2]. First of all we note that just because neural networks are not classical Monte Carlo simulations, this does not mean that they are in principle ineligible for production. In fact, SR 11-7 explicitly states that the "nature of testing and analysis will depend on the type of model and will be judged by different criteria depending on the context", [2, Sect. IV, p.6]. Thus, it is reasonable and compliant that testing neural networks looks a bit different from testing a Monte Carlo simulation.
Our approach to select a network topology addresses multiple SR 11-7 validation requirements simultaneously: 1. Documented Choice: "A sound development process will produce documented evidence in support of all model choices. Comparison to alternative theories and approaches should be included" ([2, Sect. V.I, p.11]). The learning curves of a grid of trained models resulting from the network topology selection method, see Fig. 9 for an example, serve as documented evidence that a "comparison with alternative theories and approaches" as a "fundamental component of a sound modeling process" ([2, Sect. IV, p.6]) has been conducted. This applies to both, the number of layers and units within a network topology, and comparisons between multiple network topologies. 14]) Because the learning curves include the variance, the requirement to consider both is satisfied. In fact, because the total number of degrees of freedom in the model is increased from below, this method automatically makes the model much less prone to overfitting. One should highlight that the last point is particularly delicate for financial applications. Models in the financial domain that have hundreds or even thousands of parameters are typically met with great scepticism as avoiding overfitting and instabilities in those models is no easy task. Because the machine learning community routinely deals with models that have many parameters, diagnostic methodological frameworks, a strong culture to cross-validate and standardized open source solutions have already made those models tractable.
We conclude by stressing that the proposed method to network topology selection is only one aspect of model selection, which in turn is only one aspect of model validation. Performing the suggested method of network topology selection does not exonerate the user from performing the full set of validation tasks mandated by SR11-7.

Technical implementation
An advantage of the network topology selection method (Algorithm 2.2) is that in theory, the implementation is straightforward. Any framework that can run one neural network can be used to run a grid of neural networks by using a loop. In practice, however, this results in various IT tasks such as managing the IDs of the models, loading and saving them, ensuring that their training parameters are consistent etc. The numerical simulation in Sect. 5 has been performed using the popular keras models, see [6]. We have isolated the part of the code that wraps the models into keras_grid, an open source module, which conveniently solves these problems. 7

Application to quant finance models
In this section we apply the network topology selection method (Algorithm 2.2) to the problem of learning the option pricing function of the Black-Scholes model and the Heston model. In both cases we compare the topologies resulting from the MLP and the LSTM. We find that even though the prices C(T, K) of call options clearly have a time dependence T, the MLP is actually much better suited to learn them than the LSTM. This is plausible as the problem of managing complex long-term-short-term memory does not really occur for Markovian paths generated by the Black-Scholes or Heston model. The results are described below and can also be explored interactively in a jupyter notebook. 8

Black-Scholes & Heston model
The Black-Scholes model, see [4], assumes that the stock price S t is a stochastic process on a probability space ( , F, Q) (where we think of Q as the risk-neutral measure) satisfying where r ∈ R is a fixed risk-free rate, σ > 0 is the volatility and the process W t is a Brownian motion. We denote by F = (F t ) t≥0 the augmented filtration generated by W t . Under these assumptions, a European call option with expiry at T and strike K , i.e. a derivative with payoff (S T -K) + can be priced analytically with the famous Black-Scholes formula: where denotes the cdf of the standard normal distribution and The Black-Scholes model rests on the assumption that the volatility is constant, which is arguably not realistic. The Heston model, see [13], belongs to the class of stochastic volatility models, which assume a stochastic dynamic not just for the stock price, but also for the volatility. It is defined by where r ∈ R is the risk-free rate, κ ∈ R is the rate at which the stochastic variance ν t reverts to the long-term mean θ > 0, ξ > 0 is the volatility of the volatility and ρ ∈ [0, 1] is the correlation between the Brownian motions W S t , W ν t . The option price in a Heston model can be computed via (see [7]) where 1 and 2 are given as integrals over the characteristic function = ln(S T ) of ln(S T ): We learn the Black-Scholes formula Eq. (7) for t = 0 as well as the Heston option pricing formula Eq. (11) with a neural network.
To that end we generate a data set as follows: We first define an evenly spaced grid of 60 maturities T between 3M and 5Y. Second, for the other model parameters, we generate 10,000 samples uniformly distributed in the hypercube with the bounds specified in Fig. 6. Third, we take the cartesian product between the maturities and the samples and obtain a data set with 600,000 samples. The special treatment of the maturity as a parameter is required here to adhere to the input format of the LSTM. Notice that in a productive environment, the specification of the bounds for the traning set has to be in line with business requirements and careful input checking against these bounds has to be performed after training when predictions are made. An example of a price surface for both models is shown in Fig. 7.   Fig. 8. The key insight here is that while their shape looks the same as expected, the order of magnitude of original number of units is much lower for the LSTM as for the MLP. That is because all these additional complexities of the LSTM, recall Definition 1.3, mean that for the same number of units, the LSTM has much more trainable weights than the MLP. Thus, in order to achieve the same number of trainable weights as the MLP, the LSTM has to be instantiated with a much lower number of units.

Performing network topology selection
We train both models on the above data set with an 80%/20%-split into train/test data with random shuffling. The thresholds are set to t b := t v := 0.25 and the maximal number of epochs is e max := 50. We choose the mean squared error (MSE) as a loss function and the mean absolute error (MAE) as our metric. The resulting learning curves with the bias and the variance are shown in Fig. 9 for the MLP. In this grid the number of layers increases in each column from left to right and the number of original units increases in each row from top to bottom. We find that the first column does not have enough layers to capture the  non-linearity in the pricing function of the Black-Scholes or the Heston model. However, in the second column, the first row is already below the threshold, but only barely and not yet very stably after the 50 epochs, so we select the model below in the second row. This model has L = 3 layers and just n u = 26 units in the hidden layers (reduced from the original number of 128). This means that this model has learned the Black-Scholes and Heston pricing function with only n w = 885 trained weights after just 50 epochs.
The learning curves for the corresponding LSTMs are shown in Fig. 10. We find that they are significantly worse than the MLP. Only for the very last model for L = 3 layers and n u = 17 units per hidden layer (reduced from an original number of 1024), the learning curves are just at the threshold, so we select this one. Despite having n w = 6274 trained weights in total, it still performs worse than the MLP we selected above (which has less than 7x the number of trained weights). The conclusion from this is not that LSTMs cannot learn the Black-Scholes or Heston pricing function, but rather that the simplicity of the MLP network topology is better adapted to this specific problem. The complexity of the LSTM network topology is adapted to a situation, which doesn't occur here, and thus it cannot achieve the same performance as the MLP when constrained by the same number of trainable weights.

Error distribution
While the mean absolute error (MAE) is a good metric for the learning curve, it is often not sufficient for financial applications. Therefore, we study the whole error distribution for the MLP and the LSTM selected above, see Fig. 11. Unsurprisingly, the LSTM has a higher and a wider error distribution indicating it has learned the pricing functions less well than the MLP. For the MLP it is interesting to note that while the learning curves in Fig. 9 suggest that in the mean error for the Black-Scholes and the Heston model is similar, we see in Fig. 11, that the Heston error distribution is wider than Black-Scholes. This shows that it is 'harder' for the network to learn this, which given the higher dimensionality of the data set is not surprising.
Both impressions are also confirmed Fig. 12, where we compute some statistics of the error distribution. While for MLP, high percentiles or even the max of the error is still <1, meaning in cases where the MLP error is worse than the mean, this network fails 'gracefully' , the max error for the LSTM is much worse. For the MLP, the 95th and 99th quantile of the error as well as the max error are slightly higher for Heston than for Black-Scholes.

Generalization to other models
We have illustrated the network topology selection method for parametric models trained on synthetic data in a given range as this flavour is currently the most popular. It is well known that neural networks usually do not perform very well in regions outside the training set, i.e. when they extrapolate. Therefore, we suggest to take that into account when generating the training set. The fact that this is possible is one of the big advantages of working with synthetic data (rather than real world or historic data). In practice, we recommend an automated bound checking of the new inputs supplied to the network that is consistent with the bounds used in training. Even more care has to be taken when training non-parametric models on historic data, e.g. a VaR model on historic market data shifts. The high dimensionality of the input and non-linearity of the PnL might require a topology so big that given limited availability of historic data, it might simply not be trainiable to get the bias below an acceptable threshold. This will also make it impossible to reduce the variance to satisfactory level. If the network is trained on a decade where markets are calm, it will also easily breach the ranges in which it was trainined when used in a time of market stress causing extrapolation failures.

Conclusions
We conclude that the SR 11-7 requirement to conduct model validation, in particular a thorough model selection process, can be satisfied for neural network models as well. The simple grid based network topology selection method is a pragmatic way of producing a good and documented choice of hyperparameters for a given financial application.
As a byproduct we obtain interesting insights into how neural networks learn financial models. Given that evaluating a network is very fast, this makes the use of pricing models feasible, which are too computationally expensive otherwise. We also find that while option prices (or implied volatilities) clearly have a time dependence, LSTMs are overly complicated for this application and the much simpler MLPs perform better in a fair comparison -a textbook case of Occam's razor.