An SGBM-XVA demonstrator: a scalable Python tool for pricing XVA

In this work, we developed a Python demonstrator for pricing total valuation adjustment (XVA) based on the stochastic grid bundling method (SGBM). XVA is an advanced risk management concept which became relevant after the recent financial crisis. This work is a follow-up work on Chau and Oosterlee in (Int J Comput Math 96(11):2272–2301, 2019), in which we extended SGBM to numerically solving backward stochastic differential equations (BSDEs). The motivation for this work is basically two-fold. On the application side, by focusing on a particular financial application of BSDEs, we can show the potential of using SGBM on a real-world risk management problem. On the implementation side, we explore the potential of developing a simple yet highly efficient code with SGBM by incorporating CUDA Python into our program.


Introduction
Backward stochastic differential equations (BSDEs) have been a popular research subject ever since the work of Pardoux and Peng, [2] and [3]. There are many papers regarding for their applications in mathematical finance and stochastic control. In terms of numerical analysis, there is also significant research on the efficient calculation or approximation for BSDEs. This includes Monte Carlo-based research, like [4], chaos decomposition method [5], cubature methods [6] or Fourier and wavelet-based methods, like in [7] and [8]. However, there are relatively few studies on the practical application of BSDEs. As far as we know, there is not yet an industrial software package for solving general BSDEs. This work is our first step to address this issue.
The goal of this study is basically two-fold. This work can be seen as a follow-up of our theoretical research on the stochastic grid bundling method (SGBM) for BSDEs, see [1]. SGBM is a Monte Carlo-based algorithm which was introduced in [9], and extended its application to BSDEs in [1]. Here, we will study the practical side by developing a demonstrator in Python, where we shall also make use of computing on a Graphics Processing Unit (GPU) in order to improve the scalability, and make use of the CUDA Python package. CUDA Python is a recently open to public programming tool, which was developed by Anaconda. It has become freely available at the end of 2017, being previously commer-cial software. This programming tool carries the promise of combining fast development time for Python coding with the high efficiency of GPU computing.
The second focus of this work is the application of BSDEs in financial risk management, where we would like to demonstrate the practical opportunities for efficient BSDE solving software. We choose the modeling of the variation margin and the close-out value within risk management, in the form of BSDEs, as the main test problem. In this work, we work under a complete market assumption which includes counterparty risk and margin requirements, and develop a numerical solver for XVA.
A Python demonstrator for solving XVA pricing problems with the SGBM algorithm with GPU computing is the main result of this study. The strength of this package is its scalability with respect to the dimensionality of the underlying stock price processes. While the demonstrator is designed for a specific problem setting, because of the general framework with BSDEs and SGBM, the package could easily be transformed into a general solver for BSDEs and used for other financial problems.
This article is organized as follows. In Sect. 2, we introduce the basic mathematical setting of BSDEs, the fundamental properties of the SGBM algorithm and the application of parallel computing with SGBM. Section 3 describes the programming language, the financial setting for our SGBM-XVA demonstrator, and other technical details, while some numerical tests are performed in Sect. 4. Concluding remarks, possible extensions and outlook are given in Sect. 5.

Methodology
We begin this section with a brief review of the mathematical background of BSDEs and of the Monte Carlo-based simulation method SGBM. For further details, the readers are referred to our previous work [1]. Furthermore, we will describe a parallel computing version of SGBM, which is a follow-up on [10], in which parallel SGBM was applied to the pricing of multi-dimensional Bermudan options.

Backward stochastic differential equations
We use a standard filtered complete probability space (Ω, F, F, P), where F := (F t ) 0≤t≤T is a filtration satisfying the usual conditions for a fixed terminal time T > 0. The process W := (W t ) 0≤t≤T is a d-dimensional standard Brownian motion, adapted to the filtration F and the so-called decoupled forward-backward stochastic differential equation (FBSDE) defines a system of equations of the following form: are referred to as the drift and the diffusion coefficients of the forward stochastic process S, respectively, and s 0 ∈ F 0 is the initial condition for S. The function g : [0, T] × R d × R × R 1×d is called the driver function of the backward process and the terminal condition V T is given by N(S T ), for a function N : R d → R. All stochastic integrals with Wiener process W are of the Itô type.
For both μ(t, s) and σ (t, s), we assume standard conditions such that a unique strong solution for the forward stochastic differential equation exists, This process also satisfies the Markov property, where denotes the expectation operator with respect to probability measure P.
A pair of adapted processes (V , Z) is said to be the solution of the FBSDE, if V is a continuous real-valued adapted process, Z is a real-valued predictable row vector process, such that T 0 Z t 2 dt < ∞ almost surely in P, where · denotes the Euclidean norm, and the pair satisfies Equation (1). Our goal is to find (V 0 , Z 0 ) by solving the problem backward in time. We do this by first discretizing Equation (1) along the time-wise direction, π : 0 = t 0 < t 1 < t 2 < · · · < t N = T. We assume a fixed, uniform time-step, t = t k+1t k , ∀k, and let W k+1,q := W t k+1 ,q -W t k ,q ∼ N (0, t), a normally distributed process, for q = 1, . . . , d. The vector W k+1 is defined as ( W t k+1 ,1 , . . . , W t k+1 ,d ) . The discretized forward process S π is defined by It is defined by the classical Euler-Maruyama discretization. We define a discrete-time approximation (Y π , Z π ) for (Y , Z): The notation ∇ denotes the gradient of a function. Note that various combinations of θ 1 and θ 2 give different approximation schemes. We have an explicit scheme for V π if θ 1 = 0, and an implicit scheme otherwise.

Stochastic grid bundling method (SGBM)
We now introduce the Monte Carlo-based algorithm SGBM for BSDEs. The main difference between SGBM and other commonly known Monte Carlo algorithm, for example the one in [11], include a so-called regress-later regression stage and a so-called equal-sized bundling localization. Due to the Markovian setting of (S π t k , F t k ) t k ∈π and the terminal processes V π t N and Z π t N being deterministic with respect to S π t N , there exist functions v (θ 1 ,θ 2 ) k (s) and z (θ 1 ,θ 2 ) k (s) such that for the given approximation in Equations (2a)-(2c). Our algorithm estimates these functions, (y (θ 1 ,θ 2 ) k (s), z (θ 1 ,θ 2 ) k (s)), recursively, backward in time, by a local least-squares regression technique onto a function space with basis functions (p l ) 0≤l≤Q .
As a Monte Carlo-based algorithm, our algorithm starts with the simulation of M independent samples of (S π t k ) 0≤k≤N , denoted by (S π ,m t k ) 1≤m≤M,0≤k≤N . Note that in this basic algorithm, the simulation is performed only once. Therefore, this scheme is a non-nested Monte Carlo scheme. The next step is the backward recursion. At initialization, we assign the terminal, t N = T, values to each path for our approximations, i.e., v (θ 1 ,θ 2 ),R,I The superscript R is used to distinguish the simple discretization and the discretization with SGBM, while I denotes the number of Picard iterations. The following steps are performed recursively, backward in time, at t k , k = N -1, . . . , 0. First of all, we bundle all paths into B t k (1), . . . , B t k (B) equal-sized, non-overlapping partitions based on a sorting result of a bundling function defined on (S π ,m t k ). This is the partition step. Next, we perform the local regress-later approximation separately within each bundle. The regress-later technique we are using combines the least-squares regression with (analytically determined) expectations of the basis functions to calculate the necessary expectations.
Generally speaking, for any target function f and M Monte Carlo paths, a standard regress-now algorithm for a dynamic programming problem approximates a function ι within the space spanned by the regression basis such that it minimizes the value t+ t )ι(S π ,i t )) 2 and approximates the expectation E t [f (S π t+ t )] by E t [ι(S π t )] = ι(S π t ). As a projection from a function of S π t+ t to a function of S π t is performed, it will introduce a statistical bias to the approximation.
Instead, the regress-later technique, as we employ in SGBM, finds a function κ which minimizes the functional 1 t+ t )κ(S π ,i t+ t )) 2 and approximates the expectation ]. By using functions on the same variable, at t + t, in the regression basis, we can avoid the statistical bias in the regression. However, the expectation of all basis functions must preferably be known in closed-form, in order to apply the regresslater technique efficiently.
In the context of our BSDE SGBM algorithm, we define the bundle-wise regression pa- , .
Note that as we actually apply the equal partition bundling, M m = 1 B t k (b) = N for some constant N and for all bundle. Therefore we won't face the problem of dividing by zero.
The approximate functions within the bundle at time k are defined by the above parameters and the expectations E x As stated before, a Picard iteration is performed at each time step within each bundle if the choice of (θ 1 , θ 2 ) results in an implicit scheme. For further details on the application of the Picard iteration, readers may refer to [12] or [7] and the references therein. Finally, the full approximations for each time step are defined as:

Parallel SGBM
One way to improve the efficiency of the SGBM algorithm is by means of GPU acceleration, which has been successfully implemented for the SGBM algorithm for early-exercise options in [10]. The framework of parallel computation from [10] can also be used for the SGBM solver of BSDEs. In this framework, we divide the SGBM algorithm in two stages, namely, the forward simulation stage and the backward recursion approximation stage. In this subsection, we briefly describe how parallel GPU computing is used in each stage.
In the forward simulation stage, we simulate independent samples of the stock price models from the initial time t 0 to the expiration date T as defined by the problem. Moreover, we can already pre-compute all values of interest for the backward step that are related to the stock prices and store them in the memory. These values of interest include the sorting parameter, the basis function values, the expectations of the basis functions and the terminal values. Since the generation of each sample is independent of the other samples and a huge amount of samples may be needed for an accurate Monte Carlo simulation, this stage is particularly suitable for parallel computing. Also note that after the calculation of the values of interest, the actual stock paths can be discarded in the GPU, as they are not required anymore in the backward step.
The second stage is the backward approximation stage. From the discretization of the BSDE, we notice that the calculation in time-wise direction should process sequentially, i.e., starting from the terminal time and going backward along the time direction. However, within each time step, there are many independent processes that are well suitable for parallel computing. Within each time step, the data (i.e. the values of interest) is separated into different non-overlapping bundles and the computations in the different bundles are independent of each other. Within a bundle, multiple regression steps on different variables (V , Z, g, . . .) need to be completed. For a graphical representation of the computation within each time step, we refer to Fig. 1. As each regression within the time step is independent, we can also perform these computation simultaneously. Finally, in order to reduce the overall volume of memory transfers, only the information for the current time point is transferred to the GPU.

The SGBM-XVA demonstrator
In order to test and analyze the applicability and practicality of the SGBM algorithm in the BSDE-based financial model framework, we have created an SGBM-XVA demonstrator which computes XVA, i.e. a value of great interest in modern risk management, with the algorithm introduced in the previous section. We make use of the Python programming language and the CUDA Python package for the development of this SGBM-XVA demonstrator. In this section, we address the programming tools that we have used, the basic financial setting for this problem and the design of our code.

Programming language
Python is the programming language of choice for this project as it is a popular tool in the financial industry nowadays. Being a high-level programming language, Python is easy to develop and particularly useful for scripting because of its easy to write syntax and grammar. These properties are especially useful for the financial industry, as practitioners need to constantly monitor and adapt their models to the changing markets. Moreover, Python has been one of the dominating programming languages in data science with its widely available packages. Therefore, we develop our algorithm under the Python framework. However, Python is an interpreted programming language, so, its performance is not as strong as a compiled language. One of our focusses of study is therefore the balance between rapid development and actual computation performance which can be achieved by Python.
One of the effective techniques for improving the computational efficiency of Python is to pre-compile parts of the code, for example, with the help of the Numba package. This technique has been adopted in our SGBM-XVA demonstrator. In order to further improve the efficiency of our algorithm, we apply parallel GPU computing as stated in the last section. The use of GPUs has been a major development in scientific computing. Along with the computing platform CUDA, the GPU provides a high potential for computational speed-up. With more than hundred threads (the basic operational units within CUDA) in a typical GPU, repetitive function computations can be dedicated to various treads to be run in parallel. It will greatly reduce the computational time.
In this work, we use the CUDA Python packages to incorporate GPU programming into Python. CUDA Python is made out of Python packages from Continuum Analytics, that allow a user to make use of CUDA within the native Python syntax. The tool consists of two main parts: a Numba compiler and the pyculib library. The Numba compiler transforms a native Python code with only supported features into a CUDA kernel or a device function with only a function decorator. This feature enables GPU computing in Python without the need for the programmer to learn a new language. The pyculib library [13] is a collection of several numerical libraries, that provide functions for random number generation, sorting, linear algebra operation and more. This set of tools has been previously included in commercial software, but it has become open source since 2017.
Another benefit of using CUDA Python is in terms of an automatic memory transfer. In a GPU computing framework, it is often necessary to transfer data between the CPU and GPU memory space. In this tool, a programmer can either manage the transfer of memory for better control of GPU memory usage and the bandwidth of memory transfer between CPU and GPU, or let the platform handle it automatically, again simplifying the code development.
The main down-side of this tool is that so far it only supports certain functions from the native Python and Numpy packages. Some of the well-optimized packages in linear algebra are not available for the GPU, and some of the Python code has to be rewritten into a supported version. However, using the package still requires less adaptation effort as compared to the incorporation of other compiled programming languages, like C, into a Python code. As we will discuss later, this tool delivers a great improvement in efficiency in some areas.

Financial test case: total valuation adjustment (XVA)
Next, we shall introduce the test problem for our SGBM-XVA demonstrator. The main goal here is to show that BSDEs and SGBM can be used to solve financial total valuation adjustment XVA problems in risk management.
In short, we consider a financial market where the bank is selling derivatives to a counterparty, but either the bank or the counterparty may default before expiry. Therefore, a variation margin has to be posted as collateral which will be used to settle the account when one party defaults. In this market, the funding rate for each financial product may be different, as well as the deposit rate for a risk-free account and the funding rate through bonds. The goal of this model is to compare the price of a financial portfolio with and without counterparty risk. The difference between these two prices is called the total value adjustment.
We use a simplified version of the dynamics for our financial market, as in [14]. Within this setting, we take into account the possibility that both parties, in an over-the-counter financial contract, may default, also we include the exchange of collateral, a close-out exchange in the case of a default and the usage of a repurchasing agreement (repo) for position funding. While this model is definitely more realistic than the classical financial derivative pricing theory (where one can borrow and lend freely with negligible cost) by taking into account the counterparty risk, this model still leaves out some parts of the financial deals, like regulatory capital or haircuts that apply to collateral. As already mentioned, we use the standard multi-dimensional Black-Scholes model for the asset dynamics. We select this model as a balance between a relatively realistic model and the tractability for the equations involved. The specific SGBM algorithm can be easily generalized to other models, as described in the outlook section.
Next, we introduce the mathematical model for our asset prices and the default events, as well as the notations that we use. Subsequently, we present the fundamental equations that we are solving in our demonstrator. Detailed financial interpretation as well as the model derivation will be left out and readers are referred to [14] for the description of this XVA model.

The XVA model
In this model, we take on a bank's perspective onto risk management. This perspective focuses on a non-centrally cleared financial derivative on underlying assets S, which are traded between the bank and its client and both parties may default. However, the default will not affect the underlying assets S.
Before we proceed to the BSDE description, which we are going to compute in the SGBM-XVA demonstrator, we introduce the meaning and financial background of the different terms involved. S i denotes the underlying i-th asset, which follows the standard Black-Scholes model under our assumptions. Its dynamics are defined by the SDE: The constant vectorμ = (μ 1 , . . . ,μ d ) and (σ 1 , . . . ,σ d ) represent respectively the drift rate and the standard deviation of the financial assets. The parameters ρ ij form a symmetric non-negative correlation matrix ρ, which is invertible under our assumptions. We can relate the correlated Brownian motion B to a standard, independent d-dimensional Brownian motion W by performing a Cholesky decomposition on ρ. They satisfy the equality where C is a lower triangular matrix with real and positive diagonal entries, and CC = ρ. The processes J B and J C are used to model the events of default for each party in the transaction. Mathematically, they are defined as counting processes, and where τ B and τ C are stopping times, denoting the random default times of the bank and the counterparty, respectively. The processes are assumed to have stochastic, timedependent intensities, λ B , λ C , i.e.
Next, we discuss the financial derivatives traded within this model, where we use N to denote the terminal payoff for the portfolio. To mitigate counterparty risk, the variation margins, X, need to be computed for the two parties. As in [14], the values X are based on the market value of the financial product, and they are computed and re-adjusted frequently. When X > 0, the counterparty is posting collateral with the bank.
When one party in the financial contract defaults, the contract position needs to be closed. We denote the portfolio value at default (at time τ = τ B ∧ τ C ) by θ τ , and it is given by where R C , R B ∈ [0, 1] are the recovery rates in the case the counterparty and the bank default, respectively. The variable M denotes the close-out value of the portfolio when any party defaults. We will give more details regarding M in a later section. Next, we introduce the notation for the financial quantities in our model. The adapted stochastic vector processes q S and γ S are respectively the repo rate and the dividend yield of the underlying assets. The process r is the stochastic risk-less interest rate. The processes r B and r C are the yields of the risky zero coupon bonds of the bank and the counterparty, respectively. The process q C is the repo rate for the bonds of the counterparty. The interest rate for the variation margin is given by r X , and, finally, r F is the cost of external funding. In order to simplify the expression of the demonstrator, we assume all the above processes to be constant.

Fundamental BSDE and reduced BSDE
In [14], a fundamental BSDE is derived through a hedging argument based on a replicating portfolio for the financial derivative. For the hedging portfolioV , including the counterparty credit risk, the dynamics of posting collateral and holding counterparty bonds for hedging, are given by, and the driver function is defined as The notation diag(m) denotes a diagonal matrix with the terms of vector m on the main diagonal. We compute the price of the derivative by approximating the values of the stochastic processes (V ,Ẑ, U B , U C ) that solve the above BSDE. The price of the contract at time 0 is given byV 0 .
The main difficulty of dealing with this BSDE is that the terminal time is random, as it depends on the time of default. Therefore, this BSDE has to be transformed into a standard BSDE, to reduce the computational complexity, following the techniques used in [15]. The authors in [15] showed that we may recover the solution of (4) by solving the BSDE, and usinĝ Proposition 1 (Theorem A.1 in [14]) If the pair of adapted stochastic processes (V,Ẑ) solves Equation (5), then the solution (V ,Ẑ, U B , U C ) of Equation (4) is given by (6).
Idea of the proof This technique considers the three possibilities of termination, i.e. no default, the bank's default and the counterparty default, and shows that the results in (6) solve (4) under all three situations.
When we consider the total value adjustment to the financial option values, we compute an alternative price for the same financial contracts, however under the assumption of risk-free dynamics (meaning no counterparty default). The value adjustment is defined as the difference between the two portfolios values (risky minus risk-free option values). The default-free portfolio dynamics, in a repo funding setting, can be expressed as

Discrete system
Here, we explain the setting for the variation margins X and close-out value M. The actual discrete system used in our demonstrator will be introduced as well.
In [16], the variation margin has been mandated to be "the full colleteralised mark-tomarket exposure of non-centrally cleared derivatives". Therefore, at any time t, either the mark-to-market risk-free portfolio value V or the counterparty risk adjusted valueV appear reasonable choices for the variation margin X. There are also two possible conventions for the portfolio close-out value, namely M = V and M =V . In this demonstration, the variation margin and the close-out value are assumed to be the mark-to-market price V . In this case, the valuation problem results in a linear BSDE that contains an exogenous process V . This means that both the variation margin and the close-out value are marked to the market, and are thus not dictated by the bank's internal model. Therefore, there is an additional stochastic process V in the driver forV . If the expression of V is not readily available, the numerical simulation of theV system poses extra challenges, because we have to simulate V andV simultaneously.
The driver for the reduced BSDE with counterparty credit risk is now given bȳ while the counterparty risk-free price V can be expressed as Indeed, the elementary expression of V may be available in closed-form for some basic financial derivatives under the simple 1D Black-Scholes model. However, in order for our demonstrator to cover a wide range of products, we solve V by SGBM instead.
Recall the discretization scheme in Equations (2a)-(2c), we can approximate the reduced BSDE with the following numerical scheme: This system is the one that we have implemented in our demonstrator. Note that the no- . To close-out this section, we would like to mention that there is an analytic expression for the reference price under the current setting, which is given bŷ There is however no elementary expression or simple way to evaluate this quantity.

Function descriptions
There are two parallel versions of our algorithm, both are written in Python. One of them is based on generic Python code and the Numpy, Scipy and Numba packages for performance and efficiency, which we would refer to as Version 1 from now on for simplicity. The second one makes use of the CUDA toolkit, the CUDA Python portion of the Numba package and the pyculib library and shall be referred to as Version 2. A numerical_test script has been provided for running basic comparison tests between the two versions under a predefined test setting. Next, we list the main functions within our implementation. Since the two versions have similar architecture, we would only present Version 2. We have: • the main function for the whole algorithm: cuda_sgbm_xva; • the forward simulation function for the basis values and partition ordering: cuda_jit_montecarlo • the main function for the backward approximation stage: cuda_sgbminnerblock • the parallel regression function: cuda_regression • an example class to store all the information related to the test case, including Equation (7): ArithmeticBasketPut All of the above functions form the complete demonstrator but each individual component can be used or replaced separately, which gives us flexibility for future development.
Version 1 only depends on Python and the Numba library. In addition to that, Version 2 also requires the CUDA driver and the pyculib library.

Numerical experiments
In this section, we present the tests we have implemented in our demonstrator. In these tests we assume that the bank sells a portfolio consisting of an arithmetic put option, whose payoff is given by where K is the strike price. The detailed setting for the numerical test is presented in Table 1. This set of parameters is based on the one used in [10] and it has been adopted here as an easy to scale up problem. The main purpose of the tests is to investigate whether the code can be easily extended to solving high-dimensional problems, so we choose a set of parameters whose properties do not change when we change the dimensionality of the problem. For a more practical problem, a test based on the German stock index model from [17] has been conducted in [1].
Note that here we have adopted a test case with the borrowing rate and lending rate being equal for the purpose of comparing our default-free prices to previously known result. Additional tests have been performed to assure that the convergence behavior will not be affected by different sets of parameters.
The numerical test has been conducted on a common desktop computer with an Intel(R) Core(TM) i5-7400 processor and a GeForce GTX 1080 graphic card.

Performance of GPU computing
The first set of tests is performed to check whether there is any benefit of GPU computing for our demonstrator. We perform the same test 10 times separately with the two versions and take the averaged computing time over these 10 runs to compare the efficiency between them. Moreover, in accordance with the work in [1], we use 5 sets of parameters for testing the consistency of this SGBM algorithm. The parameters are presented in Table 2. We have conducted tests with the stock price dimensionality being 1, 5 and 10. The results are shown in the Tables 3, 4 and 5.
As we expected from the theoretical work, the approximate values converge with respect to the progressively increasing grid sizes and the standard deviations decrease as we progress through the parameters sets. More importantly, there is a clear speed-up of Version 2 over Version 1. The biggest improvement comes from the forward simulation stage    where we run each independent sample in parallel, and greatly reduce the computational time. There still seems to be room for improvement regarding the backward simulation stage as even if we perform the regression in parallel on the GPU in Version 2 instead of sequentially in Version 1, the speed-up is not obvious. The main reason may be that some part of the code is not performed on the GPU to keep some flexibility of our demonstrator. Some optimization is only available for the basic Python code but not for CUDA Python. Nevertheless, our tests show that CUDA Python improves the efficiency and may provide a good balance between the speed of execution and the speed of development. In particular, as the workload increases due to using more time steps, samples and bundles, the speed-up get higher due to GPU computing.
Finally, whereas the build-in floating point format of the GPU is 32 bits, 64 bits format is needed for ensuring high accuracy with our result.

Scalability
Next, we would like to check if our algorithm can be used for high-dimensional test cases. The setting is the same as in the last section. However, Version 1 is too time-consuming for a common desktop, so the focus is on Version 2. We perform tests up to 40 dimensions, as shown in Table 6.
It can be seen that with the help of the GPU, our method can be generalized to higher dimension. The application of GPU computing definitely expands the applicability of our method. However, we also notice that the time ratio increase is higher then the dimensional ratio. This is probably related to the architecture and the build-in hardware of the GPU. It maybe worthwhile to tune the GPU setting within our demonstrator corresponding to the GPU at hand, but this is hardware-dependent.
To sum up, the application of GPU computing not only speeds up our algorithm, but it also allows us to deal with more demanding problems with about the same hardware.

Conclusion
Although we just developed a demonstrator, our study gives us interesting insight in the performance of the SGBM algorithm for BSDEs. We have shown that we can apply the SGBM algorithm to solve high-dimensional BSDEs with the help of GPU computing, which is an important feature for using BSDEs in practice. With a suitable BSDE model, we can deal with complicated financial risk management problems with our solver and since our code is based on a general framework for BSDEs, our demonstrator can be adopted to different pricing and valuation situations. We also demonstrated that we can incorporate GPU computing into a native Python code. We believe that this work serves as a basis for developing BSDE-based software for financial applications. The authors encourage readers to download and test our demonstrator, which solves the problem which was stated in this work. The aim is to developed this tool further in terms of financial applications and computational efficiency.
Next, we will mention some possible generalizations for our algorithm as a guideline to future use. We will divide the outlook into two parts, a financial and a computational part.

Financial outlook
In this work, we used the classical Black-Scholes model for the stock dynamics. It is of interest to generalize the stock model by other diffusion SDEs, of the form: This would already result in a different implementation regarding the SGBM regress-later component.
We may also alter the model dynamics to better fit the market, for example, with the inclusion of initial margins and capital requirement. Recall that in Sect. 3.2.3 we have made some assumptions about the variation margin X and the close-out value M. Simply by adjusting these assumptions, a completely different BSDE driver may result. There are four possible combinations for the variation margin and the close-out value. Each choice gives rise to a different BSDE driver, which can be seen in Table 7.
When X and M are the adjusted valuesV , it results in a linear BSDE forV . This means that the variation margin and the close-out value are marked to the model. As all the values are marked inV and both values match, the equation reduces to an option pricing equation, with different interest rates. When both of them are the risk-free interest rates, a linear BSDE results that contains an exogenous process V . This means that both the variation margin and the close-out value are marked to the market and are not dictated by the internal model. Then there is an additional stochastic process V in the driver for V , which is the case we have used in this work.
Finally, when there is a mismatch between M and X, the resulting BSDE is no longer linear. Take case 4 as an example, it implies that the variation margin is collected according to an internal model while the close-up value is given by the market. The mismatch between the variation margin and the close-up value results in non-linear terms in the driver. These non-linear terms come from the cash flow when one party defaults.
We should notice that since we have a general BSDE algorithm in the form of SGBM, we may adapt our code to these new models by simply changing the model setting in the example class without changing the actual function. This is one of the advantages of using BSDEs in pricing and risk management.

Computational outlook
As we have mentioned before, one possible way of improvement is to sacrifice some flexibility and move the remaining solver parts of the code also to the GPU. Alternatively, one may further optimize the code within the GPU as the GPU set of routines seems to be still developing.
Furthermore, other features may be included in the software, for example, different payoff functions or a different SGBM regression basis. Finally, to have a fully independent software, a user interface and modular code are recommended.