Multistep schemes for solving backward stochastic differential equations on GPU

The goal of this work is to parallelize the multistep scheme for the numerical approximation of the backward stochastic differential equations (BSDEs) in order to achieve both, a high accuracy and a reduction of the computation time as well. In the multistep scheme the computations at each grid point are independent and this fact motivates us to select massively parallel GPU computing using CUDA. In our investigations we identify performance bottlenecks and apply appropriate optimization techniques for reducing the computation time, using a uniform domain. Finally, some examples with financial applications are provided to demonstrate the achieved acceleration on GPUs.


Introduction
The backward stochastic differential equations (BSDEs) have been widely used in various areas such as physics and finance due to one of their key features, namely they provide a probabilistic representation of solutions of nonlinear parabolic partial differential equations (PDEs).We consider a (decoupled) forward backward stochastic differential equation (FBSDE) which has the form: -dy t = f (t, X t , y t , z t ) dtz t dW t , y T = ξ = g(X T ), (1) where is the driver function and ξ is the terminal condition.For a = 0 and b = 1, namely X t = W t , one obtains the standard BSDE of the form ⎧ ⎨ ⎩ -dy t = f (t, y t , z t ) dtz t dW t , where y t ∈ R m and f (t, y t , z t ) : [0, T] × R m × R m×d → R m .The existence and uniqueness of the solution of (1) are proven in [20].If dX t in (1) is defined as the geometric brownian motion for modelling the asset price, and g is the payoff function of an European option, then (1) is related to the Black-Scholes PDE, see [14].This is to say that y 0 gives the option price and z t /b represents the hegding portfolio, which presents the sensitivity of the option price y t with respect to the asset price X t .This is called also -hedging in the pricing theory.
Higher order numerical schemes demand usually more computing efforts, effective parallel implementations are thus in great demand.Some acceleration strategies based on Graphics Processing Unit (GPU) computing have been developed for the pricing problems in finance, however, a very little of them are BSDE-based approach.These works can be found in [7,11,22], where the acceleration strategies are applied on numerical methods of convergence order not higher than 2.
For higher order of convergence rate, the first multistep scheme on time-space grids is proposed in [32], where the resulting integrands by discretizing BSDE in time are approximated by using Lagrange interpolating polynomials.In [13], we have successfully parallelized that multistep scheme on GPUs, and showed the gain in computational time for option pricing via the Black-Scholes BSDE.However, the multistep scheme is only stable up to 3 multiple time levels due to Runge's phenomenon.For a better stability and the admission of more time levels, a new multistep scheme is proposed in [27] by using spline instead of Lagrange interpolating polynomials.In principle, arbitrarily many multilevel time levels can be chosen in that multistep scheme, and in general the more time levels the higher accuracy.However, using more time levels also requires more computational cost.For this reason, in this work we investigate the massively parallel GPU computing in the multistep scheme [27], to make the scheme be more useful in practice.For example, a high accuracy for low computation time can minimize monetary loss in financial problems.An application that shows the importance of efficient approaches for pricing and the risk management of financial models can be found in [3].
The reminder of this paper is organized as following.In Sect.2, we introduce the multistep method [27] for the numerical solution of BSDEs.Section 3 presents the algorithmic framework of the numerical method and the potential for preliminary reduction of computing time due to its special structure for uniform domains.In Sect.4, we describe the GPU acceleration strategies for further reduction of computing time.Section 5 shows the numerical results including financial applications.Finally, we give the conclusions in Sect.6.

The multistep scheme
In this section we introduce the multistep scheme [27] orientated to be parallelized on GPUs.

Preliminaries
Let ( , F, P, {F t } 0≤t≤T ) be a complete, filtered probability space.In this space a standard d-dimensional Brownian motion W t is defined, such that the filtration {F t } 0≤t≤T is the natural filtration of W t .We define | • | as the standard Euclidean norm in the Euclidean space R m or R m×d and L 2 = L 2 F (0, T; R d ) the set of all F t -adapted and square integrable processes valued in R d .Moreover, let F t,x s for t ≤ s ≤ T be a σ -field generated by the Brownian motion {x + W r -W t , t ≤ r ≤ s} starting from the time-space point (t, x).We define E t,x s [X] as the conditional expectation of the random variable X under the filtration square integrable, and satisfies (1) in the sense of where and the third term on the right-hand side is an Itô-type integral.This solution exist under regularity conditions [20].
Let us consider the semilinear PDE with the terminal condition u(T, x) = g(x).The following theorem can be obtained with a straightforward application of Itô's lemma.
Theorem 1 (Nonlinear Feynman-Kac Theorem) Let u ∈ C 1,2 satisfying (4) and suppose that there exists a constant C such that |b(t, is the unique solution of (1).
We note that the authors in [19] show the existence and uniqueness of a solution for BSDEs driven by a Lévy process with moments of all orders, which can be used for pricing in a Lévy market.The nonlinear Feynman-Kac formula for a general non-Markovian BSDE has been established in [21], where the path-dependent quasi-linear parabolic PDEs are considered.Furthermore, Feynman-Kac representation of fully nonlinear PDEs has been investigated, e.g., in [23].

The stable semidiscrete scheme
Let N be a positive integer and t = T/N the step size that partitions uniformly the time interval [0, T]: Let k and K y be two positive integers such that 1 ≤ k ≤ K y ≤ N , which represent the number of time layers and interpolation points respectively.The BSDE (2) can be ex-pressed as Taking the conditional expectation E x t n [•] in (6) to obtain the adaptability of the solution and using cubic spline polynomial to approximate the integrand, the reference equation for y process reads (see the Appendix) where R n y is the interpolation error.For the z process, using l and K z instead of k and K y , multiplying both sides by W t n+l in (6) and taking the conditional expectation E x t n [•], the reference equation using cubic spline interpolation reads (see the Appendix) where R n z = R n z 1 + R n z 2 are the interpolation errors.In [27], the authors have shown that the scheme is stable when k = 1, . . ., K y , with K y = 1, 2, 3, . . ., N, l = 1, with K z = 1, 2, 3, . . ., N. This is to say that the algorithm allows for arbitrary multiple time levels K y and K z .Using the conditions of cubic spline interpolation to calculate the unknown coefficients, we have with γ K y K y ,j and γ 1 K z ,j representing the calculated coefficients of the cubic spline interpolation (Table 1 and Table 2 give the values up to 6 time levels).It is shown in [27] that the local errors in (7) are given by Table 1 The coefficients {γ Table 2 The coefficients provided that f and g are smooth enough.In (7) we need to divide by t to find the value of z process.Therefore, in order to balance time truncation errors, one might set K z = K y + 1.
The stable semidiscrete scheme for the d-dimensional case is given as follows: we denote (y n , z n ) as the approximation to (y t n , z t n ), given random variables (y N-i , z N-i ), i = 0, 1, . . ., K -1 with K = max{K y , K z }.Then (y n , z n ) can be found for n = N -K, . . ., 0 such that where . ., m and d = 1, 2, . . ., d.In the following, we only present the results of the error analysis, for their proofs we refer to [27] and [32].

Lemma 1 The local estimates of the local truncation errors in
where C > 0 is a constant depending on T, f , g and the derivatives of f and g.

Theorem 2 Suppose that the initial values satisfy
where K y = 1, 2, 3 for the first equation and K y > 3 for the second one.For a sufficiently small time step t it can be shown that where C > 0 is a constant depending on T, f , g and the derivatives of f and g.

Theorem 3 Suppose that the initial values satisfy
where K z = 1, 2, 3 for the first equation and K z > 3 for the second one, and the condition on the initial values in Theorem 2 is fulfilled.For a sufficiently small time step t it can be shown that where C > 0 is a constant depending on T, f , g and the derivatives of f and g.
Remark 1 If f does not depend on process z, the maximum order of convergence for the y process is 4 and 3 for the z process; If f depends on process z, the maximum order of convergence for the y and z processes is 3.

The fully discrete scheme
Let x denote the step size in the partition of the uniform d-dimensional real axis, i.e. where . ., 0 such that where Êx i t n [•] is used to denote the approximation of the conditional expectation.The calculations at each space grid point x i in (9) are independent, for each time layer t n .Therefore, the parallelization strategy is fully related with the space discretization, which will be discussed in the following Sections.
The functions in the conditional expectations involve the d-dimensional probability density function of the Brownian Motions, one can choose e.g., the Gauss-Hermite quadrature rule to achieve a high accuracy only with a few space points.The conditional expectation can be can be sufficiently accurately approximated by a clever table interpolation where ŷn+k are interpolating values at the space points (x i + √ 2k ta ) based on y n+k values, (ω , a ) for = (λ 1 , λ 2 , . . ., λ d ) are the weights and roots of the Hermite polynomial of degree L (see [12]), . In a similar way, one can express the other conditional expectations in (9).

The algorithmic framework
According to the numerical algorithm represented in Sect.2, the whole process for numerically solving BSDEs can be divided into three steps.
1. Construct the time-space discrete domain.
We divide the time period [0, T] into N time steps using t = T/N , i.e., N + 1 time layers, and the space domain R d using step size x (see also Sect.2.3).We use the truncated domains [-16, 16] and [-8, 8] for the grid space, where the former is used for 1-dimensional examples and the latter for the 2-dimensional ones.Note that larger domain can be also used, however, the approximation will be not improved.This is to say that those truncated domains are sufficient for our numerical experiments.Furthermore, in order to balance the errors in time and space directions, we adjust x and t such that they satisfy the equality ( x) r = ( t) q+1 , where q = min{K y + 1, K z } and r denotes the global error from the interpolation method used to generate the non-grid points when calculating the conditional expectations.For a better illustration, we show a visualisation of the domain for K y = K z = 1, d = 1 in Fig. 1, including the non-grid points.2. Calculate K initial solutions with K = max{K y , K z }.
Generally, only the terminal values are given and one needs to compute the other K -1 initial values.To obtain these initial values, we start with K = 1 and choose an extremely small time step size t. 3. Calculate the numerical solution (y 0 0 , z 0 0 ) backward using equation ( 9).Note that the calculation for the y process is done implicitly with Picard iteration.In step 1, some of the generated non-grid points could be outside of the truncated domain (see Fig. 1).For these points, we take the values on the boundaries as approximations (constant extrapolation).Note that one could also do extrapolation for those points outside, however, a longer computation time will be needed.As mentioned in (10), we use interpolation to approximate the values for y and z at non-grid points.The computation time drawback when interpolating is finding the position of the new points in the interpolating process.A natural search algorithm is to loop over all the grid points, and find in which interval the point belongs to.In the worst case, an O(M d ) work is needed.Fortunately, the structure of the Gauss-Hermite quadrature creates the symmetry for the nongrid points.Recall that each new point is generated as x min ] gives the left boundary of the grid interval that X λ d belongs to, with int(x) giving the integer part of x.As a result, this step in the algorithm can be done in O(d), i.e., without a for-loop.This benefit comes from the uniformity of the space domain.This substantially reduces the total computation time, as it will be demonstrated in the numerical experiments.
In step 2, we do not consider 2K (K for y and K for z) interpolations for each new calculation, but only 2. Suppose that we are at time layer t n-K .To calculate y and z values on this time layer, one needs the calculation of conditional expectations for K time layers.In our numerical experiments, we consider 1 and 2 dimensional examples, the higher dimensional problem can be parallelized for a device (GPU) with large memory.The cubic spline interpolation is used to find the non-grid values for 1-dimensional cases and bicubic interpolation for 2-dimensional cases.For instance, the coefficients for the y process are A y ∈ R K×(4 d ×M d ) , all the coefficients are stored.When we are at time layer t n-K-1 , only the spline interpolation corresponding to the previous calculated values is considered.Then, the columns of matrix A y are shifted +1 to the right in order to delete the last column and enter the current calculated coefficients in the first column.The new A y is used for the current step.The same procedure is followed until t 0 .This reduces as well the amount of work for the algorithm.
In the final step (step 3), we consider to merge calculation of conditional expectations in (9), which is important for the reduction of computation time in the parallel implementation.This will be mentioned in more details in the next Section.

Parallel algorithm
In this Section we present the parallelization strategy of the multistep scheme presented in the previous sections.Basically, we parallelize the problem straightforwardly, while keeping attention on the optimal Compute Unified Device Architecture (CUDA) execution model, i.e., creating arrays such that the access will be aligned and coalesced, reducing the redundant access to global memory, using registers when needed etc..The first and second steps of the algorithm are implemented in the host.The third step, together with the operator for performing O(d) work in the searching procedure of the interpolation and the idea of shifting coefficients of this procedure, are fully implemented in the GPU device.Recall from (9) that the following steps are needed to calculate the approximated values on each unknown time layer backward: • Generation of the non-grid points X = x i + √ 2k ta .In the uniform space domain, the non-grid points need to be generated only once.To do this, a kernel is created where each thread generates L d points (Gauss-Hermite points) for each space direction.This kernel adds insignificant computing time in the total algorithmic time.
• Calculation of the values ŷ and ẑ at the non-grid points.This is the most time consuming part of the algorithm, which is related with finding the corresponding interpolating functions for the given y and z points, in order to interpolate their values in the non-grid ones.For the 1-dimensional cases, we have considered the cubic spline interpolation.Since (9) involves the solution of two linear systems, the Biconjugate Gradient Stabilized (BiCGSTAB) [28] iterative method is used since the matrix is tridiagonal.For this, we consider the CUDA Basic Linear Algebra Subroutine (cuBLAS) and CUDA Sparse (cuSPARSE) libraries [5].For the inner product, second norm and addition of vectors, we use the cuBLAS library.For the matrix vector multiplication, we use the cuSPARSE library with the compressed sparse row format, due to the structure of the system matrix.Moreover, we created a kernel to calculate the spline coefficients based on the solved systems.Finally, a kernel to apply the operator for the searching procedure of interpolation is created to find the values at non-grid points.Note that each thread is assigned to find m + m × d values (m for y and m × d for z).For the 2-dimensional examples, we have considered the bicubic interpolation.We need to calculate 16 coefficients for each point.Based on the bicubic interpolation idea, we need the first and mixed derivatives.These are approximated using finite difference schemes of the fourth order of accuracy (central for the interior points, forward and backward for the boundary points).Therefore, a kernel is created where each thread calculates these values.Moreover, to find the 16 coefficients, a matrix vector multiplication needs to be applied for each point.Therefore, each thread performs a matrix-vector multiplication using another kernel.Finally, a kernel to apply the operator for the searching procedure of interpolation is created to find the values at non-grid points, where each thread calculates m + m × d values.
• Calculation of the conditional expectations.
As mentioned above, we merge the calculation of conditional expectations.For the first conditional expectations in the right hand side of (9), we create one kernel, where each thread calculates one value by using (10).For the other ones, we merged their calculation (three conditional expectations) in one kernel, namely for j = 1, 2, . . ., K .This reduces the accessing of data multiple times from the global memory and gives more work to the thread, such that it does not stay idle.Note that one thread calculates 2 × m × d + m values.
• Calculation of the z values.
The second equation in ( 9) is used and each thread calculates m × d values.

• Calculation of the y values.
The first equation in ( 9) is used and each thread calculates m values, using the Picard iterative process.In the next Section, we present some numerical examples to show the accelerated computation on GPU computing with applications in option pricing.

Numerical results
We implement the parallel algorithm using CUDA C programming.The parallel computing times (only for one run) are compared with the serial ones on a CPU.Furthermore, the speedups are calculated.The Central Processing Unit (CPU) is Intel(R) Core(TM) i5-4670 3.40Ghz with 4 cores, where only one core is used for the serial calculations.The GPU is a NVIDIA GeForce 1070 Ti with a total 8GB GDDR5 memory.
We choose the degree of the Hermite polynomial L = 32 for the 1-dimensional examples and L = 8 for the 2-dimensional ones.For the Picard interations, we choose p = 30.With these values, the quadrature error and iteration error are so small that they won't affect the convergence rate.
The analytic solution is The exact solution with T = 1 is (y 0 , z 0 ) = ( 1 2 , 1  4 ).In Table 3, we show the importance of the proposed technique when interpolating non-grid values, for a high N .Note that the computation time is given in seconds.We see that without the operator to find the position of the non-grid points in space, the computation serial time is very high.
In Table 4, we present the results using 256 threads per block with K = K y = K z , t 0 = 0 and T = 1.As mentioned before, the better accuracy can be archived by using a higherstep scheme.For K > 3, the errors decrease moderately due to Theorem 2. The higher the value of time layers N the more work can be assigned to the GPU, and the speedup of the application can thus be increased.The speed up also increases on a smaller magnitude for K > 3, since the number of space data points given from M is the same.These are visualized in the plots of log 10 (|y 0,0y 0 0 |), log 10 (|z 0,0z 0 0 |) and speedup with respect to N   Example 2 Consider the nonlinear BSDE [32] ⎧ ⎨ ⎩ -dy t = 1 2 (exp(t 2 ) -4ty t -3 exp(t 2y t exp(-t 2 )) + z 2 t exp(-t 2 )) dtz t dW t , y T = ln(sin(W T ) + 3) exp(T 2 ).The analytic solution is The exact solution with T = 1 is (y 0 , z 0 ) = (ln(3), 1  3 ).The results using 256 threads per block with K = K y = K z , t 0 = 0 and T = 1 are presented in Table 5.We conclude the same as in the previous example, except the fact that the accuracy is decreased.This is due to the convergence order given in Theorem 3, which reduces to be at most 3, since the driver function depends on the z process.Furthermore, we get higher speedup compared with previous example due to the more complicated driver function (i.e. more data are accessed, more special functional unit is used etc.).The speedup is 35×.We display the plots of log 10 (|y 0,0y 0 0 |), log 10 (|z 0,0z 0 0 |) and speedup with respect to N for K = 1, . . ., 6 in Fig. 3. To optimize the application, we have used the following iterative approach: 1. Apply a profiler to the application to gather information 2. Identify application hotspots 3. Determine performance inhibitors 4. Optimize the code 5. Repeat the previous steps until desired performance is achieved We considered the case with N = 1024 and K y = K z = 6.To make the optimization process be more clear, we show the detailed information after each optimization iteration as follows.
• In the first iteration of the optimization process, we gathered the application information using nvprof (NVIDIA Command-line Profiler).The results are  presented in Table 6(a).The application hotspot is the nrm2_kernel kernel, which calculates the second norm in the BiCGSTAB algorithm.This is already optimized.Therefore, to overcome this bottleneck, we used the dot kernel dot_kernel.The computation time is reduced from 8.04 s to 0.86 s.The new speedup after the first iteration becomes 57× in stead of 35×.• In the second iteration, the next bottleneck for the application is the kernel that calculates the non-grid values for process y and z (sp_inter_non_grid_d_no_for) after each time layer backward.The performance of the kernel is limited by the latency of arithmetic and memory operations.Therefore, we considered loop interchanging and loop unrolling techniques.This reduced the computation time of the corresponding kernel and other kernels related with it, as shown in Table 6(b).By this, we reduced the computation time for sp_inter_non_grid_d_no_for from 2.48 s to 1.16 s.By default, we have reduced the computation time from 2.28 s to 1.46 s for calc_f _and_c_exp (the kernel in the third item of Sect.4) because we needed to change the way how the non-grid points are stored and accessed and also reduction for calc_c_exp_d (calculates the conditional expectation) from 0.22 s to 0.04 s.The new speedup is 69×.It can be observed from Table 6(c) that again the application bottleneck is the same kernel.Therefore, it is not worth optimizing the application furthermore.• Finally, we decreased the block dimension from 256 threads to 128 in order to increase parallelism.The final speedup is 70×.
For constant parameters (i.e.r t = r, μ t = μ, σ t = σ , δ t = δ), the analytic solution for a call option is where N(•) is the cumulative standard normal distribution function.In this example, we consider T = 0.33, K = S 0 = 100, r = 0.03, μ = 0.05, δ = 0.04 and σ = 0.2 (taken from [32]) with the solution (y 0 , z 0 ) .= (4.3671,10.0950).Note that the terminal condition has a nonsmooth problem for the z process.Therefore, for discrete points near the strike price K (also called at the money region), the initial value for the z process will cause large errors on the next time layers.To overcome this non-smoothness problem, we considered smoothing the initial conditions, cf. the approach of Kreiss [15].For the forward part of Example 3, we used the analytic solution In order to ensure a uniform stock price domain, we switch to the log stock price domain X t = ln(S t ).In Table 7 we show the importance of using this transformation.The results using 256 threads per block with K = K y = K z , t 0 = 0 and T = 0.33 are presented in Table 8.As in the previous BSDE examples, the highest accuracy is achieved for the maximal considered number of steps K and the number of time layers N , namely a 6-step scheme with N = 256, where we also have the highest speedup of 12×.We draw  Compared to the parallelized multistep scheme [32] in [13], the numerical results are more accurate, since we can considered K > 3.Moreover, we optimized the kernels created for the Black-Scholes BSDE for N = 256 and a high K , namely K = K y = K z = 6.The optimization iteration process is the same as in Example 2. The final speedup is 31×.Note that this speedup is for 256 time layers.In Example 2, we optimized for 1024 time layers.The high accuracy O (10 -12 ) with a high value of K can be achieved for more modest computation time then 6.47 seconds.
Obviously, it is more interesting to achieve the parallelization of higher dimensional BSDEs on the GPUs.
Due to GPU memory limitation (8 GB), we optimized the case where N = 64 and K y = K z = 6.In the 1-dimensional case, we used different kernels for the calculation of non-grid points, conditional expectations and the values for processes y and z.Here we merged these kernels due to memory constraint and named it main_body kernel.In the first iteration, we gathered the application information using nvprof .The results are presented in Table 10.The application hotspot is main_body kernel.The performance of the   kernel is limited by the memory operations, because the access of the bicubic spline coefficients is not optimal.However, we can't optimize this part, otherwise requesting aligned and coalesced memory operations shifts the spine coefficients and the threads access the wrong coefficients.Moreover, we can't increase the block dimension because there are not enough resources.Therefore, the final speedup is ca.59×.Finally, we consider the zero strike European spread option, which is a 2-dimensional problem.
Example 5 The zero strike European spread option BSDE reads where The analytic solution is given by Margrabe's formula In this example, we consider T = 0.1, S 1 0 = S 2 0 = 100, r = 0.05, μ 1 = μ 2 = 0.1, σ 1 = 0.25, σ 2 = 0.3 and ρ = 0.0 with the solution (y 0 , (z 1 0 , z 2 0 )) .= (15.48076,(14.3510, -12.6779)).The results are presented in Table 11 for the same parameters as in Example 4. The highest speedup is ca.64×, which requires again a large memory of ca. 3 GB.The plots of log 10 (|y 0,0 -y 0 0 |), log 10 (|z 0,0 -z 0 0 |) and speedup with respect to N for K = 1, . . ., 6 in Fig. 6.Note that the speedup is higher than in Example 4 since we have additionally the forward SDE and more work is conducted from the GPU threads.But we could not optimize further as the main constraint is the GPU memory.In Table 12 we present the performance of the main kernels.Our results show that the multistep scheme [27] with GPU computing performs very well also for the 2-dimensional case.Increasing the dimension becomes difficult when considering a single GPU due to memory constraint.However, this approach can be applied to a cluster of GPUs, where the possibility to achieve even higher speedups is enormous.This highlights the fact that the implementation of cluster GPU computing in the multistep scheme [27] offers very high accurate results in low computation time even for high dimensional problems.

Conclusions
In this work we parallelized the multistep method developed in [27] for solving BSDEs on GPU.Firstly, we analyzed the algorithm and presented approaches for reduction of computation time.The most important reduction effort was the optimal operation to find the location of the interpolated values.It was essential for the reduction of the computational time.For a further acceleration, we have investigated how to optimize the application after finding the performance bottlenecks and applying optimization techniques.Our numerical results have shown that the multistep scheme is well suited on massively parallel GPU computing and very efficient for real-time applications such as option pricing and their risk management.Our parallelization strategy should work for other multistep schemes as well, and make those schemes be more useful in practice.Using sparse grids and cluster GPU computing to solve higher dimensional problems is the task for our ongoing work.

Appendix: Detailed derivation of reference equations
We derive the reference equations for y and z in Sect.2.2.To receive the reference equation for y, we need to obtain the adaptability.Therefore, we take the conditional expectation

Figure 1
Figure 1 Time-space domain

Figure 2
Figure 2 Plots of the results for Example 1

Figure 3
Figure 3 Plots of the results for Example 2

Figure 4
Figure 4 Plots of results for Example 3

Figure 5
Figure 5 Plots of the results for Example 4

Figure 6
Figure 6 Plots of the results for Example 5

Table 3
Preliminary results with N = 256, K y = K z = 3 for Example 1

Table 4
The results for Example 1

Table 5
The results for Example 2

Table 6
Results of iterative optimization process for Example 2

Table 7
Preliminary results with N = 256, K y = K z = 3 for Example 3

Table 8
The results for Example 3

Table 9
The results for Example 4

Table 10
Performance of the main kernels for Example 4

Table 11
The results for Example 5

Table 12
Performance of the main kernels for Example 5 [27]y s , z s ) ds,(11)where martingale property of Itô integral is used.To approximate the integral in(11), Teng et al.[27]used the cubic spline polynomial to approximate the integrand.Based on the support points (t n+j , E x t n [f (t n+j , y t n+j , z t n+j )]), j = 0, . . ., K y , we have n f (s, y s , z s ) ds = t n+k t n St n ,x K y (s) ds + R n y ,