In this section, we establish the notation for two of the most common typesFootnote 2 of artificial neural networks: the multilayer perceptron (MLP) and the long-term-short-term-memory network (LSTM). It should be highlighted that the notation, the mathematical formalization and also the implementation slightly varies throughout the literature, see for example [10, 20, 21]. We have chosen an approach here that is compatible with keras and tensorflow, see [1, 6], as those frameworks are very common.
Multilayer perceptron (MLP)
The most common and most basic form of neural networks are multilayer perceptrons.
Definition 1.1
(Multilayer perceptron)
A multilayer perceptron MLP is a tuple \(\operatorname{MLP}=(A_{l}, b_{l}, \sigma _{l})_{1 \leq l \leq n_{L}}\) defined by
-
a number \(n_{i}\) of inputs,
-
a number \(n_{o}\) of outputs,
-
a number \(n_{L}\) of layers and
-
for each layer \(1 \leq l \leq n_{L}\)
-
a number \(n_{l}\) of neurons (or units),
-
a matrix \(A_{l} = (A_{l;ij}) \in \operatorname{\mathbb{R}}^{n_{l-1} \times n_{l}}\) and a vector \(b_{l} = (b_{l;i}) \in \operatorname{\mathbb{R}}^{n_{l}}\) (called bias) of weights such that \(n_{0} = n_{i}\), \(n_{n_{L}}=n_{o}\) and
-
an activation function \(\sigma _{l}:\operatorname{\mathbb{R}}\to \operatorname{\mathbb{R}}\).
For any \(1 \leq l \leq n_{L}\), the tuple \((A_{l}, b_{l}, \sigma _{l})\) is called a layer. For \(l=n_{n_{L}}\), the layer is called output layer and for \(1 \leq l< n_{L}\), the layer is called hidden layer.
Neural networks can be visualized as in Fig. 1: This shows a network \((A_{l}, b_{l}, \sigma _{l})_{1 \leq l \leq n_{L}}\) with a total of \(n_{L}=4\) layers, i.e. 3 layers are hidden. Notice that the input layer is just a visualization of the input and is not part of the actual network topology.
Computing the output from the input is codified in the feed forward.
Definition 1.2
(Feed forward)
Let \(\operatorname{MLP}=(A_{l}, b_{l}, \sigma _{l})_{1 \leq l \leq n_{L}}\) be a multilayer perceptron. Then for each \(1 \leq l \leq n_{L}\), we define a function
$$\begin{aligned} F_{l}:\operatorname{\mathbb{R}}^{n_{l-1}} \to \operatorname{\mathbb{R}}^{n_{l}}, \qquad v \mapsto \sigma _{l}\bigl(v^{T} A_{l} + b_{l}\bigr), \end{aligned}$$
where we employ the convention that \(\sigma _{l}\) is applied in every component. The composition
$$\begin{aligned} F:\operatorname{\mathbb{R}}^{n_{i}} \to \operatorname{\mathbb{R}}^{n_{o}}, \qquad F := F_{L} \circ \cdots \circ F_{2} \circ F_{1} \end{aligned}$$
is called the feed forward of MLP. Any set of inputs \(x \in \operatorname{\mathbb{R}}^{n_{i}}\) is called an input layer.
The links in Fig. 1 between the nodes visualize the feedforward and the dependence of each output on each input.
We assume here that the activation functions are chosen as the standard sigmoid, \(\sigma _{l}(x) := (1 + e^{-x})^{-1}\), for all but the last layer,Footnote 3 where we chose the linear activation \(\sigma _{L}(x) = x\).
The feed forward \(F=F_{\Theta}\) depends on all the parameters \(\Theta =(A_{l}, b_{l})_{1 \leq l \leq n_{L}}\). In general, the function \(F_{\Theta}\) will be unrelated to the problem if the weights Θ are not chosen carefully. This is performed by training the neural network with a training set \((x_{i}, y_{i})_{1 \leq i \leq N}\). More precisely, for a given cost function, say least squares, this amounts to solving the optimization problem
$$\begin{aligned} \Theta ^{*} := \operatorname*{\operatorname{argmin}}_{\Theta}{\sum _{i=1}^{N}{ \bigl\Vert F_{\Theta}(x_{i}) - y_{i} \bigr\Vert ^{2}}}. \end{aligned}$$
In practice, this optimization is performed by stochastic gradient descent methods such as Adam, see [18], and a clever computation of the gradient, called backpropagation, see [23].
Long-Term-Short-Term-Memory Network (LSTM)
While MLPs work very well in many situations, there are certain applications for which that network topology is sometimes not ideal. MLPs are built to make a prediction y given one input x. But some applications have a canonical time structure and the task is to make time-dependent predictions \(y_{t}\) from time-dependent inputs \(x_{t}\). In principle, this case can of course be covered by MLPs as well, for example by adding a time grid to the inputs and the outputs, i.e. to learn \((y_{t_{1}}, \ldots , y_{t_{n}})\) from \((x_{t_{1}}, \ldots , {x_{t_{n}}}, t_{1}, \ldots , t_{n})\). This has the advantage that it is straight-forward to implement, but the disadvantage that the network has to be potentially quite large. Another option is to train a network separately for each point \(t_{i}\). That has the disadvantage that there is no flow of information between the networks of different points in time \(t_{i}\) and the prediction of this sequence of networks might suffer from inconsistencies. A key application that has driven the research in that area is Natural Language Processing (NLP), where language is seen as a sequence of words.
A known way out of these technical problems are long-term-short-term memory networks as suggested in [14]. The idea depicted in Fig. 2 is as follows: Only one neural network, i.e. with one fixed set of weights, is trained, but in addition to the input, the network also processes the cell state. This additional piece of information is transmitted through the network and serves as a memory of previous predictions.
Formally, an LSTM can be defined by first defining a single LSTM layer, LSTML, and the associated feedforward and then stacking multiple of those together to an LSTM.
Definition 1.3
(LSTML)
A long-term-short-term-memory neural network layer is a tuple \(\operatorname{LSTML}=\operatorname{LSTML}(W, U, b, \tau , \sigma )\) consisting of
-
a number m of units and a number k of features,
-
a 4-tuple W of matrices \(W_{i}, W_{f}, W_{c}, W_{o} \in \operatorname{\mathbb{R}}^{k \times m}\) called input, forget, cell and output kernels,
-
a 4-tuple U of matrices \(U_{i}, U_{f}, U_{c}, U_{o} \in \operatorname{\mathbb{R}}^{m \times m}\) called input, forget, cell and output recurrent kernels,
-
a 4-tuple b of vectors \(b_{i}, b_{f}, b_{c}, b_{o} \in \operatorname{\mathbb{R}}^{m}\) called input, forget, cell and output bias,
-
two functions \(\sigma , \tau :\operatorname{\mathbb{R}}\to \operatorname{\mathbb{R}}\) called activation and recurrent activation.
This definition needs to be understood in the context of the associated feedforward. The feedforward of an LSTML is more complex than for MLPs. Because of the time-dependence, it needs to keep track not only of the final output, but of all the outputs at the various points in time and in addition it needs to keep track of the cell state.
Definition 1.4
(Feedforward)
Let \(\operatorname{LSTML}=\operatorname{LSTML}(W,U,b,\tau ,\sigma )\) be as above and let \(T \in \mathbb{N}\) be a natural number. Any sequence \(x = (x_{1}, \ldots , x_{T})\), \(x_{t} \in \operatorname{\mathbb{R}}^{k}\), is called an input sequence. The two sequences \(c_{t}\) and \(y_{t}\), \(t=1, \ldots , T\), called cell state and carry state, are recursively defined as follows:Footnote 4
- input::
-
\(i_{t} := \tau (x_{t} \bullet W_{i} + y_{t-1} \bullet U_{i} + b_{i}) \in \operatorname{\mathbb{R}}^{m}\),
- forget::
-
\(f_{t} := \tau (x_{t} \bullet W_{f} + y_{t-1} \bullet U_{f} + b_{f}) \in \operatorname{\mathbb{R}}^{m}\),
- candidate::
-
\(\tilde{c}_{t} := \sigma (x_{t} \bullet W_{c} + y_{t-1} \bullet U_{c} + b_{c}) \in \operatorname{\mathbb{R}}^{m}\),
- cell::
-
\(c_{t} := f_{t} \odot c_{t-1} + i_{t} \odot \tilde{c}_{t} \in \operatorname{\mathbb{R}}^{m}\),
- output::
-
\(o_{t} := \tau (x_{t} \bullet W_{o} + y_{t-1} \bullet U_{o} + b_{o}) \in \operatorname{\mathbb{R}}^{m}\),
- carry::
-
\(y_{t} := o_{t} \tau (c_{t}) \in \operatorname{\mathbb{R}}^{m}\).
Finally, the function
$$\begin{aligned} F_{T}: &\operatorname{\mathbb{R}}^{k \times T} \to \operatorname{\mathbb{R}}^{m}, \\ x=&(x_{1}, \ldots , x_{T}) \mapsto (y_{1}, \ldots , y_{T}) \end{aligned}$$
is called, the feedforward of LSTML of length T.
The interpretation of this is as follows (see Fig. 3): The input value \(i_{t}\) and forget value \(f_{t}\) represent how much weight is put by the network on the current input \(x_{t}\) and how much weight is put on forgetting the past memory. Then, a candidate cell state \(\tilde{c}_{t}\) and a candidate output \(o_{t}\) are computed on the basis of only the current input \(x_{t}\) and the last prediction \(y_{t-1}\). Thus, the new cell state \(c_{t}\) is computed as a weighted average between the candidate cell state \(\tilde{c}_{t}\) and the previous cell state \(c_{t-1}\), where the weights are given by the input weight \(i_{t}\) and the forget weight \(f_{t}\). Finally, the carry \(y_{t}\), i.e. the intermediate output at t, is computed by first computing a candidate output \(o_{t}\), which is also based only on the current input \(x_{t}\) and the last prediction \(y_{t}\), and then \(y_{t}\) is computed from \(o_{t}\) by weighing \(o_{t}\) with the new cell state \(c_{t}\). It should be noted that depending on the application, one might either consider the value \(y_{T}\) as the feedforward of the network or the vector \((y_{1}, \ldots , y_{T})\).
Practical applications rarely comprise of a single LSTML for various reasons. First, as we can see in Definition 1.4, each step of the computation, i.e. the input, forget, output and candidate cell state are equivalent to a feedforward of an MLP with just a single layer. That means that a single LSTML cannot capture arbitrary non-linearities with these computations. Also, notice that if a network comprises of just an LSTML, then the number k of features is forced to be \(k=n_{i}\), i.e. the number of inputs and the number m is forced to be \(m=n_{o}\), i.e. the number of outputs. That means that no parameters can be changed to adapt the network to the problem. The solution to both of these problems is to chain multiple LSMTLs in sequence, where the number m of units can be chosen at will, followed by a single MLP layer to ensure that the last output is of the same shape as \(n_{o}\).
Definition 1.5
(LSTM)
A long-term-short-term memory network (LSTM) with L layers is defined by a sequence \(\operatorname{LSTM}= (\operatorname{LSTML}_{1}, \ldots , \operatorname{LSTML}_{L-1}, \operatorname{MLP})\) of \(L-1\) LSTMLs with number of units \(m_{l}\) and number of features \(k_{l}\) such that \(k_{1}=n_{i}\), and a single MLP with input dimension \(k_{L-1}\) and output dimension \(n_{o}\).