Background-foreground segmentation for interior sensing in automotive industry

Drygala, Claudia; Rottmann, Matthias; Gottschalk, Hanno; Friedrichs, Klaus; Kurbiel, Thomas

doi:10.1186/s13362-022-00128-9

Research
Open access
Published: 30 December 2022

Background-foreground segmentation for interior sensing in automotive industry

Claudia Drygala ORCID: orcid.org/0000-0002-2248-3475¹,
Matthias Rottmann¹,
Hanno Gottschalk¹,
Klaus Friedrichs² &
…
Thomas Kurbiel²

Journal of Mathematics in Industry volume 12, Article number: 13 (2022) Cite this article

2240 Accesses
1 Citations
Metrics details

Abstract

To ensure safety in automated driving, the correct perception of the situation inside the car is as important as its environment. Thus, seat occupancy detection and classification of detected instances play an important role in interior sensing. By the knowledge of the seat occupancy status, it is possible to, e.g., automate the airbag deployment control. Furthermore, the presence of a driver, which is necessary for partially automated driving cars at the automation levels two to four can be verified. In this work, we compare different statistical methods from the field of image segmentation to approach the problem of background-foreground segmentation in camera based interior sensing. In the recent years, several methods based on different techniques have been developed and applied to images or videos from different applications. The peculiarity of the given scenarios of interior sensing is, that the foreground instances and the background both contain static as well as dynamic elements. In data considered in this work, even the camera position is not completely fixed. We review and benchmark three different methods ranging, i.e., Gaussian Mixture Models (GMM), Morphological Snakes and a deep neural network, namely a Mask R-CNN. In particular, the limitations of the classical methods, GMM and Morphological Snakes, for interior sensing are shown. Furthermore, it turns, that it is possible to overcome these limitations by deep learning, e.g. using a Mask R-CNN. Although only a small amount of ground truth data was available for training, we enabled the Mask R-CNN to produce high quality background-foreground masks via transfer learning. Moreover, we demonstrate that certain augmentation as well as pre- and post-processing methods further enhance the performance of the investigated methods.

1 Introduction

Interior Sensing is of high importance for automated driving. For instance, interior sensing aims at seat occupancy detection and classification [1, 2]. The classes may range from “Person”, “Child seat” and “Animal” to “Everyday object”. This knowledge about the seat occupancy can be used e.g. for smart airbag deployment control systems [3]. While the activation of the airbag could save a person’s life in case of an accident, it could lead to serious injuries [4, 5] or even to death [6–8] for a child, which is sitting in a rear-facing child seat on the passenger seat. In the case of partially autonomous vehicles at the levels two to four (defined by [9]), a driver has to be present in the car. Thus it is necessary to verify, if a person is present on the driver seat [2]. Lastly, also the back seats of the car are of interest just as the front seats. Thinking of the so called “Forgotten Baby Syndrome” [7, 10] the system could give an alarm, if a forgotten child would be detected on a back seat.

In this work, we suggest seat occupancy detection by background-foreground segmentation methods. Only the extracted foreground instances, belonging to the classes “Person”, “Child Seat” or “Object”, should be considered for the classification. The motivation behind this approach is to realize the classification task independently of the car’s interior features and thus achieve better generalization.

Background-foreground segmentation [11–13], also known as background-foreground detection [14–16] or background subtraction [16–20], is an intensively studied field in computer vision. In recent years, several methods have been developed addressing various scenarios. These methods are based on completely different techniques. As it is described extensively in the survey [16], the approaches range from classical statistics based to modern methods, which incorporate deep convolutional neural networks [21].

The goal of this work is to introduce a dataset for the training of the background-foreground segmentation task and benchmark three methods in the given setting. The benchmark is performed on the quality of the generated background-foreground masks (see Fig. 1). Note that the minimization of the computational costs is not of interest for this work. The three methods we selected are based on different techniques:

1.
Gaussian Mixture Model: A classical statistical method for background subtraction.
2.
Morphological Snakes: A classical approach to object detection bases on active contours.
3.
Mask R-CNN: A modern method solving the instance segmentation task by using deep neural networks.

In this work, we investigate the limitations of each approach. To ensure the comparability of the methods, all of them have been tested on a challenging test set of 100 real-world images. The test set is part of the dataset which is introduced by this work, named the ISSO dataset (Interior Sensing and Seat Occupany). The dataset consists of 1300 annotated real-world images extracted of videos recorded by employees of the company APTIV in Wuppertal, Germany which are splitted into a training set of 1100 images, a validation and a test set of 100 images each. The images of the ISSO dataset describe scenarios of the interior of 13 different cars, as shown in Fig. 1. Scenarios of interior sensing are highly complex, since the foreground instances and the background can be both, dynamic and static. In this work, even the camera position varies slightly from car to car. Additionally, the impact of environmental effects has to be taken into account, like different weather conditions, shadows, traffic lights and vibrations.

Although only a rather small amount of 1100 real-world annotated images for the training is available, we demonstrate with the help transfer learning that it is possible to generate background-foreground masks of high quality with a Mask R-CNN. Furthermore, we investigate to what extent the performance of the methods can be leveraged by certain pre- and post-processing methods for the data as well as by applying data augmentation techniques during the training of the neural network. In particular we study the effect of

the conversion to different color spaces (RGB, HSV, CIEL^∗a^∗b^∗),
contrast enhancement methods (Histogram Equalization, CLAHE),
morphological operators (Closing, Opening) and
data augmentation before and during the training of neural networks.

The paper is organized as follows: The theoretical background to the methods considered is briefly explained in Sect. 2. This is followed by Sect. 3, where the description of the data set designed, registered and annotated for this work is given. In Sect. 4 the metrics by which the methods are evaluated are introduced. The results to the experiments are presented and discussed in Sect. 5. In particular, the three methods are compared and the limitations of each method is discussed. Finally, we provide our conclusion and an outlook in Sect. 6.

2 A choice of methods for background-foreground segmentation

2.1 Gaussian mixture model (GMM)

The GMM, introduced for foreground-background segmentation in [22], is based on the principle of background subtraction. As described in the Sect. A, the values of a pixel are defined by a certain color space. For example, the pixel values of a gray scale image are given by single scalars, whereas the pixel values of a color image are given by a vector with the number of channels as dimension. In the GMM framework, these values of each pixel are modeled by a mixture of adaptive Gaussian distributions. The underlying data for the computation of those mixture models is given by a so called pixel process $\{X_{1}, \ldots, X_{t}\}$ which is a time series of pixel values. Thus, at any time t, the history for each pixel at position $(w_{0},h_{0})$ is known:

$$ \{X_{1}, \ldots, X_{t}\} = \bigl\{ I\bigl((w_{0}, h_{0}),i\bigr) | 1 \leq i \leq t\bigr\} $$

(1)

with I being the image sequence. Now, the probability of observing a certain pixel value at time t is defined as

$$ p(X_{t}) = \sum_{m=1}^{K} \hat{w}_{m,t} \mathcal{N}(X_{t}; \hat{\mu}_{m,t}, \hat{\Sigma}_{m,t}) $$

(2)

with

$\mathcal{N}$: The multivariate Gaussian distribution.
K: Number of Gaussian distributions in a mixture.
$\hat{\mu}_{m,t}$: The estimate of the mean value of the m-th Gaussian at time t.
$\hat{\Sigma}_{m,t}$: The estimate of the covariance matrix of the m-th Gaussian at time t defined as
$$ \hat{\Sigma}_{m,t}=\hat{\sigma}^{2}_{m,t}\mathbf{I} $$
with $\hat{\sigma}^{2}_{m,t}$ the estimate of the variance of the m-th Gaussian at time t and I the identity matrix of appropriate dimension.
$\hat{w}_{m,t}$: The estimated weight of the m-th Gaussian at time t. Moreover, the weights fulfill the properties of non-negativity and normalization, i.e., $\hat{w}_{m,t} \geq 0$ and $\sum_{m=1}^{M} \hat{w}_{m,t} = 1$.

If a new frame of the image sequence is considered at current time t, a new pixel value enters the pixel process. By the update of the pixel process, the estimates of the Gaussian distributions also have to be updated. For the estimation of the parameters, the Maximum Likelihood estimator for the currently observed data is computed. A well-known approach for this computation is the Expectation Maximization (EM) [23]. However, here the application of the exact EM algorithm would be costly, since the values of each pixel are modeled by a mixture of Gaussians and thus, the parameters are updated pixel-wise. Hence, the parameter update is realized by the implementation of an online K-means approximation [22]. For each new pixel value it is checked whether the pixel is represented by one of the already existing K Gaussians. This check is performed until a match is found. For example, a match is given if the new pixel value is within 2.5 standard deviations of a distribution.

Estimation of the model parameters

Whether there is a match or not, the weight parameters are iteratively updated via

$$\begin{aligned} \hat{w}_{m,t} &=(1-\alpha )\hat{w}_{m,t-1}+\alpha \mathbb{M}_{m,t}. \end{aligned}$$

(3)

Here, $\alpha \in [0, 1]$ is the learning rate which determines the influence of data from past points in time and the speed at which the model parameters are updated and

$$ \mathbb{M}_{m,t}= \textstyle\begin{cases} 1 & {\text{in case of a match}}, \\ 0 & {\text{else.}} \end{cases} $$

(4)

The distribution parameters μ̂ and $\hat{\sigma}^{2}$ are only updated for the distribution that matches the new pixel value $X_{t}$, otherwise no update is performed:

$$\begin{aligned} &\hat{\mu}_{m,t}= \textstyle\begin{cases} (1-\rho )\hat{\mu}_{m,t-1} +\rho X_{t} & \mathbb{M}_{m,t}=1, \\ \hat{\mu}_{m,t-1} & {\text{else}}. \end{cases}\displaystyle \end{aligned}$$

(5)

$$\begin{aligned} &\hat{\sigma}^{2}_{m,t}= \textstyle\begin{cases} (1-\rho )\hat{\sigma}^{2}_{m,t-1}+\rho \hat{\delta}_{m,t}^{T} \hat{\delta}_{m,t} & \mathbb{M}_{m,t}=1, \\ \hat{\sigma}^{2}_{m,t-1} & {\text{else}}. \end{cases}\displaystyle \end{aligned}$$

(6)

with $\rho =\alpha \mathcal{N}(X_{t}|\hat{\mu}_{m,t}, \hat{\sigma}^{2}_{m,t})$ and $\hat{\delta}_{m,t}=X_{t}-\hat{\mu}_{m,t}$. If no match is given at all, the distribution that assigns the lowest probability to the data is replaced by a new distribution with the initial parameters $\hat{w}_{\mathrm{new}}=\alpha $, $\hat{\mu}_{\mathrm{new}}=X_{t}$, $\hat{\sigma}_{\mathrm{new}}=\sigma _{0}$ and $\sigma _{0}$ an appropriate initial variance [20, 22].

Estimation of the background model

Now, it should be determined by which of the computed Gaussian distributions, the background can be modeled. In particular, the Gaussians with the highest weights and the lowest variances are of interest. Generally, it can be assumed that pixel values describing the background of a scenario are repeated and thus, also their distributions. Hence, if a new pixel value enters to a pixel process which describes the background, a high probability for a match is given. By the update rule (3), it can be observed that the weights are increasing in the case of a match. Moreover, the background consists mostly of static elements that produce less variance than dynamic ones. Therefore, to determine the distributions of the mixture model that describe the background the best, the Gaussians are sorted firstly in descending order by the value . Hence, the distributions which are most likely representing the background are at the top of the list. Then, the first B distributions are chosen to model the background

$$ B=\mathop{\operatorname{argmin}}_{b}\Biggl( \sum_{m=1}^{b} \hat{w}_{m} > \tau \Biggr) $$

(7)

with τ the percentage of the pixel process that should affect the background model. The pixel values that cannot be assigned to a distribution which belongs to the background model are grouped by a two-pass connected components algorithm [24].

Number of mixtures

In the introduced GMM framework of above, the number of Gaussian distributions in a mixture is given by a constant value that is determined by the available memory and computational power. In this work, a modified version of the original GMM framework is used, where the number of Gaussians is also adaptive. In [20] the update rule of ŵ is reformulated such that the weights may take negative values. This aims omitting weights for Gaussians which are not relevant for the background estimation. Hence, the distributions that do not describe the background with high certainty are directly excluded. We refer to [20] for a detailed derivation of the modified update rule.

2.2 Morphological snakes

Originally, the object detection method based on active contours (also called “snakes”) was presented in [25]. The idea behind this approach is to detect foreground instances of an image I by evolving an initial curve $C_{0}$ towards the instances boundaries. In particular, the evolution of this curve is achieved by minimizing the energy functional

$$\begin{aligned} \begin{aligned} E(C) ={}& \alpha \int _{0}^{1} \bigl\lVert C'(q) \bigr\rVert _{2} ^{2} \,dq + \beta \int _{0}^{1} \bigl\lVert C''(q) \bigr\rVert _{2} ^{2} \,dq \\ &{}- \lambda \int _{0}^{1} \bigl\lVert \nabla I\bigl(C(q)\bigr) \bigr\rVert _{2} \,dq \end{aligned} \end{aligned}$$

(8)

with $C(q):[0,1]\rightarrow \mathbb{R}^{2}$ a parameterized planar curve, which represents the contour, $I:[0,a]\times [0,b] \rightarrow \mathbb{R}^{+}$ the considered image, $a, b \in \mathbb{R}^{+}$ and $\alpha, \beta, \lambda \in \mathbb{R}^{+}$ constant parameters. By the design of the functional, the smoothness of the curve is controlled by the first two terms, while the third term attracts the curve towards the boundary of the object. Therein, the gradient of the image ∇I acts as an edge detector. Hence, the (local) minimum should be obtained at the objects boundary.

To handle topological changes, such as splitting and merging, automatically, the original energy functional is modified in different ways. The “Geodesic Active Contours” (GAC) [26] and the “Active Contours Without Edges” (ACWE) [27] are based on the level-set-method [28] which are successfully applied to conduct curve evolution [26]. By this, the curve $C: [0,1] \times \mathbb{R}^{+} \rightarrow \mathbb{R}^{2}, (q,t) \mapsto C(q,t)$ parameterized over time $t\in \mathbb{R}^{+}$ is included into a level-set of an arbitrary smooth embedding function $u:\mathbb{R}^{2} \times \mathbb{R}^{+} \rightarrow \mathbb{R}$, such that it holds $C(q,t)=\{(x,y) | u((x,y);t)=0\}$. Hence, the curve C is represented implicitly by u [29].

To receive the level-set formulation, the evolution of C is defined by a partial differential equation (PDE) $C_{t}$ obtained by minimizing the respective energy functional $E(C)$ with the steepest descent method. Then, this curve evolution $C_{t}$ can be reformulated into the level-set equation $u_{t}=\frac{\partial u}{\partial t}$ for $t>0$ with the initial value $u_{0}=u((x,y);0)$, as shown in [26, 30]. For the GAC- and ACWE method, this approach results in the following level-set equations for $t>0$.

Geodesic active contours (GAC)

For this approach, the level-set equation is given by

$$ u_{t} = g(I) \tilde{\kappa} \lVert \nabla u \rVert _{2} + g(I)v \lVert \nabla u \rVert _{2} + \nabla g(I) \nabla u $$

(9)

with

$I:[0,a]\times [0,b] \rightarrow \mathbb{R}^{+}, a,b \in \mathbb{R}^{+}$ an image.
$g(I):[0,\infty ) \rightarrow \mathbb{R}^{+}$ a strictly decreasing function. By the values of g, the image regions of interest can be selected, e.g. the object boundaries in the case of image segmentation. In this work, g is defined as
$$ g(I)=\frac{1}{\sqrt{1+\alpha \lVert G_{\sigma }\ast I \rVert}} $$
(10)
with $G_{\sigma }\ast I $ a Gaussian filter (∗ being the convolution operator), σ the standard deviation and $\alpha >0$ a non-linear scaling parameter. On object boundaries, $g(I)$ takes smaller values than on homogeneous image areas.
$\tilde{\kappa}:= {\mathrm{div}} ( \frac{\nabla u}{\lVert \nabla u \rVert} )$ the euclidean curvature of the embedding function u, proven in [30].
$v \in \mathbb{R}$ the balloon force parameter.

Active contours without edges (ACWE)

Herein, the level-set equation is given by

$$ u_{t}= \lVert \nabla u \rVert _{2} \bigl(\mu \tilde{\kappa} -v- \lambda _{1}(I-c_{1})^{2} + \lambda _{2}(I-c_{2})^{2} \bigr) $$

(11)

with

I, κ̃ and v analogous to above.
$c_{1}, c_{2}$ constants that depend on the curve $C: [0,1] \rightarrow \mathbb{R}^{2}$. In particular, $c_{1}$ represents the average of the pixel values of I inside C and $c_{2}$ the average of $I(x,y)$ outside C.
$\lambda _{1}, \lambda _{2} >0$ and $\mu \geq 0$ fixed weight parameters.

Both level-set equations are composed of a smoothing term, a balloon force term and an attraction force or image attachement term. In particular, the parts of the curve with a high curvature will be smoothed by the smoothing term. The balloon force term should help to accelerate the curve evolution especially in areas, where the attraction force is too weak due to small values of the gradients in less informative areas. So, the evolution of C is inflated ($v>0$) or deflated ($v<0$) by determining the velocity $v \in \mathbb{R}$. If $v=0$, the balloon force is switched off.

Now, the solutions of the level-set equations (9) and (11) are obtained by solving the time-dependent PDEs iteratively. However, the numerical methods for the computation of PDEs are costly, challenging in the implementation and suffer from stability constraints. In [29] it is shown, that it is possible to overcome these difficulties by approximating the PDEs of which (9) and (11) are composed of, by binary morphological operators. These operators are formulated as sup-inf operators as given in (18). The implementation of such a sup-inf operator is much easier and the computation is more stable and faster compared to the one of PDEs. Hence, the PDEs (9) and (11) are approximated by the composition of mathematical morphological operators, whereby the implicit representation of C is maintained.

Generally, a morphological operator T satisfies the properties of standard monotony, translation- and contrast invariance [31]. Furthermore, T is defined uniquely by a structuring element B, which is a set of arbitrary but small size and shape with a predefined origin as described in Fig. 2. Usually, B is significantly smaller than the considered image I. Mathematically, B is a matrix of dimension $c \times d$, $c,d \geq 1$ consisting of zeros and ones. The role of B is to probe a given image pixel-wise whereby the positioning of B at a pixel is given by its defined origin. According to the rule which is specified by a morphological operator, every pixel is evaluated by the comparison with the origin of B and its corresponding neighborhood which is represented by the values equal to one in the matrix [32]. Hence, the structuring element can be interpreted as a kernel in context of machine learning [31].

Morphological operators of interest for this work are the dilation, the erosion (see Fig. 3) and the curvature morphological operators, due to their properties regarding the infinitesimal behaviour.

Remark 2.1

(Notation)

Let $\mathcal{F} \subseteq C_{b}^{k}(\mathbb{R}^{n})$ be a set of bounded continuous differentiable functions up to order k over $\mathbb{R}^{n}$. The function operator is denoted by $T:\mathcal{F} \rightarrow \mathcal{F}$ which is assumed to be well-defined on $C_{b}^{k}(\mathbb{R}^{n})$ [33].

Definition 2.2

(Dilation and Erosion [34])

Let $u \in \mathcal{F}$, B the structuring element and $h\geq 1, h\in \mathbb{R}$ a scaling parameter. The Dilation of u by hB, written as $D_{h}=D_{hB}$, is defined by

$$ D_{h}u(\mathbf{x}) = \sup_{\mathbf{y}\in hB} u(\mathbf{x}- \mathbf{y}). $$

(12)

Whereas, the Erosion of u by hB, written as $E_{h}=E_{hB}$, is given by

$$ E_{h}u(\mathbf{x}) = \inf_{\mathbf{y}\in -hB} u(\mathbf{x}- \mathbf{y}). $$

(13)

Infinitesimal behaviour of dilation and erosion

Let be B convex bounded and the B-norm on $\mathbb{R}^{n}$ is defined by $\lVert \mathbf{x}\rVert _{B} = \sup_{\mathbf{y}\in B} (\mathbf{x} \cdot \mathbf{y})$ with ⋅ the Euclidean scalar product. Furthermore, the initial value of u at time $t=0$ is given by $u_{0}(\mathbf{x})=u(\mathbf{x};0)$ with $\mathbf{x}\in \mathbb{R}^{n}$. By defining $u:\mathbb{R}^{n} \times \mathbb{R}^{+} \rightarrow \mathbb{R}$ as the dilation of the initial value $u_{0}$ by tB, such that $u(\mathbf{x};t)=D_{t} u_{0}(\mathbf{x})$, it holds (see [33, 34]):

$$ \frac {\partial u}{\partial t} =\lVert \nabla u \rVert _{B}. $$

(14)

Analogously it holds for $u(\mathbf{x};t)=E_{t} u_{0}(\mathbf{x})$ (see [33, 34]):

$$ \frac {\partial u}{\partial t} =-\lVert \nabla u \rVert _{B}. $$

(15)

Here, it holds in particular that

$$ \lVert \nabla u \rVert _{B}=\lVert \nabla u \rVert _{2} $$

(16)

if the structuring element is defined as the unit ball

$$ B_{1}(0)=\bigl\{ \mathbf{x}\in \mathbb{R}^{n}: \lVert \mathbf{x}\rVert _{2} < 1 \bigr\} $$

(17)

on $\mathbb{R}^{n}$ [34, 35]. Hence, under certain conditions, the infinitesimal behaviour of the Dilation and the Erosion is equivalent to the PDE $\frac{\partial u}{\partial t}=\pm \lVert \nabla u \rVert _{2}$, which is a component of the level-set equations (9) and (11).

The sup-inf representation of morphological operators

The authors of [34] show that every morphological operator has a sup-inf representation and that also the dual inf-sup form exists.

Let $\mathcal{B}$ be a set of structuring elements and $T:\mathcal{F}\rightarrow \mathcal{F}$ an arbitrary morphological operator. Then T can be represented by the sup-inf operator

$$ SI_{h}:= T_{h} u(\mathbf{x})=\sup_{B\in \mathcal{B}} \inf_{ \mathbf{y}\in \mathbf{x}+hB} u(\mathbf{y}). $$

(18)

The dual operator of T is defined as $\tilde{T}(u)=-T(-u)$, which is also a morphological operator. Thus, the inf-sup representation of T is given by

$$ IS_{h}:= T_{h} u(\mathbf{x})=\inf_{B\in \mathcal{\tilde{B}}} \sup_{ \mathbf{y}\in \mathbf{x}+hB} u(\mathbf{y}) $$

(19)

with $\tilde{\mathcal{B}}$ the set of structuring elements of T̃.

Definition 2.3

(Curvature morphological operator [29])

Given the morphological operators $SI_{h}$ and $IS_{h}$ with the set of structuring elements $\mathcal{B}=\{[-1,1]_{\theta }\subset \mathbb{R}^{2}; \theta \in [0, \pi ) \}$ and h sufficiently small. Then the composition

$$ SI_{\sqrt{h}} \circ IS_{\sqrt{h}} $$

(20)

is defined as the curvature morphological operator.

Infinitesimal behaviour of ${SI_{\sqrt{h}} \circ IS_{\sqrt{h}}}$

It is shown in [34], that the mean operator

$$ F_{h} u(\mathbf{x}) = \frac {SI_{2h}(\mathbf{x})+IS_{2h}u(\mathbf{x})}{2} $$

(21)

has an infinitesimal behaviour, which is equivalent to the mean curvature motion $\tilde{\kappa}\lVert \nabla u \rVert _{2}$. However, the problem about $F_{h}$ or rather $F_{\sqrt{h}}$ is, that it is not a morphological operator since the property of contrast invariance is not satisfied [36]. Due to this reason, the curvature morphological operator is introduced by [29], which approximates the mean operator. Thus, (20) has the same infinitesimal behaviour as (21), namely $\tilde{\kappa}\lVert \nabla u \rVert _{2}$, which is also a component of the level-set equations (9) and (11).

Morphological GAC (MGAC) and ACWE (MACWE)

Summarizingly, the introduced morphological operators approximate PDEs, which are defined by the level-set equations $u_{t}$ of the GAC and ACWE method. By this knowledge, it is possible to derive the morphological versions of the both methods [29].

To this end, the embedding function u needs to be redefined. Firstly, u should be discrete in practice and secondly, u has to be binary, since the morphological operators are also binary. Hence, $u:\mathbb{Z}^{2} \rightarrow \{0,1\}$ is defined as a binary piece-wise constant function with

$$ u(\mathbf{x})= \textstyle\begin{cases} 1& {\text{if }} \mathbf{x} {\text{ is inside the curve boundaries},} \\ 0 & {\text{if }} \mathbf{x} {\text{ is outside the curve boundaries.}} \end{cases} $$

(22)

Due to the discretization of u, the (sets of) structuring elements also have to be discretized. This realizes the discretization of the morphological operators. In Fig. 4, a possible discrete version $\mathcal{B}^{d}$ of $\mathcal{B}$ in definition 2.3 is described. Here, $\mathcal{B}^{d}$ consists of four discrete line segments with a length of three pixels and the origin at the pixel coordinate $(0,0)$. Analogously, the structuring element of the Dilation and the Erosion given by the unit ball $B_{1}(0)$ can be discretized.

Intuitively, the balloon force operator acts in a similar way as the Dilation or the Erosion by inflating or deflating a contour, respectively. So the PDE of the balloon force term can be approximated by the Dilation, if $v>0$ and vice versa by the Erosion. The smoothing term represents the mean curvature motion, such that the PDE of this component is approximated by the curvature morphological operator. Finally, the remaining attraction force $\nabla g(I) \nabla u$ and image attachment term $\lVert \nabla u \rVert _{2} (\lambda _{2}(I-c_{2})^{2}- \lambda _{1}(I-c_{1})^{2} )$ can be discretized directly, as the remaining factors $g(I)$ of the balloon and the smoothing term. In [29] it is described how this discretization is realized.

In conclusion, the level-set equations (9) and (11) are solved by the successive computation of the composition of three discrete and morphological operators in the mGAC or the mACWE approach. The algorithms of both approaches can be found in [29].

2.3 Mask R-CNN

The Mask R-CNN [37] is a Region-based Convolutional Neural Network for instance segmentation. Hence, the goal is to detect and classify each object of an image, whereby it should be also distinguished between every individual instance within a class. By a Mask R-CNN, the detection and classification task and the generation of a mask for each instance are managed simultaneously.

In particular, the Mask R-CNN is an extension of the Faster R-CNN [38]. This framework for object detection consists of two components: A Region Proposal Network (RPN) and a region-based object detection network, here given by the Fast R-CNN [39] (the predecessor of Faster R-CNN). The RPN generates candidate object locations, called “proposals”, which the Fast R-CNN uses to determine the exact locations of the detected images. The innovation of the Faster R-CNN is to unify those both networks into one framework by developing training algorithms in which both networks share some of their layers. Now, the extension of the Faster R-CNN is realized by adding a mask branch. As described in Fig. 5, the instance masks are generated by a Fully Convolutional Network (FCN) [40] which aims at classifying for each pixel, whether it belongs to a certain class or not.

For the achievement of the prediction of high quality masks, the authors of [37] show, that the introduction of the following two novelties play a key role. Firstly, the “RoIPooling”-layer (RoI: Region of Interest) of the Faster R-CNN is substituted by the “RoIAlign”-layer. Actually, the Faster R-CNN is not designed for a pixel-to-pixel relation between the in- and output. By RoIAlign the features which are extracted by the convolutional backbone, can be properly aligned according to the input image (see Fig. 5). Thus, the generation of pixel-accurate instance segmentation masks is possible. Secondly, the classification task and the prediction of the mask for each instance is decoupled. The loss function is defined in such a manner, that binary masks are predicted for all of the K classes independently, such that no competition exists among these classes during inference. Hence, the prediction of the class is not based on a predicted mask, but solely on the classification branch. Since all desired outputs are computed in parallel, the multi-task loss

$$ \mathcal{L}= \mathcal{L}_{C} + \mathcal{L}_{B} + \mathcal{L}_{M} $$

(23)

is defined on each RoI which is composed of the classification loss $\mathcal{L}_{C}$ [39], the bounding box regression loss $\mathcal{L}_{B}$ [39] and the loss of the mask branch $\mathcal{L}_{M}$ [37, 41]. These loss functions are defined as described below.

Remark 2.4

(Notation)

The set of ground truth labels is given by $\mathcal{Y}$, whereby $\# \mathcal{Y} = K+1$, and consists of K predefined object classes and an additional background class. In particular, the background class is denoted by $y=0$. Moreover, the ground truth class y of each instance is assigned to each RoI.

1. Classification loss $\mathcal{L}_{C}$

The output of the classification branch is given by a discrete probability distribution $p = (p_{0}, p_{1}, \ldots, p_{y} \ldots, p_{K})$ over all $K+1$ classes whereby

$$ p_{k} = \frac {e^{z_{k}}}{\sum_{i=0}^{K} e^{z_{i}}}\quad \forall k=0, \ldots, K $$

(24)

with $z \in \mathbb{R}^{K+1}$ as the output of the last fully connected layer. Then the classification loss is defined as the log loss of the true class y:

$$ \mathcal{L}_{C} = -\log (p_{y}) $$

(25)

2. Bounding box regression loss $\mathcal{L}_{B}$

The output of the bounding box regression branch is given by a four tuple of pixel values $\hat{b}^{k} = (\hat{b}^{k}_{c}, \hat{b}^{k}_{d}, \hat{b}^{k}_{w}, \hat{b}^{k}_{h} ) \forall k=1, \ldots, K$. Here, $(\cdot _{c}, \cdot _{d})$ describe the pixel coordinates of the center of the bounding box, while the width and height of the bounding box are given by the pixel values with the indices w and h. For detailed information on the derivation of the certain pixel values we refer to [39]. With the ground truth bounding box $b^{y} = (b^{y}_{c}, b^{y}_{d}, b^{y}_{w}, b^{y}_{h} )$, assigned to each RoI if $y\neq 0$, the loss function is defined by

L_{B} = 1_{{y > 0}} \sum_{j \in {c, d, w, h}} H ({\hat{b}}_{j}^{y} - b_{j}^{y})

(26)

with $\mathcal{H}(\phi )$ the Huber loss function [42]

$$ \mathcal{H}(\phi ) = \textstyle\begin{cases} 0.5 \phi ^{2} & {\text{if }} \vert \phi \vert < 1, \\ \vert \phi \vert - 0.5& {\text{else}} \end{cases} $$

(27)

and the indicator function

1_{{y > 0}} = {\begin{matrix} 1 & if y > 0, \\ 0 & if y = 0 . \end{matrix}

(28)

3. Loss of the mask branch $\mathcal{L}_{M}$

The output of the mask branch has the dimension $Km^{2}$ since it encodes a binary mask of a spatial dimension of $m\times m$ for all K object classes except the background class. The values of each pixel of the predicted mask $\hat{\tau}_{rs}^{k}, k=1, \ldots, K$ are derived by applying a sigmoid activation function to the outputs of the last feature map. With the pixel values $\tau _{rs}^{y}$ of the ground truth mask, the loss function is defined as the average binary cross-entropy:

L_{M} = 1_{{y > 0}} \frac{1}{m^{2}} \sum_{r = 1}^{m} \sum_{s = 1}^{m} τ_{r s}^{y} log ({\hat{τ}}_{r s}^{y}) + (1 - τ_{r s}^{y}) log (1 - {\hat{τ}}_{r s}^{y})

(29)

with $1_{{y > 0}}$ defined as in (28).

Those three tasks of classification, bounding box regression and mask generation are solved in the head architecture of the Mask R-CNN which operates on each RoI, whereas the important features are extracted in the convolutional backbone architecture. In this work, the backbone architecture is given by a combination of the ResNet with 101 layers [43] and the Feature Pyramid Network (FPN) [44]. The head architecture is given only by the FPN.

Summarizingly, the Mask R-CNN creates instance segmentation masks that separate and individually distinguish the relevant instances from the background.

3 Datasets for interior sensing

Scenarios of interior sensing are highly complex. The instances separated from the background belong to the classes “Person”, “Child seat” and “Object” in this work. Thus, the foreground instances can be both, dynamic and static. Furthermore, only those detected instances which are positioned on the front and back seats of the car are of interest. Also modelling the background is non-trivial since it contains dynamic elements due to the motion of objects visible through the car windows. Moreover, the camera position changes slightly from car to car in this work. Additionally, the impact of environmental effects has to be taken into account, like different weather conditions, shadows, traffic lights and vibrations.

To solve the background-foreground segmentation task in this high complex setting adequately, it is of importance, especially for the training of the Mask R-CNN, that an appropriate training set is available by which a wide range of the challenges is covered. To this end, the ISSO dataset has been created by APTIV and the authors of this work. It consists of 1300 real-world images extracted from videos of different interiors of stationary or driving cars. The images contain a high variety regarding the foreground instances, the background and the environmental conditions. Further details are provided in the upcoming section.

Since the annotation of images is time-consuming and costly, only a small amount of 1100 real-world images is available for training. To overcome this problem, we apply transfer learning. Here, the training of the Mask R-CNN is initialized by a model pretrained on the COCO dataset. This pretrained model is able to detect persons and certain everyday objects outside the scope of car interiors. Therefore, we also consider the impact of the COCO dataset maintained during the training. Moreover, images of the synthetic dataset SVIRO are used for the training of the Mask R-CNN to overcome the problem of the small amount of real-world annotated data. The SVIRO dataset consists of rendered images describing scenarios in the passenger compartment of different cars.

Hence, in total, we consider images and videos of three different datasets—the COCO dataset, the SVIRO dataset and our ISSO dataset. In the next section, we describe these datasets in more detail.

3.1 The ISSO dataset

The Interior Sensing and Seat Occupancy (ISSO) dataset has been created by APTIV and the authors of this work. It consists of images which are extracted from videos that have been recorded in driving or stationary cars by APTIV in Wuppertal, Germany. The purpose of this dataset is to enable the feasibility study provided by the present article and it is not meant to be representative for a local or global population. While selecting the images for the dataset, it was taken into account, that a high variation within the instances, the backgrounds and the light conditions is given over all images. The camera is mounted on the upper or lower area of the windshield in each car. Hence, the position changes slightly per car. Since the camera is not integrated, its position might even change slightly in one and the same car. In total, 1300 images were labeled from which we define test, training and validation sets.

1. Training set

The training set is used to train the Mask R-CNN. In total, it consists of 1100 labeled images recorded in five different cars. 500 images were selected and labeled at the beginning of this work. Since the class “Child seat” suffered from a lack of variation, it was paid attention to recording as many different child seats as possible in later recording session. Of these new videos, 600 additional images were chosen and labeled, such that the number of instances for the class “Child seat” increased in particular. Nevertheless, as one can see from Fig. 6, the instances of this class are least represented in the training set. For the class “Child seat” it is to remark that instances are not clearly visible in two situations. Here, a situation refers to a specific camera position in a specific car. Firstly, due to the camera position, only a small part of a child seat is visible if it is mounted on the back seat of the car interior. Second, a child seat is hardly visible if it is occupied by a child. Hence, especially for the training it is of interest in how many cases the child seats are mounted on the front passenger seat and thus clearly visible. In particular, the child seats are mounted on the passenger front seat of the car in about 60% of the images that contain a child seat. Of these front mounted child seats over 80% are not occupied. Detailed statistics for the training set are given in the Sect. B.

2. Validation set

The validation set consists of 100 images distributed over three different cars. It contains five persons, among them one child, one female and four male. The class “Object” is represented by four instances of five main categories (laptop, PC-keyboard, bagpack and beverage crate). Moreover, two child seats are available in the validation dataset. In 29 images, a child seat is contained whereby only seven of these images show a child seat mounted on the passenger front seat. None of the front mounted child seats is occupied by a child. All cars, child seats, objects and persons are different from those shown by the training set.

3. Test set

The test set is used to evaluate and to compare the performance of the implemented foreground-background detection methods. It consists of 100 labeled images extracted from 70 videos that are recorded in five different cars. 50 images are extracted from videos inside a driving car and the other 50 images are extracted from videos inside stationary cars.

The test set contains 13 persons, among them three children and one baby, three female and ten male. In the class “Object”, everyday items are collected, like a bag pack or a wallet. As described in Table 14, 42 instances of 16 main categories are included in the test set. Furthermore, four different child seats are available in the test set. The child seat is mounted on the passenger front seat of the car in about 30% of the images that contain a child seat. Additionally, about 53% of these front mounted child seats are occupied. All cars, child seats, objects and persons are different from those shown by the training and validation sets.

Creation of the ground truth

The annotations of the images are created by the tool “Labelme” from MIT [45] extended by the function of an eraser. With “Labelme”, it is possible to annotate the instances of an image by closed polygon courses. By this information, the ground truth segmentation masks with different gray scale values for each instance can be created (see Fig. 7).

In this feasibility study we only focus on foreground-background segmentation. Hence, for the images of the test set, we generated binary ground truth segmentation masks. As described by Fig. 8, the foreground instances are represented by white pixels, while the background is given by black pixels.

3.2 The SVIRO dataset

As the name suggests, the Synthetic Vehicle Interior Rear Seat Occupancy (SVIRO) dataset [46] consists of images produced artificially by a rendering software which is Blender, version 2.79. These images depict randomly generated scenarios in the passenger compartment of ten different vehicles. For each of these cars 2500 images were generated and split into a training and a test set. Here, each training set contains 2000 labeled images and each test set 500 images. The labeled instances of the classes “Person”, “Child seat” and “Everyday object” differ between the training and the test set.

Herein, the training datasets of the five car models Ford Escape, the Lexus GSF, the Tesla Model 3, the VW Tiguan and the RenaultZoe are used. For the training set of each car, the following statistics apply: Each training set contains 23 persons, whereby six persons are children and three persons are babies. Moreover, three different child seats and four different everyday objects are used in one training dataset. Furthermore, different light conditions are taken into account. For the task of instance segmentation, the ground truth corresponding to the RGB image is given by an instance segmentation mask as described in Fig. 9.

3.3 The COCO dataset

The “Common Objects in Context” (COCO, [47]) dataset consists of images that depict everyday objects in typical environments. Mainly, the images are non-iconic. This means for example, that an object is not shown in front of a calm background but in a complex scene. For this work the training dataset of the year 2017 is used. This dataset consists of over 118,000 labeled images covering over 60 object categories of 10 main categories including the category “background” as described in Table 15. The annotations are created by labeling each instance in an image by a closed polygon course and their corresponding bounding box.

4 Evaluation metrics

The foreground-background masks generated by the implemented frameworks are evaluated pixel-wise. The metrics by which this evaluation is realized are composed of the following four terms:

True positives (TP): Pixels that belong to the foreground and which are correctly classified.
False positives (FP): Pixels that belong to the background, but which are misclassified.
True negatives (TN): Pixels that belong to the background and which are correctly classified.
False negatives (FN): Pixels that belong to the foreground, but which are misclassified.

Thereof, commonly used metrics for the evaluation of background-foreground segmentation tasks [48] can be defined as given in Table 1. While the precision describes the amount correctly predicted foreground pixels relative to the total number of predicted foreground pixels, recall considers the predicted foreground pixels relative to the total number of actual (true) foreground pixels, corresponding to the ground truth. The specificity describes the proportion of true background pixels that are correctly classified. The accuracy provides the proportion of correct classifications overall. The similarity is also known as the Jaccard index or the Intersection over Union (IoU) [48–50]. This value measures to what extent the ground truth mask and the predicted mask resemble one another. Finally, the $F_{1}$-score describes the harmonic mean of precision and recall [51].

Table 1 Definition of the evaluation metrics

Background-foreground segmentation for interior sensing in automotive industry

Abstract

1 Introduction

2 A choice of methods for background-foreground segmentation

2.1 Gaussian mixture model (GMM)

Estimation of the model parameters

Estimation of the background model

Number of mixtures

2.2 Morphological snakes

Geodesic active contours (GAC)

Active contours without edges (ACWE)

Remark 2.1

Definition 2.2

Infinitesimal behaviour of dilation and erosion

The sup-inf representation of morphological operators

Definition 2.3

Infinitesimal behaviour of \({SI_{\sqrt{h}} \circ IS_{\sqrt{h}}}\)

Morphological GAC (MGAC) and ACWE (MACWE)

2.3 Mask R-CNN

Remark 2.4

1. Classification loss \(\mathcal{L}_{C}\)

2. Bounding box regression loss \(\mathcal{L}_{B}\)

3. Loss of the mask branch \(\mathcal{L}_{M}\)

3 Datasets for interior sensing

3.1 The ISSO dataset

1. Training set

2. Validation set

3. Test set

Creation of the ground truth

3.2 The SVIRO dataset

3.3 The COCO dataset

4 Evaluation metrics

Remark 4.1

5 Results of experiments

5.1 Gaussian mixture model (GMM)

Implementation details

Results

Remark 5.1

Limitations

5.2 Morphological snakes

Implementation details

1. MACWE

2. MGAC

Results

1. MACWE

2. MGAC

Limitations

Improvement of performance

5.3 Mask R-CNN

Implementation details

1. The influence of the learning rate lr and the weight decay λ

2. The influence of data augmentation

3. The influence of the data which is used during training

Results

Limitations

5.4 Comparison

6 Conclusion and outlook

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Appendices

Appendix A: Excursion: color spaces

RGB

HSV

CIEL∗a∗b∗ (Lab)

Appendix B: Detailed statistics for the ISSO dataset

Appendix C: List of object classes of the COCO dataset

Rights and permissions

About this article

Cite this article

CIEL^∗a^∗b^∗ (Lab)