If S is of tree structure, it has been shown before [4, 5] that for all such functional networks there are low-dimensional manifolds such that it is sufficient to measure data in a -environment of M to in order to identify the model properly. Such manifolds M are called data bases. The same authors have proven that the minimal dimension of data bases is equal to the maximum number of input edges of any black-box node in the network. Moreover, almost all differentiable, monotonic submanifolds with maximum number of input variables in a black box node have (at least locally) the properties of a data base. Additionally, direct as well as indirect identification procedures have been analysed and implemented in software [13].
This result is based on the structure of S which guarantees that, despite all nodes in S may be black box models, the overall functional network model cannot represent any smooth function depending on n input variables. Now we show that this intrinsic property of hierarchical functional networks is a specific property of the topology of S and allows, if large enough data sets are available, a direct reconstruction of the topology of S from data.
In all functional models, where S has a tree structure, there will be a unique path connecting each input variable to the output node. As the paths from inputs i and j to the output node may join in a node k, and are not necessarily disjoined. Suppose all node functions are strictly monotonic in all variables with bounded second derivatives. Then the partial derivatives of the output function with respect to are the product of the partial derivatives of all i-o-functions along the path starting at the input node of and ending with the output node of the entire model:
where is the input-node of . The term represents the product of the partial derivatives of the functional nodes along the path with respect to . Let be the common part of the paths and , then it holds .
Let the input variables and be input variables to the same input node l whose input-output relation is represented by the function and be an input variable to any other node. Then application of the chain rule for derivations with respect to , and leads to the following set of partial differential equations (PDEs) for the output function :
(1)
Since the variables i and j are inputs of the same node , and are identical. The respective products of the partial derivatives along both pathways are the same for i and j, leading to the relation:
(2)
All partial derivatives of (2) with respect to any variable which is not part of will vanish everywhere:
(3)
Therefore, all functions which can be represented by the functional network have to satisfy the set of PDEs:
(3a)
for all triplets where and are inputs to the same node, whereas is the input to another node.
Generalizing this argument, we show that S is associated with an even larger set of structural PDEs that has to satisfy. Now let the root and rank be defined as follows:
Definition 1 Node k shall be the root
of the input variables and , if the pathways from to the output of the entire system z and from to y join for the first time in node k. As in tree structures the pathways from each input variable to the output are unique, all pairs of input variables will have a unique root.
The rank
of a node k shall be given by the length of the path from k to the output z of the entire system. In tree structures each node will have a unique rank.
Then, in tree structures with n input variables and one output variable y the following theorem holds:
Theorem 1 (Structure-Constraint Theorem)
For each triplet of input variables , , the conditions:
-
(i)
and:
-
(ii)
are equivalent.
Remark Eq. (3a) is a special case of the structure-constraint theorem, where is maximal.
Proof For all triplets i, j, k satisfying (ii) the pathways , and must be at least partially disjoined. As (ii) is satisfied, each of the pathways can be decomposed into three components with specific overlaps:
(4b)
with
and, because of the partial coincidence of the pathways: , it holds:
Equation (2) leads to
Because of (4b) the last term does not depend on , and it holds:
On the other side, if (i) holds, then we can find a decomposition of the respective pathways , and according to eq. (4a) and (4b), resulting in (ii). □
Based on the Structure-Constraint Theorem, the structure S of the functional network can be unravelled from the data as follows:
Algorithm 1 Direct hierarchical functional network reconstruction:
-
i.
Test for any triplet of input variables i, j, k whether condition (i) of the structure-constraint theorem is globally satisfied leading to a full set of satisfied rank-root conditions for the structure S.
-
ii.
Pick all double combinations i, j where for no the condition (ii):
holds. Then i and j are inputs to the same input node. Use this combinatorial information to distribute all input variables onto their respective input nodes.
-
iii.
Join the outputs of each input node l to one ‘child’ variable . The roots for a ‘child’ variable are equal to those roots of the respective ‘parent’ variables which are not yet identified as input nodes. The respective ranks for the roots of the ‘child’ variables are the ranks of the respective roots of the parent variables minus 1. So we arrive at a new, smaller structure which consists of all nodes which have not been identified in step (ii) as input nodes. Therefore, is identical to the respective part of S, the input variables of are the ‘child’ variables of the input nodes. The respective roots and ranks can be determined from the roots and ranks from S.
-
iv.
Distribute the ‘child’ variables as input variables of on their input nodes in . This can be performed as described in step (ii) leading to novel ‘grand-child’ variables. To do so, go to step (ii).
-
v.
In each tree-structure there exists m, , such that m loops of steps ii-iv described above will lead to a structure where all new input variables have the same root node. Then this common root is the output node of the entire system structure S and the algorithm stops.
Notes
-
a.
If for all triplets of input variables the rank-root relations are known, then the adjoint tree structure of S can be directly reengineered from this set of relations. Therefore, if very large sets of data are given (for example, from high-throughput experimentation) such that a reliable test on truth of the conditions (i, ii) for all triplets can be performed, then the structure of the underlying functional network can be directly reconstructed. This direct approach is much more effective than the approach of identifying quantitatively the model for all possible model structures S, then selecting the structure of the model with the lowest residues.
-
b.
The results described above can be transferred to models with discrete, for example, binary outputs. Then it allows the direct identification of the structure of the functional mechanisms behind the measured data in various scientific applications, if, for example, in the identification of pharmacological mechanisms from high-throughput screening data [16].
The direct network identification algorithm provides a very efficient approach to hierarchical network reengineering. It is superior to one-step reengineering approaches which need the minimization of an error functional of residues, which leads to a highly nonlinear, combinatorial optimization problem. As the algorithm can be generalized to discrete variables, it may be an efficient method for the analysis of next generation sequencing data when large data sets will be available. However, its drawbacks are the existing limitation to tree structures as well as the required estimates for condition (i) which is an ill-posed problem. Further research will be necessary for the development of stable routines which can be applied by non-experts in a standardized workflow.