1 Introduction
Neural Ordinary Differential Equations (ODENets; Chen et al., 2018
) can learn latent models from observations that are sparse in time. This property has the potential to enhance the performance of neural network predictive models in applications where information is sparse in time and it is important to account for exact arrival times and delays. In complex control systems and modelbased reinforcement learning, planning over a long horizon is often needed, while high frequency feedback is necessary for maintaining stability
(Franklin et al., 2014). Discretetime models, including RNNs (Jain & Medsker, 1999), often struggle to fully meet the needs of such applications due to the fixed time resolution. ODENets have been shown to provide superior performance with respect to classic RNNs on time series forecasting with sparse training data. However, learning their parameters can be computationally intensive. In particular, ODENets are memory efficient but time inefficient. In this paper, we address this bottleneck and propose a novel alternative strategy for ODENets training.Summary of contributions.
We propose SNet, a compact representation of ODENets that makes use of a higherorder approximation of its states by means of Legendre polynomials. This is outlined in Section 4. In order to find the optimal polynomial coefficients and network parameters, we develop a novel optimization scheme, which does not require to solve an ODE at each iteration. The resulting algorithm is detailed in Section 3 and is based on backpropagation (Linnainmaa, 1970; Werbos, 1981; Lecun, 1988) and automatic differentiation (Paszke et al., 2017). The proposed method is fully parallel with respect to time and its approximation error reduces exponentially with the Legendre polynomial order (Canuto et al., 1988).
Summary of numerical experiments.
In Section 5, our method is tested on a 6state vehicle dynamics identification problem, where it is at least one order or magnitude faster in each optimizer iteration than explicit and adjoint methods, while convergence is achieved in a third of the iterations. At test time, the MSE is reduced by one order of magnitude. In Section 6, the method is used for the identification of a 30state system consisting of identical vehicles, coupled via a known collision avoidance policy. Again, our method converges in a third of the iterations required by backpropagation thourgh a solver and each iteration is x faster than the fastest explicit scheme.
2 Neural ordinary differential equations
The minimization of a scalarvalued loss function that depends on the output of an ODENet can be formulated as a general constrained optimization problem:
(1)  
where is the state, is the input, the loss and ODE functions and are given, and the parameters have to be learned. Equation (1
) can be used to represent several inverse problems, for instance in machine learning, estimation, and optimal control
(Stengel, 1994; Law et al., 2015; Ross, 2009). Problem (1) can be solved using gradientbased optimization through several timestepping schemes for solving the ODE. (Chen et al., 2018; Gholami et al., 2019) have proposed to use the adjoint method when is a neural network. These methods are typically relying on explicit timestepping schemes (Butcher & Wanner, 1996). Limitations of these approaches are briefly summarized:Limitations of backpropagation through an ODE solver.
The standard approach for solving this problem is to compute the gradients using backpropagation through a discrete approximation of the constraints, such as RungeKutta methods (Runge, 1895; Butcher & Wanner, 1996) or multistep solvers (Raissi et al., 2018). This ensures that the solution remains feasible (within a numerical tolerance) at each iteration of a gradient descent method. However, it has several drawbacks: 1) the memory cost of storing intermediate quantities during backpropagation can be significant, 2) the application of implicit methods would require solving a nonlinear equation at each step, 3) the numerical error can significantly affect the solution, and 4) the problem topology can be unsuitable for optimization (Petersen et al., 2018).
Limitations of adjoint methods.
ODENets (Chen et al., 2018) solve (1) using the adjoint method, which consists of simulating a dynamical system defined by an appropriate augmented Hamiltonian (Ross, 2009), with an additional state referred to as the adjoint. In the backward pass the adjoint ODE is solved numerically to provide the gradients of the loss function. This means that intermediate states of the forward pass do not need to be stored. An additional step of the ODE solver is needed for the backward pass. This suffers from a few drawbacks: 1) the dynamics of either the hidden state or the adjoint might be unstable, due to the symplectic structure of the underlying Hamiltonian system, referred to as the curse of sensitivity in (Ross, 2009); 2) the procedure requires solving a differential algebraic equation and a boundary value problem which is complex, time consuming, and might not have a solution (Ross & Karpenko, 2012).
Limitations of hybrid methods.
ANODE (Gholami et al., 2019) combines the two methods above in order to mitigate their limitations. The problem is split into time batches, where the adjoint is used, and only some intermediate states from the forward pass are stored in memory. This is a compromise between time and storage and does not entirely resolve the above limitations, thus only providing results that are within the convex envelope of the two methods. Moreover, there is no clear strategy to a priori determine the size of the time chunks, which can significantly affect speed and performance.
3 Relaxation of the supervised learning problem
Our algorithm is based on two ingredients: i) the discretization of the problem using spectral elements leading to SNet, detailed in Section 4, and ii) the relaxation of the ODE constraint from (1), enabling efficient training through backpropagation. The latter can be applied directly at the continuous level and significantly reduces the difficulty of the optimization, as shown in our examples.
The problem in (1) is split into two smaller subproblems: one finds the trajectory that minimizes an unconstrained relaxation of (1). The other trains the network weights such that the trajectory becomes a solution of the ODE. Both are addressed using standard gradient descent and backpropagation. In particular, a fixed number of ADAM or SGD steps is performed for each problem in an alternate fashion, until convergence. In the following, the details of each subproblem are discussed.
Step 0: Initial trajectory.
The initial trajectory is chosen by solving the problem
(2)  
If this problem does not have a unique solution, a regularization term is added. For a quadratic loss, a closedform solution of the above problem is readily available. Otherwise, a prescribed number of SGD iterations is used.
Step 1: Coordinate descent on residual.
Once the initial trajectory is found, is computed by solving the unconstrained problem:
(3) 
If the value of the residual at the optimum is smaller than a prescribed tolerance, then the algorithms stops. Otherwise, steps 1 and 2 are iterated until convergence.
Step 2: Coordinate descent on relaxation.
Once the candidate parameters are found, the trajectory is updated by minimizing the relaxed objective:
(4)  
Discussion.
The proposed algorithm can be seen as an alternating coordinate gradient descent on the relaxed functional used in problem (4), i.e., by alternating a minimization with respect to and . If , multiple minima can exist, since each choice of the parameters would induce a different dynamics , solution of the original constraint. For , the loss function in (4) tradesoff the ODE solution residual for the data fitting, providing a unique solution. The choice of implicitly introduces a satisfaction tolerance , i.e., similar to regularized regression (Hastie et al., 2001), implying that . Concurrently, problem (3) reduces the residual.
4 SNet – Highorder discretization of the relaxed problem
In order to numerically solve the problems presented in the previous section, a discretization of is needed. Rather than updating the values at time points from the past to the future, we introduce a compact representation of the complete discrete trajectory by means of the spectral element method.
Spectral approximation.
We start by representing the scalar unknown trajectory, , and the known input, , as truncated series:
(5) 
where and are a set of given basis functions that span a space . In this work, we choose orthogonal Legendre polynomials of order (Canuto et al., 1988), where
is a hyperparameter.
Collocation and quadrature.
In order to compute the coefficients of (5), we enforce the equation at a discrete set of collocation points . Here, we choose GaussLobatto nodes (Canuto et al., 1988), which include the initial point
. Introducing the vectors of series coefficients
and of evaluations at quadrature points , the collocation problem can be solved in matrix form as(6) 
We approximate the integral (3) as a sum of residual evaluations over . Assuming that , the integrand at all quadrature points can be computed as a componentwise norm
(7) 
Fitting the input data.
Integral (4) entails evaluating the loss function at quadrature points, which requires the knowledge of the input data at . For the case when this is available, we propose a new direct training scheme, namely, SNet. This is summarized in Algorithm 1.
If data at is not available, for instance when it is sparse, or if problem (2) does not admit a unique solution, a leastsquares approach can be used. We approximate the integral by evaluating at the available time points. The corresponding alternating coordinate descent scheme SNet is presented in Algorithm 2. We use fixed numbers and of updates for, respectively, and
. Both are performed with standard routines, such as SGD. In the following sections, we study the consequences of a very lowdata scenario on this approach. In our experiments, we use ADAM to optimize the parameters and an interpolation order
, but any other orders and solvers are possible.Ease of time parallelization.
If is enforced explicitly at , then the resulting discrete system can be seen as an implicit timestepping method of order . However, while ODE integrators can only be made parallel across the different components of , the assembly of the residual can be done in parallel also across time.
Memory cost.
It can be shown that, if an ODE admits a smooth solution, the approximation error of the SNet converges exponentially with (Canuto et al., 1988), thereby producing a very compact representation of an ODENet. Thanks to this property, is typically much lower than the equivalent number of time steps of explicit or implicit schemes with a fixed order. This greatly reduces the complexity and the memory requirements of the proposed method, which can be evaluated at any via (5) by only storing few coefficients.
5 Modeling of a planar vehicle dynamics
Let us consider the system
(8) 
where are the states, is the control, is the Coriolis matrix, is the damping force, assumed to be linear, and encodes the coordinate transformation from the body to the world frame (Fossen, 2011).
Identification of a surrogate model.
A model the true system is built using a neural network for each matrix
Each network consists of two layers, the first with a activation. Bias is excluded for and . For , we use and input features, where is the vehicle orientation. When inserted in (8), these discrete networks produce an ODENet that is a surrogate model of the physical system. Time horizon and batch size of are used. All experiments are performed with learning rates for ADAM (for all methods) and for SGD (only for SNet). The trajectories of the system are shown in Figure 1, for a single batch, as well as the learning curve.
Comparison of methods with full data.
The trajectories were sampled at 100 equallyspaced time points and compared the training performance of the novel and traditional training methods. We choose an order for the spectral interpolation. For the SNet method, we use and
iterations for the SGD and ADAM algorithms at each epoch, as outlined in Algorithm
2, and perturb the initial trajectory as , . This perturbation prevents the exact convergence of Algorithm 1 during initialization, allowing to perform the alternating coordinate descent algorithm. Table 1(a) shows that SNet outperforms BKPRDoPr5 by a factor of 50, while producing a significantly improved generalization. The speedup reduces to 20 for SNet, which however yields a further reduction of the testing MSE by a factor of 10, as can be seen in Figure 2.


Comparison of methods with sparse data.
We compared again the performance of the methods, with trajectories sampled at fewer time points. In the case of SNet and SNet, only points are needed for very accurate integration of the loss function, if such points coincide with the GaussLobatto quadrature points. We found that 100 equallyspaced points produce a comparable result. Here, the points are reduced further. Table 1(b) shows that SNet preserves a good testing MSE, at the price of an increased number of iterations. We argue that this could be improved by a more specific Legendre interpolation of the data or by a more specific solver for (2). However, even with no modifications and only of data, SNet is 10x faster than BKPRDoPr5 with th of test MSE.
6 Learning a multiagent simulation
Consider the multiagent system consisting of kinematic vehicles:
(9) 
where are the states (planar position and orientation), is the coordinate transform from the body to world frame, common to all agents. The agents velocities are determined by known arbitrary control and collision avoidance policies, respectively, and plus some additional high frequency measurable signal , shared by all vehicles. We wish to learn their kinematics matrix by means of a neural network as in Section 5. The task is simpler here, but the resulting ODE has states, coupled by . We simulate agents in series.


Results.
We use learning rates for ADAM (for all methods) and for SGD (only for SNet), time horizon and batch size . The learning curves for highdata regime are in Figure 3. For method SNet, we use and training is terminated when the loss in (4) is less than , with and ^{1}^{1}1For the case of data only we set . Table 2 summarizes results. SNet is the fastest method, followed by SNet which is the best performing. BKPREuler falls back in iteration time by a factor , with x worse test MSE. Downsampling data by and confirms the trend. BCKPRDoPr5 failed due to an underflow when computing the time step, solver tolerances have been increased to rtol, atol. Since the loss continued to increase, at epochs training was terminated. Test trajectories are shown in Figure 4. Additional figures and details are in Appendix.
7 Scope and limitations
Test time and crossvalidation
At test time, since the future outputs are unknown, an explicit integrator is needed. For crossvalidation, the loss needs instead to be evaluated on a different dataset. In order to do so, one needs to solve the ODE forward in time. However, since the output data is available during crossvalidation, a corresponding polynomial representation of the form (5) can be found and the relaxed loss (4) can be evaluated efficiently.
Nonsmooth dynamics.
We have used the fact that the ODENet dynamics is smooth in order to take advantage of the exponential convergence of spectral methods. However, this might not be true in general. In these cases, the optimal choice would be to enrich the spectral element space with (possibly loworder) locallysupported basis functions (Canuto et al., 1988).
Topological properties.
As discussed in (Petersen et al., 2018), the set of functions generated by a fixed neural network topology does not posses favorable topological properties for optimization. We argue that one reason for the performance improvements shown by our algorithms is that the constraint relaxation proposed in this work can improve the properties of the optimization space.
Multiple ODEs: Synchronous vs Asynchronous.
The proposed method can be used for an arbitrary cascade of dynamical systems as they can be coded as a single ODE. In the special case when only the final state of one ODE (or its trajectory) is fed into the next block, e.g. as in (Ciccone et al., 2018), the method could be extended by means of smaller optimizations, where is the number of ODEs.
8 Related work
RNN training pathologies.
One of the first RNNs to be trained successfully were LSTMs (Greff et al., 2017), due to their particular architecture. Training an arbitrary RNN effectively is generally difficult as standard RNN dynamics can become unstable or chaotic during training and this can cause the gradients to explode and SGD to fail (Pascanu et al., 2012). When RNNs consist of discretised ODEs, then stability of SGD is intrinsically related to the size of the convergence region of the solver (Ciccone et al., 2018). Since higherorder and implicit solvers have larger convergence region (Hairer et al., 1993), following (Pascanu et al., 2012) it can be argued that our method has the potential to mitigate instabilities and hence to make the learning more efficient. This is supported by our results.
Unrolled architectures.
In (Graves, 2016), an RNN has been used with a stopping criterion, for iterative estimation with adaptive computation time. Highway (Srivastava et al., 2015) and residual networks (He et al., 2015) have been studied in (Greff et al., 2016) as unrolled estimators. In this context, (Haber & Ruthotto, 2017) treated residual networks as autonomous discreteODEs and investigated their stability. Finally, in (Ciccone et al., 2018) a discretetime nonautonomous ODE based on residual networks has been made explicitly stable and convergent to an inputdependant equilibrium, then used for adaptive computation.
Training stable ODEs.
In (Haber & Ruthotto, 2017; Ciccone et al., 2018), ODE stability conditions where used to train unrolled recurrent residual networks. Similarly, when using our method on (Ciccone et al., 2018) ODE stability can be enforced by projecting the state weight matrices, , into the Hurwitz stable space: i.e. . At test time, overall stability will also depend on the solver (Durran, 2010; Isermann, 1989). Therefore, a high order variable step method (e.g. DoPr5) should be used at test time in order to minimize the approximation error.
Dynamics and machine learning.
A physics prior on a neural network was used by (Jia et al., 2018) in the form of a consistency loss with data from a simulation. In (De Avila BelbutePeres et al., 2018), a differentiable physics framework was introduced for point mass planar models with contact dynamics. (Ruthotto & Haber, 2018)
looked at Partial Differential Equations (PDEs) to analyze neural networks, while
(Raissi & Karniadakis, 2018; Raissi et al., 2017) used Gaussian Processes (GP) to model PDEs. The solution of a linear ODE was used in (Soleimani et al., 2017) in conjunction with a structured multioutput GP to model patients outcome of continuous treatment observed at random times. (Pathak et al., 2017) predicted the divergence rate of a chaotic system with RNNs.9 Conclusions
We proposed SNet, a novel approximation of the internal dynamics of ODENets in terms of a truncated Legendre series. This allows for a very compact representation using very few coefficients thanks to its exponential convergence for smooth functions. In order to solve the associated optimization problem, a coordinate descent method is employed, where the series coefficients and net weights are updated alternately. When a good guess for the optimal trajectory can be produced from the training data, such as when the data set is rich, then the optimization converges very rapidly. SNet trains orders of magnitude faster and improves generalization with respect to the state of the art. Faster and more accurate ODENets, such as SNet, are a step towards tackling larger and more complex problems.
References

Butcher & Wanner (1996)
Butcher, J., & Wanner, G. (1996).
RungeKutta methods: some historical notes.
Applied Numerical Mathematics, 22(13), 113–151.
URL https://linkinghub.elsevier.com/retrieve/pii/S0168927496000487  Canuto et al. (1988) Canuto, C., M.Y. Hussaini, A. Quarteroni, & T.A. Zang (1988). Spectral Methods in Fluid Dynamics.. SpringerVerlag.
 Chen et al. (2018) Chen, T. Q., Rubanova, Y., Bettencourt, J., & Duvenaud, D. K. (2018). Neural Ordinary Differential Equations. In NeurIPS.
 Ciccone et al. (2018) Ciccone, M., Gallieri, M., Masci, J., Osendorfer, C., & Gomez, F. (2018). Naisnet: Stable deep networks from nonautonomous differential equations. In NeurIPS.
 De Avila BelbutePeres et al. (2018) De Avila BelbutePeres, F., Smith, K. A., Allen, K., Tenenbaum, J. B., & Kolter, J. Z. (2018). Endtoend differentiable physics for learning and control. In NeurIPS.

Durran (2010)
Durran, D. R. (2010).
Numerical Methods for Fluid Dynamics.
Springer New York.
URL https://doi.org/10.1007/9781441964120  Fossen (2011) Fossen, T. I. (2011). Handbook of marine craft hydrodynamics and motion control. John Wiley & Sons.
 Franklin et al. (2014) Franklin, G. F., Powell, J. D., & EmamiNaeini, A. (2014). Feedback Control of Dynamic Systems (7th Edition). Pearson.
 Gholami et al. (2019) Gholami, A., Keutzer, K., & Biros, G. (2019). ANODE: Unconditionally Accurate MemoryEfficient Gradients for Neural ODEs. Preprint at arXiv:1902.10298.
 Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org.

Graves (2016)
Graves, A. (2016).
Adaptive Computation Time for Recurrent Neural Networks.
arXiv:1603.08983 [cs].
ArXiv: 1603.08983.
URL http://arxiv.org/abs/1603.08983 
Greff et al. (2017)
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., &
Schmidhuber, J. (2017).
LSTM: A Search Space Odyssey.
IEEE Transactions on Neural Networks and Learning Systems,
28(10), 2222–2232.
ArXiv: 1503.04069.
URL http://arxiv.org/abs/1503.04069 
Greff et al. (2016)
Greff, K., Srivastava, R. K., & Schmidhuber, J. (2016).
Highway and residual networks learn unrolled iterative estimation.
CoRR, abs/1612.07771.
URL http://arxiv.org/abs/1612.07771 
Haber & Ruthotto (2017)
Haber, E., & Ruthotto, L. (2017).
Stable Architectures for Deep Neural Networks.
arXiv preprint arXiv:1705.03341.
URL https://arxiv.org/abs/1705.03341  Hairer et al. (1993) Hairer, E., Nørsett, S. P., & Wanner, G. (1993). Solving Ordinary Differential Equations I (2Nd Revised. Ed.): Nonstiff Problems. Berlin, Heidelberg: SpringerVerlag.
 Hastie et al. (2001) Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.

He et al. (2015)
He, K., Zhang, X., Ren, S., & Sun, J. (2015).
Deep Residual Learning for Image Recognition.
arXiv:1512.03385 [cs].
ArXiv: 1512.03385.
URL http://arxiv.org/abs/1512.03385  Horn & Johnson (2012) Horn, R. A., & Johnson, C. R. (2012). Matrix Analysis. New York, NY, USA: Cambridge University Press, 2nd ed.

Ioffe & Szegedy (2015)
Ioffe, S., & Szegedy, C. (2015).
Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift.
arXiv:1502.03167 [cs].
ArXiv: 1502.03167.
URL http://arxiv.org/abs/1502.03167 
Isermann (1989)
Isermann, R. (1989).
Digital Control Systems.
Springer Berlin Heidelberg.
URL https://doi.org/10.1007/9783642864179  Jain & Medsker (1999) Jain, L. C., & Medsker, L. R. (1999). Recurrent Neural Networks: Design and Applications (International Series on Computational Intelligence). CRC Press.

Jia et al. (2018)
Jia, X., Karpatne, A., Willard, J., Steinbach, M., Read, J. S., Hanson, P. C.,
Dugan, H. A., & Kumar, V. (2018).
Physics guided recurrent neural networks for modeling dynamical
systems: Application to monitoring water temperature and quality in lakes.
CoRR, abs/1810.02880.
URL http://arxiv.org/abs/1810.02880  Law et al. (2015) Law, K., Stuart, A., & Zygalakis, K. (2015). Data assimilation. Cham, Switzerland: Springer.
 Lecun (1988) Lecun, Y. (1988). A theoretical framework for backpropagation. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.) Proceedings of the 1988 Connectionist Models Summer School, CMU, Pittsburg, PA, (pp. 21–28). Morgan Kaufmann.
 Linnainmaa (1970) Linnainmaa, S. (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis, Univ. Helsinki.
 Pascanu et al. (2012) Pascanu, R., Mikolov, T., & Bengio, Y. (2012). On the difficulty of training recurrent neural networks. CoRR, abs/1211.5063.

Paszke et al. (2017)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
Desmaison, A., Antiga, L., & Lerer, A. (2017).
Automatic differentiation in pytorch.
In NIPS 2017 Workshop Autodiff. 
Pathak et al. (2017)
Pathak, J., Lu, Z., Hunt, B. R., Girvan, M., & Ott, E. (2017).
Using Machine Learning to Replicate Chaotic Attractors and
Calculate Lyapunov Exponents from Data.
Chaos: An Interdisciplinary Journal of Nonlinear Science,
27(12), 121102.
ArXiv: 1710.07313.
URL http://arxiv.org/abs/1710.07313  Petersen et al. (2018) Petersen, P., Raslan, M., & Voigtlaender, F. (2018). Topological properties of the set of functions generated by neural networks of fixed size. arXiv preprint arXiv:1806.08459.

Raissi & Karniadakis (2018)
Raissi, M., & Karniadakis, G. E. (2018).
Hidden Physics Models: Machine Learning of Nonlinear
Partial Differential Equations.
Journal of Computational Physics, 357, 125–141.
ArXiv: 1708.00588.
URL http://arxiv.org/abs/1708.00588 
Raissi et al. (2017)
Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2017).
Numerical Gaussian Processes for Timedependent and
Nonlinear Partial Differential Equations.
arXiv:1703.10230 [cs, math, stat].
ArXiv: 1703.10230.
URL http://arxiv.org/abs/1703.10230 
Raissi et al. (2018)
Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2018).
Multistep Neural Networks for Datadriven Discovery of
Nonlinear Dynamical Systems.
arXiv:1801.01236 [nlin, physics:physics, stat].
ArXiv: 1801.01236.
URL http://arxiv.org/abs/1801.01236  Ross (2009) Ross, I. M. (2009). A primer on Pontryagin’s principle in optimal control, vol. 2. Collegiate publishers San Francisco.
 Ross & Karpenko (2012) Ross, I. M., & Karpenko, M. (2012). A review of pseudospectral optimal control: From theory to flight. Annual Reviews in Control, 36(2), 182–197.

Runge (1895)
Runge, C. (1895).
Ueber die numerische Auflï¿œsung von Differentialgleichungen.
Mathematische Annalen, 46(2), 167–178.
URL http://link.springer.com/10.1007/BF01446807 
Ruthotto & Haber (2018)
Ruthotto, L., & Haber, E. (2018).
Deep Neural Networks Motivated by Partial Differential
Equations.
arXiv:1804.04272 [cs, math, stat].
ArXiv: 1804.04272.
URL http://arxiv.org/abs/1804.04272  Siciliano et al. (2008) Siciliano, B., Sciavicco, L., Villani, L., & Oriolo, G. (2008). Robotics: Modelling, Planning and Control (Advanced Textbooks in Control and Signal Processing). Springer.

Soleimani et al. (2017)
Soleimani, H., Subbaswamy, A., & Saria, S. (2017).
TreatmentResponse Models for Counterfactual Reasoning with
Continuoustime, Continuousvalued Interventions.
arXiv:1704.02038 [cs, stat].
ArXiv: 1704.02038.
URL http://arxiv.org/abs/1704.02038 
Srivastava et al. (2015)
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015).
Highway Networks.
arXiv:1505.00387 [cs].
ArXiv: 1505.00387.
URL http://arxiv.org/abs/1505.00387  Stengel (1994) Stengel, R. F. (1994). Optimal control and estimation. Dover books on advanced mathematics. New York: Dover Publications.
 Werbos (1981) Werbos, P. J. (1981). Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th IFIP Conference, 31.8  4.9, NYC, (pp. 762–770).

Zilly et al. (2016)
Zilly, J. G., Srivastava, R. K., Koutník, J., & Schmidhuber, J. (2016).
Recurrent Highway Networks.
arXiv:1607.03474 [cs].
ArXiv: 1607.03474.
URL http://arxiv.org/abs/1607.03474
Appendix
Appendix A Vehicle dynamics simulation
The model is formulated in a concentrated parameter form (Siciliano et al., 2008). We follow the notation of (Fossen, 2011). Recall the system definition:
where are the states, namely, the , and coordinates in a fixed (world) frame, the vehicle orientation with respect this this frame, , and the bodyframe velocities, , , and angular rate, . The input is a set of torques in the bodyframe, . The Kinematic matrix is
the mass matrix is
where is the vehicle mass and represents the rotational intertia. The Coriolis matrix is
and the damping force is . We set and . The input, , comes from a Fourier series with fundamental amplitude .
Appendix B Multiagent simulation
The multiagent simulation consists of kinematic vehicles:
where are the states for each vehicle, namely, the , positions and the orientation of vehicle in the world frame, while are the controls signals, in the form of linear and angular velocities, , . The kinematics matrix is
The agents velocities are determined by known arbitrary control and collision avoidance policies, respectively, and . In particular:
where , and .
Training configuration.
We set
The signal is generated by a Fourier series with fundamental amplitude .
Test configuration.
We change to:
with and . We also set .
Appendix C Robustness of the methods
In the main paper we have demonstrated that the proposed methods SNet and SNet can train deep ODE models orders of magnitude faster and with better generalization than standard backpropagation through an ODE solver as well as the adjoint method. Since for moderate scale ODEs the use of the adjoint becomes infeasible in terms of iteration time, in Section 6 we focused solely on comparing our methods with backpropagation through an ODE solver.
In Section 6, the use of a high order variablestep method (DoPr5), allegedly providing accurate solutions, didn’t lead to good training results. In particular, the loss function continued to increase over the iterations. On the other hand, despite being nearly times slower than our methods, the fixedstep forward Euler solver was successfully used for learning the dynamics of a state system in the training configuration described in Appendix B. One should however note that, in this configuration, the gains for the collision avoidance policy (which couples the ODE) were set to small values. This makes the system simpler and more stable than having a larger gain. As a result, if one attempts to train with the test configuration from Appendix B, where the gains are increased and the system is more unstable, then backpropagating trough Euler simply fails. Comparing Figures 3 and 5, it can be seen that the learning curves of our methods are unaffected by the change in the gains, while in the latter configuration BKPREuler fails to decrease the loss.
The forward Euler method is known to have a small region of convergence. In other words, integrating very fast dynamics requires a very small time step,
, in order to provide accurate results. In particular, for the solver error to be bounded, the eigenvalues of the state Jacobian of the ODE need to lie into the circle of the complex plane centered at
with radius (Ciccone et al., 2018; Isermann, 1989). Higherorder explicit methods, such as RungeKutta (Runge, 1895), have larger but still limited convergence regions. Our algorithms on the other hand are implicit methods, which have a larger region of convergence than recursive (explicit) methods (Hairer et al., 1993). We claim that this results in a more stable and robust training. This claim is supported by our experiments.Appendix D Gradient analysis
Consider the classic discretetime RNN:
(10) 
Then, given a loss , the following gradients are used during training^{2}^{2}2We consider a single batch element: and are in , where is the state dimensionality.:
(11) 
where, for any
, the chain rule gives:
(12)  
Iteration of (12) is the main principle of backpropagation through time (Goodfellow et al., 2016). A known fact is that iterating (12) is similar to a geometric series. Therefore, depending on the spectral radius of the Jacobian, , it can result in vanishing () or exploding () gradients (Zilly et al., 2016).
We can now demonstrate that, by avoiding function compositions, our approach is immune to the exploding gradient problem. In particular, our gradients are fully
timeindependent and their accuracy is not affected by the sequence length. Recall the ODE:(13) 
The following result is obtained:
Theorem 1.
Proof.
Given the Legendre basis functions, , define the ODE residual as:
for which the residual loss is . Then, Algorithm 2 consists of the concurrent minimization of the relaxed loss function:
with respect to the coefficients of the Legendre polynomial given , and of the residual loss:
with respect to given the set of coefficients . For the residual loss gradients are:
(14) 
where there is no recursion over the previous values , since the basis functions are given and the points are treated as data. By assumption, the Jacobian of
has finite singular values. Hence, by standard matrix norm identities (see Chapter 5 of
Horn & Johnson (2012)) the gradient of with respect to has a finite norm for all .Similarly, the absence of time recursions in the gradient for the update using the relaxed loss also follows trivially from the fact that the coefficients are independent variables, that we assume a given , that the basis functions are fixed. Then, the claim follows again from the assumption on the Jacobian of . ∎
Note that the result of Theorem 1
cannot easily be achieved when training ODEs using backpropagation through the solver or the adjoint method unless some gradient conditioning, such as clipping or batch normalization
(Ioffe & Szegedy, 2015), is performed after applying . On the other hand, our result relies only on the gradient of being finite. This is trivial for a shallow and, for a very deep , the methods in (Ioffe & Szegedy, 2015; Srivastava et al., 2015; Ciccone et al., 2018; He et al., 2015) can be applied just inside the architecture of , if necessary. This fundamental difference with standard RNN training allows for to have unrestricted Jacobian magnitudes which are needed to effectively model instabilities and longshort term dynamics, similar to LSTMs (Greff et al., 2017).
Comments
There are no comments yet.