The Langevin Approach: An R Package for Modeling Markov Processes

We describe an R package developed by the research group Turbulence, Wind energy and Stochastics (TWiSt) at the Carl von Ossietzky University of Oldenburg, which extracts the (stochastic) evolution equation underlying a set of data or measurements. The method can be directly applied to data sets with one or two stochastic variables. Examples for the one-dimensional and two-dimensional cases are provided. This framework is valid under a small set of conditions which are explicitly presented and which imply simple preliminary test procedures to the data. For Markovian processes involving Gaussian white noise, a stochastic differential equation is derived straightforwardly from the time series and captures the full dynamical properties of the underlying process. Still, even in the case such conditions are not fulfilled, there are alternative versions of this method which we discuss briefly and provide the user with the necessary bibliography.


Introduction
When dealing with stochastic series of data measurements, standard statistical tools, such as mean and centered moments, are able to catch the essential features of the distribution of observed values. Last end, sufficient high-order moments will retrieve a good approximation of the probability density function (PDF) associated with the stochastic process. However, PDFs are not able to fully characterize the dynamics underlying the process. A typical example is the Gaussian distribution: if the stochastic variable assumes values according to a Gaussian distribution, the dynamics producing such distribution of values can be as simple as an Ornstein-Uhlenbeck process (Risken 1996) but it may also be the result of a much more complicated dynamics, as we exemplify below. Thus, while knowing the distribution of observed values is important as a first approach to the data, uncovering the complete dynamics of the process provides a much deeper insight into the system, which cannot be accessed through standard statistical tools.
Starting from a stochastic differential equation, a process can be statistically reconstructed through simple stochastic integration. The inverse problem however is much more complicated: would a set of measurements be enough for a bottom-up approach to infer the underlying dynamics of the process? The short answer is yes, there are cases where this is possible. In this paper we present the long answer implemented as a package for R (see R Core Team 2015), which can be easily used, composing a method which we call the Langevin Approach. This approach was introduced by Peinke and Friedrich in the late 1990s (Friedrich and Peinke 1997;Siegert et al. 1998) and further developed in the last decades. For a review see Friedrich et al. (2011).
We start in Section 2 by describing the mathematical background on which our method is based. Here we also discuss the necessary conditions for the method to be applicable. The R functions implementing our approach are presented in Section 3 together with two examples for one and for two stochastic variables. Additional remarks on our method are discussed in Section 4, providing the user with supplementary issues and suggestions for more general situations, in particular those where the conditions supporting our mathematical background do no longer hold. A selected bibliography is also provided. Finally, Section 5 concludes the paper.
2. Stochastic equations: from data to models and back 2.1. The Langevin model A wide range of dynamical systems can be described by a stochastic differential equation, the (non-linear) Langevin equation (cf. Risken 1996;Hänggi and Thomas 1982;Van Kampen 2007).
Consider a general stochastic trajectory X(t) in time t. The time derivative of the system's trajectory dX dt can be expressed as the sum of two complementary contributions: one being purely deterministic and another one being stochastic, governed by a stochastic "force" Γ(t), defined as a δ-correlated Gaussian white noise, i.e., Γ(t) = 0 and Γ(t)Γ(t ) = 2δ(t − t ). While the deterministic term is defined by a function, D (1) (X) the stochastic contribution is weighted by another function, D (2) (X), yielding the evolution equation of X where the square root is taken for consistency, as will be clear below. We assume stationary time series here, so D (1) and D (2) are not time dependent but we show briefly how nonstationary time series can be treated in Section 4.
The Langevin equation should be interpreted as follows: for every time t where the system meets an arbitrary but fixed point X in phase space, X(t + τ ) with small τ is defined by the deterministic function D (1) (X) and the stochastic function D (2) (X)Γ(t), through trivial (Euler) stochastic integration : where η(t) is a normally distributed random variable. Here we use the Itô picture of stochastic integrals, for further details see Gardiner (2004).
Functions D (1) (X) and D (2) (X) are usually called drift and diffusion coefficients respectively and they can be as simple as constants or linear functions of X, as e.g., the Ornstein-Uhlenbeck process, as well as more complicated nonlinear functions, typically polynomials up to a given order. In particular if D (2) is explicitly depending on x, the case is called multiplicative noise.
In all cases, through substitution of the selected functions into Equation 2 one is able to generate samples of series having the same statistical features and obeying the same dynamics. Figure 1: (a) Sketch of a stochastic process in time governed by a cubic drift and quadratic diffusion contributions and (b) its corresponding probability density function (PDF). Though the series shows a bistable dynamics (cubic drift) the PDF follows a Gaussian function, equivalent to an Ornstein-Uhlenbeck process (see text). Figure 1a shows an illustration of a time series obtained through integration of Equation 2 for a cubic drift D (1) (X) = −X 3 + X, and a quadratic diffusion, D (2) (X) = X 2 + 1. Notice that these drift and diffusion coefficients describe a non-trivial dynamics, namely the underlying deterministic process, i.e., D (2) ≡ 0, has two attractive fixed points at X = ±1. The processes tends to converge to one of two stable states being at the same time perturbed by a stochastic fluctuation (D (2) = 0) which is able to push the system from one stable fixed point to the other. As shown in Figure 1b, despite this non-trivial dynamics, the PDF is a Gaussian distribution with zero mean and unit standard deviation, the same PDF as for a simple Ornstein-Uhlenbeck process with D (1) = −X and D (2) = 1. This is one of many possible examples that illustrates the deep insight, which an evolution equation like in Equation 1 can provide and which is not obtained by looking at a density distribution, see Appendix A for further details.

From stochastic data to the Langevin model
As explained previously, it is easy to generate data through the integration of a stochastic equation, such as Equation 2. More difficult is the inverse problem, to derive functions D (1) and D (2) from given data.
A condition to derive the drift and diffusion numerically is that the time-steps τ of the set of X-values are small enough (see Honisch and Friedrich (2011) for details). If the system is at time t in the state x = X(t) the drift can be calculated for small τ by averaging over the difference, X(t + τ ) − X(t), of the system state at t + τ and the state at t. Check Equation 2 above. This average is the first conditional moment of the series and it can be mathematically proven that its time derivative yields the drift coefficient. Similarly, computing the second Circles indicate the numerical results while the red dashed line indicates the theoretical coefficient, used when generating the synthetic data. Here 10 7 data points from the series illustrated in Figure 1a, were used for computing the averaged conditional moments.
Therefore, having a series of data, one estimates the drift and diffusion by computing the averages of the first and second power of the differences between X(t + τ ) and x: where · represents the average over time t. Mathematically the drift and diffusion coefficients are defined as (Risken 1996) which means that they are given by derivatives of the corresponding conditional moments M (n) (x, τ ) with respect to τ . In many cases, for a fixed x, the conditional moments depend linearly on τ for the smallest range of τ values and consequently the drift and the diffusion coefficients at this state x are estimated solely by the quotient between the corresponding conditional moment and τ in this range.

Back to the data
The Langevin Approach summarized previously is applied under a few conditions, though, as we discus afterwards, when such conditions are not fulfilled in many cases it is still possible to overcome that and apply an alternative approach which also retrieves the dynamics underlying the stochastic process. For the completeness of this paper and for the consistency of the application of our R functions, we advise the user to briefly test the data. Three conditions should in general be tested as a preliminary checking procedure and two further conditions can afterwards be tested as cross-checking.
The first condition is that the data series is stationary. Indeed, the averages for computing the conditional moments have to be taken over all If the series is non-stationary these averages are in principle not meaningful.
The second condition is that the process should be Markovian, i.e., the present state should depend on the previous state solely. Mathematically it means an equivalence between twopoint statistics, p(X(t + τ ), X(t)), and any higher-order statistics, p(X(t + τ ), X(t), ..., X(t − nτ )). This equivalence leads to the following equality between conditional probabilities of finding a value of This should hold for any positive integer n. In practice, one tests the equality for three-point statistics (n = 1) only and assumes that if the equality holds it will also hold for higher-order statistics, since all correlations shall decrease monotonically with time.
To test if the process is Markovian one can also use alternatively the Wilcoxon test (Wilcoxon 1945), in case one is dealing with single variable stochastic processes. For details see Renner et al. (2001, Appendix A).
The third condition to be tested comes from a mathematical result called Pawula Theorem (Risken 1996), from which it follows a necessary condition for Equation 1 to be valid: the fourth conditional moment must be constant, i.e., D (4) = 0. To test that one computes its derivative with respect to the time-lag, the fourth coefficient and check if it vanishes, i.e., if it is small compared to the diffusion coefficient: This coefficient is also useful for computing the numerical error of the diffusion coefficient (Lind et al. 2010).
The tests whether conditions two and three hold ensure that Γ(t) (see Equation 1) is δcorrelated and Gaussian distributed.
If all these conditions are fulfilled the Langevin Approach can be carried out and the two functions, drift coefficient D (1) and diffusion coefficient D (2) , can be derived from the given data. With the derived coefficients two additional cross-checking tests can be done.
The first one is to check if the stochastic force in Equation 1 fulfills the two conditions of a δ-correlated Gaussian white noise. To that end, one substitutes in Equation 2 the derived D (1) (X) and D (2) (X) and solves it with respect to η(t): Taking τ as the time-step of the observed time-series and substituting in X(t + τ ) and X(t) successive values of that series one re-obtains a series for η(t) which should be normally distributed.
The second cross-checking test is to substitute in Equation 2 the derived D (1) (X) and D (2) (X) coefficients, generate synthetic series and compare if its increments have the same distribution as the original series for a fixed τ spanning from the time-step of the original series up to two or more orders of magnitude larger.
Some extra care should be taken if the derived D (1) (X) and D (2) (X) coefficients show linear drift and quadratic diffusion forms as this is also the case for every Langevin process if the sampling interval is large compared to the relaxation time of the process. Riera and Anteneodo (2010) presented a method for cross-checking in this case.
Notice that, though the fulfillment of all such conditions through the proposed preliminary tests and cross-checking tests guarantees that the Langevin Approach can be applied, the rejection of one or more of these tests is still no reason for avoiding this approach. In Section 4 we will come back to this issue.

Implementation and examples
In this section we present the implementation of the Langevin Approach describing the two available R functions, Langevin1D and Langevin2D. The function Langevin1D deals with single time-series while Langevin2D should be used for two-dimensional cases, when one has two stochastic variables to be analyzed simultaneously.
The one-dimensional case deals with an evolution equation similar to Equation 1 and the two-dimensional case comprehends two stochastic variables, X 1 (t) and X 2 (t), governed by: where clearly now the drift function 2 ) is a two-dimensional vector and the diffusion coefficient is a 2 × 2-matrix given by D (2) = gg T , i.e., D (2) ij = k g ik g jk . Similar to the one-dimensional case the integration of Equation 9 follows from a simple Euler scheme leading to: 2 (X 1 , X 2 ) + √ τ g 11 (X 1 , X 2 ) g 12 (X 1 , X 2 ) g 21 (X 1 , X 2 ) g 22 (X 1 , X 2 ) where η 1 (t) and η 2 (t) are two independent normally distributed random variables.
In our implementation the conditional moments M (n) (x, τ ), Equation 3, are estimated by dividing the state space of x in N intervals, or bins, (I 1 , ..., I N ) and calculating the mean values for each interval I i : For estimating the drift and diffusion coefficients from the conditional moments we insert Equation 2 into Equation 11 and apply the conditional averages for n = 1, 2 leading to: Important to notice is that for M (2) (x, τ ), Equation 13, a term quadratic in D (1) (x) and τ has to be considered. We estimate drift and diffusion coefficients from the slope of a weighted linear regression of Equations 12 and 13.
The implementation of the functions heavily relies on the C++ linear algebra library Armadillo (Sanderson 2010) for which RcppArmadillo and Rcpp provide the integration with R (Eddelbuettel and François 2011; Eddelbuettel and Sanderson 2014). We choose Armadillo as it results in fast code especially for large data sets and has an easy readable syntax. The functions Langevin1D and Langevin2D use OpenMP (Dagum and Menon 1998) if available to take advantage of shared memory multiprocessing. Here we parallelize the evaluation of the drift and diffusion coefficients for the bins as their evaluation is independent for each bin.
In the following subsections we present one-and two-dimensional examples of Langevin processes and walk through the analysis based on the framework described in the previous section.
3.1. Langevin1D: analyzing one-dimensional data sets As an example we integrate the Langevin equation illustrated in Figure 1a with cubic drift and quadratic diffusion, namely The presented package provides the function timeseries1D to do the integration using an Euler integration scheme: R> library("Langevin") R> sf <-1000 R> set.seed(4711) R> x <-timeseries1D(N = 1e7, d11 = 1, d13 = -1, d22 = 1, d20 = 1, sf = sf) Extracting drift and diffusion coefficients from the generated time series is done by the function Langevin1D. Here two parameters that are important for the estimation have to be given as arguments.
The first one is the number of bins dividing the variable space x in discrete bins at which drift and diffusion are estimated. This integer should not be so large that each bin does no longer include a significant number of points (typically ∼ 100) and also not so small that no dependence of the drift and diffusion on the state variable can be observed.
The second parameter is the vector steps to calculate the conditional moments for different τ values (Equation 3). The conditional moments will be computed for each bin and for each step. For each bin, a linear fit is computed for all steps in steps. Typically a vector of up to ten steps is given in samples (= τ · sf).

R> ests <-Langevin1D(x, bins, steps)
From the resulting list ests, plots of the estimated drift and diffusion coefficients can be generated (see Figure 2). Here we use plotrix (Lemon 2006) to add errorbars.
Therefore we concentrate on cross-checking the estimated drift and diffusion coefficients. For checking if D (4) (X) is small compared to D (2) (X) (Pawula Theorem) we use the function summary which also computes the ratio between D (4) and (D (2) ) 2 : The result shows that D (4) (X) is smaller than 0.5% of the squared diffusion coefficient, indicating the necessary condition of the Pawula Theorem holds. To this end we fit a cubic function to the estimated drift coefficient and a quadratic function to the diffusion coefficient:

R> summary(ests)
R> estD1 <-coef(lm(D1~mean_bin + I(mean_bin^2) + I(mean_bin^3), weights = 1/eD1)) R> estD2 <-coef(lm(D2~mean_bin + I(mean_bin^2), weights = 1/eD2)) The resulting coefficients are used to generate a new time series with timeseries1D: We want to emphasize here that the Langevin Approach does not require the drift and the diffusion coefficients to be of any particular functional form, from the estimated coefficients one could directly integrate a stochastic time series which can be used to calculate the increments. We fit the estimated coefficients to polynomials only to be able to use the function timeseries1D for the integration. From the original and the reconstructed time series we now calculate PDFs of the increments for different τ and plot them to inspect their agreement visually: R> plot(1,1, log = "y", type = "n", xlim = c (-11, 12) Figure 3 shows the output: there is indeed good agreement of both increment PDFs for a wide range of τ values. Therefore we can assume that our estimated drift and diffusion coefficients describe the process sufficiently.
Notice once again that while the PDF of the series generated by Equation 14 is the same as the one of the simple Ornstein-Uhlenbeck process, dx dt = −x(t) + Γ(t), our Langevin Approach is able to uncover the correct dynamics with a bistable drift and a non-constant diffusion. See Appendix A.

Langevin2D: analyzing two-dimensional data sets
As a two-dimensional example we integrate the coupled Langevin equations in Equations 9 for a particular choice of the drift and diffusion coefficients, namely (Siegert et al. 1998) where a is a constant. Figure 4a shows the integrated trajectory (X 1 , X 2 ) for a = 0, a case where no stochastic contribution is present, whereas in Figure 4b the same trajectory is plotted now with stochastic forces having a constant amplitude of a = 0.05.
The integration is performed by timeseries2D. Drift and diffusion functions are full cubic and quadratic polynomials respectively and the elements a ij of the matrices are defined by the corresponding equations for the drift and diffusion terms (see Equations 9 and 10): Figure 4: (a) Trajectory (X 1 (t), X 2 (t)) from Equations 15 with a = 0 and (b) the same trajectory integrating the same equations with non-zero stochastic terms (a = 0.05). For plotting 10 6 resp. 10 5 data points where used.

D
(1) Estimating the drift and diffusion coefficients is done by Langevin2D, here the same rules for bins and steps apply as for the one-dimensional case. The results shown in Figure 5 are generated by the following command lines: The numerical results can be properly fitted through the functions used for the integration in Equations 15, namely: D

A glimpse beyond the Langevin package
The two examples exposed above show cases where all conditions are fulfilled. When analyzing real empirical data sets this is often not the case: one or more of the conditions under which the Langevin Approach is applied are not met. Still, in the last years we developed different alternatives and extensions to this approach to overcome specific situations in stochastic data analysis. In this section we briefly describe these alternatives and extensions.
One first problem that researchers face is the non-stationary character often appearing in real data. Here, one of two approaches may be possible. One is to ascertain if for "shorter" time-windows of the data series stationarity may be assumed. In case the data set can be decomposed in a series of time-windows which may overlap, each one having more or less constant statistical moments of the observable, the Langevin approach can be applied separately to each one of them, yielding a set of drift and diffusion coefficients, one for each time-window. In the end one extracts one drift and one diffusion coefficient, both functions of the observable and also of time.
Another possibility to handle non-stationary data sets is to check if they can be conditioned to other observables. In that case, considering the periods of the data sets associated to a particular value of the conditioning observable may be itself stationary. This is the case of the stochastic series measured of wind turbines (Wächter et al. 2011;Lind et al. 2014). The power output of one wind turbine or the loads applied to it by the wind field are two observables whose measurement series are by themselves non-stationary. The wind velocity is the observable driving those properties and it is also non-stationary. However, we have shown that both wind power production (Wächter et al. 2011) and instantaneous loads (Lind et al. 2014) can be analyzed through the Langevin Approach if we conditioned both the drift and the diffusion coefficients to each particular admissible value of the wind speed.
The second condition listed above is the Markov property. When the series of measurements fails to fulfill the Markov tests described above, it cannot be reconstructed through stochastic Euler integration since the next state cannot be estimated from the present state alone (see Equation 2). This happens, for instance, when a Markov process is spoiled by additional additive noise when a measurement is taken (see Kleinhans et al. 2007). While the process alone, X(t), is Markovian, the actual measurement, which retrieves X(t) + Y (t), does not fulfill the Markov property. In such cases the limits computed for the coefficients D (1) and D (2) diverge (see Equation 4): when τ → 0 the conditional moment for the measured values (numerator) does not vanish. Still, it is frequently possible to obtain the correct drift and diffusion coefficients for the Markov process X(t) through simple changes of their estimates (Böttcher et al. 2006;Lehle 2011Lehle , 2013Scholz et al. 2015). In cases of correlated noise where Γ (t) = 0 holds the drift coefficient D (1) (X) can still be reconstructed correctly.
A third problem that may appear during preliminary tests of empirical data is the nonvanishing fourth coefficient D (4) . As stated above in Section 2.3, according to Pawula Theorem (Risken 1996) the fourth conditional moment must be independent of the time-lag τ . If not, one cannot assume that the stochastic process is governed by a Langevin equation, Equation 1. However, in such cases, although no evolution equation can be extracted and therefore the estimated functions D (1) and D (2) have not the meaning of drift and diffusion contributions, one can still use both to provide valuable insight about the system being analyzed. One example is the work of Rinn et al. (2012) on in-situ analysis of the elastic features of a mechanical beam structure for realistic excitations with correlated noise as it appears in realworld situations. They could show that the slope of the drift coefficient D (1) is a sensitive indicator of the damage and compared to frequency based approaches, like power spectra, which estimate changes of the eigenfrequency of the structure, it is even more sensitive to small damages.
Finally, it is also important to stress that, while the functions of the presented package were prepared for analyzing data series as processes in time, the Langevin Approach can be adapted for analyzing processes in scale. In fact, when the process is not Markovian in time, violating Equation 5, there is the possibility that it is Markovian in "scale". What does this mean? It means that the increments ∆ τ introduced above follow a Markovian process in τ i.e., in time-lags but are instationary. Such analysis in scale is able to reproduce e.g., turbulence energy cascades (Friedrich and Peinke 1997;Stresing and Peinke 2010) or ocean rogue waves (Hadjihosseini et al. 2014).
More details on all these extensions and alternatives to the Langevin Approach can be found in Friedrich et al. (2011).

Discussion and conclusions
In this paper we present an R package for stochastic data analysis that is able to extract the stochastic evolution equations of physical properties from sets of their measurements.
The introduced functions serve as a framework to analyze one-and two-dimensional time series. They provide estimation of drift and diffusion coefficients describing the deterministic and the stochastic part of the analyzed process respectively. Integrating Langevin processes numerically enables one for cross-checking the obtained result and for generation of synthetic data sets.
Through illustrative examples we have shown that the Langevin evolution equation is able to uncover complex dynamics, even in cases when the associated statistics is identical to many other stochastic processes.
The presented package can be straightforwardly applied by R-users and it implies only a few preliminary tests to ascertain if all conditions on which the Langevin Approach is built are fulfilled. In case they are not, we briefly explain how to overcome them with simple extensions to the method that were already successfully applied in several applications .
Still, additional improvements of the presented functions are possible. For instance, instead of using the common average bin value when performing the binning of the data, one can incorporate a kernel-based regression of such values (Lamouroux and Lehnertz 2009) or a maximum likelihood framework (Kleinhans 2012) for estimating the drift and diffusion functions. Moreover, the approach can also be applied to data having low sampling rates (Honisch and Friedrich 2011).
given by the standard normal distribution i.e., a Gaussian distribution with zero mean and unit variance.
To that end, we start with one important remark concerning the evolution equation of one stochastic variable X, Equation 1: this equation is related to an another evolution equation, namely the one of the probability density function (PDF) of X, so-called Fokker-Planck Equation (Risken 1996): The stationary solution ( ∂P ∂t = 0) of the one-dimensional Fokker-Planck is given by Risken (1996): For the simple Ornstein-Uhlenbeck process, governed by Equation 1 with D (1) = −x and D (2) = 1, the stationary PDF reduces to P 0 in Equation 16.
One could, however, consider a much more complex dynamics such as the one exemplified in this paper, with a bistable (cubic) drift coefficient and a non-constant diffusion, depending quadratically on the stochastic variable X: Here, D (1) has two stable fixed points at ±b, with a maximum amplitude between them proportional to a, while D (2) has a minimum value c and a broadness proportional to 1/d.
Substituting the cubic drift and the quadratic diffusion, given in Equations 19, into the stationary solution, Equation 18, and integrating, yields the stationary solution: As one sees, the solution in Equation 20 has, in general, not only a Gaussian part, like Ornstein-Uhlenbeck processes, but also a polynomial part with an exponent depending on all parameters of D (1) and D (2) . However, if the exponent is exactly zero, the polynomial part vanishes and the stationary solution reduces to the Gaussian distribution.
In the example used in Section 3.1 with a = b = c = d = 1, this is the case (see Figures 1 and  2 and Equation 14).
In general, the one-point statistic in the stationary regime given by Equation 18 yields Gaussian distributions even in more complex dynamics than the one here chosen. One only needs to have a drift coefficient given by one polynomial of odd degree n > 0 and simultaneously have a diffusion coefficient given by a polynomial of degree n − 1. In that case, whatever general expression both coefficients have, it is always possible to find a combination of their parameter values for which the quotient D (1) /D (2) in the stationary solution reduces to a linear function in x yielding the PDF of a Gaussian distribution. The two-point statistic, P (X(t)|X(t − τ )), however is able to distinguish between sets of (D (1) , D (2) ) yielding the same one-point statistic.
The ambiguity of one-point statistics in characterizing the dynamics of stochastic processes in general, motivates the Langevin Approach implemented in our R package. Our approach has the advantage of being parameter free: since it computes numerically D (1) and D (2) without any given Ansatz, it can easily distinguish between higher-order drift and diffusion coefficients.