Statistical learning

Tempus - a scalable support vector regression implementation applied to time series analysis

MAGMA fork with a faster Random-Butterfly Transform solver

OSQP fork for multi-GPU systems

Notes that you may find useful

A corrected method for calculation of weights for a SVM

Effective spectral boosting

Micro standardization of input data and hyperparameters grid for SVM

Anti-symmetric SVM kernel matrix

Forecast alpha measurement criteria for financial indexes and other time-series data

The aliasing problem and its solution

Labels should be deltas (but of what?)

Time invariance of online EMD and component selection criteria, and an online VMD implementation using carry state accumulators

The case for comb validation of SVR kernels

Boosted and bagging support vector machine

Tempus - a scalable support vector regression implementation applied to time series analysis

Tempus aims to provide the highest accuracy prediction of time series on the market. This is achieved using massive parallelization, low-level optimization and scalability in multiple domains in order to make use of all hardware thrown at it. It can scale on computer clusters, multiple GPUs, and CPUs using CUDA, MPI, TBB and OpenMP. I provide consultancy services, maintenance and administration for Tempus. Please contact me for enquiries or suggestions. The source code repository is available here.

Comparison of Tempus versus LightGBM on XAUUSD 12h time-frame, predicting 345 positions in the future, at the 115th position and 345 position:

2025-Jul-24 22:31:07.504972 [debug] 7a78d1fd000 ModelService.cpp:294 [validate] Position 115, level 0, step 0, actual 4.78291, batch predicted 4.72244, LGBM predicted 4.72174, last known 4.71862 batch MAE 0.0506433, MAE last-known 0.0516979 LGBM MAE 0.0535584, LGBM MAPE 1.21857pc, batch MAPE 1.15225pc, MAPE last-known 1.17624pc, batch alpha 2.08228pc, LGBM alpha -3.47389pc, current batch alpha 6.32244pc, current LGBM alpha 5.09979pc, batch correct predictions 62.931pc, batch correct directions 64.6552pcpc, LGBM correct predictions 49.1379pc, LGBM correct directions 53.4483pc

2025-Jul-24 22:31:07.508605 [debug] 7a78d1fd000 ModelService.cpp:294 [validate] Position 345, level 0, step 0, actual 9.14403, batch predicted 9.19745, LGBM predicted 9.20839, last known 9.19362 batch MAE 0.0644838, MAE last-known 0.0652292 LGBM MAE 0.068582, LGBM MAPE 1.25383pc, batch MAPE 1.1789pc, MAPE last-known 1.19253pc, batch alpha 1.15586pc, LGBM alpha -4.88875pc, current batch alpha -7.17171pc, current LGBM alpha -22.9539pc, batch correct predictions 59.5376pc, batch correct directions 60.4046pcpc, LGBM correct predictions 44.7977pc, LGBM correct directions 49.1329pc

You can see that Tempus outperforms LightGBM for regression forecast in every cumulative measurement with the best known hyperparameters, where Tempus uses LightGBM as it's kernel function with similar hyperparameters. Therefore Tempus is inherently slower than LightGBM but provides actually usable predictions for financial purposes, eg. price directions are predicted accurately in 64% and 60%, on the 115th and 345th position respectively. These are preliminary results and I believe Tempus can do better in the near future as the project matures.

MAGMA fork with a faster Random-Butterfly Transform solver

I forked ICL U. Tennessee's MAGMA library and optimized the random-butterfly transform to provide the best found solution and user configurable number of iterations, it also uses the stream provided as an argument to the dgesv_rbt_async function. You can browse the modified code here.

OSQP fork for multi-GPU systems

I forked Oxford University's Operator Splitting Quadratic Solver (aka OSQP) and modified it to support multiple GPUs on a single computer, as well as improved thread safety. You can browse my fork of CUDA OSQP here.

Notes that you may find useful

A corrected method for calculation of weights for a SVM

Having the ideal kernel matrix available to train an SVM, presents the user with an obvious issue arising from the original theory as described by prof. Vapnik and Chervonenkis; weights x for the model are calculated using the equation Ax=b validation of the weights, where A is the kernel matrix and b is the labels vector of trained samples. When the kernel matrix fits perfectly the trained labels, weights of value 1 will result in Ax=b producing zero error. Having a less than ideal kernel function tough will produce a kernel matrix with a slight deviation from the ideal one and that is to be expected, the weights vector x is here to alleviate the issue of having a less than ideal kernel function.

What happens in reality, when searching for the best weights vector x, is that validation of Ax should not be done over b, but a blurred vector of labels b' where each element is the average of the neighboring n elements, in order to achieve a better generalization of the prediction method. This way the model produces forecasts with significantly higher precision on unseen data.

In Tempus the n number is calculated using the SOLVE_RADIUS configuration option coefficient multiplied by the m kernel dimension. Example, if the number of trained samples is 1000, the kernel is of dimensions 1000x1000 and the labels are 1000x1, SOLVE_RADIUS is 0.05, then each label is averaged to the 50 neighboring labels previous and 50 neighboring labels after the validated label. This method showed experimentally marked improvement of prediction accuracy.

Effective spectral boosting

The decomposition of a signal in frequency domain and subsequently modeling the resulting component is a process I refer to as spectral boosting. Spectral boosting produces many simple, predictable signal components out of a complex signal that, because of its high complexity, may seem random at first glance.

There are many methods to perform the decomposition of a signal of which I prefer to use variants of empirical mode decomposition, variational mode decomposition, wavelets and sometimes a short term fourier transform in Tempus. After a signal is decomposed it needs to be reassembled with perfect resemblance to the original signal. This is a very important property of signal decomposition methods and is called perfect reconstruction. Some methods do not possess this characteristic, such is the continuous wavelet transform but a discrete wavelet transform does have this property. Reconstruction can be additive or multiplicative depending on the deconstruction method.

The decomposition method should be able to discern between components of the signal of predictable quality and extract them from the rest of the (unpredictable) signal which should be muted (ie. not modeled nor predicted). Usually this is the highest frequency spectrum of the signal and the one that has the lowest mean absolute value of the autocorrelation function, and is labeled as decomposition residuals or noise. If you predict this noisy component and add the results of the prediction with the predictions of other signal components it is likely that the overall quality of the final prediction will deteriorate. Therefore the residual component should nulled right after decomposition.

I suppose a similar rule applies with boosting (or ensembling) in other domains such is the dynamic time slicing method Tempus implements, but I haven't tested it yet.

Micro standardization of input data and hyperparameters grid for SVM

I recommend standardizing only the shortest time frame of data used for training the model, using the following formula Dt = Di - mean(Di), Dt = Dt / median(abs(Dt)) / Rt, where Dt denotes the data used to train the model, Di is the original input data, and Rt is the preferred range of the data used to train the model.

Anti-symmetric SVM kernel matrix

The ideal kernel matrix of a support vector machine is antisymmetric and should not be positive semi-definite as stated in the original SVM publication by Vapnik and Chervonenkis. The proof follows below.

My work on Tempus during the past 5 years included developing a multi-layered SVM by using a support vector regressor as the kernel of an SVM. The support vector regression machine that is used as a kernel, is trained on a dataset produced from the ideal kernel matrix for the given problem. This dataset is called a support vector manifold. An implementation of generating a manifold dataset can be seen here as a manifold to another SVM kernel, or a manifold to a gradient boosted model (Microsoft LightGBM) can be seen here.

The ideal kernel matrix is anti-symmetrical and generated from both differences of labels (L1 - L2, as well as L2 - L1, where K(1,2) = L1 - L2 = -K(2, 1) = - (L2 - L1), and by concatenating the feature vectors, or L1 - L2 -> F1 § F2 and L2 - L1 -> F2 § F1, where § denotes concatenation of the feature vectors, you can see the implementation here.

Having modified the original theory to make use of an anti-symmetric kernel matrix imposed the need to change the prediction process too. The ideal kernel matrix Ki needs to be augmented by adding the trained labels to every row of Ki:

Ki L

Ki00, Ki01, Ki02 ... Ki0n L0, L1, L2 ... LnKi10, Ki11, Ki12 ... Ki1n L0, L1, L2 ... LnKi20, Ki21, Ki22 ... Ki2n + L0, L1, L2 ... Ln.... L0, L1, L2 ... LnKim0, Kim1, Kim2 ... Kimn L0, L1, L2 ... Ln

You can see the implementation of the prediction method in Tempus here.

EDIT: The ideal kernel matrix can not be positive semi-definite, and the support vector machine theory should be amended not to oblige to the Karush-Kuhn-Tucker conditions since having an ideal kernel function removes the need for a weights vector. The kernel distances can be negative and consequently, the weights vector is always set to 1. Therefore, the prediction process for a 3x3 kernel matrix should look like this:

Example of a prediction method using an ideal kernel matrix

Prof. Emanouil Atanasov proposed that the ideal kernel matrix could be turned into a positive semi-definite matrix by applying exp(-lambda*A) element-wise to the distance matrix, where lambda is a parameter. I don't see the need for that because it makes the prediction process impossible.

Forecast alpha measurement criteria for financial indexes and other time-series data

Prediction alpha is measured by comparing the L1 error of a model, to the L1 error produced by using the last-known (nearest entropy-wise) value at the time of training the above model for predicting the variable. The last-known value for a model predicting every hour's price with a prediction horizon of 10 minutes would be the index value at T-10 minutes, where T is the starting time of the predicted price, i.e. when predicting the price starting at 14:00 the price at 13:50 is taken as the last-known price. Example:

MAE = |Pv - Av| is the mean absolute error of the predicted value versus the actual value

MAE_LK = |Lv - Av| is the mean absolute error of using the last-known value as predictions

Ap = 100 * (MAE_LK / MAE - 1) prediction alpha in percentage points is the mean-absolute error of using the last known values divided by the L1 error of the model minus one, then multiplied by a hundred.

The aliasing problem and its solution

Converting analog data and high-precision digital sampling to time-series data of particular resolution inevitably leads to loss of information. Here I describe the problem, its solution or alleviation, and its application to financial data.

A well known issue of sampling is when noticeable frequency components above the Nyquist frequency mirror to the south generating erroneous information not present in the original data before the sampling process.

Weak blue lines show aliasing

This same issue occurs when processing financial data - when a filter needs to be applied to financial data sampled at 1 ms (1 KHz) which is standard resolution in Meta Quotes 5 Terminal, or 10 ns (100 Mhz) highest resolution for Deutsche Borse T7 trading software, processing hardware requirements impose the need to resample the index data into lower resolution, or lower frequency (eg. 1 Hz). In order to avoid losing as much information as possible and prevent aliasing a TWAP strategy is applied to every frame of the destination sampling frequency while keeping the destination sampling frequency as high as processing resources permit. FIR filters to extract monthly, weekly or daily patterns can become pretty large, while IIR filters are not usable in this situation because they have undetermined phase offset.

I do not recommend frequencies lower than 1 Hz to use when filtering financial indexes.

Labels should be deltas (but of what?)

In order to minimize complexity of the trained labels, the complexity should be minimized using the following method: From the analyzed label we subtract the closest value known, in the analyzed domain context - time duration for time series or GPS coordinates distance for GIS data. The label set modified this way becomes more predictable and simpler (less complex).

Predictability (inverse of complexity) is measured using the mean absolute of the autocorrelation function of the labels set.

Example, if at 12:30 I am predicting the time-weighted average price starting at 13:00 and ending at 14:00. From the TWAP price I should subtract the last known price of the index at 12:30 ie. Ld = L - k

Time invariance of online EMD and component selection criteria, and an online VMD implementation using carry state accumulators

In order to produce a causal EMD transform that would be usable in a real-time streaming data scenario, me and prof Atanasov had to make the choice of using low frequency aimed FIR filters for the sifting process instead of interpolating the input signal. Components with lower entropy - measured by the amount of dissipation of the signal in FFT bins, and higher autocorrelation (self-subtraction at different offsets) are extracted using a non-linear optimizer.

In order to convert the VMD implementation available here to an online streaming transformation I store the state of the accumulators and carry it as new data arrives. You can see a sped up implementation in Tempus' continuous VMD transform implementation here.

The case for comb validation of SVR kernels

Higher than appropriate gamma will lead to predictions gravitating towards the training set mean, as seen on the figure below left. The figures below show a screenshot of the LibSVM Java applet from CC Chang's website predicting labels market with dense white dots, while the blue squares show the training set used to train the support vector machine before doing the forecast. Kernel used is RBF, with cost of 8000 and 0 epsilon. The features and the labels start from 0 and increment by 1.

Fig. 1 - SVR with RBF kernel, gamma is too high

Fig. 2 - SVR with RBF kernel, gamma is appropriate

Boosted and bagging support vector machine

I'm currently working on this. You can find the preliminary source code of a C++ implementation here.

Updatе: Boosting on residuals for several support vector machine models for regression where the residuals produced by one model is fed to the following SVM model is not giving the results I was expecting and put the idea on ice for the time being.

On the other hand, bagging or splitting the input data into contiguous slices, while preserving the original order (not shuffled) gives major speed improvement without much sacrifice of accuracy. Shuffling the input data though, lowers accuracy significantly.

I tune hyperparameters for every chunk of data separately in order to maximize accuracy. Overlapping chunks seem to give minor improvements in accuracy but a significant drop of training speed. You can see SVM chunking in action in Tempus. Thanks to prof. Atanasov for the bagging suggestion, though his was of shuffled chunks.

Page updated

Google Sites

Report abuse