Neural Network Research to Predict Gap Risk for Financial Assets

I set out to explore different methodologies applied to Neural Networks to predict gap risk on financial assets.

This is a blog I aim to update on a weekly basis sharing my journey for this research project.

Motivation For This Post

The inspiration for this project stems from the interest to continue learning after completing in June 2024 a 6-month Professional Certification at Imperial Business School on Machine Learning and Artificial Intelligence.

I continue experimenting with neural networks on this project. I am privileged to be guided by Ali Muhammad, who lectured me during my certification at Imperial.

In this project we use different neural network approaches to estimate gap risk on the price of financial assets, and each approach is held in a subdirectory in this repo. The projects in this repo may slightly deviate from this objective as I explore and research associated predictions that help me build towards the end goal.

This project is work in progress and this page serves as a log for this amazing journey.

Open Source Code Available

Pytorch powered project stack

Work in progress:

CNN training by encoding price time series into Gramian Angular Field (GAF) images

Application of Recurrent Neural Network with Long-short Term Memory Models

Get Github Repo

At the beginning before starting this blog post

I had a feed forward network with two convolutional layers and three fully connected layers running on a jupyter notebook. I tried to solve a regression problem and predict next day prices based on a financial time series. The model trained with Silicon Valley Bank (SIVB) and validated predictions with Silvergate Bank (SICP), both entities entered bankruptcy proceedings in 2023.

The stocks time series are encoded into images using Gramian Angular Field (GAF) to capture the similarity (GASF) or difference (GADF) between these. This approach enabled me to capture price time dependencies and enhance the explainability of results via CNNs.

Poor results on the CNN led me to start considering different architectures and study RNNs.

At this point I started posting to this blog …

Week 1

  • I train and evaluate the network to solve a regression problem. I therefore evaluate R^2 metrics. However, since we predict financial asset time series prices, I am particularly interested in the relative comparison between actual and predicted values to 1 decimal place. I refer to this comparison as Threshold Accuracy:
    • (torch.abs(predicted_tensor – actual_tensor)<= 0.1) / total
  • The CNN results are not optimal with Threshold Accuracy in the low 20s
  • I had thus started research on LSTMs and preparing a simple LSTM model to understand this better
  • After meeting with Ali, I have decided that an approach that is not “fail fast” can bring benefits to my edification and eventually this research even if the results are bad
  • Thus I have paused the development on the RNN and I am digging deeper on the CNN
  • I have refactored the CNN from a Jupyter Notebook to python scripts and helper function modules. The model continues to yield unstable results
  • Encoded images correlation drops to 60% whereas the actual price time series used to build these images is >99%. This is an area I need to research and understand this difference by encoding images with different parameters
  • I intend to run a grid search for the above-mentioned image generation algorithm

Week 2

  • I have re-trained and validated several scenarios for the CNN using GAF images. Unfortunately I had to stop because the results were hard to track. The analysis is therefore incomplete.
  • The results for these simulations so far are more encouraging: validation dataset with ~60% correlation between the training/validation GAF images dataset yield [Threshold Accuracy at 1 decimal place = 44%, R^2=46%, MSE=3%].
  • I still need to dig to understand the correlation between these images for different dataset but it’s clearer what hyperparameters help
  • I ask my professor for guidance how to better track results and suggests mlflow. I will spend the net week setting it up.

Week 3

  • I run mlflow server locally and I have setup a public mlflow server run on a docker container that servers my results from storage and the database. Unfortunately and for now, it serves from a cheap Azure SQL database so it’s not the fastest to show results
  • Mlflow was particularly useful to visualize the results from encoding images with different parameters. I compared scenario results between encoding the time series into images using MarkovTransitionField vs GramianAngularField (GAF). I obtain best results (results and parameters used are stored in Mlflow gaprisk-experiment-005) with the following key encoding parameters and model parameter:
    • GAF summation method
    • gaf_sample_range (-1, 0.5)
    • MinMax(-1,0) scaler
    • dropout 0.25 to 0.5
  • Several runs did not converge training endlesslesly through the 10k-15k epochs. I have now added pytorch LRScheduler with Min to dynamically ratchet the learning rate down with a patience of 300 epochs. It abandons training if there is no loss improvement after 300 epochs in the case of 32×32 images.
  • I hit a wall: At the best Threshold Accuracy result, the CNN gets stuck at a local minima ~0.15%. I am using multiple regularization methods:
    • nn.BatchNorm1d
    • Leaky ReLU
    • momentum
    • He kaiming  weight initialization tested at 0 and range (0, 0.5)
    • Dropout
  • Thus I have started to search for alternative approaches that may help me reach global minima, the first being an increase the size of GAF images that may represent a larger time series cohort.
  • I have also found that the performance during model training is suboptimal, my RTX 3090 Ti GPU is humming at 30-50% capacity. I suspect the root of the issue is moving tensor data to CPU to perfom interim calculations during training and log data to mlflow.
  • MLflow has provided clarity in the inconsistency of results: there may be a bug in my calculation of R^2. I am going to refactor the making it more GPU optimal along with fixing this bug.
  • I have added to the backlog re-running scenarios with alternative optim and loss function combinations than MSELoss and Adam.

Access The MLFlow Server

MLFlow Server

Username and Password: visitor

Week 4

  • In order to refactor the code on tensors, I realize I am lacking some important pytorch concepts on tensor operations. I am taken a few days to study this interesting course and this one.

Week 5

  • I have refactored code to keep all calculations unless necessary in the GPU. However mlflow logs require cpu metrics and this slows down training.
  • I have confirmed R^2 and Threshold Accuracy calculations are correct, further confirmed with torcheval.metrics.functional.regression.r2_score. Since R^2 is calculated at 64-bit precision, it suggests poor predictions when compared to 1 decimal place Threshold Accuracy results =~50%, unlike 2 decimal place Threshold Accuracy results =~10%. Other error measures like RMSE are not as extreme.
  • In relation to the GPUs low utilization rate, I have confirmed the GPUs bottleneck is not the DataLoader and for the short dataset I use, the GPU’s highest utilization for the hyper-parameters that yield the best results is when DataLoader num_workers=0, even with higher batches run (best performing batch_size=512). Disabling mlflow logging which moves data to CPU increases GPU utilization to 100% as expected. GPU utilization is 20-40% when mlflow logs.
  • I have started running the Threshold Accuracy analysis for larger images
  • I am now researching an embedding model suitable for univariate analysis.

Week 6

  • Completed testing the network with differerent image sizes: Using the best hyperparameters and model parameters I have found that due to the short time series in this dataset, I can only run 32,64,128 image sizes. 32×32 was the most performant at 48% Threshold Accuracy and 42% R^2 for the 1 decimal place prediction calculation, whilst 2 decimal places Threshold Accuracy continues to underperform at 4%. Interestingly, to reach training loss <0.1 for these larger images, I had to remove the ReduceLROnPlateau or abandon training if convergence did not happen during training.
  • Meeting with my professor this week was productive as always. I have now adjusted the backlog priority on the back of this conversation.
  • Refactored to switch on/off mlflow context: back to 95% GPU utlization but no mlflow logging 🙁

Week 7

  • Refactored the code to adjust the learning rate dynamically during the training and by mixing ReduceLROnPlateau Min with a LR reset rate that helps the training loss bounce off local minima with the objective of reaching global minima. Training with dropout 0.25 and weights init=0 was stuck at 0.16 min loss rate. The above mentioned strategy has helped me bring this loss to 0.02 but prediction Threshold Accuracy at 1 d.p. remains at 41%. Training with no dropout the model achieves 0.000003 loss but the lack of generalization reduces prediction accuracy by over 6%.
  • I’ve enhanced the network to predict the classification for the next day price (i.e. next day price class for each window is price up or down). See Mlflow results:
    • Test Data (SIVBQ): Next day price classification class Threshold Accuracy is 95%
    • Evaluation Data (SICP): Where next day price predicition to 1 decimal place is 41%, its corresponding next day price classification class Threshold Accuracy is  50%, and it exhibits a bearish bias reflecting the dataset used to train the model.

Week 8

  • During my attendance at Unity Unite 2024 Barcelona, I came accross a discussion on the use of a wrapped adam optimizer – AdamW Cyclical Learning Rates and cosine learning rate scheduler. I implement this approach to test better generalization and smoother convergence following this implementation. The optimizer does not seem to differ much from existing Adam implementation for PyTorch, except that it separates weight decaying from batch gradient calculations. Cosine annealing scheduler with restarts allows model to converge to a (possibly) different local minimum on every restart and normalizes weight decay hyperparameter value according to the length of restart period.
    • Unfortunately I observe this approach combined with loss function MSELoss does not converge to loss <0.02 and the dynamic ReduceLROnPlateau Min with a LR reset rate performs better in this regard.

Taking Stock

Weeks 1 – 8

  • Training vs Evaluation time series correlation is 99% and drops to encoded GASF images = 60%
  • Regression Model:
    • key encoding parameters and model parameter to achieve next day price prediction best Threshold Accuracy stands at a loss <0.025
      • GAF summation method
      • gaf_sample_range (-1, 0.5)
      • MinMax(-1,0) scaler
      • dropout 0.25 to 0.5
      • Adam optimizer and MSELoss function loss
    • A custom pytorch ReduceLROnPlateau Min with a learning rate reset rate allows the model to reach global minima.
    • The best Threshold Accuracy results are achieved via the following regularization methods to generalize:
      • nn.BatchNorm1d
      • Leaky ReLU
      • momentum
      • He kaiming  weight initialization tested at 0 and range (0, 0.5)
      • Dropout
    • The next day price prediction results worth mentioning. We conclude minimizing the training loss erodes the benefits on model generalization.
      • Min Model loss < 0.025:
        • Encoded image test dataset next day prediction at 1 decimal place 67%, R^2 = 78%
        • Encoded image evaluation dataset next day prediction at 1 decimal place 44%, 2 decimal places 4%, R^2 = 35%
      • Min Model loss = 0.26
        • Encoded image test dataset next day prediction at 1 decimal place 58%, R^2 = 78%
        • Encoded image evaluation dataset next day prediction at 1 decimal place 46%, 2 decimal places 4%, R^2 = 47%
  • Classification Model:
    • Next day price higher or lower price prediction achieves global minima at function loss <0.000003 with evaluation dataset score 50%. Test dataset score 52%

Week 9

  • I have completed enhancing the code to train the model with concatinated time series. Due to the price difference between stocks, I calculate rebased stock price starting at 100 for the concatinated time series and using the log return change from one time data point to the next.
  • The log-price correlation metric (image below) is not representative for our comparison purposes with highly volatile short time series because it only captures tendency but not the gamma of price – see Comerica (CMA) time series 92% correlation vs SIVBQ, whilst SIVBQ drops in price by an additional 50% in the period. Therefore I am building a dynamic time warping (DTW) metric to compare stocks similarity. This metric will also enable measuring similarity for time series of different sizes.

Week 10

  • The dynamic time warping – DTW – matrix below shows the distance for each pair of stocks, the lower the distance the greater the similarity.
  • As found in the prior week, there is a clear disconnect between log price correlation and log price DTW distance. I start investigation on this issue.
Stock Pair TickersLog Price DTW DistanceLog Price Pearson Correlation
KEY-CFG4610.90
FITB-CFG6320.86
SIVBQ-ALLY19440.98
FITB-ALLY20900.91
SIVBQ-SICP25000.93
  • Model predictions are far out from what would be expected when considering log price correlations. I research similarity indexes to compare encoded GAF image inputs and their respective log price time series, including:
    • DTW Distance of log price series, a similarity metric which accounts for the series’ speed variation.
    • Pearson correlation: the missalignment between DTW distances and log price correlation coefficients is particularly noticeable as shown above for the FITB-CFG pair.
Stock Pair TickersLog Price DTW DistanceLog Price Pearson CorrelationEncoded Image Mean Pearson CorrelationStructural Similarity (SSIM) by Tensor BxCxXxY
KEY-CFG4610.900.920.7988
FITB-CFG6320.860.550.7222
SIVBQ-ALLY19440.980.840.4821
FITB-ALLY20900.910.700.4810
SIVBQ-SICP25000.930.650.3705
    • Structural Similarity Index Measure (SSIM), which extracts the structure of an image, and computed using PyTorch’s implementation, which efficiently handles multi-channel data without requiring explicit iteration over channels (unlike scikit-image library on CPU).
    • Mean Squared Error (MSE) provides a pixel-wise comparison of the average squared difference between corresponding pixels.
    • Cosine Similarity.

Week 11

  • Model predictions are far out from what would be expected when considering log price correlations. I research similarity indexes to compare encoded GAF image inputs and their respective log price time series, including:
    • DTW Distance of log price series, a similarity metric which accounts for the series’ speed variation.
    • Pearson correlation: the missalignment between DTW distances and log price correlation coefficients is particularly noticeable as shown in the images in Week 10 for the FITB-CFG pair.

Week 12:

  • I seek to compare last week’s similarity metrics to the neural network’s Threshold Accuracy .
  • I have implemented a few more changes to train and evaluate:
    • Network’s Loss Function: Taking cue from the results in this research paper, I have tested training the network with Mean Absolute Error (MAE). The paper goes further to suggest to train with a 10-day stride and a 20-day overlap between successive windows.
      • I implement this function loss with pytorch as nn.L1Loss().
      • In my case predictions performed better with a stride of 1. I will need to test this further as it may over-fit the data.
      • I also refactored the code to a less exhaustive windowing approach than what I had, and a 25-day overlap between windows.
    • Regularization: I notice the model fails to generalize with these new settings. I have substituted regularization function ReLU for Swish which I picked up about its use from this kaggle competition. The Pytorch implementation is SiLU. It improves the results, in particular evaluation R^2 but the Threshold Accuracy metric is similar to ReLUs.
  • I obtain Threshold Accuracy metric for divergent pairs of stocks:
MLFlow LinkTraining-Eval StocksDTW DistanceInput Images MSEInput Images SSIMEvaluation Threshold Accuracy at 0.1 (%)R^2 (%)Model Loss
Click HerePWBK-OZK12911.970.0522.2-0.070.010
Click HereSIVBQ-PWBK60382.190.0516.6-0.220.013
Click HereZION-KEY5140.3950.6158.330.380.014
Click HereCFG-RF20200.3750.6341.660.330.015
Click HereFITB-CFG6320.280.7266.60.650.014
Click HereKEY-CFG4600.140.872.20.850.014
  • Since both SSIM and MSE are metrics calculated on the image metrics and are consistent, I conclude a possible cause for the network’s threshold Threshold Accuracy deviation occurs because feature extraction is sub-optimal for different time series. Base on research findings, I explore adding two additional  convolutional layers targeting results for the deviations I observed, I outline below the difference in the network before and after the enhancement. Unfortunately, the threshold Threshold Accuracy predictions are stlightly worse but the over performance in Threshold Accuracy for images with higher MSEs remains.

Week 13:

  • SSIM and MSE are consistent, and their prediction Threshold Accuracy is also consistent except for pair ZION-KEY and CFG-RF. I observe the SSIM metric of the resulting feature maps captured after the last convo2d layer and fully connected layers flip: CFG-RF Feature Map SSIM < ZION-KEY Feature Map SSIM. ZION-KEY prediction Threshold Accuracy is indeed greater than CFG-RF but the Threshold Accuracy gap is over 17% (41% vs 58%).
  • The higher SSIM for CFG-RF could be due to the model overfitting on features that are common but not necessarily contribute to predict stock price movements of RF.
  • CFG-RF has slightly lower SSIM in the convolutional layers compared to ZION-KEY. Since convo layers are responsible for initial feature extraction, any shortcomings here may impact the quality of features passed to FC layers.
Training-Eval StocksDTW DistanceInput Images MSEInput Images SSIMFeature Image Conv2d SSIMFeature Image Fully Connected SSIMEvaluation Threshold Accuracy at 0.1 (%)Evaluation R^2 (%)Model Loss
PWBK-OZK12911.970.050.113822.2-0.070.010
SIVBQ-PWBK60382.190.050.109916.6-0.220.013
ZION-KEY5140.3950.610.57080.415458.330.380.014
CFG-RF20200.3750.630.54310.570841.660.330.015
FITB-CFG6320.280.720.59660.539466.60.650.014
KEY-CFG4600.140.80.693072.20.850.014

Week 14:

  • I set out to examine descriptive statistics for each of the layers weights and gradients. This may shed some light on weight explosion, overfitting or otherwise.
  • The lower SSIM index for FC layers observed for CFG-RF pair may infer the network is struggling to learn features in the later layers. An increase in the optimizer’s adamw_weight_decay may help.
  • Stepping during training for CFG-RF pair, I find:
    • Training R^2 reaches 1 early on stabilizing around 0.993 indicating overfitting. I am considering re-training with dropout.
    • The conv1 layer exhibits slightly asymmetrical distributions (non-zero means) that increases over epochs, suggesting that weight drift might be happening. The continued growth during training of conv1.weight_weight_std is consistent with weight drift. By contrast, the drift may be present for Conv2 during training but its weights mean converges at zero by the end of training.
    • The increasing dispersion of the conv weights means may indicate that the model is learning more nuanced and specific patterns, with individual weights adjusting to fit particular training samples or features. However, and in the particular case of conv1, the combination of the continued increase in the mean (drfit) and dispersion may well be that the network is overfitting.
    • Since both conv1 and conv2 gradients means oscillate around zero, the model may not be learning, except as observed by the drift on the weights, possibly idiosyncrasies rather than general patterns, leading to overfitting. An increase in the optimizer’s weight decay to penalize large weights help me prevent these from increasing.
    • fc1 weights and its gradient exhibits a similar drift / overfitting behaviour to conv1. FC2 gradients means oscillate around zero which may indicate the model may not be learning at this layer.
    • bn_fc2 and bn_fc2.weight_weight_mean and bn_fc2.weight_weight_std stabilize on the last 20% of training. Its weights mean ~0.8 indicates the layer is scaling inputs close to their learned distribution. Std ~0.4-0.6 indicates it allows for diverse patterns to be passed on without overly compressing feature ranges. These should hopefully help to generalize.
    • bn_fc2.weight gradients fluctuate in a low range throughout training, indicating its makes small adjustments to its weights, and should help us avoid overfitting.
    • fc3 shows stable weights and gradients, indicating its has converges and is well regularized.
    • fc3.bias mean seems to be “offsetting” or behaving like an increasing bias shift, perhaps trying to correct or balance cnn and fc layers above it that output higher values on average than the model requires to match the observations. This raises some concern as all layers should be sharing this heavy lifting. Forcing this adjustment may be behind the SSIM degrament observed at the FC3 layer.
  • Following these observationsm, I think I’ve identified as possible issue and to balance learning throughout the network
    • I increase the optimizer’s weight decay to prevent weight drift (from 0.00001 to 0.001) and
    • introduce dropout (from 0 to 0.4) to achieve generalization

Week 15:

  • Having experimented with additional convo and FC layers in this network, I conclude convo layers learn quickly and thoroughly, possibly so much these can overfit. The lion share of the learning relies on these layers. Trying to shift this learning to the fully connected layers yields low R^2 results. For example, since in general the SSIM inputs-to-Layer Feature maps is >60% for convo layers and 25% for FC layers, and in order to balance more evenly the learning between these two layers, I took an extreme scenario where optimizer’s learning rate of convo layers is set to 0.0000001 and that for fully connected layers to 0.1. The objective was to observe if an evaluation R^2 followed an SSIM FC increase. Both train and evaluation R^2 dropped significantly. In particular, when the SSIM_FC spikes, it leads to a major drop in training R^2. On this basis I think of this network where the convo layers learns complex patterns and the FC layers serves as an offset. The challenge is the combination of these layers and the SSIMs achieved, help us generalize.
  • The objective is to increase the SSIM of FC layers whilst not overfitting with too high SSIM of Convo layers. Using a simpler model (see mlflow results)
  • I find SSIM Convo > 60% with SSIM FC > 50% produce evaluation R^2 >65% (fixed precision float16) and 70% at float32, eval MAE 0.108. Unfortunately, higher traininig R^2 leads to overfitting and reduces the generalization of the model for the pair Train with CFG –  Evaluation with RC. Recall this pair is an outlier, in that its SSIM Input-Evaluation dataset is relatively high =0.65 but the DTW Distance is large = 2020.
  • Unfortunately, none of the multiple regularization techniques used with this setup helped generalize the model when evaluated with RF stock series, but the model’s performance decreased dramatically. Other architectures tested would overfit or under perform when regularization techniques were used, making this test the best performant.
  • Weights and volatility and proving to be stable during training:
  • Training the model with CFG shares time series prices the training R^2 is 0.5678 and lowest cum loss 0.1458. These are the comparable results accross selected stocks by interest in DTW and SSIM against this CFG trained checkpoint.
  • The table reflects a clear relationship between SSIM input-to-evaluated GAF images and the model’s predictive performance. Convo SSIM feature image output to the CFG image training input is also consistent. This relationship begings to break at the FC layer feature image output level, consistent with our observation that the FC layer offsets convo layer outputs to closer predict the labels.
  • Interestingly, FITB’s MAE = 0.12 > RF MAE = 0.095 and yet both prediction R^2 are pretty much the same. Perhaps FITB’s series speed variation closer to the training share series of CFG justifies this result.
  • The graph indicates the threshold where the model’s predicted values start to underperform is over 0.7.
Eval StockDTW Distance To CFGStock Pair Images SSIMFeature Image Conv2d SSIM Inputs Training-Vs-EvalFeature Image FC SSIM Inputs Training-Vs-EvalEvaluation Threshold Accuracy 1 dp (%)Evaluation MAEEvaluation R^2
OZK1261.10.56200.58580.317250.00.15590.3140
PWBK999.90.05650.22290.144416.660.2888-0.0830
SIVBQ54430.37780.40320.31525.550.27750.0273
ZION628.80.54140.55410.429138.880.18330.2563
FITB632.70.720.65320.509258.330.12330.6964
KEY5140.80.73820.506744.440.14260.5724
RF20200.62750.61760.463366.660.095140.6998
  • I then train/evaluate with the CFG-KEY pair. A more complex model performs better with ~90% evaluation R^2 and MAE=0.0951.
Eval StockDTW Distance To CFGStock Pair Images SSIMFeature Image Conv2d SSIM Inputs Training-Vs-EvalFeature Image FC SSIM Inputs Training-Vs-EvalEvaluation Threshold Accuracy 1 dp (%)Evaluation MAEEvaluation R^2
OZK1261.10.56200.94730.133338.880.17390.2031
PWBK999.90.05650.91400.091416.660.3882-1.1904
SIVBQ54430.37780.93920.144519.440.2596-0.1551
ZION628.80.54140.95160.211138.880.15310.4448
FITB632.70.720.96380.449063.880.09200.8599
KEY5140.80.96820.487169.440.07600.8810
RF20200.62750.95630.301538.880.17210.0844

Week 16:

  • Since changing the network’s architecture or hyperparmeters led to an overfitted model, I considered testing whether a learning rate derived via bayesian optimization with the objective function being an epoch’s training loss + an adjustment factor that primarily increased this loss for any low training-input-image-to-FullyConnected-feature-map SSIM.
  • I implemented this using the GPyOpt library because it was easy to read, integrates nicely with the optimizer and was easy to test. However it had some limitations for the user to control exploration/exploitation.
  • Unfortunately, the adjustment factor would not help and the model overfit (mlflow experiment)

Backlog

  • Pre-processing/Transformations:
    • Train the NN for 2 stocks that are highly correlated and compare to a highly correlated evaluation time series. Then compare this time series to a medium and low correlated evaluation time series to confirm the approach is robust. Try 1k data points in time series, then 2k, then 3k.
    • Eval model and log to mlflow during training when best cum loss < threshold
    • Explore a stride > 1
    • Encode images pre-log delta and gamma transformation
    • Since it’s likely the dataset’s max-min is large and the data volatile, differencing transformation (i.e. abs % change) is unlikely to help the model on its learning, but we’ll test it.
    • Embeddings – leaning towards stumpy library for matrix profiles with interesting applications.
    • Refactor mlflow warning UCVolumeDatasetSource
    • Implement DTW with RAPIDS cuML as fastdtw lib runs on CPU