Neural Network Research to Predict Gap Risk for Financial Assets

I set out to explore different methodologies applied to Neural Networks to predict gap risk on financial assets.

This is a blog I aim to update on a weekly basis sharing my journey for this research project.

Motivation For This Post

The inspiration for this project stems from the interest to continue learning after completing in June 2024 a 6-month Professional Certification at Imperial Business School on Machine Learning and Artificial Intelligence.

I continue experimenting with neural networks on this project. I am privileged to be guided by Ali Muhammad, who lectured me during my certification at Imperial.

In this project we use different neural network approaches to estimate gap risk on the price of financial assets, and each approach is held in a subdirectory in this repo. The projects in this repo may slightly deviate from this objective as I explore and research associated predictions that help me build towards the end goal.

This project is work in progress and this page serves as a log for this amazing journey.

Open Source Code Available

Pytorch powered project stack

Work in progress:

CNN training by encoding price time series into Gramian Angular Field (GAF) images

Application of Recurrent Neural Network with Long-short Term Memory Models

Get Github Repo

At the beginning before starting this blog post

I had a basic CNN jupyter notebook that attempted to predict next day prices based on a time series. The model trained with Silicon Valley Bank (SIVB) and validated predictions with Silvergate Bank (SICP), both entities bankrupt in 2023. Poor results on the CNN led me to start studying RNNs.

It’s at this point when I started posting in this blog …

Week 1

  • The CNN results are not optimal with accuracy in the low 20s
  • I had thus started research on LSTMs and preparing a simple LSTM model to understand this better
  • After meeting with Ali, I have decided that an approach that is not “fail fast” can bring benefits to my edification and eventually this research even if the results are bad
  • Thus I have paused the development on the RNN and I am digging deeper on the CNN
  • I have refactored the CNN from a Jupyter Notebook to python scripts and helper function modules. The model continues to yield unstable results
  • Encoded images correlation drops to 60% whereas the actual price time series used to build these images is >99%. This is an area I need to research and understand this difference by encoding images with different parameters
  • I intend to run a grid search for the above-mentioned image generation algorithm

Week 2

  • I have re-trained and validated several scenarios for the CNN using GAF images. Unfortunately I had to stop because the results were hard to track. The analysis is therefore incomplete.
  • The results for these simulations so far are more encouraging: validation dataset with ~60% correlation between the training/validation GAF images dataset yield [Accuracy at 1 decimal place = 44%, R^2=46%, MSE=3%].
  • I still need to dig to understand the correlation between these images for different dataset but it’s clearer what hyperparameters help
  • I ask my professor for guidance how to better track results and suggests mlflow. I will spend the net week setting it up.

Week 3

  • I run mlflow server locally and I have setup a public mlflow server run on a docker container that servers my results from storage and the database. Unfortunately and for now, it serves from a cheap Azure SQL database so it’s not the fastest to show results
  • Mlflow was particularly useful to visualize the results from encoding images with different parameters. I compared scenario results between encoding the time series into images using MarkovTransitionField vs GramianAngularField (GAF). I obtain best results (results and parameters used are stored in Mlflow gaprisk-experiment-005) with the following key encoding parameters and model parameter:
    • GAF summation method
    • gaf_sample_range (-1, 0.5)
    • MinMax(-1,0) scaler
    • dropout 0.25 to 0.5
  • Several runs did not converge training endlesslesly through the 10k-15k epochs. I have now added pytorch LRScheduler with Min to dynamically ratchet the learning rate down with a patience of 300 epochs. It abandons training if there is no loss improvement after 300 epochs in the case of 32×32 images.
  • I hit a wall: At the best accuracy result, the CNN gets stuck at a local minima ~0.15%. I am using multiple regularization methods:
    • nn.BatchNorm1d
    • Leaky ReLU
    • momentum
    • He kaiming  weight initialization tested at 0 and range (0, 0.5)
    • Dropout
  • Thus I have started to search for alternative approaches that may help me reach global minima, the first being an increase the size of GAF images that may represent a larger time series cohort.
  • I have also found that the performance during model training is suboptimal, my RTX 3090 Ti GPU is humming at 30-50% capacity. I suspect the root of the issue is moving tensor data to CPU to perfom interim calculations during training and log data to mlflow.
  • MLflow has provided clarity in the inconsistency of results: there may be a bug in my calculation of R^2. I am going to refactor the making it more GPU optimal along with fixing this bug.
  • I have added to the backlog re-running scenarios with alternative optim and loss function combinations than MSELoss and Adam.

Access The MLFlow Server

MLFlow Server

Username and Password: visitor

Week 4

  • In order to refactor the code on tensors, I realize I am lacking some important pytorch concepts on tensor operations. I am taken a few days to study this interesting course and this one.

Week 5

  • I have refactored code to keep all calculations unless necessary in the GPU. However mlflow logs require cpu metrics and this slows down training.
  • I have confirmed R^2 and accuracy calculations are correct, further confirmed with torcheval.metrics.functional.regression.r2_score. Since R^2 is calculated at 64-bit precision, it suggests poor predictions when compared to 1 decimal place accuracy results =~50%, unlike 2 decimal place accuracy results =~10%. Other error measures like RMSE are not as extreme.
  • In relation to the GPUs low utilization rate, I have confirmed the GPUs bottleneck is not the DataLoader and for the short dataset I use, the GPU’s highest utilization for the hyper-parameters that yield the best results is when DataLoader num_workers=0, even with higher batches run (best performing batch_size=512). Disabling mlflow logging which moves data to CPU increases GPU utilization to 100% as expected. GPU utilization is 20-40% when mlflow logs.
  • I have started running the accuracy analysis for larger images
  • I am now researching an embedding model suitable for univariate analysis.

Week 6

  • Completed testing the network with differerent image sizes: Using the best hyperparameters and model parameters I have found that due to the short time series in this dataset, I can only run 32,64,128 image sizes. 32×32 was the most performant at 48% accuracy and 42% R^2 for the 1 decimal place prediction calculation, whilst 2 decimal places accuracy continues to underperform at 4%. Interestingly, to reach training loss <0.1 for these larger images, I had to remove the ReduceLROnPlateau or abandon training if convergence did not happen during training.
  • Meeting with my professor this week was productive as always. I have now adjusted the backlog priority on the back of this conversation.
  • Refactored to switch on/off mlflow context: back to 95% GPU utlization but no mlflow logging 🙁

Week 7

  • Refactored the code to adjust the learning rate dynamically during the training and by mixing ReduceLROnPlateau Min with a LR reset rate that helps the training loss bounce off local minima with the objective of reaching global minima. Training with dropout 0.25 and weights init=0 was stuck at 0.16 min loss rate. The above mentioned strategy has helped me bring this loss to 0.02 but prediction accuracy at 1 d.p. remains at 41%. Training with no dropout the model achieves 0.000003 loss but the lack of generalization reduces prediction accuracy by over 6%.
  • I’ve enhanced the network to predict the classification for the next day price (i.e. next day price class for each window is price up or down). See Mlflow results:
    • Test Data (SIVBQ): Next day price classification class accuracy is 95%
    • Evaluation Data (SICP): Where next day price predicition to 1 decimal place is 41%, its corresponding next day price classification class accuracy is  50%, and it exhibits a bearish bias reflecting the dataset used to train the model.

Week 8

  • During my attendance at Unity Unite 2024 Barcelona, I came accross a discussion on the use of a wrapped adam optimizer – AdamW Cyclical Learning Rates and cosine learning rate scheduler. I implement this approach to test better generalization and smoother convergence following this implementation. The optimizer does not seem to differ much from existing Adam implementation for PyTorch, except that it separates weight decaying from batch gradient calculations. Cosine annealing scheduler with restarts allows model to converge to a (possibly) different local minimum on every restart and normalizes weight decay hyperparameter value according to the length of restart period.
    • Unfortunately I observe this approach combined with loss function MSELoss does not converge to loss <0.02 and the dynamic ReduceLROnPlateau Min with a LR reset rate performs better in this regard.

Taking Stock

Weeks 1 – 8

  • Training vs Evaluation time series correlation is 99% and drops to encoded GASF images = 60%
  • Regression Model:
    • key encoding parameters and model parameter to achieve next day price prediction best accuracy stands at a loss <0.025
      • GAF summation method
      • gaf_sample_range (-1, 0.5)
      • MinMax(-1,0) scaler
      • dropout 0.25 to 0.5
      • Adam optimizer and MSELoss function loss
    • A custom pytorch ReduceLROnPlateau Min with a learning rate reset rate allows the model to reach global minima.
    • The best accuracy results are achieved via the following regularization methods to generalize:
      • nn.BatchNorm1d
      • Leaky ReLU
      • momentum
      • He kaiming  weight initialization tested at 0 and range (0, 0.5)
      • Dropout
    • The next day price prediction results worth mentioning. We conclude minimizing the training loss erodes the benefits on model generalization.
      • Min Model loss < 0.025:
        • Encoded image test dataset next day prediction at 1 decimal place 67%, R^2 = 78%
        • Encoded image evaluation dataset next day prediction at 1 decimal place 44%, 2 decimal places 4%, R^2 = 35%
      • Min Model loss = 0.26
        • Encoded image test dataset next day prediction at 1 decimal place 58%, R^2 = 78%
        • Encoded image evaluation dataset next day prediction at 1 decimal place 46%, 2 decimal places 4%, R^2 = 47%
  • Classification Model:
    • Next day price higher or lower price prediction achieves global minima at function loss <0.000003 with evaluation dataset score 50%. Test dataset score 52%

Week 9

  • I have completed enhancing the code to train the model with concatinated time series. Due to the price difference between stocks, I calculate rebased stock price starting at 100 for the concatinated time series and using the log return change from one time data point to the next.
  • The log-price correlation metric (image below) is not representative for our comparison purposes with highly volatile short time series because it only captures tendency but not the gamma of price – see Comerica (CMA) time series 92% correlation vs SIVBQ, whilst SIVBQ drops in price by an additional 50% in the period. Therefore I am building a dynamic time warping (DTW) metric to compare stocks similarity. This metric will also enable measuring similarity for time series of different sizes.

Week 10

  • The dynamic time warping – DTW – matrix below shows the distance for each pair of stocks, the lower the distance the greater the similarity.
  • As found in the prior week, there is a clear disconnect between log price correlation and log price DTW distance. I start investigation on this issue.
Stock Pair TickersLog Price DTW DistanceLog Price Pearson Correlation
KEY-CFG4610.90
FITB-CFG6320.86
SIVBQ-ALLY19440.98
FITB-ALLY20900.91
SIVBQ-SICP25000.93

Week 11

  • Model predictions are far out from what would be expected when considering log price correlations. I research similarity indexes to compare encoded GAF image inputs and their respective log price time series, including:
    • DTW Distance of log price series, a similarity metric which accounts for the series’ speed variation.
    • Pearson correlation: the missalignment between DTW distances and log price correlation coefficients is particularly noticeable as shown below for the FITB-CFG pair.
Stock Pair TickersLog Price DTW DistanceLog Price Pearson CorrelationEncoded Image Mean Pearson CorrelationStructural Similarity (SSIM) by Tensor BxCxXxY
KEY-CFG4610.900.920.7988
FITB-CFG6320.860.550.7222
SIVBQ-ALLY19440.980.840.4821
FITB-ALLY20900.910.700.4810
SIVBQ-SICP25000.930.650.3705
    • Structural Similarity Index Measure (SSIM), which extracts the structure of an image, and computed using PyTorch’s implementation, which efficiently handles multi-channel data without requiring explicit iteration over channels (unlike scikit-image library on CPU).
    • Mean Squared Error (MSE) provides a pixel-wise comparison of the average squared difference between corresponding pixels.
    • Cosine Similarity.

Week 11

  • Model predictions are far out from what would be expected when considering log price correlations. I research similarity indexes to compare encoded GAF image inputs and their respective log price time series, including:
    • DTW Distance of log price series, a similarity metric which accounts for the series’ speed variation.
    • Pearson correlation: the missalignment between DTW distances and log price correlation coefficients is particularly noticeable as shown below for the FITB-CFG pair.

Week 12:

  • I seek to compare last week’s similarity metrics to the neural network’s accuracy.
  • I have implemented a few more changes to train and evaluate:
    • Network’s Loss Function: Taking cue from the results in this research paper, I have tested training the network with Mean Absolute Error (MAE). The paper goes further to suggest to train with a 10-day stride and a 20-day overlap between successive windows.
      • I implement this function loss with pytorch as nn.L1Loss().
      • In my case predictions performed better with a stride of 1. I will need to test this further as it may over-fit the data.
      • I also refactored the code to a less exhaustive windowing approach than what I had, and a 25-day overlap between windows.
    • Regularization: I notice the model fails to generalize with these new settings. I have substituted regularization function ReLU for Swish which I picked up about its use from this kaggle competition. The Pytorch implementation is SiLU. It improves the results, in particular evaluation R^2 but the accuracy metric is similar to ReLUs.
  • I obtain accuracy metric for divergent pairs of stocks:
MLFlow LinkTraining-Eval StocksDTW DistanceMSESSIMAccuracy (%)R^2 (%)Model Loss
Click HerePWBK-OZK12911.970.0522.2-0.070.010
Click HereSIVBQ-PWBK60382.190.0516.6-0.220.013
Click HereZION-KEY5140.3950.6158.330.380.014
Click HereCFG-RF20200.3750.6341.660.330.015
Click HereFITB-CFG6320.280.7266.60.650.014
Click HereKEY-CFG4600.140.872.20.850.014
  • Since both SSIM and MSE are metrics calculated on the image metrics and are consistent, I conclude a possible cause for the network’s accuracy deviation occurs because feature extraction is sub-optimal for different time series. Base on research findings, I explore adding two additional  convolutional layers targeting results for the deviations I observed, I outline below the difference in the network before and after the enhancement. Unfortunately, the accuracy predictions are stlightly worse but the over performance in accuracy for images with higher MSEs remains.

Week 13:

  • SSIM and MSE are consistent, and their prediction accuracy is also consistent except for pair ZION-KEY and CFG-RF. I observe the SSIM metric of the resulting feature maps captured after the last convo2d layer and fully connected layers flip: CFG-RF Feature Map SSIM < ZION-KEY Feature Map SSIM. ZION-KEY prediction accuracy is indeed greater than CFG-RF but the accuracy gap is over 17% (41% vs 58%).
Training-Eval StocksDTW DistanceMSESSIMFeature Image Conv2d SSIMFeature Image Fully Connected SSIMAccuracy (%)R^2 (%)Model Loss
PWBK-OZK12911.970.050.113822.2-0.070.010
SIVBQ-PWBK60382.190.050.109916.6-0.220.013
ZION-KEY5140.3950.610.57080.415458.330.380.014
CFG-RF20200.3750.630.54310.570841.660.330.015
FITB-CFG6320.280.720.59660.539466.60.650.014
KEY-CFG4600.140.80.693072.20.850.014

Backlog

  • Pre-processing/Transformations:
    • Train the NN for 2 stocks that are highly correlated and compare to a highly correlated evaluation time series. Then compare this time series to a medium and low correlated evaluation time series to confirm the approach is robust. Try 1k data points in time series, then 2k, then 3k.
    • Calculate image similarity metric from CNN feature extraction embeddings
    • Eval model and log to mlflow during training when best cum loss < threshold
    • Explore a stride > 1
    • Key stroke trigger to run eval during training
    • Encode images pre-log delta and gamma transformation
    • Since it’s likely the dataset’s max-min is large and the data volatile, differencing transformation (i.e. abs % change) is unlikely to help the model on its learning, but we’ll test it.
    • Embeddings – leaning towards stumpy library for matrix profiles with interesting applications.
    • Refactor mlflow warning UCVolumeDatasetSource
    • Implement DTW with RAPIDS cuML as fastdtw lib runs on CPU