Generative Adversarial Networks in Finance: an overview

Modelling in finance is a challenging task: the data often has complex statistical properties and its inner workings are largely unknown. Deep learning algorithms are making progress in the field of data-driven modelling, but the lack of sufficient data to train these models is currently holding back several new applications. Generative Adversarial Networks (GANs) are a neural network architecture family that has achieved good results in image generation and is being successfully applied to generate time series and other types of financial data. The purpose of this study is to present an overview of how these GANs work, their capabilities and limitations in the current state of research with financial data and present some practical applications in the industry. As a proof of concept, three known GAN architectures were tested on financial time series, and the generated data was evaluated on its statistical properties, yielding solid results. Finally, it was shown that GANs have made considerable progress in their finance applications and can be a solid additional tool for data scientists in this field.  


Clare-GAN: Generation Of Class-Specific Time Series

Through numerous works attempts were made to obtain generative models for time series that correctly reproduce the underlying temporal characteristics of a given training data set. However, we prove in this work that the performance of these models is limited on datasets with high-variability for example containing different classes. In such setups, it is extremely difficult for a generative model to find the right trade-off between sample fidelity i.e. their similarity to the real time series and sample diversity. Furthermore, it is essential to preserve the original classes and the variation within each class. To tackle this issue, we propose a new generative class sensitive model, Class-specific Recurrent GAN (CLaRe-GAN), that conditions the generator on an auxiliary information containing the class-specific and class-independent attributes. Our model relies on class specific encoders: a unique encoder for two contradictory functionalities i.e. extracting the inter- and intra-class attributes. To extract the high-level representation of the time series, we make a shared-latent space assumption [19]. At the same time, we use a class discriminator that discriminates between the latent vectors to efficiently extract the class-specific attributes. We test our approach on a set of publicly available datasets where the number of classes, the length and the number of available times series for each class varies and evaluate our approach both visually and computationally. We prove that our model outperforms the state-of-the-art generative models and leads to a significant and consistent improvement in the quality of the generated time series while preserving the classes and variation of the original dataset.
Italy power demand Two lead ECG Freezer regular train Distal Phalanx TW Yoga
t-SNE, PCA visualization Discriminative score Predictive score  


Multivariate Time Series Synthesis Using Generative Adversarial Networks

Collection and analysis of distributed (cloud) computing workloads allows for a deeper understanding of user and system behavior and is necessary for efficient operation of infrastructures and applications. The availability of such workload data is however often limited as most cloud infrastructures are commercially operated and monitoring data is considered proprietary or falls under GPDR regulations. This work investigates the generation of synthetic workloads using Generative Adversarial Networks and addresses a current need for more data and better tools for workload generation. Resource utilization measurements such as the utilization rates of Content Delivery Network (CDN) caches are generated and a comparative evaluation pipeline using descriptive statistics and time-series analysis is developed to assess the statistical similarity of generated and measured workloads. We use CDN data open sourced by us in a data generation pipeline as well as back-end ISP workload data to demonstrate the multivariate synthesis capability of our approach. The work contributes a generation method for multivariate time series workload generation that can provide arbitrary amounts of statistically similar data sets based on small subsets of real data. The presented technique shows promising results, in particular for heterogeneous workloads not too irregular in temporal behavior.
Content Delivery Network data
Engle-Granger test Johansen test  


Generative Time-Series Modeling With Fourier Flows

Generating synthetic time-series data is crucial in various application domains, such as medical prognosis, wherein research is hamstrung by the lack of access to data due to concerns over privacy. Most of the recently proposed methods for generating synthetic time-series rely on implicit likelihood modeling using generative adversarial networks (GANs)—but such models can be difficult to train, and may jeopardize privacy by “memorizing” temporal patterns in training data. In this paper, we propose an explicit likelihood model based on a novel class of normalizing flows that view time-series data in the frequency-domain rather than the time-domain. The proposed flow, dubbed a Fourier flow, uses a discrete Fourier transform (DFT) to convert variable-length time-series with arbitrary sampling periods into fixed-length spectral representations, then applies a (data-dependent) spectral filter to the frequency-transformed time-series. We show that, by virtue of the DFT analytic properties, the Jacobian determinants and inverse mapping for the Fourier flow can be computed efficiently in linearithmic time, without imposing explicit structural constraints as in existing flows such as NICE (Dinh et al. (2014)), RealNVP (Dinh et al. (2016)) and GLOW (Kingma & Dhariwal (2018)). Experiments show that Fourier flows perform competitively compared to state-of-the-art baselines.
Sinusoidal sequence Google stock data UCI Energy data Lung cancer pathways dataset
F-score Predictive score  


A Spectral Enabled GAN for Time Series Data Generation

Time dependent data is a main source of information in today's data driven world. Generating this type of data though has shown its challenges and made it an interesting research area in the field of generative machine learning. One such approach was that by Smith et al. who developed Time Series Generative Adversarial Network (TSGAN) which showed promising performance in generating time dependent data and the ability of few shot generation though being flawed in certain aspects of training and learning. This paper looks to improve on the results from TSGAN and address those flaws by unifying the training of the independent networks in TSGAN and creating a dependency both in training and learning. This improvement, called unified TSGAN (uTSGAN) was tested and comapred both quantitatively and qualitatively to its predecessor on 70 benchmark time series data sets used in the community. uTSGAN showed to outperform TSGAN in 80\% of the data sets by the same number of training epochs and 60\% of the data sets in 3/4th the amount of training time or less while maintaining the few shot generation ability with better FID scores across those data sets.  


How Faithful Is Your Synthetic Data? Sample-Level Metrics For Evaluating And Auditing Generative Models

Devising domain- and model-agnostic evaluation metrics for generative models is an important and as yet unresolved problem. Most existing metrics, which were tailored solely to the image synthesis setup, exhibit a limited capacity for diagnosing the different modes of failure of generative models across broader application domains. In this paper, we introduce a 3-dimensional evaluation metric, (α-Precision, β-Recall, Authenticity), that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion. Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity. We introduce generalization as an additional, independent dimension (to the fidelity-diversity trade-off) that quantifies the extent to which a model copies training data -- a crucial performance indicator when modeling sensitive data with requirements on privacy. The three metric components correspond to (interpretable) probabilistic quantities, and are estimated via sample-level binary classification. The sample-level nature of our metric inspires a novel use case which we call model auditing, wherein we judge the quality of individual samples generated by a (black-box) model, discarding low-quality samples and hence improving the overall model performance in a post-hoc manner.
SIVEP-Gripe COVID-19 patient data MNIST data AmsterdamUMCdb data
alpha-Precision beta-Recall Authenticity  


Pareto GAN: Extending The Representational Power Of GANs To Heavy-Tailed Distributions

Generative adversarial networks (GANs) are often billed as "universal distribution learners", but precisely what distributions they can represent and learn is still an open question. Heavy-tailed distributions are prevalent in many different domains such as financial risk-assessment, physics, and epidemiology. We observe that existing GAN architectures do a poor job of matching the asymptotic behavior of heavy-tailed distributions, a problem that we show stems from their construction. Additionally, when faced with the infinite moments and large distances between outlier points that are characteristic of heavy-tailed distributions, common loss functions produce unstable or near-zero gradients. We address these problems with the Pareto GAN. A Pareto GAN leverages extreme value theory and the functional properties of neural networks to learn a distribution that matches the asymptotic behavior of the marginal distributions of the features. We identify issues with standard loss functions and propose the use of alternative metric spaces that enable stable and efficient learning. Finally, we evaluate our proposed approach on a variety of heavy-tailed datasets.
Wikipedia Web traffic 136 million keystrokes SNAP Live Journal S&P 500 Daily Changes
Kolmogorov–Smirnov test AUC of log-log plots of empyrical CCDFs  


Using GANs For Sharing Networked Time Series Data: Challenges, Initial Promise, And Open Questions

Limited data access is a longstanding barrier to data-driven research and development in the networked systems community. In this work, we explore if and how generative adversarial networks (GANs) can be used to incentivize data sharing by enabling a generic framework for sharing synthetic datasets with minimal expert knowledge. As a specific target, our focus in this paper is on time series datasets with metadata (e.g., packet loss rate measurements with corresponding ISPs). We identify key challenges of existing GAN approaches for such workloads with respect to fidelity (e.g., long-term dependencies, complex multidimensional relationships, mode collapse) and privacy (i.e., existing guarantees are poorly understood and can sacrifice fidelity). To improve fidelity, we design a custom workflow called DoppelGANger (DG) and demonstrate that across diverse real-world datasets (e.g., bandwidth measurements, cluster requests, web sessions) and use cases (e.g., structural characterization, predictive modeling, algorithm comparison), DG achieves up to 43% better fidelity than baseline models. Although we do not resolve the privacy problem in this work, we identify fundamental challenges with both classical notions of privacy and recent advances to improve the privacy properties of GANs, and suggest a potential roadmap for addressing these challenges. By shedding light on the promise and challenges, we hope our work can rekindle the conversation on workflows for data sharing.
Wikipedia Web traffic Measuring Broadband America Google Cluster Usage Traces
Temporal and spatial autocorrelation W1 metric Resource costs  


Cot-GAN: Generating Sequential Data Via Causal Optimal Transport

We introduce COT-GAN, an adversarial algorithm to train implicit generative models optimized for producing sequential data. The loss function of this algorithm is formulated using ideas from Causal Optimal Transport (COT), which combines classic optimal transport methods with an additional temporal causality constraint. Remarkably, we find that this causality condition provides a natural framework to parameterize the cost function that is learned by the discriminator as a robust (worst-case) distance, and an ideal mechanism for learning time dependent data distributions. Following Genevay et al.\ (2018), we also include an entropic penalization term which allows for the use of the Sinkhorn algorithm when computing the optimal transport cost. Our experiments show effectiveness and stability of COT-GAN when generating both low- and high-dimensional time series data. The success of the algorithm also relies on a new, improved version of the Sinkhorn divergence which demonstrates less bias in learning.
Multivariate AR Noisy oscillation UCI Electroencephalography dataset
Correlation metric  


Conditional GAN for timeseries generation

It is abundantly clear that time dependent data is a vital source of information in the world. The challenge has been for applications in machine learning to gain access to a considerable amount of quality data needed for algorithm development and analysis. Modeling synthetic data using a Generative Adversarial Network (GAN) has been at the heart of providing a viable solution. Our work focuses on one dimensional times series and explores the few shot approach, which is the ability of an algorithm to perform well with limited data. This work attempts to ease the frustration by proposing a new architecture, Time Series GAN (TSGAN), to model realistic time series data. We evaluate TSGAN on 70 data sets from a benchmark time series database. Our results demonstrate that TSGAN performs better than the competition both quantitatively using the Frechet Inception Score (FID) metric, and qualitatively when classification is used as the evaluation criteria.