Skip to main content

Research and publications

AI Garage drives cutting-edge research by developing novel algorithms across diverse fields. Our team actively contributes to the global research community, publishing consistently in top-tier peer-reviewed conferences and specialized workshops.

Hands pointing at graphs on a table.

Research and publications

Machine learning

SHIP: Structural Hierarchies for Instance-dependent Partial Labels‎

Venue: WACV

SHIP introduces a plug-and-play hierarchy module for partial-label learning that derives structural label hierarchies directly from instance-dependent candidate label sets. It splits feature representations into multiple heads corresponding to hierarchy levels and supervises them using dynamically generated coarse-to-fine targets. This reduces mistake severity and improves representation quality across datasets with intrinsic or simulated hierarchies while adding minimal overhead to base PLL architectures.

Authors: Tushar Kadam, Utkarsh Mishra and Aakarsh Malhotra

 

 

Machine learning

SALE-MLP: Structure Aware Latent Embeddings for GNN to Graph-free MLP Distillation‎

Venue: IJCAI

SALE-MLP proposes a structure-aware Graph-to-MLP distillation method that learns graph-semantic latent embeddings from node features without using the graph at inference time. It aligns a student MLP's feature space with a teacher GNN via unsupervised structural losses instead of relying on precomputed GNN embeddings. The approach achieves superior performance to existing G2M methods on node classification and link prediction in both transductive and inductive settings, with notable gains in inductive scenarios.

Authors: Harsh Pal, Sarthak Malik, Rajat Patel and Aakarsh Malhotra

 

 

Machine learning

Tag2M- A Task-Agnostic Knowledge Distillation Framework for Distilling Gnn to MLP

Venue: KDD

Tag2M presents a task-agnostic GNN-to-MLP distillation framework that transfers structural knowledge from a teacher GNN into a lightweight MLP for graph-free, few-shot inference. It uses a self-supervised contrastive loss to encode topology from node attributes and Lipschitz positional embeddings plus an inference-time prompt head for rapid task adaptation. Tag2M generalizes across homophilous and heterophilous graphs and delivers large speedups (up to 20–200×) while outperforming prior distillation methods on multiple node-level tasks over 11 public datasets.

Authors: Ram Ganesh V, Ayush Singh, Aditi Rai, Harsh Pal, Deepanshu, Akshay Sethi, Aakarsh Malhotra and Sayan Ranu

 

 

Fraud Detection

Prodem: Proactive Detection of Model Degradation in Financial Fraud Prediction Under Label Delay

Venue: ECML PKDD

Prodem targets proactive detection of performance degradation in fraud prediction models deployed under significant label delays typical of financial chargeback workflows. It develops monitoring signals and detection mechanisms that operate before true fraud labels fully materialize, enabling timely intervention. Experiments on real-world fraud pipelines show that Prodem flags degradation earlier and more reliably than conventional delayed-label monitoring, helping maintain fraud catch rates and business KPIs.

Authors: Akshay Sethi, Priyanshi Gupta, Sparsh Kansotia, Kamal Kant and Nitish Srivasatava

 

 

Machine learning

FairFusion: Debiasing Diffusion Models for Fair Synthetic Tabular Data Generation

Venue: ECAI

FairFusion introduces a debiasing framework for diffusion models that generate synthetic tabular data while enforcing fairness across sensitive groups. The method integrates fairness-aware constraints and loss terms into the diffusion process, reducing disparate treatment and impact in downstream models trained on the synthetic data. Empirical results on benchmark tabular datasets demonstrate that FairFusion achieves competitive utility while substantially improving group fairness relative to standard diffusion-based generators.

Authors: Ruma Roy, Darshika Tiwari and Anubha Pandey

 

 

Others

BiGReachFRauD: Bipartite Graph Representation Learning using Breached Sources for Financial Fraud Detection

Venue: ECAI (PAIS)

BiGReachFRauD builds bipartite-graph representations by linking payment entities to externally breached identifiers (such as leaked emails or devices) to augment fraud detection. It designs a representation learning pipeline over this bipartite structure to capture reachability patterns indicative of compromised entities. The learned embeddings, when fed into downstream fraud models, improve detection of compromised merchants and cards compared with baselines that ignore breached-source connectivity.

Authors: Manasvi, Suhas, Deepanshu, Hariom and Yatin

 

 

Others

FgenXAI: A Generative AI Framework for Explainable Financial Records Summarization

Venue: KDD Workshop

FgenXAI proposes a generative-and-explainable AI framework that turns model explanations (e.g., feature attributions) into user-friendly summaries of financial records. The architecture includes query filtering, parsing/context building, response synthesis, and safety-focused response checking, loosely inspired by RAG-style modularity. Experiments on real financial workflows evaluate hallucination, refusal, and jailbreak robustness, showing that FgenXAI enables interactive, safer explanation consumption compared to one-shot XAI methods like SHAP/LIME alone.

Authors: Rakshit Rao, Manoj Mangam, Shivam Arora, Raahul Nallasamy, Sherin Bharathiy M, Aakarsh Malhotra and Alok Mani Singh

 

 

Machine learning

Towards Equitable Coreset Selection: Addressing Challenges Under Class Imbalance

Venue: CIKM Short

This work introduces Equitable Coreset Selection (ECS), a coreset framework explicitly designed for imbalanced classification settings. ECS adaptively prunes data while preserving minority-class coverage, mitigating the overrepresentation of majority classes seen in standard coreset methods. On multiple benchmarks, ECS consistently improves performance and robustness under severe class imbalance compared to state-of-the-art coreset baselines.

Authors: Liyana Sahir, Anugu Namratha Reddy, B Srinath Achary, Ashutosh Sharma, Krisha Shah, Sonia Gupta and Siddhartha Asthana

 

 

Others

EvenOddML: Even and Odd Aggregation with Multi-Level Contrastive Learning for Bipartite Graph

Venue: CIKM Full

EvenOddML is a bipartite-graph representation learning model that aggregates information from immediate neighbors and 2-hop same-type neighbors via an even-and-odd encoder. It couples this encoder with a three-level contrastive learning scheme (layer, type-global, network-global) to jointly capture local and global structures. Evaluations on recommendation and link prediction tasks show that EvenOddML outperforms existing bipartite GNN methods, especially in capturing indirect same-type influences.

Authors: Manasvi Aggarwal, Jahnavi Methukumalli, Deepanshu Bagotia and Suhas Power

 

 

Trustworthy AI

Unmasking Bias in Financial AI: A Robust Framework for Evaluating and Mitigating Hidden Biases in LLMs

Venue: ICAIF

This paper proposes a systematic framework to surface and quantify hidden biases in LLM-based financial AI applications. It designs stress-test prompts, evaluation protocols, and mitigation strategies that target fairness, robustness, and regulatory concerns in financial decision-support use cases. Results highlight non-trivial biases in off-the-shelf LLMs and show that the framework’s mitigation pipeline can significantly reduce disparate behaviors across demographic or customer segments.

Authors: Shresth, Balraj, Raghavendra, Hrishikesh and Puspita

 

 

Others

MI-GP: Unsupervised Breach Merchant Identification via Adaptive Graph Pruning

Venue: ICAIF

BMI-GP addresses unsupervised identification of potentially breached merchants by modeling transaction networks as graphs and pruning them adaptively. The method constructs merchant-centric graphs and applies graph-pruning strategies to isolate suspicious connectivity patterns without labeled breach data. Experimental results on real payment data indicate that BMI-GP surfaces high-risk merchants earlier and with fewer false positives than heuristic thresholding approaches.

Authors: Kamna Meena, Subham Kumar Singh, Priyanshi Gupta, Gaurav Oberoi, Nitish Srivasatava and Siddhartha Asthana

 

 

Others

Temporal Boosting for Incremental Tree-based Learning on Tabular Data

Venue: CODS (ADS)

This work proposes a temporal boosting strategy that incrementally updates tree-based models as new time-stamped tabular data streams in, without full retraining. The method adjusts boosting weights and tree updates to respect temporal drift while preserving past knowledge. Across several temporal tabular benchmarks, it yields better accuracy–latency trade-offs than standard batch retraining or naive online updates.

Authors: Rahul, Payal, Bhanu, Maneet, Josh and Chris

 

 

Others

Time-dependent Check-in Attribute Prediction via Domain-aware CSMTPP

Venue: CODS

The paper models time-dependent user check-in attributes using a domain-aware variant of continuous-time spatio-temporal point processes (CSMTPP). It incorporates domain-specific signals (such as location semantics or periodicity) into the intensity function to better predict attributes associated with future check-ins. Experiments on real-world mobility datasets show improved predictive performance over generic spatio-temporal baselines.


Authors: Anand, Ushmita and Maneet

 

 

Others

Can curriculum learning overcome structural disparity in MP-GNNs?

Venue: CODS

This work investigates whether curriculum learning can mitigate structural disparity issues in message-passing GNNs, where nodes with different structural roles are hard to learn jointly. It designs curricula that schedule training over graph regions or structural patterns, gradually increasing complexity. Results on multiple graph benchmarks indicate that appropriate curricula improve stability and accuracy of MP-GNNs under structural heterogeneity compared to standard training.


Authors: Ushmita Pareek, Raunak Pandey , Krisha, Srinath, Sonia and Siddhartha

Machine learning

LocaRank: Merchant Store Location ranking in a Geo-Spatial Graph with GNNs

Venue: GCLR@AAAI 2024‎

The importance of multi-outlet model (E.g., Target, Walmart, CVS etc.) is a focus of many brands in the retail industry. They want to broaden their market presence and increase accessibility in all areas so that they can minimize the demand-supply gap. Hence, it's imperative that the loca- tions of the store sites are chosen in a manner to maximize performance metrics that may vary with evolving consumer trends and business objectives. In this work, we review existing frameworks for ranking the traditional brick-and mortar stores and highlight the limitations of the current frameworks that are unable to address factors such as a) changing customer demands, needs and preferences, and b) changing business requirements and priorities. With this paper, we aim to understand the relationship between the changing market trends through time and develop a data driven solution that can help businesses rank and benchmark the performance of a store location against other store on custom defined metric. We evaluate the method using real world data from merchant stores of the city of San Francisco. We experiment with two different techniques to generate embeddings using Siamese network and evaluating the performance of different GNNs for ranking merchant stores in a given geo-spatial graph.

Authors: Garima Arora, Akash Choudhary, Kanishk Goyal, Siddhartha Asthana and Deepak Yadav

 

 

Machine learning

A Closer look at Consistency Regularization for Semi-Supervised Learning‎

Venue: CODS COMAD‎‎

Several state-of-the-art deep learning models have utilized consistency regularization by augmenting data during training. In addition to contributing to the generalizability of a model, data augmentation techniques have also been used in semi-supervised learning where a trained network is used to pseudolabel unlabelled data. During this process a supervised model is utilized to assign pseudolabels which are generated from augmented variations of the unlabelled data. This allows the model to look at different prediction vectors over such augmented versions each unlabelled data sample. However, some of these augmentations are stronger than others depending on the challenges they pose for a supervised model which has been trained on very limited data. We present a thorough study of data augmentation techniques and show that like previous semi-supervised methods only using the mean response of the model on augmentations may not be the best idea for pseudolabelling in the context of such a weakly-supervised paradigm of learning. In particular, for this work, we study consistency regularization from the perspective of pseudolabelling data for a self-training based student-teacher learning framework.

Authors: Soumyadeep Ghosh, Sanjay Kumar, Awanish Kumar and Janu Verma

 

 

Machine learning

HierTGAN: Hierarchical Time Series Generation with Aggregation Constraints‎

Venue: CODS COMAD‎‎‎

Generative models for time series data have been able to preserve the temporal dynamics of the original time series and are extremely successful in generating realistic synthetic data. However, in the real world, time series data can be disaggregated by various attributes of interest, thereby forming a hierarchical structure, often referred to as hierarchical time series data. Existing models for time series generation do not capture the structural dynamics (inter-level relationships of the hierarchy) of hierarchical time series data. Therefore, in this research, for the first time, we introduce HierTGAN, an auto-regressive generative adversarial network (GAN) for hierarchical time series generation. The proposed HierTGAN solves for an equivalent inter-level relationship within the embedding space generated by an autoencoder. Multiple experiments have been performed to evaluate the effectiveness of HierTGAN in generating realistic synthetic hierarchical time series data.

Authors: Srini Rohan Gujulla Leel, Vikrant Dey, Puspita Majumdar and Ankit Khairkar

 

 

Machine learning

SGD-MLP: Structure Generation and Distillation using a graph free MLP‎

Venue: CODS COMAD‎‎‎

While Graph Neural Networks have shown great results on graph structured data, they are difficult to be used in the real world scenarios due to scalability constraints. Existing methods try to solve this issue by distilling the knowledge from trained GNNs to MLPs. However, these methods still require the graph structure representation during inference, increasing the inference latency a lot. While there are strategies available to sparsify or simplify the graph structure and enhance the speed of GNNs, such as minimizing computations like multiplication and accumulation through pruning and quantization, the inherent graph dependency is still there. The main hurdle that remains unaddressed is the interdependence of data, which significantly limits the potential for speed improvement. Driven by the distinct advantages and limitations of GNNs and MLPs, we propose SGD-MLP that integrate these two by leveraging neighbourhood Contrastive learning. The key idea is to pre-train the student MLP with structural information along with the knowledge distillation(KD), a process we term as structure induction with KD. Inference time for SGD-MLP is 225 times faster than GNNs, without sacrificing much on accuracy on average. SGD-MLP enhanced the accuracy by 11.52% compared to standalone MLPs, perform on par with GNNs on 3 out of 5 datasets and beat the state of the art GLNN by 1.47% on an average.

Authors: Sanjay Kumar Patnala, Sumedh B G, Akshay Sethi, Sonia Gupta and Siddhartha Asthana

 

 

Graph Learning

CPa-WAC: Constellation Partitioning-based Scalable Weighted Aggregation Composition for Knowledge Graph Embedding‎

Venue: IJCAI 2024‎‎‎‎

Scalability and training time are crucial for any graph neural network model processing a knowledge graph (KG). While partitioning knowledge graphs helps reduce the training time, the prediction accuracy reduces significantly compared to training the model on the whole graph. In this paper, we propose CPa-WAC: a lightweight architecture that incorporates graph convolutional networks and modularity maximization-based constellation partitioning to harness the power of local graph topology. The proposed CPa-WAC method reduces the training time and memory cost of knowledge graph embedding, making the learning model scalable. The results from our experiments on standard databases, such as Wordnet and Freebase, show that by achieving meaningful partitioning, any knowledge graph can be broken down into subgraphs and processed separately to learn embeddings. Furthermore, these learned embeddings can be used for knowledge graph completion, retaining similar performance compared to training a GCN on the whole KG, while speeding up the training process by upto five times. Additionally, the proposed CPa-WAC method outperforms several other state-of-the-art KG in terms of prediction accuracy.

Authors: Sudipta Modak, Aakarsh Malhotra, Sarthak Malik, Anil Surisetty and Esam Abdel-Raheem

 

 

Machine learning

CASH via Optimal Diversity for Ensemble Learning

Venue: SIKDD 2024‎‎‎‎

The Combined Algorithm Selection and Hyperparameter Optimization (CASH) problem is pivotal in Automatic Machine Learning (AutoML). Most leading approaches combine Bayesian optimization with post-hoc ensemble building to create advanced AutoML systems. Bayesian optimization (BO) typically focuses on identifying a singular algorithm and its hyperparameters that outperform all other configurations. Recent developments have highlighted an oversight in prior CASH methods: the lack of consideration for diversity among the base learners of the ensemble. This oversight was overcome by explicitly injecting the search for diversity into the traditional CASH problem. However, despite recent developments, BO’s limitation lies in its inability to directly optimize ensemble generalization error, offering no theoretical assurance that increased diversity correlates with enhanced ensemble performance. Our research addresses this gap by establishing a theoretical foundation that integrates diversity into the core of BO for direct ensemble learning. We explore a theoretically sound framework that describes the relationship between pair-wise diversity and ensemble performance, which allows our Bayesian optimization framework Optimal Diversity Bayesian Optimization (OptDivBO) to directly and efficiently minimize ensemble generalization error. OptDivBO guarantees an optimal balance between pairwise diversity and individual model performance, setting a new precedent in ensemble learning within CASH. Empirical results on 20 public datasets show that OptDivBO achieves the best average test ranks of 1.57 and 1.4 in classification and regression tasks.

Authors: Pranav Poduval, Sanjay Kumar Patnala, Gaurav Oberoi, Nitish Srivasatava and Siddhartha Asthana

Graph Learning

MEGA: Multi-Encoder GNN Architecture for stronger task collaboration and generalization‎

Venue: ECML PKDD 2024‎

Self-supervised learning in graphs has emerged as a promising avenue for harnessing unlabeled graph data, leveraging pretext tasks to generate informative node representations. However, the reliance on a single pretext task often constrains generalization across various downstream tasks and datasets. Recent advancements in multi-task learning on graphs aim to tackle this limitation by integrating multiple pretext tasks, framing the problem as a multi-objective optimization to train a shared set of parameters. However, these approaches frequently encounter task interference, where competing tasks degrade overall performance by conflicting with each other due to the limited expressivity of the model. In this work, we introduce MEGA, a novel multi-encoder graph neural network architecture designed to alleviate task interference by providing distinct parameter spaces for the decoupled training of each task. This architecture allows for independent learning from multiple pretext tasks, followed by a simple self-supervised dimensionality reduction technique to combine the insights gleaned. Through extensive experiments, we demonstrate the superiority of our approach, showcasing an average performance improvement of across three commonly used downstream tasks (i.e., link prediction, node classification, and partition prediction) and nine benchmark datasets.

Authors: Faraz Khoshbakhtian, Gaurav Oberoi, Dionne Aleman and Siddhartha Asthana

 

 

Others

GraTeD-MLP: Efficient Node Classification via Graph Transformer Distillation to MLP‎

Venue: Learning on Graphs (LoG)‎‎

Graph Transformers (GTs) like NAGphormer have shown impressive performance by encoding graph's structural information and node features. However, their self-attention and complex architectures require high computation and memory, hindering their deployment. Thus, we propose a novel framework called Graph Transformer Distillation to Multi-Layer Perceptron (GraTeD-MLP). GraTeD-MLP leverages knowledge distillation (KD) and a novel decomposition of attentional representation to distill the learned representations from the teacher GT to a student MLP. During distillation, we incorporate a gated MLP architecture where two branches learn the decomposed attentional representation for a node while the third predicts node embeddings. Encoding the attentional representation mitigates the MLP’s over-reliance on node features, enabling robust performance even in inductive settings. Empirical results demonstrate that the proposed GraTeD-MLP has significantly faster inference time than the teacher GT model, with speed-up ranging from 20×−40×. With up to 25% improved performance over vanilla MLP. Furthermore, we empirically show that the proposed GraTeD-MLP outperforms other GNN distillation methods in seven datasets in both inductive and transductive settings.

Authors: Sarthak Malik, Aditi Rai, Ram Ganesh V, Himank Sehgal, Akshay Sethi and Aakarsh Malhotra

 

 

Others

Progressive Label Disambiguation for Partial Label Learning in Homogeneous Graphs

Venue: International Conference on Information and Knowledge Management(CIKM) ‎‎‎

Many existing Graph Neural Networks (GNN) methods assume that labels are reliable and sufficient, which may not be the case in real-world scenarios. This paper addresses one such problem of Partial Label Learning (PLL) on graph-structured data. In the PLL for graphs, each node is represented by a candidate set of labels, where only one is true while the others are inaccurate. Despite advancements with PLL in tabular and vision domains, the graph-structured data still needs to be explored. In this work, we first define PLL for graphs. Subsequently, we propose a new PLD-Graph algorithm for PLL in homogeneous graphs with scarce labels. We utilize graph augmentation to reduce the effects of inexact labels and provide additional supervision from unlabeled nodes. Progressive label disambiguation is performed based on the model's ability to predict correct classes. Furthermore, an additional loss estimates the label corruption matrix to capture associations between correct and incorrect labels. We show the effectiveness of the proposed algorithm on multiple graph datasets, with two types of noise and varying levels of ambiguous labels. Overall, the proposed PLD-Graph algorithm outperforms state-of-the-art PLL methods.

Authors: Rajat Patel, Aakarsh Malhotra, Sudipta Modak and Siddharth7nbsp;Yeramsetty

 

 

Machine learning

Learning Representations for Bipartite Graphs using Multi-Task Self-Supervised Learning‎

Venue: ECML PKDD: Machine Learning and Knowledge Discovery in Databases: Research Track, 2023‎‎

Representation learning for bipartite graphs is a challenging problem due to its unique structure and characteristics. The primary challenge is the lack of extensive supervised data and the bipartite graph structure, where two distinct types of nodes exist with no direct connections between the nodes of the same kind. Hence, recent algorithms utilize Self Supervised Learning (SSL) to learn effective node embeddings without needing costly labeled data. However, conventional SSL methods learn through a single pretext task, making the trained model specific to the downstream task. This paper proposes a novel approach for learning generalized representations of bipartite graphs using multi-task SSL. The proposed method utilizes multiple self-supervised tasks to learn improved embeddings that capture different aspects of the bipartite graphs, such as graph structure, node features, and local-global information. We utilize deep multi-task learning (MTL) to further assist in learning generalizable self-supervised solution. To mitigate negative transfer when related and unrelated tasks are trained in MTL, we propose a novel DST++ algorithm. The proposed DST++ optimization algorithm improves existing DST by considering task affinities and groupings for better initialization and training. The end-to-end proposed method with complimentary SSL tasks and DST++ multi-task optimization is evaluated on three tasks: node classification, link prediction, and node regression, using four publicly available benchmark datasets. The results demonstrate that our proposed method outperforms state-of-the-art methods for representation learning in bipartite graphs. Specifically, our method achieves up to 12% improvement in accuracy for node classification and up to 9% improvement in AUC for link prediction tasks compared to the baseline methods.

Authors: Akshay Sethi, Sonia Gupta, Aakarsh Malhotra and Siddhartha Asthana

 

 

Machine learning

Contrastive Representation through Angle and Distance based Loss for Partial Label Learning‎

ECML PKDD: Machine Learning and Knowledge Discovery in Databases: Research Track, 2023‎

Partial label learning (PLL) is a form of weakly supervised learning which aims to train a deep network from training instances and its corresponding label set. The label set, also known as the candidate set, is a group of labels associated with each training instance, out of which only one label is the ground truth for the training instance. Contrastive learning is one of the popular techniques used to learn from a partially labeled dataset, intending to reduce intra-class while maximizing inter-class distance. In this paper, we suggest improving the contrastive technique used in PiCO. The proposed Contrastive Representation via Angle and Distance based Loss (CRADL) segregates the contrastive loss into two parts, the angle based loss and the distance based loss. The former angle based loss covers the angular separation between two contrastive vectors. However, we showcase a scenario where such angular loss prefers one contrastive vector over the other despite having the same angle. Thus, the second loss term built on distance fixes the issue. We show experiments on CIFAR-10 and CIFAR-100, where the corresponding PLL databases are generated using uniform noise. The experiments show that the PLL algorithms learn better using the proposed CRADL-based learning and generate distinguishing representations, as observed by compact cluster formation with CRADL. This eventually results in CRADL outperforming the current state-of-the-art studies in PLL setup at different uniform noise rates.

Priyanka Chudasama, Tushar Kadam, Rajat Patel, Aakarsh Malhotra & Mangam Manoj

 

 

Machine learning & Trustworthy AI

Practical Bias Mitigation through Proxy Sensitive Attribute Label Generation‎

Venue: Workshop on Modelling Uncertainty in the Financial World (MUFin’23) in conjunction with AAAI, 2023‎

Addressing bias in the trained machine learning system often requires access to sensitive attributes. In practice, these attributes are not available either due to legal and policy regulations or data unavailability for a given demographic. Existing bias mitigation algorithms are limited in their applicability to real-world scenarios as they require access to sensitive attributes to achieve fairness. In this research work, we aim to address this bottleneck through our proposed unsupervised proxy-sensitive attribute label generation technique. Towards this end, we propose a two-stage approach of unsupervised embedding generation followed by clustering to obtain proxy-sensitive labels. The efficacy of our work relies on the assumption that bias propagates through non-sensitive attributes that are correlated to the sensitive attributes and, when mapped to the high dimensional latent space, produces clusters of different demographic groups that exist in the data. Experimental results demonstrate that bias mitigation using existing algorithms such as Fair Mixup and Adversarial Debiasing yields comparable results on derived proxy labels when compared against using true sensitive attributes.

Authors: Bhushan Chaudhari, Anubha Pandey, Deepak Bhatt and Darshika Tiwari

 

 

Machine learning

TETRAA - Trained and Selective Transmutation of Encoder-based Adversarial Attack‎

Venue: International Joint Conference on Neural Networks (IJCNN)‎‎

The adversarial attack is a pivotal field of research. These attacks can help the organization identify model vulnerabilities and save them from financial and reputation loss. Researchers have already explored several adversarial attacks on computer vision applications. However, these attacks fail when it comes to the transactional dataset-which has added constraints of limited attack attempts and scalability for large feature space containing both categorical merchant code and continuous transaction amount. The existing literature on the adversarial attack for the tabular domain is limited to numerical features only. To overcome these challenges, we introduce a novel Genetic attack algorithm TETRAA for black-box attacks that produce evolution-based specialized perturbations through encoder/autoencoder for each example, giving the advantage of query efficiency. A novel fitness function is formalized for handling the perturbation of categorical features under the constraints of realistic samples. Further, we improve the success rate and the norm of perturbation of generated samples using a binary search-based approach. Experiments show TETRAA requires significantly fewer model queries than several state-of-the-art black-box adversarial attacks ZOO, boundary attack, HopSkipJump attack, and ESPA, along with better success rate and norm of perturbation when tested on untargeted attacks.

Authors: Sarthak Malik, Himanshi Charotia and Gaurav Dhama

 

 

Machine learning

Auto-TabTransformer: Hierarchical Transformers for Self and Semi Supervised Learning in Tabular Data‎

Venue: 2023 International Joint Conference on Neural Networks (IJCNN)‎

Self and Semi-Supervised Learning have shown promising results in language and computer vision but are still underexplored in the context of tabular data. This paper focuses on exploring self and semi-supervised methods for tabular data. Towards this, we have proposed Auto-Tab Transformer, a method for training hierarchical transformers in a self and semi-supervised setup using redundancy reduction. The technique focuses on key aspects of self and semi-supervised learning: feature encoding, pre-training objective, training methodology and neural architecture. Performing extensive experiments on four publically accessible datasets, we show that Auto-Tab Transformer achieves state of the art (SOTA) results in the less labelled data domain. We conduct extensive ablation studies detailing the importance of all the components used.

Authors: Akshay Sethi, Sonia Gupta, Ayush Agarwal, Nancy Agrawal and Siddhartha Asthana

 

 

Machine learning

FraudAmmo: Large Scale Synthetic Transactional Dataset for Payment Fraud Detection‎

Venue: IJCNN, 2023‎‎

Global losses due to payment fraud have tripled from 9.84 billion in 2011 to 32.39 billion in 2020 and are expected to reach $40.62 billion by 2027. In addition to the financial losses, fraud negatively impacts brand reputation and leads to a bad customer experience. Advanced machine learning has been actively adopted to tackle the fraud detection problem at scale. However, the scarcity of open datasets leads to less reproducible research, especially in the payments domain. We have released a synthetic transactional dataset, FraudAmmo, containing 3 million transactions, synthetically generated from real-world datasets. FraudAmmo is diverse with respect to transactional channels, geography, customer and fraud types. We leverage the idea of privacy preservation in tabular Generative Adversarial Neural Networks (GANs) to generate FraudAmmo. In addition to privacy-preserved GANs, we have applied certain checks to ensure the customer’s differential privacy. The quality of the generated dataset is evaluated on various metrics including machine learning efficacy, statistical similarity, and privacy preservability. We have also benchmarked our results on FraudAmmo using machine learning algorithms such as bagging, boosting, MLP and logistic regression. To the best of our knowledge, this is the first large-scale synthetic fraud transaction dataset that aims to aid academia and research groups in developing and validating their fraud detection models. The dataset can be found here — https://github.com/karthi2107/FraudAmmo

Authors: Karthikeswaren R, Kanishka Kayathwal, Gaurav Dhama and Hardik Wadhwa

 

 

Machine learning

TBoost: Gradient Boosting Temporal Graph Neural Networks‎

Venue: TGLR@NeurIPs 2023

Fraud prediction, compromised account detection, and attrition signaling are vital problems in the financial domain. Generally, these tasks are temporal classification problems as labels exhibit temporal dependence. The labels of these tasks change with time. Each financial transaction contains heterogeneous data like account number, merchant, amount, decline status, etc. A financial dataset contains chronological transactions. This data possesses three distinct characteristics: heterogeneity, relational structure, and temporal nature. Previous efforts fall short of modeling all these characteristics in a unified way. Gradient-boosted decision trees (GBDTs) are used to tackle heterogeneity. Graph Neural Networks (GNNs) are employed to model relational information. Temporal GNNs account for temporal dependencies in the data. In this paper, we propose a novel unified framework, TBoost, which combines GBDTs and temporal GNNs to jointly model the heterogeneous, relational, and temporal characteristics of the data. It leverages both node and edge-level dynamics to solve temporal classification problems. To validate the effectiveness of TBoost, we conduct extensive experiments, demonstrating its superiority in handling the complexities of financial data.

Authors: P Nath, G Waghmare, N Agrawal, N Kumar and S Asthana

 

 

Machine learning

Structure Aware Transformers on Graphs for Node Classification‎

Venue: NeurIPS 2023 GLFrontiers‎

Transformers have achieved state-of-the-art performance in the fields of Computer Vision (CV) and Natural Language Processing (NLP). Inspired by this, architectures have come up in recent times that incorporate transformers into the domain of graph neural networks. Most of the existing Graph Transformers either take a set of all the nodes as an input sequence leading to quadratic time complexity or they take only one hop or k-hop neighbours as the input sequence, thereby completely ignoring any long-range interactions. To this end, we propose Structure Aware Transformer on Graphs (SATG), where we capture both short-range and long-range interactions in a computationally efficient manner. When it comes to dealing with non-euclidean spaces like graphs, positional encoding becomes an integral component to provide structural knowledge to the transformer. Upon observing the shortcomings of the existing set of positional encodings, we introduce a new class of positional encodings trained on a Neighbourhood Contrastive Loss that effectively captures the entire topology of the graph. We also introduce a method to effectively capture long-range interactions without having a quadratic time complexity. Extensive experiments done on five benchmark datasets show that SATG consistently outperforms GNNs by a substantial margin and also successfully outperforms other Graph Transformers.

Authors: Sumedh B G, Sanjay Patnala, Himil Vasava, Akshay Sethi and Sonia Gupta

 

 

Machine learning

Learning Temporal Representations of Bipartite Financial Graphs‎

Venue: International Conference on AI in Finance 2023

Dynamic Bipartite graph is naturally suited for modeling temporally evolving interaction in several domains, including digital payment and social media. Though dynamic graphs are widely studied, their focus remains on homogeneous graphs. This paper proposes a novel framework for representation learning in temporally evolving bipartite graphs. It introduces a bipartite graph transformer layer, a temporal bipartite graph encoder based on an attention mechanism for learning node representations. It further extends the information maximization objective based on noise contrastive learning to temporal bipartite graphs. This combination of bipartite encoder layer and noise contrastive loss ensures each node-set in the temporal bipartite graph is represented uniquely and disentangled from other node-set. We use four public datasets with temporal bipartite characteristics in experimentation. The proposed model shows promising results on the transductive and inductive dynamic link prediction task and on the temporal recommendation task.

Authors: Pritam Kumar Nath, Govind Waghmare, Nikhil Tumbde, Nitish Kumar and Siddhartha Asthana

 

 

Machine learning

Improving the Robustness of Financial Models through Identification of the Minimal Vulnerable Feature Set‎

Venue: International Conference on AI in Finance 2023

Research in adversarial robustness has primarily focused on neural networks in domains like computer vision, neglecting heterogeneous tabular datasets prevalent in finance. The financial domain, in particular, faces a heightened risk, where malicious actors may exploit ML model vulnerabilities to manipulate transactions and gain unauthorized access. To address this gap, we aim to simulate the adversaries’ intentions using heterogeneous tabular data and focus on identifying a minimal vulnerable set of features that are most susceptible to an external attack. Identifying such features can enable developers to safeguard their models against adversarial attacks by updating or refining the rules of deployed models. To this effect, a GAN-based architecture, termed as the Feature Selector Network, has been proposed for learning the minimal vulnerable feature set. Experimental evaluation shows a significant reduction in attack generation time and the number of queries. The proposed method is tested using attack imperceptibility performance metrics, the number of queries, and the time to generate attacks using existing state-of-the-art attack algorithms. We observed up to a reduction in the number of queries and in overall attack generation time, along with a significant improvement in imperceptibility metrics like the Norm of perturbations and distance to closest neighbor while achieving a good success rate. Further experimental analysis suggested that the model trained on the adversaries generated using the proposed pipeline resulted in more than a decrease in the adversarial attack success rate on the test set, thus allowing developers a robust technique for safeguarding ML models.

Authors: Anubha Pandey, Himanshu Chaudhary, Alekhya Bhatraju, Deepak Bhatt and Maneet Singh

 

 

Machine learning

Study of Topology Bias in GNN-based Knowledge Graphs Algorithms‎

Venue: MLoG@ICDM‎

Graph neural networks (GNN) have recently been integrated into knowledge graph representation learning. The efficient message-passing functions in GNNs capture latent relationships between entities within these semantic networks, which aids in various downstream tasks such as link prediction, node classification, and entity alignment. However, there is a general deficiency in representation learning on graphs with loops (cycles) and self-loops. Traditional message-passing functions induce biased learning on knowledge graphs, leading to skewed predictions. This work presents a detailed analysis of representation bias generated by these functions on knowledge graphs containing short and self-loops. We demonstrate the variance in performance on knowledge graphs with varying topology over two downstream: link prediction and entity alignment. The experiments show that the representations from popular learning algorithms are prone to capturing biases in the graphs’ structures. These biases, however, have different effects on the formulated downstream tasks, motivating research in the domain of topology-invariant representation algorithms for knowledge graphs.

Authors: Anil Surisetty, Aakarsh Malhotra, Deepak Chaurasiya, Sudipta Modak, Siddharth Yerramsetty, Alok Singh, Liyana Sahir and Esam Abdel-Raheem

 

 

Machine learning

INFRANET: Forecasting intermittent time series using DeepNet with parameterized conditional demand and size distribution‎

Venue: AI4TS workshop at ICDM 2023‎

Real-life time series data deals with the problem of intermittency, where the demand appears sporadically in time, i.e., long runs of zero demand are observed before periods with nonzero demand. Modelling time series for Intermittent demand forecasting (IDF) holds a significant role in several industries, such as Retail, Automobile, and slow-moving parts, to enable effective inventory management. Existing methods forecast inter-demand time and size independently, predicting constant values over the entire interval. This static forecasting reduces overall point accuracy. To tackle this issue, we introduce INFRANET (Intermittent demand Forecasting using Autoregressive Network), a novel approach for generating accurate probabilistic forecasts by parameterizing conditional demand probability and size distributions. We have proposed a novel framework incorporating an auto-regressive network that jointly models the demad probability and demand size via a weighted loss function. Additionally, we explore three new decoding techniques: two probabilistic and one deterministic approach for accurate demand forecasting. Our model is specifically designed to address intermittent and lumpy demand forecasting as well as obsolescence scenarios. We show that the proposed model outperforms existing methods, demonstrated through extensive empirical evaluation on multiple datasets.

Authors: Diksha Shrivastava, Sarthak Pujari, Yatin Katyal, Siddhartha Asthana, Chandrudu K and Aakashdeep Singh

 

 

Machine learning

On Incorporating new Variables during Evaluation‎

Venue: TRL @ NeurIPS 2023 Poster‎

Any classification or regression model needs access to the same features and input that were utilized to train the model. However in real world scenarios, several models are in operation for years and new variables/features may be available during the inferencing stage. If such features are to be utilized, their values have to be captured in the dataset that was utilized for training the model. We propose a model agnostic approach where we trained a model without the access to those features during the training stage, which could benefit from the additional features available during testing. We show that by using the proposed approach and without any access to the extra features during the training phase, we are able to improve the performance of the model on four real world tabular datasets. We provide extensive analysis on how and which variables result in the improvement over the model which was trained without the extra feature(s).

Authors: Harsimran Bhasin and Soumyadeep Ghosh

Machine learning, Others

RePS: Relation, Position and Structure aware Entity Alignment‎

Venue: Graph Learning Workshop in conjunction with The ACM Web Conference 2022

With the rising adoption of Machine Learning across the domains like banking, pharmaceutical, ed-tech, etc, it has become utmost important to adopt responsible AI methods to ensure models are not unfairly discriminating against any group. Given the lack of clean training data, generative adversarial techniques are preferred to generate synthetic data with several state-of-the-art architectures readily available across various domains from unstructured data such as text, images to structured datasets modelling fraud detection and many more. These techniques overcome several challenges such as class imbalance, limited training data, restricted access to data due to privacy issues. Existing work focusing on generating fair data either works for a certain GAN architecture or is very difficult to tune across the GANs. In this paper, we propose a pipeline to generate fairer synthetic data independent of the GAN architecture. The proposed paper utilizes a pre-processing algorithm to identify and remove bias inducing samples. In particular, we claim that while generating synthetic data most GANs amplify bias present in the training data but by removing these bias inducing samples, GANs essentially focuses more on real informative samples. Our experimental evaluation on two open-source datasets demonstrates how the proposed pipeline is generating fair data along with improved performance in some cases.

Authors: Anil Surisetty, Deepak Chaurasiya, Nitish Kumar, Alok Singh, Gaurav Dhama, Aakarsh Malhotra, Vikrant Dey and Ankur Arora

 

 

Machine learning

Modeling Inter-Dependence Between Time and Mark in Multivariate Temporal Point Processes‎

Venue: ACM International Conference on Information and Knowledge Management, 2022‎

Temporal Point Processes (TPP) are probabilistic generative frameworks. They model discrete event sequences localized in continuous time. Generally, real-life events reveal descriptive information, known as marks. Marked TPPs model time and marks of the event together for practical relevance. Conditioned on past events, marked TPPs aim to learn the joint distribution of the time and the mark of the next event. For simplicity, conditionally independent TPP models assume time and marks are independent given event history. They factorize the conditional joint distribution of time and mark into the product of individual conditional distributions. This structural limitation in the design of TPP models hurt the predictive performance on entangled time and mark interactions. In this work, we model the conditional inter-dependence of time and mark to overcome the limitations of conditionally independent models. We construct a multivariate TPP conditioning the time distribution on the current event mark in addition to past events. Besides the conventional intensity-based models for conditional joint distribution, we also draw on flexible intensity-free TPP models from the literature. The proposed TPP models outperform conditionally independent and dependent models in standard prediction tasks. Our experimentation on various datasets with multiple evaluation metrics highlights the merit of the proposed approach.


Authors: Govind Waghmare, Ankur Debnath, Siddhartha Asthana and Aakarsh Malhotra

 

 

Others

CaPE: Category Preserving Embeddings for Similarity-Search in Financial Graphs

Venue: ACM International Conference on AI in Finance. 2022‎

Similarity-search is an important problem to solve for the payment industry having user-merchant interaction data. It finds out merchants similar to a given merchant and solves various tasks like peer-set generation, recommendation, community detection, and anomaly detection. Recent works have shown that by leveraging interaction data, Graph Neural Networks (GNNs) can be used to generate node embeddings for entities like a merchant, which can be further used for such similarity-search tasks. However, most of the real-world financial data come with high cardinality categorical features such as city, industry, super-industries, etc. which are fed to the GNNs in a one-hot encoded manner. Current GNN algorithms are not designed to work for such sparse features which makes it difficult for them to learn these sparse features preserving embeddings. In this work, we propose CaPE, a Category Preserving Embedding generation method which preserves the high cardinality feature information in the embeddings. We have designed CaPE to preserve other important numerical feature information as well. We compare CaPE with the latest GNN algorithms for embedding generation methods to showcase its superiority in peer set generation tasks on real-world datasets, both external as well as internal (synthetically generated). We also compared our method for a downstream task like link prediction.

Authors: Gaurav Oberoi, Pranav Poduval, Karamjit Singh, Sangam Verma and Pranay Gupta

 

 

Fraud Detection, others

Guided Self-Training based Semi-Supervised Learning for Fraud Detection

Venue: ACM International Conference on AI in Finance, 2022‎

Semi supervised learning has attracted attention of AI researchers in the recent past, especially after the advent of deep learning methods and their success in several real world applications. Most deep learning models require large amounts of labelled data, which is expensive to obtain. Fraud detection is a very important problem for several industries and large amount of data is often available. However, obtaining labelled data is cumbersome and hence semi-supervised learning is perfectly positioned to aid us in building robust and accurate supervised models. In this work, we consider different kinds of fraud detection paradigms and show that a self-training based semi-supervised learning approach can produce significant improvements over a model that has been training on a limited set of labelled data. We propose a novel self-training approach by using a guided sharpening technique using a pair of autoencoders which provide useful cues for incorporating unlabelled data in the training process. We conduct thorough experiments on three different real world databases and analysis to showcase the effectiveness of the approach. On the elliptic bitcoin fraud dataset, we show that utilizing unlabelled data improves the F1 score of the model trained on limited labelled data by around 10%.

Authors: Awanish Kumar, Soumyadeep Ghosh and Janu Verma

 

 

Fraud Detection, others

Adversarial Fraud Generation for Improved Detection

Venue: ACM International Conference on AI in Finance, 2022‎‎

Generative Adversarial Networks (GANs) are known for their ability to learn data distribution and hence exist as a suitable alternative to handle class imbalance through oversampling. However, it still fails to capture the diversity of the minority class owing to their limited representation, for example, frauds in our study. Particularly the fraudulent patterns closer to the class boundary get missed by the model. This paper proposes using GANs to simulate fraud transaction patterns conditioned on genuine transactions, thereby enabling the model to learn a translation function between both spaces. Further to synthesize fraudulent samples from the class boundary, we trained GANs using losses inspired by data poisoning attack literature and discussed their efficacy in improving fraud detection classifier performance. The efficacy of our proposed framework is demonstrated through experimental results on the publicly available European Credit-Card Dataset and CIS Fraud Dataset.

Authors: Anubha Pandey, Alekhya Bhatraju, Shiv Markam and Deepak Bhatt

 

 

Others

A Semi-Supervised Vulnerability Management System

Venue: ‎Intelligent Systems Conference (IntelliSys) 2022​​​​​​​‎

With the advent of modern network security advancements, computational resources of an organization are always at a threat from external entities. Such entities may be represented by hackers or miscreants who might cause significant damage to data and other software or hardware resources of an organization. A Vulnerability is a general way of representing a weakness in the software or hardware resources of the computational infrastructure of an organization. Such vulnerabilities may be either minor software issues, or in some cases may expose vital computational resources of the organization to external threats. The first step is to scan the entire computational infrastructure for such vulnerabilities. Once they are ascertained, a patching process is carried out to mitigate the threats. In order to perform effective mitigation, the most serious vulnerabilities should be given a higher priority. In order to create this priority list, a scoring mechanism is required for all scanned vulnerabilities. We present an end to end deployed vulnerability management system which can score these vulnerabilities using a natural language description of the same.

Authors: Soumyadeep Ghosh, Sourojit Bhaduri, Sanjay Kumar, Janu Verma, Yatin Katyal and Ankur Saraswat

 

 

Trustworthy AI, others

FLiB: Fair Link Prediction in Bipartite Network‎

Venue: 26th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-22), 2022‎

Graph neural networks have become a popular modeling choice in many real-world applications like social networks, recommender systems, molecular science. GNNs have been shown to exhibit greater bias compared to other ML models trained on i.i.d data, and as they are applied to many socially-consequential use-cases, it becomes imperative for the model results and learned representations to be fair. Real-world applications of GNNs involve learning over heterogeneous networks with several nodes and edge types. We show that various kinds of nodes in a heterogeneous network can pick bias from a particular node type and remain non-trivial to debias using standard fairness algorithms. We propose a novel framework- Fair Link Prediction in Bipartite Networks (FLiB) that ensures fair link prediction while learning fair representations for all types of nodes with respect to the sensitive attribute of one of the node type. We further propose S-FLiB, which effectively mitigates bias at the subgroup level by regularising model predictions for subgroups defined over problem-specific grouping criteria.

Authors: Piyush, Nitish Kumar, Sangam Verma, Karamjit Singh and Pranav Poduval

 

 

Trustworthy AI, others

GroupMixNorm Layer for Learning Fair Models

Venue: Workshop on Interpolation Regularizers and Beyond in conjunction with NeurIPS, 2022‎‎

Recent research has focused on proposing algorithms for bias mitigation from automated prediction algorithms. Most of the techniques include convex surrogates of fairness metrics such as demographic parity or equalized odds in the loss function, which are not easy to estimate. Further, these fairness constraints are mostly data-dependent and aim to minimize the disparity among the protected groups during the training. However, they may not achieve similar performance on the test set. In order to address the above limitations, this research proposes a novel GroupMixNorm layer for bias mitigation from deep learning models. As an alternative to solving constraint optimization separately for each fairness metric, we have formulated bias mitigation as a problem of distribution alignment of several groups identified through the protected attributes. To this effect, the GroupMixNorm layer probabilistically mixes group-level feature statistics of samples across different groups based on the protected attribute. The proposed method improves upon several fairness metrics with minimal impact on accuracy. Experimental evaluation and extensive analysis on benchmark tabular and image datasets demonstrate the efficacy of the proposed method to achieve state-of-the-art performance.

Authors: Anubha Pandey, Aditi Rai, Maneet Singh, Deepak Bhatt and Tanmoy Bhowmik

 

 

Others

Post-pandemic Economic Transformations in the United States of America‎

Venue: Workshop on Social Data Mining in the Post-pandemic Era (SocDM 2022) in conjunction with IEEE Conference on Data Mining, 2022‎

The COVID-19 pandemic has impacted economic activity not only in the United States, but across the globe. Lockdown and travel restrictions imposed by local authorities have led to change in customer preferences and thus transformation of economic activity from traditional areas to new regions. While most changes have been temporary and short term, some of them have been observed to be of permanent nature. Using large-scale aggregated and anonymized transaction data across various socio-economic groups, we analyse and discuss such temporary relocation of citizens’ economic activities in metropolitan areas of 15 states in the US. The results of this study have extensive implications for urban planners and business owners, and can provide insights into the temporary relocation of economic activities resulting from an extreme exogenous shock like the COVID-19 pandemic.

Authors: Avi Chawla, Nidhi Mulay, Vikas Bishnoi, Yatin Katyal, Ankur Saraswat, Mohsen Bahrami, Esteban Moro and Alex Pentland

 

 

Fraud Detection, Machine learning

Label-aware Sampling using Contrastive Learning for GNN-based Fraud Detection

Venue: Workshop on ML in Finance​​​​​​​ in conjunction with 28th SIGKDD Conference on Knowledge Discovery and Data Mining, 2022‎

Graph-based methods have garnered a lot of attention in fraud detection tasks due to the relational nature of the fraud behaviour. Owing to the success of graph neural networks in various graph- analytical problems like link prediction, node classification, graph classification etc. various GNN-based fraud detection models have been proposed. Most of the GNN-based approaches rely on aggregating information from neighbors to make inferences for a given node. However, these architectures do not explicitly identify which neighbours are valuable to the learning task and it may be harmful to the model performance. In various real-world fraud situations, the label distribution is highly skewed due to a small fraction of fraud events as compared to non-fraud events. This problem of sampling relevant neighbors to include in GNN aggregation is further exacerbated in the scenario with heavy class-imbalance; since a fraudulent node can easily camouflage among a lot of non-fraud nodes, and rely on the neighbor aggregation to evade the fraud detector. In this paper, we propose a novel GNN-based imbalanced fraud detection model. Our approach works by first splitting a node’s full neighbourhood into label aware sub-graphs, and then these sub-graphs are sampled by means of a separate Siamese network trained using contrastive loss. The contrastive network assigns scores to each pair of the neighboring nodes, and sampling the neighbourhood using the contrastive score allows us to under-sample from majority class neighbourhood. Then, we employ separate GNN layers for each of the filtered sub-graphs to aggregate the information and build corresponding node embeddings. These embeddings are further processed using aggregation function to get the final node representation vector which is then mapped to the class-label via a multi-layer-perceptron. Experiments are performed on the real-world Bitcoin transaction dataset(Elliptic dataset) which demonstrate that the proposed framework outperforms state-of-the-art baselines.

Authors: Garima Arora, Adarsh Patankar, Akash Choudhary and Janu Verma

 

 

Machine learning & Fraud Detection

FairGen: Fair Synthetic Data Generation

Venue: DataPerf2022 Workshop in conjunction with International Conference on Machine Learning (ICML), 2022‎‎

With the rising adoption of Machine Learning across the domains like banking, pharmaceutical, ed-tech, etc, it has become utmost important to adopt responsible AI methods to ensure models are not unfairly discriminating against any group. Given the lack of clean training data, generative adversarial techniques are preferred to generate synthetic data with several state-of-the-art architectures readily available across various domains from unstructured data such as text, images to structured datasets modelling fraud detection and many more. These techniques overcome several challenges such as class imbalance, limited training data, restricted access to data due to privacy issues. Existing work focusing on generating fair data either works for a certain GAN architecture or is very difficult to tune across the GANs. In this paper, we propose a pipeline to generate fairer synthetic data independent of the GAN architecture. The proposed paper utilizes a pre-processing algorithm to identify and remove bias inducing samples. In particular, we claim that while generating synthetic data most GANs amplify bias present in the training data but by removing these bias inducing samples, GANs essentially focuses more on real informative samples. Our experimental evaluation on two open-source datasets demonstrates how the proposed pipeline is generating fair data along with improved performance in some cases.


Authors: Himanshu Chaudhary, Bhushan Chaudhari, Aakash Agarwal, Kamna Meena and Tanmoy Bhowmik

 

 

Fraud Detection

TeGraF: Temporal and Graph based Fraudulent Transaction Detection Framework

Venue: 2nd ACM International Conference on Artificial Intelligence in Finance (ICAIF), 2021‎‎

Detection of fraudulent transactions is an imperative research area in the financial domain, affecting the different entities involved in the payment process. An accurate fraud detection algorithm will help in identifying fraudulent transactions, thus facilitating immediate response and dispute resolution. To this effect, this research proposes a novel framework TeGraF for detecting fraudulent transactions by modeling temporal and structural features from a given input. The proposed algorithm operates at the intersection of two key research areas: Temporal Point Processes (TPPs) and Graph Neural Networks (GNNs). Due to the wide occurrence of sequential data in the financial domain, TPPs are very useful for modeling the sequence of transactions. Parallelly, the financial domain data can also be represented as a graphical structure capturing interactions between users and vendors/merchants. Thus, the proposed algorithm utilizes the temporal features learned via the TPP based model and the structural features captured via the GNN for modeling fraudulent transactions. Different graph representation learning techniques like Node2Vec, Metapath2Vec, LINE, DeepWalk, and BiNE are employed to compare the overall performance. Experiments have been evaluated on a synthetic dataset containing 62K users and 4M transactions, which demonstrate the improved performance of the proposed technique as compared to the existing algorithms.


Authors: Shivshankar Reddy, Pranav Poduval, Anand Vir Singh Chauhan, Maneet Singh, Sangam Verma, Karamjit Singh and Tanmoy Bhowmik

 

 

Machine learning

AuthSHAP — Authentication Vulnerability Detection on Tabular Data in Black Box Setting

Venue: 2nd ACM International Conference on Artificial Intelligence in Finance (ICAIF), 2021‎‎‎

Adversarial archetypes exploit the workings of any system to disrupt the robustness and the decision-making of the underlying machine learning algorithms. This is an important area of concern in the banking industry where the global approbation of real-time & friction less payment systems has prompted the financial institutions to invest in superior authentication solutions. Identifying fraudulent actors poses an inherent challenge for varied reasons — thorough fraud detection can have an impact on customer satisfaction. While fraud detection slows down the authentication process, the customers' demands faster response. Present systems tend to produce numerous false positives, along with being vulnerable to ingenious fraudsters, which in conjunction with dollar loss, also leads to reputation loss for the issuers. In this paper, we present AuthSHAP, a first of its kind, model agnostic and robust implementation of SHAP (SHapley Additive exPlanations) to uncover the extent to which key features are not appropriated by any model in the decision making. This 'knowledge' is significant information to a fraudsters to design intelligent or adversarial attacks. We show that even in black-box settings, where the attacker only needs to have a fair process or business understanding about the type of information or raw features that are passed through the system and the responses it is possible to understand the vulnerability. Thus, this can be used by any financial institution as an active preventive measure (a) to shield their authentication system from adversarial attacks and, also (b) to reduce false declines to prevent a sub-optimal customer experience amounting to both revenue and reputation loss for entities across the payment value chain. We propose an evaluation method using a simulation where we create a decision system or Model with desired vulnerability and identify the same via our proposed methodology in a black-box setting. We extend this by performing several experiments on aggregated and anonymized real-world financial transaction data and validate our findings with internal subject matter experts.


Authors: Debasmita Das, Yatin Katyal, Rajesh Kumar Ranjan, Ram Ganesh V and Rohit Bhattacharya

 

 

Machine learning

MeTGAN: Memory efficient Tabular GAN for high cardinality categorical datasets

Venue: 28th International Conference on Neural Information Processing (ICONIP), 2021‎

Generative Adversarial Networks (GANs) have seen their use for generating synthetic data expand, from unstructured data like images to structured tabular data. One of the recently proposed models in the field of tabular data generation, CTGAN, demonstrated state-of-the-art performance on this task even in the presence of a high class imbalance in categorical columns or multiple modes in continuous columns. Many of the recently proposed methods have also derived ideas from CTGAN. However, training CTGAN requires a high memory footprint while dealing with high cardinality categorical columns in the dataset. In this paper, we propose MeTGAN, a memory-efficient version of CTGAN, which reduces memory usage by roughly 80%, with a minimal effect on performance. MeTGAN uses sparse linear layers to overcome the memory bottlenecks of CTGAN. We compare the performance of MeTGAN with the other models on publicly available datasets. Quality of data generation, memory requirements, and the privacy guarantees of the models are the metrics considered in this study. The goal of this paper is also to draw the attention of the research community on the issue of the computational footprint of tabular data generation methods to enable them on larger datasets especially ones with high cardinality categorical variables.


Authors: Shreyansh Singh, Kanishka Kayathwal, Hardik Wadhwa and Gaurav Dhama

 

 

Time Series Modelling

Deviation-based Marked Temporal Point Process for Marker Prediction (2021)

Venue: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), 2021‎

Temporal Point Processes (TPPs) are useful for modeling event sequences which do not occur at regular time intervals. For example, TPPs can be used to model the occurrence of earthquakes, social media activity, financial transactions, etc. Owing to their flexible nature and applicability in several real-world scenarios, TPPs have gained wide attention from the research community. In literature, TPPs have mostly been used to predict the occurrence of the next event (time) with limited focus on the type/category of the event, termed as the marker. Further, limited focus has been given to model the inter-dependency of the event time and marker information for more accurate predictions. To this effect, this research proposes a novel Deviation-based Marked Temporal Point Process (DMTPP) algorithm focused on predicting the marker corresponding to the next event. Specifically, the deviation between the estimated and actual occurrence of the event is modeled for predicting the event marker. The DMTPP model is explicitly useful in scenarios where the marker information is not known immediately with the event occurrence, but is instead obtained after some time. DMTPP utilizes a Recurrent Neural Network (RNN) as its backbone for encoding the historical sequence pattern, and models the dependence between the marker and event time prediction. Experiments have been performed on three publicly available datasets for different tasks, where the proposed DMTPP model demonstrates state-of-the-art performance. For example, an accuracy of 91.76% is obtained on the MIMIC-II dataset, demonstrating an improvement of over 6% from the state-of-the-art model.

Authors: Anand Vir Singh Chauhan, Shivshankar Reddy, Maneet Singh, Karamjit Singh and Tanmoy Bhowmik

 

 

Fraud Detection

CuRL: Coupled Representation Learning of Cards and Merchants to detect Transaction Frauds (2021)

Venue: 30th International Conference on Artificial Neural Networks (ICANN), 2021‎

Payment networks like Mastercard or Visa process billions of transactions every year. A significant number of these transactions are fraudulent that cause huge losses to financial institutions. Conventional fraud detection methods fail to capture higher-order interactions between payment entities i.e., cards and merchants, which could be crucial to detect out-of-pattern, possibly fraudulent transactions. Several works have focused on capturing these interactions by representing the transaction data either as a bipartite graph or homogeneous graph projections of the payment entities. In a homogeneous graph, higher-order cross-interactions between the entities are lost and hence the representations learned are sub-optimal. In a bipartite graph, the sequences generated through random walk are stochastic, computationally expensive to generate, and sometimes drift away to include uncorrelated nodes. Moreover, scaling graph-learning algorithms and using them for real-time fraud scoring is an open challenge. In this paper, we propose CuRL and tCuRL, coupled representation learning methods that can effectively capture the higher-order interactions in a bipartite graph of payment entities. Instead of relying on random walks, proposed methods generate coupled session-based interaction pairs of entities which are then fed as input to the skip-gram model to learn entity representations. The model learns the representations for both entities simultaneously and in the same embedding space, which helps to capture their cross-interactions effectively. Furthermore, considering the session constrained neighborhood structure of an entity makes the pair generation process efficient. This paper demonstrates that the proposed methods run faster than many state-of-the-art representation learning algorithms and produce embeddings that outperform other relevant baselines on fraud classification task.


Authors: Maitrey Gramopadhye*, Shreyansh Singh*, Kushagra Agarwal, Nitish Srivasatava, Alok Mani Singh, Siddhartha Asthana and Ankur Arora

 

 

Time Series Modelling

A Survey on Classical and Deep Learning based Intermittent Time Series Forecasting Methods​​​​​​​ (2021)

Venue: International Joint Conference on Neural Network (IJCNN), 2021‎‎

Demand forecasting is a fundamental aspect of inventory and supply chain management. Due to the sporadic nature of the demand, demand forecasting involves dealing with intermittent time series in domains such as retail, manufacturing. Conventional forecasting methods do not work well for intermittent time series due to inherent sparsity in such series. Researchers have proposed multiple methods to deal with intermittent time series such as Croston and its variants. Our work aims to provide an insight into the various forecasting methods traditionally known to work well for forecasting intermittent series. We have also explored deep learning methods that have been proposed in recent literature. These methods are thoroughly reviewed and explained in this survey paper. Additionally, experiments are done on two publicly available datasets to compare the performance of the traditional methods with deep learning models. Furthermore, a hybrid model made of independent classification and regression trees has been implemented and studied as well. We provide a comprehensive evaluation that aims at selecting the appropriate method, given the dataset, context, and objectives that have to be met by the forecasting practitioner/researcher.

Authors: Karthikeswaren Ramachandran, Kanishka Kayathwal, Gaurav Dhama and Ankur Arora

 

 

Fraud Detection

Application of Reinforcement Learning to Payment Fraud (2021)‎

Venue: Workshop on Multi-Armed Bandits and Reinforcement Learning: Advancing Decision Making in E-Commerce and Beyond in conjunction with the 27th ACM Conference on Knowledge Discovery and Data Mining (KDD), 2021‎

The large variety of digital payment choices available to consumers today has been a key driver of e-commerce transactions in the past decade. Unfortunately, this has also given rise to cybercriminals and fraudsters who are constantly looking for vulnerabilities in these systems by deploying increasingly sophisticated fraud attacks. A typical fraud detection system employs standard supervised learning methods where the focus is on maximizing the fraud recall rate. However, we argue that such a formulation can lead to sub-optimal solutions. The design requirements for these fraud models requires that they are robust to the high-class imbalance in the data, adaptive to changes in fraud patterns, maintain a balance between the fraud rate and the decline rate to maximize revenue, and be amenable to asynchronous feedback since usually there is a significant lag between the transaction and the fraud realization. To achieve this, we formulate fraud detection as a sequential decision-making problem by including the utility maximization within the model in the form of the reward function. The historical decline rate and fraud rate define the state of the system with a binary action space composed of approving or declining the transaction. In this study, we primarily focus on utility maximization and explore different reward functions to this end. The performance of the proposed Reinforcement Learning system has been evaluated for two publicly available fraud datasets using Deep Q-learning and compared with different classifiers. We aim to address the rest of the issues in future work.


Authors: Siddharth Vimal, Kanishka Kayathwal, Hardik Wadhwa and Gaurav Dhama

 

 

Machine learning

DEDD: Deep Encoder with Dual Decoder Architecture for Stability and Specificity Preserving Encoding and Translation of Embedding between Domains (2021)

Venue: International Conference on Information Technology and Cloud Computing (ITCC) in conjunction with International Conference on Computing, Networks and Internet of Things (CNIOT), 2021‎

We propose a deep learning-based encoder with a dual decoder system to enrich the expressive power of embeddings pre-trained on two different corpora along with switching representation between domains. There are two scenarios: (a) Each of the corpora is pertaining to the different subject matter or topic of interests and (b) One corpus is a vast super-domain with generic and non-specific embeddings while the second one pertains to one specific sub-domain. In either case, the criterion for high-quality training would be to have enough common words between them. The mapping of contextual embeddings from both the corpus into the common latent space blends the semantic richness of both the corpus-specific learning while maintaining embedding stability. Furthermore, there is one dedicated decoder for either of the domains for generating the representation from common latent space. We evaluated our method for cross-learning between generalized GLOVE embedding and a very specialized skill-embedding developed by random-walk on a graph-based Skills Hierarchy. We demonstrate that our method preserves the stability of the generic embedding, the specificity of the skill domain as well as enriches the semantic representation of either domain through switching enabled by the encoder-to-duel-decoder path.


Authors: Rajesh Kumar Ranjan, Debasmita Das, Ram Ganesh V., Yatin Katyal and Tanmoy Bhowmik

 

 

Time Series Modelling

Deep Learning based Time Series Forecasting (2021)‎

Venue: Book chapter in the ‘Deep Learning Applications, Volume 3’ to be published by Springer, 2021‎

For decision-makers in the forecasting sector, decision processes like planning of facilities, an optimal day-to-day operation within the domain etc., are complex with several different levels to be considered. These decisions address widely different time horizons and aspects of the system, making it difficult to model. The advent of deep learning in forecasting solved the need for expensive hand-crafted features and deep domain knowledge. The work aims at giving a structure to the existing literature for time-series forecasting in deep learning. Based on the underlying structures of the technique, such as RNN, CNN, and Transformer, we have categorized various deep learning-based time series forecasting techniques and provided a consolidated report. Additionally, we have performed experiments to compare these techniques on 4 different publicly available datasets. Finally, based on these experiments, we provide an intuitive reasoning behind these performances. We believe that this work shall help the researchers in choosing relevant techniques for future research.

Authors: Kushagra Agarwal, Lalasa Dheekollu, Gaurav Dhama, Ankur Arora, Siddhartha Asthana and Tanmoy Bhowmik

 

 

Machine learning

A comparative study on Transformers for Word Sense Disambiguation‎

Venue: 28th International Conference on Neural Information Processing (ICONIP), 2021‎

Recent years of research in Natural Language Processing (NLP) have witnessed dramatic growth in training large models for generating context-aware language representations. In this regard, numerous NLP systems have leveraged the power of neural network-based architectures to incorporate sense information in embeddings, resulting in Contextualized Word Embeddings (CWEs). Despite this progress, the NLP community has not witnessed any significant work performing a comparative study on the contextualization power of such architectures. This paper presents a comparative study and an extensive analysis of nine widely adopted Transformer models. These models are BERT, CTRL, DistilBERT, OpenAI-GPT, OpenAI-GPT2, Transformer-XL, XLNet, ELECTRA, and ALBERT. We evaluate their contextualization power using two lexical sample Word Sense Disambiguation (WSD) tasks, SensEval-2 and SensEval-3. We adopt a simple yet effective approach to WSD that uses a k-Nearest Neighbor (kNN) classification on CWEs. Experimental results show that the proposed techniques also achieve superior results over the current state-of-the-art on both the WSD tasks.


Authors: Avi Chawla, Nidhi Mulay, Vikas Bishnoi, Gaurav Dhama and A.K. Singh

 

 

Machine learning

Improving the performance of Transformer Context Encoders for NER‎

Venue: IEEE Conference on Information Fusion (FUSION), 2021​‎

Large Transformer based models have provided state-of-the-art results on a variety of Natural Language Processing (NLP) tasks. While these Transformer models perform exceptionally well on a wide range of NLP tasks, their usage in Sequence Labeling has been mostly muted. Although pretrained Transformer models such as BERT and XLNet have been successfully employed as input representation, the use of the Transformer model as a context encoder for sequence labeling is still minimal, and most recent works still use recurrent architecture as the context encoder. In this paper, we compare the performance of the Transformer and Recurrent architecture as context encoders on the Named Entity Recognition (NER) task. We vary the character-level representation module from the previously proposed NER models in literature and show how the modification can improve the NER model’s performance. We also explore data augmentation as a method for enhancing their performance. Experimental results on three NER datasets show that our proposed techniques established a new state-of-the-art using the Transformer Encoder over the previously proposed models in the literature using only non-contextualized embeddings.


Authors: Avi Chawla, Nidhi Mulay, Vikas Bishnoi and Gaurav Dhama

 

 

Machine learning

Evolutionary adversarial attacks on Payment Systems‎

Venue: 20th IEEE International Conference on Machine Learning and Applications (ICMLA), 2021‎‎‎

Credit card fraud detection is arguably the most critical use case of machine learning for any payment system. Deep neural networks and tree-based classifiers can provide state-of-the-art performance for fraud classification. However, we try to emphasize that these models have serious vulnerabilities that need to be addressed. Studies show that it is possible to fool machine learning models with curated input samples known as adversarial examples. Attackers can use these examples to deceive the fraud classifiers deployed by institutions, causing considerable financial harm. We feel that the literature on adversarial examples for fraud detection systems has been limited to simpler datasets. In this paper, we use two large publicly available datasets for credit card fraud detection to benchmark the performance of some conventional machine learning models and compare the effectiveness of different black-box attacks on the best-performing model. Lastly, we introduce a novel gradient-free approach to black-box attacks, which uses evolution-based specialized perturbations to create attacks (ESPA). We show that the new method requires far fewer queries than other black-box attack methods like Zeroth Order optimization, Boundary Attack, and HopSkipJump, and can leverage the information gained from previously successful attacks.

Authors: Nishant Kumar, Siddharth Vimal, Kanishka Kayathwal and Gaurav Dhama

 

 

Machine learning

Modelling Approaches for Silent Attrition Prediction in Payment Networks‎

Venue: 20th IEEE International Conference on Machine Learning and Applications (ICMLA), 2021‎‎

Predicting customer attrition (churn) is a well known problem in industries that provide services, like financial institutions, telecommunications, e-commerce, and retail. There are two kinds of attrition — active and passive (silent). Active attrition is usually associated with subscription-based business models, commonly seen in telecommunications and internet industries like Netflix. In industries like finance, retail, and ecommerce, we see the other kind of attrition — silent attrition where customers stop doing business without formal notice. This makes the silent attrition prediction problem even more challenging because it is difficult to differentiate between attrited and inactive customers. We focus our work on predicting silent attrition which is still under-explored in the payment card industry (i.e. Mastercard, Visa). The contribution of our work is threefold. First, we present a data-driven approach to define silent attrition as customer inactivity. Second, we discussed multiple procedures to generate synthetic data thereby preserving customers’ privacy. At last, we presented a comprehensive view of various machine learning (ML) pathways in which this churn prediction problem can be framed and solved; each requiring a specific feature engineering. We presented experimental results corresponding to each pathway to comparative analysis. We believe that this work to be beneficial to the researchers and ML practitioners who often have to deal with sensitive financial data but have limited permission to use it. In this direction, we demonstrated the use of synthetic data generation to reduce the risk of data leakage and other privacy concerns relating to ML models development.


Authors: Lalasa Dheekollu, Hardik Wadhwa, Siddharth Vimal, Anubhav Gupta, Siddhartha Asthana, Ankur Arora and Smriti Gupta

 

 

Machine learning

Label-Value Extraction from Documents using Co-SSL Framework‎

Venue: 17th International Conference on Advanced Data Mining and Applications (ADMA), 2021‎

Label-value extraction from documents refers to the task of extracting relevant values for corresponding labels/fields. For example, it encompasses extracting the total amount from receipts, the date value from invoices/patents/forms, or tax amount from receipts/invoices. Automated label-value extraction has widespread applicability in real-world scenarios of document understanding, book-keeping, reconciliation and content summarization. Recent research has focused on developing label-value extraction models, however, to the best of our knowledge, limited attention has been given to developing a light-weight compact label-value extraction module generalizable across different document types. Since in real-world deployment, a developed model is often required to process different types of documents for the same label/field type, this research proposes a novel Context-based Semi-supervised (Co-SSL) framework for the same. The proposed Co-SSL framework focuses on identifying candidates for each label/field, followed by the generation of their context based on spatial cues. Further, novel data augmentation strategies are proposed which are specifically applicable to the problem of information extraction from documents. The extracted information (candidate and context) is then provided to a deep learning based model trained in a novel semi-supervised setting for applicability in real-world scenarios of limited training data. The performance of the Co-SSL framework has been demonstrated on three challenging datasets containing different document types (receipts, patents and forms).

Authors: Sara Sai Abhishek, Maneet Singh, Bhanupriya Pegu and Karamjit Singh

 

 

Fraud Detection

Temporal Debiasing using Adversarial Loss based GNN Architecture for Crypto Fraud Detection

Venue: ‎20th IEEE International Conference on Machine Learning and Applications (ICMLA), 2021‎

The tremendous rise of cryptocurrency in the payment domain has unlocked huge opportunities but also raised numerous challenges in parallel involving cybercriminal activities like money laundering, terrorist financing, illegal and risky services, etc, owing to its anonymous and decentralized setup. The demand for building a more transparent cryptocurrency network, resilient to such activities, has risen extensively as more financial institutions look to incorporate it into their network. While a plethora of traditional machine learning and graph based deep learning techniques have been developed to detect illicit activities in a cryptocurrency transaction network, the challenge of generalization and robust model performance on future timesteps still exists. In this paper, we show that the model learned on transactional feature set provided in dataset (Elliptic Dataset) carry a temporal bias, i.e. they are highly dependent on the timesteps they occur. Deploying temporally biased models limits their performance on future timesteps. To address this, we propose a temporal debiasing technique using GNN based architecture that ensures generalization by adversarially learning between fraud 1 classification and temporal classification. The adversarial loss constructed optimizes the embeddings to ensure they 1) perform well on fraud classification task 2) does not contain temporal bias. The proposed architecture capture the underlying fraud patterns that remain consistent over time. We evaluate the performance of our proposed architecture on the Elliptic dataset and compare the performance with existing machine learning and graph-based architectures. 1 Fraud and illicit are used interchangeably in this paper.

Authors: Aditya Singh, Anubhav Gupta, Hardik Wadhwa, Siddhartha Asthana and Ankur Arora

 

 

Fraud Detection

Med-Dynamic Meta Learning — A multi-layered representation to identify provider fraud in healthcare​​​​​​​‎

Venue: The International FLAIRS Conference Proceedings, Vol. 34, 2021‎‎

Every year, health insurance fraud costs taxpayers billions of dollars and puts patient’s health and welfare at risk. Existing solutions to detect fraudulent providers (hospitals, physicians, etc.) aim to find unusual pattern at claim level features but fail to harness provider-provider and provider-patient interaction information. We propose a novel framework, Med-Dynamic meta learning (MeDML), that extends the capability of traditional fraud detection by learning patterns from 1) patient-provider interaction using temporal and geo-spatial characteristics 2) provider’s treatment using encounter data (e.g. medical codes, mix of attended patients) and 3) referral using underlying provider-provider relationships based on common patient visits within 30 days. To the best of our knowledge, MeDML is first framework that can model fraud using multi-aspect representation of provider. MeDML also encapsulates provider's phantom billing index, which identifies excessive and unnecessary services provided to patients, by segmenting frequently co-occurring diagnosis and procedures in non-fraudulent provider’s claims. It uses a novel framework to aggregate the learned representations capturing their task-specific relative importance via attention mechanism. We test the dynamically generated meta embedding using various downstream models and show that it outperforms all baseline algorithms for provider fraud prediction task.


Authors: Nitish Kumar, Deepak Chaurasiya, Alok Singh, Siddhartha Asthana, Kushagra Agarwal and Ankur Arora

 

 

Machine learning

MoDest: Multi-module Design Validation for Documents‎

Venue: ACM India Joint International Conference on Data Science and Management of Data (ACM CODS-COMAD), 2021‎

Information extraction (IE) from Visually Rich Documents (VRDs) is a common need for businesses, where extracted information is used for various purposes such as verification, design validation or compliance. Most of the research in IE from VRDs has focused on textual documents such as invoices and receipts, while extracting information from multi-modal VRDs remains a challenging task. This research presents a novel end-to-end design validation framework for multi-modal VRDs containing textual and visual components, for compliance against a pre-defined set of rules. The proposed Multi-mOdule DESign validaTion (referred to as MoDest) framework constitutes two steps: (i) information extraction using five modules for obtaining the textual and visual components, followed by (ii) validating the extracted components against a pre-defined set of design rules. Given an input multi-modal VRD image, the MoDest framework either accepts or rejects its design while providing an explanation for the decision. The proposed framework is tested for design validation for a particular type of VRDs: banking cards, under the real-world constraint of limited and highly imbalance training data with more than 99% of card designs belonging to one class (accepted). Experimental evaluation on real world images from our in-house dataset demonstrates the effectiveness of the proposed MoDest framework. Analysis drawn from the real-world deployment of the framework further strengthens its utility for design validation.

Authors: Bhanupriya Pegu, Maneet Singh, Kamal Kant, Karamjit Singh and Tanmoy Bhowmik

 

 

Trustworthy AI

Simultaneous Improvement of ML Model Fairness and Performance by Identifying Bias in Data‎

Venue: Data-centric AI Workshop in conjunction with Conference on Neural Information Processing Systems (NeurIPS), 2021‎

Machine learning models built on datasets containing discriminative instances attributed to various underlying factors result in biased and unfair outcomes. It’s a well founded and intuitive fact that existing bias mitigation strategies often sacrifice accuracy in order to ensure fairness. But when AI engine’s prediction is used for decision making which reflects on revenue or operational efficiency such as credit risk modeling, it would be desirable by the business if accuracy can be somehow reasonably preserved. This conflicting requirement of maintaining accuracy and fairness in AI motivates our research. In this paper, we propose a fresh approach for simultaneous improvement of fairness and accuracy of ML models within a realistic paradigm. The essence of our work is a data preprocessing technique that can detect instances ascribing a specific kind of bias that should be removed from the dataset before training and we further show that such instance removal will have no adverse impact on model accuracy. In particular, we claim that in the problem settings where instances exist with similar feature but different labels caused by variation in protected attributes, an inherent bias gets induced in the dataset, which can be identified and mitigated through our novel scheme. Our experimental evaluation on two open-source datasets demonstrates how the proposed method can mitigate bias along with improving rather than degrading accuracy, while offering certain set of control for end user.


Authors: Aakash Agarwal, Bhushan Chaudhari and Tanmoy Bhowmik

 

 

Others

Effects of stimulus checks on spending patterns of different economic groups

Venue: ‎IEEE International Conference on Data Mining Workshops (ICDMW), 2021‎‎

This paper uses daily anonymous aggregated trans-action data to analyze the changes in consumer spending caused by receipt of the stimulus payments in the United States during the COVID-19 pandemic. The stimulus checks were provided as part of the CARES Act aiming to provide emergency assistance for individuals and businesses affected by the pandemic. We analyze the impact of the receipt of those payments on the aggregated daily spending of different socio-economic groups and industries. We show that the transaction patterns of low spending consumers were most impacted by the stimulus payments among different spending groups. Our study results also indicate that the consumer responses after the first stimulus check (April 2020) were substantial and significant on industries that sell daily essential items, whereas consumer responses after the third stimulus check (March 2021) were significant in non-essential goods (e.g. luxury and entertainment sector). The results of this study are of crucial importance because they could help policy makers better shape stimulus payments that may be needed in future emergencies.


Authors: Nidhi Mulay, Vikas Bishnoi, Yatin Katyal, Mohsen Bahrami, Esteban Moro, Ankur Saraswat and Alex Pentland

 

 

Time Series Modelling

Adversarial Generation of Temporal Data: A Critique on Fidelity of Synthetic Data‎

Venue: First International Workshop on Machine Learning for Irregular Time Series in conjunction with European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), 2021

Generative modelling for temporal data has seen a paradigm shift from autoregressive to adversarial models. Adversarial generation algorithms have proven to be more efficient in capturing the complex temporal correlations that the simplistic autoregressive model could not. Albeit, high-fidelity remains a concern even for adversarial modelling. The generation of high-fidelity data requires the model to have three strengths: capture feature correlations, model long-term dependencies, and scalability in dimensions. This paper analyzes these strengths on the existing methods of adversarial temporal generation regarding the fidelity of synthetic data. Towards this, we evaluate different algorithms for adversarial temporal generation on five different datasets of varying dynamics (long-term vs. short-term dependency) and dimensionality. We conclude by discussing gaps in the literature and future directions for high fidelity temporal data generation through adversarial methods. Generative modelling for temporal data has seen a paradigm shift from autoregressive to adversarial models. Adversarial generation algorithms have proven to be more efficient in capturing the complex temporal correlations that the simplistic autoregressive model could not. Albeit, high-fidelity remains a concern even for adversarial modelling. The generation of high-fidelity data requires the model to have three strengths: capture feature correlations, model long-term dependencies, and scalability in dimensions. This paper analyzes these strengths on the existing methods of adversarial temporal generation regarding the fidelity of synthetic data. Towards this, we evaluate different algorithms for adversarial temporal generation on five different datasets of varying dynamics (long-term vs. short-term dependency) and dimensionality. We conclude by discussing gaps in the literature and future directions for high fidelity temporal data generation through adversarial methods.

Authors: Ankur Debnath, Govind Waghmare, Hardik Wadhwa, Siddhartha Asthana and Ankur Arora

 

 

Time Series Modelling

Exploring Generative Data Augmentation in Multivariate Time Series Forecasting: Opportunities and Challenges (2021)‎

Venue: Workshop on Mining and Learning from Time Series (MiLeTS) in conjunction with ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2021‎

In multivariate time series (MTS), each time point constitutes multiple time-dependent variables. Short-term and long-term correlation of these variables is a significant characteristic of MTS, and is a key challenge while modelling the same. While classical auto-regressive models are heavily used to model MTS, neural models are more flexible and efficient. However, neural models rely on a large amount of labelled data for training. Availability of labelled time series data could be a bottleneck in real-world scenarios. This scarcity of labelled data can be mitigated by data augmentation. In MTS, augmentation techniques need to realize short-term correlations and long-term temporal dynamics. In this work, we introduce a novel meta-algorithm for time-series data augmentation to address the data scarcity problem. Due to the intrinsic ordering of samples in time series, we argue that one cannot simply add synthetic samples to the real samples for augmentation. To this end, we generate synthetic MTS data preserving temporal dynamics using an offthe-shelf generative algorithm and frame augmentation in MTS as a transfer learning problem. In addition, we point out the drawbacks of generative model in MTS augmentation. We show the effectiveness of our method on publicly available MTS datasets in forecasting. We also perform qualitative and quantitative analysis of synthetic MTS data and its applicability in long-term forecasting. To the best of our knowledge, this is the first study on generative data augmentation for MTS forecasting.

Authors: Ankur Debnath, Govind Waghmare, Hardik Wadhwa, Siddhartha Asthana and Ankur Arora

 

 

Machine learning

Self-Training with Ensemble of Teacher Models (2021)‎

Venue: Workshop on Weakly Supervised Representation Learning in conjunction with the 30th International Joint Conference on Artificial Intelligence (IJCAI), 2021‎‎

In order to train robust deep learning models, large amounts of labelled data is required. However, in the absence of such large repositories of labelled data, unlabeled data can be exploited for the same. Semi-Supervised learning aims to utilize such unlabeled data for training classification models. Recent progress of self-training based approaches have shown promise in this area, which leads to this study where we utilize an ensemble approach for the same. A by-product of any semi-supervised approach may be loss of calibration of the trained model especially in scenarios where unlabeled data may contain out-of-distribution samples, which leads to this investigation on how to adapt to such effects. Our proposed algorithm carefully avoids common pitfalls in utilizing unlabeled data and leads to a more accurate and calibrated supervised model compared to vanilla self-training based student-teacher algorithms. We perform several experiments on the popular STL-10 database followed by an extensive analysis of our approach and study its effects on model accuracy and calibration.


Authors: Soumyadeep Ghosh, Sanjay Kumar, Janu Verma and Awanish Kumar

 

 

Time Series Modelling

Semi-supervised Learning for Marked Temporal Point Processes‎

Venue: Workshop on Weakly Supervised Representation Learning in conjunction with the 30th International Joint Conference on Artificial Intelligence (IJCAI), 2021‎

Temporal Point Processes (TPPs) are often used to represent the sequence of events ordered as per the time of occurrence. Owing to their flexible nature, TPPs have been used to model different scenarios and have shown applicability in various real-world applications. While TPPs focus on modeling the event occurrence, Marked Temporal Point Process (MTPP) focuses on modeling the category/class of the event as well (termed as the marker). Research in MTPP has garnered substantial attention over the past few years, with an extensive focus on supervised algorithms. Despite the research focus, limited attention has been given to the challenging problem of developing solutions in semi-supervised settings, where algorithms have access to a mix of labeled and unlabeled data. This research proposes a novel algorithm for Semi-supervised Learning for Marked Temporal Point Processes (SSL-MTPP) applicable in such scenarios. The proposed SSL-MTPP algorithm utilizes a combination of labeled and unlabeled data for learning a robust marker prediction model. The proposed algorithm utilizes an RNN-based Encoder-Decoder module for learning effective representations of the time sequence. The efficacy of the proposed algorithm has been demonstrated via multiple protocols on the Retweet dataset, where the proposed SSL-MTPP demonstrates improved performance in comparison to the traditional supervised learning approach.

Authors: Shivshankar Reddy, Anand Vir Singh Chauhan, Maneet Singh and Karamjit Singh

 

 

Machine learning

Table Structure Recognition using CoDec Encoder-Decoder (2021)

Venue: Workshop on Machine Learning in conjunction with the 16th International Conference on Document Analysis and Retrieval (ICDAR), 2021‎

Automated document analysis and parsing has been the focus of research since a long time. An important component of document parsing revolves around understanding tabular regions with respect to their structure identification, followed by precise information extraction. While substantial effort has gone into table detection and information extraction from documents, table structure recognition remains to be a long-standing task demanding dedicated attention. The identification of the table structure enables extraction of structured information from tabular regions which can then be utilized for further applications. To this effect, this research proposes a novel table structure recognition pipeline consisting of row identification and column identification modules. The column identification module utilizes a novel Column Detector Encoder-Decoder model (termed as CoDec Encoder Decoder) which is trained via a novel loss function for predicting the column mask for a given input image. Experiments have been performed to analyze the different components of the proposed pipeline, thus supporting their inclusion for enhanced performance. The proposed pipeline has been evaluated on the challenging ICDAR 2013 table structure recognition dataset, where it demonstrates state-of-the-art performance.


Authors: Bhanupriya Pegu, Maneet Singh, Aakash Agarwal, Aniruddha Mitra and Karamjit Singh

 

 

Fraud Detection

Intent2Vec: Representation Learning of Cardholder and Merchant intent from Temporal Interaction Sequences for Fraud Detection‎

Venue: Workshop on Applied Semantics Extraction and Analytics in conjunction with the 30th International Joint Conference on Artificial Intelligence (IJCAI), 2021‎

Fraud detection has been a challenging problem for financial institutions as it causes a loss of $24.2 billion per annum globally. This paper focuses on transaction fraud, the most prevalent type of fraud in the payment industry. The ability to detect and decline potential fraudulent transactions in real-time is crucial to guarantee a robust and secure environment for both, cardholders and merchants. Conventional fraud detection techniques predominantly use rule-based methods or extensive manual feature engineering for machine learning models. These fraud models rely on detecting anomalies in the attributes of a transaction. However, they fail to capture any type of interaction between cardholder and merchant involved in a transaction. The proposed approach, Intent2Vec, extends the capability of traditional fraud models by learning representation of payment entities using approaches of NLP to semantically capture the intent of doing a transaction. The modelled intent enables us to predict the next set of plausible merchants for a card and vice versa. Any deviation from the predicted and observed card or merchant can point towards a potential fraud. We test the relevance of intent based semantics on the downstream task of fraud detection wherein classifiers utilizing the entities’ learnt intent outperform other baseline algorithms on metrics such as AUC-PR and F1 score.

Authors: Nitish Kumar, Shinyjohn Shaju, Kanishka Kayathwal, Deepak Chaurasiya, Kushagra Agarwal, Alok Singh, Siddhartha Asthana and Ankur Arora

 

 

Trustworthy AI

Transitioning from Real to Synthetic Data: Quantifying the Bias in Model

Venue: Workshop on ​​​​​​​Synthetic Data Generation: Quality, Privacy, Bias in conjunction with the International Conference on Learning Representations (ICLR), 2021‎‎

With the advent of generative modeling techniques, synthetic data and its use has penetrated across various domains from unstructured data such as image, text to structured dataset modeling healthcare outcome, risk decisioning in financial domain, and many more. It overcomes various challenges such as limited training data, class imbalance, restricted access to dataset owing to privacy issues. To ensure the trained model used for automated decisioning purposes makes a fair decision there exist prior work to quantify and mitigate those issues. This study aims to establish a trade-off between bias and fairness in the models trained using synthetic data. Variants of synthetic data generation techniques were studied to understand bias amplification including differentially private generation schemes. Through experiments on a tabular dataset, we demonstrate there exist a varying levels of bias impact on models trained using synthetic data. Techniques generating less correlated feature performs well as evident through fairness metrics with 94\%, 82\%, and 88\% relative drop in DPD (demographic parity difference), EoD (equality of odds) and EoP (equality of opportunity) respectively, and 24\% relative improvement in DRP (demographic parity ratio) with respect to the real dataset. We believe the outcome of our research study will help data science practitioners understand the bias in the use of synthetic data.


Authors: Aman Gupta, Deepak Bhatt and Anubha Pandey

 

 

Others

Server-Language Processing: A Semi-Supervised approach to Server Failure Detection

Venue: ‎International Conference on Information Technology and Cloud Computing (ITCC) in conjunction with International Conference on Computing, Networks and Internet of Things (CNIOT), 2021‎

As industrial systems continue to grow in terms of scale and complexity, having an effective as well as proactive failure management approach helps mitigate the impact of server failure. While supervised methods fail to perform well in real-world servers due to label noise in log data as well as their failure to detect unseen failures, unsupervised techniques are often too naive to differentiate between complex log structures. We propose a NLP based semi-supervised solution that learns the complex understanding of healthy and failure log patterns using an ensemble of deep learning based density and sequential solutions. Our hypothesis is that server logs follow a language of their own, which we attempt to decipher through Server-Language Processing. Experimental evaluations on real world log data show that our proposed solution outperforms other existing log-based anomaly detection methods for real world application. The solution was implemented for 3000 servers for 6 months of log data, and was able to pick up server failures upto 2 weeks in advance without raising an excess of false alarms.

Authors: Sonali Syngal, Sangam Verma, Kandukuri Karthik, Yatin Katyal and Soumyadeep Ghosh

 

 

Others

Pandemic Spread Prediction and Healthcare Preparedness Through Financial and Mobility Data

Venue: Workshop on AI for Public Health in conjunction with the International Conference on Learning Representations (ICLR), 2021

The pandemics like Coronavirus disease 2019 (COVID-19) require Governments and health professionals to make time-sensitive, critical decisions about travel restrictions and resource allocations. This paper identifies various factors that affect the spread of the disease using transaction data and proposes a model to predict the degree of spread of the disease and thus the number of medical resources required in upcoming weeks. We perform a region-wise analysis of these factors to identify the control measures that affect the minimal set of population. Our model also helps in estimating the surges in clinical demand and identifying when the medical resources would be saturated. Using this estimate, we suggest the preventive as well as corrective measures to avoid critical situations.

Authors: Nidhi Mulay, Vikas Bishnoi, Himanshi Charotia, Siddhartha Asthana, Gaurav Dhama and Ankur Arora

Time Series Modelling

Deep Learning based Time Series Forecasting (2020)‎

Venue: 19th IEEE International Conference on Machine Learning and Applications (ICMLA), 2020‎

For decision-makers in the forecasting sector, decision processes like planning of facilities, an optimal day-to-day operation within the domain etc., are complex with several different levels to be considered. These decisions address widely different time horizons and aspects of the system, making it difficult to model. The advent of deep learning in forecasting solved the need for expensive hand-crafted features and deep domain knowledge. The work aims at giving a structure to the existing literature for time-series forecasting in deep learning. Based on the underlying structures of the technique, such as RNN, CNN, and Transformer, we have categorized various deep learning-based time series forecasting techniques and provided a consolidated report. Additionally, we have performed experiments to compare these techniques on 4 different publicly available datasets. Finally, based on these experiments, we provide an intuitive reasoning behind these performances. We believe that this work shall help the researchers in choosing relevant techniques for future research.

Authors: ​​​​​​​Kushagra Agarwal, Lalasa Dheekollu, Gaurav Dhama, Ankur Arora, Siddhartha Asthana and Tanmoy Bhowmik

 

 

Machine learning

Deep Learning Algorithm to Rank-Order Resumes using Discriminative Embedding Space Session Track (2020)‎

Venue: Grace Hopper Celebration India (GHCI), 2020‎‎

Authors: Sonali Syngal and Debasmita Das

 

 

Others

Information Retrieval and Extraction on COVID-19 Clinical Articles using Graph Community Detection and BIO-Bert Embeddings (2020)

Venue: Workshop on NLP for COVID-19 in conjunction with the 58th Annual Meeting of the Association for Computational Linguistics, 2020‎

In this paper, we present an information retrieval system on a corpus of scientific articles related to COVID-19. We build a similarity network on the articles where similarity is determined via shared citations and biological domain-specific sentence embeddings. Ego-splitting community detection on the article network is employed to cluster the articles and then the queries are matched with the clusters. Extractive summarization using BERT and PageRank methods is used to provide responses to the query. We also provide a Question-Answer bot on a small set of intents to demonstrate the efficacy of our model for an information extraction module.


Authors: Debasmita Das, Yatin Katyal, Janu Verma, Rajesh Kumar Ranjan, Shashank Dubey, Aakash Deep Singh, Sourojit Bhaduri and Kushagra Agarwal

 

 

Machine learning

Word and Graph Embeddings for COVID-19 Retweet Prediction (2020)

Venue: AnalytiCup Workshop in conjunction with the 29th ACM International Conference on Information and Knowledge Management (CIKM), 2020​​​​​​​‎

In this paper, we present our solution for COVID-19 retweet prediction challenge. The proposed approach consists of feature engineering and modeling. For feature engineering, we leverage both hand-crafted and unsupervised learning features. As the provided data set is large, we implement auto-encoding algorithms to reduce feature dimension. To develop predictive models, we utilize ensemble learning and deep learning algorithms. We then combine these models to generate the final blended model. Moreover, to stabilize the predictions, we also apply bagging as well as down-sampling techniques to remove the tweets where number of retweets equals to zero. Our solution is ranked first on the public test set and second on the private test set.

Authors: Tam T. Nguyen, Karamjit Singh, Sangam Verma, Hardik Wadhwa, Siddharth Vimal, Lalasa Dheekollu, Sheng Jie Lui, Divyansh Gupta, Dong Yang Jin and Zha Wei

 

 

Fraud Detection

Limitations and Applicability of GANs in Banking Domain‎

Venue: Workshop on Applied Deep Generative Networks in conjunction with the 24th European Conference on Artificial Intelligence (ECAI), 2020 

Threats due to payment-related frauds are always a primary concern for financial institutions (FIs), often leading to huge losses and impacting consumer experience. To combat emerging frauds and improve the system’s robustness, FIs need an efficient system to detect fraud while authorizing payments. The biggest challenge in developing a fraud detection system is a high degree of class imbalance between fraudulent and legitimate transactions. Recently, Generative Adversarial Networks (GANs) are employed as an oversampling technique to augment the dataset with synthetic minority samples. In this paper, we present a systematic study to train GANs for synthetic fraud generation, demonstrating improved classifier performance detecting fraud. Training of GANs is conducted in various settings, including min-max objective and with or without auxiliary loss discriminating synthetic fraud and real fraud from non-fraud samples. Auxiliary loss is obtained using contrastive loss or triplet loss. Quality of trained GANs is estimated by evaluating the lift in classifier performance when trained with dataset augmented with synthetic fraud. Further, the effect of Discriminator Rejection Sampling (DRS) is studied in synthetic sample selection used for training data augmentation. The performance comparison of different settings proposed in this study is evaluated using a publicly available Credit-Card dataset and showed an absolute improvement of up to 6% in Recall and 3% in precision. We hope this paper will help advance the applicability of GANs with a practical insight into the research that has been done on this topic so far and open doors to interesting future research direction. 



Authors: Anubha Pandey. Deepak Bhatt and Tanmoy Bhowmik

Contact us to learn more

Mastercard logo.