Ayan Sengupta

I am a third year PhD. student at IIT Delhi, where I work in the Laboratory for Computational Social Systems (LCS2) lead by Dr. Tanmoy Chakraborty. My primary area of research includes understanding generalization capabilities of small and efficient language models. I also actively work on designing efficient methods for compressing large language models for adpoting in resource-constrained settings.

I am also a master inventor at UnitedHealth Group with more than 7 granted patents (10 additional applications submitted to US patent office) with innovations solving various critical challenges in the healthcare sector, including rare disease prediction and member care delivery.

I completed my Masters in Business Analytics and Data Science (PGDBA) from Indian Institute of Management, Calcutta . Prior to that, I did my Masters of Science in Mathematics and Computing from Indian Institute of Technology, Guwahati in 2015 and Bachelor of Science in Mathematics and Computer Science from Chennai Mathematical Institute in 2013.

Email / GitHub / LinkedIn / Kaggle / Topcoder / Scholar

Recent News · Education · Selected Publications · Granted Patents · Data Science Competitions · Extra ·

Recent Highlights

[May 2025] Our paper on downscaling accepted at International Conference on Machine Learning (ICML 2025) in the Position track.
[May 2025] Our paper on generalization vs fidelity paradox of knowledge distillation accepted at The 63rd Annual Meeting of the Association for Computational Linguistics (ACL) findings.
[Feb 2025] Our paper on calibration-free model compression accepted at International Conference on Learning Representations (ICLR 2025).

Education

PhD, Computer Science
Indian Institute of Technology, Delhi
2023-Present

Advisors: Dr. Tanmoy Chakraborty

	PhD, Computer Science Indraprastha Institute of Information Technology, Delhi 2021-2022 Advisors: Dr. Tanmoy Chakraborty and Dr. Md. Shad Akhtar Courses: Machine Learning, Natural Language Processing, Social Network Analysis, Data Mining, Artificial Intelligence, Bayesian Machine Learning
	MS, Business Analytics and Data Science Indian Institute of Management, Calcutta (jointly with Indian Institute of Technology, Kharagpur and Indian Statistical Institute, Kolkata) 2015-2017 Courses: Algorithms, Machine Learning, Multivariate Analysis, Complex Networks, Information Retrieval, Econometrics, Statistical Inference
	Masters of Science, Mathematics and Computing Indian Institute of Technology, Guwahati 2013-2015 Advised by Prof. Anupam Saikia for masters thesis on Mathematics of Elliptic Curves and its application in Cryptography Courses: Algorithms, Logic Programming (Prolog, Introduction to AI), Probability Theory, Numerical Analysis, Optimization
	Bachelor of Science, Mathematics and Computer Science Chennai Mathematical Institute 2010-2013 Courses: Mathematical Logic, Algorithm Design, Game Theory, Theory of Computation, Programming

Selected Publications (* denotes equal contribution)

	Position: Enough of Scaling LLMs! Lets Focus on Downscaling *Yash Goel, Ayan Sengupta**, Tanmoy Chakraborty International Conference on Machine Learning (ICML) 2025* Paper \| Code We challenge the dominant focus on neural scaling laws and advocate for a paradigm shift toward downscaling in the development of large language models (LLMs). While scaling laws have provided critical insights into performance improvements through increasing model and dataset size, we emphasize the significant limitations of this approach, particularly in terms of computational inefficiency, environmental impact, and deployment constraints. To address these challenges, we propose a holistic framework for downscaling LLMs that seeks to maintain performance while drastically reducing resource demands. This paper outlines practical strategies for transitioning away from traditional scaling paradigms, advocating for a more sustainable, efficient, and accessible approach to LLM development.
	You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning Ayan Sengupta, Siddhant Chaudhary, Tanmoy Chakraborty International Conference on Learning Representations (ICLR) 2025 Openreview \| Code The ever-increasing size of large language models (LLMs) presents significant challenges for deployment due to their heavy computational and memory requirements. Current model pruning techniques attempt to alleviate these issues by relying heavily on external calibration datasets to determine which parameters to prune or compress, thus limiting their flexibility and scalability across different compression ratios. Moreover, these methods often cause severe performance degradation, particularly in downstream tasks, when subjected to higher compression rates. In this paper, we propose PruneNet, a novel model compression method that addresses these limitations by reformulating model pruning as a policy learning process. PruneNet decouples the pruning process from the model architecture, eliminating the need for calibration datasets. It learns a stochastic pruning policy to assess parameter importance solely based on intrinsic model properties while preserving the spectral structure to minimize information loss. PruneNet can compress the LLaMA-2-7B model in just 15 minutes, achieving over 80% retention of its zero-shot performance with a 30% compression ratio, outperforming existing methods that retain only 75% performance. Furthermore, on complex multitask language understanding tasks, PruneNet demonstrates its robustness by preserving up to 80% performance of the original model, proving itself a superior alternative to conventional structured compression techniques.
	A Good Learner can Teach Better: Teacher-Student Collaborative Knowledge Distillation Ayan Sengupta, Shantanu Dixit, Md Shad Akhtar, Tanmoy Chakraborty International Conference on Learning Representations (ICLR) 2024 Openreview \| Code Knowledge distillation (KD) is a technique used to transfer knowledge from a larger “teacher” model into a smaller “student” model. Recent advancements in meta-learning-based knowledge distillation (MetaKD) emphasize that the finetuning of teacher models should be aware of the student’s need to achieve better knowledge distillation. However, existing MetaKD methods often lack incentives for the teacher model to improve itself. In this study, we introduce MPDistil, a meta-policy distillation technique, that utilizes novel optimization strategies to foster both collaboration and competition during the fine-tuning of the teacher model in the meta-learning step. Additionally, we propose a curriculum learning framework for the student model in a competitive setup, in which the student model aims to outperform the teacher model by self-training on various tasks. Exhaustive experiments on SuperGLUE and GLUE benchmarks demonstrate the efficacy of MPDistil compared to 20 conventional KD and advanced MetaKD baselines, showing significant performance enhancements in the student model– e.g., a distilled 6-layer BERT model outperforms a 12-layer BERT model on f ive out of six SuperGLUE tasks. Furthermore, MPDistil, while applied to a large language teacher model (DeBERTa-v2-xxlarge), significantly narrows the performance gap of its smaller student counterpart (DeBERTa-12) by just 4.6% on SuperGLUE. Wefurther demonstrate how higher rewards and customized training curricula strengthen the student model and enhance generalizability.
	Persona-aware Generative Model for Code-mixed Language Ayan Sengupta, Md Shad Akhtar, Tanmoy Chakraborty Transactions on Machine Learning Research (TMLR) 2024 Openreview \| Code Code-mixing and script-mixing are prevalent across online social networks and multilingual societies. However, a user’s preference toward code-mixing depends on the socioeconomic status, demographics of the user, and the local context, which existing generative models tend to ignore while generating code-mixed texts. In this work, we make a pioneering attempt to develop a persona-aware generative model to generate texts resembling real-life code-mixed texts of individuals. We propose PARADOX, a persona-aware generative model for code-mixed text generation, which is a novel Transformer-based encoder-decoder model that encodes an utterance conditioned on a user’s persona and generates code-mixed texts without monolingual reference data. We propose an alignment module that re-calibrates the generated sequence to resemble real-life code-mixed texts. PARADOX generates code-mixed texts that are semantically more meaningful and linguistically more valid. To evaluate the personification capabilities of PARADOX, we propose four new metrics– CM BLEU, CM Rouge-1, CM Rouge-L and CM KS. On average, PARADOX achieves 1.6% better CM BLEU, 57% better perplexity and 32% better semantic coherence than the non-persona-based counterparts.
	Manifold-Preserving Transformers are Effective for Short-Long Range Encoding Ayan Sengupta, Md Shad Akhtar, Tanmoy Chakraborty Findings of the Association for Computational Linguistics (EMNLP) 2023 Paper \| Code Multi-head self-attention-based Transformers have shown promise in different learning tasks. Albeit these models exhibit significant improvement in understanding short-term and long-term contexts from sequences, encoders of Transformers and their variants fail to preserve layer-wise contextual information. Transformers usually project tokens onto sparse manifolds and fail to preserve mathematical equivalence among the token representations. In this work, we propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens. We propose a simple alternative to dot-product attention to ensure Lipschitz continuity. This allows TransJect to learn injective mappings to transform token representations to different manifolds with similar topology and preserve Euclidean distance between every pair of tokens in subsequent layers. Evaluations across multiple benchmark short- and long-sequence classification tasks show maximum improvements of 6.8% and 5.9%, respectively, over the variants of Transformers. Additionally, TransJect displays 79% better performance than Transformer on the language modeling task. We further highlight the shortcomings of multi-head self-attention from the statistical physics viewpoint. Although multi-head self-attention was incepted to learn different abstraction levels within the networks, our empirical analyses suggest that different attention heads learn randomly and unorderly. In contrast, TransJect adapts a mixture of experts for regularization; these experts are more orderly and balanced and learn different sparse representations from the input sequences. TransJect exhibits very low entropy and can be efficiently scaled to larger depths.

Image Credits: Link	Does aggression lead to hate? Detecting and reasoning offensive traits in hinglish code-mixed texts Ayan Sengupta, Sourabh Kumar Bhattacharjee, Md Shad Akhtar, Tanmoy Chakraborty Neurocomputing \| Elsevier Paper \| Code Aggression is a prominent trait of human beings that can affect social harmony in a negative way. The hate mongers misuse the freedom of speech in social media platforms to flood with their venomous comments in many forms. Identifying different traits of online offense is thus inevitable and the need of the hour. Existing studies usually handle one or two offense traits at a time, mainly due to the lack of a combined annotated dataset and a scientific study that provides insights into the relationship among the traits. In this paper, we study the relationship among five offense traits – aggression, hate, sarcasm, humor, and stance in Hinglish (Hindi-English) social media code-mixed texts. We employ various state-of-the-art deep learning systems at different morphological granularities for the classification across five offense traits. Our evaluation of the unified framework suggests performance across all major traits. Furthermore, we propose a novel notion of causal importance score to quantify the effect of different abusive keywords and the overall context on the offensiveness of the texts.
Image Credits: Link	HIT: A Hierarchically Fused Deep Attention Network for Robust Code-mixed Language Representation Ayan Sengupta, Sourabh Kumar Bhattacharjee, Tanmoy Chakraborty, Md Shad Akhtar Findings of the Association for Computational Linguistics (ACL-IJCNLP) 2021 Paper \| Code Understanding linguistics and morphology of resource-scarce code-mixed texts remains a key challenge in text processing. Although word embedding comes in handy to support downstream tasks for low-resource languages, there are plenty of scopes in improving the quality of language representation particularly for code-mixed languages. In this paper, we propose HIT, a robust representation learning method for code-mixed texts. HIT is a hierarchical transformer-based framework that captures the semantic relationship among words and hierarchically learns the sentence-level semantics using a fused attention mechanism. HIT incorporates two attention modules, a multi-headed self-attention and an outer product attention module, and computes their weighted sum to obtain the attention weights. Our evaluation of HIT on one European (Spanish) and five Indic (Hindi, Bengali, Tamil, Telugu, and Malayalam) languages across four NLP tasks on eleven datasets suggests significant performance improvement against various state-of-the-art systems. We further show the adaptability of learned representation across tasks in a transfer learning setup (with and without fine-tuning).
Image Credits: Link	Gated Transformer for Robust De-noised Sequence-to-Sequence Modelling Ayan Sengupta, Amit Kumar, Sourabh Kumar Bhattacharjee, Suman Roy Findings of the Association for Computational Linguistics (EMNLP) 2021 Paper \| Code Robust sequence-to-sequence modelling is an essential task in the real world where the inputs are often noisy. Both user-generated and machine generated inputs contain various kinds of noises in the form of spelling mistakes, grammatical errors, character recognition errors, all of which impact downstream tasks and affect interpretability of texts. In this work, we devise a novel sequence-to-sequence architecture for detecting and correcting different real world and artificial noises (adversarial attacks) from English texts. Towards that we propose a modified Transformer-based encoder-decoder architecture that uses a gating mechanism to detect types of corrections required and accordingly corrects texts. Experimental results show that our gated architecture with pre-trained language models perform significantly better that the non-gated counterparts and other state-of-the-art error correction models in correcting spelling and grammatical errors. Extrinsic evaluation of our model on Machine Translation (MT) and Summarization tasks show the competitive performance of the model against other generative sequence-to-sequence models under noisy inputs.
Image Credits: Link	An Embedding-based Joint Sentiment-Topic Model for Short Texts Ayan Sengupta, William Scott Paka, Suman Roy, Gaurav Ranjan, Tanmoy Chakraborty Vol. 15 (2021): Fifteenth International AAAI Conference on Web and Social Media Paper \| Code Short text is a popular avenue of sharing feedback, opinions and reviews on social media, e-commerce platforms, etc. Many companies need to extract meaningful information (which may include thematic content as well as semantic polarity) out of such short texts to understand users’ behaviour. However, obtaining high quality sentiment-associated and human interpretable themes still remains a challenge for short texts. In this paper we develop ELJST, an embedding enhanced generative joint sentiment-topic model that can discover more coherent and diverse topics from short texts. It uses Markov Random Field Regularizer that can be seen as generalisation of skip-gram based models. Further, it can leverage higher order semantic information appearing in word embedding, such as self-attention weights in graphical models. Our results show an average improvement of 10% in topic coherence and 5% in topic diversification over baselines. Finally, ELJST helps understand users' behaviour at more granular levels which can be explained. All these can bring significant values to service and healthcare industries often dealing with customers.
	A Study of Pre-trained Language Models along with Regularization Techniques for Downstream Tasks Ayan Sengupta Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020) \| Conference on Empirical Methods in Natural Language Processing (EMNLP 2020) Paper \| Code This document describes the system description developed by team datamafia at WNUT-2020 Task 2: Identification of informative COVID-19 English Tweets. This paper contains a thorough study of pre-trained language models on downstream binary classification task over noisy user generated Twitter data. The solution submitted to final test leaderboard is a fine tuned RoBERTa model which achieves F1 score of 90.8% and 89.4% on the dev and test data respectively. In the later part, we explore several techniques for injecting regularization explicitly into language models to generalize predictions over noisy data. Our experiments show that adding regularizations to RoBERTa pre-trained model can be very robust to data and annotation noises and can improve overall performance by more than 1.2%.

Granted Patents

US12112132: NATURAL LANGUAGE PROCESSING MACHINE LEARNING FRAMEWORKS TRAINED USING MULTI-TASK TRAINING ROUTINES
US11210818: SUPERVISED AND UNSUPERVISED MACHINE LEARNING TECHNIQUES FOR COMMUNICATION SUMMARIZATION
US12229512: SIGNIFICANCE-BASED PREDICTION FROM UNSTRUCTURED TEXT
US11698934: GRAPH EMBEDDING BASED PARAGRAPH VECTOR MACHINE LEARNING MODELS
US11494565: NATURAL LANGUAGE PROCESSING TECHNIQUES USING JOINT SENTIMENT-TOPIC MODELING
US12008321: NATURAL LANGUAGE PROCESSING TECHNIQUES FOR SEQUENTIAL TOPIC MODELING
US11068666: NATURAL LANGUAGE PROCESSING USING JOINT SENTIMENT-TOPIC MODELING

Data Science Competitions

	2023 Kaggle AI Report Code In this challenge, the participants were asked to write an essay on one of the following seven chosen topics from the field of AI, with a prompt to describe what the community has learned over the past 2 years of working and experimenting with.
	Kaggle: Jigsaw Multilingual Toxic Comment Classification Challenge 2020 Code \| Link In this challenge, the objective is to classify whether a comment is toxic (or, abusive). The dataset contains multi-lingual comments from different Wikipedia talk pages. We experimented with various multilingual transformer models, Universal Sentence Encoder (USE) and their ensembles. Final private leaderboard rank achieved 156 (Bronze medal)
	Kaggle: Google QUEST Q&A Labeling Challenge 2020 Code \| Link In this challenge, the objective is to predict different subjective aspects of question-answering gathered from different StackExchange properties. In this competition, we explored BERT model and hand picked feature engineering to develop a robust model that can predict the subjectivity metrics accurately. Final private leaderboard rank achieved 116 (Bronze medal)
	Kaggle: Bengali.AI Handwritten Grapheme Classification Challenge 2020 Code \| Link Experimented with efficient net models with different CNN heads (SSE, GEM pooling) on GPU/TPU to classify handwritten bengali alphabets and its constituents (diacritics). Final private leaderboard rank achieved 233 (out of 2000 teams)
Image Credits: Link	CrowdANALYTIX: Gamma Log Facies Type Prediction Challenge 2019 Code \| Link Given an array of GR (Gamma-Ray) values, accurately predict the log facies type corresponding to each value. Solution includes stacking of several seq2seq models with attention. Achieved overall 96.8% accuracy. Final private leaderboard rank achieved 26 (out of 350 teams)
Image Credits: Link	Data Science Game 2016: Online selection Bodhisattwa Prasad Majumder, Robin Singh, Ayan Sengupta, Jayanta Mandi Code \| Link1 \| Link2 The challenge was to classify orientation of building roofs using satellite images of roof tops. We used different image augmentation techniques along with VGG network and achieved 82% accuracy on test dataset. Final private leaderboard rank achieved 22 (out of 110 teams) and got selected for Finals of Data Science Game 2016

Development Projects

jointtsmodel - Python package for Probabilistic Joint Topic-Sentiment Models
Code

Joint topic-sentiment models aka. aspect rating models can extract thematic representation from texts at a granular level. In the areas of customer relationship management, it is utmost important to understand the pain point areas of customers from their feedbacks, complaints data and other data sources. By leveraging both sentiment as well as, thematic information, JST based models can understand the intent at overall level as well as, at theme level, which makes it perfect for analyzing voice of the customers.

This package contains implementation of various probabilistic generative JST models from the literature.

Extra

[2019] Passed Grade 1 in Rock & Pop Bass, awarded by Trinity College London 🤘
[2019] Performed at Rockathon organized by Noida School of Rock 🤘🤘
[2016] Passed A1 in French by CdA Global Language Centre

Thanks to Jon Barron and Bodhisattwa P. Majumder for this nice template.