Ayan Sengupta

I am a first year PhD. student at IIT Delhi, where I work in the Laboratory for Computational Social Systems (LCS2) lead by Dr. Tanmoy Chakraborty and Dr. Md. Shad Akhtar. My primary area of research includes understanding generalization capabilities of pre-trained language models, representation learning for code-mixed and low-resource languages.

My other areas of interest are Bayesian meta learning and representation learning from knowledge graphs, graph neural network and generative NLP.

I completed my Masters in Business Analytics and Data Science (PGDBA) from Indian Institute of Management, Calcutta . Prior to that, I did my Masters of Science in Mathematics and Computing from Indian Institute of Technology, Guwahati in 2015 and Bachelor of Science in Mathematics and Computer Science from Chennai Mathematical Institute in 2013.

Email  /  CV  /  GitHub  /  LinkedIn  /  Kaggle  /  Topcoder  /  Scholar

profile photo
Recent News·Education·Selected Publications·Corporate Projects·Data Science Competitions·Development Projects·Awards·Invited Talks and Trainings·Voluntary Services·Extra·
Recent Highlights
  • [Dec 2023] Attending EMNLP 2023. See you in Singapore.
  • [Oct 2023] Paper titled "Manifold-Preserving Transformers are Effective for Short-Long Range Encoding" accepted at EMNLP (findings) 2023.
  • [Dec 2022] Invited talk on Code-mixed representation learning at Workshop on Indian Code-mixed and Low-resource Natural Language Processing (ICLrNLP) collocated with ICON 2022.
  • [Dec 2022] Joined PhD. at IIT Delhi in the Electrical Engineering Dept.
  • [Nov 2022] Publication of patent ID US11494565B2 titled "Natural Language Processing Techniques using Joint Sentiment-Topic Modeling".
Education
PontTuset

PhD, Computer Science
Indian Institute of Technology, Delhi
2023-Present

Advisors: Dr. Tanmoy Chakraborty and Dr. Md. Shad Akhtar

PontTuset

PhD, Computer Science
Indraprastha Institute of Information Technology, Delhi
2021-2022

Advisors: Dr. Tanmoy Chakraborty and Dr. Md. Shad Akhtar

Courses: Machine Learning, Natural Language Processing, Social Network Analysis, Data Mining, Artificial Intelligence, Bayesian Machine Learning

PontTuset

MS, Business Analytics and Data Science
Indian Institute of Management, Calcutta (jointly with Indian Institute of Technology, Kharagpur and Indian Statistical Institute, Kolkata)
2015-2017

Courses: Algorithms, Machine Learning, Multivariate Analysis, Complex Networks, Information Retrieval, Econometrics, Statistical Inference

PontTuset

Masters of Science, Mathematics and Computing
Indian Institute of Technology, Guwahati
2013-2015

Advised by Prof. Anupam Saikia for masters thesis on Mathematics of Elliptic Curves and its application in Cryptography

Courses: Algorithms, Logic Programming (Prolog, Introduction to AI), Probability Theory, Numerical Analysis, Optimization

PontTuset

Bachelor of Science, Mathematics and Computer Science
Chennai Mathematical Institute
2010-2013

Courses: Mathematical Logic, Algorithm Design, Game Theory, Theory of Computation, Programming

Selected Publications (* denotes equal contribution)
PontTuset

Manifold-Preserving Transformers are Effective for Short-Long Range Encoding
Ayan Sengupta, Md Shad Akhtar, Tanmoy Chakraborty
Findings of the Association for Computational Linguistics: EMNLP 2023 | Association for Computational Linguistics
Paper | Code

Multi-head self-attention-based Transformers have shown promise in different learning tasks. Albeit these models exhibit significant improvement in understanding short-term and long-term contexts from sequences, encoders of Transformers and their variants fail to preserve layer-wise contextual information. Transformers usually project tokens onto sparse manifolds and fail to preserve mathematical equivalence among the token representations. In this work, we propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens. We propose a simple alternative to dot-product attention to ensure Lipschitz continuity. This allows TransJect to learn injective mappings to transform token representations to different manifolds with similar topology and preserve Euclidean distance between every pair of tokens in subsequent layers. Evaluations across multiple benchmark short- and long-sequence classification tasks show maximum improvements of 6.8% and 5.9%, respectively, over the variants of Transformers. Additionally, TransJect displays 79% better performance than Transformer on the language modeling task. We further highlight the shortcomings of multi-head self-attention from the statistical physics viewpoint. Although multi-head self-attention was incepted to learn different abstraction levels within the networks, our empirical analyses suggest that different attention heads learn randomly and unorderly. In contrast, TransJect adapts a mixture of experts for regularization; these experts are more orderly and balanced and learn different sparse representations from the input sequences. TransJect exhibits very low entropy and can be efficiently scaled to larger depths.

PontTuset

Image Credits: Link

Does aggression lead to hate? Detecting and reasoning offensive traits in hinglish code-mixed texts
Ayan Sengupta, Sourabh Kumar Bhattacharjee, Md Shad Akhtar, Tanmoy Chakraborty
Neurocomputing | Elsevier
Paper | Code

Aggression is a prominent trait of human beings that can affect social harmony in a negative way. The hate mongers misuse the freedom of speech in social media platforms to flood with their venomous comments in many forms. Identifying different traits of online offense is thus inevitable and the need of the hour. Existing studies usually handle one or two offense traits at a time, mainly due to the lack of a combined annotated dataset and a scientific study that provides insights into the relationship among the traits. In this paper, we study the relationship among five offense traits – aggression, hate, sarcasm, humor, and stance in Hinglish (Hindi-English) social media code-mixed texts. We employ various state-of-the-art deep learning systems at different morphological granularities for the classification across five offense traits. Our evaluation of the unified framework suggests performance across all major traits. Furthermore, we propose a novel notion of causal importance score to quantify the effect of different abusive keywords and the overall context on the offensiveness of the texts.

PontTuset

Image Credits: Link

HIT: A Hierarchically Fused Deep Attention Network for Robust Code-mixed Language Representation
Ayan Sengupta, Sourabh Kumar Bhattacharjee, Tanmoy Chakraborty, Md Shad Akhtar
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 | Association for Computational Linguistics
Paper | Code

Understanding linguistics and morphology of resource-scarce code-mixed texts remains a key challenge in text processing. Although word embedding comes in handy to support downstream tasks for low-resource languages, there are plenty of scopes in improving the quality of language representation particularly for code-mixed languages. In this paper, we propose HIT, a robust representation learning method for code-mixed texts. HIT is a hierarchical transformer-based framework that captures the semantic relationship among words and hierarchically learns the sentence-level semantics using a fused attention mechanism. HIT incorporates two attention modules, a multi-headed self-attention and an outer product attention module, and computes their weighted sum to obtain the attention weights. Our evaluation of HIT on one European (Spanish) and five Indic (Hindi, Bengali, Tamil, Telugu, and Malayalam) languages across four NLP tasks on eleven datasets suggests significant performance improvement against various state-of-the-art systems. We further show the adaptability of learned representation across tasks in a transfer learning setup (with and without fine-tuning).

PontTuset

Image Credits: Link

Gated Transformer for Robust De-noised Sequence-to-Sequence Modelling
Ayan Sengupta, Amit Kumar, Sourabh Kumar Bhattacharjee, Suman Roy
Findings of the Association for Computational Linguistics: EMNLP 2021 | Association for Computational Linguistics
Paper | Code

Robust sequence-to-sequence modelling is an essential task in the real world where the inputs are often noisy. Both user-generated and machine generated inputs contain various kinds of noises in the form of spelling mistakes, grammatical errors, character recognition errors, all of which impact downstream tasks and affect interpretability of texts. In this work, we devise a novel sequence-to-sequence architecture for detecting and correcting different real world and artificial noises (adversarial attacks) from English texts. Towards that we propose a modified Transformer-based encoder-decoder architecture that uses a gating mechanism to detect types of corrections required and accordingly corrects texts. Experimental results show that our gated architecture with pre-trained language models perform significantly better that the non-gated counterparts and other state-of-the-art error correction models in correcting spelling and grammatical errors. Extrinsic evaluation of our model on Machine Translation (MT) and Summarization tasks show the competitive performance of the model against other generative sequence-to-sequence models under noisy inputs.

PontTuset

Image Credits: Link

An Embedding-based Joint Sentiment-Topic Model for Short Texts
Ayan Sengupta*, William Scott Paka*, Suman Roy, Gaurav Ranjan, Tanmoy Chakraborty
Vol. 15 (2021): Fifteenth International AAAI Conference on Web and Social Media
Paper | Code

Short text is a popular avenue of sharing feedback, opinions and reviews on social media, e-commerce platforms, etc. Many companies need to extract meaningful information (which may include thematic content as well as semantic polarity) out of such short texts to understand users’ behaviour. However, obtaining high quality sentiment-associated and human interpretable themes still remains a challenge for short texts. In this paper we develop ELJST, an embedding enhanced generative joint sentiment-topic model that can discover more coherent and diverse topics from short texts. It uses Markov Random Field Regularizer that can be seen as generalisation of skip-gram based models. Further, it can leverage higher order semantic information appearing in word embedding, such as self-attention weights in graphical models. Our results show an average improvement of 10% in topic coherence and 5% in topic diversification over baselines. Finally, ELJST helps understand users' behaviour at more granular levels which can be explained. All these can bring significant values to service and healthcare industries often dealing with customers.

PontTuset

A Study of Pre-trained Language Models along with Regularization Techniques for Downstream Tasks
Ayan Sengupta
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020) | Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)
Paper | Code

This document describes the system description developed by team datamafia at WNUT-2020 Task 2: Identification of informative COVID-19 English Tweets. This paper contains a thorough study of pre-trained language models on downstream binary classification task over noisy user generated Twitter data. The solution submitted to final test leaderboard is a fine tuned RoBERTa model which achieves F1 score of 90.8% and 89.4% on the dev and test data respectively. In the later part, we explore several techniques for injecting regularization explicitly into language models to generalize predictions over noisy data. Our experiments show that adding regularizations to RoBERTa pre-trained model can be very robust to data and annotation noises and can improve overall performance by more than 1.2%.

Corporate Projects
PontTuset

Image Credits: Link

Multi-modal Disease Prediction
Dec 2021 - Dec 2022
Ayan Sengupta, Amit Kumar, Suman Roy

Built deep learning based multi-task model for disease prediction. We utilized clinical knowledge graphs and tempo-spatial representation for encounter level predictions of diagnosis and medications.

PontTuset

Image Credits: Link

Structured Information Extraction from Scanned Medical Charts
Nov 2019 - Oct 2021

Building information extraction engine to extract structured information from noisy scanned images of handwritten medical charts. Our solution involves multi-modal system to first detect the type of chart page, given the image and OCR text. Further we use sequential and transformer models to detect named entities from noisy word tokens and parse the text into key-value pairs. Our solution can be used for extracting demographic information as well as, key medical conditions from unstructured texts.

PontTuset

Image Credits: Link

Voice of Customer Analytics
Jan 2019 - May 2020
Code1 | Code2
Ayan Sengupta, William Scott Paka, Vijay Malladi Varma, Suman Roy, Gaurav Ranjan, Tanmoy Chakraborty

In this project we develop an end-to-end system that extracts meaningful and actionable pain points from user generated texts like - feedback, complaint, social media feeds, telecommunications etc. Our solution extracts customers' intent at topic level and prioritizes critical issues in an unsupervised manner. We use probabilistic generative models based on Latent Dirichlet Allocation (LDA) to detect joint topic-sentiments from texts. Further, we use statistical measures to quantify information content and assign a priority score to each complaint so that the downstream CRM teams can be assigned accordingly. We also developed a NMF (Non-Negative Matrix Factorization) based online version of our solution to work with high velocity streaming data.

Other Research Projects
PontTuset

Code-Mixed Sentiment Classification (SemEval 2020 Task 9 and FIRE 2020)
Jan 2020 - Sep 2020
Code1 | Code2 | Link1 | Link2

Code-mixing is very common in social media. In this project, we learned textual representation from Hinglish, Tamil-English and Malayalam-English texts and further classify its sentiment. We learned unsupervised word embeddings - word2vec, fastText and contextual embeddings using BERT and used sequential attention models and BERT for classifying tweet sentiment. We further used BytePairEncoding (BPE) and learned representation using custom transformers from scratch.

PontTuset

PharmaCoNER: Named Entity Recognition from Biomedical Texts (BioNLP 2019 Task 2)
May 2019 - Jul 2019
Code | Link

In this project we developed a named entity recognition system to detect pharmaceutical entities from Spanish clinical corpus. We explored different embedding techniques along with sequential and convolutional models with conditional random field (CRF) for classifying entities and detecting entity offsets.

Data Science Competitions
PontTuset

2023 Kaggle AI Report
Code

In this challenge, the participants were asked to write an essay on one of the following seven chosen topics from the field of AI, with a prompt to describe what the community has learned over the past 2 years of working and experimenting with.

PontTuset

Kaggle: Jigsaw Multilingual Toxic Comment Classification Challenge 2020
Code | Link

In this challenge, the objective is to classify whether a comment is toxic (or, abusive). The dataset contains multi-lingual comments from different Wikipedia talk pages. We experimented with various multilingual transformer models, Universal Sentence Encoder (USE) and their ensembles.

Final private leaderboard rank achieved 156 (Bronze medal)

PontTuset

Kaggle: Google QUEST Q&A Labeling Challenge 2020
Code | Link

In this challenge, the objective is to predict different subjective aspects of question-answering gathered from different StackExchange properties. In this competition, we explored BERT model and hand picked feature engineering to develop a robust model that can predict the subjectivity metrics accurately.

Final private leaderboard rank achieved 116 (Bronze medal)

PontTuset

Kaggle: Bengali.AI Handwritten Grapheme Classification Challenge 2020
Code | Link

Experimented with efficient net models with different CNN heads (SSE, GEM pooling) on GPU/TPU to classify handwritten bengali alphabets and its constituents (diacritics).

Final private leaderboard rank achieved 233 (out of 2000 teams)

PontTuset

Image Credits: Link

CrowdANALYTIX: Gamma Log Facies Type Prediction Challenge 2019
Code | Link

Given an array of GR (Gamma-Ray) values, accurately predict the log facies type corresponding to each value. Solution includes stacking of several seq2seq models with attention. Achieved overall 96.8% accuracy.

Final private leaderboard rank achieved 26 (out of 350 teams)

PontTuset

Codalab: Fine-Grained Classification of Objects from Aerial Imagery Challenge 2019
Code | Link | Blog

The competition focused on fine grained classification of objects from aerial imagery. We used different image augmentation techniques along with different image classification models - MobileNet, Resnet, InceptionNet and achieved average precision of 55% on the test data.

Final private leaderboard rank achieved 16 (out of 50 teams)

PontTuset

Image Credits: Link

Data Science Game 2016: Online selection
Bodhisattwa Prasad Majumder, Robin Singh, Ayan Sengupta, Jayanta Mandi
Code | Link1 | Link2

The challenge was to classify orientation of building roofs using satellite images of roof tops. We used different image augmentation techniques along with VGG network and achieved 82% accuracy on test dataset.

Final private leaderboard rank achieved 22 (out of 110 teams) and got selected for Finals of Data Science Game 2016

Development Projects
PontTuset

Image Credits: Link

consNLP - A NLP toolkit for text data Exploration, Visualization and Modeling
Code

A consolidated NLP toolkit for text data analysis and modeling. It supports various functionalities like - tokenization, lemmatization, binary/multiclass classification, sequence classification, Question-Answering, Natural Language Inference (NLI) training as well as inference. It supports CPU/GPU/TPU platforms.

PontTuset

jointtsmodel - Python package for Probabilistic Joint Topic-Sentiment Models
Code

Joint topic-sentiment models aka. aspect rating models can extract thematic representation from texts at a granular level. In the areas of customer relationship management, it is utmost important to understand the pain point areas of customers from their feedbacks, complaints data and other data sources. By leveraging both sentiment as well as, thematic information, JST based models can understand the intent at overall level as well as, at theme level, which makes it perfect for analyzing voice of the customers.

This package contains implementation of various probabilistic generative JST models from the literature.

PontTuset

Slack bot for Online Product Selection
Code

This is a simple slack bot that reads the required product description from the users, parses the user input and shows relevant products from Amazon.

Awards
  • [2020] Became Kaggle Competition Expert
  • [2016] Finalist, Data Science Game '16, Paris; Represented India (1 out of 3 teams), International Rank 14
  • [2009] 7 years KVPY fellowship for persuing undergraduate and graduate research in basic sciences. Fellowship funded by Department of Science and Technology (Govt. of India)
Invited Talks and Trainings
  • [2022] Invited talk on Code-mixed representation learning at Workshop on Indian Code-mixed and Low-resource Natural Language Processing (ICLrNLP) collocated with ICON 2022
  • [2019] Student talk at IIIT Delhi (course taught by Prof. Tanmoy Chakraborty) on Probabilistic Generative Models
  • [2015] Research Scholar's Seminar at IIT Guwahati on Existence of Algebraic Closure of a field
Voluntary Services
  • PC member of ACL 2023
  • Reviewer at Journal of Intelligent & Fuzzy Systems
  • Reviewer at International AAAI Conference on Web and Social Media (ICWSM) 2021
  • Reviewer at Workshop on Noisy User-generated Text (W-NUT) 2020
  • PC member of ECML-PKDD 2020 in Applied Data Science Track
  • Writer and member of Topcoder Thrive community
Extra
  • [2019] Passed Grade 1 in Rock & Pop Bass, awarded by Trinity College London 🤘
  • [2019] Performed at Rockathon organized by Noida School of Rock 🤘🤘
  • [2016] Passed A1 in French by CdA Global Language Centre

Flag Counter

Thanks to Jon Barron and Bodhisattwa P. Majumder for this nice template.