Recent Highlights
- [Dec 2023] Attending EMNLP 2023. See you in Singapore.
- [Oct 2023] Paper titled "Manifold-Preserving Transformers are Effective for Short-Long Range Encoding" accepted at EMNLP (findings) 2023.
- [Dec 2022] Invited talk on Code-mixed representation learning at Workshop on Indian Code-mixed and Low-resource Natural Language Processing (ICLrNLP) collocated with ICON 2022.
- [Dec 2022] Joined PhD. at IIT Delhi in the Electrical Engineering Dept.
- [Nov 2022] Publication of patent ID US11494565B2 titled "Natural Language Processing Techniques using Joint Sentiment-Topic Modeling".
|
|
PhD, Computer Science
Indian Institute of Technology, Delhi
2023-Present
Advisors: Dr. Tanmoy Chakraborty and Dr. Md. Shad Akhtar
|
|
PhD, Computer Science
Indraprastha Institute of Information Technology, Delhi
2021-2022
Advisors: Dr. Tanmoy Chakraborty and Dr. Md. Shad Akhtar
Courses: Machine Learning, Natural Language Processing, Social Network Analysis, Data Mining, Artificial Intelligence, Bayesian Machine Learning
|
|
MS, Business Analytics and Data Science
Indian Institute of Management, Calcutta (jointly with Indian Institute of Technology, Kharagpur and Indian Statistical Institute, Kolkata)
2015-2017
Courses: Algorithms, Machine Learning, Multivariate Analysis, Complex Networks, Information Retrieval, Econometrics, Statistical Inference
|
|
Masters of Science, Mathematics and Computing
Indian Institute of Technology, Guwahati
2013-2015
Advised by Prof. Anupam Saikia for masters thesis on Mathematics of Elliptic Curves and its application in Cryptography
Courses: Algorithms, Logic Programming (Prolog, Introduction to AI), Probability Theory, Numerical Analysis, Optimization
|
|
Bachelor of Science, Mathematics and Computer Science
Chennai Mathematical Institute
2010-2013
Courses: Mathematical Logic, Algorithm Design, Game Theory, Theory of Computation, Programming
|
Selected Publications
(* denotes equal contribution)
|
|
Manifold-Preserving Transformers are Effective for Short-Long Range Encoding
Ayan Sengupta, Md Shad Akhtar, Tanmoy Chakraborty
Findings of the Association for Computational Linguistics: EMNLP 2023 | Association for Computational Linguistics
Paper | Code
Multi-head self-attention-based Transformers have shown promise in different learning tasks. Albeit these models exhibit significant improvement in understanding short-term and long-term contexts from sequences, encoders of Transformers and their variants fail to preserve layer-wise contextual information. Transformers usually project tokens onto sparse manifolds and fail to preserve mathematical equivalence among the token representations. In this work, we propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens. We propose a simple alternative to dot-product attention to ensure Lipschitz continuity. This allows TransJect to learn injective mappings to transform token representations to different manifolds with similar topology and preserve Euclidean distance between every pair of tokens in subsequent layers. Evaluations across multiple benchmark short- and long-sequence classification tasks show maximum improvements of 6.8% and 5.9%, respectively, over the variants of Transformers. Additionally, TransJect displays 79% better performance than Transformer on the language modeling task. We further highlight the shortcomings of multi-head self-attention from the statistical physics viewpoint. Although multi-head self-attention was incepted to learn different abstraction levels within the networks, our empirical analyses suggest that different attention heads learn randomly and unorderly. In contrast, TransJect adapts a mixture of experts for regularization; these experts are more orderly and balanced and learn different sparse representations from the input sequences. TransJect exhibits very low entropy and can be efficiently scaled to larger depths.
|
Image Credits: Link
|
Does aggression lead to hate? Detecting and reasoning offensive traits in hinglish code-mixed texts
Ayan Sengupta, Sourabh Kumar Bhattacharjee, Md Shad Akhtar, Tanmoy Chakraborty
Neurocomputing | Elsevier
Paper | Code
Aggression is a prominent trait of human beings that can affect social harmony in a negative way. The hate mongers misuse the freedom of speech in social media platforms to flood with their venomous comments in many forms. Identifying different traits of online offense is thus inevitable and the need of the hour. Existing studies usually handle one or two offense traits at a time, mainly due to the lack of a combined annotated dataset and a scientific study that provides insights into the relationship among the traits. In this paper, we study the relationship among five offense traits – aggression, hate, sarcasm, humor, and stance in Hinglish (Hindi-English) social media code-mixed texts. We employ various state-of-the-art deep learning systems at different morphological granularities for the classification across five offense traits. Our evaluation of the unified framework suggests performance across all major traits. Furthermore, we propose a novel notion of causal importance score to quantify the effect of different abusive keywords and the overall context on the offensiveness of the texts.
|
Image Credits: Link
|
HIT: A Hierarchically Fused Deep Attention Network for Robust Code-mixed Language Representation
Ayan Sengupta, Sourabh Kumar Bhattacharjee, Tanmoy Chakraborty, Md Shad Akhtar
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 | Association for Computational Linguistics
Paper | Code
Understanding linguistics and morphology of resource-scarce code-mixed texts remains a key challenge in text processing. Although word embedding comes in handy to support downstream tasks for low-resource languages, there are plenty of scopes in improving the quality of language representation particularly for code-mixed languages. In this paper, we propose HIT, a robust representation learning method for code-mixed texts. HIT is a hierarchical transformer-based framework that captures the semantic relationship among words and hierarchically learns the sentence-level semantics using a fused attention mechanism. HIT incorporates two attention modules, a multi-headed self-attention and an outer product attention module, and computes their weighted sum to obtain the attention weights. Our evaluation of HIT on one European (Spanish) and five Indic (Hindi, Bengali, Tamil, Telugu, and Malayalam) languages across four NLP tasks on eleven datasets suggests significant performance improvement against various state-of-the-art systems. We further show the adaptability of learned representation across tasks in a transfer learning setup (with and without fine-tuning).
|
Image Credits: Link
|
Gated Transformer for Robust De-noised Sequence-to-Sequence Modelling
Ayan Sengupta, Amit Kumar, Sourabh Kumar Bhattacharjee, Suman Roy
Findings of the Association for Computational Linguistics: EMNLP 2021 | Association for Computational Linguistics
Paper | Code
Robust sequence-to-sequence modelling is an essential task in the real world where the inputs are often noisy. Both user-generated and machine generated inputs contain various kinds of noises in the form of spelling mistakes, grammatical errors, character recognition errors, all of which impact downstream tasks and affect interpretability of texts. In this work, we devise a novel sequence-to-sequence architecture for detecting and correcting different real world and artificial noises (adversarial attacks) from English texts. Towards that we propose a modified Transformer-based encoder-decoder architecture that uses a gating mechanism to detect types of corrections required and accordingly corrects texts. Experimental results show that our gated architecture with pre-trained language models perform significantly better that the non-gated counterparts and other state-of-the-art error correction models in correcting spelling and grammatical errors. Extrinsic evaluation of our model on Machine Translation (MT) and Summarization tasks show the competitive performance of the model against other generative sequence-to-sequence models under noisy inputs.
|
Image Credits: Link
|
An Embedding-based Joint Sentiment-Topic Model for Short Texts
Ayan Sengupta*, William Scott Paka*, Suman Roy, Gaurav Ranjan, Tanmoy Chakraborty
Vol. 15 (2021): Fifteenth International AAAI Conference on Web and Social Media
Paper | Code
Short text is a popular avenue of sharing feedback, opinions and reviews on social media, e-commerce platforms, etc. Many companies need to extract meaningful information (which may include thematic content as well as semantic polarity) out of such short texts to understand users’ behaviour. However, obtaining high quality sentiment-associated and human interpretable themes still remains a challenge for short texts. In this paper we develop ELJST, an embedding enhanced generative joint sentiment-topic model that can discover more coherent and diverse topics from short texts. It uses Markov Random Field Regularizer that can be seen as generalisation of skip-gram based models. Further, it can leverage higher order semantic information appearing in word embedding, such as self-attention weights in graphical models. Our results show an average improvement of 10% in topic coherence and 5% in topic diversification over baselines. Finally, ELJST helps understand users' behaviour at more granular levels which can be explained. All these can bring significant values to service and healthcare industries often dealing with customers.
|
|
A Study of Pre-trained Language Models along with Regularization Techniques for Downstream Tasks
Ayan Sengupta
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020) | Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)
Paper | Code
This document describes the system description developed by team datamafia at WNUT-2020 Task 2: Identification of informative COVID-19 English Tweets. This paper contains a thorough study of pre-trained language models on downstream binary classification task over noisy user generated Twitter data. The solution submitted to final test leaderboard is a fine tuned RoBERTa model which achieves F1 score of 90.8% and 89.4% on the dev and test data respectively. In the later part, we explore several techniques for injecting regularization explicitly into language models to generalize predictions over noisy data. Our experiments show that adding regularizations to RoBERTa pre-trained model can be very robust to data and annotation noises and can improve overall performance by more than 1.2%.
|
Image Credits: Link
|
Multi-modal Disease Prediction
Dec 2021 - Dec 2022
Ayan Sengupta, Amit Kumar, Suman Roy
Built deep learning based multi-task model for disease prediction. We utilized clinical knowledge graphs and tempo-spatial representation for encounter level predictions of diagnosis and medications.
|
Image Credits: Link
|
Structured Information Extraction from Scanned Medical Charts
Nov 2019 - Oct 2021
Building information extraction engine to extract structured information from noisy scanned images of handwritten medical charts. Our solution involves multi-modal system to first detect the type of chart page, given the image and OCR text. Further we use sequential and transformer models to detect named entities from noisy word tokens and parse the text into key-value pairs. Our solution can be used for extracting demographic information as well as, key medical conditions from unstructured texts.
|
Image Credits: Link
|
Voice of Customer Analytics
Jan 2019 - May 2020
Code1 | Code2
Ayan Sengupta, William Scott Paka, Vijay Malladi Varma, Suman Roy, Gaurav Ranjan, Tanmoy Chakraborty
In this project we develop an end-to-end system that extracts meaningful and actionable pain points from user generated texts like - feedback, complaint, social media feeds, telecommunications etc. Our solution extracts customers' intent at topic level and prioritizes critical issues in an unsupervised manner. We use probabilistic generative models based on Latent Dirichlet Allocation (LDA) to detect joint topic-sentiments from texts. Further, we use statistical measures to quantify information content and assign a priority score to each complaint so that the downstream CRM teams can be assigned accordingly. We also developed a NMF (Non-Negative Matrix Factorization) based online version of our solution to work with high velocity streaming data.
|
|
Code-Mixed Sentiment Classification (SemEval 2020 Task 9 and FIRE 2020)
Jan 2020 - Sep 2020
Code1 | Code2 | Link1 | Link2
Code-mixing is very common in social media. In this project, we learned textual representation from Hinglish, Tamil-English and Malayalam-English texts and further classify its sentiment. We learned unsupervised word embeddings - word2vec, fastText and contextual embeddings using BERT and used sequential attention models and BERT for classifying tweet sentiment. We further used BytePairEncoding (BPE) and learned representation using custom transformers from scratch.
|
|
PharmaCoNER: Named Entity Recognition from Biomedical Texts (BioNLP 2019 Task 2)
May 2019 - Jul 2019
Code | Link
In this project we developed a named entity recognition system to detect pharmaceutical entities from Spanish clinical corpus. We explored different embedding techniques along with sequential and convolutional models with conditional random field (CRF) for classifying entities and detecting entity offsets.
|
Data Science Competitions
|
|
2023 Kaggle AI Report
Code
In this challenge, the participants were asked to write an essay on one of the following seven chosen topics from the field of AI, with a prompt to describe what the community has learned over the past 2 years of working and experimenting with.
|
|
Kaggle: Jigsaw Multilingual Toxic Comment Classification Challenge 2020
Code | Link
In this challenge, the objective is to classify whether a comment is toxic (or, abusive). The dataset contains multi-lingual comments from different Wikipedia talk pages. We experimented with various multilingual transformer models, Universal Sentence Encoder (USE) and their ensembles.
Final private leaderboard rank achieved 156 (Bronze medal)
|
|
Kaggle: Google QUEST Q&A Labeling Challenge 2020
Code | Link
In this challenge, the objective is to predict different subjective aspects of question-answering gathered from different StackExchange properties. In this competition, we explored BERT model and hand picked feature engineering to develop a robust model that can predict the subjectivity metrics accurately.
Final private leaderboard rank achieved 116 (Bronze medal)
|
|
Kaggle: Bengali.AI Handwritten Grapheme Classification Challenge 2020
Code | Link
Experimented with efficient net models with different CNN heads (SSE, GEM pooling) on GPU/TPU to classify handwritten bengali alphabets and its constituents (diacritics).
Final private leaderboard rank achieved 233 (out of 2000 teams)
|
Image Credits: Link
|
CrowdANALYTIX: Gamma Log Facies Type Prediction Challenge 2019
Code | Link
Given an array of GR (Gamma-Ray) values, accurately predict the log facies type corresponding to each value. Solution includes stacking of several seq2seq models with attention. Achieved overall 96.8% accuracy.
Final private leaderboard rank achieved 26 (out of 350 teams)
|
|
Codalab: Fine-Grained Classification of Objects from Aerial Imagery Challenge 2019
Code | Link | Blog
The competition focused on fine grained classification of objects from aerial imagery. We used different image augmentation techniques along with different image classification models - MobileNet, Resnet, InceptionNet and achieved average precision of 55% on the test data.
Final private leaderboard rank achieved 16 (out of 50 teams)
|
Image Credits: Link
|
Data Science Game 2016: Online selection
Bodhisattwa Prasad Majumder, Robin Singh, Ayan Sengupta, Jayanta Mandi
Code | Link1 | Link2
The challenge was to classify orientation of building roofs using satellite images of roof tops. We used different image augmentation techniques along with VGG network and achieved 82% accuracy on test dataset.
Final private leaderboard rank achieved 22 (out of 110 teams) and got selected for Finals of Data Science Game 2016
|
Image Credits: Link
|
consNLP - A NLP toolkit for text data Exploration, Visualization and Modeling
Code
A consolidated NLP toolkit for text data analysis and modeling. It supports various functionalities like - tokenization, lemmatization, binary/multiclass classification, sequence classification, Question-Answering, Natural Language Inference (NLI) training as well as inference. It supports CPU/GPU/TPU platforms.
|
|
jointtsmodel - Python package for Probabilistic Joint Topic-Sentiment Models
Code
Joint topic-sentiment models aka. aspect rating models can extract thematic representation from texts at a granular level. In the areas of customer relationship management, it is utmost important to understand the pain point areas of customers from their feedbacks, complaints data and other data sources. By leveraging both sentiment as well as, thematic information, JST based models can understand the intent at overall level as well as, at theme level, which makes it perfect for analyzing voice of the customers.
This package contains implementation of various probabilistic generative JST models from the literature.
|
|
Slack bot for Online Product Selection
Code
This is a simple slack bot that reads the required product description from the users, parses the user input and shows relevant products from Amazon.
|
Awards
- [2020] Became Kaggle Competition Expert
- [2016] Finalist, Data Science Game '16, Paris; Represented India (1 out of 3 teams), International Rank 14
- [2009] 7 years KVPY fellowship for persuing undergraduate and graduate research in basic sciences. Fellowship funded by Department of Science and Technology (Govt. of India)
|
Invited Talks and Trainings
- [2022] Invited talk on Code-mixed representation learning at Workshop on Indian Code-mixed and Low-resource Natural Language Processing (ICLrNLP) collocated with ICON 2022
- [2019] Student talk at IIIT Delhi (course taught by Prof. Tanmoy Chakraborty) on Probabilistic Generative Models
- [2015] Research Scholar's Seminar at IIT Guwahati on Existence of Algebraic Closure of a field
|
Voluntary Services
- PC member of ACL 2023
- Reviewer at Journal of Intelligent & Fuzzy Systems
- Reviewer at International AAAI Conference on Web and Social Media (ICWSM) 2021
- Reviewer at Workshop on Noisy User-generated Text (W-NUT) 2020
- PC member of ECML-PKDD 2020 in Applied Data Science Track
- Writer and member of Topcoder Thrive community
|
|