Ayan Sengupta

I am a first year PhD. student at IIT Delhi, where I work in the Laboratory for Computational Social Systems (LCS2) lead by Dr. Tanmoy Chakraborty and Dr. Md. Shad Akhtar. My primary area of research includes understanding generalization capabilities of pre-trained language models, representation learning for code-mixed and low-resource languages.

My other areas of interest are Bayesian meta learning and representation learning from knowledge graphs, graph neural network and generative NLP.

I am also a Senior Data Scientist at Optum (UnitedHealth Group), where I work on building NLP solutions Optum's housecall program. Optum's housecall program provides in-house care for patients. As part of this initiative, we analyze patient call transcripts for pain point analysis and enable Optum to provide better experience for our customers.

Before joining Optum, I completed my Masters in Business Analytics and Data Science (PGDBA) from Indian Institute of Management, Calcutta . Prior to that, I did my Masters of Science in Mathematics and Computing from Indian Institute of Technology, Guwahati in 2015 and Bachelor of Science in Mathematics and Computer Science from Chennai Mathematical Institute in 2013.

Email / CV / GitHub / LinkedIn / Kaggle / Topcoder / Scholar

Recent News · Education · Selected Publications · Professional Experiences · Corporate Projects · Data Science Competitions · Development Projects · Awards · Invited Talks and Trainings · Voluntary Services · Extra ·

Recent Highlights

[Dec 2023] Attending EMNLP 2023. See you in Singapore.
[Oct 2023] Paper titled "Manifold-Preserving Transformers are Effective for Short-Long Range Encoding" accepted at EMNLP (findings) 2023.
[Dec 2022] Invited talk on Code-mixed representation learning at Workshop on Indian Code-mixed and Low-resource Natural Language Processing (ICLrNLP) collocated with ICON 2022.
[Dec 2022] Joined PhD. at IIT Delhi in the Electrical Engineering Dept.
[Nov 2022] Publication of patent ID US11494565B2 titled "Natural Language Processing Techniques using Joint Sentiment-Topic Modeling".

Education

PhD, Computer Science
Indian Institute of Technology, Delhi
2023-Present

Advisors: Dr. Tanmoy Chakraborty and Dr. Md. Shad Akhtar

	PhD, Computer Science Indraprastha Institute of Information Technology, Delhi 2021-2022 Advisors: Dr. Tanmoy Chakraborty and Dr. Md. Shad Akhtar Courses: Machine Learning, Natural Language Processing, Social Network Analysis, Data Mining, Artificial Intelligence, Bayesian Machine Learning
	MS, Business Analytics and Data Science Indian Institute of Management, Calcutta (jointly with Indian Institute of Technology, Kharagpur and Indian Statistical Institute, Kolkata) 2015-2017 Courses: Algorithms, Machine Learning, Multivariate Analysis, Complex Networks, Information Retrieval, Econometrics, Statistical Inference
	Masters of Science, Mathematics and Computing Indian Institute of Technology, Guwahati 2013-2015 Advised by Prof. Anupam Saikia for masters thesis on Mathematics of Elliptic Curves and its application in Cryptography Courses: Algorithms, Logic Programming (Prolog, Introduction to AI), Probability Theory, Numerical Analysis, Optimization
	Bachelor of Science, Mathematics and Computer Science Chennai Mathematical Institute 2010-2013 Courses: Mathematical Logic, Algorithm Design, Game Theory, Theory of Computation, Programming

Selected Publications (* denotes equal contribution)

Manifold-Preserving Transformers are Effective for Short-Long Range Encoding
Ayan Sengupta, Md Shad Akhtar, Tanmoy Chakraborty
Findings of the Association for Computational Linguistics: EMNLP 2023 | Association for Computational Linguistics
Paper | Code

Multi-head self-attention-based Transformers have shown promise in different learning tasks. Albeit these models exhibit significant improvement in understanding short-term and long-term contexts from sequences, encoders of Transformers and their variants fail to preserve layer-wise contextual information. Transformers usually project tokens onto sparse manifolds and fail to preserve mathematical equivalence among the token representations. In this work, we propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens. We propose a simple alternative to dot-product attention to ensure Lipschitz continuity. This allows TransJect to learn injective mappings to transform token representations to different manifolds with similar topology and preserve Euclidean distance between every pair of tokens in subsequent layers. Evaluations across multiple benchmark short- and long-sequence classification tasks show maximum improvements of 6.8% and 5.9%, respectively, over the variants of Transformers. Additionally, TransJect displays 79% better performance than Transformer on the language modeling task. We further highlight the shortcomings of multi-head self-attention from the statistical physics viewpoint. Although multi-head self-attention was incepted to learn different abstraction levels within the networks, our empirical analyses suggest that different attention heads learn randomly and unorderly. In contrast, TransJect adapts a mixture of experts for regularization; these experts are more orderly and balanced and learn different sparse representations from the input sequences. TransJect exhibits very low entropy and can be efficiently scaled to larger depths.

Image Credits: Link	Does aggression lead to hate? Detecting and reasoning offensive traits in hinglish code-mixed texts Ayan Sengupta, Sourabh Kumar Bhattacharjee, Md Shad Akhtar, Tanmoy Chakraborty Neurocomputing \| Elsevier Paper \| Code Aggression is a prominent trait of human beings that can affect social harmony in a negative way. The hate mongers misuse the freedom of speech in social media platforms to flood with their venomous comments in many forms. Identifying different traits of online offense is thus inevitable and the need of the hour. Existing studies usually handle one or two offense traits at a time, mainly due to the lack of a combined annotated dataset and a scientific study that provides insights into the relationship among the traits. In this paper, we study the relationship among five offense traits – aggression, hate, sarcasm, humor, and stance in Hinglish (Hindi-English) social media code-mixed texts. We employ various state-of-the-art deep learning systems at different morphological granularities for the classification across five offense traits. Our evaluation of the unified framework suggests performance across all major traits. Furthermore, we propose a novel notion of causal importance score to quantify the effect of different abusive keywords and the overall context on the offensiveness of the texts.
Image Credits: Link	HIT: A Hierarchically Fused Deep Attention Network for Robust Code-mixed Language Representation Ayan Sengupta, Sourabh Kumar Bhattacharjee, Tanmoy Chakraborty, Md Shad Akhtar Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 \| Association for Computational Linguistics Paper \| Code Understanding linguistics and morphology of resource-scarce code-mixed texts remains a key challenge in text processing. Although word embedding comes in handy to support downstream tasks for low-resource languages, there are plenty of scopes in improving the quality of language representation particularly for code-mixed languages. In this paper, we propose HIT, a robust representation learning method for code-mixed texts. HIT is a hierarchical transformer-based framework that captures the semantic relationship among words and hierarchically learns the sentence-level semantics using a fused attention mechanism. HIT incorporates two attention modules, a multi-headed self-attention and an outer product attention module, and computes their weighted sum to obtain the attention weights. Our evaluation of HIT on one European (Spanish) and five Indic (Hindi, Bengali, Tamil, Telugu, and Malayalam) languages across four NLP tasks on eleven datasets suggests significant performance improvement against various state-of-the-art systems. We further show the adaptability of learned representation across tasks in a transfer learning setup (with and without fine-tuning).
Image Credits: Link	Gated Transformer for Robust De-noised Sequence-to-Sequence Modelling Ayan Sengupta, Amit Kumar, Sourabh Kumar Bhattacharjee, Suman Roy Findings of the Association for Computational Linguistics: EMNLP 2021 \| Association for Computational Linguistics Paper \| Code Robust sequence-to-sequence modelling is an essential task in the real world where the inputs are often noisy. Both user-generated and machine generated inputs contain various kinds of noises in the form of spelling mistakes, grammatical errors, character recognition errors, all of which impact downstream tasks and affect interpretability of texts. In this work, we devise a novel sequence-to-sequence architecture for detecting and correcting different real world and artificial noises (adversarial attacks) from English texts. Towards that we propose a modified Transformer-based encoder-decoder architecture that uses a gating mechanism to detect types of corrections required and accordingly corrects texts. Experimental results show that our gated architecture with pre-trained language models perform significantly better that the non-gated counterparts and other state-of-the-art error correction models in correcting spelling and grammatical errors. Extrinsic evaluation of our model on Machine Translation (MT) and Summarization tasks show the competitive performance of the model against other generative sequence-to-sequence models under noisy inputs.
Image Credits: Link	An Embedding-based Joint Sentiment-Topic Model for Short Texts Ayan Sengupta, William Scott Paka, Suman Roy, Gaurav Ranjan, Tanmoy Chakraborty Vol. 15 (2021): Fifteenth International AAAI Conference on Web and Social Media Paper \| Code Short text is a popular avenue of sharing feedback, opinions and reviews on social media, e-commerce platforms, etc. Many companies need to extract meaningful information (which may include thematic content as well as semantic polarity) out of such short texts to understand users’ behaviour. However, obtaining high quality sentiment-associated and human interpretable themes still remains a challenge for short texts. In this paper we develop ELJST, an embedding enhanced generative joint sentiment-topic model that can discover more coherent and diverse topics from short texts. It uses Markov Random Field Regularizer that can be seen as generalisation of skip-gram based models. Further, it can leverage higher order semantic information appearing in word embedding, such as self-attention weights in graphical models. Our results show an average improvement of 10% in topic coherence and 5% in topic diversification over baselines. Finally, ELJST helps understand users' behaviour at more granular levels which can be explained. All these can bring significant values to service and healthcare industries often dealing with customers.
	A Study of Pre-trained Language Models along with Regularization Techniques for Downstream Tasks Ayan Sengupta Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020) \| Conference on Empirical Methods in Natural Language Processing (EMNLP 2020) Paper \| Code This document describes the system description developed by team datamafia at WNUT-2020 Task 2: Identification of informative COVID-19 English Tweets. This paper contains a thorough study of pre-trained language models on downstream binary classification task over noisy user generated Twitter data. The solution submitted to final test leaderboard is a fine tuned RoBERTa model which achieves F1 score of 90.8% and 89.4% on the dev and test data respectively. In the later part, we explore several techniques for injecting regularization explicitly into language models to generalize predictions over noisy data. Our experiments show that adding regularizations to RoBERTa pre-trained model can be very robust to data and annotation noises and can improve overall performance by more than 1.2%.

Professional Experiences

Optum (UnitedHealth Group)

Aug 2019 - Present
Senior Data Scientist
Working with Optum's housecall team to build NLP solutions on customer call transcripts. Building better conversational agents for capturing customer's pain points. Previously worked with Natural Language Processing Group of Optum's RQNS (Risk, Quality and Network Solutions) team. Built NLP solutions to extract structured information from noisy OCR outputs extracted from clinical charts. Explored deep sequential models and transformer based models for named entity recognition and representation learning from noisy texts. Additionally, used multi-modal systems leveraging graphs, computer vision and NLP to classify page contents and extracted meaningful information from the clinical charts.

May 2017 - Aug 2019
Data Scientist
Worked with Optum's RQNS risk team to determine overall health risk for 7 million patients. Built and deployed big data Python module to extract suspects, screenings, hospitalization, visit information etc. for patients. These information are sent to providers and payers (insurance companies) to calculate the overall risk score and payments are adjusted accordingly.

Worked with Optum India's R&D team on various research initiatives. We handled various R&D projects across - voice of customer analytics, NPS analytics, medical fraud detection etc. One of our work on Medical Fraud Detection from Clinical Charts were demonstrated to National Health Authority (NHA), Govt. of India.

Nov 2016 - Apr 2017
Data Science Intern
Worked with NLP group of Optum India on text extraction from scanned medical charts. Built and deployed OCR pipeline that currently processes 1 billion barcode pages every year.

Greenfield Softwares Private Limited

May 2016 - Jun 2016
Data Science Intern
Devised a predictive analytics engine to detect failures in data centre infrastructure management systems. We demonstrated our solution to senior management of the company.

Corporate Projects

Image Credits: Link

Multi-modal Disease Prediction
Dec 2021 - Dec 2022
Ayan Sengupta, Amit Kumar, Suman Roy

Built deep learning based multi-task model for disease prediction. We utilized clinical knowledge graphs and tempo-spatial representation for encounter level predictions of diagnosis and medications.

Image Credits: Link

Structured Information Extraction from Scanned Medical Charts
Nov 2019 - Oct 2021

Building information extraction engine to extract structured information from noisy scanned images of handwritten medical charts. Our solution involves multi-modal system to first detect the type of chart page, given the image and OCR text. Further we use sequential and transformer models to detect named entities from noisy word tokens and parse the text into key-value pairs. Our solution can be used for extracting demographic information as well as, key medical conditions from unstructured texts.

Image Credits: Link

Voice of Customer Analytics
Jan 2019 - May 2020
Code1 | Code2
Ayan Sengupta, William Scott Paka, Vijay Malladi Varma, Suman Roy, Gaurav Ranjan, Tanmoy Chakraborty

In this project we develop an end-to-end system that extracts meaningful and actionable pain points from user generated texts like - feedback, complaint, social media feeds, telecommunications etc. Our solution extracts customers' intent at topic level and prioritizes critical issues in an unsupervised manner. We use probabilistic generative models based on Latent Dirichlet Allocation (LDA) to detect joint topic-sentiments from texts. Further, we use statistical measures to quantify information content and assign a priority score to each complaint so that the downstream CRM teams can be assigned accordingly. We also developed a NMF (Non-Negative Matrix Factorization) based online version of our solution to work with high velocity streaming data.

Patents

Machine Learning Techniques for Cross-Domain Text Classification, Optum, 2022
Differential Word Mover Similarity for Cross-Domain Medical Coding, Optum, 2022
Attention-Based Machine Learning Techniques Using Temporal Sequence Data and Dynamic Co-Occurence Graph Data Objects, Optum, 2022
Natural Language Processing Machine Learning Frameworks Trained Using Multi-Task Training Routines, Optum, 2022
Supervised and Unsupervised Machine Learning Techniques for Communication Summarization, Optum, 2022
Machine Learning Techniques for Denoising Input Sequences, Optum, 2021
Graph-Embedding-based Paragraph Vector Machine Learning Model, Optum, 2021
Significance-based Prediction from Unstructured Text, Optum, 2021
Natural Language Processing Techniques for Sequential Topic Modeling, Optum, 2020 [Link]
Natural Language Processing Techniques using Joint Sentiment-Topic Modeling, Optum, 2020 [Link]
Auto Prioritization of Customer Grievances using Sentiment-Topic Information, Optum, 2020
Natural Language Processing Using Joint Sentiment-Topic Modeling, Optum, 2019 [Link]

Other Research Projects

Code-Mixed Sentiment Classification (SemEval 2020 Task 9 and FIRE 2020)
Jan 2020 - Sep 2020
Code1 | Code2 | Link1 | Link2

Code-mixing is very common in social media. In this project, we learned textual representation from Hinglish, Tamil-English and Malayalam-English texts and further classify its sentiment. We learned unsupervised word embeddings - word2vec, fastText and contextual embeddings using BERT and used sequential attention models and BERT for classifying tweet sentiment. We further used BytePairEncoding (BPE) and learned representation using custom transformers from scratch.

PharmaCoNER: Named Entity Recognition from Biomedical Texts (BioNLP 2019 Task 2)
May 2019 - Jul 2019
Code | Link

In this project we developed a named entity recognition system to detect pharmaceutical entities from Spanish clinical corpus. We explored different embedding techniques along with sequential and convolutional models with conditional random field (CRF) for classifying entities and detecting entity offsets.

Data Science Competitions

	2023 Kaggle AI Report Code In this challenge, the participants were asked to write an essay on one of the following seven chosen topics from the field of AI, with a prompt to describe what the community has learned over the past 2 years of working and experimenting with.
	Kaggle: Jigsaw Multilingual Toxic Comment Classification Challenge 2020 Code \| Link In this challenge, the objective is to classify whether a comment is toxic (or, abusive). The dataset contains multi-lingual comments from different Wikipedia talk pages. We experimented with various multilingual transformer models, Universal Sentence Encoder (USE) and their ensembles. Final private leaderboard rank achieved 156 (Bronze medal)
	Kaggle: Google QUEST Q&A Labeling Challenge 2020 Code \| Link In this challenge, the objective is to predict different subjective aspects of question-answering gathered from different StackExchange properties. In this competition, we explored BERT model and hand picked feature engineering to develop a robust model that can predict the subjectivity metrics accurately. Final private leaderboard rank achieved 116 (Bronze medal)
	Kaggle: Bengali.AI Handwritten Grapheme Classification Challenge 2020 Code \| Link Experimented with efficient net models with different CNN heads (SSE, GEM pooling) on GPU/TPU to classify handwritten bengali alphabets and its constituents (diacritics). Final private leaderboard rank achieved 233 (out of 2000 teams)
Image Credits: Link	CrowdANALYTIX: Gamma Log Facies Type Prediction Challenge 2019 Code \| Link Given an array of GR (Gamma-Ray) values, accurately predict the log facies type corresponding to each value. Solution includes stacking of several seq2seq models with attention. Achieved overall 96.8% accuracy. Final private leaderboard rank achieved 26 (out of 350 teams)
	Codalab: Fine-Grained Classification of Objects from Aerial Imagery Challenge 2019 Code \| Link \| Blog The competition focused on fine grained classification of objects from aerial imagery. We used different image augmentation techniques along with different image classification models - MobileNet, Resnet, InceptionNet and achieved average precision of 55% on the test data. Final private leaderboard rank achieved 16 (out of 50 teams)
Image Credits: Link	Data Science Game 2016: Online selection Bodhisattwa Prasad Majumder, Robin Singh, Ayan Sengupta, Jayanta Mandi Code \| Link1 \| Link2 The challenge was to classify orientation of building roofs using satellite images of roof tops. We used different image augmentation techniques along with VGG network and achieved 82% accuracy on test dataset. Final private leaderboard rank achieved 22 (out of 110 teams) and got selected for Finals of Data Science Game 2016

Development Projects

Image Credits: Link

consNLP - A NLP toolkit for text data Exploration, Visualization and Modeling
Code

A consolidated NLP toolkit for text data analysis and modeling. It supports various functionalities like - tokenization, lemmatization, binary/multiclass classification, sequence classification, Question-Answering, Natural Language Inference (NLI) training as well as inference. It supports CPU/GPU/TPU platforms.

jointtsmodel - Python package for Probabilistic Joint Topic-Sentiment Models
Code

Joint topic-sentiment models aka. aspect rating models can extract thematic representation from texts at a granular level. In the areas of customer relationship management, it is utmost important to understand the pain point areas of customers from their feedbacks, complaints data and other data sources. By leveraging both sentiment as well as, thematic information, JST based models can understand the intent at overall level as well as, at theme level, which makes it perfect for analyzing voice of the customers.

This package contains implementation of various probabilistic generative JST models from the literature.

Slack bot for Online Product Selection
Code

This is a simple slack bot that reads the required product description from the users, parses the user input and shows relevant products from Amazon.

Awards

[2021] Senior inventor award by UnitedHealth Group
[2020] Became Kaggle Competition Expert
[2019] Innovation award at Optum for invention disclosure filing
[2018] Mastermind award Q4-2018 at Optum for research initiatives
[2018] 3rd place in MindSpark, Annual Data Science Championship at Optum
[2016] Finalist, Data Science Game '16, Paris; Represented India (1 out of 3 teams), International Rank 14
[2009] 7 years KVPY fellowship for persuing undergraduate and graduate research in basic sciences. Fellowship funded by Department of Science and Technology (Govt. of India)

Invited Talks and Trainings

[2022] Invited talk on Code-mixed representation learning at Workshop on Indian Code-mixed and Low-resource Natural Language Processing (ICLrNLP) collocated with ICON 2022
[2020] Conducted Clinical NLP specialized training for 200+ employees at Optum
[2019] Session at Optum on A Comprehensive Overview of Natural Language Processing in Healthcare industry
[2019] Student talk at IIIT Delhi (course taught by Prof. Tanmoy Chakraborty) on Probabilistic Generative Models
[2015] Research Scholar's Seminar at IIT Guwahati on Existence of Algebraic Closure of a field

Voluntary Services

PC member of ACL 2023
Reviewer at Journal of Intelligent & Fuzzy Systems
Reviewer at International AAAI Conference on Web and Social Media (ICWSM) 2021
Reviewer at Workshop on Noisy User-generated Text (W-NUT) 2020
PC member of ECML-PKDD 2020 in Applied Data Science Track
Writer and member of Topcoder Thrive community

Extra

[2019] Passed Grade 1 in Rock & Pop Bass, awarded by Trinity College London 🤘
[2019] Performed at Rockathon organized by Noida School of Rock 🤘🤘
[2016] Passed A1 in French by CdA Global Language Centre

Thanks to Jon Barron and Bodhisattwa P. Majumder for this nice template.