Ayan Sengupta

I am a Senior Data Scientist at Optum (UnitedHealth Group), where I work on building NLP solutions for Risk, Quality and Network Solutions (RQNS). Optum's RQNS team is primarily responsible for capturing health risks for 7 million patients and connecting as a bridge between its members, payers and provider networks across the US.

My current areas of interest are representation learning from noisy and code-mixed texts, information extraction and unsupervised deep generative models. In future, I want to explore the areas of meta learning and multi-modal systems leveraging NLP, Computer Vision and Graphs.

Before joining Optum, I completed my Masters in Business Analytics and Data Science (PGDBA) from Indian Institute of Management, Calcutta . Prior to that, I did my Masters of Science in Mathematics and Computing from Indian Institute of Technology, Guwahati in 2015 and Bachelor of Science in Mathematics and Computer Science from Chennai Mathematical Institute in 2013.

Email  /  CV  /  GitHub  /  LinkedIn  /  Kaggle  /  Topcoder

profile photo
Experiences·Projects·Data Science Competitions·Open Source Development·Education·Awards·Invited Talks and Trainings·Services·Extra·

Optum (UnitedHealth Group)

Aug 2019 - Present
Senior Data Scientist
Working with Natural Language Processing Group of Optum's RQNS (Risk, Quality and Network Solutions) team. Building NLP solutions to extract structured information from noisy OCR outputs extracted from clinical charts. Exploring deep sequential models and transformer based models for named entity recognition and representation learning from noisy texts. Additionally, using multi-modal systems leveraging graphs, computer vision and NLP to classify page content of clinical charts and extract meaningful information from them.

May 2017 - Aug 2019
Data Scientist
Worked with Optum's RQNS risk team to determine overall health risk for 7 million patients. Built and deployed big data Python module to extract suspects, screenings, hospitalization, visit information etc. for patients. These information are sent to providers and payers (insurance companies) to calculate the overall risk score and payments are adjusted accordingly.

Worked with Optum India's R&D team on various research initiatives. We handled various R&D projects across - voice of customer analytics, NPS analytics, medical fraud detection etc. One of our work on Medical Fraud Detection from Clinical Charts were demonstrated to National Health Authority (NHA), Govt. of India.

Nov 2016 - Apr 2017
Data Science Intern
Worked with NLP group of Optum India on text extraction from scanned medical charts. Built and deployed OCR pipeline that currently processes 1 billion barcode pages every year.


Greenfield Softwares Private Limited

May 2016 - Jun 2016
Data Science Intern
Devised a predictive analytics engine to detect failures in data centre infrastructure management systems. We demonstrated our solution to senior management of the company.


I'm interested in representation learning, meta learning and unsupervised generative models in NLP. Previously I have worked on probabilistic generative models, sentiment analysis, sequence classification etc. In future, I wish to explore the area of multi-modal representation learning.

Recent Projects

Image Credits: Link

Structured Information Extraction from Scanned Medical Charts
Nov 2019 - Present

Building information extraction engine to extract structured information from noisy scanned images of handwritten medical charts. Our solution involves multi-modal system to first detect the type of chart page, given the image and OCR text. Further we use sequential and transformer models to detect named entities from noisy word tokens and parse the text into key-value pairs. Our solution can be used for extracting demographic information as well as, key medical conditions from unstructured texts.


Image Credits: Link

External Knowledge infused NER from noisy OCR texts
Nov 2019 - Present

Named entity recognition (NER) is a well explored area in NLP. However, challenges arise when the texts are noisy. Capturing correct semantics and contextual meaning is very difficult with noisy incorrect data like OCR texts. In this project, we explore leveraging external knowledge on the entities as a form of graph embeddings to enhance NER quality. We are building multi-modal system that uses knowledge graphs along with contextual text information to detect and link entities of interest.


Image Credits: Link

Voice of Customer Analytics
Jan 2019 - May 2020
Code1 | Code2
Ayan Sengupta, William Scott Paka, Vijay Malladi Varma, Suman Roy, Gaurav Ranjan, Tanmoy Chakraborty

In this project we develop an end-to-end system that extracts meaningful and actionable pain points from user generated texts like - feedback, complaint, social media feeds, telecommunications etc. Our solution extracts customers' intent at topic level and prioritizes critical issues in an unsupervised manner. We use probabilistic generative models based on Latent Dirichlet Allocation (LDA) to detect joint topic-sentiments from texts. Further, we use statistical measures to quantify information content and assign a priority score to each complaint so that the downstream CRM teams can be assigned accordingly. We also developed a NMF (Non-Negative Matrix Factorization) based online version of our solution to work with high velocity streaming data.

  • Auto Prioritization of Customer Grievances using Sentiment-Topic Information, Optum, 2020
  • Natural Language Processing Using Joint Sentiment-Topic Modeling, Optum, 2019
Other Research Projects

Image Credits: Link

Denoising Noisy Text Data
Mar 2020 - Present
Ayan Sengupta, Amit Kumar, Vijay Malladi Varma, Suman Roy, Gaurav Ranjan

Noisy text is everywhere. From social media posts to text messages to machine generated/translated data. In this project, we try to formalize the different kinds of noise and advarsarial attacks on text data and further try to denoise them. We are exploring different encoder-decoder architectures that reads a noisy input and generates expected correct text output.


Code-Mixed Sentiment Classification (SemEval 2020 Task 9)
Jan 2020 - Mar 2020
Code | Link

Code-mixing is very common in social media. In this project, we learned textual representation from Hinglish tweets and further classify its sentiment. We learned unsupervised word embeddings - word2vec, fastText and contextual embeddings using BERT and used sequential attention models and BERT for classifying tweet sentiment.


PharmaCoNER: Named Entity Recognition from Biomedical Texts (BioNLP 2019 Task 2)
May 2019 - Jul 2019
Code | Link

In this project we developed a named entity recognition system to detect pharmaceutical entities from Spanish clinical corpus. We explored different embedding techniques along with sequential and convolutional models with conditional random field (CRF) for classifying entities and detecting entity offsets.

Past Publications
(* denotes equal contribution)

Image Credits: Link

A Large-scale Analysis of the Marketplace Characteristics in Fiverr
Suman Kalyan Maity, Chandra Bhanu Jha*, Avinash Kumar*, Ayan Sengupta*, Madhur Modi*, Animesh Mukherjee
2017 Hawaii International Conference On System Sciences (HICSS-50)

Crowdsourcing platforms have become quite popular due to the increasing demand of human computation-based tasks. Though the crowdsourcing systems are primarily demand-driven like MTurk, supply-driven marketplaces are becoming increasingly popular. Fiverr is one of the fast growing supply-driven marketplaces, where the sellers post micro-tasks (gigs) and users purchase them for prices as less as $5. In this paper, we studied the Fiverr platform as a unique marketplace and characterized the sellers, buyers and the interactions among them. We also studied Fiverr as a seller-driven marketplace in terms of sales, churn rates, competitiveness among various subcategories etc.


Fault Detection Engine in Intelligent Predictive Analytics Platform for DCIM
Bodhisattwa Prasad Majumder, Ayan Sengupta, Sajal Jain, Parikshit Bhaduri
2016 Fourth International Conference on Business Analytics and Intelligence
Paper | Code

It is imperative for data centers to keep uptime of the equipment to the maximum. The various equipment in the data center such as UPS, PDUs, PACs are monitored by DCIM with real time alerts if critical parameters cross thresholds. The streaming data from devices as well as alerts and faults reports can be correlated by analytics to pin point root causes of failures as well as predict device failures. This paper introduces a computing unit, Fault Engine which leverages the log-data from all concerned devices available in the device chain and employs a Markov Process based Failure Model to predict whether the failure is permanent or transient hence raising alarm with proper severity. The paper also talks about probabilistic detection of root cause in a situation of device failure.

Data Science Competitions

Kaggle: Jigsaw Multilingual Toxic Comment Classification Challenge 2020
Code | Link

In this challenge, the objective is to classify whether a comment is toxic (or, abusive). The dataset contains multi-lingual comments from different Wikipedia talk pages. We experimented with various multilingual transformer models, Universal Sentence Encoder (USE) and their ensembles.

Final private leaderboard rank achieved 156 (Bronze medal)


Kaggle: Google QUEST Q&A Labeling Challenge 2020
Code | Link

In this challenge, the objective is to predict different subjective aspects of question-answering gathered from different StackExchange properties. In this competition, we explored BERT model and hand picked feature engineering to develop a robust model that can predict the subjectivity metrics accurately.

Final private leaderboard rank achieved 116 (Bronze medal)


Kaggle: Bengali.AI Handwritten Grapheme Classification Challenge 2020
Code | Link

Experimented with efficient net models with different CNN heads (SSE, GEM pooling) on GPU/TPU to classify handwritten bengali alphabets and its constituents (diacritics).

Final private leaderboard rank achieved 233 (out of 2000 teams)


Image Credits: Link

CrowdANALYTIX: Gamma Log Facies Type Prediction Challenge 2019
Code | Link

Given an array of GR (Gamma-Ray) values, accurately predict the log facies type corresponding to each value. Solution includes stacking of several seq2seq models with attention. Achieved overall 96.8% accuracy.

Final private leaderboard rank achieved 26 (out of 350 teams)


Codalab: Fine-Grained Classification of Objects from Aerial Imagery Challenge 2019
Code | Link | Blog

The competition focused on fine grained classification of objects from aerial imagery. We used different image augmentation techniques along with different image classification models - MobileNet, Resnet, InceptionNet and achieved average precision of 55% on the test data.

Final private leaderboard rank achieved 16 (out of 50 teams)


Image Credits: Link

Data Science Game 2016: Online selection
Bodhisattwa Prasad Majumder, Robin Singh, Ayan Sengupta, Jayanta Mandi
Code | Link1 | Link2

The challenge was to classify orientation of building roofs using satellite images of roof tops. We used different image augmentation techniques along with VGG network and achieved 82% accuracy on test dataset.

Final private leaderboard rank achieved 22 (out of 110 teams) and got selected for Finals of Data Science Game 2016

Open Source Development

Image Credits: Link

An Interactive Text Editor with Intent Understanding
Coming Soon

GPT powered interactive text editor that predicts the next likely word with the overall text sentiment and allows user to write text that suites her intent.


jointtsmodel - Python package for Probabilistic Joint Topic-Sentiment Models

Joint topic-sentiment models aka. aspect rating models can extract thematic representation from texts at a granular level. In the areas of customer relationship management, it is utmost important to understand the pain point areas of customers from their feedbacks, complaints data and other data sources. By leveraging both sentiment as well as, thematic information, JST based models can understand the intent at overall level as well as, at theme level, which makes it perfect for analyzing voice of the customers.

This package contains implementation of various probabilistic generative JST models from the literature.


Slack bot for Online Product Selection

This is a simple slack bot that reads the required product description from the users, parses the user input and shows relevant products from Amazon.


MS, Business Analytics and Data Science
Indian Institute of Management, Calcutta (jointly with Indian Institute of Technology, Kharagpur and Indian Statistical Institute, Kolkata)

Courses: Algorithms, Machine Learning, Multivariate Analysis, Complex Networks, Information Retrieval, Econometrics, Statistical Inference


Masters of Science, Mathematics and Computing
Indian Institute of Technology, Guwahati

Advised by Prof. Anupam Saikia for masters thesis on Mathematics of Elliptic Curves and its application in Cryptography

Courses: Algorithms, Logic Programming (Prolog, Introduction to AI), Probability Theory, Numerical Analysis, Optimization


Bachelor of Science, Mathematics and Computer Science
Chennai Mathematical Institute

Courses: Mathematical Logic, Algorithm Design, Game Theory, Theory of Computation, Programming

  • [2020] Became Kaggle Competition Expert (Global rank 1507)
  • [2019] Innovation award at Optum for invention disclosure filing
  • [2018] Mastermind award Q4-2018 at Optum for research initiatives
  • [2018] 3rd place in MindSpark, Annual Data Science Championship at Optum
  • [2016] Finalist, Data Science Game '16, Paris; Represented India (1 out of 3 teams), International Rank 14
  • [2009] 7 years KVPY fellowship for persuing undergraduate and graduate research in basic sciences. Fellowship funded by Department of Science and Technology (Govt. of India)
Invited Talks and Trainings
  • [2020] Conducted Clinical NLP specialized training for 200+ employees at Optum
  • [2019] Session at Optum on A Comprehensive Overview of Natural Language Processing in Healthcare industry
  • [2019] Student talk at IIIT Delhi (course taught by Prof. Tanmoy Chakraborty) on Probabilistic Generative Models
  • [2015] Research Scholar's Seminar at IIT Guwahati on Existence of Algebraic Closure of a field
  • [2019] Passed Grade 1 in Rock & Pop Bass, awarded by Trinity College London 🤘
  • [2019] Performed at Rockathon organized by Noida School of Rock 🤘🤘
  • [2016] Passed A1 in French by CdA Global Language Centre

Flag Counter

Thanks to Jon Barron and Bodhisattwa P. Majumder for this nice template.