Hi there, my name is Arpan Mishra and I currently work as a Data Science Consultant at ZS Associates. I graduated in July 2021 with a Bachelor of Science in Statistics from KMC, University of Delhi.
I’m an Applied GenAI Engineer with 4.5 years of experience architecting scalable multi-agent LLM systems, retrieval-augmented generation pipelines, and AI evaluation frameworks. I deliver production solutions in regulated domains, combining GenAI, structured data processing, and cloud infrastructure.
If i’m not coding then you can find me playing my ukulele or crushing someone on chess.com, challenges are accepted ♚
Experience
This is how my professional journey has been until now
Data Science Consultant at ZS Associates
December 2025 – Present | Gurgaon, India
Clinical Trial Document Authoring Platform
Led development of an AI-powered platform supporting 12 clinical trial document types using a multi-agent LLM architecture, managing a team of 10 AI engineers and collaborating cross-functionally to drive production delivery.
Designed a Planner-based agent orchestration framework with RAG-powered context grounding to execute complex document drafting workflows with improved factual accuracy.
Built scalable digitization and indexing pipelines across AWS and Azure, enabling structured processing of diverse clinical document formats.
Developed a centralized knowledge layer (MCP server on AWS Agent Core) and established an LLM evaluation framework to ensure high-quality, transparent AI outputs in regulated environments.
Clinical Query Analysis System
Designed and deployed a GenAI-powered clinical query intelligence system that reduced redundant queries by 20%, improving operational efficiency across clinical workflows.
Architected a multi-agent LLM framework comprising an Executor Agent, Redundancy Classification Agent, and Query Category Detection Agent.
Processed and generated insights from ~1.3M historical clinical queries to benchmark performance and improve classification accuracy.
Integrated the system into the clinical query portal to enable real-time redundancy detection and query-type tagging during user input.
Data Science Associate Consultant at ZS Associates
December 2023 – December 2025 | Gurgaon, India
Safety Narrative Document Authoring
Developed an AI-powered pipeline to analyze SDTM and ADaM clinical trial datasets and automatically generate chronological safety narratives for patients.
Engineered an end-to-end ETL pipeline to transform structured clinical datasets into LLM-compatible formats for downstream reasoning and summarization.
Leveraged structured prompting techniques to enable temporal reasoning over patient safety events and produce coherent, medically aligned summaries.
Built a lightweight application to operationalize the pipeline, generating safety narratives for 600+ patients across 3 clinical trials.
Implemented automated Gantt chart visualizations from trial data to assist medical writers in validating event timelines.
Auto Document Redactor
Designed and deployed an automated document redaction pipeline to remove Patient PII from confidential clinical documents, achieving 98% recall and 90% precision.
Applied chain-of-thought prompting and few-shot learning techniques to accurately identify nuanced and context-sensitive PII entities.
Hardened the redaction workflow by integrating a rule-based and ML-driven spaCy entity recognition pipeline to improve robustness and reduce false negatives.
Built evaluation benchmarks and validation workflows to ensure compliance and reliability in regulated clinical environments.
Data Science Associate at ZS Associates
November 2021 – December 2023 | New Delhi, India
ISR Entity Detection Pipeline
Built an entity and relationship extraction pipeline from clinical protocol documents to support Industry Sponsored Research (ISR) decision-making.
Extracted key entities including drug, dosage, cycle, endpoints, and inclusion/exclusion criteria using spaCy NER, custom fine-tuned entity recognition models, and XGBoost-based classifiers.
Enabled structured protocol intelligence to help stakeholders make data-driven research sponsorship decisions.
Patient Discontinuation Prediction
Developed an XGBoost-based predictive model to identify patients at risk of therapy discontinuation.
Engineered features from claims data, patient demographics, and census datasets, applying feature selection techniques such as RFE and forward/backward elimination.
Packaged the solution into a reusable internal asset for client-specific model training and feature generation.
Research Intern at Inria
June 2021 – September 2021 | Lille, France
Worked with medical data for mental health patients with a history of suicide attempts.
The objective was to model the recurrence of a suicide attempt from demographic as well as medical survey data by VigilanS using parametric as well as non-parametric statistical methods.
Conducted spatial analysis of the patients and used geostatistical techniques to include the effect of spatial autocorrelation into the model.
Machine Learning Engineer (Part Time) at Omdena
August 2020 – February 2021 | Remote
Worked with satellite imagery and survey data from Census and DHS as part of a global team of 50 change makers.
Used Landsat 7 & 8 Satellite Images and census data to create a model predicting district-level census variables using a multi-modal, multi-task learning approach.
Used DHS data and Sentinel images to classify the Asset Wealth Index of clusters across India.
This project was hosted by World Resources Institute (WRI) and is under UN´s Sustainable Development Goal 8 (Decent Work & Economic Growth).
Cloud & Infrastructure
AWS (SageMaker, Agent Runtime, Bedrock Data Automation, EKS, DocumentDB), Azure Document Intelligence, Google Vertex AI
Python
R
SQL
NLP
Projects
These are some of the personal projects that I have built in the past.
Rossman Sales Prediction
Created a tool to predict the daily sales of any store of the Rossmann drug store chain which is the 2nd
largest drug store chain in Germany.
Sentiment Extraction using Bert
Used Bert to detect the sentiment of a given text and further extract the words that best conveys the
detected sentiment.
Generating Anime Synopsis using Deep Learning
I used two techniques, LSTMs and then a fine tuned GPT2 for comparing their language modeling
capabilities and the results were
astounding!
Global Suicide Analysis EDA
I analyzed the global suicide data for 90+ countries from the year 1985 - 2015 in R.
Various statistical tests and data visualization techniques were used to explain the data.
Text Analysis Webapp
The purpose of this app is to offer anyone starting off an NLP projects a fast and convenient means of
exploring the text data cutting down the time between EDA and Modelling.
Rubik’s Cube Rotation Prediction
Predicting the X-Axis Rotation for a give rubik’s cube using Resnet-50. This was part of the AI Blitz
Challenge, a hackathon hosted by AI Crowd.
Selfie Filter using CNN
I used a CNN architecture for facial keypoint detection and further used openCV to achieve the desired
effect of a sunglass filter which works real time with a webcam.
Blog
Here are few of the blogs that I have written related to machine learning, data science and the projects that I have
built.
Faster Machine Learning Using Hub by Activeloop
A code walkthrough of using the hub package for satellite imagery
Let’s make some Anime using Deep Learning
Comparing text generation methods: LSTM vs GPT2
Decoding Support Vector Machines
Intuitively understand how Support Vector Machines work
Predicting HR Attrition using Support Vector Machines
Learn to train an SVM model following best practices