About Me

Hi there, my name is Arpan Mishra and I currently work as a Data Science Consultant at ZS Associates. I graduated in July 2021 with a Bachelor of Science in Statistics from KMC, University of Delhi.

I’m an Applied AI Engineer with 4.5 years of experience building production-grade LLM systems in regulated, high-stakes environments. I specialize in translating complex business problems into scalable AI architectures—spanning multi-agent pipelines, RAG-based retrieval, prompt engineering, and evaluation frameworks. I’m equally comfortable in deep technical implementation and in communicating architectural decisions to non-technical audiences.

If I’m not building things, I’m either at the gym, napping, or losing my mind over FIFA.

Experience

This is how my professional journey has been until now


Data Science Consultant at ZS Associates
December 2025 – Present | Gurgaon, India

Clinical Trial Document Authoring Platform

  • Built and delivered a $5M enterprise-grade multi-agent document authoring system supporting 12 clinical trial document types for a Fortune 500 client—a production-ready platform that reduced manual drafting time and earned trust from cross-functional clinical and regulatory stakeholders.
  • Designed the core agent orchestration layer using a Planner-based framework with RAG-powered context grounding, enabling complex document workflows to execute with improved factual accuracy and compliance with regulatory standards.
  • Built scalable digitization and indexing pipelines across AWS and Azure, enabling structured ingestion of diverse clinical document formats—creating a reliable foundation for downstream AI reasoning.
  • Developed a centralized knowledge layer as a service (MCP server on AWS Agent Core) and established an LLM evaluation framework, ensuring high-quality, auditable AI outputs in a regulated clinical environment.

Clinical Query Analysis System

  • Designed and deployed a GenAI-powered clinical query intelligence system that reduced redundant queries by 20%, directly improving operational efficiency across clinical workflows.
  • Architected a multi-agent LLM framework comprising an Executor Agent, Redundancy Classification Agent, and Query Category Detection Agent—each with well-defined responsibilities for modular iteration and evaluation.
  • Processed and benchmarked insights from ~1.3M historical clinical queries, establishing a performance baseline and improving classification accuracy through targeted prompt optimization and model tuning.
  • Integrated the system into the live clinical query portal, enabling real-time redundancy detection and query-type tagging at the point of user input.

Data Science Associate Consultant at ZS Associates
December 2023 – December 2025 | Gurgaon, India

Safety Narrative Document Authoring

  • Developed an AI-powered pipeline to analyze SDTM and ADaM clinical trial datasets and automatically generate chronological safety narratives for patients.
  • Engineered an end-to-end ETL pipeline to transform structured clinical datasets into LLM-compatible formats for downstream reasoning and summarization.
  • Leveraged structured prompting techniques to enable temporal reasoning over patient safety events and produce coherent, medically aligned summaries.
  • Built a lightweight application to operationalize the pipeline, generating safety narratives for 600+ patients across 3 clinical trials.
  • Implemented automated Gantt chart visualizations from trial data to assist medical writers in validating event timelines.

Auto Document Redactor

  • Designed and deployed an automated PII redaction pipeline for confidential clinical documents, achieving 98% recall and 90% precision—meeting strict compliance requirements for regulated clinical environments.
  • Implemented a hybrid architecture combining rule-based heuristics and an ML-driven spaCy NER pipeline with chain-of-thought prompting, accurately identifying nuanced and context-sensitive PII entities.
  • Built evaluation benchmarks and validation workflows to ensure ongoing reliability and compliance, establishing a repeatable testing framework for the team.

Data Science Associate at ZS Associates
November 2021 – December 2023 | New Delhi, India

ISR Entity Detection Pipeline

  • Built an entity and relationship extraction pipeline from clinical protocol documents to support Industry Sponsored Research (ISR) decision-making—turning unstructured text into structured intelligence.
  • Extracted key entities (drug, dosage, cycle, endpoints, inclusion/exclusion criteria) using spaCy NER, custom fine-tuned entity recognition models, and XGBoost-based classifiers, enabling structured protocol intelligence at scale.
  • Enabled structured protocol intelligence to help stakeholders make data-driven research sponsorship decisions.

Patient Discontinuation Prediction

  • Developed an XGBoost-based predictive model to identify patients at risk of therapy discontinuation; engineered features from claims data, patient demographics, and census datasets using RFE and forward/backward elimination.
  • Packaged the solution as a reusable internal asset, enabling client-specific model training and feature generation across engagements.

Research Intern at Inria, University of Lille
June 2021 – September 2021 | Lille, France

Recidivism Prediction — Mental Health & Suicide Risk

  • Worked with medical data for mental health patients with a history of suicide attempts.
  • Modeled the recurrence of suicide attempts using both parametric and non-parametric statistical methods, incorporating medical survey data from VigilanS and identifying factors affecting re-attempt probability.
  • Conducted spatial analysis of patient data and incorporated geostatistical spatial autocorrelation into the model to account for regional patterns.

Machine Learning Engineer (Part Time) at Omdena
August 2020 – February 2021 | Remote

  • Worked with satellite imagery and survey data from Census and DHS as part of a global team of 50 change makers.
  • Used Landsat 7 & 8 Satellite Images and census data to create a model predicting district-level census variables using a multi-modal, multi-task learning approach.
  • Used DHS data and Sentinel images to classify the Asset Wealth Index of clusters across India.
  • This project was hosted by World Resources Institute (WRI) and is under UN´s Sustainable Development Goal 8 (Decent Work & Economic Growth).

Skills

Languages & Tools
Python, R, SQL, JavaScript, Docker, Streamlit, LangGraph, LangChain

LLM & Agent Systems
Multi-Agent Architectures, RAG, Prompt Engineering (Chain-of-Thought, Few-Shot, Structured), Agent Orchestration, Context Engineering, LLM Evaluation & Validation Frameworks, MCP Servers

Cloud & Infrastructure
AWS (SageMaker, Agent Runtime, Bedrock, EKS, DocumentDB), Azure Document Intelligence, Google Vertex AI

Domains
Clinical Trials, Healthcare AI, Regulated AI Deployment, Document Intelligence, Safety-Critical Systems

Projects

These are some of the personal projects that I have built in the past.


caloriebot

CalorieBot — Nutrition Tracking Agent (2025)

A LangGraph-powered 6-phase nutrition agent deployed as a WhatsApp bot. Users describe meals in natural text or voice and the agent runs item parsing → food search → macro scaling → diary logging → daily summary, end-to-end.

rossman

Rossman Sales Prediction

Created a tool to predict the daily sales of any store of the Rossmann drug store chain which is the 2nd largest drug store chain in Germany.

bert

Sentiment Extraction using Bert

Used Bert to detect the sentiment of a given text and further extract the words that best conveys the detected sentiment.

anime

Generating Anime Synopsis using Deep Learning

I used two techniques, LSTMs and then a fine tuned GPT2 for comparing their language modeling capabilities and the results were astounding!

suicide analysis

Global Suicide Analysis EDA

I analyzed the global suicide data for 90+ countries from the year 1985 - 2015 in R. Various statistical tests and data visualization techniques were used to explain the data.

text analysis

Text Analysis Webapp

The purpose of this app is to offer anyone starting off an NLP project a fast and convenient means of exploring the text data cutting down the time between EDA and Modelling.

rubiks cube

Rubik’s Cube Rotation Prediction

Predicting the X-Axis Rotation for a given rubik’s cube using Resnet-50. This was part of the AI Blitz Challenge, a hackathon hosted by AI Crowd.

selfie filter

Selfie Filter using CNN

I used a CNN architecture for facial keypoint detection and further used openCV to achieve the desired effect of a sunglass filter which works real time with a webcam.

Blog

Here are few of the blogs that I have written related to machine learning, data science and the projects that I have built.


SAT

Faster Machine Learning Using Hub by Activeloop

A code walkthrough of using the hub package for satellite imagery

anime

Let’s make some Anime using Deep Learning

Comparing text generation methods: LSTM vs GPT2

svm

Decoding Support Vector Machines

Intuitively understand how Support Vector Machines work

pred

Predicting HR Attrition using Support Vector Machines

Learn to train an SVM model following best practices

Contact