Thomas Ajai
CS @ Columbia University · New York City, New York
tl3444@columbia.edu
GitHub
LinkedIn
Education
Columbia University
M.S. in Computer Science · GPA: 3.92 / 4.00
Courses: Natural Language Processing, Analysis of Algorithms, ML for Functional Genomics, Distributed Systems
College of Engineering Trivandrum
B.Tech in Computer Science and Engineering · GPA: 9.52 / 10.00
Courses: Algorithm Analysis and Design, Computational Theory, Database Systems
Experience
Data Engineer — Providence India
Jul 2023 – Jul 2025- Built and maintained end-to-end ETL pipelines using Alteryx and Azure Data Factory, integrating data from Epic, Cerner, and Athena EHR systems (53M+ records)
- Migrated data workflows to Snowflake, reducing ETL runtime by 80% and improving data reliability for analytics teams
- Developed Power BI dashboards tracking 25+ KPIs for clinical and financial data
- Collaborated with cross-functional teams to design data models and automate data quality validation for production systems
- Introduced Docker-based containerization for reproducible ETL environments and easier deployment across systems
NLP Developer Intern — Jiffy.ai
Oct 2020 – Jan 2021- Developed an NLP model (SpaCy, NER) to extract key clauses from unstructured documents, enhancing efficiency by 30%
- Designed and automated data preprocessing pipelines for 2,000+ legal and business documents
- Communicated analytical insights to stakeholders, demonstrating ability to translate ML model performance into business value
Projects
Columbia University
Post-Training a Large Language Model for Mathematical Reasoning
PyTorch · LLaMA · LoRA · REINFORCE- Implemented and fine-tuned Meta LLaMA-3.2-1B on GSM8K using supervised LoRA fine-tuning and REINFORCE, improving grade-school math accuracy from ~3% to ~24%.
- Designed robust evaluation pipelines for numeric answer extraction, tolerance-based exact-match scoring, and automated benchmarking.
- Implemented custom training loops, reward functions, prompt masking, and formatting-constrained generation using Hugging Face Transformers and PyTorch.
RAG-Based Document QA Assistant
LangChain · FAISS · OpenAI · Streamlit- Built a Retrieval-Augmented Generation (RAG) system enabling users to query unstructured internal PDF documents beyond LLM context window limits.
- Designed ingestion pipelines using LangChain to chunk, embed, and index documents into a FAISS vector store for semantic retrieval.
- Implemented a conversational retrieval chain to preserve multi-turn context by injecting chat history into each query.
- Developed an interactive Streamlit chat UI and integrated OpenAI embeddings + generation for end-to-end document QA.
Ancient Egyptian → German Neural Machine Translation
Transformers · LoRA- Fine-tuned a Transformer-based model for Ancient Egyptian transliteration → German translation using LoRA on domain-specific corpora.
- Extended tokenizer to support Gardiner sign codes and rare transliteration symbols, improving lexical coverage and training stability.
- Achieved BLEU = 45.6 and chrF = 61.6 on held-out test sets; conducted ablations on tokenizer design and adapter rank.
- Constructed end-to-end data pipelines for transliteration normalization, token alignment, and evaluation.
Neural Machine Translation with Transformers (French → English)
PyTorch · Transformers- Implemented a decoder-only Transformer from scratch in PyTorch with multi-head causal attention, residual connections, learned positional embeddings, and GPT-style weight tying.
- Trained and evaluated on Multi30k, implementing masked cross-entropy loss, autoregressive decoding, and BLEU evaluation via SacreBLEU.
- Fine-tuned Helsinki-NLP OpusMT and benchmarked against the custom Transformer baseline.
Fault-Tolerant MapReduce Framework
Go · RPC · Distributed Systems- Implemented a distributed MapReduce system in Go with a centralized master and parallel workers communicating via RPC.
- Designed task scheduling and worker coordination to support concurrent map/reduce execution and dynamic worker availability.
- Added fault handling to reassign tasks from failed workers, ensuring job completion under partial failures.
Replicated Key-Value Store with Raft Consensus
Go · Distributed Consensus- Built a replicated key-value store using Raft to provide linearizable Get/Put operations across multiple replicas.
- Implemented leader election, log replication, and state machine application to tolerate crashes and network partitions.
- Integrated snapshotting and log compaction to bound log growth and improve recovery performance.
Technical Skills
Languages / Tools:
Python, Go, SQL, C++, JavaScript
Data Engineering:
Alteryx, Azure Data Factory, Snowflake, Power BI, Pandas, NumPy
Tools:
Docker, GitHub, Linux, VS Code, Jupyter Notebooks
Machine Learning:
PyTorch, Transformers, LoRA, LLMs, NER, Prompt Engineering