Thomas Ajai

CS @ Columbia University · New York City, New York
tl3444@columbia.edu GitHub LinkedIn

Education

Columbia University

M.S. in Computer Science · GPA: 3.92 / 4.00

New York, NY Aug 2025 – Dec 2026

Courses: Natural Language Processing, Analysis of Algorithms, ML for Functional Genomics, Distributed Systems

College of Engineering Trivandrum

B.Tech in Computer Science and Engineering · GPA: 9.52 / 10.00

Trivandrum, IN Aug 2019 – Jun 2023

Courses: Algorithm Analysis and Design, Computational Theory, Database Systems

Experience

Data Engineer — Providence India

Jul 2023 – Jul 2025
  • Built and maintained end-to-end ETL pipelines using Alteryx and Azure Data Factory, integrating data from Epic, Cerner, and Athena EHR systems (53M+ records)
  • Migrated data workflows to Snowflake, reducing ETL runtime by 80% and improving data reliability for analytics teams
  • Developed Power BI dashboards tracking 25+ KPIs for clinical and financial data
  • Collaborated with cross-functional teams to design data models and automate data quality validation for production systems
  • Introduced Docker-based containerization for reproducible ETL environments and easier deployment across systems

NLP Developer Intern — Jiffy.ai

Oct 2020 – Jan 2021
  • Developed an NLP model (SpaCy, NER) to extract key clauses from unstructured documents, enhancing efficiency by 30%
  • Designed and automated data preprocessing pipelines for 2,000+ legal and business documents
  • Communicated analytical insights to stakeholders, demonstrating ability to translate ML model performance into business value

Projects

Columbia University

Post-Training a Large Language Model for Mathematical Reasoning

PyTorch · LLaMA · LoRA · REINFORCE
  • Implemented and fine-tuned Meta LLaMA-3.2-1B on GSM8K using supervised LoRA fine-tuning and REINFORCE, improving grade-school math accuracy from ~3% to ~24%.
  • Designed robust evaluation pipelines for numeric answer extraction, tolerance-based exact-match scoring, and automated benchmarking.
  • Implemented custom training loops, reward functions, prompt masking, and formatting-constrained generation using Hugging Face Transformers and PyTorch.

RAG-Based Document QA Assistant

LangChain · FAISS · OpenAI · Streamlit
  • Built a Retrieval-Augmented Generation (RAG) system enabling users to query unstructured internal PDF documents beyond LLM context window limits.
  • Designed ingestion pipelines using LangChain to chunk, embed, and index documents into a FAISS vector store for semantic retrieval.
  • Implemented a conversational retrieval chain to preserve multi-turn context by injecting chat history into each query.
  • Developed an interactive Streamlit chat UI and integrated OpenAI embeddings + generation for end-to-end document QA.

Ancient Egyptian → German Neural Machine Translation

Transformers · LoRA
  • Fine-tuned a Transformer-based model for Ancient Egyptian transliteration → German translation using LoRA on domain-specific corpora.
  • Extended tokenizer to support Gardiner sign codes and rare transliteration symbols, improving lexical coverage and training stability.
  • Achieved BLEU = 45.6 and chrF = 61.6 on held-out test sets; conducted ablations on tokenizer design and adapter rank.
  • Constructed end-to-end data pipelines for transliteration normalization, token alignment, and evaluation.

Neural Machine Translation with Transformers (French → English)

PyTorch · Transformers
  • Implemented a decoder-only Transformer from scratch in PyTorch with multi-head causal attention, residual connections, learned positional embeddings, and GPT-style weight tying.
  • Trained and evaluated on Multi30k, implementing masked cross-entropy loss, autoregressive decoding, and BLEU evaluation via SacreBLEU.
  • Fine-tuned Helsinki-NLP OpusMT and benchmarked against the custom Transformer baseline.

Fault-Tolerant MapReduce Framework

Go · RPC · Distributed Systems
  • Implemented a distributed MapReduce system in Go with a centralized master and parallel workers communicating via RPC.
  • Designed task scheduling and worker coordination to support concurrent map/reduce execution and dynamic worker availability.
  • Added fault handling to reassign tasks from failed workers, ensuring job completion under partial failures.

Replicated Key-Value Store with Raft Consensus

Go · Distributed Consensus
  • Built a replicated key-value store using Raft to provide linearizable Get/Put operations across multiple replicas.
  • Implemented leader election, log replication, and state machine application to tolerate crashes and network partitions.
  • Integrated snapshotting and log compaction to bound log growth and improve recovery performance.

Technical Skills

Languages / Tools: Python, Go, SQL, C++, JavaScript
Data Engineering: Alteryx, Azure Data Factory, Snowflake, Power BI, Pandas, NumPy
Tools: Docker, GitHub, Linux, VS Code, Jupyter Notebooks
Machine Learning: PyTorch, Transformers, LoRA, LLMs, NER, Prompt Engineering