Thomas Ajai

CS @ Columbia University · New York City, New York
tl3444@columbia.edu GitHub LinkedIn

Education

M.S. in Computer Science · GPA: 3.92 / 4.00

New York, NY Aug 2025 – Dec 2026

Courses: Natural Language Processing, Analysis of Algorithms, ML for Functional Genomics, Distributed Systems

B.Tech in Computer Science and Engineering · GPA: 9.52 / 10.00

Trivandrum, IN Aug 2019 – Jun 2023

Courses: Algorithm Analysis and Design, Computational Theory, Database Systems

Jul 2023 – Jul 2025

Built and maintained end-to-end ETL pipelines using Alteryx and Azure Data Factory, integrating data from Epic, Cerner, and Athena EHR systems (53M+ records)
Migrated data workflows to Snowflake, reducing ETL runtime by 80% and improving data reliability for analytics teams
Developed Power BI dashboards tracking 25+ KPIs for clinical and financial data
Collaborated with cross-functional teams to design data models and automate data quality validation for production systems
Introduced Docker-based containerization for reproducible ETL environments and easier deployment across systems

Oct 2020 – Jan 2021

Developed an NLP model (SpaCy, NER) to extract key clauses from unstructured documents, enhancing efficiency by 30%
Designed and automated data preprocessing pipelines for 2,000+ legal and business documents
Communicated analytical insights to stakeholders, demonstrating ability to translate ML model performance into business value

Columbia University

PyTorch · LLaMA · LoRA · REINFORCE

Implemented and fine-tuned Meta LLaMA-3.2-1B on GSM8K using supervised LoRA fine-tuning and REINFORCE, improving grade-school math accuracy from ~3% to ~24%.
Designed robust evaluation pipelines for numeric answer extraction, tolerance-based exact-match scoring, and automated benchmarking.
Implemented custom training loops, reward functions, prompt masking, and formatting-constrained generation using Hugging Face Transformers and PyTorch.

LangChain · FAISS · OpenAI · Streamlit

Built a Retrieval-Augmented Generation (RAG) system enabling users to query unstructured internal PDF documents beyond LLM context window limits.
Designed ingestion pipelines using LangChain to chunk, embed, and index documents into a FAISS vector store for semantic retrieval.
Implemented a conversational retrieval chain to preserve multi-turn context by injecting chat history into each query.
Developed an interactive Streamlit chat UI and integrated OpenAI embeddings + generation for end-to-end document QA.

Transformers · LoRA

Fine-tuned a Transformer-based model for Ancient Egyptian transliteration → German translation using LoRA on domain-specific corpora.
Extended tokenizer to support Gardiner sign codes and rare transliteration symbols, improving lexical coverage and training stability.
Achieved BLEU = 45.6 and chrF = 61.6 on held-out test sets; conducted ablations on tokenizer design and adapter rank.
Constructed end-to-end data pipelines for transliteration normalization, token alignment, and evaluation.

PyTorch · Transformers

Implemented a decoder-only Transformer from scratch in PyTorch with multi-head causal attention, residual connections, learned positional embeddings, and GPT-style weight tying.
Trained and evaluated on Multi30k, implementing masked cross-entropy loss, autoregressive decoding, and BLEU evaluation via SacreBLEU.
Fine-tuned Helsinki-NLP OpusMT and benchmarked against the custom Transformer baseline.

Go · RPC · Distributed Systems

Implemented a distributed MapReduce system in Go with a centralized master and parallel workers communicating via RPC.
Designed task scheduling and worker coordination to support concurrent map/reduce execution and dynamic worker availability.
Added fault handling to reassign tasks from failed workers, ensuring job completion under partial failures.

Go · Distributed Consensus

Built a replicated key-value store using Raft to provide linearizable Get/Put operations across multiple replicas.
Implemented leader election, log replication, and state machine application to tolerate crashes and network partitions.
Integrated snapshotting and log compaction to bound log growth and improve recovery performance.

Languages / Tools: Python, Go, SQL, C++, JavaScript

Data Engineering: Alteryx, Azure Data Factory, Snowflake, Power BI, Pandas, NumPy

Tools: Docker, GitHub, Linux, VS Code, Jupyter Notebooks

Machine Learning: PyTorch, Transformers, LoRA, LLMs, NER, Prompt Engineering