ML Researcher & AI Engineer
Building at the frontier of large language models and agentic AI. MS Analytics student at Georgia Tech, with research spanning NLP, computer vision, and multimodal AI systems.
About
Hello — I'm Saehee [Say-Hee]. I'm an ML researcher and AI engineer with dual degrees in Artificial Intelligence and Linguistics from Seoul National University, currently pursuing my MS in Analytics at Georgia Tech. I'm conducting LLM training data attribution research at GT EI-Lab under Prof. Mark Riedl.
My work spans four research experiences — NLP at SNU SKI-ML Lab, Computer Vision at SNU MIPA Lab, Legal AI at LBOX, and LLM pre-training research at GT EI-Lab — alongside production ML deployments at enterprise companies like Salesforce. I'm building toward a research career at the intersection of LLM development, agentic workflows, and multimodal AI.
From improving LLM fluency by 23% through controlled text generation, to curating ~5B-token training corpora for LLM pre-training, to building adversarial defenses against deepfakes — I bridge rigorous research with deployable ML systems.
Projects
GT EI-Lab · Prof. Mark Riedl · Ongoing
Research on training data attribution for large language models. Enriched pre-training corpus with 576 topic × format bins via stratified and proportional sampling, investigating data quality signals and provenance to improve LLM pre-training transparency.
~5B tokens curated · 576 topic × format bins
Research in progress ↗
GT Sonification Lab · Ongoing
Agentic AI framework with autonomous agents orchestrating multimodal data collection — audio, facial, and behavioral signals — combined with LLM-based clinical reasoning for continuous mental health monitoring.
Emotion AI · multimodal time series · autonomous agent coordination
Research in progress ↗
SNU SKI-ML Lab · Under revision
Co-authored research on energy model-based controlled text generation for black-box LLMs, handling multiple constraints simultaneously — toxicity, logical consistency, formality — while preserving fluency. Fine-tuned via SFT using synthetic datasets.
+23% BERTScore · −22% perplexity · +31% constraint satisfaction across LLaMA, Phi, Qwen
View overview ↗
SNU · B.A. Thesis · December 2024
Quantified the effects of syntactic complexity on LLM performance across QA, text completion, and EN–KO translation using 8 linguistically motivated metrics. Identified POS divergence as the key bottleneck and demonstrated prompt engineering strategies to mitigate degradation.
In-context learning, kNN example selection, instruction tuning
View paper ↗
SNU MIPA Lab · 2024
Adversarial defense system against diffusion-based deepfakes — injects imperceptible noise into training photos to degrade unauthorized AI-generated identity recreation and protect against identity misuse.
−2% generated image quality · −3% facial similarity degradation
View research ↗
SNU · 2nd place / 40 teams
Two-stage detection pipeline combining object detection and OCR for intelligent transportation systems. Won 2nd place out of 40 teams in competition.
87.60% OCR accuracy · 0.72 mean IoU · 81% end-to-end detection
View project ↗
Georgia Tech · 2025
Hybrid search engine combining structured SQL queries with FAISS vector similarity search, enabling ANN-based retrieval and relational filtering on large-scale text corpora. Adaptive query strategies dynamically tune search scope based on filter selectivity.
View project ↗
Hyundai Home Shopping · 2022
Ensemble model combining LightGBM gradient boosting and SASRec sequential recommendation to optimize customer targeting for marketing campaigns. Deployed on AWS for real-time predictions with sentiment analysis across 10K+ reviews.
+42% purchase rate · +14% campaign engagement lift
View project ↗
Growth Hackers · SNU · 2022
Hybrid CatBoost + LSTM forecasting for call center volume with an interactive Streamlit dashboard. Enabled managers to optimize workforce allocation and reduce operational costs.
MAE < 100 · 29% lower monthly error vs baseline
View demo ↗
LBOX · EMNLP 2022 Workshop
South Korea's first legal data analytics platform — end-to-end development visualizing sentencing patterns and enhancing accessibility through information extraction pipelines.
4× engagement from non-legal users · 161% YoY sales increase
View details ↗
Georgia Tech · 2025
Web-based traffic prediction dashboard for Atlanta's I-285 with 330,000+ data points. Features historical/forecast mode switching and animated timeline visualization with geospatial rendering via Leaflet.js.
75% compression ratio · <5MB · 1–3s initial load
View dashboard ↗
Georgia Tech · 2025
SQL database engine built from scratch with a Lark-based parser, supporting DDL/DML operations. Integrated collaborative filtering for book recommendations.
<1s on 100K+ rows · +18% Precision@10 vs baseline
View code ↗Experience
Research Intern · Advisor: Prof. Mark Riedl
Atlanta, GA
Data Scientist Intern · Industry GTM Team
Seoul, South Korea
Research Intern (NLP) · Advisor: Prof. Jay-yoon Lee
Seoul, South Korea
Research Intern (Computer Vision)
Seoul, South Korea
Data Scientist Intern
Seoul, South Korea
Data Science Project Lead
Seoul, South Korea
Skills
Publications
Locate & Edit: Text Editing-Based Controlled Text Generation for Black-Box LMs
Under Revision · July 2025
Energy model-based approach for controlled text generation handling multiple constraints simultaneously. +23% BERTScore, −22% perplexity across LLaMA, Phi, and Qwen.
Syntactic Complexity and Prompt Design: A Linguistic Analysis of LLM Performance
B.A. Thesis, Department of Linguistics, Seoul National University · December 2024
Operationalized syntactic complexity metrics to analyze LLM performance across QA, text completion, and EN–KO translation. Identified POS divergence as a key predictor of degradation.
Data-efficient End-to-end Information Extraction for Statistical Legal Analysis
EMNLP 2022 Workshop (NLLP) · December 2022
Data-efficient IE from legal documents including novel legal ontologies, baseline models, and evaluation frameworks. Contributed to South Korea's first legal analytics platform.