Saehee Eom

ML Researcher & AI Engineer

Building at the frontier of large language models and agentic AI. MS Analytics student at Georgia Tech, with research spanning NLP, computer vision, and multimodal AI systems.

seom@gatech.edu LinkedIn ↗ GitHub ↗ Resume ↗

About

Hello — I'm Saehee [Say-Hee]. I'm an ML researcher and AI engineer with dual degrees in Artificial Intelligence and Linguistics from Seoul National University, currently pursuing my MS in Analytics at Georgia Tech. I'm conducting LLM training data attribution research at GT EI-Lab under Prof. Mark Riedl.

My work spans four research experiences — NLP at SNU SKI-ML Lab, Computer Vision at SNU MIPA Lab, Legal AI at LBOX, and LLM pre-training research at GT EI-Lab — alongside production ML deployments at enterprise companies like Salesforce. I'm building toward a research career at the intersection of LLM development, agentic workflows, and multimodal AI.

From improving LLM fluency by 23% through controlled text generation, to curating ~5B-token training corpora for LLM pre-training, to building adversarial defenses against deepfakes — I bridge rigorous research with deployable ML systems.

3 Published papers

4 Research roles

90% Reporting efficiency gained

~5B Tokens curated

Projects

🔬 01

LLM Training Data Attribution AI Research

GT EI-Lab · Prof. Mark Riedl · Ongoing

Research on training data attribution for large language models. Enriched pre-training corpus with 576 topic × format bins via stratified and proportional sampling, investigating data quality signals and provenance to improve LLM pre-training transparency.

~5B tokens curated · 576 topic × format bins

LLMs Pre-training Data Curation NLP Research

Research in progress ↗

🤖 02

Agentic Multimodal Monitoring for Mental Health AI Research

GT Sonification Lab · Ongoing

Agentic AI framework with autonomous agents orchestrating multimodal data collection — audio, facial, and behavioral signals — combined with LLM-based clinical reasoning for continuous mental health monitoring.

Emotion AI · multimodal time series · autonomous agent coordination

Agentic AI Multimodal LLMs Emotion AI

Research in progress ↗

✍️ 03

Controlled Text Generation AI Research

SNU SKI-ML Lab · Under revision

Co-authored research on energy model-based controlled text generation for black-box LLMs, handling multiple constraints simultaneously — toxicity, logical consistency, formality — while preserving fluency. Fine-tuned via SFT using synthetic datasets.

+23% BERTScore · −22% perplexity · +31% constraint satisfaction across LLaMA, Phi, Qwen

PyTorch LLMs NLP SFT

View overview ↗

🧠 04

Syntactic Complexity & LLM Performance AI Research

SNU · B.A. Thesis · December 2024

Quantified the effects of syntactic complexity on LLM performance across QA, text completion, and EN–KO translation using 8 linguistically motivated metrics. Identified POS divergence as the key bottleneck and demonstrated prompt engineering strategies to mitigate degradation.

In-context learning, kNN example selection, instruction tuning

Linguistics LLMs Prompt Engineering

View paper ↗

🛡️ 05

GuardianDream: Adversarial Defense AI Research

SNU MIPA Lab · 2024

Adversarial defense system against diffusion-based deepfakes — injects imperceptible noise into training photos to degrade unauthorized AI-generated identity recreation and protect against identity misuse.

−2% generated image quality · −3% facial similarity degradation

PyTorch Diffusion Models Computer Vision

View research ↗

🚗 06

Korean License Plate Detection AI Research

SNU · 2nd place / 40 teams

Two-stage detection pipeline combining object detection and OCR for intelligent transportation systems. Won 2nd place out of 40 teams in competition.

87.60% OCR accuracy · 0.72 mean IoU · 81% end-to-end detection

Faster R-CNN EasyOCR Computer Vision

View project ↗

🔍 07

Hybrid Vector Search System ML Systems

Georgia Tech · 2025

Hybrid search engine combining structured SQL queries with FAISS vector similarity search, enabling ANN-based retrieval and relational filtering on large-scale text corpora. Adaptive query strategies dynamically tune search scope based on filter selectivity.

FAISS IVF-PQ Python SQL

View project ↗

🎯 08

CRM Targeting Ensemble ML Systems

Hyundai Home Shopping · 2022

Ensemble model combining LightGBM gradient boosting and SASRec sequential recommendation to optimize customer targeting for marketing campaigns. Deployed on AWS for real-time predictions with sentiment analysis across 10K+ reviews.

+42% purchase rate · +14% campaign engagement lift

LightGBM SASRec AWS

View project ↗

📊 09

Call Volume Forecasting ML Systems

Growth Hackers · SNU · 2022

Hybrid CatBoost + LSTM forecasting for call center volume with an interactive Streamlit dashboard. Enabled managers to optimize workforce allocation and reduce operational costs.

MAE < 100 · 29% lower monthly error vs baseline

CatBoost LSTM Streamlit

View demo ↗

⚖️ 10

Legal Analytics Platform Data Engineering

LBOX · EMNLP 2022 Workshop

South Korea's first legal data analytics platform — end-to-end development visualizing sentencing patterns and enhancing accessibility through information extraction pipelines.

4× engagement from non-legal users · 161% YoY sales increase

Python NLP IE SQL

View details ↗

🗺️ 11

Atlanta Traffic Visualization Dashboard Data Engineering

Georgia Tech · 2025

Web-based traffic prediction dashboard for Atlanta's I-285 with 330,000+ data points. Features historical/forecast mode switching and animated timeline visualization with geospatial rendering via Leaflet.js.

75% compression ratio · <5MB · 1–3s initial load

Python Leaflet.js Time Series

View dashboard ↗

🗄️ 12

SQL Database Engine & Recommender Data Engineering

Georgia Tech · 2025

SQL database engine built from scratch with a Lark-based parser, supporting DDL/DML operations. Integrated collaborative filtering for book recommendations.

<1s on 100K+ rows · +18% Precision@10 vs baseline

Python SQLite Lark

View code ↗

Experience

GT EI-Lab · NLP Research Lab

Feb 2026 – Present

Research Intern · Advisor: Prof. Mark Riedl

Atlanta, GA

LLM training data attribution research; enriched pre-training corpus with 576 topic × format bins via stratified and proportional sampling (~5B tokens)
Investigating data quality signals and provenance to improve LLM pre-training transparency and reproducibility

Salesforce

Jan – Jun 2025

Data Scientist Intern · Industry GTM Team

Seoul, South Korea

Designed scoring algorithm and recommender system to predict enterprise SaaS opportunity close rates across 50+ industries, informing GTM strategy
Engineered automated Snowflake-Tableau ETL pipelines with RBAC dashboards, reducing manual reporting by 90%
Collaborated with global stakeholders (Korea, USA, Japan) to unify data definitions

SNU SKI-ML Lab

Jun 2024 – Jan 2025

Research Intern (NLP) · Advisor: Prof. Jay-yoon Lee

Seoul, South Korea

Co-authored paper on Controlled Text Generation; +23% BERTScore, −22% perplexity across LLaMA, Phi, Qwen
Fine-tuned LLMs via SFT with ChatGPT-generated synthetic datasets; +31% constraint satisfaction
Built scalable LLM evaluation framework; led journal clubs on RAG, autonomous agents, quantization

SNU MIPA Lab

Jan – Feb 2024

Research Intern (Computer Vision)

Seoul, South Korea

Developed GuardianDream — adversarial defense against diffusion-based deepfakes
Led journal club on GANs, Diffusion Models, VLMs, LoRA, and SOTA generative AI

LBOX · Series C LegalTech

Jul 2022 – Jan 2023

Data Scientist Intern

Seoul, South Korea

Co-authored EMNLP 2022 workshop paper on IE for legal documents; built baseline models, evaluation frameworks, and legal ontologies
Built 200+ SQL/Redash dashboards; drove 161% YoY sales growth
Developed South Korea's first legal analytics platform — 4× higher engagement from non-experts

Growth Hackers · SNU Data Science Club

Jul 2021 – Jun 2022

Data Science Project Lead

Seoul, South Korea

Led industry-sponsored data science consulting projects with measurable business outcomes
Delivered CRM targeting and call center forecasting solutions

Skills

Programming & ML Python · PyTorch · TensorFlow · Scikit-Learn · PySpark · C++ · Java · R

LLMs & Generative AI Hugging Face · vLLM · LangChain · OpenAI API · Transformers · FAISS · RAG · Agentic AI

AI & Model Ops Fine-tuning (SFT) · Quantization · Synthetic Data · Model Evaluation · llama.cpp · Prompt Engineering

Data Systems SQL · BigQuery · MySQL · Snowflake · Spark · Databricks · Hadoop

Cloud & Tools AWS · GCP · Docker · Kubernetes · Git · Tableau · Streamlit

Publications

[1]

Locate & Edit: Text Editing-Based Controlled Text Generation for Black-Box LMs

H. R. Son, Saehee Eom, et al.

Under Revision · July 2025

Energy model-based approach for controlled text generation handling multiple constraints simultaneously. +23% BERTScore, −22% perplexity across LLaMA, Phi, and Qwen.

[2]

Syntactic Complexity and Prompt Design: A Linguistic Analysis of LLM Performance

Saehee Eom

B.A. Thesis, Department of Linguistics, Seoul National University · December 2024

Operationalized syntactic complexity metrics to analyze LLM performance across QA, text completion, and EN–KO translation. Identified POS divergence as a key predictor of degradation.

[3]

Data-efficient End-to-end Information Extraction for Statistical Legal Analysis

W. Hwang, Saehee Eom, et al.

EMNLP 2022 Workshop (NLLP) · December 2022

Data-efficient IE from legal documents including novel legal ontologies, baseline models, and evaluation frameworks. Contributed to South Korea's first legal analytics platform.