Saehee Eom

ML Researcher & AI Engineer

Building at the frontier of large language models and agentic AI. MS Analytics student at Georgia Tech, with research spanning NLP, computer vision, and multimodal AI systems.

Saehee Eom

Hello — I'm Saehee [Say-Hee]. I'm an ML researcher and AI engineer with dual degrees in Artificial Intelligence and Linguistics from Seoul National University, currently pursuing my MS in Analytics at Georgia Tech. I'm conducting LLM training data attribution research at GT EI-Lab under Prof. Mark Riedl.

My work spans four research experiences — NLP at SNU SKI-ML Lab, Computer Vision at SNU MIPA Lab, Legal AI at LBOX, and LLM pre-training research at GT EI-Lab — alongside production ML deployments at enterprise companies like Salesforce. I'm building toward a research career at the intersection of LLM development, agentic workflows, and multimodal AI.

From improving LLM fluency by 23% through controlled text generation, to curating ~5B-token training corpora for LLM pre-training, to building adversarial defenses against deepfakes — I bridge rigorous research with deployable ML systems.

3 Published papers
4 Research roles
90% Reporting efficiency gained
~5B Tokens curated
🔬 01
LLM Training Data Attribution AI Research

GT GT EI-Lab · Prof. Mark Riedl · Ongoing

Research on training data attribution for large language models. Enriched pre-training corpus with 576 topic × format bins via stratified and proportional sampling, investigating data quality signals and provenance to improve LLM pre-training transparency.

~5B tokens curated · 576 topic × format bins

LLMs Pre-training Data Curation NLP Research
Research in progress ↗
🤖 02
Agentic Multimodal Monitoring for Mental Health AI Research

GT GT Sonification Lab · Ongoing

Agentic AI framework with autonomous agents orchestrating multimodal data collection — audio, facial, and behavioral signals — combined with LLM-based clinical reasoning for continuous mental health monitoring.

Emotion AI · multimodal time series · autonomous agent coordination

Agentic AI Multimodal LLMs Emotion AI
Research in progress ↗
✍️ 03
Controlled Text Generation AI Research

SNU SNU SKI-ML Lab · Under revision

Co-authored research on energy model-based controlled text generation for black-box LLMs, handling multiple constraints simultaneously — toxicity, logical consistency, formality — while preserving fluency. Fine-tuned via SFT using synthetic datasets.

+23% BERTScore · −22% perplexity · +31% constraint satisfaction across LLaMA, Phi, Qwen

PyTorch LLMs NLP SFT
View overview ↗
🧠 04
Syntactic Complexity & LLM Performance AI Research

SNU SNU · B.A. Thesis · December 2024

Quantified the effects of syntactic complexity on LLM performance across QA, text completion, and EN–KO translation using 8 linguistically motivated metrics. Identified POS divergence as the key bottleneck and demonstrated prompt engineering strategies to mitigate degradation.

In-context learning, kNN example selection, instruction tuning

Linguistics LLMs Prompt Engineering
View paper ↗
🛡️ 05
GuardianDream: Adversarial Defense AI Research

SNU SNU MIPA Lab · 2024

Adversarial defense system against diffusion-based deepfakes — injects imperceptible noise into training photos to degrade unauthorized AI-generated identity recreation and protect against identity misuse.

−2% generated image quality · −3% facial similarity degradation

PyTorch Diffusion Models Computer Vision
View research ↗
🚗 06
Korean License Plate Detection AI Research

SNU SNU · 2nd place / 40 teams

Two-stage detection pipeline combining object detection and OCR for intelligent transportation systems. Won 2nd place out of 40 teams in competition.

87.60% OCR accuracy · 0.72 mean IoU · 81% end-to-end detection

Faster R-CNN EasyOCR Computer Vision
View project ↗
🔍 07
Hybrid Vector Search System ML Systems

GT Georgia Tech · 2025

Hybrid search engine combining structured SQL queries with FAISS vector similarity search, enabling ANN-based retrieval and relational filtering on large-scale text corpora. Adaptive query strategies dynamically tune search scope based on filter selectivity.

FAISS IVF-PQ Python SQL
View project ↗
🎯 08
CRM Targeting Ensemble ML Systems

Hyundai Hyundai Home Shopping · 2022

Ensemble model combining LightGBM gradient boosting and SASRec sequential recommendation to optimize customer targeting for marketing campaigns. Deployed on AWS for real-time predictions with sentiment analysis across 10K+ reviews.

+42% purchase rate · +14% campaign engagement lift

LightGBM SASRec AWS
View project ↗
📊 09
Call Volume Forecasting ML Systems

SNU Growth Hackers · SNU · 2022

Hybrid CatBoost + LSTM forecasting for call center volume with an interactive Streamlit dashboard. Enabled managers to optimize workforce allocation and reduce operational costs.

MAE < 100 · 29% lower monthly error vs baseline

CatBoost LSTM Streamlit
View demo ↗
⚖️ 10
Legal Analytics Platform Data Engineering

LBOX LBOX · EMNLP 2022 Workshop

South Korea's first legal data analytics platform — end-to-end development visualizing sentencing patterns and enhancing accessibility through information extraction pipelines.

4× engagement from non-legal users · 161% YoY sales increase

Python NLP IE SQL
View details ↗
🗺️ 11
Atlanta Traffic Visualization Dashboard Data Engineering

GT Georgia Tech · 2025

Web-based traffic prediction dashboard for Atlanta's I-285 with 330,000+ data points. Features historical/forecast mode switching and animated timeline visualization with geospatial rendering via Leaflet.js.

75% compression ratio · <5MB · 1–3s initial load

Python Leaflet.js Time Series
View dashboard ↗
🗄️ 12
SQL Database Engine & Recommender Data Engineering

GT Georgia Tech · 2025

SQL database engine built from scratch with a Lark-based parser, supporting DDL/DML operations. Integrated collaborative filtering for book recommendations.

<1s on 100K+ rows · +18% Precision@10 vs baseline

Python SQLite Lark
View code ↗
GT GT EI-Lab · NLP Research Lab
Feb 2026 – Present

Research Intern · Advisor: Prof. Mark Riedl

Atlanta, GA

  • LLM training data attribution research; enriched pre-training corpus with 576 topic × format bins via stratified and proportional sampling (~5B tokens)
  • Investigating data quality signals and provenance to improve LLM pre-training transparency and reproducibility
Salesforce Salesforce
Jan – Jun 2025

Data Scientist Intern · Industry GTM Team

Seoul, South Korea

  • Designed scoring algorithm and recommender system to predict enterprise SaaS opportunity close rates across 50+ industries, informing GTM strategy
  • Engineered automated Snowflake-Tableau ETL pipelines with RBAC dashboards, reducing manual reporting by 90%
  • Collaborated with global stakeholders (Korea, USA, Japan) to unify data definitions
SNU SNU SKI-ML Lab
Jun 2024 – Jan 2025

Research Intern (NLP) · Advisor: Prof. Jay-yoon Lee

Seoul, South Korea

  • Co-authored paper on Controlled Text Generation; +23% BERTScore, −22% perplexity across LLaMA, Phi, Qwen
  • Fine-tuned LLMs via SFT with ChatGPT-generated synthetic datasets; +31% constraint satisfaction
  • Built scalable LLM evaluation framework; led journal clubs on RAG, autonomous agents, quantization
SNU SNU MIPA Lab
Jan – Feb 2024

Research Intern (Computer Vision)

Seoul, South Korea

  • Developed GuardianDream — adversarial defense against diffusion-based deepfakes
  • Led journal club on GANs, Diffusion Models, VLMs, LoRA, and SOTA generative AI
LBOX LBOX · Series C LegalTech
Jul 2022 – Jan 2023

Data Scientist Intern

Seoul, South Korea

  • Co-authored EMNLP 2022 workshop paper on IE for legal documents; built baseline models, evaluation frameworks, and legal ontologies
  • Built 200+ SQL/Redash dashboards; drove 161% YoY sales growth
  • Developed South Korea's first legal analytics platform — 4× higher engagement from non-experts
SNU Growth Hackers · SNU Data Science Club
Jul 2021 – Jun 2022

Data Science Project Lead

Seoul, South Korea

  • Led industry-sponsored data science consulting projects with measurable business outcomes
  • Delivered CRM targeting and call center forecasting solutions
Programming & ML Python · PyTorch · TensorFlow · Scikit-Learn · PySpark · C++ · Java · R
LLMs & Generative AI Hugging Face · vLLM · LangChain · OpenAI API · Transformers · FAISS · RAG · Agentic AI
AI & Model Ops Fine-tuning (SFT) · Quantization · Synthetic Data · Model Evaluation · llama.cpp · Prompt Engineering
Data Systems SQL · BigQuery · MySQL · Snowflake · Spark · Databricks · Hadoop
Cloud & Tools AWS · GCP · Docker · Kubernetes · Git · Tableau · Streamlit
[1]

Locate & Edit: Text Editing-Based Controlled Text Generation for Black-Box LMs

H. R. Son, Saehee Eom, et al.

Under Revision · July 2025

Energy model-based approach for controlled text generation handling multiple constraints simultaneously. +23% BERTScore, −22% perplexity across LLaMA, Phi, and Qwen.

[2]

Syntactic Complexity and Prompt Design: A Linguistic Analysis of LLM Performance

Saehee Eom

B.A. Thesis, Department of Linguistics, Seoul National University · December 2024

Operationalized syntactic complexity metrics to analyze LLM performance across QA, text completion, and EN–KO translation. Identified POS divergence as a key predictor of degradation.

[3]

Data-efficient End-to-end Information Extraction for Statistical Legal Analysis

W. Hwang, Saehee Eom, et al.

EMNLP 2022 Workshop (NLLP) · December 2022

Data-efficient IE from legal documents including novel legal ontologies, baseline models, and evaluation frameworks. Contributed to South Korea's first legal analytics platform.