Adversarial Robustness
Building AI systems that are robust to adversarial attacks and distribution shifts.
Overview
Modern vision-language models achieve remarkable performance on standard benchmarks, but remain vulnerable to adversarial perturbations and distribution shifts. My research develops provably-guaranteed detection methods and robust training procedures.
Current Projects
KoALA: KL-L0 Adversarial Detector via Label Agreement
| Status: Under review at ICLR 2026 | arXiv |
→ Key insight: Combining KL divergence with L0-based label agreement enables formal detection guarantees.
What we did:
- Designed a dual-head detector that flags inputs when prediction heads disagree
- Established formal detection guarantees with rigorous mathematical proofs
- Validated on ResNet-18 and CLIP (ViT-B/32) across standard adversarial benchmarks
Why it matters: Most adversarial detection methods lack formal guarantees — they work empirically but can be bypassed by adaptive attacks. KoALA provides provable bounds on detection performance.
Research Questions I’m Exploring
- Certified robustness for VLMs — Can we extend randomized smoothing to vision-language models?
- Distribution shift detection — How do we distinguish adversarial inputs from natural distribution shifts?
- Robust fine-tuning — Does adversarial training transfer across modalities?