DPO vs. SFT Fine-Tuning

March 4, 2026

When people say fine-tuning, they usually mean: change the model’s behavior by training it on a curated dataset. There are two common ways to do that.

SFT (Supervised Fine-Tuning) = training on a solid dataset. Data looks like prompt → ideal response. Best when you want the model to reliably produce a specific format — structured JSON, consistent tool calls, company-specific writing patterns.

DPO (Direct Preference Optimization) = training on preferences. Data looks like prompt + chosen + rejected. Best when the model can already generate reasonable candidates but you want it to consistently pick the better behavior: more concise, better tone, fewer refusals.

If you can write down the right answer, start with SFT. If you mostly know which answer is better, DPO is often a cleaner fit. A common production pattern: SFT to teach the target shape, then DPO to polish behavior.