Finetune LLMs | Notion

1. Understand Fine-tuning

Fine-tuning an LLM customizes its behavior, enhances + injects knowledge, and optimizes performance for domains/specific tasks. For example:

GPT-4 serves as a base model; however, OpenAI fine-tuned it to better comprehend instructions and prompts, leading to the creation of ChatGPT-4 which everyone uses today.
DeepSeek-R1-Distill-Llama-8B is a fine-tuned version of Llama-3.1-8B. DeepSeek utilized data generated by DeepSeek-R1, to fine-tune Llama-3.1-8B. This process, known as distillation (a subcategory of fine-tuning), injects the data into the Llama model to learn reasoning capabilities.

With Unsloth, you can fine-tune for free on Colab, Kaggle, or locally with just 3GB VRAM by using our notebooks. By fine-tuning a pre-trained model (e.g. Llama-3.1-8B) on a specialized dataset, you can:

Update + Learn New Knowledge: Inject and learn new domain-specific information.
Customize Behavior: Adjust the model’s tone, personality, or response style.
Optimize for Tasks: Improve accuracy and relevance for specific use cases.

Example usecases:

Train LLM to predict if a headline impacts a company positively or negatively.
Use historical customer interactions for more accurate and custom responses.
Fine-tune LLM on legal texts for contract analysis, case law research, and compliance.

You can think of a fine-tuned model as a specialized agent designed to do specific tasks more effectively and efficiently. Fine-tuning can replicate all of RAG's capabilities, but not vice versa.

Fine-tuning misconceptions:

You may have heard that fine-tuning does not make a model learn new knowledge or RAG performs better than fine-tuning. That is false. Read more FAQ + misconceptions here:

🤔FAQ + Is Fine-tuning Right For Me?

2. Choose the Right Model + Method

If you're a beginner, it is best to start with a small instruct model like Llama 3.1 (8B) and experiment from there. You'll also need to decide between QLoRA and LoRA training:

LoRA: Fine-tunes small, trainable matrices in 16-bit without updating all model weights.
QLoRA: Combines LoRA with 4-bit quantization to handle very large models with minimal resources.