Fine-tuning and Application of LLMs in Specialized Fields

Universal LLMs are strong across a broad range of tasks, but in specialized fields—medicine, law, finance, code—adaptation is required. Fine-tuning and RAG allow for the creation of domain experts without training from scratch.

Methods for Adapting LLMs

Full Fine-tuning: we update all model parameters on domain data. GPT-3 (175B): fine-tuning cost ~$100K. Risk of "catastrophic forgetting": the model loses general knowledge. Justified only for highly specialized tasks with a large dataset.

LoRA (Low-Rank Adaptation, Hu et al., 2021): We add to each weight matrix W a low-rank update: W' = W + BA, where B ∈ ℝ^{d×r}, A ∈ ℝ^{r×k}, r ≪ min(d,k). We train only B and A (0.01–1% parameters). Initialization: B=0, A~N — initially, there are no changes. During inference: W' = W + BA — just add them up. Quality is comparable to full fine-tuning with r=8–64.

QLoRA (Dettmers et al., 2023): LoRA + 4-bit NormalFloat quantization of the base model. Allows fine-tuning LLaMA-65B on a single A100 GPU (80GB). "Killer feature" for academic labs.

Adapter layers: Add small "adapter" layers between existing ones. Freeze all others. Less popular than LoRA.

Retrieval-Augmented Generation (RAG)

Problem: LLM doesn't know domain information, outdated or confidential data. Hallucinations for specific facts.

RAG architecture:

Indexing: Split documents into chunks (256–512 tokens), create embeddings via text-embedding-ada-002 or local models (BAAI/bge-en), save to vector database (Chroma, Faiss, Pinecone, Qdrant).
Retrieval: For query q: embedding(q) → similarity search in vector DB → top-k (k=3–10) relevant chunks.
Generation: LLM generates a response using the found fragments as context. Prompt: "Based on the following context, answer the question. Context: [chunks]. Question: [q]".

Evaluation of RAG: RAGAS metrics: Faithfulness (answer corresponds to context), Answer Relevancy (answer to the question), Context Precision (relevance of found chunks), Context Recall (whether the necessary chunks were found).

Advanced RAG strategies: HyDE (Hypothetical Document Embeddings): LLM generates a hypothetical answer → we embed it (not the query) → more accurate retrieval. Reranking: after retrieval, a cross-encoder re-evaluates relevance. Self-RAG: the model independently decides if retrieval is needed.

Specialized LLMs

Medicine: Med-PaLM 2 (Google, 2023): PaLM 2 fine-tuned on medical texts + RLHF from specialized physicians. USMLE test (US medical exam): Med-PaLM 2 = 85% (expert level). Limitations: not FDA approved, hallucinations in medical facts are dangerous.

Law: LegalBERT, SaulLM-7B: Pretrained or fine-tuned on legal corpora (Canlii, US court decisions, contracts). Superior to standard models for Named Entity Recognition in contracts, claim classification, summarization of decisions.

Finance: BloombergGPT (50B, 2023): Trained on 363B financial tokens (Bloomberg News, SEC filings, financial reports) + 345B general ones. Better than GPT-4 for financial NLP tasks (sentiment, NER, QA), but worse for general tasks.

Code: GitHub Copilot, CodeLlama: Copilot (Codex-based): 40–55% of lines accepted by programmers. +55% productivity for experienced developers (Brynjolfsson, 2023). CodeLlama-34B: SOTA on HumanEval (Python coding benchmark) among open-source models.

Evaluation of LLMs

Standard benchmarks: MMLU (57 academic tasks): GPT-4 = 86%, LLaMA-3-8B = 68%. HumanEval (code): GPT-4 = 67%, CodeLlama-34B = 55%. MT-Bench: conversational quality, GPT-4 as judge.

Live benchmarks (Chatbot Arena): Users compare answers from two anonymous models. ELO rating. Harder to "over-optimize" for the benchmark.

Problem of data contamination: Test data may have entered training → inflated results. LiveBench: regularly updated tasks from recent events.

Numerical Example

Fine-tune LLaMA-3-8B via QLoRA on a legal dataset (50K contracts):

Rank r=16, alpha=32, target modules=q_proj, v_proj
Trainable params: 16.7M (0.21% of 8B)
Training: 3 epochs × 50K samples, A100 40GB, ~6 hours
Val loss: 1.82 → 1.31 (28% reduction)
NER in contracts (extraction of parties, dates, sums): F1 = 0.91 vs 0.74 for base LLaMA-3-8B (zero-shot)
RAG over contract database: F1 = 0.86 (worse than fine-tune, but no retraining needed)

Conclusion: For specialized NER tasks with data → fine-tune. For QA on specific documents → RAG.

Assignment: Build a RAG system for scientific articles (use arXiv abstracts in your field). (1) Download 1000 articles (arXiv API). (2) Split into chunks (chunk_size=512, overlap=50). Create embeddings (sentence-transformers). Save in Chroma. (3) Implement RAG: for 20 questions about the articles’ contents — retrieval + GPT-3.5 answer. (4) Evaluate: RAGAS Faithfulness and Context Precision. (5) Compare with direct GPT-3.5 (without RAG) — how does accuracy improve on factual questions?