Supported Models

Tested and validated with Octogent's agent loop. Performance ratings are based on reasoning quality, tool-call accuracy, and multi-step task completion.

ModelParametersQuantVRAM / RAMContextSpeedReasoningStatus
llama3.2:8b-instruct-q8_08BQ8_0~8 GB128KFastGoodrecommended
llama3.1:70b-instruct-q4_K_M70BQ4_K_M~42 GB128KSlowExcellentstable
qwen2.5-coder:14b-instruct-q6_K14BQ6_K~12 GB128KMediumExcellentstable
deepseek-r1:14b-distill-q6_K14BQ6_K~12 GB128KMediumExcellentstable
mistral:7b-instruct-q8_07BQ8_0~7 GB32KFastFairstable
phi4:14b-q4_K_M14BQ4_K_M~9 GB16KMediumGoodbeta
groq/llama-3.3-70b-versatile70BCloudn/a128KVery FastExcellentcloud

Inference Parameters

These parameters control how the LLM generates text at the Ollama layer. They can be set globally in octogent.config.json or per-skill.

temperature
0.2
Default
top_p
0.85
Nucleus sampling
top_k
40
Token candidates
repeat_penalty
1.1
Anti-repetition
num_ctx
8192
Context tokens
num_predict
-1
Uncapped output
num_thread
auto
CPU threads
num_gpu
auto
GPU layers

Configuration Schema

Full LLM configuration block for octogent.config.json:

{
  "llm": {
    "primary": {
      "provider": "ollama",
      "model": "llama3.2:8b-instruct-q8_0",
      "baseUrl": "http://localhost:11434",
      "options": {
        "temperature": 0.2,
        "top_p": 0.85,
        "top_k": 40,
        "repeat_penalty": 1.1,
        "num_ctx": 8192,
        "num_predict": -1
      }
    },
    "fallback": {
      "provider": "groq",
      "model": "llama-3.3-70b-versatile",
      "apiKey": "${GROQ_API_KEY}"
    },
    "routing": {
      "strategy": "primary-first",
      "fallbackOnTimeout": true,
      "timeoutMs": 30000
    }
  }
}

Pull Models via Ollama

Run these commands after installing Ollama to pull supported models:

$ollama pull llama3.2:8b-instruct-q8_0# Recommended — fast & capable
$ollama pull qwen2.5-coder:14b-instruct-q6_K# Best for code tasks
$ollama pull deepseek-r1:14b-distill-q6_K# Best for complex reasoning
$ollama pull llama3.1:70b-instruct-q4_K_M# Max quality (needs 48GB+ RAM)

Quantization Guide

Q4_K_M
4-bit Mixed Precision
Best size/quality tradeoff for machines with limited RAM. Recommended for 8–16 GB systems. Minor quality reduction on complex reasoning tasks.
Q5_K_M
5-bit Mixed Precision
Slightly higher RAM cost than Q4 with noticeably better output quality. Recommended for 16–24 GB systems running 14B models.
Q6_K
6-bit K-Quant
Near-original quality with moderate compression. Ideal for 14B models on 24 GB systems. Strong reasoning and code generation performance.
Q8_0
8-bit Full Precision
Closest to FP16 quality. Used for 8B models by default. Requires more RAM but produces the highest quality output for smaller models.