The Octogent — Fully Autonomous AI Agent

Supported Models

Tested and validated with Octogent's agent loop. Performance ratings are based on reasoning quality, tool-call accuracy, and multi-step task completion.

Model	Parameters	Quant	VRAM / RAM	Context	Speed	Reasoning	Status
`llama3.2:8b-instruct-q8_0`	8B	`Q8_0`	~8 GB	128K	Fast	Good	recommended
`llama3.1:70b-instruct-q4_K_M`	70B	`Q4_K_M`	~42 GB	128K	Slow	Excellent	stable
`qwen2.5-coder:14b-instruct-q6_K`	14B	`Q6_K`	~12 GB	128K	Medium	Excellent	stable
`deepseek-r1:14b-distill-q6_K`	14B	`Q6_K`	~12 GB	128K	Medium	Excellent	stable
`mistral:7b-instruct-q8_0`	7B	`Q8_0`	~7 GB	32K	Fast	Fair	stable
`phi4:14b-q4_K_M`	14B	`Q4_K_M`	~9 GB	16K	Medium	Good	beta
`groq/llama-3.3-70b-versatile`	70B	`Cloud`	n/a	128K	Very Fast	Excellent	cloud

Inference Parameters

These parameters control how the LLM generates text at the Ollama layer. They can be set globally in octogent.config.json or per-skill.

temperature

0.2

Default

top_p

0.85

Nucleus sampling

top_k

Token candidates

repeat_penalty

1.1

Anti-repetition

num_ctx

8192

Context tokens

num_predict

-1

Uncapped output

num_thread

auto

CPU threads

num_gpu

auto

GPU layers

Configuration Schema

Full LLM configuration block for octogent.config.json:

{
  "llm": {
    "primary": {
      "provider": "ollama",
      "model": "llama3.2:8b-instruct-q8_0",
      "baseUrl": "http://localhost:11434",
      "options": {
        "temperature": 0.2,
        "top_p": 0.85,
        "top_k": 40,
        "repeat_penalty": 1.1,
        "num_ctx": 8192,
        "num_predict": -1
      }
    },
    "fallback": {
      "provider": "groq",
      "model": "llama-3.3-70b-versatile",
      "apiKey": "${GROQ_API_KEY}"
    },
    "routing": {
      "strategy": "primary-first",
      "fallbackOnTimeout": true,
      "timeoutMs": 30000
    }
  }
}

Pull Models via Ollama

Run these commands after installing Ollama to pull supported models:

$ollama pull llama3.2:8b-instruct-q8_0# Recommended — fast & capable

$ollama pull qwen2.5-coder:14b-instruct-q6_K# Best for code tasks

$ollama pull deepseek-r1:14b-distill-q6_K# Best for complex reasoning

$ollama pull llama3.1:70b-instruct-q4_K_M# Max quality (needs 48GB+ RAM)

Quantization Guide

Q4_K_M

4-bit Mixed Precision

Best size/quality tradeoff for machines with limited RAM. Recommended for 8–16 GB systems. Minor quality reduction on complex reasoning tasks.

Q5_K_M

5-bit Mixed Precision

Slightly higher RAM cost than Q4 with noticeably better output quality. Recommended for 16–24 GB systems running 14B models.

Q6_K

6-bit K-Quant

Near-original quality with moderate compression. Ideal for 14B models on 24 GB systems. Strong reasoning and code generation performance.

Q8_0

8-bit Full Precision

Closest to FP16 quality. Used for 8B models by default. Requires more RAM but produces the highest quality output for smaller models.