Octogent runs local LLM inference via Ollama — no cloud dependency, no API key, no rate limits. Every parameter is configurable. Every model is interchangeable.
Tested and validated with Octogent's agent loop. Performance ratings are based on reasoning quality, tool-call accuracy, and multi-step task completion.
| Model | Parameters | Quant | VRAM / RAM | Context | Speed | Reasoning | Status |
|---|---|---|---|---|---|---|---|
llama3.2:8b-instruct-q8_0 | 8B | Q8_0 | ~8 GB | 128K | Fast | Good | recommended |
llama3.1:70b-instruct-q4_K_M | 70B | Q4_K_M | ~42 GB | 128K | Slow | Excellent | stable |
qwen2.5-coder:14b-instruct-q6_K | 14B | Q6_K | ~12 GB | 128K | Medium | Excellent | stable |
deepseek-r1:14b-distill-q6_K | 14B | Q6_K | ~12 GB | 128K | Medium | Excellent | stable |
mistral:7b-instruct-q8_0 | 7B | Q8_0 | ~7 GB | 32K | Fast | Fair | stable |
phi4:14b-q4_K_M | 14B | Q4_K_M | ~9 GB | 16K | Medium | Good | beta |
groq/llama-3.3-70b-versatile | 70B | Cloud | n/a | 128K | Very Fast | Excellent | cloud |
These parameters control how the LLM generates text at the Ollama layer. They can be set globally in octogent.config.json or per-skill.
Full LLM configuration block for octogent.config.json:
{
"llm": {
"primary": {
"provider": "ollama",
"model": "llama3.2:8b-instruct-q8_0",
"baseUrl": "http://localhost:11434",
"options": {
"temperature": 0.2,
"top_p": 0.85,
"top_k": 40,
"repeat_penalty": 1.1,
"num_ctx": 8192,
"num_predict": -1
}
},
"fallback": {
"provider": "groq",
"model": "llama-3.3-70b-versatile",
"apiKey": "${GROQ_API_KEY}"
},
"routing": {
"strategy": "primary-first",
"fallbackOnTimeout": true,
"timeoutMs": 30000
}
}
}Run these commands after installing Ollama to pull supported models: