llama.cpp Client
This requires you to have a llama.cpp server running
If you do not have a llama.cpp server running, you can follow the setup instructions in the llama.cpp GitHub repository.
The llama.cpp client connects to the llama-server executable from the llama.cpp project. This is an efficient, lightweight way to run local LLM inference.
If you want to add a llama.cpp client, change the Client Type to llama.cpp.
Should work out of the box with a local llama.cpp server
The default values should work with a local llama.cpp server if you are running llama-server on the default port (8080).
Click Save to add the client.
Ready to use
Once added, the client should appear in the clients list and display the currently loaded model.
Settings
Client Name
A unique name for the client that makes sense to you.
API URL
The URL of your llama.cpp server, without any path. For example, http://localhost:8080.
The llama.cpp server (llama-server) defaults to port 8080, so unless you changed the port when starting the server, the default URL should work.
API Key
If your llama.cpp server is configured to require authentication, you can set the API key here. Most local setups do not require this.
Context Length
The number of tokens to use as context when generating text. Defaults to 8192.
Common issues
Generations are weird / bad
Make sure the correct prompt template is assigned. Talemate will attempt to automatically detect the appropriate prompt template based on the model name, but this does not always work.
Could not connect
This means that either:
- Your llama.cpp server (
llama-server) is not running - The API URL is incorrect
- The connection is blocked (for example, by a firewall)
Make sure the server is running before attempting to connect. You can start the server with a command like:
llama-server -m /path/to/your/model.gguf


