The K2 Intelligence Service Windows service
The K2 Intelligence Service runs llama-server (from llama.cpp) as a managed Windows service. This service hosts your AI models, validates them, and provides model health management and logging. See LLaMA Introduction and Core Concepts for more information on the llama-server.
You configure which AI models to load and how they behave through two files:
-
models.ini: Your model configuration file. Defines which models to load and their parameters.
-
appsettings.json: The service configuration. Controls the port, timeouts, restart behavior, and other operational settings of the llama-service.
A default model is included out of the box, such as Qwen3-0.6B-Q4_K_M.gguf (open source, released under the Apache 2.0 license).
The K2 Intelligence Service has the following logging behaviors:
-
Validation Phase: Every time the service starts, it tests each model to confirm it can load. Only models that pass this check are used.
-
Health Monitoring: Once running, the service continuously checks that the server and your models are healthy.
Prerequisites
Before using K2 Intelligence, you must have a configured K2 Intelligence Service instance with:
- the API Base URL (your K2 Intelligence endpoint - typically https://localhost:62137/v1).
- an API Key (for authentication - if not needed, just add random text).
- at least the default Model (you can change this or add others).
- a default Temperature (optional, controls randomness. 0.7 is normally a good middle ground).
- a default Max Tokens (optional, controls total token allowance, reasoning included).
For recommended minimum system requirements and the installation steps see the installation guide's K2 Intelligence topic.
Server checks and logs
Use the following commands to check if the service is running and to verify that the API responds.
Check if the service is running
You should see Status: Running. If it shows Stopped, the service has not been installed or has stopped unexpectedly. You can manually check the service application on Windows directly as well.
Verify the API responds
You can paste either of following service commands directly into your web browser. Update the IP address and port to match your environment. The port value is set in your appsettings.json file.
A healthy server returns the below for health, and a JSON string about your model for the models endpoint:
If you get a connection error, the server has not finished starting yet, or the port and host configuration does not match what you have set. Troubleshoot your service using the information available in the logs.
Where to find logs
The service writes logs to two places:
| Location | What you see | When to check |
|---|---|---|
| logs/llama-service-.log | All operational messages (rolling daily, 14-day retention) | General troubleshooting, model validation results |
| Windows Event Log | Error-level events only (source: "K2 Intelligence Service") | Service crashes, critical failures |
The file log is located in the logs folder in K2 Intelligence directory (usually in C:\Program Files\K2\Intelligence\logs). Each day gets its own file, and files are kept for 14 days before being cleaned up automatically.
To open the latest log:
(Powershell script)
Get-Content ".\logs\llama-service-$(Get-Date -Format 'yyyyMMdd').log" -Tail 50
You can also manually scroll to the K2 Intelligence Service log content.
Modifying your intelligence service
Edit the configuration files to add or remove models, and configure model and service options.
| File | Purpose | Edit It? |
|---|---|---|
| models.ini | Your model definitions and their settings | Yes - this is your main file |
| appsettings.json | Service behavior (port, timeouts, restart policy) | Yes - when you need to change operational settings |
Do not edit the files below - they are generated and regenerated automatically.
| File | What it is |
|---|---|
| validated.models.ini | The production INI with only models that passed validation. Overwritten on every restart. |
| temp.validation.ini | A temporary file used during the validation phase. Deleted after validation completes. |
Both generated files live in the root K2 Intelligence directory.
models.ini
The models.ini file tells the service which AI models to load and what settings to use for each one.
File format
The file uses a standard INI file format
; Lines starting with ; are comments
[SectionName]
key = value
another-key = value
Section names go in square brackets (this is also the model's alias that you use in K2 to make requests). Each section (except the global one) represents one model. Key-value pairs are the settings for that section. Comments start with a semicolon.
Global defaults with [*]
The [*] section sets defaults for all models. Anything you define here cascades down to every model, unless a specific model overrides it in its own section.
Here is an example global section from the default models.ini:
Global Defaults with [*]
[*]
threads = 2
stop-timeout = 10
cache-ram = 1024
ctx-size = 512
ubatch-size = 512
batch-size = 2048
prio-batch = 0
prio = 0
sleep-idle-seconds = 120
parallel = 4
You do not have to include all of these parameters. The global section can be as simple or as detailed as you need.
Model sections
Each model gets its own named section. The section name becomes the model ID - this is what you (and K2) use to reference the model in API calls. For example:
[K2IntelligenceModel]
model = Models\Qwen3-0.6B-Q4_K_M.gguf
threads = 1
threads-batch = 1
temperature = 0.7
ctx-size = 8192
min-p = 0.00
top-p = 0.95
top-k = 20
repeat-penalty = 1.0
reasoning = off
In this example, the model ID is K2IntelligenceModel. When K2 makes an API call, it specifies "model": "K2IntelligenceModel".
Finding your model file
The model key points to the GGUF (the supported model file format that llama.cpp allows) file on disk. To download other models you can go to Hugging Face Hub.
Relative paths are resolved against your WorkingDirectory setting defined in your appsettings.json file (or the service directory if WorkingDirectory is empty):
Absolute paths are used as-is:
If a model file does not exist or cannot be loaded, that model will fail validation and will not start. The windows service stays running, but the llama-server does not, so you must view the logs and event viewer when troubleshooting.
How settings inherit
Settings cascade from global down to models. A model section only needs to include the model location and the settings you want to override - everything else comes from [*]. So at the model level you include settings you only want to set for that model or the unique configurations required for that model. For example:
[*]
threads = 2
ctx-size = 4096
temperature = 0.7
[FastModel]
model = Models\small-model.gguf
; Inherits threads=2, ctx-size=4096, temperature=0.7
[DetailedModel]
model = Models\large-model.gguf
ctx-size = 8192 ; Overrides global
temperature = 0.3 ; Overrides global
; threads still inherited from global
Common model settings
These are a few of the possible settings- please read llama.cpp docs for more information.
| Setting | What It Does | Typical Range |
|---|---|---|
| threads | CPU threads for token generation | 1 -> true core count |
| threads-batch | CPU threads for prompt processing | Same as threads or higher (depends on many factors, the llama.cpp docs cover this) |
| threads-http | Threads handling HTTP requests | 2-4 -> won't hit your CPU as hard so excluding this entirely and letting server manage it is also valid! |
| parallel | Simultaneous request slots | Request slots 1-8 (each slot uses memory - number of simultaneous requests that can be processed, keep in mind that more requests at once, the slower each will be but faster overall replies compared to sequential). This can be removed, which will set request slots to auto, which uses a default value of 4. |
| ctx-size | Context window in tokens | Tokens 512-32768 (bigger = more memory, set to a value so that parallel value divided by this is equal to at least 1512 -> single shot prompts are normally not longer but this is dependent on configuration and usage requirements.) |
| batch-size | Batch size for prompt processing | 128-4096 |
| ubatch-size | Physical micro-batch size | 64-512 |
| cache-ram | Prompt cache in MiB (0 = off, -1 = unlimited) | 0-8192 |
| temperature | Randomness in responses | 0.0-2.0 (0 = deterministic, 1 = balanced, 0.3-0.7 ideal but check model being used and your own requirements) |
| top-p | Cumulative probability threshold | 0.0-1.0 |
| top-k | Top-K token selection | 1-100 |
| min-p | Minimum probability threshold | 0.0-1.0 |
| repeat-penalty | Discourages repetition | 1.0-1.5 (1.0 = no penalty) |
| reasoning | Reasoning mode | on or off |
| prio | Process priority (-1=low, 0=normal, 1=medium, 2=high, 3=realtime) | 0-1 |
| prio-batch | Priority for batch operations | 0-1 |
| poll | CPU polling level (0-100) | 0 for low CPU, 50+ for low latency |
| sleep-idle-seconds | Seconds idle before model sleeps (-1 = disabled) | 60-300, or -1 |
| stop-timeout | Seconds to wait for model unload during switches |
5-120
This setting affects how models behave when being swapped or unloaded. What it does: When llama-server needs to unload a model (to load another one, or during shutdown), it waits up to stop-timeout seconds for the unload to complete. Range: 5 to 120 seconds. Values outside this range are clamped back into it to avoid long waited shutdowns. Where to set it:
When to increase it: If you have very large models that take a long time to unload from memory and you see them being killed prematurely during switches. The default of 10 seconds works for most cases. |
| reasoning-budget | Token budget for reasoning (-1 = unlimited) | -1 or a positive integer |
Any flag you can pass to llama-server on the command line can go in your models.ini (if configuring a model), or appsettings.json (if configuring the service).
Steps to add a new model
- Open models.ini.
- Add a new section at the bottom:
- Optionally add settings to override the global defaults or add to global defaults.
- Save the file.
- Restart the service:Or restart it manually.
- Check the logs to confirm the model passed validation or try hit the endpoints.
If multiple models are added (outside of keeping in hardware requirements) as long as one loads the service will start - so it is always good to validate each of them.
Steps to remove a model
- Open models.ini.
- Delete the entire model section - the [ModelName] header and all its lines.
- Save the file.
- Restart the service.
Model validation
After restarting, check the logs or test the endpoints directly:
(Powershell script)
Get-Content ".\logs\llama-service-$(Get-Date -Format 'yyyyMMdd').log" -Tail 100 | Select-String "validation"
Some common reasons a model fails validation:
- The model path is wrong or the file does not exist.
- The GGUF file is corrupted.
- The model requires more memory than is available.
- The model is incompatible with your build of llama-server or hardware.
appsettings.json
The appsettings.json file controls how the llama-server itself behaves. You don't need to change most of these settings, but here are the ones you may need to edit.
Settings
| Setting | Default | What it does |
| WorkingDirectory | "" (service directory) | Base directory for resolving relative paths. Leave empty unless your files live elsewhere. |
| ExecutablePath | Binaries/llama-server.exe | Path to llama-server. Change if your binary is in a different location. |
| ModelsIniPath | models.ini | Path to your models.ini file. Change if you want to use a different filename or location. |
| Host | 127.0.0.1 | Network interface the server binds to. 127.0.0.1 means localhost only. Change to 0.0.0.0 to allow network access (do this with caution). |
| Port | 62137 | TCP port for the API server. Change if this port conflicts with another service. |
| models-max | 4 |
Sets up model swap limits. When set to 1, with 2 models defined, the server can only have one model loaded, and so will swap out models based on API requests. So if model A is loaded, and a request for model B is made, then the server ejects model A and loads model B. This is set to 1 during validation. |
ServerArguments
Use the ServerArguments section to pass any llama-server command-line flags as key-value pairs. These apply globally to the server process and model processes. For example:
How values work:
- Boolean true: passed as a flag (e.g., "verbose": true becomes --verbose)
- Boolean false: not passed at all (e.g., "verbose": false is omitted)
- Strings and numbers: passed as --key value
You can put any llama-server flag here that is not managed by the service - see the Server managed flags section for what to avoid.
SkipValidation
You can set SkipValidation to true to skip the model validation phase and start all models from your models.ini file without testing them first.
When to use it:
- You are testing and want faster restarts.
- You know all your models work and want to skip the startup check (...100% trust).
- You are troubleshooting validation itself.
The trade-off: If a model is broken (corrupt file, missing, or out of memory), the service will start but that model will fail when K2 tries to use it. Without validation, you lose the early warning. Use this setting deliberately, not as a default.
Restart behavior
- MaxRestartAttempts: How many times the service will try to restart after a failure. Set to 0 for unlimited restarts (forced immortality). Default is 3.
- RestartDelaySeconds: How long to wait between restart attempts. Default is 5 seconds.
When the restart attempts run out, the service stops permanently and will not try again until you restart it manually (or Windows Service recovery settings kick in).
Highly sensitive settings
The following are only a few of the sensitive settings :
| Setting | Why to leave it |
|---|---|
| LlamaClient timeouts | These control how long the service waits for API responses from llama-server. Too low and health checks fail prematurely; too high and restarts drag out. |
| LlamaClient.Resilience | The retry and circuit breaker policy. Tuned for the service's own API calls to llama-server. |
| Serilog section | Logging configuration. Only change if you need more verbose logs for debugging (set MinimumLevel.Default to Debug). |
| HealthCheck.ModelStatusPollDelaySeconds | How often the service checks model health. The default (60 seconds) is fine for most cases. Lower values mean more frequent checks but higher background CPU usage. |
Server managed flags
Some llama-server flags are managed entirely by the service and cannot be overridden. If you add them to your configuration anyway, nothing will break - they will just be ignored.
Orchestration flags
These are set by the service and cannot be overridden:
| Flag | Why the service manages it |
|---|---|
| offline | Prevents llama-server from downloading anything. Always enabled. |
| log-prefix | Adds structured prefixes to logs. Needed for log parsing. |
| models-preset | Points to the generated INI file. Dynamically set on each start. |
| host | Set from your Host config. Do not set it elsewhere. |
| port | Set from your Port config. Do not set it elsewhere. |
Forbidden flags
These are stripped from your configuration because they conflict with the service's operation, so inherently would cause issues:
| Flag | Why it is forbidden |
|---|---|
| log-timestamp | Changes the log format, which would break the service's log parsing. |
| load-on-startup | Would cause models to load automatically, bypassing the validation phase. The service needs to control when and how models load. |
If you add these to ServerArguments in appsettings.json, you will see a warning in the logs and the flag will be ignored. In models.ini they are silently removed during processing.
HTTPS setup
By default the service runs on HTTP. If you need HTTPS - such as when K2 and the AI service are on different machines - you can enable it following the steps bellow
Enabling HTTPS
To enable HTTPS for your llama-server, set the following values in your appsettings.json file:
"LlamaServer": {
"UseHttps": true,
"SslCertFile": "certs/server.crt",
"SslKeyFile": "certs/server.key"
}
- UseHttps: Set to true to enable HTTPS.
- SslCertFile: Path to your SSL certificate file (.pem format).
- SslKeyFile: Path to your SSL private key file (.key format).
Certificate paths can be relative (resolved against WorkingDirectory - the root of your K2 Intelligence directory) or absolute.
What happens
When UseHttps is true, the service:
- Validates that both certificate files exist and are accessible.
- Adds --ssl-cert-file and --ssl-key-file flags when starting the llama-server.
- Changes the API endpoint to https:// instead of http://.
If the certificate or key files are missing, the service will fail to start and log an error.
Certificate notes
- The service does not generate certificates for you, you must create and sign them.
- Self-signed certificates work fine for internal use. For production, use certificates from your organization’s Certificate Authority.
- If you change the certificate files, restart the service to pick up the new ones.
- When using HTTPS, the API key you configure for K2 will be sent over the encrypted connection.
Considerations
-
Running the K2 Intelligence Service will close all other llama-server processes.
-
In your models.ini file, if threads is defined as 2 globally, each model has 2 threads set, so two models means 4 threads.
-
In your appsettings.json file, when verbose is set to true, raw response/requests are logged, so only use verbose for debugging.
-
The K2 Intelligence Service install includes all currently available CPU binaries for llama-server. There may be extremely rare cases where the CPU binary that llama-server chooses to load is wrong for your CPU. If this happens, the logs will just show the binaries that were loaded and then fail at some point with no real clear indicator. Remove the failed binary to test if this is what has happened.