The K2 Intelligence Service Windows service

This topic covers the typically used parameters and options. There are many more possible with the llama-server, but those are beyond the scope of this content.

The K2 Intelligence Service runs llama-server (from llama.cpp) as a managed Windows service. This service hosts your AI models, validates them, and provides model health management and logging. See LLaMA Introduction and Core Concepts for more information on the llama-server.

You configure which AI models to load and how they behave through two files:

  • models.ini: Your model configuration file. Defines which models to load and their parameters.

  • appsettings.json: The service configuration. Controls the port, timeouts, restart behavior, and other operational settings of the llama-service.

A default model is included out of the box, such as Qwen3-0.6B-Q4_K_M.gguf (open source, released under the Apache 2.0 license).

The K2 Intelligence Service has the following logging behaviors:

  • Validation Phase: Every time the service starts, it tests each model to confirm it can load. Only models that pass this check are used.

  • Health Monitoring: Once running, the service continuously checks that the server and your models are healthy.

Running the K2 Intelligence Service will close all other llama-server processes.

Prerequisites

Before using K2 Intelligence, you must have a configured K2 Intelligence Service instance with:

  • the API Base URL (your K2 Intelligence endpoint - typically https://localhost:62137/v1).
  • an API Key (for authentication - if not needed, just add random text).
  • at least the default Model (you can change this or add others).
  • a default Temperature (optional, controls randomness. 0.7 is normally a good middle ground).
  • a default Max Tokens (optional, controls total token allowance, reasoning included).

For recommended minimum system requirements and the installation steps see the installation guide's K2 Intelligence topic.

Server checks and logs

Use the following commands to check if the service is running and to verify that the API responds.

Check if the service is running

Copy

(Powershell script)

Get-Service -Name "K2IntelligenceService"

You should see Status: Running. If it shows Stopped, the service has not been installed or has stopped unexpectedly. You can manually check the service application on Windows directly as well.

Verify the API responds

You can paste either of following service commands directly into your web browser. Update the IP address and port to match your environment. The port value is set in your appsettings.json file.

Copy

(Web browser)

http://127.0.0.1:62137/health
http://127.0.0.1:62137/v1/models

A healthy server returns the below for health, and a JSON string about your model for the models endpoint:

Copy

 

{ "status": "ok" }

If you get a connection error, the server has not finished starting yet, or the port and host configuration does not match what you have set. Troubleshoot your service using the information available in the logs.

Where to find logs

The service writes logs to two places:

Location What you see When to check
logs/llama-service-.log All operational messages (rolling daily, 14-day retention) General troubleshooting, model validation results
Windows Event Log Error-level events only (source: "K2 Intelligence Service") Service crashes, critical failures

The file log is located in the logs folder in K2 Intelligence directory (usually in C:\Program Files\K2\Intelligence\logs). Each day gets its own file, and files are kept for 14 days before being cleaned up automatically.

To open the latest log:

Copy

(Powershell script)

Get-Content ".\logs\llama-service-$(Get-Date -Format 'yyyyMMdd').log" -Tail 50

You can also manually scroll to the K2 Intelligence Service log content.

Modifying your intelligence service

Edit the configuration files to add or remove models, and configure model and service options.

File Purpose Edit It?
models.ini Your model definitions and their settings Yes - this is your main file
appsettings.json Service behavior (port, timeouts, restart policy) Yes - when you need to change operational settings

Do not edit the files below - they are generated and regenerated automatically.

File What it is
validated.models.ini The production INI with only models that passed validation. Overwritten on every restart.
temp.validation.ini A temporary file used during the validation phase. Deleted after validation completes.

Both generated files live in the root K2 Intelligence directory.

If you edit validated.models.ini, your edits will be overwritten the next time the service starts. Make your edits in model.ini.

models.ini

The models.ini file tells the service which AI models to load and what settings to use for each one.

File format

The file uses a standard INI file format

Copy
; Lines starting with ; are comments
[SectionName]
key = value
another-key = value

Section names go in square brackets (this is also the model's alias that you use in K2 to make requests). Each section (except the global one) represents one model. Key-value pairs are the settings for that section. Comments start with a semicolon.

Global defaults with [*]

The [*] section sets defaults for all models. Anything you define here cascades down to every model, unless a specific model overrides it in its own section.

Here is an example global section from the default models.ini:

Copy

Global Defaults with [*]

[*]
threads = 2
stop-timeout = 10
cache-ram = 1024
ctx-size = 512
ubatch-size = 512
batch-size = 2048
prio-batch = 0
prio = 0
sleep-idle-seconds = 120
parallel = 4

You do not have to include all of these parameters. The global section can be as simple or as detailed as you need.

Model sections

Each model gets its own named section. The section name becomes the model ID - this is what you (and K2) use to reference the model in API calls. For example:

Copy

 

[K2IntelligenceModel]
model = Models\Qwen3-0.6B-Q4_K_M.gguf
threads = 1
threads-batch = 1
temperature = 0.7
ctx-size = 8192
min-p = 0.00
top-p = 0.95
top-k = 20
repeat-penalty = 1.0
reasoning = off

In this example, the model ID is K2IntelligenceModel. When K2 makes an API call, it specifies "model": "K2IntelligenceModel".

Finding your model file

The model key points to the GGUF (the supported model file format that llama.cpp allows) file on disk. To download other models you can go to Hugging Face Hub.

Relative paths are resolved against your WorkingDirectory setting defined in your appsettings.json file (or the service directory if WorkingDirectory is empty): 

Copy

 

model = Models\Qwen3-0.6B-Q4_K_M.gguf

Absolute paths are used as-is:

Copy

 

model = D:\AI\Models\Qwen3-0.6B-Q4_K_M.gguf

If a model file does not exist or cannot be loaded, that model will fail validation and will not start. The windows service stays running, but the llama-server does not, so you must view the logs and event viewer when troubleshooting.

How settings inherit

Settings cascade from global down to models. A model section only needs to include the model location and the settings you want to override - everything else comes from [*]. So at the model level you include settings you only want to set for that model or the unique configurations required for that model. For example:

Copy

 

[*]
threads = 2
ctx-size = 4096
temperature = 0.7

[FastModel]
model = Models\small-model.gguf
; Inherits threads=2, ctx-size=4096, temperature=0.7

[DetailedModel]
model = Models\large-model.gguf
ctx-size = 8192        ; Overrides global
temperature = 0.3      ; Overrides global
; threads still inherited from global
If threads is defined as 2 globally, each model has 2 threads set, so including two models means 4 threads are used.
Use a lower context size under global and then override in each model individually. This allows the model to load faster during validation (as it has lower RAM requirements). If context size is unset or set very high, the entire RAM requirement is initially reserved and can be quite large.

Common model settings

These are a few of the possible settings- please read llama.cpp docs for more information.

Setting What It Does Typical Range
threads CPU threads for token generation 1 -> true core count
threads-batch CPU threads for prompt processing Same as threads or higher (depends on many factors, the llama.cpp docs cover this)
threads-http Threads handling HTTP requests 2-4 -> won't hit your CPU as hard so excluding this entirely and letting server manage it is also valid!
parallel Simultaneous request slots Request slots 1-8 (each slot uses memory - number of simultaneous requests that can be processed, keep in mind that more requests at once, the slower each will be but faster overall replies compared to sequential). This can be removed, which will set request slots to auto, which uses a default value of 4.
ctx-size Context window in tokens Tokens 512-32768 (bigger = more memory, set to a value so that parallel value divided by this is equal to at least 1512 -> single shot prompts are normally not longer but this is dependent on configuration and usage requirements.)
batch-size Batch size for prompt processing 128-4096
ubatch-size Physical micro-batch size 64-512
cache-ram Prompt cache in MiB (0 = off, -1 = unlimited) 0-8192
temperature Randomness in responses 0.0-2.0 (0 = deterministic, 1 = balanced, 0.3-0.7 ideal but check model being used and your own requirements)
top-p Cumulative probability threshold 0.0-1.0
top-k Top-K token selection 1-100
min-p Minimum probability threshold 0.0-1.0
repeat-penalty Discourages repetition 1.0-1.5 (1.0 = no penalty)
reasoning Reasoning mode on or off
prio Process priority (-1=low, 0=normal, 1=medium, 2=high, 3=realtime) 0-1
prio-batch Priority for batch operations 0-1
poll CPU polling level (0-100) 0 for low CPU, 50+ for low latency
sleep-idle-seconds Seconds idle before model sleeps (-1 = disabled) 60-300, or -1
stop-timeout Seconds to wait for model unload during switches

5-120

This setting affects how models behave when being swapped or unloaded.

What it does: When llama-server needs to unload a model (to load another one, or during shutdown), it waits up to stop-timeout seconds for the unload to complete.

Range: 5 to 120 seconds. Values outside this range are clamped back into it to avoid long waited shutdowns.

Where to set it:

  • In [*] as a default for all models
  • In a model section to override for that specific model
  • During validation, it is always forced to 5 seconds - the service needs to fail fast

When to increase it: If you have very large models that take a long time to unload from memory and you see them being killed prematurely during switches. The default of 10 seconds works for most cases.

reasoning-budget Token budget for reasoning (-1 = unlimited) -1 or a positive integer

Any flag you can pass to llama-server on the command line can go in your models.ini (if configuring a model), or appsettings.json (if configuring the service).

Steps to add a new model

  1. Open models.ini.
  2. Add a new section at the bottom:
    Copy

     

    [MyNewModel]
    model = Models\my-new-model.gguf
  3. Optionally add settings to override the global defaults or add to global defaults.
  4. Save the file.
  5. Restart the service:
    Copy

    (Powershell script)

    Restart-Service -Name "K2 Intelligence Service"
    Or restart it manually.
  6. Check the logs to confirm the model passed validation or try hit the endpoints.

If multiple models are added (outside of keeping in hardware requirements) as long as one loads the service will start - so it is always good to validate each of them.

Steps to remove a model

  1. Open models.ini.
  2. Delete the entire model section - the [ModelName] header and all its lines.
  3. Save the file.
  4. Restart the service.

Model validation

After restarting, check the logs or test the endpoints directly:

Copy

(Powershell script)

Get-Content ".\logs\llama-service-$(Get-Date -Format 'yyyyMMdd').log" -Tail 100 | Select-String "validation"

Some common reasons a model fails validation:

  • The model path is wrong or the file does not exist.
  • The GGUF file is corrupted.
  • The model requires more memory than is available.
  • The model is incompatible with your build of llama-server or hardware.

appsettings.json

The appsettings.json file controls how the llama-server itself behaves. You don't need to change most of these settings, but here are the ones you may need to edit.

Settings

Setting Default What it does
WorkingDirectory "" (service directory) Base directory for resolving relative paths. Leave empty unless your files live elsewhere.
ExecutablePath Binaries/llama-server.exe Path to llama-server. Change if your binary is in a different location.
ModelsIniPath models.ini Path to your models.ini file. Change if you want to use a different filename or location.
Host 127.0.0.1 Network interface the server binds to. 127.0.0.1 means localhost only. Change to 0.0.0.0 to allow network access (do this with caution).
Port 62137 TCP port for the API server. Change if this port conflicts with another service.
models-max 4

Sets up model swap limits. When set to 1, with 2 models defined, the server can only have one model loaded, and so will swap out models based on API requests. So if model A is loaded, and a request for model B is made, then the server ejects model A and loads model B.

This is set to 1 during validation.

HTTPS is also supported, but to use HTTPS you must generate and use your own certificates and key files. For your safety, the service will not start if the certificates are not signed. See HTTPS setup for more information.

ServerArguments

Use the ServerArguments section to pass any llama-server command-line flags as key-value pairs. These apply globally to the server process and model processes. For example:

Copy

 

"ServerArguments": {
"verbose": false,
"no-webui": true
}

How values work:

  • Boolean true: passed as a flag (e.g., "verbose": true becomes --verbose)
  • Boolean false: not passed at all (e.g., "verbose": false is omitted)
  • Strings and numbers: passed as --key value 

You can put any llama-server flag here that is not managed by the service - see the Server managed flags section for what to avoid.

SkipValidation

You can set SkipValidation to true to skip the model validation phase and start all models from your models.ini file without testing them first.

When to use it:

  • You are testing and want faster restarts.
  • You know all your models work and want to skip the startup check (...100% trust).
  • You are troubleshooting validation itself.

The trade-off: If a model is broken (corrupt file, missing, or out of memory), the service will start but that model will fail when K2 tries to use it. Without validation, you lose the early warning. Use this setting deliberately, not as a default.

Copy

 

"StartupValidation": {
"SkipValidation": true
}

Restart behavior

Copy

 

"MaxRestartAttempts": 3,
"RestartDelaySeconds": 5
  • MaxRestartAttempts: How many times the service will try to restart after a failure. Set to 0 for unlimited restarts (forced immortality). Default is 3.
  • RestartDelaySeconds: How long to wait between restart attempts. Default is 5 seconds.

When the restart attempts run out, the service stops permanently and will not try again until you restart it manually (or Windows Service recovery settings kick in).

Highly sensitive settings

These settings are tuned for normal operation. Changing them can cause issues. We mention them so that you can tweak things in your environment if you need to. Always make tweaks in a test environment first.

The following are only a few of the sensitive settings :

Setting Why to leave it
LlamaClient timeouts These control how long the service waits for API responses from llama-server. Too low and health checks fail prematurely; too high and restarts drag out.
LlamaClient.Resilience The retry and circuit breaker policy. Tuned for the service's own API calls to llama-server.
Serilog section Logging configuration. Only change if you need more verbose logs for debugging (set MinimumLevel.Default to Debug).
HealthCheck.ModelStatusPollDelaySeconds How often the service checks model health. The default (60 seconds) is fine for most cases. Lower values mean more frequent checks but higher background CPU usage.

Server managed flags

Some llama-server flags are managed entirely by the service and cannot be overridden. If you add them to your configuration anyway, nothing will break - they will just be ignored.

Orchestration flags

These are set by the service and cannot be overridden:

Flag Why the service manages it
offline Prevents llama-server from downloading anything. Always enabled.
log-prefix Adds structured prefixes to logs. Needed for log parsing.
models-preset Points to the generated INI file. Dynamically set on each start.
host Set from your Host config. Do not set it elsewhere.
port Set from your Port config. Do not set it elsewhere.

Forbidden flags

These are stripped from your configuration because they conflict with the service's operation, so inherently would cause issues:

Flag Why it is forbidden
log-timestamp Changes the log format, which would break the service's log parsing.
load-on-startup Would cause models to load automatically, bypassing the validation phase. The service needs to control when and how models load.

If you add these to ServerArguments in appsettings.json, you will see a warning in the logs and the flag will be ignored. In models.ini they are silently removed during processing.

HTTPS setup

By default the service runs on HTTP. If you need HTTPS - such as when K2 and the AI service are on different machines - you can enable it following the steps bellow

Enabling HTTPS

To enable HTTPS for your llama-server, set the following values in your appsettings.json file:

Copy

 

"LlamaServer": {
"UseHttps": true,
"SslCertFile": "certs/server.crt",
"SslKeyFile": "certs/server.key"
}
  • UseHttps: Set to true to enable HTTPS.
  • SslCertFile: Path to your SSL certificate file (.pem format).
  • SslKeyFile: Path to your SSL private key file (.key format).

Certificate paths can be relative (resolved against WorkingDirectory - the root of your K2 Intelligence directory) or absolute.

What happens

When UseHttps is true, the service:

  1. Validates that both certificate files exist and are accessible.
  2. Adds --ssl-cert-file and --ssl-key-file flags when starting the llama-server.
  3. Changes the API endpoint to https:// instead of http://.

If the certificate or key files are missing, the service will fail to start and log an error.

Certificate notes

  • The service does not generate certificates for you, you must create and sign them.
  • Self-signed certificates work fine for internal use. For production, use certificates from your organization’s Certificate Authority.
  • If you change the certificate files, restart the service to pick up the new ones.
  • When using HTTPS, the API key you configure for K2 will be sent over the encrypted connection.

Considerations

  • Running the K2 Intelligence Service will close all other llama-server processes.

  • In your models.ini file, if threads is defined as 2 globally, each model has 2 threads set, so two models means 4 threads.

  • In your appsettings.json file, when verbose is set to true, raw response/requests are logged, so only use verbose for debugging.

  • The K2 Intelligence Service install includes all currently available CPU binaries for llama-server. There may be extremely rare cases where the CPU binary that llama-server chooses to load is wrong for your CPU. If this happens, the logs will just show the binaries that were loaded and then fail at some point with no real clear indicator. Remove the failed binary to test if this is what has happened.