Arif Sheikh

-- How-To Guides | Running LLMs Locally | Tuning Parameters --

Understanding AI Chat Settings

A Complete Guide to Large Language Model Parameters

A step-by-step guide to understanding and customizing AI chatbot settings for researchers and users.

Introduction

AI chatbots are highly configurable and can be customized for different applications. This guide explains each setting in **simple terms** to help users optimize responses for **accuracy, creativity, and efficiency**.

General Parameters

These settings control how the AI behaves when generating responses. They influence factors like response flow, creativity, consistency, and stopping conditions.

Stream Chat Response

This controls how the AI delivers its responses—either word-by-word in real time or as a fully-formed message.

ON: The model types responses **word by word**, similar to a human thinking and speaking in real time.
OFF: The model **processes the full response silently** and then displays the entire answer all at once.
**Analogy:** It’s like **watching someone type a message** live vs. **getting a full email response instantly**.

Function Calling

Some advanced LLMs can interact with **external functions or APIs** to fetch real-world data or execute commands.

Enabled: The AI can perform real-world tasks like checking the weather, setting reminders, or running calculations.
Disabled: The AI only generates responses based on its internal knowledge, without external actions.
**Analogy:** Imagine a chatbot that can either **just chat** or **also act as a personal assistant** (e.g., answering "What’s the weather like?").

Seed

A **seed value** ensures that the model generates the **same response** for the same input, making results **predictable and reproducible**.

With a Seed: Asking "What is AI?" multiple times will always return the same answer.
Without a Seed: The response may vary slightly each time.
**Analogy:** Think of it like **rolling dice**. A fixed seed is like **weighted dice**—it always lands on the same number.

Stop Sequence

Defines **special words or characters** that tell the AI when to **stop generating text**.

If the stop sequence is set to **"END"**, the AI will stop generating text as soon as it encounters this word.
**Analogy:** It’s like a **director yelling "Cut!"** on a movie set to stop an actor from continuing.

Temperature

Temperature controls **how creative or predictable** the AI’s responses are.

High (e.g., 1.0): The AI generates more **creative and varied** responses.
Low (e.g., 0.2): The AI **sticks to safe, factual** responses.
**Analogy:** Imagine an artist painting. A **high-temperature** artist experiments with wild colors, while a **low-temperature** artist follows a reference image closely.

Reasoning Effort

This setting controls **how much time** the AI spends **thinking before responding**.

Higher effort: The AI produces **more detailed and thoughtful** responses but takes longer.
Lower effort: The AI **responds quickly**, but answers may be **simpler**.
**Analogy:** Think of a student answering a test. **Rushing** gives **quick but shallow** answers, while **taking more time** results in **better explanations**.

Mirostat Parameters (Balance Control)

Mirostat is an **advanced dynamic control mechanism** designed to **adjust randomness** in AI-generated responses. It helps maintain a balance between **coherent, predictable text** and **creative, varied responses**.

Mirostat

Mirostat dynamically adjusts **temperature and randomness** to **keep responses balanced**. It prevents the model from becoming **too random or too deterministic** during a conversation.

Enabled: The AI automatically **self-regulates** randomness and creativity.
Disabled: You need to manually control randomness using **Temperature, Top-K, and Top-P**.
**Analogy:** Think of Mirostat like a **thermostat** for AI creativity—it adjusts responses dynamically so they don’t get **too wild or too robotic**.

Mirostat Eta

This parameter controls **how fast** the AI adapts to randomness and adjusts its response style.

Higher values (e.g., 1.0+): The AI **quickly adapts** and adjusts its response randomness on the fly.
Lower values (e.g., 0.1): The AI **adjusts gradually**, making slower changes.
**Analogy:** It’s like a car’s **cruise control sensitivity**—higher reacts to speed changes faster, while lower adjusts more smoothly.

Mirostat Tau (t)

Mirostat Tau defines the **level of "surprise"** in the AI’s word selection. A lower value makes responses **more controlled**, while a higher value introduces **more variety**.

Lower values (e.g., 2.0): AI sticks to **predictable and logical** words.
Higher values (e.g., 8.0): AI allows for **more surprising and diverse** responses.
**Analogy:** It’s like **deciding how much to improvise** in a speech—low tau follows the script, high tau allows for spontaneous creativity.

Word Choice Controls

These parameters control how the AI **chooses words** during text generation. They determine whether responses are **precise and focused** or **diverse and creative**.

Top-K

Top-K limits the **vocabulary pool** by selecting words from the **K most likely choices**. The lower the value, the **more deterministic** the AI’s response.

Lower K (e.g., 10): AI picks from a **small set of highly probable** words safer, structured responses.
Higher K (e.g., 100): AI picks from a **larger set** more diverse but potentially random wording.
**Analogy:** It’s like choosing words from a **predefined vocabulary list** vs. having **the entire dictionary available**.

Top-P (Nucleus Sampling)

Instead of selecting a **fixed number** of words (like Top-K), Top-P selects words **whose cumulative probability adds up to a given threshold**.

Lower P (e.g., 0.3): AI **sticks to the safest words** fewer surprises.
Higher P (e.g., 0.9): AI **considers a broader range of words** richer, more unpredictable responses.
**Analogy:** Instead of picking the top 10 words (Top-K), Top-P picks **enough words to reach 90% confidence**.

Min-P

Ensures that even **low-probability words have a chance** to be selected, **preventing repetition** of the most common words.

Higher Min-P: AI forces inclusion of **less common** words, making responses more **diverse**.
Lower Min-P: AI **stays safe**, sticking to predictable words.
**Analogy:** It’s like making sure **rare words are not ignored** in a speech competition.

Frequency Penalty

Prevents AI from **repeating the same words too often**, promoting more **varied** responses.

Higher Value (e.g., 1.5): Strong penalty AI **avoids repeating words**.
Lower Value (e.g., 0.2): Minimal penalty AI **may reuse the same words more frequently**.
**Analogy:** Like a teacher deducting points if a student **uses the same word too many times** in an essay.

Repeat Last N

Limits how often AI **repeats recently used phrases** by remembering the last N words.

Lower N: AI **remembers fewer past words**, increasing the risk of repetition.
Higher N: AI **remembers more**, avoiding word-for-word repeats.
**Analogy:** Like a public speaker making sure **they don’t repeat the same phrase** in a talk.

Tfs Z (Tail-Free Sampling)

Ensures that uncommon words **don’t sneak in too frequently** while still keeping responses **diverse**.

Lower Tfs Z: AI prefers **common, expected** words.
Higher Tfs Z: AI allows **more rare words**, which can add depth or make responses sound odd.
**Analogy:** Like **editing a script** to keep **focus on the main topic** instead of adding unnecessary complexity.

Memory and Processing

These settings control how much **conversation history the AI remembers** and how it processes text. They impact the **quality, continuity, and efficiency** of AI-generated responses.

Context Length

Determines how many **previous words or messages** the AI remembers when generating responses. A longer context helps maintain continuity in long conversations.

Short Context (e.g., 512 tokens): AI **forgets** older messages quickly, leading to **disjointed responses**.
Long Context (e.g., 4,000+ tokens): AI **remembers more conversation history**, improving coherence.
**Analogy:** It’s like a person remembering only **the last sentence vs. an entire conversation**.

Batch Size

Controls **how many words** the AI processes **at once** before generating a response.

Small Batch Size (e.g., 1-4): AI **processes fewer words at a time**, making responses **slower but more memory efficient**.
Large Batch Size (e.g., 16+): AI **processes more words at once**, improving **speed but consuming more memory**.
**Analogy:** It’s like a chef **preparing one dish at a time vs. cooking multiple meals in parallel**.

Tokens To Keep On Context Refresh

Defines how many **tokens (words or characters)** are preserved when the AI **refreshes its memory**. Affects how well the AI maintains **long-term consistency** in responses.

Low (e.g., 100 tokens): AI **forgets most past context**, making responses **less coherent**.
High (e.g., 2,000+ tokens): AI **remembers key parts of past conversations**, improving continuity.
**Analogy:** Like keeping **important notes from a meeting** instead of **forgetting everything**.

Max Tokens

Sets a **hard limit** on how many **words the AI can generate** in a single response.

Low Limit (e.g., 50 tokens): AI gives **short, concise responses**.
High Limit (e.g., 500+ tokens): AI provides **longer, more detailed answers**.
**Analogy:** It’s like setting a **word limit** on an essay—you can **summarize** or **write a full report**.

Computer Performance Settings

These settings determine **how efficiently the AI model runs** on your machine. They help optimize memory usage, **CPU and GPU performance**, and overall system stability.

use_mmap

Enables **memory-mapped file access**, allowing AI models to use disk storage instead of **RAM** for large computations. This helps systems with **low RAM capacity** run large models more efficiently.

Enabled: Saves RAM by loading model parts **on demand** from disk.
Disabled: Keeps everything in RAM, which may cause **out-of-memory crashes**.
**Analogy:** Like streaming a movie **instead of downloading** it all at once to save storage space.

use_mlock

Keeps the AI model **locked in RAM**, preventing the operating system from **swapping** it to disk. This can **improve response speed** but uses more memory.

Enabled: The model stays in **fast-access RAM**, making responses **quicker**.
Disabled: The system may **swap data to disk**, slowing down performance.
**Analogy:** Like keeping a **frequently used book** open on your desk instead of putting it back on the shelf.

num_thread

Sets how many **CPU threads** are used to process AI computations. A higher number improves **speed**, but excessive use may slow down **other tasks** on your computer.

Low (e.g., 1-4 threads): Uses less CPU but **slower responses**.
High (e.g., max available threads): **Faster AI** but may cause **lag** on other applications.
**Analogy:** Like cooking with **one burner vs. using all burners** on a stove.

num_gpu

Defines how many **GPU cores** are used for AI acceleration. More GPU cores speed up processing, but **only works if the model supports GPU inference**.

0 (CPU-only): Uses CPU, **slower but widely supported**.
1+ (GPU-enabled): Runs AI on **graphics card**, significantly increasing speed.
**Analogy:** Like using a **gaming graphics card** to process high-resolution images instead of relying on a basic CPU.

Quick Tips for AI Optimization

For Factual Answers:
  • Lower **Temperature** (0.2-0.3) Reduces randomness.
  • Higher **Reasoning Effort** AI thinks deeper before answering.
  • Moderate **Context Length** Retains past details without memory overload.
For Creative Writing:
  • Higher **Temperature** (0.7-1.0) Allows for creative variations.
  • Higher **Top-P** (0.9) Ensures diverse vocabulary usage.
  • Longer **Context Length** Helps AI maintain storytelling consistency.
For Conversational AI:
  • Enable **Mirostat** Keeps randomness under control.
  • Use **Stop Sequences** Prevents excessive text generation.
  • Lower **Frequency Penalty** Allows the AI to reuse phrases naturally.
For Fast AI Performance:
  • Enable **use_mmap** Saves RAM by streaming models from disk.
  • Enable **use_mlock** Keeps models in RAM for quick access.
  • Adjust **num_thread & num_gpu** based on your hardware.