Monday, June 15, 2026

Adding a Web UI to Your Local LLM - Open WebUI, Docker & GPU Acceleration | EngineerVisions

Adding a Web UI to Your Local LLM - Open WebUI, Docker & GPU Acceleration | EngineerVisions

Adding a Web UI to Your Local LLM — Open WebUI, Docker & GPU Acceleration

📌 Part 2 of the Local LLM Series In the previous post, I covered how to install Ollama on Linux Mint and run Phi-4 Mini and Gemma3:4B locally on an MSI Katana with an RTX 3050. This post picks up where that left off - I briefly mentioned Open WebUI as a web interface but never went into depth about how to properly set it up, especially when things go wrong with Docker networking and GPU access. That is exactly what this post covers.

The terminal chat that Ollama provides works perfectly well for quick experiments, but for longer sessions - especially when you are comparing model outputs or picking up a conversation later - a proper browser-based interface makes a noticeable difference. Open WebUI is a self-hosted interface that looks and feels very much like ChatGPT or Claude, and it connects directly to your local Ollama instance. No data leaves your machine.

What I did not expect was the number of small but meaningful hurdles involved in getting it running correctly - particularly around Docker networking and getting the GPU accessible from inside the container. This post documents every step and every error I hit along the way.

Installation Options

Open WebUI supports several installation methods: Docker, pip, uv, and a standalone desktop app. I tried both Docker and the desktop app. They are functionally nearly identical, so I uninstalled the desktop app and stuck with Docker - it is easier to manage, restart, and update in place. Everything from here on assumes Docker.

📦
Available install methods Docker (recommended for most users), pip, uv, and a native desktop app. Full documentation at docs.openwebui.com.

Basic Docker Setup

Start by pulling the main image and running the container. The image is lightweight and the setup is a single command:

Terminal
$ docker pull ghcr.io/open-webui/open-webui:main
Terminal
$ docker run -d \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

# Open: http://localhost:3000

This starts the container named open-webui, binding port 3000 on your host to port 8080 inside the container. Open a browser tab and navigate to localhost:3000 - you should see the Open WebUI login screen.

You will notice, however, that no Ollama models appear in the model selector. The interface loads but cannot reach your local Ollama instance. This is a networking problem, and it took me a while to track down exactly why.

The Networking Problem

The root cause is straightforward once you see it. By default, Ollama binds only to 127.0.0.1 - meaning it only accepts connections from the same machine it is running on. You can confirm this with:

Terminal
$ ss -tulpn | grep 11434
tcp   LISTEN 0      4096       127.0.0.1:11434      0.0.0.0:*

The Docker container has its own isolated network namespace - its localhost is not the host machine's localhost. Even though I had passed --add-host=host.docker.internal:host-gateway when starting the container, that flag only creates a DNS alias pointing at the host machine. Ollama is still only listening on 127.0.0.1, so when Open WebUI inside the container tries to reach http://host.docker.internal:11434, Ollama ignores the connection. The architecture looks like this:

Network Diagram
            Host machine (Linux Mint)

                Ollama
                  |
                  |
            127.0.0.1:11434
                  |
                  X  ← Docker cannot reach this
                  |
        -------------------------
        | Docker container      |
        | Open WebUI            |
        |                       |
        | localhost != host     |
        -------------------------

The fix is to make Ollama listen on the Docker bridge interface instead, so that containers on the default Docker network can actually reach it.

Fixing the Connection - Binding Ollama to the Docker Bridge

First, find the IP address of the Docker bridge interface on your host:

Terminal
$ ip addr show docker0
4: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether de:09:f4:3c:a7:23 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever

The Docker bridge IP here is 172.17.0.1. Now edit the Ollama systemd service to tell it to listen on that address:

Terminal
$ sudo systemctl edit ollama

This opens a drop-in override file in your editor. Add the following between the two comment markers:

/etc/systemd/system/ollama.service.d/override.conf
### Editing /etc/systemd/system/ollama.service.d/override.conf
### Anything between here and the comment below will become the contents of the drop-in file

[Service]
Environment="OLLAMA_HOST=172.17.0.1"

### Edits below this comment will be discarded

Save and exit. You should see a confirmation message: Successfully installed edited file '/etc/systemd/system/ollama.service.d/override.conf'. Now reload the daemon and restart both Ollama and the Open WebUI container:

Terminal
$ sudo systemctl daemon-reload
$ sudo systemctl restart ollama
$ docker restart open-webui

Refresh localhost:3000 in your browser. Your locally installed Ollama models should now appear in the model selector and one will be pre-selected by default.

Running Open WebUI in GPU Mode

The standard :main image runs the Open WebUI frontend and backend on CPU. If you want the WebUI's own AI features - things like document embeddings, built-in image generation pipelines, and local model inference directly through the container - to use your GPU, you need the CUDA variant.

Pulling the CUDA Image

Terminal
$ docker pull ghcr.io/open-webui/open-webui:v0.9.6-cuda

Fair warning: this image is large. It ships with full CUDA runtime libraries bundled inside.

ImageContent SizeDisk Usage
open-webui:v0.9.6-cuda5.63 GB16.9 GB

Once pulled, try running it with GPU access:

Terminal
$ docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  --gpus all \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:v0.9.6-cuda

If everything is configured correctly, the container starts silently and you can open your browser. However, on a fresh Linux system, you will likely hit this error:

Docker Error
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

Run 'docker run --help' for more information

This means Docker cannot find the NVIDIA Container Toolkit. Your NVIDIA driver is working fine - but there is a separate middleware layer called nvidia-container-toolkit that lets Docker communicate with the GPU. It is not installed by default even when your driver is.

Installing the NVIDIA Container Toolkit

First, confirm your NVIDIA driver is still healthy:

Terminal
$ nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05              Driver Version: 595.71.05      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3050 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   52C    P8              7W /   60W |     196MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1277      G   /usr/lib/xorg/Xorg                        4MiB |
|    0   N/A  N/A           19922      C   /usr/local/bin/python3                  174MiB |
+-----------------------------------------------------------------------------------------+

Good. Now install the toolkit - this adds the NVIDIA repository, installs the package, and configures the Docker runtime to use it:

Terminal
# Add the NVIDIA Container Toolkit GPG key and repository
$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

$ curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

$ sudo apt update
$ sudo apt install -y nvidia-container-toolkit

Once installed, configure Docker to use the NVIDIA runtime and restart the Docker daemon:

Terminal
$ sudo nvidia-ctk runtime configure --runtime=docker
$ sudo systemctl restart docker

Verifying GPU Access Inside Docker

Before re-running the Open WebUI container, verify that Docker can now see the GPU. This command pulls a minimal CUDA image and runs nvidia-smi inside it - if the output matches what you see on the host, everything is wired up correctly:

Terminal
$ docker run --rm --gpus all nvidia/cuda:12.9.2-base-ubuntu22.04 nvidia-smi
Unable to find image 'nvidia/cuda:12.9.2-base-ubuntu22.04' locally
12.9.2-base-ubuntu22.04: Pulling from nvidia/cuda
1dca0c9bc5f7: Pull complete
504205c74aa4: Pull complete
33b2e09f3b0d: Pull complete
95b418228d05: Pull complete
12eeb868872b: Pull complete
40d16f30db40: Pull complete
4f4fb700ef54: Pull complete
Digest: sha256:8cd34c18c70fcb862f9829e7a2a04597feeb5f5d221904c77610b60c78c00ba4
Status: Downloaded newer image for nvidia/cuda:12.9.2-base-ubuntu22.04
Sun Jun 14 15:22:37 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05              Driver Version: 595.71.05      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3050 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   46C    P0             12W /   60W |      14MiB /   4096MiB |      9%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The GPU is visible from inside the container. Now remove the old errored-out container and start the CUDA-enabled one properly:

Running the GPU-Enabled Container

Terminal
$ docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  --gpus all \
  -v open-webui:/app/backend/data \
  -e HF_TOKEN=<YOUR_HUGGING_FACE_TOKEN> \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:v0.9.6-cuda
💡
Add a Hugging Face token Passing -e HF_TOKEN=<token> with a read-only Hugging Face token speeds up the initial startup significantly. Open WebUI downloads sentence-transformer embedding models from Hugging Face on first boot, and an authenticated request avoids rate limiting. Get a free read token at huggingface.co/settings/tokens.

You can tail the container logs to watch the startup sequence. The key lines to look for are CUDA enabled and the server start confirmation:

Terminal - docker logs -f open-webui
No WEBUI_SECRET_KEY environment variable set, loading from file.
Generating new WEBUI_SECRET_KEY...
Loading WEBUI_SECRET_KEY from .webui_secret_key
CUDA enabled — extending LD_LIBRARY_PATH for torch/cudnn libraries.
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
WARNI [open_webui.env] 

WARNING: CORS_ALLOW_ORIGIN IS SET TO '*' - NOT RECOMMENDED FOR PRODUCTION DEPLOYMENTS.

WARNI [langchain_community.utils.user_agent] USER_AGENT environment variable not set, consider setting it to identify your requests.

 ██████╗ ██████╗ ███████╗███╗   ██╗    ██╗    ██╗███████╗██████╗ ██╗   ██╗██╗
██╔═══██╗██╔══██╗██╔════╝████╗  ██║    ██║    ██║██╔════╝██╔══██╗██║   ██║██║
██║   ██║██████╔╝█████╗  ██╔██╗ ██║    ██║ █╗ ██║█████╗  ██████╔╝██║   ██║██║
██║   ██║██╔═══╝ ██╔══╝  ██║╚██╗██║    ██║███╗██║██╔══╝  ██╔══██╗██║   ██║██║
╚██████╔╝██║     ███████╗██║ ╚████║    ╚███╔███╔╝███████╗██████╔╝╚██████╔╝██║
 ╚═════╝ ╚═╝     ╚══════╝╚═╝  ╚═══╝     ╚══╝╚══╝ ╚══════╝╚═════╝  ╚═════╝ ╚═╝


v0.9.6 - building the best AI user interface.

https://github.com/open-webui/open-webui

Fetching 30 files: 100%|██████████| 30/30 [01:56<00:00,  3.90s/it]
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 5293.76it/s]
BertModel LOAD REPORT from: /app/backend/data/cache/embedding/models/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/1110a243fdf4706b3f48f1d95db1a4f5529b4d41
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED:	can be ignored when loading from different task/architecture; not ok if you expect identical arch.
INFO:     Started server process [1]
INFO:     Waiting for application startup.
2026-06-14 15:37:55.856 | INFO     | open_webui.utils.logger:start_logger:214 - GLOBAL_LOG_LEVEL: INFO
2026-06-14 15:37:55.857 | INFO     | open_webui.main:lifespan:661 - Installing external dependencies of functions and tools...
2026-06-14 15:37:55.904 | INFO     | open_webui.utils.plugin:install_frontmatter_requirements:419 - No requirements found in frontmatter.
2026-06-14 15:37:55.904 | INFO     | open_webui.utils.automations:scheduler_worker_loop:171 - Scheduler worker started (poll interval: 10s)
2026-06-14 15:40:13.061 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50272 - "GET / HTTP/1.1" 200
2026-06-14 15:40:13.246 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50274 - "GET /_app/immutable/chunks/3wvqjvW3.js HTTP/1.1" 200
2026-06-14 15:40:13.250 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50272 - "GET /_app/immutable/entry/start.C43MIJEF.js HTTP/1.1" 200
2026-06-14 15:40:13.251 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50282 - "GET /_app/immutable/entry/app.C84R6Z2n.js HTTP/1.1" 200
2026-06-14 15:40:13.253 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50286 - "GET /_app/immutable/chunks/pAaorhp4.js HTTP/1.1" 200
2026-06-14 15:40:13.263 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50284 - "GET /_app/immutable/nodes/0.p2u0EbrE.js HTTP/1.1" 200
2026-06-14 15:40:13.264 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50304 - "GET /_app/immutable/chunks/DeuB8iJ0.js HTTP/1.1" 200
2026-06-14 15:40:13.272 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50286 - "GET /_app/immutable/chunks/DC0W6J9h.js HTTP/1.1" 200
2026-06-14 15:40:13.278 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50282 - "GET /_app/immutable/chunks/CBKmxchQ.js HTTP/1.1" 200
2026-06-14 15:40:13.280 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50272 - "GET /_app/immutable/chunks/CMqz8Nsr.js HTTP/1.1" 200
2026-06-14 15:40:13.281 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50304 - "GET /_app/immutable/chunks/T8jte1Yq.js HTTP/1.1" 200
2026-06-14 15:40:13.281 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50274 - "GET /_app/immutable/chunks/C7Lxt8YS.js HTTP/1.1" 200
2026-06-14 15:40:13.284 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50284 - "GET /_app/immutable/chunks/BWUGEjTy.js HTTP/1.1" 200
2026-06-14 15:40:13.300 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50286 - "GET /_app/immutable/chunks/CaTaFyk-.js HTTP/1.1" 200
2026-06-14 15:40:13.306 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50272 - "GET /_app/immutable/chunks/C-Awsziq.js HTTP/1.1" 200
2026-06-14 15:40:13.307 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50274 - "GET /_app/immutable/chunks/QTFhdCgb.js HTTP/1.1" 200
2026-06-14 15:40:13.308 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50304 - "GET /_app/immutable/chunks/DadTtvaW.js HTTP/1.1" 200
2026-06-14 15:40:13.311 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50282 - "GET /static/custom.css HTTP/1.1" 200
2026-06-14 15:40:13.312 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50284 - "GET /static/loader.js HTTP/1.1" 200
2026-06-14 15:40:13.316 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50286 - "GET /static/splash.png HTTP/1.1" 200
2026-06-14 15:40:13.445 | INFO     | uvicorn.protocols.http.httptools_impl:send:483 - 172.17.0.1:50282 - "GET /_app/immutable/nodes/1.uvDIxH0j.js HTTP/1.1" 200

Once startup is complete, check that the container is healthy:

Terminal
$ docker ps
CONTAINER ID   IMAGE                                        COMMAND          CREATED         STATUS                   PORTS                    NAMES
98b68ce37c24   ghcr.io/open-webui/open-webui:v0.9.6-cuda   "bash start.sh"  4 minutes ago   Up 4 minutes (healthy)   0.0.0.0:3000->8080/tcp   open-webui

The (healthy) status confirms the container's health check is passing. You can now open localhost:3000 and chat with your local models through the browser interface, with the GPU fully active.

Monitoring GPU Usage

With the GPU-enabled container running, it is satisfying to watch the VRAM and utilisation numbers move in real time while chatting. Two commands cover this well:

Terminal - GPU stats (refreshes every second)
$ watch -n 1 nvidia-smi
Terminal - container resource usage
$ docker stats open-webui
⚠️
VRAM headroom on 4 GB cards The CUDA image itself reserves some VRAM for the embedding model it loads on startup (around 150–200 MB). Combined with the model layers Ollama loads for inference, you are operating close to the 4 GB ceiling with a 4B model. Avoid running both simultaneously with heavier models, or you will see layers spill into system RAM and inference will slow significantly.
· · ·

Getting Open WebUI talking to Ollama is not quite the one-liner the documentation suggests - at least not on Linux when Ollama is running as a systemd service. The Docker networking problem in particular is easy to miss if you do not know to check what address Ollama is actually bound to. But once those two pieces are in place, the setup is genuinely solid: conversation history, multiple model support, document uploads, and a clean interface, all running locally with no data leaving the machine.

In the next post, I want to explore some of the features Open WebUI adds on top of the basic chat - search engine integration, custom system prompts, and the built-in tools and skills system. More to come.

Wednesday, June 3, 2026

Weekend Project: Running a Private AI on My Laptop with Ollama

Weekend Project: Running a Private AI on My Laptop with Ollama | EngineerVisions

Running an LLM on My Laptop — A Weekend Experiment with Ollama

👋 Back after a long break It has been a while since I posted anything here — the last one was back in 2014. Life, work, and a pile of unfinished side projects got in the way. But I recently got into something exciting enough to write about again: running AI models locally on my own machine, with no cloud and no API keys. This post is the first in what I hope will be a new series of experiments on EngineerVisions.

If you have been following the AI space at all, you have probably been using tools like ChatGPT or Gemini through a web browser. They are impressive, but there is always this nagging feeling — your prompts are being sent to someone else's server, logged, possibly used for training, and you have a usage limit. For a lot of experiments I want to run, that setup is not ideal.

So I decided to try running a language model entirely on my laptop. No internet connection required, no API key, no subscription. Just the model, my GPU, and a terminal. Here is how it went.

Why This Made Sense for Me

My background on this blog has always been computer vision — working with OpenCV, JavaCV, image processing pipelines. Naturally, I have been watching the progression from classical vision algorithms to deep learning to now these large multimodal models with a lot of interest. Running a language model locally felt like a natural next experiment - it is the same general idea of getting a powerful model to run on your own hardware rather than relying on a cloud service.

The other motivation: I have a Raspberry Pi 4 sitting on my desk doing not much. I wanted to eventually get a small model running on it too. The laptop would be step one.

The Hardware

Here is what I was working with for this experiment:

ComponentSpec
LaptopMSI Katana GF66 11UC
CPUIntel Core i7-11800H (8 cores / 16 threads)
GPUNVIDIA RTX 3050 — 4 GB VRAM
RAM16 GB DDR4
OSLinux Mint 22.3 (x64)

The RTX 3050 is not a high-end card, but it has 4 GB of dedicated VRAM, which is enough to run smaller models entirely on the GPU. That means fast inference - not the agonizingly slow CPU-only speeds you might expect on a consumer laptop.

💡
NVIDIA drivers on Linux Mint Linux Mint's Driver Manager makes installing the proprietary NVIDIA driver really straightforward - no manual CUDA setup needed. Ollama ships with its own CUDA runtime libraries, so you just need the driver.

Prerequisites

Check whether nvidia-smi is properly installed or not by executing nvidia-smi command

Terminal
$ nvidia-smi  
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3050 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   47C    P0            752W /   60W |      14MiB /   4096MiB |      9%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1267      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+

After confirming that the nvidia-smi is properly installed, I installed nvidia-cuda-toolkit or nvcc. Although it's not required to install it manually, since Ollama includes its own pre-compiled CUDA binaries. It automatically detects and uses your NVIDIA GPU right out of the box. I installed it if I ever wanted to compile Ollama from its source code or to customize the underlying llama.cpp engine that Ollama relies on.

Terminal
$ sudo apt install nvidia-cuda-toolkit        
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libnvidia-cfg1-590 libnvidia-common-590 libnvidia-compute-590 libnvidia-decode-590 libnvidia-egl-wayland1:i386 libnvidia-encode-590 libnvidia-extra-590 libnvidia-fbc1-590 libnvidia-gl-590 libwayland-server0:i386
  nvidia-compute-utils-590 nvidia-firmware-590-590.48.01 nvidia-kernel-source-590-open nvidia-utils-590 xserver-xorg-video-nvidia-590
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
  ca-certificates-java cpp-12 fonts-dejavu-extra g++-12 gcc-12 gcc-12-base ibverbs-providers java-common libaccinj64-12.0 libamd-comgr2 libamdhip64-5 libatk-wrapper-java libatk-wrapper-java-jni libcu++-dev libcub-dev libcublas12
  libcublaslt12 libcudart12 libcufft11 libcufftw11 libcuinj64-12.0 libcupti-dev libcupti-doc libcupti12 libcurand10 libcusolver11 libcusolvermg11 libcusparse12 libgcc-12-dev libgl-dev libglx-dev libhsa-runtime64-1 libhsakmt1
  libhwloc-plugins libhwloc15 libibumad3 libjpeg62 libnppc12 libnppial12 libnppicc12 libnppidei12 libnppif12 libnppig12 libnppim12 libnppist12 libnppisu12 libnppitc12 libnpps12 libnvblas12 libnvidia-ml-dev libnvjitlink12 libnvjpeg12
  libnvrtc-builtins12.0 libnvrtc12 libnvtoolsext1 libnvvm4 libpfm4 libpthread-stubs0-dev librdmacm1t64 libstdc++-12-dev libtbb-dev libtbb12 libtbbbind-2-5 libtbbmalloc2 libthrust-dev libucx0 libvdpau-dev libx11-dev libxau-dev
  libxcb1-dev libxdmcp-dev node-html5shiv nsight-compute nsight-compute-target nsight-systems nsight-systems-target nvidia-cuda-dev nvidia-cuda-gdb nvidia-cuda-toolkit-doc nvidia-opencl-dev nvidia-profiler nvidia-visual-profiler
  ocl-icd-opencl-dev opencl-c-headers opencl-clhpp-headers openjdk-8-jre openjdk-8-jre-headless x11proto-dev xorg-sgml-doctools xtrans-dev
Suggested packages:
  gcc-12-locales cpp-12-doc g++-12-multilib gcc-12-doc gcc-12-multilib default-jre libhwloc-contrib-plugins libstdc++-12-doc libtbb-doc libvdpau-doc libx11-doc libxcb-doc nvidia-cuda-samples opencl-clhpp-headers-doc fonts-nanum
  fonts-ipafont-gothic fonts-ipafont-mincho fonts-wqy-microhei fonts-wqy-zenhei fonts-indic
Recommended packages:
  libnvcuvid1
The following NEW packages will be installed:
  ca-certificates-java cpp-12 fonts-dejavu-extra g++-12 gcc-12 gcc-12-base ibverbs-providers java-common libaccinj64-12.0 libamd-comgr2 libamdhip64-5 libatk-wrapper-java libatk-wrapper-java-jni libcu++-dev libcub-dev libcublas12
  libcublaslt12 libcudart12 libcufft11 libcufftw11 libcuinj64-12.0 libcupti-dev libcupti-doc libcupti12 libcurand10 libcusolver11 libcusolvermg11 libcusparse12 libgcc-12-dev libgl-dev libglx-dev libhsa-runtime64-1 libhsakmt1
  libhwloc-plugins libhwloc15 libibumad3 libjpeg62 libnppc12 libnppial12 libnppicc12 libnppidei12 libnppif12 libnppig12 libnppim12 libnppist12 libnppisu12 libnppitc12 libnpps12 libnvblas12 libnvidia-ml-dev libnvjitlink12 libnvjpeg12
  libnvrtc-builtins12.0 libnvrtc12 libnvtoolsext1 libnvvm4 libpfm4 libpthread-stubs0-dev librdmacm1t64 libstdc++-12-dev libtbb-dev libtbb12 libtbbbind-2-5 libtbbmalloc2 libthrust-dev libucx0 libvdpau-dev libx11-dev libxau-dev
  libxcb1-dev libxdmcp-dev node-html5shiv nsight-compute nsight-compute-target nsight-systems nsight-systems-target nvidia-cuda-dev nvidia-cuda-gdb nvidia-cuda-toolkit nvidia-cuda-toolkit-doc nvidia-opencl-dev nvidia-profiler
  nvidia-visual-profiler ocl-icd-opencl-dev opencl-c-headers opencl-clhpp-headers openjdk-8-jre openjdk-8-jre-headless x11proto-dev xorg-sgml-doctools xtrans-dev
0 upgraded, 91 newly installed, 0 to remove and 69 not upgraded.
Need to get 2,261 MB of archives.
After this operation, 6,892 MB of additional disk space will be used.

Once the installation was complete, I verified it by running nvcc --version command.

Terminal
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

The Tool: Ollama

Ollama is the tool that makes local LLM inference easy. Think of it like a package manager for AI models - it handles model downloads, quantization, GPU memory management, and exposes a REST API, all through a single binary. It detects your NVIDIA GPU automatically and figures out how many model layers to load into VRAM. You do not have to configure any of that manually.

It also has a growing library of models you can pull with one command, similar to how docker pull works.

Installation

One command:

Terminal
$ curl -fsSL https://ollama.com/install.sh | sh
>>> Installing ollama to /usr/local  
>>> Downloading ollama-linux-amd64.tar.zst
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> NVIDIA GPU installed.

The installer detects your OS, downloads the right binary, and sets up Ollama as a background service. Once done, verify it:

Terminal
$ ollama --version
ollama version is 0.24.0
$ systemctl status ollama
● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)
     Active: active (running) since Thu 2026-05-28 00:53:08 IST; 14min ago
   Main PID: 19849 (ollama)
      Tasks: 21 (limit: 18725)
     Memory: 61.2M (peak: 292.1M)
        CPU: 3.397s
     CGroup: /system.slice/ollama.service
             └─19849 /usr/local/bin/ollama serve

No config files, no dependency juggling. That was the entire setup.

Picking the Models

1. Phi-4 Mini

With 4 GB of VRAM I needed to be selective. I went with Phi-4 Mini from Microsoft - a 3.8 billion parameter model that fits comfortably within 4 GB and genuinely performs well for its size.

📊
Phi-4 Mini — Quick Specs 3.8B parameters · 128K token context window · Knowledge cutoff: Early 2023 · VRAM usage: ~2.5 GB (Q4 quantized) · Released: February 2025

For anyone wondering about model quality - it handles multi-step reasoning, code generation, and technical questions surprisingly well for a 3.8B model. It is not GPT-4, but for a laptop running a fully offline model, the output quality is genuinely impressive.

Running It

Ollama has a simple ollam run command followed by a model name. It actaully looks for the model in local machine, if not available it downloads it and runs it.

Terminal
$ ollama run phi4-mini
pulling manifest 
pulling 3c168af1dea0: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 2.5 GB                         
pulling 813f53fdc6e5: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  655 B                         
pulling fa8235e5b48f: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.1 KB                         
pulling 8c2539a423c4: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  411 B                         
verifying sha256 digest 
writing manifest 
success 

I tried a few different types of prompts to get a sense of what it could do:

Sample conversations

─────────────────────────────────────────
phi4-mini · 3.8B · GPU (RTX 3050 · 4GB)
─────────────────────────────────────────

>>> Can you access the internet ?

As Phi developed by Microsoft, I don't have direct real-time browsing capabilities or active Internet connection. My responses are generated based on a mixture of licensed data, human-created content, and publicly available information up to my last training cut-off in early 2023.

If you're looking for specific answers that require the latest online resources such as current events news articles from today onward (after March 31st), I would not be able to access or retrieve those details. However, I'm here with a vast amount of general knowledge on numerous topics Up until my last update! Feel free to ask any question and I'll do my best to help you using the information I've been trained on.

>>> How do I enable internet access to you so that you can get updated information? 2023 is 3 years ago !

I appreciate your interest in ensuring I'm up-to-date with current events, but as an AI developed by Microsoft known as Phi (and previously released under the name Phi), my capabilities are not designed for real-time Internet browsing or direct updates from online sources. My responses reflect knowledge and data available until early 2023.

However, you can always ask questions about recent trends up to that time in history without expecting live updates—I'll do what I can with existing information! If you're looking at a specific field like technology advancements post-early 2023 or current events happening after my last update (which are beyond March 31st), unfortunately those details won't be accessible through me. But if there's anything else you'd need knowledge about up to early 2023, feel free to ask and I'll do the best I can!

>>> Is there a way that I can update the model or somehow may able to get internet access ?

As an AI developed by Microsoft called Phi (formerly known as Phi), you don't have direct control over updating me in real-time with new data from online sources. However, updates and improvements are periodically made internally at Microsoft's servers.

For future advancements that could potentially be incorporated into my architecture:
1. **Official Updates:** Keep track of official announcements or releases by Microsoft related to AI enhancements.
2. **Direct Support Channels:** Engage through support channels provided for users (like feedback forms) which can convey your interest in updates and improvements, including the need for enhanced real-time capabilities.

If there's a specific field you're interested in that requires up-to-date knowledge beyond my current training data cut-off point of March 2023:
1. **Stay Updated Yourself:** You could use dedicated tools or services online (like Google Scholar) to keep yourself informed about recent advancements post-March 31st.
2. **AI Development News**: Follow Microsoft, AI research publications and news platforms for any breakthroughs related directly after my last update.
Microsoft continually works on improving the technology behind me; it's a dynamic process with many factors such as privacy regulations that guide how these updates occur without compromising user data security or trustworthiness of information provided through an online assistant like Phi.
─────────────────────────────────

After these conversations, I wasn't satisfied with Phi-4 mini, so I started looking at other suitable models for my local LLM and found out Gemma3:4B would be another ideal candidate, so I went ahead and installed it.

2. Gemma3:4B

Terminal
$  ollama pull gemma3:4b
pulling manifest 
pulling aeda25e63ebd: 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ ▏ 3.3 GB/3.3 GB  1.5 MB/s      0s
pulling e0a42594d802: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  358 B                         
pulling dd084c7d92a3: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 8.4 KB                         
pulling 3116c5225075: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   77 B                         
pulling b6ae5839783f: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  489 B                         
verifying sha256 digest 
writing manifest 
success

Sample conversations, I asked the same first question.

─────────────────────────────────────────
gemma3 · 4B · GPU (RTX 3050 · 4GB)
─────────────────────────────────────────

>>> Can you access the internet ?

Yes, I can! I have access to Google Search and other Google services, which allows me to access and process information from the internet in real-time.
Think of it like this: when you ask me a question, I don't just rely on the knowledge I was trained on. I can go out and search the web for the most up-to-date information to give you a more complete and accurate answer.
**However, it's important to note a few things:**
* **My access is filtered:** Google has filters in place to ensure I don't generate responses that are harmful, biased, or misleading.
* **I’m constantly learning:** My knowledge is continually being updated as Google updates its search algorithms and my training data.

**To test it, you can ask me anything you'd typically search for on Google!** 😊
For example, you could ask me:
* "What's the current weather in London?"
* "What's the latest news on the Ukraine conflict?"
* "Who won the Super Bowl?"


>>> I am able to run Phi4-mini in my laptop. But it can't access the internet and was trained in 2023 March. So it's not upto date. How do I update it ? Is there a way that I can update the model or somehow may able to get internet access to it ?

Yes, I can! I have access to Google Search and other Google services, which allows me to access and process information from the internet in real-time.
You’ve hit on a really critical point – and a common limitation with models like Phi-4-mini that aren't designed for continuous internet access. You're absolutely correct that its knowledge cutoff (March 2023) and lack of internet connectivity severely limit its usefulness. Let's break down how you might approach updating it and potentially adding some level of connectivity.

**1. Updating the Model (The Biggest Challenge):**
* **No Official Updates:** Currently, there's no official mechanism for updating the Phi-4-mini model itself. The original developers (Microsoft) haven’t released updated versions with new training data.
* **Fine-tuning (The Best You Can Do):** The most realistic approach is **fine-tuning**. This involves taking the existing Phi-4-mini model and training it *further* on a smaller dataset that *does* contain more recent information. * **What data to use:** You’d need to compile a dataset of text relevant to the time period you want to update its knowledge. News articles, blog posts, or even carefully curated sets of facts from reputable sources would be good. * **Tools:** You'd typically use a framework like PyTorch or TensorFlow and libraries like Hugging Face Transformers to perform this fine-tuning. This requires some programming expertise.

**2. Adding Internet Access (More Complex, Less Reliable):**
This is significantly harder and requires a more architecturally complex approach. Here are a few potential, but challenging, methods:
* **Retrieval-Augmented Generation (RAG):** This is the most promising approach currently. Instead of updating the model directly, you create a system where Phi-4-mini *queries* an external knowledge base (which *could* be an internet-connected database). * **How it works:** When you ask a question, Phi-4-mini first searches an external database for relevant information. Then, it combines that information with its existing knowledge to generate a response. * **Implementation:** You'd need to build a system that: * Has an internet-connected database (e.g., using a vector database like Pinecone or Weaviate to store embeddings of web content). * Writes code to interface Phi-4-mini with the database.
* **Agent Frameworks:** Frameworks like LangChain and AutoGPT allow you to build agents that can interact with external tools, including the internet. You could potentially integrate Phi-4-mini into such a framework, giving it limited internet access. This is very advanced.

**3. Getting More Current Data – The Realistic Strategy:**
Given the difficulties of updating the model itself, the most practical solution is to **continuously provide it with the latest information through prompts.**
* **Contextual Prompts:** Include up-to-date information in your prompts to guide its responses. For example, "Based on the latest news as of today..."
**Resources to Explore:**
* **Hugging Face Transformers:** [https://huggingface.co/transformers/](https://huggingface.co/transformers/) - The core library for working with models like Phi-4-mini.
* **LangChain:** [https://www.langchain.com/](https://www.langchain.com/) - A powerful framework for building applications with LLMs, including RAG.
* **Pinecone/Weaviate:** [https://www.pinecone.io/](https://www.pinecone.io/) and [https://weaviate.io/](https://weaviate.io/) - Vector databases for storing and searching embeddings.
**Important Note:** Building any of these more advanced solutions (RAG or agent frameworks) requires significant technical skills in Python, machine learning, and potentially database management.
To help me give you more tailored advice, could you tell me:
* **What is your programming experience level?** (e.g., beginner, intermediate, advanced)
* **What are you hoping to *do* with Phi-4-mini?** (e.g., answer questions, generate creative text, etc.)

─────────────────────────────────

The output of Gemma3:4B compared to Phi-4 mini was far better. And I would continue to use Gemma3:4B in my laptop for sometime and then I'll try to run it in my Raspberry Pi 4 (If it can run it)

Adding a Web Interface

The terminal chat works fine, but for longer sessions I wanted something more comfortable. Open WebUI is a self-hosted interface — essentially a ChatGPT clone — that connects to Ollama automatically. With Docker it is a single command:

Terminal
$ docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

# Open: http://localhost:3000

You get conversation history, multiple model support, and a clean interface — all running locally, zero data leaving your machine.

⚠️
VRAM limit If you try a model that exceeds your VRAM (e.g., a 7B model on 4 GB), Ollama will partially offload layers to system RAM. It works, but inference slows noticeably. Stick to 3B–4B models for fully GPU-accelerated performance on 4 GB.

Handy Commands to Know

Quick Reference
# List downloaded models
$ ollama list

# Pull without running
$ ollama pull mistral

# Check what's loaded in GPU memory
$ ollama ps

# Hit the REST API directly (useful for scripts)
$ curl http://localhost:11434/api/generate \
    -d '{"model":"phi4-mini","prompt":"Hello!","stream":false}'
· · ·

If you are sitting on a laptop with an NVIDIA card and running Linux, there is genuinely no reason not to try this. The setup is simpler than anything I was doing with JavaCV back in 2013, and the results are far more immediately impressive. The model files live in ~/.ollama/models, everything is local, and you can delete it all cleanly if it is not for you.

Good to be writing here again. More experiments coming.