Wednesday, June 3, 2026

Weekend Project: Running a Private AI on My Laptop with Ollama

Weekend Project: Running a Private AI on My Laptop with Ollama | EngineerVisions

Running an LLM on My Laptop — A Weekend Experiment with Ollama

👋 Back after a long break It has been a while since I posted anything here — the last one was back in 2014. Life, work, and a pile of unfinished side projects got in the way. But I recently got into something exciting enough to write about again: running AI models locally on my own machine, with no cloud and no API keys. This post is the first in what I hope will be a new series of experiments on EngineerVisions.

If you have been following the AI space at all, you have probably been using tools like ChatGPT or Gemini through a web browser. They are impressive, but there is always this nagging feeling — your prompts are being sent to someone else's server, logged, possibly used for training, and you have a usage limit. For a lot of experiments I want to run, that setup is not ideal.

So I decided to try running a language model entirely on my laptop. No internet connection required, no API key, no subscription. Just the model, my GPU, and a terminal. Here is how it went.

Why This Made Sense for Me

My background on this blog has always been computer vision — working with OpenCV, JavaCV, image processing pipelines. Naturally, I have been watching the progression from classical vision algorithms to deep learning to now these large multimodal models with a lot of interest. Running a language model locally felt like a natural next experiment — it is the same general idea of getting a powerful model to run on your own hardware rather than relying on a cloud service.

The other motivation: I have a Raspberry Pi 4 sitting on my desk doing not much. I wanted to eventually get a small model running on it too. The laptop would be step one.

The Hardware

Here is what I was working with for this experiment:

ComponentSpec
LaptopMSI Katana GF66 11UC
CPUIntel Core i7-11800H (8 cores / 16 threads)
GPUNVIDIA RTX 3050 — 4 GB VRAM
RAM16 GB DDR4
OSLinux Mint 22.3 (x64)

The RTX 3050 is not a high-end card, but it has 4 GB of dedicated VRAM, which is enough to run smaller models entirely on the GPU. That means fast inference — not the agonizingly slow CPU-only speeds you might expect on a consumer laptop.

💡
NVIDIA drivers on Linux Mint Linux Mint's Driver Manager makes installing the proprietary NVIDIA driver really straightforward — no manual CUDA setup needed. Ollama ships with its own CUDA runtime libraries, so you just need the driver.

The Tool: Ollama

Ollama is the tool that makes local LLM inference easy. Think of it like a package manager for AI models — it handles model downloads, quantization, GPU memory management, and exposes a REST API, all through a single binary. It detects your NVIDIA GPU automatically and figures out how many model layers to load into VRAM. You do not have to configure any of that manually.

It also has a growing library of models you can pull with one command, similar to how docker pull works.

Installation

One command:

Terminal
$ curl -fsSL https://ollama.com/install.sh | sh

The installer detects your OS, downloads the right binary, and sets up Ollama as a background service. Once done, verify it can see your GPU:

Terminal
$ ollama --version
# ollama version 0.6.x

$ nvidia-smi
# Should list your RTX 3050 — confirms GPU is visible to Ollama

No config files, no dependency juggling. That was the entire setup.

Picking a Model: Phi-4 Mini

With 4 GB of VRAM I needed to be selective. I went with Phi-4 Mini from Microsoft — a 3.8 billion parameter model that fits comfortably within 4 GB and genuinely performs well for its size.

📊
Phi-4 Mini — Quick Specs 3.8B parameters · 128K token context window · Knowledge cutoff: June 2024 · VRAM usage: ~2.5 GB (Q4 quantized) · Released: February 2025

For anyone wondering about model quality — it handles multi-step reasoning, code generation, and technical questions surprisingly well for a 3.8B model. It is not GPT-4, but for a laptop running a fully offline model, the output quality is genuinely impressive.

Running It

Terminal
$ ollama run phi4-mini
# Downloads ~2.5 GB on first run, cached locally after that
# Drops you directly into an interactive chat

I tried a few different types of prompts to get a sense of what it could do. Here is one exchange that stood out:

Sample conversation

─────────────────────────────────────────
phi4-mini · 3.8B · GPU (RTX 3050 · 4GB)
─────────────────────────────────────────

>>> Can you access the internet ?

As Phi developed by Microsoft, I don't have direct real-time browsing capabilities or active Internet connection. My responses are generated based on a mixture of licensed data, human-created content, and publicly available information up to my last training cut-off in early 2023.

If you're looking for specific answers that require the latest online resources such as current events news articles from today onward (after March 31st), I would not be able to access or retrieve those details. However, I'm here with a vast amount of general knowledge on numerous topics Up until my last update! Feel free to ask any question and I'll do my best to help you using the information I've been trained on.

─────────────────────────────────

Adding a Web Interface

The terminal chat works fine, but for longer sessions I wanted something more comfortable. Open WebUI is a self-hosted interface — essentially a ChatGPT clone — that connects to Ollama automatically. With Docker it is a single command:

Terminal
$ docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

# Open: http://localhost:3000

You get conversation history, multiple model support, and a clean interface — all running locally, zero data leaving your machine.

⚠️
VRAM limit If you try a model that exceeds your VRAM (e.g., a 7B model on 4 GB), Ollama will partially offload layers to system RAM. It works, but inference slows noticeably. Stick to 3B–4B models for fully GPU-accelerated performance on 4 GB.

Handy Commands to Know

Quick Reference
# List downloaded models
$ ollama list

# Pull without running
$ ollama pull mistral

# Check what's loaded in GPU memory
$ ollama ps

# Hit the REST API directly (useful for scripts)
$ curl http://localhost:11434/api/generate \
    -d '{"model":"phi4-mini","prompt":"Hello!","stream":false}'
· · ·

If you are sitting on a laptop with an NVIDIA card and running Linux, there is genuinely no reason not to try this. The setup is simpler than anything I was doing with JavaCV back in 2013, and the results are far more immediately impressive. The model files live in ~/.ollama/models, everything is local, and you can delete it all cleanly if it is not for you.

Good to be writing here again. More experiments coming.

No comments:

Post a Comment