Running an LLM on My Laptop — A Weekend Experiment with Ollama
If you have been following the AI space at all, you have probably been using tools like ChatGPT or Gemini through a web browser. They are impressive, but there is always this nagging feeling — your prompts are being sent to someone else's server, logged, possibly used for training, and you have a usage limit. For a lot of experiments I want to run, that setup is not ideal.
So I decided to try running a language model entirely on my laptop. No internet connection required, no API key, no subscription. Just the model, my GPU, and a terminal. Here is how it went.
Why This Made Sense for Me
My background on this blog has always been computer vision — working with OpenCV, JavaCV, image processing pipelines. Naturally, I have been watching the progression from classical vision algorithms to deep learning to now these large multimodal models with a lot of interest. Running a language model locally felt like a natural next experiment — it is the same general idea of getting a powerful model to run on your own hardware rather than relying on a cloud service.
The other motivation: I have a Raspberry Pi 4 sitting on my desk doing not much. I wanted to eventually get a small model running on it too. The laptop would be step one.
The Hardware
Here is what I was working with for this experiment:
| Component | Spec |
|---|---|
| Laptop | MSI Katana GF66 11UC |
| CPU | Intel Core i7-11800H (8 cores / 16 threads) |
| GPU | NVIDIA RTX 3050 — 4 GB VRAM |
| RAM | 16 GB DDR4 |
| OS | Linux Mint 22.3 (x64) |
The RTX 3050 is not a high-end card, but it has 4 GB of dedicated VRAM, which is enough to run smaller models entirely on the GPU. That means fast inference — not the agonizingly slow CPU-only speeds you might expect on a consumer laptop.
The Tool: Ollama
Ollama is the tool that makes local LLM inference easy. Think of it like a package manager for AI models — it handles model downloads, quantization, GPU memory management, and exposes a REST API, all through a single binary. It detects your NVIDIA GPU automatically and figures out how many model layers to load into VRAM. You do not have to configure any of that manually.
It also has a growing library of models you can pull with one command, similar to how docker pull works.
Installation
One command:
$ curl -fsSL https://ollama.com/install.sh | sh
The installer detects your OS, downloads the right binary, and sets up Ollama as a background service. Once done, verify it can see your GPU:
$ ollama --version # ollama version 0.6.x $ nvidia-smi # Should list your RTX 3050 — confirms GPU is visible to Ollama
No config files, no dependency juggling. That was the entire setup.
Picking a Model: Phi-4 Mini
With 4 GB of VRAM I needed to be selective. I went with Phi-4 Mini from Microsoft — a 3.8 billion parameter model that fits comfortably within 4 GB and genuinely performs well for its size.
For anyone wondering about model quality — it handles multi-step reasoning, code generation, and technical questions surprisingly well for a 3.8B model. It is not GPT-4, but for a laptop running a fully offline model, the output quality is genuinely impressive.
Running It
$ ollama run phi4-mini # Downloads ~2.5 GB on first run, cached locally after that # Drops you directly into an interactive chat
I tried a few different types of prompts to get a sense of what it could do. Here is one exchange that stood out:
Sample conversation
Adding a Web Interface
The terminal chat works fine, but for longer sessions I wanted something more comfortable. Open WebUI is a self-hosted interface — essentially a ChatGPT clone — that connects to Ollama automatically. With Docker it is a single command:
$ docker run -d \ -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui \ --restart always \ ghcr.io/open-webui/open-webui:main # Open: http://localhost:3000
You get conversation history, multiple model support, and a clean interface — all running locally, zero data leaving your machine.
Handy Commands to Know
# List downloaded models $ ollama list # Pull without running $ ollama pull mistral # Check what's loaded in GPU memory $ ollama ps # Hit the REST API directly (useful for scripts) $ curl http://localhost:11434/api/generate \ -d '{"model":"phi4-mini","prompt":"Hello!","stream":false}'
If you are sitting on a laptop with an NVIDIA card and running Linux, there is genuinely no reason not to try this. The setup is simpler than anything I was doing with JavaCV back in 2013, and the results are far more immediately impressive. The model files live in ~/.ollama/models, everything is local, and you can delete it all cleanly if it is not for you.
Good to be writing here again. More experiments coming.
No comments:
Post a Comment