Wednesday, June 3, 2026

Weekend Project: Running a Private AI on My Laptop with Ollama

Weekend Project: Running a Private AI on My Laptop with Ollama | EngineerVisions

Running an LLM on My Laptop — A Weekend Experiment with Ollama

👋 Back after a long break It has been a while since I posted anything here — the last one was back in 2014. Life, work, and a pile of unfinished side projects got in the way. But I recently got into something exciting enough to write about again: running AI models locally on my own machine, with no cloud and no API keys. This post is the first in what I hope will be a new series of experiments on EngineerVisions.

If you have been following the AI space at all, you have probably been using tools like ChatGPT or Gemini through a web browser. They are impressive, but there is always this nagging feeling — your prompts are being sent to someone else's server, logged, possibly used for training, and you have a usage limit. For a lot of experiments I want to run, that setup is not ideal.

So I decided to try running a language model entirely on my laptop. No internet connection required, no API key, no subscription. Just the model, my GPU, and a terminal. Here is how it went.

Why This Made Sense for Me

My background on this blog has always been computer vision — working with OpenCV, JavaCV, image processing pipelines. Naturally, I have been watching the progression from classical vision algorithms to deep learning to now these large multimodal models with a lot of interest. Running a language model locally felt like a natural next experiment - it is the same general idea of getting a powerful model to run on your own hardware rather than relying on a cloud service.

The other motivation: I have a Raspberry Pi 4 sitting on my desk doing not much. I wanted to eventually get a small model running on it too. The laptop would be step one.

The Hardware

Here is what I was working with for this experiment:

ComponentSpec
LaptopMSI Katana GF66 11UC
CPUIntel Core i7-11800H (8 cores / 16 threads)
GPUNVIDIA RTX 3050 — 4 GB VRAM
RAM16 GB DDR4
OSLinux Mint 22.3 (x64)

The RTX 3050 is not a high-end card, but it has 4 GB of dedicated VRAM, which is enough to run smaller models entirely on the GPU. That means fast inference - not the agonizingly slow CPU-only speeds you might expect on a consumer laptop.

💡
NVIDIA drivers on Linux Mint Linux Mint's Driver Manager makes installing the proprietary NVIDIA driver really straightforward - no manual CUDA setup needed. Ollama ships with its own CUDA runtime libraries, so you just need the driver.

Prerequisites

Check whether nvidia-smi is properly installed or not by executing nvidia-smi command

Terminal
$ nvidia-smi  
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3050 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   47C    P0            752W /   60W |      14MiB /   4096MiB |      9%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1267      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+

After confirming that the nvidia-smi is properly installed, I installed nvidia-cuda-toolkit or nvcc. Although it's not required to install it manually, since Ollama includes its own pre-compiled CUDA binaries. It automatically detects and uses your NVIDIA GPU right out of the box. I installed it if I ever wanted to compile Ollama from its source code or to customize the underlying llama.cpp engine that Ollama relies on.

Terminal
$ sudo apt install nvidia-cuda-toolkit        
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libnvidia-cfg1-590 libnvidia-common-590 libnvidia-compute-590 libnvidia-decode-590 libnvidia-egl-wayland1:i386 libnvidia-encode-590 libnvidia-extra-590 libnvidia-fbc1-590 libnvidia-gl-590 libwayland-server0:i386
  nvidia-compute-utils-590 nvidia-firmware-590-590.48.01 nvidia-kernel-source-590-open nvidia-utils-590 xserver-xorg-video-nvidia-590
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
  ca-certificates-java cpp-12 fonts-dejavu-extra g++-12 gcc-12 gcc-12-base ibverbs-providers java-common libaccinj64-12.0 libamd-comgr2 libamdhip64-5 libatk-wrapper-java libatk-wrapper-java-jni libcu++-dev libcub-dev libcublas12
  libcublaslt12 libcudart12 libcufft11 libcufftw11 libcuinj64-12.0 libcupti-dev libcupti-doc libcupti12 libcurand10 libcusolver11 libcusolvermg11 libcusparse12 libgcc-12-dev libgl-dev libglx-dev libhsa-runtime64-1 libhsakmt1
  libhwloc-plugins libhwloc15 libibumad3 libjpeg62 libnppc12 libnppial12 libnppicc12 libnppidei12 libnppif12 libnppig12 libnppim12 libnppist12 libnppisu12 libnppitc12 libnpps12 libnvblas12 libnvidia-ml-dev libnvjitlink12 libnvjpeg12
  libnvrtc-builtins12.0 libnvrtc12 libnvtoolsext1 libnvvm4 libpfm4 libpthread-stubs0-dev librdmacm1t64 libstdc++-12-dev libtbb-dev libtbb12 libtbbbind-2-5 libtbbmalloc2 libthrust-dev libucx0 libvdpau-dev libx11-dev libxau-dev
  libxcb1-dev libxdmcp-dev node-html5shiv nsight-compute nsight-compute-target nsight-systems nsight-systems-target nvidia-cuda-dev nvidia-cuda-gdb nvidia-cuda-toolkit-doc nvidia-opencl-dev nvidia-profiler nvidia-visual-profiler
  ocl-icd-opencl-dev opencl-c-headers opencl-clhpp-headers openjdk-8-jre openjdk-8-jre-headless x11proto-dev xorg-sgml-doctools xtrans-dev
Suggested packages:
  gcc-12-locales cpp-12-doc g++-12-multilib gcc-12-doc gcc-12-multilib default-jre libhwloc-contrib-plugins libstdc++-12-doc libtbb-doc libvdpau-doc libx11-doc libxcb-doc nvidia-cuda-samples opencl-clhpp-headers-doc fonts-nanum
  fonts-ipafont-gothic fonts-ipafont-mincho fonts-wqy-microhei fonts-wqy-zenhei fonts-indic
Recommended packages:
  libnvcuvid1
The following NEW packages will be installed:
  ca-certificates-java cpp-12 fonts-dejavu-extra g++-12 gcc-12 gcc-12-base ibverbs-providers java-common libaccinj64-12.0 libamd-comgr2 libamdhip64-5 libatk-wrapper-java libatk-wrapper-java-jni libcu++-dev libcub-dev libcublas12
  libcublaslt12 libcudart12 libcufft11 libcufftw11 libcuinj64-12.0 libcupti-dev libcupti-doc libcupti12 libcurand10 libcusolver11 libcusolvermg11 libcusparse12 libgcc-12-dev libgl-dev libglx-dev libhsa-runtime64-1 libhsakmt1
  libhwloc-plugins libhwloc15 libibumad3 libjpeg62 libnppc12 libnppial12 libnppicc12 libnppidei12 libnppif12 libnppig12 libnppim12 libnppist12 libnppisu12 libnppitc12 libnpps12 libnvblas12 libnvidia-ml-dev libnvjitlink12 libnvjpeg12
  libnvrtc-builtins12.0 libnvrtc12 libnvtoolsext1 libnvvm4 libpfm4 libpthread-stubs0-dev librdmacm1t64 libstdc++-12-dev libtbb-dev libtbb12 libtbbbind-2-5 libtbbmalloc2 libthrust-dev libucx0 libvdpau-dev libx11-dev libxau-dev
  libxcb1-dev libxdmcp-dev node-html5shiv nsight-compute nsight-compute-target nsight-systems nsight-systems-target nvidia-cuda-dev nvidia-cuda-gdb nvidia-cuda-toolkit nvidia-cuda-toolkit-doc nvidia-opencl-dev nvidia-profiler
  nvidia-visual-profiler ocl-icd-opencl-dev opencl-c-headers opencl-clhpp-headers openjdk-8-jre openjdk-8-jre-headless x11proto-dev xorg-sgml-doctools xtrans-dev
0 upgraded, 91 newly installed, 0 to remove and 69 not upgraded.
Need to get 2,261 MB of archives.
After this operation, 6,892 MB of additional disk space will be used.

Once the installation was complete, I verified it by running nvcc --version command.

Terminal
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

The Tool: Ollama

Ollama is the tool that makes local LLM inference easy. Think of it like a package manager for AI models - it handles model downloads, quantization, GPU memory management, and exposes a REST API, all through a single binary. It detects your NVIDIA GPU automatically and figures out how many model layers to load into VRAM. You do not have to configure any of that manually.

It also has a growing library of models you can pull with one command, similar to how docker pull works.

Installation

One command:

Terminal
$ curl -fsSL https://ollama.com/install.sh | sh
>>> Installing ollama to /usr/local  
>>> Downloading ollama-linux-amd64.tar.zst
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> NVIDIA GPU installed.

The installer detects your OS, downloads the right binary, and sets up Ollama as a background service. Once done, verify it:

Terminal
$ ollama --version
ollama version is 0.24.0
$ systemctl status ollama
● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)
     Active: active (running) since Thu 2026-05-28 00:53:08 IST; 14min ago
   Main PID: 19849 (ollama)
      Tasks: 21 (limit: 18725)
     Memory: 61.2M (peak: 292.1M)
        CPU: 3.397s
     CGroup: /system.slice/ollama.service
             └─19849 /usr/local/bin/ollama serve

No config files, no dependency juggling. That was the entire setup.

Picking the Models

1. Phi-4 Mini

With 4 GB of VRAM I needed to be selective. I went with Phi-4 Mini from Microsoft - a 3.8 billion parameter model that fits comfortably within 4 GB and genuinely performs well for its size.

📊
Phi-4 Mini — Quick Specs 3.8B parameters · 128K token context window · Knowledge cutoff: Early 2023 · VRAM usage: ~2.5 GB (Q4 quantized) · Released: February 2025

For anyone wondering about model quality - it handles multi-step reasoning, code generation, and technical questions surprisingly well for a 3.8B model. It is not GPT-4, but for a laptop running a fully offline model, the output quality is genuinely impressive.

Running It

Ollama has a simple ollam run command followed by a model name. It actaully looks for the model in local machine, if not available it downloads it and runs it.

Terminal
$ ollama run phi4-mini
pulling manifest 
pulling 3c168af1dea0: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 2.5 GB                         
pulling 813f53fdc6e5: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  655 B                         
pulling fa8235e5b48f: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.1 KB                         
pulling 8c2539a423c4: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  411 B                         
verifying sha256 digest 
writing manifest 
success 

I tried a few different types of prompts to get a sense of what it could do:

Sample conversations

─────────────────────────────────────────
phi4-mini · 3.8B · GPU (RTX 3050 · 4GB)
─────────────────────────────────────────

>>> Can you access the internet ?

As Phi developed by Microsoft, I don't have direct real-time browsing capabilities or active Internet connection. My responses are generated based on a mixture of licensed data, human-created content, and publicly available information up to my last training cut-off in early 2023.

If you're looking for specific answers that require the latest online resources such as current events news articles from today onward (after March 31st), I would not be able to access or retrieve those details. However, I'm here with a vast amount of general knowledge on numerous topics Up until my last update! Feel free to ask any question and I'll do my best to help you using the information I've been trained on.

>>> How do I enable internet access to you so that you can get updated information? 2023 is 3 years ago !

I appreciate your interest in ensuring I'm up-to-date with current events, but as an AI developed by Microsoft known as Phi (and previously released under the name Phi), my capabilities are not designed for real-time Internet browsing or direct updates from online sources. My responses reflect knowledge and data available until early 2023.

However, you can always ask questions about recent trends up to that time in history without expecting live updates—I'll do what I can with existing information! If you're looking at a specific field like technology advancements post-early 2023 or current events happening after my last update (which are beyond March 31st), unfortunately those details won't be accessible through me. But if there's anything else you'd need knowledge about up to early 2023, feel free to ask and I'll do the best I can!

>>> Is there a way that I can update the model or somehow may able to get internet access ?

As an AI developed by Microsoft called Phi (formerly known as Phi), you don't have direct control over updating me in real-time with new data from online sources. However, updates and improvements are periodically made internally at Microsoft's servers.

For future advancements that could potentially be incorporated into my architecture:
1. **Official Updates:** Keep track of official announcements or releases by Microsoft related to AI enhancements.
2. **Direct Support Channels:** Engage through support channels provided for users (like feedback forms) which can convey your interest in updates and improvements, including the need for enhanced real-time capabilities.

If there's a specific field you're interested in that requires up-to-date knowledge beyond my current training data cut-off point of March 2023:
1. **Stay Updated Yourself:** You could use dedicated tools or services online (like Google Scholar) to keep yourself informed about recent advancements post-March 31st.
2. **AI Development News**: Follow Microsoft, AI research publications and news platforms for any breakthroughs related directly after my last update.
Microsoft continually works on improving the technology behind me; it's a dynamic process with many factors such as privacy regulations that guide how these updates occur without compromising user data security or trustworthiness of information provided through an online assistant like Phi.
─────────────────────────────────

After these conversations, I wasn't satisfied with Phi-4 mini, so I started looking at other suitable models for my local LLM and found out Gemma3:4B would be another ideal candidate, so I went ahead and installed it.

2. Gemma3:4B

Terminal
$  ollama pull gemma3:4b
pulling manifest 
pulling aeda25e63ebd: 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ ▏ 3.3 GB/3.3 GB  1.5 MB/s      0s
pulling e0a42594d802: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  358 B                         
pulling dd084c7d92a3: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 8.4 KB                         
pulling 3116c5225075: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   77 B                         
pulling b6ae5839783f: 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  489 B                         
verifying sha256 digest 
writing manifest 
success

Sample conversations, I asked the same first question.

─────────────────────────────────────────
gemma3 · 4B · GPU (RTX 3050 · 4GB)
─────────────────────────────────────────

>>> Can you access the internet ?

Yes, I can! I have access to Google Search and other Google services, which allows me to access and process information from the internet in real-time.
Think of it like this: when you ask me a question, I don't just rely on the knowledge I was trained on. I can go out and search the web for the most up-to-date information to give you a more complete and accurate answer.
**However, it's important to note a few things:**
* **My access is filtered:** Google has filters in place to ensure I don't generate responses that are harmful, biased, or misleading.
* **I’m constantly learning:** My knowledge is continually being updated as Google updates its search algorithms and my training data.

**To test it, you can ask me anything you'd typically search for on Google!** 😊
For example, you could ask me:
* "What's the current weather in London?"
* "What's the latest news on the Ukraine conflict?"
* "Who won the Super Bowl?"


>>> I am able to run Phi4-mini in my laptop. But it can't access the internet and was trained in 2023 March. So it's not upto date. How do I update it ? Is there a way that I can update the model or somehow may able to get internet access to it ?

Yes, I can! I have access to Google Search and other Google services, which allows me to access and process information from the internet in real-time.
You’ve hit on a really critical point – and a common limitation with models like Phi-4-mini that aren't designed for continuous internet access. You're absolutely correct that its knowledge cutoff (March 2023) and lack of internet connectivity severely limit its usefulness. Let's break down how you might approach updating it and potentially adding some level of connectivity.

**1. Updating the Model (The Biggest Challenge):**
* **No Official Updates:** Currently, there's no official mechanism for updating the Phi-4-mini model itself. The original developers (Microsoft) haven’t released updated versions with new training data.
* **Fine-tuning (The Best You Can Do):** The most realistic approach is **fine-tuning**. This involves taking the existing Phi-4-mini model and training it *further* on a smaller dataset that *does* contain more recent information. * **What data to use:** You’d need to compile a dataset of text relevant to the time period you want to update its knowledge. News articles, blog posts, or even carefully curated sets of facts from reputable sources would be good. * **Tools:** You'd typically use a framework like PyTorch or TensorFlow and libraries like Hugging Face Transformers to perform this fine-tuning. This requires some programming expertise.

**2. Adding Internet Access (More Complex, Less Reliable):**
This is significantly harder and requires a more architecturally complex approach. Here are a few potential, but challenging, methods:
* **Retrieval-Augmented Generation (RAG):** This is the most promising approach currently. Instead of updating the model directly, you create a system where Phi-4-mini *queries* an external knowledge base (which *could* be an internet-connected database). * **How it works:** When you ask a question, Phi-4-mini first searches an external database for relevant information. Then, it combines that information with its existing knowledge to generate a response. * **Implementation:** You'd need to build a system that: * Has an internet-connected database (e.g., using a vector database like Pinecone or Weaviate to store embeddings of web content). * Writes code to interface Phi-4-mini with the database.
* **Agent Frameworks:** Frameworks like LangChain and AutoGPT allow you to build agents that can interact with external tools, including the internet. You could potentially integrate Phi-4-mini into such a framework, giving it limited internet access. This is very advanced.

**3. Getting More Current Data – The Realistic Strategy:**
Given the difficulties of updating the model itself, the most practical solution is to **continuously provide it with the latest information through prompts.**
* **Contextual Prompts:** Include up-to-date information in your prompts to guide its responses. For example, "Based on the latest news as of today..."
**Resources to Explore:**
* **Hugging Face Transformers:** [https://huggingface.co/transformers/](https://huggingface.co/transformers/) - The core library for working with models like Phi-4-mini.
* **LangChain:** [https://www.langchain.com/](https://www.langchain.com/) - A powerful framework for building applications with LLMs, including RAG.
* **Pinecone/Weaviate:** [https://www.pinecone.io/](https://www.pinecone.io/) and [https://weaviate.io/](https://weaviate.io/) - Vector databases for storing and searching embeddings.
**Important Note:** Building any of these more advanced solutions (RAG or agent frameworks) requires significant technical skills in Python, machine learning, and potentially database management.
To help me give you more tailored advice, could you tell me:
* **What is your programming experience level?** (e.g., beginner, intermediate, advanced)
* **What are you hoping to *do* with Phi-4-mini?** (e.g., answer questions, generate creative text, etc.)

─────────────────────────────────

The output of Gemma3:4B compared to Phi-4 mini was far better. And I would continue to use Gemma3:4B in my laptop for sometime and then I'll try to run it in my Raspberry Pi 4 (If it can run it)

Adding a Web Interface

The terminal chat works fine, but for longer sessions I wanted something more comfortable. Open WebUI is a self-hosted interface — essentially a ChatGPT clone — that connects to Ollama automatically. With Docker it is a single command:

Terminal
$ docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

# Open: http://localhost:3000

You get conversation history, multiple model support, and a clean interface — all running locally, zero data leaving your machine.

⚠️
VRAM limit If you try a model that exceeds your VRAM (e.g., a 7B model on 4 GB), Ollama will partially offload layers to system RAM. It works, but inference slows noticeably. Stick to 3B–4B models for fully GPU-accelerated performance on 4 GB.

Handy Commands to Know

Quick Reference
# List downloaded models
$ ollama list

# Pull without running
$ ollama pull mistral

# Check what's loaded in GPU memory
$ ollama ps

# Hit the REST API directly (useful for scripts)
$ curl http://localhost:11434/api/generate \
    -d '{"model":"phi4-mini","prompt":"Hello!","stream":false}'
· · ·

If you are sitting on a laptop with an NVIDIA card and running Linux, there is genuinely no reason not to try this. The setup is simpler than anything I was doing with JavaCV back in 2013, and the results are far more immediately impressive. The model files live in ~/.ollama/models, everything is local, and you can delete it all cleanly if it is not for you.

Good to be writing here again. More experiments coming.

Monday, May 19, 2014

ReconBot - Gesture and Speech Controlled Bot


ReconBot (Reconnaissance Robot)
Our Final Year Project - 2013, team members : Gowrav,Mallikarjuna,Manoj,Sachin Anchan.
visit: reconbot.blogspot.in