Back to Privacy and Data Safety with AI Tools

Lesson 6 of 8

Local AI models: the private alternative

~16 min read

Every prompt you send to ChatGPT, Claude, or Gemini travels to a remote server, gets processed, and may be stored, reviewed, or used to improve future models. For most tasks, that's fine. For sensitive work — HR decisions, legal strategy, client financials, medical data — it's a genuine liability. Local AI models run entirely on your own hardware. No data leaves your machine. No terms of service govern what you can ask. No vendor can read your prompts. This lesson maps the local AI landscape so you can make informed decisions about when and how to use it.

7 Things You Need to Know About Local AI

Local models run on your CPU or GPU — your laptop, desktop, or on-premise server — with zero internet required after setup.
The leading tool for running local models is Ollama (free, open-source), which lets you download and run models like Llama 3, Mistral, and Phi-3 in minutes.
Model quality has closed significantly: Meta's Llama 3 8B (free) performs comparably to GPT-3.5 on many professional tasks.
Hardware is the real constraint — a 7B parameter model needs roughly 8GB of RAM; a 13B model needs 16GB; a 70B model needs 48GB+.
LM Studio provides a graphical interface for non-technical users — no command line required — and works on Mac, Windows, and Linux.
Privacy is absolute by design: if your machine isn't connected to the internet while running the model, no data can possibly leave.
Local models can be fine-tuned on your own documents, enabling custom AI assistants trained exclusively on company data.

Why Cloud AI Creates Privacy Exposure

When you use ChatGPT or Claude through a browser or the default API, your prompts are transmitted over HTTPS to OpenAI's or Anthropic's servers. Both companies retain conversation data by default, though retention periods and training opt-outs vary by plan. OpenAI's free and Plus tiers use conversations for model improvement unless you manually disable it in settings. Claude's free tier operates under Anthropic's standard privacy policy, which permits human review of conversations for safety purposes. Enterprise plans for both — which start at $30/user/month for ChatGPT Enterprise — offer stronger contractual protections, but the data still leaves your infrastructure.

The risk isn't hypothetical. In 2023, Samsung engineers accidentally leaked proprietary chip design data by pasting source code into ChatGPT. The incident prompted Samsung to ban generative AI tools on internal devices entirely. Legal, healthcare, and financial firms face additional regulatory exposure: GDPR, HIPAA, and SOC 2 compliance frameworks all have specific requirements about where sensitive data can be processed. A local model eliminates this exposure entirely — not by adding security layers, but by removing the data transmission step from the equation.

ChatGPT free/Plus: conversations used for training by default; opt-out available in Settings → Data Controls
Claude free: subject to human review; no guaranteed opt-out for free users
Gemini (Google): data may be reviewed by human raters; tied to your Google account
Microsoft Copilot (M365): enterprise version offers stronger isolation; consumer version stores data in Microsoft cloud
All cloud providers: subject to government data requests under applicable law in their jurisdiction

Quick Privacy Audit

Before pasting anything into a cloud AI tool, ask: 'Would I be comfortable if this text appeared in a court filing, a news article, or a competitor's hands?' If the answer is no, that content belongs in a local model. This single question catches 90% of risky prompts before they become a problem.

Cloud vs. Local: The Real Trade-offs

Factor	Cloud AI (ChatGPT, Claude)	Local AI (Ollama + Llama 3)
Data privacy	Data leaves your device; vendor policies apply	Data never leaves your hardware
Setup time	Instant — browser or API key	15–45 minutes for first-time setup
Cost (ongoing)	$0–$30+/user/month depending on plan	$0 after hardware; electricity only
Model quality (general)	GPT-4o, Claude 3.5 Sonnet — frontier models	Llama 3 70B approaches GPT-4 level; 8B is GPT-3.5 level
Internet required	Always	Only for initial model download
Customization	Limited (system prompts, fine-tuning via API)	Full access — fine-tune, modify, deploy freely
Hardware requirement	None beyond a browser	8GB RAM minimum; 16–64GB recommended
Compliance (HIPAA/GDPR)	Requires enterprise contracts + BAAs	Inherently compliant if data stays on-premise

Core trade-offs between cloud and local AI deployment for professional use

The Local AI Ecosystem: Tools You Should Know

Three tools dominate the local AI space for non-developers. Ollama is a command-line tool that manages model downloads and serves a local API — think of it as the engine. LM Studio wraps that functionality in a clean desktop GUI with a built-in chat interface, model library browser, and OpenAI-compatible API server. Jan.ai is a newer entrant with a similar GUI approach but stronger focus on extensions and integrations. For teams wanting a shared internal server rather than individual laptops, Ollama can be deployed on a local network machine or private cloud instance, making models accessible to everyone without any data leaving the organization's infrastructure.

The models themselves are separate from the tools that run them. Hugging Face hosts over 500,000 open models, though most aren't optimized for local inference. The practical shortlist for professional use is much shorter: Llama 3 (Meta), Mistral and Mixtral (Mistral AI), Phi-3 (Microsoft), Gemma 2 (Google), and Qwen 2 (Alibaba). These models are released under licenses that generally permit commercial use, though you should verify the specific license for your use case — Llama 3's license, for example, restricts use by companies with over 700 million monthly active users.

Ollama — install at ollama.com; run 'ollama run llama3' to download and start the model in one command
LM Studio — download at lmstudio.ai; browse models visually, chat in-app, or run as a local API server
Jan.ai — download at jan.ai; strong extension ecosystem, good for integrating with other local tools
AnythingLLM — wraps Ollama with a full RAG (retrieval-augmented generation) interface for chatting with your own documents
Open WebUI — a browser-based ChatGPT-style interface that connects to a local Ollama instance, good for team deployments

Choosing the Right Local Model for Your Hardware

Model	Size	RAM Required	Best For	License
Llama 3.2 3B	3 billion params	4–6 GB	Quick tasks, low-power hardware	Llama 3.2 Community License
Llama 3.1 8B	8 billion params	8–10 GB	General professional use, good balance	Llama 3.1 Community License
Mistral 7B	7 billion params	8 GB	Instruction following, writing tasks	Apache 2.0 (commercial OK)
Phi-3 Medium	14 billion params	10–12 GB	Reasoning, analysis, coding	MIT License (commercial OK)
Llama 3.1 70B	70 billion params	48 GB+	Near-frontier quality, complex tasks	Llama 3.1 Community License
Mixtral 8x7B	~47B active params	26–32 GB	High quality, mixture-of-experts architecture	Apache 2.0 (commercial OK)
Gemma 2 9B	9 billion params	10–12 GB	Strong at structured reasoning, Google-trained	Gemma Terms of Use

Practical local model selection guide — RAM figures are for 4-bit quantized versions (standard for local use)

Quantization: Why 70B Models Fit on Normal Hardware

Raw model weights for a 70B parameter model would require ~140GB in full 32-bit precision. Quantization compresses those weights to 4-bit or 8-bit representations, reducing Llama 3.1 70B to roughly 40GB with minimal quality loss. Ollama and LM Studio handle this automatically — when you download a model, you're getting the quantized version. You don't need to understand the math; just know that the RAM figures in the table above already reflect this compression.

Local Model Privacy Test — Running Mistral 7B via Ollama

Prompt

I'm going to share some confidential information with you. Our company, Meridian Consulting, is planning to acquire a competitor, Apex Analytics, at a valuation of $14M. This is not public. Help me draft talking points for a board presentation explaining the strategic rationale.

AI Response

Here are concise board-level talking points for the Meridian-Apex acquisition rationale: **Strategic Fit** - Apex Analytics fills the data engineering capability gap that currently limits Meridian's ability to compete for enterprise contracts above $500K - Combined entity addresses a $2.1B addressable market; neither firm reaches it independently **Financial Logic** - $14M valuation represents 2.3x Apex's trailing revenue — within normal range for analytics firms at this growth stage - Projected cost synergies of $800K–$1.2M annually through consolidated infrastructure and eliminated redundant roles **Risk Mitigation** - Key retention: structure earn-out provisions for Apex's three lead data scientists - Integration timeline: 90-day parallel operations period before system consolidation Would you like me to expand any section or adjust the tone for a more conservative board?

This Prompt Would Be Dangerous in ChatGPT

The prompt above contains material non-public information (MNPI) about an acquisition — exactly the kind of content that creates legal liability if transmitted to a third-party server. Running this through ChatGPT's standard interface could violate securities regulations, breach client confidentiality agreements, or expose your firm to regulatory scrutiny. The identical prompt run through Ollama on a disconnected laptop carries zero data transmission risk. Same output. Entirely different risk profile.

Quick-Start Task: Run Your First Local Model

Install Ollama and Run Llama 3 Locally

Goal: Have a working local LLM running on your own hardware, with firsthand experience of its capability and response speed, ready to use for sensitive professional tasks.

1. Go to ollama.com and download the installer for your operating system (Mac, Windows, or Linux). Run the installer — it takes under 2 minutes. 2. Open your terminal (Mac: press Cmd+Space, type 'Terminal'; Windows: search 'Command Prompt' or 'PowerShell'). 3. Type the command: ollama run llama3.1 and press Enter. Ollama will download the 8B model (~4.7GB). This takes 5–15 minutes depending on your connection. 4. Once the download completes, you'll see a '>>>' prompt. Type a test message: 'Summarize the key risks of a vendor contract in 3 bullet points.' Press Enter. 5. Review the response. Note the response speed — on most modern laptops with 16GB RAM, you'll see 15–30 tokens per second. 6. Now test privacy-sensitive content: paste a paragraph of fictional but realistic confidential business data and ask for analysis. Confirm you're comfortable knowing this never left your machine.

Local AI Quick-Reference Cheat Sheet

Best tool to start with: LM Studio (GUI, no terminal needed) — lmstudio.ai
Best model for most professionals: Llama 3.1 8B (8GB RAM) or Mistral 7B (8GB RAM)
Best model if you have 16GB+ RAM: Phi-3 Medium 14B or Llama 3.1 8B with more headroom
Best model for near-GPT-4 quality: Llama 3.1 70B — needs 48GB RAM or a good GPU
Command to pull any model in Ollama: ollama pull [modelname] (e.g., ollama pull mistral)
Command to list installed models: ollama list
To run as a local API (OpenAI-compatible): Ollama serves on http://localhost:11434 by default
AnythingLLM = best option for chatting with your own PDF/Word documents locally
Internet needed: only during model download; inference runs fully offline
Quantization is automatic: 4-bit models are standard and suitable for professional use

Key Takeaways So Far

Cloud AI tools transmit every prompt to vendor servers — this creates real legal and reputational risk for sensitive professional content.
Local models run entirely on your hardware; with no internet connection, data transmission is architecturally impossible.
Ollama and LM Studio make local AI accessible to non-technical professionals in under an hour of setup time.
The hardware constraint is RAM: 8GB supports capable 7B–8B models; 16GB opens up 13B–14B models with significantly better output.
Quantization makes large models practical — a 70B model fits in ~40GB RAM thanks to automatic compression in tools like Ollama.
Open-source models like Llama 3.1 8B perform at GPT-3.5 level on most professional tasks, which is sufficient for the majority of workplace use cases.
Regulatory frameworks (HIPAA, GDPR, SOC 2) often require on-premise data processing — local models are the only AI option that satisfies this by default.

Choosing the Right Local Model for Your Work

Not every local model fits every job. The gap between a 7-billion-parameter model running on a laptop and a 70-billion-parameter model running on a workstation is enormous — in quality, speed, and hardware cost. Your choice depends on three variables: what you're doing with the model, what hardware you have, and how much accuracy you actually need. A consultant summarizing client notes has different requirements than a developer generating code or an analyst processing financial data. Get this decision wrong and you'll either run a model too slow to use or waste time on one too weak to trust.

The Model Size vs. Hardware Reality

Model size is measured in parameters — the numerical weights that define what a model knows. More parameters generally means better reasoning, broader knowledge, and more nuanced output. The catch is VRAM: video memory on your GPU. A 7B model typically needs 4–6 GB of VRAM to run at full speed. A 13B model needs 8–10 GB. A 70B model needs 40–48 GB — that's a high-end workstation or a Mac with 64 GB unified memory. CPU-only inference is possible but runs 5–20x slower, making it impractical for real work. Match your model to your machine before anything else.

Model Size	VRAM Required	Typical Hardware	Best Use Cases	Speed (tokens/sec)
3B–7B	4–6 GB	Most gaming laptops, M1/M2 MacBook	Summarization, Q&A, drafting	30–80 t/s
13B	8–10 GB	RTX 3080/4070, M2 Pro Mac	Analysis, coding, longer docs	15–40 t/s
34B	20–24 GB	RTX 4090, M2 Max Mac	Complex reasoning, multi-step tasks	8–20 t/s
70B	40–48 GB	Dual GPU, M2 Ultra Mac	Near-GPT-4 quality tasks	3–10 t/s
Quantized 7B (Q4)	3–4 GB	Older laptops, 8 GB RAM	Basic tasks, privacy-sensitive drafts	20–60 t/s

Model size guide: match parameters to your hardware before downloading anything

Quantization cuts size without killing quality

Quantized models compress weights from 16-bit to 4-bit or 8-bit precision. A quantized 7B model (Q4_K_M format) uses roughly half the VRAM of the full version with only a 5–10% quality drop on most tasks. In Ollama and LM Studio, look for model names ending in Q4_K_M or Q5_K_M — these are the best balance of size and performance for most professionals.

The Leading Local Models Right Now

The open-source model landscape moves fast. Meta's Llama 3 family (released April 2024) set a new benchmark for what's possible locally — the 8B version matches or beats older GPT-3.5 on many tasks, and the 70B version trades blows with GPT-4 on reasoning benchmarks. Mistral's models punch above their weight: Mistral 7B outperforms Llama 2 13B despite being half the size. Microsoft's Phi-3 Mini (3.8B parameters) is remarkable for edge devices — it runs on smartphones and handles structured tasks well. Google's Gemma 2 (2B and 9B) is optimized for instruction-following and runs efficiently on consumer hardware.

For code-specific work, Meta's Code Llama and DeepSeek Coder V2 are purpose-built — they outperform general models on programming tasks by a significant margin. If your work is primarily English-language document processing, Llama 3 8B or Mistral 7B cover 80% of professional use cases. If you need multilingual support, Llama 3 has stronger non-English performance than Mistral. For structured data extraction or JSON output, Mistral's models follow formatting instructions more reliably. The right model isn't the biggest one — it's the one calibrated to your actual workflow.

Model	Developer	Sizes Available	Strengths	Weaknesses	Best For
Llama 3	Meta	8B, 70B	Broad reasoning, multilingual, instruction following	Larger sizes need serious hardware	General professional tasks
Mistral 7B / Mixtral	Mistral AI	7B, 8x7B	Efficient, fast, strong structured output	Weaker on non-English	Summarization, drafting, analysis
Phi-3 Mini	Microsoft	3.8B	Tiny footprint, surprisingly capable	Struggles with long context	Edge devices, quick tasks
Gemma 2	Google	2B, 9B	Clean instruction-following, safe outputs	Less creative than Llama	Structured tasks, Q&A
Code Llama	Meta	7B, 13B, 34B	Code generation, debugging, explanation	Weak on non-code tasks	Developers, technical analysts
DeepSeek Coder V2	DeepSeek	16B, 236B	State-of-the-art coding benchmark scores	Large sizes only	Serious software development

Current leading local models as of mid-2024 — check Hugging Face leaderboards for updated rankings

Where to find and compare models

The Hugging Face Open LLM Leaderboard ranks open models on standardized benchmarks — it's the most reliable public comparison tool. For practical local deployment, the Ollama model library (ollama.com/library) lists models tested and packaged for easy installation. LM Studio's built-in search pulls directly from Hugging Face. Always check a model's license: Llama 3 and Mistral allow commercial use; some others restrict it.

Running Local Models: What a Real Workflow Looks Like

The abstraction of 'running a local model' hides a concrete sequence of steps. With Ollama — the most accessible tool for non-developers — the entire process from installation to first response takes under 10 minutes on a capable machine. You install Ollama, pull a model with a single terminal command, and interact via a chat interface or API. LM Studio offers a graphical interface for those who prefer not to use a terminal. Both tools handle the complex parts: model loading, memory management, and inference optimization. What you see is a chat window. What's happening underneath is a full inference engine running on your hardware.

The API compatibility these tools offer is strategically important. Ollama exposes an OpenAI-compatible API endpoint at localhost:11434. This means any application built to talk to ChatGPT's API can be redirected to talk to your local model instead — by changing one URL. Tools like Obsidian, VS Code extensions, and custom scripts that use the OpenAI API can be pointed at your local Ollama instance, giving you private AI integration inside your existing tools without rewriting anything. This is how professionals build genuinely private AI workflows without depending on any external service.

Install Ollama from ollama.com (macOS, Windows, Linux — all supported)
Open terminal and run: ollama pull llama3 (downloads the 8B model, ~4.7 GB)
Start a chat session: ollama run llama3
Type your prompt — response generates locally, nothing leaves your machine
For a GUI, open LM Studio, search for your model, download, and load it
To use the API: send POST requests to http://localhost:11434/api/chat
Point any OpenAI-compatible tool at localhost:11434 to use your local model

Local model prompt — sensitive client analysis

Prompt

The following is a summary of notes from a client strategy session. Extract the three most significant business risks they mentioned, and for each risk, suggest one concrete mitigation action. Keep your response under 300 words. [PASTE CLIENT NOTES HERE]

AI Response

**Risk 1: Supply chain concentration** The client sources 74% of critical components from a single supplier in Malaysia. A disruption — weather, geopolitical, or operational — would halt production within 2 weeks. Mitigation: Qualify a secondary supplier in a different geography within 90 days. Negotiate a dual-source agreement even if the secondary supplier is more expensive. **Risk 2: Key-person dependency** Two individuals hold the client relationships for 60% of annual revenue. No documented handover process exists. Mitigation: Initiate structured relationship transfer over Q3 — assign junior account managers to shadow all key client calls and begin formal introduction to client contacts. **Risk 3: Pricing model exposure** Fixed-price contracts signed 18 months ago don't include inflation adjustment clauses. Margins have compressed by an estimated 8–12 points. Mitigation: Renegotiate renewal contracts to include CPI-linked adjustment clauses. Flag the three largest fixed-price contracts for priority review.

Privacy Boundaries: What Local Actually Protects

Running a model locally does not make you immune to all privacy risks. It eliminates one major risk: your data leaving your machine to a third-party server. But other risks remain. If you're running Ollama on a networked computer, other users on the same network can potentially query your local API endpoint unless you configure firewall rules. If your prompts or outputs are stored in a local log file — which many tools do by default — that data can be accessed by anyone with filesystem access to your machine. Local means no cloud transmission, not no exposure.

The model itself is also a consideration. Open-weight models like Llama 3 are released by companies (Meta, in this case) who designed and trained them. The model weights don't phone home — they're static files on your disk. But the training data those models learned from may include copyrighted material, personal data scraped from the web, or content with uncertain provenance. This doesn't affect your privacy directly, but it matters for compliance in regulated industries. Legal and healthcare professionals should verify that their organization's AI policy covers locally-run open-weight models, not just cloud services.

Ollama's default API is open to your local network

By default, Ollama binds to 0.0.0.0, meaning any device on your local network can send it prompts. On a home network this is usually low risk. In a corporate office or shared workspace, it's a real exposure. Restrict Ollama to localhost only by setting the environment variable OLLAMA_HOST=127.0.0.1 before starting the service. Check your LM Studio server settings for equivalent controls.

Threat	Cloud AI Tools	Local AI (Default)	Local AI (Hardened)
Data sent to vendor servers	Yes — every prompt	No	No
Vendor uses data for training	Possible (check ToS)	No	No
Network interception of prompts	Encrypted (TLS)	Risk if LAN exposed	Mitigated with localhost binding
Local log file exposure	N/A	Possible	Disable logging in tool settings
Unauthorized LAN access to API	N/A	Default risk (Ollama)	Set OLLAMA_HOST=127.0.0.1
Regulatory compliance gaps	Major concern	Reduced but not eliminated	Requires policy review

Privacy threat comparison: cloud AI vs. local AI in default and hardened configurations

Set Up and Test a Private Local AI Session

Goal: Complete a real professional task using a locally-run model, confirm no data was transmitted externally, and establish a baseline for evaluating local model quality against cloud alternatives.

1. Download and install Ollama from ollama.com for your operating system. 2. Before launching, set the environment variable OLLAMA_HOST=127.0.0.1 to restrict API access to localhost only. 3. Open a terminal and run: ollama pull mistral — this downloads the Mistral 7B model (~4.1 GB). 4. Once downloaded, run: ollama run mistral to open a local chat session. 5. Paste a paragraph of real work content you would normally hesitate to put into ChatGPT — a client email, a draft report section, or internal meeting notes. 6. Prompt the model to summarize, extract key points, or suggest improvements — verify the output quality meets your standard. 7. Open your system's network monitor (Activity Monitor on Mac, Task Manager on Windows) and confirm zero outbound network traffic occurred during the session.

Running and Trusting Local AI: What You Need to Know

You have a local model running. Now the real questions start: which tasks belong on local AI, which still need the cloud, and how do you know your data never leaves the machine? This section answers those questions with concrete criteria, a deployment decision table, and a practical privacy audit you keep. Local AI is not a replacement for every cloud tool — it is a precise instrument for specific, sensitive work.

Local models process everything on your CPU or GPU — no data touches an external server.
Model quality scales with your hardware: 8GB RAM handles 7B models; 16GB+ handles 13B–30B models comfortably.
Ollama, LM Studio, and Jan are the three dominant local runners for non-developers in 2024.
Llama 3 (Meta), Mistral 7B, and Phi-3 (Microsoft) are the top open-weight models for professional use.
Quantized models (GGUF format) trade ~5% accuracy for dramatically lower hardware requirements.
Local models have no memory between sessions unless you configure a persistent context tool like Continue or Open WebUI.
Air-gapped deployment — running on a machine with no internet — is the gold standard for maximum data privacy.

Choosing the Right Task for Local AI

Not every task benefits from local deployment. Cloud models like GPT-4o and Claude 3.5 Sonnet still outperform most local alternatives on complex reasoning, long-document synthesis, and real-time web access. Where local models win decisively is on sensitive data: HR documents, legal drafts, financial summaries, client records, and internal strategy. The privacy guarantee is absolute because the architecture makes leakage structurally impossible, not just policy-prohibited.

Local models also excel at repetitive, templated work — reformatting data, summarizing meeting notes, drafting standard emails — where a 7B model is fast enough and accurate enough. Response speed on modern laptops (Apple M-series or a recent AMD/Intel with integrated GPU) is typically 15–40 tokens per second, which feels close to real-time for most writing tasks. For batch processing of sensitive files overnight, speed becomes irrelevant and privacy becomes everything.

Use local AI for: PII-heavy documents, internal financials, client data, HR reviews, legal drafts, proprietary code.
Use cloud AI for: complex multi-step reasoning, real-time research, large context windows (>32K tokens), image generation.
Hybrid approach: draft sensitive content locally, then refine non-sensitive structure with a cloud model.
Never paste salary data, social security numbers, medical records, or NDA-protected text into ChatGPT, Claude, or Gemini.

The Sensitivity Test

Before pasting anything into a cloud AI, ask: 'Would I be comfortable if this appeared in a data breach headline?' If no, use a local model. This single question eliminates the most common privacy mistakes professionals make with AI tools.

Task Type	Recommended Tool	Reason
Summarize client contracts	Llama 3 (local)	Contains PII and confidential terms
Draft a marketing email	ChatGPT / Claude	No sensitive data; benefits from cloud quality
Analyze HR performance reviews	Mistral 7B (local)	Employee data — legally sensitive in most jurisdictions
Brainstorm product names	ChatGPT / Gemini	No sensitivity; creative breadth matters more
Reformat internal financial tables	Phi-3 (local)	Financial data should not leave the organization
Research competitor landscape	Perplexity / ChatGPT	Needs real-time web access; no sensitive input
Write code handling user PII	GitHub Copilot (local mode) or Ollama	Source code with data logic is proprietary

Task-to-tool routing guide for privacy-conscious AI use

Model Selection and Hardware Reality

Picking a model without knowing your hardware is like ordering a car without checking the garage. The GGUF quantized format — used by Ollama and LM Studio — compresses models so a 7B parameter model runs in roughly 4–5GB of RAM. A 13B model needs 8–10GB. If you are on a MacBook Pro M2 or M3 with 16GB unified memory, you can run 13B models smoothly. Windows machines need a dedicated GPU with VRAM, or they fall back to slower CPU inference.

For most professionals, Mistral 7B or Llama 3 8B hits the sweet spot: fast, capable, and fits on standard business laptops. Phi-3 Mini from Microsoft is remarkably capable at 3.8B parameters — designed specifically for efficiency on constrained hardware. If your organization has a shared server with a GPU, deploying Ollama server-side lets your whole team query a local model without any individual needing powerful hardware.

Model	Parameters	Min RAM (GGUF)	Best For	Runner
Phi-3 Mini	3.8B	3GB	Low-resource devices, quick tasks	Ollama, LM Studio
Mistral 7B	7B	5GB	Writing, summarization, instruction-following	Ollama, Jan
Llama 3 8B	8B	6GB	General professional use, reasoning	Ollama, LM Studio
Llama 3 70B	70B	48GB+	Near-GPT-4 quality, needs server GPU	Ollama (server)
Code Llama 13B	13B	10GB	Code generation and review	Ollama, LM Studio

Local model quick-reference: hardware requirements and use cases

Open-Weight ≠ Fully Open Source

Meta's Llama 3 and Microsoft's Phi-3 are 'open-weight' — the model weights are free to download and run locally, but usage licenses restrict certain commercial applications. For most business use under 700 million monthly users (Llama's threshold), you are fully permitted. Always check the specific model license before enterprise deployment.

Verifying Your Privacy Setup Actually Works

Claiming a setup is private and verifying it are different things. The practical test is network monitoring: run your local model while capturing traffic with a free tool like Little Snitch (Mac) or GlassWire (Windows). If the AI runner makes zero outbound connections while processing your prompt, your data is staying local. Ollama by default binds to localhost (127.0.0.1) and makes no outbound calls during inference — but update checks happen separately, so watch for that distinction.

For organizations handling regulated data (HIPAA, GDPR, SOC 2 scope), air-gapping the inference machine adds a second layer: no network card active means no accidental transmission regardless of software behavior. Document your local AI setup in writing — which model, which runner, which hardware, network status — because auditors and legal teams increasingly ask about AI data handling. A one-page configuration record protects you and your organization.

Update and Telemetry Risks

LM Studio and Ollama both send anonymous telemetry and check for updates by default. This is not prompt data — but it is outbound traffic. In high-security environments, disable auto-update in settings and block outbound connections from the runner application entirely. Run manual updates on a schedule instead. Assume any software with a network connection can be a vector until you verify otherwise.

Build Your Personal Local AI Privacy Audit Card

Goal: Produce a completed one-page Local AI Configuration Record that documents your setup, verifies network isolation, and defines your personal data routing policy — a document you can share with a manager, compliance team, or keep for your own reference.

1. Open a blank document (Word, Notion, or Google Docs) and title it 'Local AI Configuration Record — [Your Name] — [Date]'. 2. Record your hardware: device model, RAM, GPU/VRAM if applicable, and operating system. 3. List the local AI runner you installed (Ollama, LM Studio, or Jan) and its version number. 4. List the model(s) you have pulled, including the exact model name and parameter size (e.g., 'llama3:8b', 'mistral:7b-instruct'). 5. Note your network status during inference: open the runner, run a test prompt ('Summarize: The meeting covered Q3 targets'), then check your network monitor to confirm zero external connections were made. Write 'Confirmed local-only' or note any exceptions. 6. Write a two-sentence data handling statement: what types of work you will route to local AI versus cloud AI, based on the task routing table above.

Local AI Cheat Sheet

Best runners: Ollama (CLI, developer-friendly), LM Studio (GUI, beginner-friendly), Jan (open-source GUI).
Best models for most professionals: Mistral 7B or Llama 3 8B on 8–16GB RAM machines.
GGUF quantized models run on CPU — no GPU required, just slower.
Sensitive data routing: HR, legal, financial, client PII → always local.
Verify privacy by monitoring network traffic during inference — zero outbound = confirmed local.
Disable telemetry and auto-updates in high-security environments.
Air-gap (no internet) = maximum protection for regulated data environments.
Document your setup: model name, runner version, hardware, network status.
Open-weight licenses (Llama 3, Phi-3) permit most business use — verify before large-scale enterprise deployment.

Local models run entirely on your hardware — your prompts and outputs never leave your machine.
Hardware determines model choice: match parameter size to your available RAM using the GGUF format.
Ollama, LM Studio, and Jan make local deployment accessible without developer skills.
Route sensitive tasks (PII, financials, legal, HR) to local models; use cloud models for complex reasoning and research.
Verify your privacy setup with network monitoring — don't assume, confirm.
Disable telemetry and update checks in regulated or high-security environments.
Document your configuration: a one-page record satisfies auditors and clarifies your own policy.

Knowledge Check

You need to summarize 50 employee performance reviews to identify common themes. Which approach is most appropriate?

Your MacBook Pro M2 has 8GB of unified memory. Which model configuration is most realistic for smooth local inference?

How do you confirm that your local AI runner is genuinely not sending your prompts to external servers?

What does 'open-weight' mean in the context of models like Llama 3 and Phi-3?

A colleague says: 'I disabled the internet on my laptop while using LM Studio, so I don't need to worry about telemetry.' What is the most accurate assessment?

This lesson requires Pro

Upgrade your plan to unlock this lesson and all other Pro content on the platform.

Upgrade to Pro

You're currently on the Free plan.