Skip to main content
Back to Privacy and Data Safety with AI Tools
Lesson 6 of 8

Local AI models: the private alternative

~16 min read

Every prompt you send to ChatGPT, Claude, or Gemini travels to a remote server, gets processed, and may be stored, reviewed, or used to improve future models. For most tasks, that's fine. For sensitive work — HR decisions, legal strategy, client financials, medical data — it's a genuine liability. Local AI models run entirely on your own hardware. No data leaves your machine. No terms of service govern what you can ask. No vendor can read your prompts. This lesson maps the local AI landscape so you can make informed decisions about when and how to use it.

7 Things You Need to Know About Local AI

  1. Local models run on your CPU or GPU — your laptop, desktop, or on-premise server — with zero internet required after setup.
  2. The leading tool for running local models is Ollama (free, open-source), which lets you download and run models like Llama 3, Mistral, and Phi-3 in minutes.
  3. Model quality has closed significantly: Meta's Llama 3 8B (free) performs comparably to GPT-3.5 on many professional tasks.
  4. Hardware is the real constraint — a 7B parameter model needs roughly 8GB of RAM; a 13B model needs 16GB; a 70B model needs 48GB+.
  5. LM Studio provides a graphical interface for non-technical users — no command line required — and works on Mac, Windows, and Linux.
  6. Privacy is absolute by design: if your machine isn't connected to the internet while running the model, no data can possibly leave.
  7. Local models can be fine-tuned on your own documents, enabling custom AI assistants trained exclusively on company data.

Why Cloud AI Creates Privacy Exposure

When you use ChatGPT or Claude through a browser or the default API, your prompts are transmitted over HTTPS to OpenAI's or Anthropic's servers. Both companies retain conversation data by default, though retention periods and training opt-outs vary by plan. OpenAI's free and Plus tiers use conversations for model improvement unless you manually disable it in settings. Claude's free tier operates under Anthropic's standard privacy policy, which permits human review of conversations for safety purposes. Enterprise plans for both — which start at $30/user/month for ChatGPT Enterprise — offer stronger contractual protections, but the data still leaves your infrastructure.

The risk isn't hypothetical. In 2023, Samsung engineers accidentally leaked proprietary chip design data by pasting source code into ChatGPT. The incident prompted Samsung to ban generative AI tools on internal devices entirely. Legal, healthcare, and financial firms face additional regulatory exposure: GDPR, HIPAA, and SOC 2 compliance frameworks all have specific requirements about where sensitive data can be processed. A local model eliminates this exposure entirely — not by adding security layers, but by removing the data transmission step from the equation.

  • ChatGPT free/Plus: conversations used for training by default; opt-out available in Settings → Data Controls
  • Claude free: subject to human review; no guaranteed opt-out for free users
  • Gemini (Google): data may be reviewed by human raters; tied to your Google account
  • Microsoft Copilot (M365): enterprise version offers stronger isolation; consumer version stores data in Microsoft cloud
  • All cloud providers: subject to government data requests under applicable law in their jurisdiction

Quick Privacy Audit

Before pasting anything into a cloud AI tool, ask: 'Would I be comfortable if this text appeared in a court filing, a news article, or a competitor's hands?' If the answer is no, that content belongs in a local model. This single question catches 90% of risky prompts before they become a problem.

Cloud vs. Local: The Real Trade-offs

FactorCloud AI (ChatGPT, Claude)Local AI (Ollama + Llama 3)
Data privacyData leaves your device; vendor policies applyData never leaves your hardware
Setup timeInstant — browser or API key15–45 minutes for first-time setup
Cost (ongoing)$0–$30+/user/month depending on plan$0 after hardware; electricity only
Model quality (general)GPT-4o, Claude 3.5 Sonnet — frontier modelsLlama 3 70B approaches GPT-4 level; 8B is GPT-3.5 level
Internet requiredAlwaysOnly for initial model download
CustomizationLimited (system prompts, fine-tuning via API)Full access — fine-tune, modify, deploy freely
Hardware requirementNone beyond a browser8GB RAM minimum; 16–64GB recommended
Compliance (HIPAA/GDPR)Requires enterprise contracts + BAAsInherently compliant if data stays on-premise
Core trade-offs between cloud and local AI deployment for professional use

The Local AI Ecosystem: Tools You Should Know

Three tools dominate the local AI space for non-developers. Ollama is a command-line tool that manages model downloads and serves a local API — think of it as the engine. LM Studio wraps that functionality in a clean desktop GUI with a built-in chat interface, model library browser, and OpenAI-compatible API server. Jan.ai is a newer entrant with a similar GUI approach but stronger focus on extensions and integrations. For teams wanting a shared internal server rather than individual laptops, Ollama can be deployed on a local network machine or private cloud instance, making models accessible to everyone without any data leaving the organization's infrastructure.

The models themselves are separate from the tools that run them. Hugging Face hosts over 500,000 open models, though most aren't optimized for local inference. The practical shortlist for professional use is much shorter: Llama 3 (Meta), Mistral and Mixtral (Mistral AI), Phi-3 (Microsoft), Gemma 2 (Google), and Qwen 2 (Alibaba). These models are released under licenses that generally permit commercial use, though you should verify the specific license for your use case — Llama 3's license, for example, restricts use by companies with over 700 million monthly active users.

  1. Ollama — install at ollama.com; run 'ollama run llama3' to download and start the model in one command
  2. LM Studio — download at lmstudio.ai; browse models visually, chat in-app, or run as a local API server
  3. Jan.ai — download at jan.ai; strong extension ecosystem, good for integrating with other local tools
  4. AnythingLLM — wraps Ollama with a full RAG (retrieval-augmented generation) interface for chatting with your own documents
  5. Open WebUI — a browser-based ChatGPT-style interface that connects to a local Ollama instance, good for team deployments

Choosing the Right Local Model for Your Hardware

ModelSizeRAM RequiredBest ForLicense
Llama 3.2 3B3 billion params4–6 GBQuick tasks, low-power hardwareLlama 3.2 Community License
Llama 3.1 8B8 billion params8–10 GBGeneral professional use, good balanceLlama 3.1 Community License
Mistral 7B7 billion params8 GBInstruction following, writing tasksApache 2.0 (commercial OK)
Phi-3 Medium14 billion params10–12 GBReasoning, analysis, codingMIT License (commercial OK)
Llama 3.1 70B70 billion params48 GB+Near-frontier quality, complex tasksLlama 3.1 Community License
Mixtral 8x7B~47B active params26–32 GBHigh quality, mixture-of-experts architectureApache 2.0 (commercial OK)
Gemma 2 9B9 billion params10–12 GBStrong at structured reasoning, Google-trainedGemma Terms of Use
Practical local model selection guide — RAM figures are for 4-bit quantized versions (standard for local use)

Quantization: Why 70B Models Fit on Normal Hardware

Raw model weights for a 70B parameter model would require ~140GB in full 32-bit precision. Quantization compresses those weights to 4-bit or 8-bit representations, reducing Llama 3.1 70B to roughly 40GB with minimal quality loss. Ollama and LM Studio handle this automatically — when you download a model, you're getting the quantized version. You don't need to understand the math; just know that the RAM figures in the table above already reflect this compression.
Local Model Privacy Test — Running Mistral 7B via Ollama

Prompt

I'm going to share some confidential information with you. Our company, Meridian Consulting, is planning to acquire a competitor, Apex Analytics, at a valuation of $14M. This is not public. Help me draft talking points for a board presentation explaining the strategic rationale.

AI Response

Here are concise board-level talking points for the Meridian-Apex acquisition rationale: **Strategic Fit** - Apex Analytics fills the data engineering capability gap that currently limits Meridian's ability to compete for enterprise contracts above $500K - Combined entity addresses a $2.1B addressable market; neither firm reaches it independently **Financial Logic** - $14M valuation represents 2.3x Apex's trailing revenue — within normal range for analytics firms at this growth stage - Projected cost synergies of $800K–$1.2M annually through consolidated infrastructure and eliminated redundant roles **Risk Mitigation** - Key retention: structure earn-out provisions for Apex's three lead data scientists - Integration timeline: 90-day parallel operations period before system consolidation Would you like me to expand any section or adjust the tone for a more conservative board?

This Prompt Would Be Dangerous in ChatGPT

The prompt above contains material non-public information (MNPI) about an acquisition — exactly the kind of content that creates legal liability if transmitted to a third-party server. Running this through ChatGPT's standard interface could violate securities regulations, breach client confidentiality agreements, or expose your firm to regulatory scrutiny. The identical prompt run through Ollama on a disconnected laptop carries zero data transmission risk. Same output. Entirely different risk profile.

Quick-Start Task: Run Your First Local Model

Install Ollama and Run Llama 3 Locally

Goal: Have a working local LLM running on your own hardware, with firsthand experience of its capability and response speed, ready to use for sensitive professional tasks.

1. Go to ollama.com and download the installer for your operating system (Mac, Windows, or Linux). Run the installer — it takes under 2 minutes. 2. Open your terminal (Mac: press Cmd+Space, type 'Terminal'; Windows: search 'Command Prompt' or 'PowerShell'). 3. Type the command: ollama run llama3.1 and press Enter. Ollama will download the 8B model (~4.7GB). This takes 5–15 minutes depending on your connection. 4. Once the download completes, you'll see a '>>>' prompt. Type a test message: 'Summarize the key risks of a vendor contract in 3 bullet points.' Press Enter. 5. Review the response. Note the response speed — on most modern laptops with 16GB RAM, you'll see 15–30 tokens per second. 6. Now test privacy-sensitive content: paste a paragraph of fictional but realistic confidential business data and ask for analysis. Confirm you're comfortable knowing this never left your machine.

Local AI Quick-Reference Cheat Sheet

  • Best tool to start with: LM Studio (GUI, no terminal needed) — lmstudio.ai
  • Best model for most professionals: Llama 3.1 8B (8GB RAM) or Mistral 7B (8GB RAM)
  • Best model if you have 16GB+ RAM: Phi-3 Medium 14B or Llama 3.1 8B with more headroom
  • Best model for near-GPT-4 quality: Llama 3.1 70B — needs 48GB RAM or a good GPU
  • Command to pull any model in Ollama: ollama pull [modelname] (e.g., ollama pull mistral)
  • Command to list installed models: ollama list
  • To run as a local API (OpenAI-compatible): Ollama serves on http://localhost:11434 by default
  • AnythingLLM = best option for chatting with your own PDF/Word documents locally
  • Internet needed: only during model download; inference runs fully offline
  • Quantization is automatic: 4-bit models are standard and suitable for professional use

Key Takeaways So Far

  1. Cloud AI tools transmit every prompt to vendor servers — this creates real legal and reputational risk for sensitive professional content.
  2. Local models run entirely on your hardware; with no internet connection, data transmission is architecturally impossible.
  3. Ollama and LM Studio make local AI accessible to non-technical professionals in under an hour of setup time.
  4. The hardware constraint is RAM: 8GB supports capable 7B–8B models; 16GB opens up 13B–14B models with significantly better output.
  5. Quantization makes large models practical — a 70B model fits in ~40GB RAM thanks to automatic compression in tools like Ollama.
  6. Open-source models like Llama 3.1 8B perform at GPT-3.5 level on most professional tasks, which is sufficient for the majority of workplace use cases.
  7. Regulatory frameworks (HIPAA, GDPR, SOC 2) often require on-premise data processing — local models are the only AI option that satisfies this by default.

Choosing the Right Local Model for Your Work

Not every local model fits every job. The gap between a 7-billion-parameter model running on a laptop and a 70-billion-parameter model running on a workstation is enormous — in quality, speed, and hardware cost. Your choice depends on three variables: what you're doing with the model, what hardware you have, and how much accuracy you actually need. A consultant summarizing client notes has different requirements than a developer generating code or an analyst processing financial data. Get this decision wrong and you'll either run a model too slow to use or waste time on one too weak to trust.

The Model Size vs. Hardware Reality

Model size is measured in parameters — the numerical weights that define what a model knows. More parameters generally means better reasoning, broader knowledge, and more nuanced output. The catch is VRAM: video memory on your GPU. A 7B model typically needs 4–6 GB of VRAM to run at full speed. A 13B model needs 8–10 GB. A 70B model needs 40–48 GB — that's a high-end workstation or a Mac with 64 GB unified memory. CPU-only inference is possible but runs 5–20x slower, making it impractical for real work. Match your model to your machine before anything else.

Model SizeVRAM RequiredTypical HardwareBest Use CasesSpeed (tokens/sec)
3B–7B4–6 GBMost gaming laptops, M1/M2 MacBookSummarization, Q&A, drafting30–80 t/s
13B8–10 GBRTX 3080/4070, M2 Pro MacAnalysis, coding, longer docs15–40 t/s
34B20–24 GBRTX 4090, M2 Max MacComplex reasoning, multi-step tasks8–20 t/s
70B40–48 GBDual GPU, M2 Ultra MacNear-GPT-4 quality tasks3–10 t/s
Quantized 7B (Q4)3–4 GBOlder laptops, 8 GB RAMBasic tasks, privacy-sensitive drafts20–60 t/s
Model size guide: match parameters to your hardware before downloading anything

Quantization cuts size without killing quality

Quantized models compress weights from 16-bit to 4-bit or 8-bit precision. A quantized 7B model (Q4_K_M format) uses roughly half the VRAM of the full version with only a 5–10% quality drop on most tasks. In Ollama and LM Studio, look for model names ending in Q4_K_M or Q5_K_M — these are the best balance of size and performance for most professionals.

The Leading Local Models Right Now

The open-source model landscape moves fast. Meta's Llama 3 family (released April 2024) set a new benchmark for what's possible locally — the 8B version matches or beats older GPT-3.5 on many tasks, and the 70B version trades blows with GPT-4 on reasoning benchmarks. Mistral's models punch above their weight: Mistral 7B outperforms Llama 2 13B despite being half the size. Microsoft's Phi-3 Mini (3.8B parameters) is remarkable for edge devices — it runs on smartphones and handles structured tasks well. Google's Gemma 2 (2B and 9B) is optimized for instruction-following and runs efficiently on consumer hardware.

For code-specific work, Meta's Code Llama and DeepSeek Coder V2 are purpose-built — they outperform general models on programming tasks by a significant margin. If your work is primarily English-language document processing, Llama 3 8B or Mistral 7B cover 80% of professional use cases. If you need multilingual support, Llama 3 has stronger non-English performance than Mistral. For structured data extraction or JSON output, Mistral's models follow formatting instructions more reliably. The right model isn't the biggest one — it's the one calibrated to your actual workflow.

ModelDeveloperSizes AvailableStrengthsWeaknessesBest For
Llama 3Meta8B, 70BBroad reasoning, multilingual, instruction followingLarger sizes need serious hardwareGeneral professional tasks
Mistral 7B / MixtralMistral AI7B, 8x7BEfficient, fast, strong structured outputWeaker on non-EnglishSummarization, drafting, analysis
Phi-3 MiniMicrosoft3.8BTiny footprint, surprisingly capableStruggles with long contextEdge devices, quick tasks
Gemma 2Google2B, 9BClean instruction-following, safe outputsLess creative than LlamaStructured tasks, Q&A
Code LlamaMeta7B, 13B, 34BCode generation, debugging, explanationWeak on non-code tasksDevelopers, technical analysts
DeepSeek Coder V2DeepSeek16B, 236BState-of-the-art coding benchmark scoresLarge sizes onlySerious software development
Current leading local models as of mid-2024 — check Hugging Face leaderboards for updated rankings

Where to find and compare models

The Hugging Face Open LLM Leaderboard ranks open models on standardized benchmarks — it's the most reliable public comparison tool. For practical local deployment, the Ollama model library (ollama.com/library) lists models tested and packaged for easy installation. LM Studio's built-in search pulls directly from Hugging Face. Always check a model's license: Llama 3 and Mistral allow commercial use; some others restrict it.

Running Local Models: What a Real Workflow Looks Like

The abstraction of 'running a local model' hides a concrete sequence of steps. With Ollama — the most accessible tool for non-developers — the entire process from installation to first response takes under 10 minutes on a capable machine. You install Ollama, pull a model with a single terminal command, and interact via a chat interface or API. LM Studio offers a graphical interface for those who prefer not to use a terminal. Both tools handle the complex parts: model loading, memory management, and inference optimization. What you see is a chat window. What's happening underneath is a full inference engine running on your hardware.

The API compatibility these tools offer is strategically important. Ollama exposes an OpenAI-compatible API endpoint at localhost:11434. This means any application built to talk to ChatGPT's API can be redirected to talk to your local model instead — by changing one URL. Tools like Obsidian, VS Code extensions, and custom scripts that use the OpenAI API can be pointed at your local Ollama instance, giving you private AI integration inside your existing tools without rewriting anything. This is how professionals build genuinely private AI workflows without depending on any external service.

  1. Install Ollama from ollama.com (macOS, Windows, Linux — all supported)
  2. Open terminal and run: ollama pull llama3 (downloads the 8B model, ~4.7 GB)
  3. Start a chat session: ollama run llama3
  4. Type your prompt — response generates locally, nothing leaves your machine
  5. For a GUI, open LM Studio, search for your model, download, and load it
  6. To use the API: send POST requests to http://localhost:11434/api/chat
  7. Point any OpenAI-compatible tool at localhost:11434 to use your local model
Local model prompt — sensitive client analysis

Prompt

The following is a summary of notes from a client strategy session. Extract the three most significant business risks they mentioned, and for each risk, suggest one concrete mitigation action. Keep your response under 300 words. [PASTE CLIENT NOTES HERE]

AI Response

**Risk 1: Supply chain concentration** The client sources 74% of critical components from a single supplier in Malaysia. A disruption — weather, geopolitical, or operational — would halt production within 2 weeks. Mitigation: Qualify a secondary supplier in a different geography within 90 days. Negotiate a dual-source agreement even if the secondary supplier is more expensive. **Risk 2: Key-person dependency** Two individuals hold the client relationships for 60% of annual revenue. No documented handover process exists. Mitigation: Initiate structured relationship transfer over Q3 — assign junior account managers to shadow all key client calls and begin formal introduction to client contacts. **Risk 3: Pricing model exposure** Fixed-price contracts signed 18 months ago don't include inflation adjustment clauses. Margins have compressed by an estimated 8–12 points. Mitigation: Renegotiate renewal contracts to include CPI-linked adjustment clauses. Flag the three largest fixed-price contracts for priority review.

Privacy Boundaries: What Local Actually Protects

Running a model locally does not make you immune to all privacy risks. It eliminates one major risk: your data leaving your machine to a third-party server. But other risks remain. If you're running Ollama on a networked computer, other users on the same network can potentially query your local API endpoint unless you configure firewall rules. If your prompts or outputs are stored in a local log file — which many tools do by default — that data can be accessed by anyone with filesystem access to your machine. Local means no cloud transmission, not no exposure.

The model itself is also a consideration. Open-weight models like Llama 3 are released by companies (Meta, in this case) who designed and trained them. The model weights don't phone home — they're static files on your disk. But the training data those models learned from may include copyrighted material, personal data scraped from the web, or content with uncertain provenance. This doesn't affect your privacy directly, but it matters for compliance in regulated industries. Legal and healthcare professionals should verify that their organization's AI policy covers locally-run open-weight models, not just cloud services.

Ollama's default API is open to your local network

By default, Ollama binds to 0.0.0.0, meaning any device on your local network can send it prompts. On a home network this is usually low risk. In a corporate office or shared workspace, it's a real exposure. Restrict Ollama to localhost only by setting the environment variable OLLAMA_HOST=127.0.0.1 before starting the service. Check your LM Studio server settings for equivalent controls.
ThreatCloud AI ToolsLocal AI (Default)Local AI (Hardened)
Data sent to vendor serversYes — every promptNoNo
Vendor uses data for trainingPossible (check ToS)NoNo
Network interception of promptsEncrypted (TLS)Risk if LAN exposedMitigated with localhost binding
Local log file exposureN/APossibleDisable logging in tool settings
Unauthorized LAN access to APIN/ADefault risk (Ollama)Set OLLAMA_HOST=127.0.0.1
Regulatory compliance gapsMajor concernReduced but not eliminatedRequires policy review
Privacy threat comparison: cloud AI vs. local AI in default and hardened configurations
Set Up and Test a Private Local AI Session

Goal: Complete a real professional task using a locally-run model, confirm no data was transmitted externally, and establish a baseline for evaluating local model quality against cloud alternatives.

1. Download and install Ollama from ollama.com for your operating system. 2. Before launching, set the environment variable OLLAMA_HOST=127.0.0.1 to restrict API access to localhost only. 3. Open a terminal and run: ollama pull mistral — this downloads the Mistral 7B model (~4.1 GB). 4. Once downloaded, run: ollama run mistral to open a local chat session. 5. Paste a paragraph of real work content you would normally hesitate to put into ChatGPT — a client email, a draft report section, or internal meeting notes. 6. Prompt the model to summarize, extract key points, or suggest improvements — verify the output quality meets your standard. 7. Open your system's network monitor (Activity Monitor on Mac, Task Manager on Windows) and confirm zero outbound network traffic occurred during the session.

Running and Trusting Local AI: What You Need to Know

You have a local model running. Now the real questions start: which tasks belong on local AI, which still need the cloud, and how do you know your data never leaves the machine? This section answers those questions with concrete criteria, a deployment decision table, and a practical privacy audit you keep. Local AI is not a replacement for every cloud tool — it is a precise instrument for specific, sensitive work.

  1. Local models process everything on your CPU or GPU — no data touches an external server.
  2. Model quality scales with your hardware: 8GB RAM handles 7B models; 16GB+ handles 13B–30B models comfortably.
  3. Ollama, LM Studio, and Jan are the three dominant local runners for non-developers in 2024.
  4. Llama 3 (Meta), Mistral 7B, and Phi-3 (Microsoft) are the top open-weight models for professional use.
  5. Quantized models (GGUF format) trade ~5% accuracy for dramatically lower hardware requirements.
  6. Local models have no memory between sessions unless you configure a persistent context tool like Continue or Open WebUI.
  7. Air-gapped deployment — running on a machine with no internet — is the gold standard for maximum data privacy.

Choosing the Right Task for Local AI

Not every task benefits from local deployment. Cloud models like GPT-4o and Claude 3.5 Sonnet still outperform most local alternatives on complex reasoning, long-document synthesis, and real-time web access. Where local models win decisively is on sensitive data: HR documents, legal drafts, financial summaries, client records, and internal strategy. The privacy guarantee is absolute because the architecture makes leakage structurally impossible, not just policy-prohibited.

Local models also excel at repetitive, templated work — reformatting data, summarizing meeting notes, drafting standard emails — where a 7B model is fast enough and accurate enough. Response speed on modern laptops (Apple M-series or a recent AMD/Intel with integrated GPU) is typically 15–40 tokens per second, which feels close to real-time for most writing tasks. For batch processing of sensitive files overnight, speed becomes irrelevant and privacy becomes everything.

  • Use local AI for: PII-heavy documents, internal financials, client data, HR reviews, legal drafts, proprietary code.
  • Use cloud AI for: complex multi-step reasoning, real-time research, large context windows (>32K tokens), image generation.
  • Hybrid approach: draft sensitive content locally, then refine non-sensitive structure with a cloud model.
  • Never paste salary data, social security numbers, medical records, or NDA-protected text into ChatGPT, Claude, or Gemini.

The Sensitivity Test

Before pasting anything into a cloud AI, ask: 'Would I be comfortable if this appeared in a data breach headline?' If no, use a local model. This single question eliminates the most common privacy mistakes professionals make with AI tools.
Task TypeRecommended ToolReason
Summarize client contractsLlama 3 (local)Contains PII and confidential terms
Draft a marketing emailChatGPT / ClaudeNo sensitive data; benefits from cloud quality
Analyze HR performance reviewsMistral 7B (local)Employee data — legally sensitive in most jurisdictions
Brainstorm product namesChatGPT / GeminiNo sensitivity; creative breadth matters more
Reformat internal financial tablesPhi-3 (local)Financial data should not leave the organization
Research competitor landscapePerplexity / ChatGPTNeeds real-time web access; no sensitive input
Write code handling user PIIGitHub Copilot (local mode) or OllamaSource code with data logic is proprietary
Task-to-tool routing guide for privacy-conscious AI use

Model Selection and Hardware Reality

Picking a model without knowing your hardware is like ordering a car without checking the garage. The GGUF quantized format — used by Ollama and LM Studio — compresses models so a 7B parameter model runs in roughly 4–5GB of RAM. A 13B model needs 8–10GB. If you are on a MacBook Pro M2 or M3 with 16GB unified memory, you can run 13B models smoothly. Windows machines need a dedicated GPU with VRAM, or they fall back to slower CPU inference.

For most professionals, Mistral 7B or Llama 3 8B hits the sweet spot: fast, capable, and fits on standard business laptops. Phi-3 Mini from Microsoft is remarkably capable at 3.8B parameters — designed specifically for efficiency on constrained hardware. If your organization has a shared server with a GPU, deploying Ollama server-side lets your whole team query a local model without any individual needing powerful hardware.

ModelParametersMin RAM (GGUF)Best ForRunner
Phi-3 Mini3.8B3GBLow-resource devices, quick tasksOllama, LM Studio
Mistral 7B7B5GBWriting, summarization, instruction-followingOllama, Jan
Llama 3 8B8B6GBGeneral professional use, reasoningOllama, LM Studio
Llama 3 70B70B48GB+Near-GPT-4 quality, needs server GPUOllama (server)
Code Llama 13B13B10GBCode generation and reviewOllama, LM Studio
Local model quick-reference: hardware requirements and use cases

Open-Weight ≠ Fully Open Source

Meta's Llama 3 and Microsoft's Phi-3 are 'open-weight' — the model weights are free to download and run locally, but usage licenses restrict certain commercial applications. For most business use under 700 million monthly users (Llama's threshold), you are fully permitted. Always check the specific model license before enterprise deployment.

Verifying Your Privacy Setup Actually Works

Claiming a setup is private and verifying it are different things. The practical test is network monitoring: run your local model while capturing traffic with a free tool like Little Snitch (Mac) or GlassWire (Windows). If the AI runner makes zero outbound connections while processing your prompt, your data is staying local. Ollama by default binds to localhost (127.0.0.1) and makes no outbound calls during inference — but update checks happen separately, so watch for that distinction.

For organizations handling regulated data (HIPAA, GDPR, SOC 2 scope), air-gapping the inference machine adds a second layer: no network card active means no accidental transmission regardless of software behavior. Document your local AI setup in writing — which model, which runner, which hardware, network status — because auditors and legal teams increasingly ask about AI data handling. A one-page configuration record protects you and your organization.

Update and Telemetry Risks

LM Studio and Ollama both send anonymous telemetry and check for updates by default. This is not prompt data — but it is outbound traffic. In high-security environments, disable auto-update in settings and block outbound connections from the runner application entirely. Run manual updates on a schedule instead. Assume any software with a network connection can be a vector until you verify otherwise.
Build Your Personal Local AI Privacy Audit Card

Goal: Produce a completed one-page Local AI Configuration Record that documents your setup, verifies network isolation, and defines your personal data routing policy — a document you can share with a manager, compliance team, or keep for your own reference.

1. Open a blank document (Word, Notion, or Google Docs) and title it 'Local AI Configuration Record — [Your Name] — [Date]'. 2. Record your hardware: device model, RAM, GPU/VRAM if applicable, and operating system. 3. List the local AI runner you installed (Ollama, LM Studio, or Jan) and its version number. 4. List the model(s) you have pulled, including the exact model name and parameter size (e.g., 'llama3:8b', 'mistral:7b-instruct'). 5. Note your network status during inference: open the runner, run a test prompt ('Summarize: The meeting covered Q3 targets'), then check your network monitor to confirm zero external connections were made. Write 'Confirmed local-only' or note any exceptions. 6. Write a two-sentence data handling statement: what types of work you will route to local AI versus cloud AI, based on the task routing table above.

Local AI Cheat Sheet

  • Best runners: Ollama (CLI, developer-friendly), LM Studio (GUI, beginner-friendly), Jan (open-source GUI).
  • Best models for most professionals: Mistral 7B or Llama 3 8B on 8–16GB RAM machines.
  • GGUF quantized models run on CPU — no GPU required, just slower.
  • Sensitive data routing: HR, legal, financial, client PII → always local.
  • Verify privacy by monitoring network traffic during inference — zero outbound = confirmed local.
  • Disable telemetry and auto-updates in high-security environments.
  • Air-gap (no internet) = maximum protection for regulated data environments.
  • Document your setup: model name, runner version, hardware, network status.
  • Open-weight licenses (Llama 3, Phi-3) permit most business use — verify before large-scale enterprise deployment.
  1. Local models run entirely on your hardware — your prompts and outputs never leave your machine.
  2. Hardware determines model choice: match parameter size to your available RAM using the GGUF format.
  3. Ollama, LM Studio, and Jan make local deployment accessible without developer skills.
  4. Route sensitive tasks (PII, financials, legal, HR) to local models; use cloud models for complex reasoning and research.
  5. Verify your privacy setup with network monitoring — don't assume, confirm.
  6. Disable telemetry and update checks in regulated or high-security environments.
  7. Document your configuration: a one-page record satisfies auditors and clarifies your own policy.
Knowledge Check

You need to summarize 50 employee performance reviews to identify common themes. Which approach is most appropriate?

Your MacBook Pro M2 has 8GB of unified memory. Which model configuration is most realistic for smooth local inference?

How do you confirm that your local AI runner is genuinely not sending your prompts to external servers?

What does 'open-weight' mean in the context of models like Llama 3 and Phi-3?

A colleague says: 'I disabled the internet on my laptop while using LM Studio, so I don't need to worry about telemetry.' What is the most accurate assessment?

This lesson requires Pro

Upgrade your plan to unlock this lesson and all other Pro content on the platform.

Upgrade to Pro

You're currently on the Free plan.