AI and Machine Learning

Artificial Intelligence is the broad idea of machines doing things that seem intelligent. Machine Learning is the specific approach of learning from data. Deep Learning is ML where the model has many stacked layers.

ConceptRelationship
AIBroad field — machines doing intelligent things
Machine LearningLearning from data by adjusting weights
Neural NetworkAn ML model shaped like a layered network of nodes
Deep LearningNeural network with many layers (deep = many floors)
LLMLarge Language Model — deep learning trained on text at massive scale

How Machine Learning Works

A machine learning model is a mathematical function with millions of adjustable numbers called weights. Training runs labelled examples through the model, measures how wrong the output is (the loss), then nudges the weights slightly in the right direction using gradient descent.

The free-throw analogy: You throw the ball, get feedback (“too hard”), adjust, repeat. After thousands of attempts your body dials in the right motion. ML works the same way — guess, measure error, adjust, repeat.

The dial panel analogy: Imagine thousands of dials all set to random values. Feed in a cat photo, get “dog” back — wrong. A process called backpropagation nudges each dial slightly toward the answer that would have produced “cat”. After a million examples, the dials settle into positions that reliably recognise cats.

Key insight: The model never understands anything. It finds statistical patterns in numbers. A model that recognises cats has never “seen” a cat — it has found regularities in pixel values.


Neural Networks and Layers

A neural network is ML where the dial panel is organised into stacked layers — like floors in an office building.

Input layer  →  Hidden layers (progressively abstract)  →  Output layer
  (pixels)       (edges → shapes → faces → "person")        (cat/dog)
  • Ground floor — raw input (pixels, tokens, audio samples)
  • Middle floors — progressively abstract patterns
  • Top floor — final answer

Why layers matter: a single layer can only learn simple relationships. With layers the network chains them:

  • Layer 1: “these pixel patterns form an edge”
  • Layer 2: “these edges form a pointy ear”
  • Layer 3: “pointy ear + whisker shape = cat”

No human programmed those rules — they emerge from training.


Language Models

History

YearMilestone
1954Georgetown-IBM Experiment — 250-word, 49-sentence translation system
1980sStatistical approaches
Early 2000sRise of neural networks
2013Word embeddings
2017Transformer architecture — the breakthrough
2018BERT (bidirectional encoder) + GPT (generative pre-trained transformer)
2020+GPT-3/4, Claude, Gemini — LLMs at massive scale

Foundation models

A Foundation Model is a versatile AI model trained on extensive, diverse data — adaptable for broad applications. Foundation models handle multiple mediums: text, audio, visual.

Transformer components:

  • Attention mechanism — identifies and focuses on important parts of input
  • Activation functions — decide how much information moves to next layer
  • Parameters — the adjustable dials/sliders that fine-tune the model
  • Loss function — gives the error score during training
  • Optimizers — tweak parameters for better performance

Neural network layers:

Layer typeWhat it learns
GroundworkBasic concepts — lines, colours, shapes
IntermediateComplex structures — faces, objects, syntax, grammar
AdvancedAbstract concepts — sentiment, sarcasm, reasoning

How an LLM answers a question — step by step

Example: “What is the capital of Latvia?”

1. Tokenisation — sentence split into chunks:

["What", " is", " the", " capital", " of", " Latvia", "?"]

2. Token → vector — each token maps to thousands of numbers encoding meaning. “Latvia” lands near “Estonia”, “Lithuania”, “Baltic” in that number space because they co-occurred in training text.

3. Attention — the model computes relationships between all tokens simultaneously. For “capital” it notices:

  • “What” → this is a question asking what something is
  • “Latvia” → the subject of the capital question

4. Pattern matching against training weights — the model never looks up a database. During training on billions of pages, it saw:

"...Riga, the capital of Latvia..."
"...Latvia's capital city, Riga..."

Those patterns adjusted the model’s weights so “capital of Latvia” now strongly predicts “Riga”.

5. Token prediction — the model outputs a probability distribution:

TokenProbability
Riga97.3%
Tallinn0.8%
Vilnius0.6%

It picks “Riga” → done.

LLMs vs databases

LLMs are not databases. They don’t store and retrieve exact data. They generate responses based on patterns learned from training data — synthesising new content by predicting the most likely sequence of tokens. This is why they can hallucinate.

Key insight: “Riga” is baked into the model’s weights as a statistical pattern — which is also why models can confidently give wrong answers for obscure facts where training data was sparse or contradictory.

Hallucination: When a generative AI model produces inaccurate or irrelevant results that seem convincing. Not always easy to recognise.


AI Chatbots

TypeDescription
Rule-basedIf/then decision trees, pre-programmed flow, can’t detect synonyms
AI ChatbotsNLP + ML, user types freely, learns from data, recognises intent

Components of an AI chatbot:

  • NLP (Natural Language Processing) — reads, understands, derives meaning from human language
  • Machine Learning — improves responses from collected data via a Knowledge Base
  • Intent recognition — understands the goal behind a query even when phrased differently

Industry applications: Customer service, HR, finance, marketing, sales, e-commerce (product recommendations), healthcare (appointment scheduling), insurance (automated quotes), manufacturing (supply queries).

Building chatbots:

  • Code: Python, JavaScript, Java
  • SDKs: Microsoft Bot Framework, Node.js (DialogFlow, ChatterBot)
  • APIs: OpenAI API, Google Chat API, Facebook Messenger API
  • No-code builders: Azure Cognitive Services, Microsoft Power Virtual Agent

Prompt Engineering

Crafting effective prompts is an art requiring experimentation. Every prompt has attributes — the same qualities that define any piece of writing.

Prompt attributes

AttributeDescriptionExample
FormatType of outputessay, list, blog post, tweet, code
LengthHow long”5-minute read”, “10-item list”, “500 words”
AudienceWho it’s for”explain to a 10-year-old”, “for software developers”
ToneMood/styleformal, casual, empathetic, funny
DomainSubject focushealth benefits, economic impacts, neurological
PerspectivePoint of viewoptimistic, neutral, pessimistic
Role/PersonaWho the AI acts as”act as a marketing copywriter”, “act as a teacher”

Prompt techniques

Prompt chaining — treat it as a conversation, refining iteratively:

1. "Write marketing copy for a newsletter about the Eames Lounge Chair"
2. "Make it a 5-minute read"
3. "Remove the salutations and signature, and suggest images"

Flipping the role — get multiple perspectives:

"As a hiring manager, what do you look for in a leader?"
→ flip →
"As a leadership candidate, what concerns you about a new job?"

Shot-based prompting:

TypeExamples givenWhen to use
Zero-shotNoneSimple, clear tasks
One-shot1Orients the model toward format/style
Few-shot2–3Complex format requirements
Many-shotManyPrecise output formatting

Advanced techniques:

  • Combine role + time period: “Imagine you’re a historian in 2150 looking back at social media’s impact”
  • Ask the model to quiz you with multiple-choice questions
  • Long, detailed prompts for nuanced answers in ambiguous domains

For image generation, specify: style (abstract/realistic), composition (rule of thirds, depth), colour scheme (monochromatic, complementary), subject (foreground/background), mood (eerie, serene). Avoid overspecification — leave room for creativity.

Prompt library

Keep a personal prompt library of what attributes and orderings give the best results for your common tasks.


Local LLMs vs cloud (2026 reality check)

From Eduard Ruzga’s “Local LLMs Are Finally Beating the Cloud! — But Are They?”:

What’s actually happening: Consumer GPU hardware (RTX 4090, M-series Macs) can run 70B+ parameter models. Benchmarks show local models beating cloud on specific coding tasks — but benchmarks are cherry-picked.

DimensionLocal LLMsCloud LLMs
Cost at scaleNear-zero marginalPay per token
Privacy100% localData sent to provider
SpeedLimited by VRAMProvider-side scaling
Model quality~70B max practical1T+ parameter models
Setup complexityHighZero
UpdatesManualAutomatic

Verdict (2026): Local wins for privacy-sensitive tasks and budget-constrained high-volume use. Cloud wins for quality, reliability, and complex reasoning. Most serious developers use both.

“Local LLMs Are Finally Beating the Cloud! — But Are They?” — Eduard Ruzga (206 claps)

Dedicated local-inference hardware — NVIDIA RTX Spark

Rather than repurposing a gaming GPU, RTX Spark is a desktop box built specifically for local LLM inference:

SpecValue
CPU20-core Arm
GPUBlackwell architecture, 6,144 CUDA cores
Memory128GB unified (shared CPU/GPU)
InterconnectNVLink
Peak throughputUp to 1 petaflop
Practical capabilityRuns a 120B-parameter model locally

128GB unified memory at this price point pushes the practical “what fits locally” ceiling well past the 70B class in the consumer-GPU table above — closer to mid-size frontier models than to the RTX 4090 generation.

Source: Pramod Chandrayan (in Predict) — “NVIDIA Just Put a 120-Billion-Parameter AI Model in Your Laptop. Here’s What That Actually Changes.” (2026-06-11)


Gemma 4 models (Google, April 2026)

Google released Gemma 4 family under Apache 2.0:

ModelParamsNote
Gemma 4 1B1BUltra-lightweight, on-device
Gemma 4 4B4BMobile/edge
Gemma 4 12B12BBalanced
Gemma 4 26B26BDisproportionately strong for its size — standout of the family

The 26B model tested notably above its weight class on coding and reasoning benchmarks.

“I Tested All 4 Gemma 4 Models: The 26B One Is Cheating (In the Best Way)” — Chew Loong Nian (174 claps)


LLM GPU fit tool

From Pawel’s “Stop Guessing Which LLMs Fit Your GPU”:

A community tool that calculates whether a given model fits in your GPU’s VRAM before you download it. Solves the common problem of downloading a 70B model only to find it needs 40GB VRAM but you have 8GB.

Inputs: model size (params), quantisation level (4-bit, 8-bit, fp16), available VRAM
Output: will it fit? how much headroom? recommended quantisation for your hardware

Quick VRAM reference:

Model sizeQuantisationVRAM needed
7B4-bit (GGUF Q4)~4 GB
7B8-bit~8 GB
13B4-bit~8 GB
30B4-bit~16 GB
70B4-bit~40 GB
70B2-bit~20 GB

“Stop Guessing Which LLMs Fit Your GPU — There’s a Tool for That” — Pawel (209 claps)


Nemotron 3 Nano (NVIDIA, 2026)

NVIDIA’s small model designed for practical on-device deployment:

  • Optimised for inference efficiency over raw benchmark scores
  • Targets edge devices and resource-constrained environments
  • Positioned as the most practical small model for real workloads (not just benchmarks)
  • Apache 2.0 licence — fully open for commercial use

“Nemotron 3 Nano: Why This ‘Small’ Model Might Be the Most Practical AI You’ll Use” — Faisal haque (201 claps)


Claude Opus 4.7 (April 2026)

Released 2026-04-16 to API, Bedrock (anthropic.claude-opus-4-7-v1:0), Vertex AI (claude-opus-4-7@20260416), and Microsoft Foundry. Same price as 4.6: 25 per million in/out tokens.

Benchmark gains over Opus 4.6

BenchmarkOpus 4.6Opus 4.7DeltaGPT-5.4Gemini 3.1 Pro
SWE-bench Verified80.8%87.6%+6.880.6%
SWE-bench Pro (agentic coding)53.4%64.3%+10.957.7%54.2%
GPQA Diamond (grad reasoning)91.3%94.2%+2.994.4% Pro94.3%
MMMLU (multilingual)91.1%91.5%+0.492.6%
MCP-Atlas (tool use)75.8%77.3%+1.5
BrowseComp (agentic search)79.3%89.3% Pro
Anthropic internal 93-taskbaseline+13%

Headline claim: 4 tasks Opus 4.6 categorically cannot solve at default settings (verified independently by Chew Loong Nian on 2 of them).

The default-effort swap (the real story)

The “nerfed Claude” complaints in March 2026 traced to a silent change on March 3 setting Opus 4.6’s default effort to medium (internal tag: effort 85). Fine for chat, disastrous for long agentic coding runs (loops, hallucinated imports, abandoned sessions).

Opus 4.7 silently flipped the default back to high. Setting effort: "high" on Opus 4.6 manually closes ~60% of the perceived 4.6→4.7 gap on long-horizon coding tasks. Most of the headline upgrade is the default change, not the model weights.

# Free upgrade for anyone still on 4.6:
client.messages.create(
    model="claude-opus-4-6",
    thinking={"type": "adaptive", "effort": "high"},  # was "medium" by default
    ...
)

Other changes

  • Adaptive thinking is now the only mode — model auto-scales compute per task
  • Manual extended thinking removedthinking.budget_tokens no longer settable on 4.7+
  • Vision improvements — chemical structures and technical diagrams (e.g. CPU pipeline forwarding paths) now correctly interpreted; not marketing fluff
  • Long-context consistency — 900KB-context retrieval task: 4.7 found all 17 callers vs 4.6@medium (14, with hallucinated paths) and 4.6@high (16)

Use-case guidance

WorkloadRecommendation
Long agentic coding workflowsSwitch to 4.7 — +10.9 SWE-bench Pro is real and reproducible
On 4.6, tight budgetSet effort: "high" first — closes most of the gap free
Web-research / browse-heavy agentsStay on GPT-5.4 — 10-point BrowseComp lead, not a rounding error
Vision (diagrams, chemistry, medical)4.7 is a meaningful upgrade
One-shot completions (<500 tokens out)Either is fine — gap shows up on long runs
Cost-sensitive chatSonnet 4.6 (15) still the right pick — 4.7@high actually costs more per session than 4.6@medium at the same sticker price

Counterpoint

Alex Dunlop’s “Opus 4.7 Is The Worst Release Anthropic Has Ever Shipped” (5 min, 235 claps) argues the removal of manual thinking.budget_tokens is a step backward for power users who had tuned budgets for specific workloads — variability of adaptive mode trades predictability for average-case quality. Valid for batch pipelines where token budget is a hard constraint; less relevant for interactive coding.

Sources: Chew Loong Nian, “I Tested Claude Opus 4.7 vs 4.6 on 7 Real Tasks: The Default Setting Swap” (Towards AI, 2026-04-17); Alex Dunlop, “Opus 4.7 Is The Worst Release Anthropic Has Ever Shipped”


Qwen 30B on Raspberry Pi (April 2026)

From Sebastian Buzdugan’s “A 30B Qwen model runs in real time on a Raspberry Pi” (8 min, 1K claps):

A 30B parameter Qwen model achieves real-time inference on a Raspberry Pi — a significant milestone for edge AI. Why it matters:

  • Demonstrates that 30B models are becoming viable on sub-$100 hardware
  • Quantisation (likely 2-bit or 3-bit) enables it — heavy quality trade-off but functional
  • Changes the calculus for air-gapped / offline AI deployments
  • Raspberry Pi = deployable in environments where cloud is impractical (field sensors, kiosks, embedded systems)

VRAM table addition: at 2-3 bit quantisation, 30B model fits in ~8–12 GB — overlapping with consumer GPU territory, not just server hardware.


Qwen Instruct models (Alibaba) — the open coding workhorse

Qwen is Alibaba’s open-weight LLM family and the recurring “free model that matches Claude on coding” in the backlog. The thing to understand first is what “Instruct” now means in Qwen.

Instruct vs Thinking — a deliberate split, not just “chat-tuned”

The original Qwen3 (April 2025) shipped hybrid checkpoints that toggled thinking on/off in one model. Alibaba abandoned the hybrid approach — it dragged down benchmark quality — and now trains two separate lines:

InstructThinking
Optimised forchatbots, OCR/extraction, direct answers, low latencymath, STEM, code, multi-step reasoning
Behaviouranswers directly, no extended reasoning traceexplicit reasoning before answering
Speed / costfast, predictable, cheaperdeeper but slower and pricier
Sampling defaultstemp 0.7 / top_p 0.8temp 1.0 / top_p 0.95

The split paid off immediately: Qwen3-235B-A22B-Instruct-2507 posted a ~2.8× AIME25 jump over the April hybrid release. Rule of thumb: reasoning-heavy work (code, math) benefits most from Thinking; for direct generation/extraction the Instruct variant is near-equal and much cheaper. Alibaba says hybrid may return once the quality regression is solved.

Instruct lineup (Qwen3 → 3.5)

  • Dense Instruct: 0.6B, 1.7B, 4B, 8B, 14B, 32B — all 128K context, tool calling, structured output.
  • MoE Instruct: 30B-A3B (3B active), 235B-A22B (22B active). MoE = total params for capacity, active params for inference cost.
  • Qwen3.5 (Feb 2026): scales to ~397B total / 17B active, 201 languages, claimed 8.6×–19× throughput gain over the prior generation.

Coder-Instruct — the variants the backlog keeps citing

  • Qwen3-Coder-480B-A35B-Instruct — flagship open agentic coder (35B active). 256K context native → 1M with extrapolation. SOTA among open models on agentic coding / browser-use / tool-use, reported comparable to Claude Sonnet 4.
  • Qwen3-Coder-30B-A3B-Instruct — the run-it-locally one; 50.3% Pass@1 on SWE-bench Verified. This is the model behind the vault’s “run Claude Code locally on a Mac with a 4-bit Qwen3.6-27B” backlog items — at 2-4 bit quant it lands in consumer-GPU / Apple-Silicon territory (see GPU-and-Hardware-for-AI).
  • Qwen3-Coder-Next — newer technical-report entry continuing the coder line.

Why it matters here

Qwen is the practical answer to “Claude-Code but local / zero marginal cost”: the 30B-A3B Coder-Instruct runs on a high-RAM Mac or a single consumer GPU and is routinely benchmarked against Opus/GPT on real coding tasks. Compare with Gemma 4 (the other local-model camp). Caveat from the vault’s hardware notes: hosted Qwen via Ollama Cloud has been unreliable (high failure rates on Qwen3.5) — the value is in self-hosting.

Sources (web research, 2026-06-27): Qwen3-Coder blog; Qwen3-Coder GitHub; Qwen3 full lineup guide 2026; The Register — Alibaba drops hybrid thinking; Fireworks — Qwen3 Instruct vs Thinking vs Coder; Best Qwen models 2026


AI as pattern matching — the developer mental model

“AI is software that uses statistical patterns, learned from data, to perform tasks that traditionally required human judgment.”

The key shift: deterministic → probabilistic. Traditional code says “if X then Y because I told it to.” AI says “when I see X, it’s probably Y because that’s what patterns suggest.” This is why fraud detection is 94% accurate rather than 100% correct — and why that’s not a bug.

What AI is not doing: understanding, reasoning, or knowing. A language model that writes Shakespeare-quality prose has no understanding of narrative or emotion — it found patterns in vast amounts of text. This explains both the impressive capabilities and the bizarre failures:

  • Can write human-like text → learned those patterns from training data
  • Fails at counting letters in “strawberry” → counting is computation, not pattern matching
  • Confidently states wrong facts → those word patterns were statistically likely in training
  • Fails when an image is rotated 45° → different pixel patterns = different input

The evolving taxonomy: ML, deep learning, neural networks, transformers are all techniques within AI — not competing approaches. They all do the same fundamental thing: learn patterns from data. Expert systems (rule-based, 1980s) were once called “AI”; today they’re just code.


Training vs inference — the key operational split

TrainingInference
WhatModel learns patterns from labelled examplesLearned patterns applied to new data
WhenOnce per model version, offlineEvery API call, in production
CostMillions of dollars (large models), weeks of GPU timeFast, relatively cheap
Who does itModel providers (OpenAI, Anthropic, Google)Everyone who calls the API

Most developers only deal with inference. You call a pre-trained model’s API — the patterns were already learned.

Understanding this explains:

  • Why models don’t improve from your production data unless you retrain
  • Why “teaching it your use case on the fly” doesn’t work
  • Why retraining is a big deal (cost, time)
  • Why data quality at training time determines production quality forever

Fine-tuning / transfer learning: take a model with general patterns, teach it more specific ones for your use case. Much cheaper than training from scratch. Still bound by the same limitations — pattern matching, data quality dependency.


Tokens and context windows

A token ≈ 4 characters — a word or part of a word (“understanding” = “under” + “standing” = 2 tokens). Models process everything as tokens, not words or sentences.

Context window = how many tokens the model can process at once (its “working memory”). The model literally cannot see past this limit.

EraContext window
Early GPT-3~4K tokens
GPT-48K–32K tokens
Claude 3+100K–200K+ tokens

Why this matters in production:

  • Long documents get truncated
  • Long conversations cause the model to “forget” early context
  • Context usage = cost (every token in the window costs money)

Embeddings — how text becomes numbers

Everything AI processes — text, images, audio — must become numbers. The word “king” becomes a list of hundreds of numbers (a vector). “Queen” becomes a different vector.

These numbers capture semantic relationships: KING - MAN + WOMAN ≈ QUEEN. This works because the numerical representations capture patterns of co-occurrence in training data.

Practical implications:

  • Search finds related concepts, not just exact keyword matches — “car” also retrieves “automobile”
  • Similar meaning → similar vectors → similar search results
  • Vector databases (Pinecone, Weaviate) store and query these embeddings efficiently

Temperature and sampling parameters

Controls how generative models (LLMs) make decisions — how deterministic vs creative the output is:

ParameterEffect
Temperature = 0Near-deterministic — same input → same output every time
Temperature = 1Creative — same input → varied phrasings each time
top-kOnly consider the K most probable next tokens (e.g. top-k=50)
top-pOnly consider tokens until cumulative probability hits P% (e.g. top-p=0.9)

Practical guidance:

  • Customer service bot → low temperature (0–0.2): consistent, reliable responses
  • Creative writing assistant → higher temperature (0.7–1.0): variety and surprise
  • Code generation → low temperature: deterministic, fewer hallucinations

Confidence scores ≠ accuracy

A model saying it’s “99% confident” does NOT mean it’s 99% likely to be correct. It means the pattern strongly matches what it learned as that class.

If the model learned wrong patterns, it can be very confident and very wrong.

Threshold decisions belong to you, not the AI:

  • Model outputs: “75% probability this transaction is fraud”
  • You decide: block at 70%? 85%? Higher threshold = fewer false positives but more missed fraud
  • These are ethical and business decisions — the AI only gives probabilities

Calibration: test model confidence against actual outcomes before trusting in production. A model “90% confident” should be right ~90% of the time — if it’s actually right 70% of the time, all downstream risk assessments are broken.


Error compounding in AI pipelines

When you chain multiple AI models (agents, RAG pipelines, multi-step workflows), errors multiply:

3 models, each 90% accurate:
0.9 × 0.9 × 0.9 = 0.729  →  27% error rate

5 models, each 90% accurate:
0.9^5 = 0.59  →  41% error rate

This is why multi-agent systems need careful design — uncertainty compounds at every step. A final output that’s “5 steps deep” can be largely random even if each individual model performs well.


Preprocessing requirements

Models are extremely picky about input format:

Input typeRequirement
ImagesExact resolution (224×224 means exactly 224×224), correct color channels (RGB vs grayscale), normalised pixel values (0–1 or -1 to +1) — one pixel off = failure
TextCorrect encoding, within context window
AllData must match the distribution of training data — the most common production failure

Data representativeness > data correctness: A model trained on perfectly scanned documents will fail on phone photos with coffee stains. The data isn’t garbage — it’s just not representative.


Explainability problem

Most models are black boxes — they output a decision, not an explanation. Tools like SHAP/LIME approximate explanations after the fact (“income was the main factor”) but these are approximations, not true causal explanations.

When explainability is required: regulated industries (credit, healthcare, insurance, legal) often require documented reasoning. If your use case mandates explanation, AI may be the wrong tool. Design for this constraint before building.


Feedback loops

AI decisions create data that influences future AI decisions — which can reinforce biases:

  • Recommendation systems: show users what they seem to like → they click it → system learns they like it → shows more → echo chamber
  • Credit models: reject applicants with certain profiles → never see if they’d have been good customers → keep rejecting similar profiles → entire segments locked out

Loops are everywhere. Models can actively make things worse by reinforcing their own biases. Plan to detect and break loops at design time.


Production realities

What works reliably

  • High-volume classification with stable patterns — fraud detection, spam filtering, content moderation
  • Recommendation — correlation at scale; Netflix doesn’t need to understand why you like dystopian fiction, just that you fit a pattern
  • Document processing — OCR on standard forms, invoice extraction — key word: consistent format and representative training data

What doesn’t work reliably

  • Novel reasoning — “reset my password” (pattern) vs “here’s my unique situation” (reasoning required)
  • Guaranteed accuracy — probabilistic systems cannot guarantee deterministic outcomes
  • Self-correction — AI doesn’t learn from production mistakes; retrain to fix

Monitoring AI in production

Traditional uptime monitoring is insufficient — a model can be up and returning responses while being completely wrong.

MetricWhat it catches
Prediction driftProduction data patterns diverging from training data
Accuracy decayPerformance degrading over time
P95/P99 latencyAI inference spikes (P99 spike = 1 in 100 users waits 10× longer)

Deployment strategies

  • Shadow mode — run new model alongside old, compare results before switching
  • Canary deployment — route 1% of traffic to new model, watch for issues
  • Never swap full production on passing tests alone — behaviour changes may only appear at scale

Cost management

StrategyImpact
Temperature = 0More predictable, slightly cheaper
Token limits on API callsHard cap on per-call cost
Cache repeated queriesOne e-commerce site cut costs 50% caching product description enhancements
Batch vs real-timeBatch is ~10× cheaper but adds latency — most teams start real-time, hit bills, redesign for batch
Use smaller models for lower-stakes tasksSimple classification → cheap model; high-stakes decisions → best model

Indirect costs often exceed AI costs: data pipelines, storage, human review of edge cases, monitoring infrastructure.

Security concerns specific to AI

  • Prompt injection — manipulating prompts to leak information or bypass safeguards
  • Model extraction — competitors reverse-engineering your fine-tuned model through careful queries
  • PII leakage — models can accidentally reproduce private data from training sets (GDPR implications)

Agent harness components (2026)

From Divy Yadav’s “7 Agent Harness Components Every AI Developer Needs to Build Reliable AI Agents” (13 min, 298 claps) and Yanli Liu’s “Harness Engineering: What Every AI Engineer Needs to Know in 2026” (22 min, 698 claps):

A harness is the infrastructure layer that wraps an LLM to make it production-reliable. Raw LLM calls are not enough — agents need scaffolding to handle failures, state, and coordination.

Three architectural camps (Yanli Liu):

CampApproachBest for
Prompt-centricRich prompts + few tools; minimal orchestrationSimple, single-step tasks
Tool-centricHeavy tool use; model selects and chains toolsMulti-step retrieval/action workflows
Agent-centricAgents orchestrate other agents; full multi-agentComplex, long-horizon tasks

7 harness components (Divy Yadav):

ComponentRole
State managementTrack conversation, task progress, and intermediate results across turns
Tool registryDiscoverable catalogue of tools with schemas the agent can query
Retry + fallback logicHandle transient failures without losing task state
Memory layerShort-term (in-context), long-term (vector/file), working (scratchpad)
ObservabilityTrace every tool call, token count, and decision for debugging and cost tracking
GuardrailsInput/output validation, content filtering, loop detection
Handoff protocolHow agents pass work to each other — structured output schema + acknowledgement

Key insight: the harness is what distinguishes a demo agent from a production agent. Most agent failures in production are harness failures, not LLM failures — the model is fine but the scaffolding doesn’t handle edge cases.

See also Harness engineering for the two-agent (Planner/Executor) pattern and Three generations of agent orchestration for the adversarial critic pattern.


World models — Yann LeCun’s anti-LLM bet (AMI Labs, 2026)

LeCun left Meta in November 2025 to found AMI Labs (Advanced Machine Intelligence Labs) in Paris. Seed: **3.5B pre-money / $4.5B post-money — Europe’s largest seed round ever (Crunchbase). Backers include Bezos Expeditions, Eric Schmidt, Mark Cuban, Jim Breyer, Tim & Rosemary Berners-Lee, Xavier Niel.

Thesis (NVIDIA GTC 2026): “LLMs are too limiting. Scaling them up will not allow us to reach AGI.” LeCun calls LLMs “an offramp on the path to AGI” — physical understanding requires latent-state prediction from video/sensor data, not autoregressive token prediction on text. AMI’s target verticals are healthcare, robotics, industrial process control, automation, wearables, transportation — sectors where reliability and physical grounding matter more than fluent text.

Three shipping artifacts in 60 days, all built on JEPA (Joint Embedding Predictive Architecture):

ProjectDateHeadline resultRepo
AMI Labs2026-03-10$1.03B seed announcement
V-JEPA 2.1 (arXiv 2603.14482)2026-03-16Dense-feature video model. 77.3% top-1 on Something-Something v2; 39.7 R@5 Epic-Kitchens-100 SOTA. V-JEPA 2-AC trained on <62 hours of unlabeled robot video deployed zero-shot on Franka arms in two labs for image-goal pick-and-place.facebookresearch/vjepa2 (Apache 2.0, 3.7k★)
LeWorldModel (LeWM) (arXiv 2603.19312)2026-03-27First JEPA that trains stably end-to-end from raw pixels with only two losses (next-embedding prediction + Gaussian latent regularization) — no stop-gradient/EMA tricks. ~15M params, single GPU, ~few hours, up to 48× faster planning than foundation-model-based world models.lucas-maes/le-wm

Why it matters in 2026:

  1. LLM scaling curve is visibly bending — GPT-5.5 doubled price (30 per M tokens) on April 23, 2026; DeepSeek V4-Pro (1.6T params) is still 0.2 pts behind Claude Opus 4.6 on SWE-bench Verified.
  2. David Silver (AlphaGo lead) raised **1B+ within 8 weeks: this is now a cohort, not a contrarian opinion.
  3. Robotics demand is real revenue — V-JEPA 2’s zero-shot Franka result is what factory-floor automation has been asking for since 2022.

Counter-argument: GPT-5.5 hits 88.7% SWE-bench Verified, 82.7% Terminal-Bench 2.0; LLMs ship customer-facing capability today. JEPA has no comparable revenue-generating artifact yet. The scaling thesis may be wrong about AGI but right about $100B revenue lines this decade.

Run V-JEPA 2 yourself (Linux/WSL — decord blocks macOS):

git clone https://github.com/facebookresearch/vjepa2.git
cd vjepa2 && conda create -n vjepa2-312 python=3.12 -y
conda activate vjepa2-312 && pip install .
import torch
encoder, ac_predictor = torch.hub.load(
    'facebookresearch/vjepa2', 'vjepa2_ac_vit_giant'
)
preprocessor = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_preprocessor')

These are the same weights used for the zero-shot Franka pick-and-place demo. Model sizes: ViT-Large, ViT-Huge, ViT-Giant, ViT-Giant-384.

See also AI-Agents for harness/orchestration patterns; world models change what “the model” is but agents still need scaffolding. AMI Labs is private. The $1.03B seed was a venture round (Cathay Innovation, Greycroft, Hiro Capital, HV Capital, Bezos Expeditions, plus angels). No ticker, no public shares. Access is limited to LPs in those VC funds or a future secondary round.

Closest public-market proxies for the world-models / anti-LLM-scaling thesis:

AngleTickers
Robotics arms / industrial automation (V-JEPA’s target)ABB, FANUY, ISRG, ROK, SYM
Humanoid / mobile robotics exposureTSLA (Optimus), NVDA (Isaac/GR00T), GOOGL (DeepMind robotics)
Compute substrate either thesis still needsNVDA, AMD, AVGO, TSM
Schmidt-style “AI-physical-world” playsPublic via ETFs: BOTZ, ROBO, ARKQ

Caveat: none of these are AMI — they just benefit if the JEPA/world-model thesis pans out. If LeCun’s bet works, the upside is captured privately first; public markets see it via downstream robotics/compute revenue years later. Not investment advice

See also

  • Claude-Code — Claude is an LLM; understanding prompting makes you more effective with it
  • Python — common language for ML/AI development
  • Databases-NoSQL — vector databases used in AI applications
  • AI-Agents — multi-agent patterns and error compounding in practice