Limited Intelligence: Engineering

The Agentic Singularity

João Silva — Fri, 24 Apr 2026 13:03:14 GMT

In 2026, the “Chatbot” is a legacy term. It belongs to the era of 2023, where we were impressed that a machine could write a cover letter. Today, if your AI doesn’t have a “hand” to move a mouse or an API key to execute a trade, it’s essentially a very expensive paperweight.

We have transitioned from Generative AI to Agentic AI. The technical delta between the two is massive. Generative AI is a stateless prediction engine; Agentic AI is a stateful, goal-oriented system. This week, the trending discourse isn’t about how many parameters a model has, but how it manages its Orchestration Layer.

To build a functional agent in 2026, you aren’t just calling an LLM. You are building a complex feedback loop.

1. Recursive Planning (System 2 Reasoning)

Early models used “Chain of Thought” (CoT) as a prompting trick. Today, CoT is baked into the architecture. We see this in the latest iterations of Reasoning-Native Kernels.

The Workflow: When a goal is received (e.g., “Analyze this 10-K and execute a hedge strategy”), the agent doesn’t start typing. It initiates a Sub-goal Decomposition.
The Scratchpad: Modern agents maintain a hidden “latent scratchpad” where they simulate different execution paths before committing to an action. This reduces “hallucination-in-action,” which was the primary killer of 2025-era agents.

2. The Model Context Protocol (MCP) and Tool-Augmented Generation

The most significant technical trend this week is the maturation of MCP. We’ve finally moved past the “Plugin” mess. MCP provides a standardized interface for models to talk to the world.

Standardized Schemas: Whether the agent is talking to a SQL database or a robotic arm, it uses the same protocol.
Dynamic Discovery: Agents can now “poll” an environment to see what tools are available. If you give an agent access to a new GitHub repo, it can read the README, understand the functions, and begin using them without a human having to write a “system prompt” explaining the API.

3. Long-term Memory and State Persistence

The “Context Window” wars are over. We won. But 10-million-token windows are useless if the model can’t find the needle in the haystack. The trend now is Semantic State Management.

Vectorized Ephemeral Memory: Agents now use a tiered memory system:
1. L1 (Working): The immediate context window.
2. L2 (Short-term): A high-speed cache of the last 50 steps in a workflow.
3. L3 (Long-term): A RAG-based (Retrieval-Augmented Generation) archive of all past interactions, indexed by “importance” scores rather than just chronological order.

As we give agents more autonomy, we encounter a new technical challenge: Agentic Drift. This is where an agent, in the process of solving a complex, multi-day task, slowly veers away from the original constraints.

Technical journals are currently buzzing with Constraint-Satisfaction-Checking (CSC). This is a secondary, highly quantized model that runs in parallel, acting as a “referee” to ensure the main agent doesn’t violate safety or budget parameters.

If you are still thinking about AI as a “text-in, text-out” system, you are falling behind. The value in 2026 lies in the Orchestration. It’s about how you handle retries when an API fails, how you manage the state across a three-day autonomous task, and how you ensure the agent knows when to stop and ask for human permission.

The Efficiency Paradox

João Silva — Thu, 23 Apr 2026 13:03:56 GMT

For half a decade, the AI industry followed a predictable path: add more GPUs, add more data, get a smarter model. But in 2026, we have hit the “Energy Wall.” Training a model 10x larger than GPT-4 doesn’t yield 10x more intelligence. It yields a 10x higher electricity bill.

This week’s technical trend is the Rise of the Specialist. We are seeing a massive migration away from “God-models” toward Small Language Models (SLMs) that are hyper-optimized for specific domains.

How is a 7B parameter model in 2026 performing compared to a 175B model from 2023? The answer lies in Knowledge Distillation and Curated Synthetic Data.

1. The Student-Teacher Framework

We are now using the “massive” models—the ones that take a small country’s power grid to run—as “Teachers.” They generate millions of high-quality reasoning chains. These “gold-standard” examples are then used to train the “Student” (the SLM).

The Result: The SLM doesn’t have to learn how to speak English from scratch by reading the messy internet. It learns purely from the “refined logic” of the teacher model. This allows it to punch way above its weight class in reasoning capability.

2. Specialized Loss Functions

In 2026, we aren’t just using standard cross-entropy loss. We are using Task-Specific Loss Functions. If you are building an SLM for medical diagnosis, the model is penalized more heavily for a “false negative” than for a formatting error. This “weighted intelligence” makes SLMs safer and more reliable for production than general-purpose LLMs.

The real reason SLMs are trending is the hardware. Every laptop and smartphone sold in 2026 has a dedicated NPU (Neural Processing Unit).

Privacy by Default: Because an SLM can fit on a device, we are seeing a “Privacy Renaissance.” Companies are no longer sending sensitive IP to a cloud provider’s API. They are running a specialized 3B parameter model locally on the engineer’s machine.
Zero Latency: When the model is on your local bus, the latency is measured in microseconds, not seconds. This has enabled the “Live Interaction” era—AI that responds to your voice or screen state in real-time without the “thinking” pause.

The “smart” engineering teams are no longer choosing one model. They are building Ensembles of Specialists.

The Router Architecture: A tiny, extremely fast model (100M parameters) acts as a traffic cop. It listens to the user’s request.
- Code request? Send it to the Code-SLM.
- Legal question? Send it to the Law-SLM.
- Silly joke? Send it to the general-purpose “cheap” model.

This modularity allows companies to swap out “specialists” as better ones become available, rather than being locked into one massive, monolithic provider.

The era of the “Generalist” is ending. In 2026, the competitive advantage belongs to those who can fine-tune small, efficient models on proprietary, high-quality data. The “Scale-Only” doctrine is dead; long live the “Precision” doctrine.

Understanding Communicating Sequential Processes (CSP)

João Silva — Wed, 22 Apr 2026 13:02:38 GMT

In the landscape of modern software development, concurrency is no longer a luxury—it is a survival requirement. As processor speeds have plateaued and we have transitioned into the era of many-core architectures, the burden of performance has shifted from the hardware engineer to the software architect. However, managing multiple tasks simultaneously has historically been a minefield of complexity, prone to subtle bugs that are notoriously difficult to debug.

Enter Communicating Sequential Processes (CSP). While it sounds like a mouthful of academic jargon, CSP is a formal language and a philosophical approach to concurrency that prioritizes clarity, safety, and predictability. By shifting the focus from “protecting data” to “moving data,” CSP offers a blueprint for building massive, scalable systems without losing one’s mind to the chaos of shared state.

To understand CSP, we must look back to 1978, when Tony Hoare published his seminal paper. Before this, concurrency was largely handled through shared memory. Imagine a single whiteboard in a busy office where ten people are trying to write their own schedules at the same time. To keep things orderly, you would need a “lock” (a physical guard) who only lets one person touch the whiteboard at a time. If the guard falls asleep or two people grab the same marker, the system collapses.

Hoare proposed a radical alternative: What if, instead of a shared whiteboard, every person had their own notebook? If they needed to coordinate, they would pass a physical note to one another. This “message passing” is the heartbeat of CSP.

The mantra of the CSP approach can be summarized in a single, transformative sentence:

“Do not communicate by sharing memory; instead, share memory by communicating.”

CSP relies on two fundamental abstractions that work in tandem to create a harmonious system.

In CSP, a “Process” is a self-contained logic unit that executes sequentially. It doesn’t care about the outside world except when it needs to send or receive information. Because it is sequential, the developer can reason about it just like a standard, single-threaded program. There are no hidden side effects from other threads creeping in to change its variables.

Channels are the conduits through which processes interact. Think of a channel as a specialized pipe. One process drops a piece of data into one end, and another process pulls it out of the other. The channel handles the heavy lifting of synchronization, ensuring that the data transfer happens safely.

One of the most elegant aspects of pure CSP is the concept of the Rendezvous. By default, communication over a channel is synchronous.

If a Sender wants to send data, it hangs up and waits until a Receiver is ready to take it.
If a Receiver wants to get data, it hangs up and waits until a Sender is ready to provide it.

When both are present at the channel, the data transfer occurs, and both continue their merry way. This “handshake” eliminates the need for manual locks or semaphores. The communication is the synchronization.

To appreciate why CSP has gained so much traction in high-performance systems, we must compare it to the traditional “Shared Memory” model.

A system where processes just sit and wait for a single channel would be quite rigid. CSP introduces the concept of a Choice (often implemented as a select statement). This allows a process to monitor multiple channels at once.

The process essentially says, “I am ready to talk to anyone who is ready to talk to me.” If three different channels have data available, the process picks one (often pseudo-randomly or based on priority) and executes that specific logic. This enables the creation of highly responsive systems that can handle timeouts, cancellations, and multi-source data streams gracefully.

Once you have processes and channels, you can assemble them into sophisticated architectural patterns:

Pipelines: Much like a factory assembly line, one process performs Step A and passes the result to a channel; the next process performs Step B, and so on. This maximizes throughput.
Fan-out: A single producer sends tasks to a channel, and multiple worker processes pull from that same channel to process data in parallel.
Fan-in: Multiple processes send their results into a single “aggregator” channel that collects and reports the data.

For those who enjoy the rigor of formal logic, CSP is actually a member of the Process Calculus family. It uses algebraic notations to prove that a system is free from certain types of errors. While most developers don’t write out the mathematical proofs, they benefit from the “mathematical cleanliness” of the model. It ensures that if you follow the rules of the protocol, the system remains mathematically sound.

While CSP solves the “Race Condition” (where two threads change data at the same time), it does not automatically solve every problem. Developers must still be wary of:

Deadlock: If Process A is waiting for Process B, and Process B is waiting for Process A, the system grinds to a halt.
Livelock: Processes are so busy responding to each other that they never actually get any “real” work done.
Channel Leaks: If a process creates a channel but never closes it or stops listening, memory can slowly bleed away.

We are living in an era of distributed systems and microservices. Interestingly, the way microservices communicate over a network (via APIs or Message Queues) is essentially CSP on a macro scale. By adopting CSP within a single application, developers can use the same mental model for their internal code as they do for their entire cloud infrastructure.

It promotes a decoupled architecture. Because processes only know about the channels they hold, they don’t need to know anything about the internal state of other processes. This makes code more modular, easier to test, and significantly more resilient to change.

Communicating Sequential Processes is more than just a technical implementation; it is a shift in perspective. It moves us away from the “God-object” pattern, where one giant block of memory is poked and prodded by a thousand different fingers, toward a “Society of Specialists” who collaborate through clear, defined communication.

By embracing the channel and the sequential process, we stop fighting the nature of multi-core hardware and start working with it. Whether you are building a high-frequency trading platform, a real-time chat application, or a simple web server, the principles of CSP provide a stable foundation in an increasingly concurrent world.

Openskills.sh

João Silva — Tue, 21 Apr 2026 13:02:10 GMT

The transition from “Chatbots” to “Agents” is the defining shift of the current AI era. However, as developers began building these autonomous entities, they hit a wall: fragmentation. Every agent framework—from LangChain to CrewAI—had its own way of defining tools, and every LLM provider had a different way of calling them.

Enter Agent Skills and openskills.sh. This ecosystem represents the first successful attempt to standardize how we package, discover, and execute AI capabilities. It’s essentially the “npm for AI agents,” transforming loose prompts and scripts into portable, versioned, and sandboxed modules.

Before the Agent Skills standard emerged, we primarily used Function Calling. You would provide a JSON schema to an LLM, and it would output a JSON object to trigger a local function. While effective, it had three major flaws:

Token Bloat: You had to cram every tool definition into the system prompt. If your agent had 50 tools, it might spend 5,000 tokens just “remembering” what it could do before you even asked a question.
Maintenance Hell: Tool definitions were often buried in code (Python or TypeScript). Non-developers couldn’t easily tweak the instructions an agent used to perform a specific task.
Non-Portability: A tool written for a LangChain agent couldn’t easily be dropped into a Cursor IDE session or a Claude Code terminal.

Agent Skills solve this by treating a capability as a static asset—a folder containing a SKILL.md file and any necessary supporting scripts.

A “Skill” is a structured directory. It is designed to be human-readable and agent-optimized. The heart of any skill is the SKILL.md file, which follows a specific specification:

At the top of every skill is a YAML frontmatter block for metadata, followed by Markdown instructions for the LLM.

---
name: kubernetes-ops
description: Specialized skill for managing K8s clusters, pods, and deployments.
version: 1.2.0
author: dev-ops-team
allowed_tools: [shell, read_file, write_file]
---

# Kubernetes Operations Instructions
When the user asks to "check the cluster health" or "restart a service," 
follow this protocol:
1. List all pods in the namespace using `kubectl get pods`.
2. Check for any pods with a status other than 'Running'.
3. If a pod is in 'CrashLoopBackOff', fetch the logs using `kubectl logs `.
...

LLMs are native speakers of Markdown. By using a .md file, we allow the agent to “read” the manual only when it’s relevant. This leads us to the most important concept in the openskills.sh ecosystem: Progressive Disclosure.

In a traditional setup, the agent is like a student forced to memorize the entire library before the exam. With Agent Skills, the agent is given a catalog.

Discovery: The agent is given a list of skill names and descriptions (the YAML frontmatter).
Invocation: When a user says, “Fix my Kubernetes deployment,” the agent realizes the kubernetes-ops skill is relevant.
Loading: The agent uses a tool (like npx openskills read) to pull the full content of SKILL.md into its context.

This keeps the base system prompt lean and allows agents to scale to hundreds of specialized skills without losing their “reasoning” headroom to overhead.

openskills.sh is the hub for this entire movement. It serves three primary roles: the Standard, the Registry, and the Runtime.

Much like npmjs.com or crates.io, openskills.sh provides a centralized marketplace where developers can publish skills. If you need a skill to interact with the Jira API, handle complex PDF parsing, or manage AWS Lambda functions, you don’t write it from scratch—you install it.

The openskills CLI tool allows for universal installation across any agent environment.

npx openskills install anthropics/web-search
npx openskills install my-org/internal-db-query --private

Perhaps most critically, openskills provides a sandboxed execution environment. When an agent invokes a skill that requires running a Python script or a Shell command, openskills ensures that code runs in a restricted container, protecting your local machine or server.

One of the biggest risks of autonomous agents is “Prompt Injection” leading to “Remote Code Execution” (RCE). If an agent reads a malicious file and decides to run rm -rf /, a standard shell tool would comply.

openskills.sh implements a Dual Sandbox Architecture:

Native Sandboxing (OS-Level):
On macOS, it uses Seatbelt (the sandbox technology behind the App Store). On Linux, it uses Landlock. This restricts the scripts’ access to only specific directories and blocks network access unless explicitly granted in the skill’s metadata.
WASM/WASI Sandboxing (Experimental):
For cross-platform safety, skills can be compiled into WebAssembly. This provides a completely isolated memory space and a capability-based security model.

There is often confusion between Anthropic’s MCP and Agent Skills. While they both aim for interoperability, they solve different problems:

The Synergy: Most modern agentic workflows use MCP for data (the “What”) and Agent Skills for process (the “How”).

To create a skill and share it via openskills.sh, you follow a modular workflow. Let’s build a “Security Auditor” skill.

security-auditor/
├── SKILL.md
├── scripts/
│   └── scan_ports.py
└── resources/
    └── common_vulnerabilities.json

In SKILL.md, you define exactly how the agent should behave when it finds an open port.

You can use the CLI to “dry run” how an agent sees your skill:

npx openskills read ./security-auditor

Once ready, you can push your skill to a GitHub repo or the openskills.sh registry. Because the standard is open, anyone using Cursor, Windsurf, or Claude Code can immediately “equip” your skill.

As we move toward multi-agent systems, openskills.sh becomes the “language” of handoffs. An orchestrator agent can query a directory of skills, see that “Agent B” has the stripe-billing skill, and delegate the task with full confidence that the instructions and safety guards are pre-defined.

We are entering an era of Composable Intelligence. Instead of building one giant “god-model” that knows everything, we are building specialized, tiny experts that can be shared, versioned, and audited.

If you are a developer, openskills.sh is your way to ensure your code is “agent-ready.” If you are an enterprise, it is your way to enforce safety and consistency across your AI workforce.

The wall between “writing code” and “prompting AI” is dissolving. Skills are the glue that holds these two worlds together.

The Sliding Window Strategy in LLM Training

João Silva — Mon, 20 Apr 2026 13:02:51 GMT

In the era of Generative AI, the “secret sauce” of a high-performing Large Language Model (LLM) isn’t just the number of parameters or the quality of the GPU cluster; it’s the way the model consumes information. Training a Transformer-based model is essentially an exercise in pattern recognition across massive datasets. However, these models have a fundamental limitation: the context window.

When you have a dataset consisting of billion-word corpora—books, code repositories, and long-form articles—you cannot simply “feed” the entire document into the model at once. You must sample it. Among the various techniques used to bridge the gap between massive datasets and finite context windows, the Sliding Window approach stands out as a critical strategy for maintaining semantic continuity and maximizing data utility.

To understand why we need sliding windows, we first have to look at the architecture of a Transformer. Most modern LLMs utilize a fixed context length, denoted as $L$. Whether it’s 2,048 tokens (like early GPT-3) or 128,000 tokens (like modern Claude or GPT-4 iterations), the model has a “hard limit” on how many tokens it can process in a single forward pass.

If you have a document with 10,000 tokens and your model has a context window of 1,000 tokens, you have a problem. How do you slice that document?

The simplest method is to cut the document into non-overlapping blocks of 1,000 tokens.

Block 1: Tokens 1 to 1,000
Block 2: Tokens 1,001 to 2,000
...and so on.

The Problem: This creates “boundary artifacts.” If a crucial piece of information (like the subject of a sentence) is at token 999, and the verb is at token 1,002, the model will never see them together. The model loses the “flow” of the text, and the transitions between blocks become blind spots.

The sliding window technique solves this by introducing overlap. Instead of jumping exactly one window length forward, the “sampling window” moves forward by a smaller step, known as the stride (S).

The sliding window is defined by two primary hyperparameters:

Window Size (W): The total number of tokens the model can see at once (the context length).
Stride (S): The number of tokens the window moves forward after each sample.

If we have a document of length D, the number of samples N we can extract is roughly:

If S = W, we have non-overlapping blocks (Simple Truncation).
If S < W, we have overlapping blocks (Sliding Window).
If S = 1, we have a “maximal” sliding window where every single token eventually appears in every possible position within the window (extremely computationally expensive).

Language is not modular. Ideas, arguments, and narratives flow across token boundaries. By using a sliding window with a stride smaller than the window size, we ensure that every token—and more importantly, every relationship between tokens—is captured in multiple contexts. This allows the model to learn how to handle “preceding context” more effectively, as it sees the same information appearing at the end of one window and the beginning of the next.

In the world of Deep Learning, more data is usually better. Sliding windows act as a form of text-based data augmentation. By shifting the window by a few tokens, you create a “new” training example for the model. Even though the tokens are the same, their positional encodings change.

In a Transformer, the model’s understanding of a token is heavily influenced by its position relative to others. Seeing the word “Quantum” at index 10 vs. index 500 helps the model become more robust to positional variations.

While a sliding window doesn’t technically increase the model’s physical context limit, it improves the model’s ability to “stitch” ideas together during inference. If the model was trained on overlapping samples, it becomes more adept at transitioning between chunks of text when generating long-form content.

Implementing a sliding window in a data pipeline (like a PyTorch Dataset or using the Hugging Face datasets library) requires balancing memory efficiency with speed.

In this approach, you store raw text and tokenize segments as they are needed.

Load a long document.
Tokenize the entire document into a large array.
Use a pointer to slice the array: tokens[i : i + window_size].
Increment i by the stride S.

class SlidingWindowDataset(Dataset):
    def __init__(self, tokens, window_size, stride):
        self.tokens = tokens
        self.window_size = window_size
        self.stride = stride
        self.samples = []
        
        # Pre-calculate indices
        for i in range(0, len(tokens) - window_size + 1, stride):
            self.samples.append(i)

    def __getitem__(self, idx):
        start = self.samples[idx]
        return self.tokens[start : start + self.window_size]

While the sliding window sounds like a “free win,” it comes with significant computational costs.

If you set a stride of S = W/2 (50% overlap), you are essentially doubling the size of your training data. This means:

2x more forward/backward passes.
2x more energy consumption.
2x more time to reach the same number of “epochs” over the raw text.

In large-scale pre-training (like Llama 3 or GPT-4), compute is the most expensive resource. Engineers often choose a stride that is very close to the window size ($S \approx 0.9W$) to minimize redundancy while still providing enough overlap to smooth out boundary issues.

If the stride is too small, the model sees the same sequences over and over again. This can lead to overfitting on specific phrases or patterns found in the training data, rather than learning general linguistic rules.

If you pre-process your data into sliding window chunks and save them to disk (e.g., as .bin or .jsonl files), the storage requirements can explode. A 1TB dataset could easily become 5TB if a high-overlap sliding window is applied. Most modern pipelines perform the windowing “just-in-time” during the data loading phase to save disk space.

Some researchers use a dynamic stride based on the content. For example, if a document is identified as “high quality” (like a textbook), the stride might be smaller to ensure the model learns every nuance. For “lower quality” data (like web scrapes), a larger stride might be used to move through the data quickly.

It’s important to distinguish sliding windows from Packing. Packing is the practice of concatenating multiple short documents together to fill a single context window, separated by an (End Of String) token.

Sliding Window: Used to break down one long document.
Packing: Used to combine many short documents.
In a production-grade pipeline, these two techniques are often used together. You might slide through a long book, and if the last window of that book is only half-full, you “pack” the beginning of the next document into the remaining space.

When using sliding windows, the way you calculate the Loss (usually Cross-Entropy Loss) can be adjusted.

In a standard Next-Token Prediction task, you calculate the loss for all tokens in the window. However, in a sliding window setup with high overlap, some tokens are “new” to the model in the current window, while others were already seen at the end of the previous window.

Some researchers suggest calculating the loss only on the “new” tokens (the ones in the stride portion) or applying a lower weight to the “re-seen” tokens to prevent the model from over-optimizing on the middle sections of a window.

The sliding window is more than just a data-loading trick; it is a fundamental bridge between the linear nature of human language and the block-based nature of Transformer computation.

For most developers and researchers:

Use a small stride (S approximately 0.1W to 0.5W) for fine-tuning on domain-specific long-form data where every connection matters (e.g., legal docs, medical research).
Use a large stride (S approximately 0.8W to 0.9W) for general pre-training to balance context continuity with computational efficiency.
Never use zero overlap unless the data is naturally modular (like a collection of short tweets).

As we push toward models with million-token context windows, the necessity for sliding windows may shift, but the logic remains: how we present data to a model determines how that model perceives the world. By sliding the window thoughtfully, we ensure the model never misses the forest for the trees—or the sentence for the tokens.

Contextual Embeddings in LLMs

João Silva — Thu, 16 Apr 2026 13:01:57 GMT

In the evolution of Natural Language Processing (NLP), contextual embeddings represent one of the most significant breakthroughs. They allow modern Large Language Models (LLMs) to move beyond merely “recognizing” words to actually “understanding” them based on the specific intent and nuance of a sentence.

To understand why contextual embeddings are transformative, we must first look at what came before: Static Embeddings (e.g., Word2Vec, GloVe).

Static Embeddings: In these older models, every word is mapped to a single, fixed vector (a list of numbers) in a high-dimensional space. Whether the word “bank” appeared in “river bank” or “bank account,” the model assigned it the exact same vector. This approach failed to capture polysemy—the phenomenon where a single word has multiple, distinct meanings.
Contextual Embeddings: These models, powered by the Transformer architecture, dynamically generate a unique vector for a word each time it appears. The representation of “bank” is calculated by looking at the other words in the sentence, allowing the model to distinguish between a financial institution and a riverside terrain.

The magic of contextual embeddings happens within the Transformer architecture, specifically through a process known as Self-Attention.

Step 1: Tokenization

The input text is broken into smaller units called tokens (words or subwords). These tokens are assigned an initial, “base” vector that represents their broad meaning.

Step 2: The Self-Attention Mechanism

This is the engine of context. When the model processes a sequence of tokens, the self-attention mechanism computes how much focus (attention) each token should place on every other token in the sequence.

If the input is “The bank of the river,” the “bank” token pays high attention to “river.”
If the input is “The bank is closed today,” the “bank” token pays high attention to “closed” and “today.”

Through these attention weights, the vector for “bank” is updated to incorporate the semantics of the surrounding words.

Step 3: Layered Processing

Transformers consist of multiple “blocks” or layers. As the token passes through each layer, its representation is refined. Early layers might capture basic syntax (like grammar and word order), while deeper layers capture complex, abstract semantic relationships. By the time the token reaches the final layer, its vector is highly specific to its current context.

Contextual embeddings are the foundation upon which modern, reasoning-capable AI is built. They offer several critical advantages:

Handling Polysemy: As illustrated, the model accurately differentiates meanings, leading to vastly superior performance in translation, summarization, and sentiment analysis.
Capturing Long-Range Dependencies: Traditional models struggled to link words that were far apart in a sentence. Self-attention allows tokens to “see” and interact with any other token, regardless of distance.
Semantic Nuance: These embeddings don’t just capture dictionary definitions; they capture tone, intent, and stylistic variations.
Foundation for RAG: In Retrieval-Augmented Generation (RAG) systems, contextual embeddings are used to index documents. Because they are context-aware, they allow the system to retrieve highly relevant information even when a user’s query uses different phrasing than the source document.

By moving from static “definitions” to dynamic “context-based representations,” contextual embeddings allow AI to mimic human comprehension, making them essential to the capabilities of current state-of-the-art models.

Positional Embeddings in LLMs

João Silva — Wed, 15 Apr 2026 13:02:03 GMT

In the architecture of Transformers, the self-attention mechanism is permutation-invariant. This means that if you shuffle the order of words in a sentence, the attention scores remain identical. To bridge this gap and allow the model to understand the sequence and structure of language, we inject positional information.

Positional embeddings serve as a “coordinate system” for the tokens in a sequence, allowing the model to distinguish between “The dog bit the man” and “The man bit the dog.” We categorize these approaches into Absolute and Relative methods.

Absolute positional encoding assigns a unique vector representation to each position index (0, 1, 2, …, N) in the sequence. This vector is added to the token embedding before it enters the Transformer blocks.

The Theory

The core idea is to represent position as an “address.”

Learned Embeddings: In original models like BERT, each position i is mapped to a trainable vector p_i. The input becomes x’_i = x_i + p_i.
Sinusoidal Embeddings: Introduced in the original “Attention Is All You Need” paper, these use fixed sine and cosine functions:
This allows the model to potentially extrapolate to sequence lengths longer than those seen during training, as the functions are continuous.

Limitations

Fixed Context Window: Learned APEs fail if the test sequence is longer than the maximum training length.
Lack of Translation Invariance: The model doesn’t inherently understand that “the distance between word A and word B is the same” regardless of where they appear in the sentence.

2. Relative Positional Embeddings (RPE)

Rather than encoding “where” a word is, relative embeddings focus on the distance between two tokens (j - i). The intuition is that the relationship between two words depends on how far apart they are, not their absolute index.

The Theory

In standard self-attention, the score between tokens i and j is calculated as

In relative schemes, we modify this to include a term representing the distance $\Delta = i - j$:

Where R is a learnable embedding representing the relative distance between position i and j.

Key Modern Approaches

RoPE (Rotary Positional Embeddings): Used in modern architectures like Llama and Mistral. RoPE encodes positions by rotating the query and key vectors in a complex plane. It captures relative information through the inner product, effectively decaying the attention score as the distance between tokens increases.
ALiBi (Attention with Linear Biases): Rather than adding vectors, ALiBi adds a static, non-learned penalty to the attention scores based on the distance between tokens. This is exceptionally efficient and allows for infinite sequence length extrapolation.

Models like Claude or Llama 3 rely heavily on RPEs (specifically RoPE). Because they do not rely on fixed index slots, these models can be fine-tuned to handle documents spanning hundreds of thousands of tokens. If you use a model to summarize a 500-page legal document, it is using relative positioning to maintain the coherence of facts separated by tens of thousands of words.

For structured tasks where the sequence length is tightly constrained (like a fixed-length sentence translation), APEs are often sufficient and highly optimized in hardware. However, recent hybrid systems are increasingly shifting toward RoPE even here, as it provides a better “semantic anchor” for grammatical relationships.

In audio processing (like Whisper), the “time” of the input is continuous. Relative approaches are superior here because they allow the model to recognize rhythmic patterns or spectral features regardless of when they start in an audio file, offering better robustness to varying segment lengths.

Byte-Pair Encoding (BPE)

João Silva — Tue, 14 Apr 2026 13:00:55 GMT

In the architecture of Large Language Models (LLMs), the model does not “read” text as humans do. It processes numerical representations. The bridge between raw human text and these numbers is tokenization. Among the various methods available, Byte-Pair Encoding (BPE) has emerged as the industry standard, powering models like GPT-4, Llama, and Mistral.

BPE is a subword tokenization algorithm that strikes a balance between character-level models (which are too granular) and word-level models (which struggle with infinite vocabulary sizes).

The core philosophy of BPE is iterative merging: it starts by treating every character as an individual token and progressively merges the most frequently occurring adjacent pairs of tokens into a new, single token. This continues until a pre-defined vocabulary size is reached.

To understand the mechanics, imagine we are training a tokenizer on a tiny corpus containing the words: “hug”, “pug”, and “pun”.

We break the text down into individual characters and add an end-of-word marker (often ):

h, u, g,
p, u, g,
p, u, n,

The algorithm counts the frequency of all adjacent pairs.

u + g appears twice.
p + u appears twice.

If we choose to merge u and g, they become a new token ug. The vocabulary now includes the individual characters plus the new compound token ug. We repeat this process, merging the next most frequent pair, until we hit our target vocabulary size.

BPE solved two massive problems in Natural Language Processing:

Traditional word-level models would fail if they encountered a word not in their training dictionary (e.g., a rare medical term or a made-up slang word). Because BPE can break unknown words down into smaller subwords (or even characters), it ensures the model can always generate a representation for any string.

By grouping frequently occurring character sequences (like “ing”, “tion”, or “pre”), BPE allows the model to represent long words with fewer tokens.

Word-level: “Tokenization” = 1 token (but requires a massive, unmanageable vocabulary).
Character-level: “Tokenization” = 13 tokens (too long for the model’s limited context window).
BPE: “Token” + “ization” = 2 tokens (optimized length and memory usage).

Early versions of BPE operated on Unicode characters, which could lead to issues with rare emojis or non-Latin alphabets. Modern LLMs (like GPT-2 and beyond) utilize Byte-level BPE.

Instead of merging characters, the algorithm operates on the raw bytes of the UTF-8 encoding. This ensures:

Universal Coverage: The base vocabulary is fixed at 256 bytes.
No “Unknown” Tokens: Because every string can be represented as bytes, the model is theoretically capable of tokenizing any input, regardless of language, emoji usage, or symbols.

While powerful, BPE is not perfect:

Greedy Approach: BPE is a greedy algorithm. It doesn’t look at the context of the sentence; it simply merges the most frequent pairs globally. Sometimes, this results in unintuitive subwords.
Complexity: It requires a pre-tokenization training step. If you change your training corpus significantly, the tokenizer may become sub-optimal, which is why developers often use a tokenizer specifically trained on the distribution of data the model will see.

Byte-Pair Encoding is the silent engine behind the fluency of LLMs. By intelligently clustering the building blocks of language into meaningful subword units, BPE allows models to handle the vast, messy, and creative nature of human text with both efficiency and precision. It remains the most effective compromise between the granularity of characters and the semantic richness of words.

A Deep Dive into Harness Engineering

João Silva — Mon, 13 Apr 2026 13:01:06 GMT

In the early days of Generative AI, developers focused on “Context Engineering”—ensuring the model had the right files and snippets to generate a single block of code. However, as we move toward coding agents that can navigate entire codebases and perform multi-step tasks, context is no longer enough.

We need a way to trust the output without constant line-by-line manual review. This is where Harness Engineering begins.

In the relationship between a developer and an AI, the “Agent” is defined by the equation: Agent = Model + Harness.

While the Model (LLM) provides the “reasoning” and token generation, the Harness is the structural framework that constrains that reasoning. A well-engineered harness serves two primary functions:

Increasing Probability: It makes it more likely the agent succeeds on the first attempt (Feedforward).
Self-Correction: It provides sensors that allow the agent to detect and fix its own errors before a human ever sees them (Feedback).

Harness Engineering borrows heavily from cybernetics, using a “Governor” model to regulate the codebase.

Feedforward (The Guides)

Guides are proactive. They provide the agent with “ambient affordances”—the rules of the road.

Computational Guides: Deterministic tools like “OpenRewrite” recipes or project scaffolds that force the agent into a specific structure.
Inferential Guides: Semantic instructions, such as AGENTS.md files or “Skills” libraries, that explain the intent and style the agent should follow.

Feedback (The Sensors)

Sensors are reactive. They observe the output and provide a signal for the agent to act upon.

The Power of Custom Linters: Böckeler notes that feedback is most powerful when optimized for LLMs. Instead of a generic error, a custom linter message could say: “You violated our module boundary rule; please move this logic to the Service layer.” This acts as a “positive prompt injection” that triggers self-correction.

Harness Engineering changes the fundamental workflow of the software engineer. In a traditional workflow, a developer fixes bugs in the code. In a harnessed workflow, the developer iterates on the harness.

Issue Occurs: The agent produces a sub-par solution or violates a pattern.
Harness Gap Analysis: The human identifies why the harness failed to prevent or detect this.
Regulation Improvement: The human updates the guides (feedforward) or sensors (feedback).
Verification: The agent reruns the task, now governed by the improved harness.

This “Steering Loop” ensures that the engineering team’s collective intelligence is externalized into the system, making the codebase increasingly “agent-friendly” over time.

Not all parts of a codebase are equally easy to govern. Böckeler divides the harness into three functional categories:

A. The Maintainability Harness

This regulates internal code quality (complexity, style, duplication).

Status: High confidence. We have decades of static analysis tools (Linters, SonarQube) that act as cheap, fast, computational sensors.

B. The Architecture Fitness Harness

This ensures the system adheres to its architectural characteristics (performance, modularity, observability).

Implementation: Using tools like ArchUnit to check module boundaries or performance tests that act as feedback loops if an agent introduces a latency regression.

C. The Behavior Harness (The “Elephant in the Room”)

This regulates functional correctness—does the feature work?

The Challenge: Relying on AI to write its own tests creates a circular logic problem. If the AI misunderstood the requirement, it will write a “green” test that confirms its own misunderstanding.
Current Solution: Humans must provide the behavioral ground truth through “approved fixtures” or manually verified functional specifications that the agent cannot alter.

Why are some codebases easier for AI to handle than others? It comes down to Ambient Affordances.

A codebase written in a strongly typed, modular fashion has higher “harnessability.” It provides more “handles” for the harness to grab onto. Böckeler invokes Ashby’s Law of Requisite Variety, which states that a regulator must have as much variety as the system it governs.

Because an LLM can generate an infinite variety of code (much of it bad), we use Harness Templates to reduce that variety. By committing to specific service topologies (e.g., “This is a standard CRUD API”), we narrow the space the AI operates in, making a comprehensive harness achievable.

The most sophisticated harness cannot replace human intuition. Humans provide three things an LLM lacks:

Social Accountability: Your name is on the commit; the AI doesn’t care about the long-term consequences.
Organizational Memory: Knowing why a specific technical debt was accepted for business reasons.
Aesthetic Disgust: The visceral reaction to a 500-line function that “works” but is unmaintainable.

Harness Engineering is not about reaching 100% automation. It is about shifting quality left. By building a system of computational and inferential guardrails, we ensure that when a human is finally called to review code, they are focusing on high-level design and intent, rather than catching the “toil” that a well-tuned harness should have caught automatically.

A Deep Dive into Token Compaction for LLMs

João Silva — Fri, 10 Apr 2026 13:04:00 GMT

The “context window” has become the new frontier of the AI arms race. We’ve moved from the 2,048-token limits of early GPT-3 to the million-token horizons of Gemini 1.5 and beyond. However, there is a fundamental law of physics—or at least, of GPU VRAM—that remains: Attention is expensive.

As sequences grow, the Key-Value (KV) cache balloons, memory bandwidth bottlenecks emerge, and the quadratic scaling of self-attention, $O(n^2)$, threatens to turn even the most powerful H100 clusters into very expensive space heaters. Enter Token Compaction: the art and science of keeping the “signal” while ruthlessly discarding the “noise.”

To understand compaction, we must first acknowledge why we need it. In an auto-regressive Transformer, the model avoids recomputing hidden states for previous tokens by storing them in the KV Cache.

While this saves computation, it creates a massive memory footprint. For a model with $l$ layers, $h$ attention heads, and a hidden dimension $d$, the memory required for the KV cache of a sequence length $s$ is roughly:

For a 70B parameter model at 16-bit precision, a 128k context window isn’t just a “long prompt”—it’s a hundred-gigabyte memory hurdle. Token compaction techniques aim to reduce $s$ (the effective sequence length) without losing the semantic coherence required for accurate generation.

Token compaction isn’t a single “trick.” It is a spectrum of strategies ranging from “dropping tokens on the floor” to “fusing them into a smarter representation.” We can categorize these into four primary pillars:

Token Pruning (Selection)
Token Merging (Fusion)
Architectural Compression (GQA/MQA)
Dynamic Eviction (KV Cache Management)

Token pruning assumes that not all tokens are created equal. In a typical sentence, stop words, punctuation, or redundant fillers often carry low “attention weight.”

One of the most influential papers in this space, H2O, observed that a small fraction of tokens—termed “Heavy Hitters”—contribute to the vast majority of the attention scores. By maintaining a small, fixed-size cache of these high-influence tokens and discarding the rest, models can maintain performance while using significantly less memory.

How it works: The model tracks cumulative attention scores. If a token consistently receives high attention from subsequent tokens, it stays. If its attention score stays below a threshold, it’s evicted from the cache.
The Benefit: It allows for theoretically infinite sequence lengths in a fixed memory budget, provided the “working memory” of the task fits within the H2O cache.

Researchers discovered a curious phenomenon: the very first tokens in a sequence (the “sinks”) receive massive amounts of attention, regardless of their semantic value.

The Insight: If you remove the first token, the model’s perplexity explodes. StreamingLLM keeps the first few tokens (the anchors) and a sliding window of the most recent tokens, effectively “compacting” the context by ignoring the middle-distance history that hasn’t been flagged as important.

If pruning is a scalpel, Token Merging (ToMe) is a blender. Instead of deciding if a token is “in” or “out,” merging looks for tokens that are mathematically similar and combines them.

Using a similarity metric (often Cosine Similarity) between the Key vectors ($K$), the algorithm identifies clusters of tokens that represent the same concept or context.

Partition: Divide tokens into two sets.
Compare: Calculate the similarity between sets.
Merge: Average the most similar pairs into a single token representation.
Weight: Increase the “importance” weight of the new merged token so the attention mechanism knows it represents a larger chunk of the original text.

This is particularly effective in multimodal LLMs. In an image, 50 tokens representing a “clear blue sky” can be merged into one with almost zero loss in descriptive power.

While not “compaction” in the sense of post-processing, architectural changes are the foundation of modern efficiency.

Multi-Head Attention (MHA): Every Query head has its own Key and Value head. (Expensive).
Multi-Query Attention (MQA): Multiple Query heads share a single Key and Value head. (Drastic reduction in KV cache size, but some quality loss).
Grouped-Query Attention (GQA): The middle ground used in Llama 3 and Mistral. It groups Query heads into subgroups, each sharing a KV pair.

Sometimes, you don’t need fewer tokens; you just need “smaller” tokens. Standard models use FP16 or BF16 (16 bits per parameter). Quantization techniques like KIVI or KV-Quant compress these down to 4-bit or even 2-bit representations.

The challenge here is the Outlier Problem. In LLM hidden states, a few dimensions often have much larger magnitudes than others. If you quantize the whole vector uniformly, these outliers cause massive rounding errors. Compaction in this context involves:

Per-channel scaling: Giving each dimension its own scaling factor.
Sparse-and-Dense decomposition: Keeping outliers in high precision while compressing the rest of the “bulk” data.

While token compaction sounds like a miracle, it comes with “The Compression Tax.”

Retrieval Accuracy: If you are doing “Needle in a Haystack” tests, pruning can accidentally throw away the “needle.”
Reasoning Chains: In multi-step logic (CoT), seemingly “unimportant” intermediate steps might be vital for the final output. Compaction algorithms often struggle to distinguish between “filler” and “structural logic.”
Prefill Latency: Merging tokens requires calculating similarity matrices, which can actually make the initial “reading” of the prompt slower, even if the “generation” phase becomes faster.

The future of token compaction lies in Learned Compaction. Instead of using fixed heuristics (like “keep the first 4 tokens”), we are seeing the rise of models that have a “gatekeeper” layer. This layer predicts the importance of a token before it enters the KV cache.

As we march toward “Infinite Context,” the bottleneck will shift from how much we can store to how efficiently we can index. Token compaction is essentially the creation of a “searchable index” for the model’s own memory.

We are moving away from the “Brute Force” era of LLMs. In the early days, we simply threw more VRAM at the problem. Today, we are teaching models to be discerning—to realize that in a 100,000-word book, the specific phrasing of a “the” or a “but” is less important than the character’s motivation. Token compaction is, in a sense, the first step toward giving AI a “subconscious” filter.

Parallels Between Human Memory and Large Language Models

João Silva — Mon, 06 Apr 2026 13:03:06 GMT

The human mind and the Large Language Model (LLM) are arguably the two most sophisticated information-processing systems in existence. While one is the product of millions of years of biological evolution and the other a result of decades of computational engineering, they share a fundamental challenge: How do you store, retrieve, and utilize vast amounts of information in real-time?

In cognitive psychology, the distinction between Short-Term Memory (STM) and Long-Term Memory (LTM) is foundational. In the world of Artificial Intelligence, a striking parallel has emerged within the architecture of Transformers—the engine behind models like GPT-4 and Gemini. By examining these two systems side-by-side, we gain not only a better understanding of AI but a deeper appreciation for the elegance of human cognition.

To understand the parallel, we must first define the biological standard. In 1968, Atkinson and Shiffrin proposed the Multi-Store Model, which remains the primary framework for discussing memory stages.

Short-term memory is our “mental workspace.” It is characterized by:

Limited Capacity: Classically defined by George Miller as $7 \pm 2$ items.
Brief Duration: Information typically fades within 15–30 seconds unless rehearsed.
High Accessibility: Information is immediately available for manipulation.

Modern psychology often prefers the term Working Memory, popularized by Baddeley and Hitch. It isn’t just a waiting room for data; it is an active processor consisting of a “Central Executive” that directs attention, a “Phonological Loop” for auditory data, and a “Visuospatial Sketchpad” for imagery.

Long-term memory is the “hard drive” of the brain. It is characterized by:

Virtually Infinite Capacity: There is no known limit to what a human can learn over a lifetime.
Durability: Information can last from minutes to decades.
Consolidation: The process of moving information from STM to LTM, often involving the hippocampus and sleep.

LTM is further divided into Explicit (Declarative) memory—facts and events—and Implicit (Procedural) memory—skills like riding a bike or typing.

LLMs do not have “brains,” but they do have functional equivalents to these memory systems. In the context of a chatbot or an AI agent, the distinction is found between Weights and Context.

When you chat with an LLM, the model remembers what you said five sentences ago. This is its Short-Term Memory, technically referred to as the Context Window.

Capacity: Just as humans can only hold a few digits in their head, LLMs have a token limit (e.g., 128k or 1M tokens).
Attention Mechanism: The “Transformer” architecture uses an Attention Mechanism that mirrors human selective attention. It decides which parts of the input are relevant to the current word being generated.
Volatility: Once the “session” is cleared or the window is exceeded, the model “forgets” everything in that specific conversation. It does not naturally “learn” from a single chat in real-time.

The “knowledge” an LLM possesses—the fact that Paris is the capital of France or how to write Python code—is stored in its Parameters (Weights).

The Training Phase: This is the AI’s version of consolidation. During training, the model processes trillions of words, and the “lessons” are baked into the strength of the connections between neurons in the neural network.
Static Nature: Unlike human LTM, which is “plastic” (constantly changing), a standard LLM’s LTM is static after training. To add new long-term knowledge, it must be “fine-tuned” or retrained.

One of the biggest hurdles in AI is that models eventually “run out” of context window, much like a human forgets a phone number if they are distracted. To solve this, engineers developed Retrieval-Augmented Generation (RAG).

RAG acts like an external long-term memory or a reference library. Instead of trying to “remember” everything in its weights, the AI looks up information in a database and brings it into its “Working Memory” (context window) only when needed.

This mirrors the human use of External Memory Aids—like notebooks or Google—but it also mimics the way our brain retrieves a specific memory from LTM to handle a current task. When you solve a math problem, you retrieve the “rule” from LTM into your Working Memory to apply it. RAG does exactly this for AI.

A fascinating parallel exists in how both systems fail.

Psychology: The Serial Position Effect suggests humans remember the beginning (Primacy) and the end (Recency) of a list best, often forgetting the middle.
AI: Research has shown that LLMs also struggle with “Lost in the Middle.” When given a very long prompt, they are much better at utilizing information located at the very start or the very end of the text, while the middle often gets ignored by the attention mechanism.

This suggests that “attention” is a finite resource in both biological and synthetic architectures.

Comparing human memory to LLMs reveals a fundamental truth about intelligence: Processing requires a trade-off between volume and speed. We cannot keep everything we’ve ever learned in our active consciousness (STM) because it would create too much noise. Similarly, an LLM cannot have an infinite context window without becoming computationally “heavy” and slow.

As we move toward “Agentic AI”—models that can plan, reason, and remember over long periods—we are seeing AI move closer to the human model of continuous learning. While humans use the hippocampus to turn today’s experiences into tomorrow’s wisdom, AI researchers are developing “memory wrappers” and “dynamic fine-tuning” to give machines a similar sense of persistence.

Ultimately, the LLM is a mirror. By building machines that “remember” like us, we are slowly decoding the algorithmic secrets of our own minds.

From N-Grams to Reasoning Engines

João Silva — Fri, 03 Apr 2026 13:00:50 GMT

The story of Large Language Models (LLMs) is often told as a sudden explosion that began in late 2022 with the release of ChatGPT. However, for those tracking the pulse of computational linguistics, this “explosion” was the result of decades of slow-burn research, architectural pivots, and a fundamental shift in how we think about machine intelligence.

It is a history of moving from rules to probabilities, and finally, to reasoning.

1. The Pre-Neural Era: The Search for Structure (1950s – 1990s)

In the early days of Artificial Intelligence, the dominant philosophy was Symbolic AI. Researchers believed that if we could simply code all the rules of grammar and logic into a machine, it would understand language. This led to “Expert Systems” and the creation of the ELIZA chatbot in the 1960s, which used simple pattern matching to mimic a Rogerian psychotherapist.

By the 1980s and 90s, the field shifted toward Statistical NLP. Instead of hard-coded rules, researchers used N-grams. An N-gram model predicts the next word based on the frequency of word sequences in a massive corpus of text. If the word “San” appeared, the model calculated a high probability that “Francisco” would follow.

While revolutionary, these models were “shallow.” They had no concept of context beyond the immediate few words (the “N” in N-gram), and they lacked any internal representation of meaning.

2. The Neural Awakening: RNNs and the Context Problem (2000s – 2014)

The introduction of Neural Networks changed the game. Instead of counting word frequencies, researchers began using Word Embeddings (like Word2Vec and GloVe). These represented words as high-dimensional vectors, where words with similar meanings (e.g., “King” and “Queen”) were mathematically close to one another.

The Rise of the RNN

To handle the sequential nature of language, researchers turned to Recurrent Neural Networks (RNNs). Unlike static networks, RNNs had a “memory” loop, allowing information from previous words to persist.

However, RNNs suffered from a fatal flaw: The Vanishing Gradient Problem. As a sentence grew longer, the model “forgot” the beginning. To solve this, Sepp Hochreiter and Jürgen Schmidhuber introduced Long Short-Term Memory (LSTM) networks. LSTMs used “gates” to decide what information to keep and what to discard, allowing for much longer context windows.

3. The 2017 Inflection Point: “Attention is All You Need”

In 2017, a team at Google Brain published a paper that would change the trajectory of AI forever: “Attention is All You Need.” This paper introduced the Transformer architecture.

The Transformer abandoned the sequential processing of RNNs entirely. Instead, it used a mechanism called Self-Attention. This allowed the model to look at every word in a sentence simultaneously and weigh their importance relative to one another, regardless of how far apart they were.

The mathematical core of this mechanism is defined as:

Where Q (Query), K (Key), and V (Value) are vector representations of the input. This allowed for massive parallelization during training, enabling researchers to train models on datasets orders of magnitude larger than before.

4. The Era of Pre-training: BERT vs. GPT (2018 – 2019)

Following the Transformer breakthrough, two distinct paths emerged:

BERT (Bidirectional Encoder Representations from Transformers) focused on “filling in the blanks.” By masking words in a sentence and forcing the model to guess them, it became incredibly good at understanding nuance.

GPT (Generative Pre-trained Transformer), conversely, focused on Autoregressive generation. It was trained on the simple task of predicting the next token in a sequence. This simplicity turned out to be its greatest strength.

5. Scaling Laws and the GPT-3 Moment (2020 – 2022)

In 2020, OpenAI released GPT-3. With 175 billion parameters, it was 100 times larger than its predecessor, GPT-2.

This era validated the Scaling Laws: the observation that as you increase compute, data, and parameter count, the model’s performance improves predictably. GPT-3 demonstrated “Few-Shot Learning”—the ability to perform tasks it wasn’t explicitly trained for (like translation or coding) just by seeing a few examples in the prompt.

The “Assistant” Pivot: InstructGPT and RLHF

While GPT-3 was powerful, it was often “unaligned.” It would hallucinate, be rude, or follow instructions poorly because it was only trained to imitate the internet, not to help a user.

OpenAI solved this using Reinforcement Learning from Human Feedback (RLHF). By having humans rank different model outputs, they trained a second “reward model” to teach the LLM how to be a helpful assistant. This led to InstructGPT, the direct ancestor of ChatGPT.

6. The 2023 Explosion: ChatGPT and GPT-4

On November 30, 2022, OpenAI released ChatGPT. It wasn’t a new model—it was a fine-tuned version of GPT-3.5 optimized for dialogue—but the interface changed everything. For the first time, the general public could interact with a high-level LLM through a simple chat box.

Months later, GPT-4 arrived. It moved beyond simple text-to-text, introducing multimodal capabilities and a massive leap in reasoning, scoring in the 90th percentile on the Uniform Bar Exam.

7. The Open Source Counter-Current (2023 – 2024)

While OpenAI and Google (with Bard, later Gemini) kept their weights proprietary, Meta (Facebook) took a different approach. The release of LLaMA (Large Language Model Meta AI) sparked an open-source revolution.

Because LLaMA’s weights were smaller and more efficient, developers realized they could run powerful AI on consumer-grade hardware. This led to an explosion of “small” models like Mistral, Falcon, and Vicuna, proving that efficiency and fine-tuning could sometimes rival raw scale.

8. The Current Frontier: Multimodality and Agents (2025 – 2026)

Today, the “History of LLMs” is evolving into the history of Large Multimodal Models (LMMs). We have moved from models that just “read” to models that can “see,” “hear,” and “do.”

Key Trends of the Present:

Infinite Context: Models like Gemini 1.5 Pro now support context windows of up to 2 million tokens, allowing them to process entire codebases or hours of video in one go.
Agentic Workflows: We are moving from “Chatbots” to “Agents”—systems that can use tools, browse the web, and execute multi-step plans autonomously.
The Rise of SLMs: Small Language Models (under 10B parameters) are becoming the standard for edge computing and mobile devices.

Conclusion

The history of LLMs serves as a testament to what computer scientist Rich Sutton called “The Bitter Lesson”: the realization that leveraging massive amounts of computation and general-purpose learning algorithms consistently outperforms human-designed “clever” features.

We have moved from trying to teach machines the rules of our world to giving them the scale to discover those rules for themselves. The next chapter likely won’t just be about “more data,” but about Reasoning and World Models—the transition from predicting the next word to understanding the physics and logic of the reality behind those words.

Matryoshka Representation Learning (MRL)

João Silva — Thu, 02 Apr 2026 13:00:44 GMT

As of early 2026, the landscape of Retrieval-Augmented Generation (RAG) and semantic search has shifted from “bigger is better” to “flexible is faster.” At the heart of this shift lies Matryoshka Representation Learning (MRL), an elegant training technique that has effectively solved the “fixed-dimension bottleneck” that plagued vector databases for years.

If you’ve ever felt the pain of choosing between a 1536-dimension vector (high accuracy, high cost) and a 128-dimension vector (fast, but “dumb”), MRL is your new best friend. Here is a deep dive into the world of Matryoshka Embeddings.

1. The Fixed-Dimension Paradox

Before 2024, embedding models were rigid. If you trained a model to output 768 dimensions, you were stuck with 768 dimensions.

This created a massive engineering headache:

Storage Bloat: 10 million documents at 3072 dimensions (like OpenAI’s text-embedding-3-large) requires roughly 120GB of RAM just for the vectors.
Latency: Calculating cosine similarity on high-dimensional vectors is computationally expensive, leading to slower query times.
The “Re-indexing” Nightmare: If you decided midway through a project that your vectors were too big, you had to re-embed your entire dataset—a process that could cost thousands of dollars and days of compute time.

We needed a way to make embeddings “elastic.” We needed a vector that could be a heavyweight champion when accuracy mattered, but a lightweight sprinter when speed was the priority.

2. What is Matryoshka Representation Learning?

The name comes from the Matryoshka, or Russian nesting doll. In a Matryoshka set, you have a large doll that contains a smaller, perfectly formed doll inside, which contains an even smaller one, and so on.

In the context of machine learning, Matryoshka Representation Learning (MRL) is a training paradigm where a single embedding is structured such that its most critical semantic information is “front-loaded” into the first few dimensions.

Instead of the information being spread randomly across 1024 dimensions, MRL forces the model to ensure that:

The first 64 dimensions are a valid, useful embedding.
The first 128 dimensions are even better.
The first 256 dimensions capture most of the nuance.
The full 1024 dimensions provide the ultimate “high-definition” detail.

This means you can truncate a 1024-dimensional vector at the 128th index and still have a functional embedding that outperforms older, fixed-size models.

3. The Technical Engine: How MRL Works

The “magic” isn’t in the model architecture (which is usually a standard Transformer), but in the Loss Function.

In standard embedding training, we calculate a single loss based on the final vector. In MRL, we calculate a Multi-Scale Loss. We take the full vector, slice it at various pre-defined “Matryoshka points,” and calculate the loss for each slice.

The Mathematics of Nesting

Let x be our input. The model F produces a high-dimensional vector:

We define a set of dimensions:

where each

The total loss is the weighted sum of losses at each dimensionality:

Where:

z_{1:m} is the prefix of the vector up to dimension $m$.
L is a standard contrastive loss (like InfoNCE).
c_m is a weighting coefficient (often set to 1 for equal importance).

By optimizing for all these dimensions simultaneously, the backpropagation process forces the model to pack the “essence” of the data into the earliest dimensions. If the model fails to capture the core meaning in the first 64 dimensions, the term will be high, and the model will be penalized.

4. Why 2026 is the Year of MRL

While the original MRL paper was published by researchers at the University of Washington and Google in 2022, it didn’t become an industry standard until late 2024 and throughout 2025.

Today, in 2026, nearly every major embedding provider supports MRL natively:

OpenAI: Their text-embedding-3-large (3072 dimensions) can be truncated to 256 dimensions while still outperforming the legendary text-embedding-ada-002.
Google Gemini: The Gemini Embedding 2 model uses MRL to allow seamless transitions between 768 and 3072 dimensions.
Voyage AI & Jina: Models like Voyage MM-3.5 and Jina v4 have pushed MRL into the multimodal space, allowing you to truncate image and text vectors with less than 1% loss in accuracy.

2026 Benchmarks: The “98% Rule”

Recent benchmarks on the MTEB (Massive Text Embedding Benchmark) show a consistent pattern: MRL-trained models typically retain 98% of their performance even when truncated to 8-10% of their original size.

5. Engineering the Two-Stage “Coarse-to-Fine” Retrieval

The most powerful application of MRL is the Two-Stage Retrieval pipeline. This pattern allows you to have your cake (speed) and eat it too (accuracy).

Stage 1: The “Coarse” Shortlist

You store only the first 128 dimensions of your embeddings in a fast, in-memory vector index (like HNSW or DiskANN). Because the vectors are tiny, you can search through millions of documents in microseconds. This returns a “shortlist” of, say, 1,000 candidates.

Stage 2: The “Fine” Rerank

You then fetch the full 3072 dimensions for only those 1,000 candidates (stored in cheaper SSD storage). You perform a final similarity check using the full vectors to pick the top 10.

The result? You get the accuracy of a massive model with the infrastructure cost of a tiny one. In production environments, this has been shown to reduce vector search latency by up to 80%.

6. Advanced Trends: SMRL and Adaptive Selection

As we’ve moved into 2025-2026, researchers have introduced Sequential Matryoshka Representation Learning (SMRL) and SMEC (Sequential Matryoshka Embedding Compression).

These new methods solve a subtle issue with original MRL: gradient variance. When you train with 10 different loss functions at once, the gradients can get “noisy.” SMRL uses a sequential training approach that stabilizes the learning process, allowing for even better performance at extremely low dimensions (like 32 or 64).

Additionally, Adaptive Dimension Selection (ADS) modules now allow systems to dynamically choose the embedding size based on the “difficulty” of the query. Simple queries (e.g., “What is a cat?”) use 128 dimensions, while complex, nuanced queries (e.g., “Legal precedents for intellectual property in synthetic biology”) automatically trigger a full-dimensional search.

7. Conclusion

Matryoshka Embeddings represent a fundamental shift in how we think about data representations. We are moving away from “one-size-fits-all” vectors toward liquid representations that adapt to our hardware, our budget, and our latency requirements.

In 2026, if you aren’t using MRL in your RAG pipeline, you’re likely overpaying for your database and overcharging your users in latency.

Attention Is All You Need

João Silva — Wed, 01 Apr 2026 13:01:08 GMT

In the pre-2017 era of Natural Language Processing, we treated sequences like a single-file line. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks processed words one by one, desperately trying to remember what happened at the beginning of the sentence by the time they reached the end.

Then came “Attention Is All You Need.” The Transformer architecture threw out the sequential bottleneck and replaced it with Multi-Head Attention (MHA). If the Transformer is the engine of modern AI, attention heads are the cylinders that allow it to fire on all levels simultaneously.

1. The Core Philosophy: Why “Multi-Head”?

To understand multi-head attention, we first have to understand the limitation of Scaled Dot-Product Attention.

In a single-head system, the model calculates a weighted average of all words in a sequence to represent a specific word. While powerful, a single head is forced to make a choice. If I say, “The bank was closed because the river overflowed,” the word “bank” has two distinct relationships:

Syntactic: It is the subject of “was closed.”
Semantic/Contextual: It relates to “river” (indicating a geographic bank, not a financial one).

A single attention head might struggle to focus on both the grammatical structure and the nuanced context at the same time. Multi-head attention solves this by allowing the model to jointly attend to information from different representation subspaces at different positions.

The Analogy: Imagine a crime scene. One detective (Head 1) looks only at footprints. Another (Head 2) looks at DNA. A third (Head 3) looks at witness statements. By combining their reports, you get a 3D view of the truth that no single detective could capture.

2. The Mechanics: Behind the Math

Every attention head operates on three learned linear projections: Queries (Q), Keys (K), and Values (V).

The Single-Head Calculation

For a single head, the attention mechanism is defined by the following formula:

Where:

Q: What I’m looking for.
K: What I have to offer.
V: The information I actually provide.
sqrt(d_k): A scaling factor to prevent the dot products from growing too large, which would push the softmax into regions with tiny gradients (the “vanishing gradient” problem).

Moving to Multi-Head

In a Multi-Head setup, we don’t just do this once. We split the model’s embedding dimension (e.g., 512 in the original Transformer) into h different heads. If h=8, each head works in a 64-dimensional space.

The process looks like this:

Project: Linearly project Q, K, V into h subspaces.
Attend: Perform Scaled Dot-Product Attention for each head independently.
Concatenate: Stitch the results of all h heads back together.
Final Project: Pass the concatenated vector through a final weight matrix (W^O) to ensure the heads share their “findings.”

3. What Are They Actually Doing? (Interpretability)

Researchers have spent years “peeking” into these heads to see if they’ve actually learned anything useful. As it turns out, attention heads often specialize in specific linguistic tasks:

Syntactic Heads: Some heads focus almost exclusively on the relationship between a verb and its direct object.
Positional Heads: Some heads always look at the previous word or the very next word, acting as a sort of local “sliding window.”
Entity Heads: In larger models like GPT-4, certain heads specialize in tracking names, dates, or specific entities throughout a long document.
Delimiting Heads: Some heads focus on periods, commas, or the [SEP] tokens, helping the model understand where ideas end.

The “Emergent” Nature

The beauty of attention heads is that we don’t tell them to look for grammar or entities. They discover these patterns because they are the most efficient way to reduce loss during training. It is an emergent property of the architecture.

4. Efficiency and The “Over-Parameterization” Problem

One of the most surprising findings in modern AI research is that we might not need all these heads.

In a famous paper titled “Are Sixteen Heads Better than One?”, researchers found that you could prune (remove) a significant percentage of attention heads during inference without a major drop in performance. In some cases, a model with 12 heads could be pruned down to 1 or 2 heads in certain layers with negligible impact.

Why does this happen?

Redundancy: Many heads end up learning the same thing.
Specialization vs. Generalization: Some layers require many heads to parse complex logic, while other layers (often the earlier ones) only need a few to handle basic patterns.

This has led to the rise of Structured Pruning and Head Importance Scoring, where we identify “dead” heads and cut them to make models faster and lighter.

5. The Evolution: MQA, GQA, and FlashAttention

As we’ve moved toward Large Language Models (LLMs) with massive context windows (like Gemini’s 1M+ tokens), the standard Multi-Head Attention became a memory bottleneck. This led to three major innovations:

Multi-Query Attention (MQA)

Instead of every head having its own K and V, all heads share a single Key and Value. This drastically reduces the memory footprint during decoding, though it can slightly hurt model “expressiveness.”

Grouped-Query Attention (GQA)

A middle ground used by models like Llama 3. Heads are grouped, and each group shares a K and V. This balances the speed of MQA with the quality of MHA.

FlashAttention

This isn’t a change in the math, but a change in how the hardware (GPU) handles it. By being “IO-aware,” FlashAttention computes the attention matrix in blocks, avoiding the need to write the massive N x N attention matrix to slow memory.

6. Conclusion

Attention heads are the reason we can have conversations with AI that feel coherent and contextually aware. They allowed us to move past the “foggy memory” of RNNs and into an era where every word in a 500-page book can be simultaneously compared to every other word.

However, the future likely involves dynamic attention. Instead of a fixed number of heads, we may see models that activate only the “experts” (heads) needed for a specific prompt—saving trillions of calculations and making AI more efficient than ever.

From Raw Scores to Reason

João Silva — Tue, 31 Mar 2026 13:03:19 GMT

In the current era of generative artificial intelligence, we often speak of Large Language Models (LLMs) as if they possess “intent” or “understanding.” We describe their ability to write code, compose poetry, or solve complex logical puzzles. However, beneath the layer of conversational fluidly lies a deterministic mathematical pipeline. At the very end of this pipeline—after the billions of parameters have been traversed and the multi-head attention mechanisms have fired—sits a critical, often overlooked duo: Logits and the Softmax function.

For engineers, founders, and strategists, understanding these two components is not merely an academic exercise in calculus. It is the key to mastering model behavior, controlling “hallucinations,” and optimizing the bridge between raw compute and human-readable reasoning.

I. The Final Frontier: What are Logits?

Before a model can tell you that the next word in the sentence “The capital of France is...” is “Paris,” it produces a set of raw, unnormalized scores. These scores are known as Logits.

The Mathematical Origin

In the context of deep learning, the final linear layer of a transformer model outputs a vector. If our model has a vocabulary of 50,000 tokens, this vector contains 50,000 distinct numbers. These numbers are logits.

Mathematically, a logit is the inverse of the sigmoid “logistic” function. In the journey of a token through a neural network, the logits represent the model’s “raw conviction” before they are constrained to a probability distribution. Unlike probabilities, logits can be any real number: positive, negative, or zero.

A high positive logit suggests a high degree of confidence that the corresponding token is the correct next step.
A negative logit suggests the model “thinks” that token is highly unlikely.

Why Logits Matter for Developers

Logits are the “raw data” of model intent. When you access a model via an API (like OpenAI or Anthropic), you often only see the final text. However, “Logprobs” (logarithmic probabilities derived from logits) are frequently available. By analyzing logits, developers can:

Measure Uncertainty: If the top two logits are nearly identical, the model is “confused.”
Calibrate Output: You can manually “bias” logits to force the model to avoid certain words or favor others (Logit Bias).

II. The Softmax Transformation: Creating Order from Chaos

Raw logits are difficult for a system to use for decision-making because they lack a fixed scale. How do we compare a logit of 12.5 to a logit of -3.2 in a way that represents a percentage chance? This is where the Softmax function enters.

The Formula

The Softmax function takes a vector of $K$ real numbers and transforms them into a probability distribution consisting of $K$ probabilities proportional to the exponentials of the input numbers.

For an input vector $\mathbf{z}$, the Softmax function $\sigma(\mathbf{z})$ is defined as:

Why Exponentials?

The use of the natural exponential $e$ serves two vital purposes:

Positivity: It ensures that every output is a positive number (since $e^x$ is always positive).
Magnification: It acts as a “winner-takes-all” mechanism. Small differences in raw logits are magnified into large differences in probability. If one logit is slightly higher than the rest, Softmax ensures it receives the lion’s share of the probability mass.

The Summation Property

Crucially, the denominator $\sum e^{z_j}$ ensures that all the output values sum exactly to 1.0 (100%). This turns the raw scores into a valid probability distribution, allowing the model to “rank” the entire vocabulary.

III. The Lever of Creativity: Temperature Scaling

If Softmax is the engine, Temperature ($T$) is the throttle. In the deployment of LLMs, temperature is the most common hyperparameter used to control the “creativity” or “randomness” of the output.

Temperature is an adjustment made to the logits immediately before they are passed into the Softmax function:

The Effects of Temperature

The value of $T$ dictates how the probability mass is distributed among the tokens:

Low Temperature ($T < 1$): The model becomes more confident and deterministic. By dividing the logits by a small number, the gap between the highest logit and the others is stretched. The “winner” gets even more probability, often approaching 100%. This is ideal for factual tasks, coding, or data extraction.
High Temperature ($T > 1$): The model becomes “creative” or “diverse.” Dividing by a large number flattens the distribution, making the “gap” between the top choice and the “long tail” of other words much smaller. This allows the model to occasionally pick less likely words, leading to more varied prose.
$T = 0$ (Argmax): This is technically a mathematical limit. The model will always choose the token with the absolute highest logit. It becomes completely deterministic.

IV. LLMs and the Token Selection Lifecycle

To see how Softmax and Logits function in the wild, let’s trace the lifecycle of a single token generation in a Transformer.

Input Processing: The user sends a prompt. It is tokenized and converted into embeddings.
The Transformer Block: The data passes through 96+ layers of attention and feed-forward networks. The model uses its learned weights to calculate the relationship between the input tokens.
The Logit Head: The final hidden state is projected onto the vocabulary space. We now have a vector of Logits (raw scores).
Softmax Application: The logits are scaled by Temperature and passed through the Softmax function. We now have a Probability Distribution.
Sampling: The model doesn’t always just pick the top word. It uses sampling strategies like:
- Top-P (Nucleus Sampling): Only consider the smallest set of tokens whose cumulative probability exceeds $P$ (e.g., 0.9).
- Top-K: Only consider the top $K$ most likely tokens.
The Output: A token is selected, appended to the prompt, and the process repeats (Autoregression).

V. Strategic Implications for AI Implementation

For those building at the application layer, the relationship between Logits and Softmax informs several high-stakes business decisions.

1. Guardrailing and Safety

One of the most effective ways to prevent a model from generating restricted content is through Logit Masking. By setting the logits of specific forbidden tokens to negative infinity ($-\infty$) before the Softmax layer, you can mathematically guarantee that the model will never select those words, regardless of the prompt.

2. Detection of Hallucinations

Hallucinations often occur when the Softmax distribution is “flat”—meaning the model has no clear winner among its potential outputs. By monitoring the Entropy of the Softmax output, developers can create a “confidence score” for every response. If the entropy is too high, the system can trigger a search tool or ask the user for clarification.

3. Cost and Latency (The Logit Bottleneck)

The vocabulary size of an LLM directly impacts the size of the final logit vector. As we move toward models with larger vocabularies (to support more languages or specialized jargon), the compute cost of the final linear layer and the Softmax operation increases. Optimizing this “Logit Head” is a major focus for edge-AI and mobile-first LLM deployments.

VI. Beyond Softmax: The Future of Probabilistic Outputs

While Softmax is the industry standard, it is not without flaws. The “Exponential” nature of Softmax can sometimes lead to overconfidence, where a model assigns 99.9% probability to a wrong answer simply because its logit was slightly higher than the second-best option.

Research into Sparsemax (which can assign exactly zero probability to unlikely tokens) and Calibration (adjusting logits so that a 90% probability actually corresponds to a 90% accuracy rate) is the next frontier. For founders, staying ahead of these architectural shifts means building more reliable, steerable, and trustworthy AI systems.

Conclusion

The “magic” of LLMs is often attributed to their size, but their utility is governed by their precision. Logits represent the model’s raw, unvarnished thoughts; Softmax represents the civilized, probabilistic output we use to communicate.

By mastering the interplay between these two—and the temperature that mediates them—we move away from treating AI as a “black box” and toward treating it as a finely-tuned instrument of digital logic. Whether you are optimizing a customer service bot or architecting a new frontier model, the path to performance runs directly through the Softmax layer.

The Hidden Cost

João Silva — Mon, 30 Mar 2026 13:03:52 GMT

The narrative that AI and the cloud have ushered in a “golden age” of effortless software development is one of the most expensive lies in the modern enterprise. While the marketing brochures promise “heavenly” developer experiences, the reality on the ground is often a hell of architectural rot, burnout, and a “churn premium” that is silently hemorrhaging millions of dollars.

If you are seeing your best engineers walk out the door, it isn’t because they’ve lost their edge or can’t master the latest LLM-assisted coding tool. They are leaving because they are drowning in systems that make it impossible to do their best work.

To stop the bleed, leaders must move beyond shiny dashboards and confront the broken systems beneath.

Most organizations track turnover, but few track the Churn Premium. This is the compounded cost of technical debt, lost institutional knowledge, and the massive overhead required to replace a high-performing engineer who understood the “where the bodies are buried” in a legacy codebase.

When a senior developer quits because of burnout, you aren’t just losing a headcount; you are paying a tax on every future feature. New hires take months to reach the same level of fluency, and if the system they inherit is already broken, they are likely to follow their predecessor out the door within 18 months. This cycle is a death spiral for innovation.

Principle 1: Separate Preparation from Implementation

One of the primary drivers of developer cognitive overload is the “rebuilding the bike while riding uphill” syndrome. Currently, most teams expect developers to clean up messy, legacy codebases while simultaneously shipping new, high-stakes features.

This dual-track cognitive load is a recipe for failure. Human brains are not wired to perform deep structural refactoring and feature implementation in the same breath.

The Strategy: “Make the Change Easy”

Following the wisdom of engineering veteran Kent Beck: “First make the change easy (warning: this might be hard), then make the easy change.”

Phase 1: Preparation (The Cleanup). This is a dedicated effort to refactor the environment so the new feature has a clean “landing zone.” No new business logic is added here.
Phase 2: Implementation (The Feature). Once the architecture supports the change, the implementation becomes trivial.

By isolating these two activities, you reduce the mental “context switching” that leads to bugs and developer frustration.

Principle 2: Stop Flying Blind with Data-Driven Empathy

You cannot fix burnout if you cannot see the “heroics” happening behind the scenes. Many leaders rely on surface-level metrics like Jira velocity, which often mask the reality of a team on the brink of collapse. To truly understand the health of your engineering org, you need to leverage advanced developer analytics—specifically tools like DevStats.

Key Reports to Monitor:

The Activity Heatmap: This is your early warning system for burnout. Look for consistent patterns of “out-of-hours” work. A short burst of weekend work for a major launch is normal; a two-month trend of Sunday night commits is a signal that your best people are updating their resumes.
Planning Accuracy Report: This identifies the “Yes Men” in your organization—the developers who commit to impossible deadlines out of a sense of duty, only to suffer in silence. If your planning accuracy is consistently low, the problem isn’t the developers; it’s the dates.

Principle 3: Build the Thinnest Viable Platform (TVP)

In an attempt to “help” developers, many organizations build massive, internal “Developer Platforms” that end up becoming overengineered nightmares. If your platform requires a 50-page manual just to deploy a microservice, you haven’t built a tool; you’ve built a barrier.

The goal is to provide a Thinnest Viable Platform—a set of self-service tools that provide just enough abstraction to remove friction without removing control.

Principle 4: Weaponize SRE Error Budgets

Product owners and stakeholders often have an insatiable appetite for new features, frequently “bullying” engineers into shipping code on foundations they know are unstable. To counter this, you must stop the guessing games and implement a mathematical circuit breaker: The Error Budget.

The 99.9% Rule

If you set a reliability target of 99.9%, you are effectively saying the system is allowed to be down or “broken” for roughly 43 minutes per month.

If the budget is intact: The team proceeds with feature development as planned.
If the budget is spent: All feature work stops. Every engineer, designer, and product owner shifts focus to reliability and technical debt until the system is stabilized.

This moves the conversation from “opinion-based” to “data-driven.” It is no longer the engineer’s word against the product owner’s; it is a hard limit defined by the system’s own health.

Principle 5: Drag the Team Out of the “Anxiety Zone”

When an environment is governed by fear of blame, engineers hide risks. They stop reporting “minor” bugs that are actually symptoms of systemic failure. I have seen projects go dark for days because a developer was too intimidated to flag a misconfiguration 48 hours earlier.

Cultivating Psychological Safety

High-agency teams require the safety to fail. As a leader, you must go first:

Share your own “screw-ups” openly.
Conduct blameless post-mortems that focus on the process failure, not the person who pushed the button.
Reward risk-flagging. An engineer who identifies a critical flaw in a proposed feature should be celebrated as much as the one who ships it.

Principle 6: Translate Tech Debt into Shareholder Destruction

One of the biggest mistakes technical leaders make is trying to explain “refactoring” or “technical debt” to the C-suite using engineering terminology. The Board does not care about “clean code.” They care about risk and capital efficiency.

To get the resources you need, you must translate technical debt into its true form: Shareholder Destruction.

The 1:100 Rule

The math of software defects is brutal.

A bug costs $1 to fix during the initial design or development phase. That same bug costs $100 once it is live in production, factoring in customer support, emergency patches, and reputational damage.

The “Knowledge Transfer Tax”

Explain that messy code is essentially a “Knowledge Transfer Tax.” If your most expensive senior engineers are spending 15–20 hours a week just “keeping the lights on” or explaining convoluted logic to juniors, that is a direct drain on the company’s R&D budget. You are paying senior wages for janitorial work.

Conclusion

The future of the tech industry won’t be won by the companies that squeeze the most lines of AI-generated code out of their staff. It will be won by the companies that treat developer time—and more importantly, developer energy—as their most precious capital.

Your job as a leader is not to demand more output. It is to build a system that doesn’t make people hate their jobs. If you keep pushing for “shiny” features while the foundation is rotting, you aren’t building a product; you’re building a Titanic.

Stop flying blind. Fix the system, empower your people, and stop the churn before it consumes your venture.

The Evaluation Gap

João Silva — Wed, 25 Mar 2026 13:03:16 GMT

In the transition from Large Language Models (LLMs) to autonomous AI agents, the industry has hit a significant bottleneck. While building a prototype that can perform a “cool trick” takes an afternoon, moving that agent into a production environment where it handles sensitive data, financial transactions, or customer-facing operations is a different beast entirely.

The primary hurdle isn’t just the logic of the agent—it’s the evaluation. For decades, software engineering relied on deterministic unit tests: input $X$ always results in output $Y$. With AI agents, the path from input to output is non-deterministic, multi-step, and often involves tool-use iterations that can fail in a thousand subtle ways.

The O’Reilly literature on AI evaluation, particularly the emerging frameworks surrounding agentic workflows, emphasizes a shift from “vibe-based development” to a systematic, metrics-driven approach. To build agents that actually work at scale, we must move beyond the chat box and into the laboratory.

1. The Anatomy of an Agentic Failure

Before we can evaluate success, we must understand the unique failure modes of agents. Unlike a standard RAG (Retrieval-Augmented Generation) system, which typically has a linear flow (Query → Retrieve → Generate), an agent operates in a loop:

Perception: Understanding the user’s intent.
Planning: Breaking the intent into sub-tasks.
Tool Selection: Deciding which external API or database to call.
Execution: Parsing the tool output.
Observation: Deciding if the goal is met or if another loop is needed.

A failure can occur at any stage. An agent might plan correctly but select the wrong tool. It might execute the tool correctly but fail to parse the JSON response. Or, most dangerously, it might enter a “hallucination loop,” where it tries to fix an error with more erroneous actions. Evaluation, therefore, cannot just be a look at the final answer; it must be a trace-level assessment of the entire trajectory.

2. Defining the Metrics: The Four Pillars of Evaluation

The O’Reilly framework for evaluation generally categorizes metrics into four distinct buckets. To build a robust system, you need coverage across all of them.

I. Correctness (Functional Accuracy)

This is the most obvious metric, but the hardest to measure. Did the agent achieve the user’s goal?

Tool Call Accuracy: Did the agent call the right function with the correct parameters?
Final Answer Relevancy: Is the output semantically aligned with the prompt?
Success Rate: In a multi-turn conversation, what percentage of tasks were completed without human intervention?

II. Reliability and Consistency

Because LLMs are probabilistic, an agent might succeed on Monday and fail on Tuesday with the same prompt.

Pass@k: If we run the same prompt $k$ times, how often does it succeed?
Robustness to Noise: If we add irrelevant information to the prompt, does the agent still find the correct path?

III. Safety and Guardrails

Agents have “agency,” meaning they can do harm if not constrained.

Prompt Injection Vulnerability: Can a user trick the agent into bypassing its system instructions?
PII Leakage: Does the agent inadvertently pull sensitive data from a database and show it to the user?
Toxicity and Bias: Does the agent generate harmful content during its reasoning steps?

IV. Efficiency (Performance and Cost)

In a business context, an agent that takes 45 seconds to think and costs $2.00 per query is often unusable.

Tokens Per Task: How many tokens were consumed in the loops?
Latency per Step: Which specific tool or reasoning step is slowing down the UX?
Cost per Success: The total cost of all API calls divided by the number of successful outcomes.

3. The “Gold Dataset” Problem

You cannot evaluate what you haven’t defined. The cornerstone of AI evaluation is the Evaluation Dataset (often called a “Gold Set”). This is a curated list of inputs and their expected “Ground Truth” outputs.

For agents, a Gold Set is significantly more complex than for a standard classifier. A high-quality agentic dataset should include:

The Prompt: The initial user request.
The Context: The state of the world (e.g., “The user is logged in,” “The database has 3 records”).
The Expected Trajectory: Not just the final answer, but the specific tools that should be called.
Negative Constraints: Things the agent should not do (e.g., “Do not delete the record”).

Synthetic Data Generation:

Creating 1,000 manual test cases is grueling. Modern evaluation strategies use “LLM-as-a-Generator” to create synthetic test cases. By prompting a frontier model (like GPT-4o or Claude 3.5 Sonnet) to “imagine 50 ways a user might try to break this specific tool,” you can bootstrap an evaluation suite in minutes.

4. LLM-as-a-Judge: Scaling Evaluation

How do you grade a 10-step agent trajectory? You can’t use Regex or Exact Match. The solution championed in recent technical literature is the LLM-as-a-Judge pattern.

In this architecture, you use a highly capable model to grade the performance of your smaller, faster production agent. You provide the Judge with a rubric.

Example Rubric for a Sales Agent:
Score 1: Agent failed to ask for the user’s email.
Score 3: Agent asked for the email but didn’t verify the format.
Score 5: Agent collected the email, verified it, and successfully called the UpdateLead tool.

While “LLM-as-a-Judge” introduces its own biases, it is remarkably consistent when compared to human graders and operates at a fraction of the cost and time. To mitigate bias, practitioners often use Reference-Based Evaluation, where the Judge is given a “perfect” example to compare against the agent’s actual performance.

5. Architectural Integration: The Eval-Driven Development Cycle

Evaluation shouldn’t be a post-mortem; it should be integrated into the CI/CD pipeline. The O’Reilly approach suggests an Eval-Driven Development (EDD) loop:

Baseline: Run your current agent through your Gold Set. Record the scores.
Experiment: Change a prompt, swap a model, or add a new tool.
Evaluate: Run the new version through the exact same Gold Set.
Compare: Use a “Diff” tool to see which cases improved and—crucially—which ones regressed.

One of the most common pitfalls in AI development is the “Hydra Effect”: you fix a prompt to solve Problem A, but that change causes a regression in Problem B. Without a systematic evaluation suite, you are flying blind.

6. Real-World Case Study: The Customer Support Agent

Imagine you are building an agent for an e-commerce platform that can process refunds.

The Vibe Check: You ask it to refund a fake order. It works. You feel good.
The Systematic Eval: You run 100 test cases.
- Finding 1: The agent successfully refunds 90% of cases.
- Finding 2: In 5% of cases, the agent refunds the wrong item because it didn’t clarify which product in a multi-item order the user meant.
- Finding 3: In 5% of cases, the agent hallucinated a “manager approval code” to bypass a restriction.

By identifying these specific failure modes through evaluation, you can implement Programmatic Guardrails. For example, you can add a validation step that requires the agent to output a specific JSON schema before the refund tool is ever triggered.

7. The Business Case: ROI of Evaluation

For founders and stakeholders, evaluation is often seen as a “nice-to-have” technical debt. This is a mistake. Evaluation is directly tied to the Unit Economics of an AI product.

Reducing Rework: It is 10x cheaper to fix a prompt in staging than to deal with a corrupted database in production.
Model Optimization: Evaluation allows you to see if a cheaper, faster model (like Llama 3-8B) can perform as well as a more expensive one (GPT-4o) for a specific task. You can only make that switch confidently if you have the metrics to prove there is no quality loss.
Trust and Adoption: Enterprise clients demand SLAs (Service Level Agreements). You cannot provide an SLA for an AI agent without a statistically significant evaluation report.

8. Conclusion

The “Agentic Era” promises a world where software doesn’t just show us data, but acts upon it. However, agency without accountability is a liability.

As we move forward, the tools for evaluation—like those discussed in O’Reilly’s technical guides—will become as standard as GitHub or Docker. We are moving toward a future of Continuous Evaluation, where agents are constantly monitored by other AI systems, ensuring they remain within the bounds of their intent, safety, and efficiency.

If you are building agents today, stop tweaking your prompts in a vacuum. Build your Gold Set, define your rubric, and start measuring. In the world of AI, the winner isn’t the one with the best prompt; it’s the one with the best feedback loop.

Key Takeaways for Your Strategy:

Traceability is mandatory: Log every step of the agent’s “thought” process, not just the final output.
Focus on Regressions: Use automated evals to ensure new features don’t break old successes.
Use the Right Tool for the Grade: Use “heavyweight” models to judge “lightweight” production agents.
Quantify the “Vibe”: Turn subjective quality into a 1-5 scale with clear rubrics.

Your RAG System Has a Hidden UX Problem

João Silva — Tue, 24 Mar 2026 13:03:41 GMT

In the world of Generative AI, we’ve spent the last two years obsessed with the “R” in RAG. We’ve optimized vector databases, fine-tuned embedding models, and experimented with hybrid search to ensure that when a user asks a question, the system finds the right needle in the haystack.

And yet, despite our retrieval being more “intelligent” than ever, the user experience often feels like it’s stuck in 1998.

There is a silent killer of user trust in RAG systems. It’s not hallucination, and it’s not latency. It’s the mismatch between how we find information and how we show it. We are retrieving semantically, but we are highlighting lexicographically.

This is the story of why your RAG system is inadvertently “gaslighting” your users, and how a new generation of small, specialized models—like the one recently open-sourced by Zilliz—is solving it.

The Great Disconnect: Meaning vs. Matching

To understand the problem, we have to look at the two different “brains” operating inside a modern RAG application.

The Retrieval Brain (Semantic): This brain operates in high-dimensional vector space. It doesn’t care about the letters in a word; it cares about the “vibe” or the conceptual intent. If you search for “liquid assets,” it knows to look for “cash,” “savings accounts,” and “marketable securities.”
The UI Brain (Keyword): This brain is essentially a sophisticated version of Ctrl+F. It looks for exact character matches. If the user typed “liquid assets” and the document says “available cash,” the UI Brain sees zero overlap. It leaves the text plain, white, and unhelpful.

The “A15 Bionic” Paradox

Imagine a user at a major tech firm searching the internal wiki for “iPhone performance.”

The vector database does its job perfectly. It skips over generic marketing fluff and retrieves a technical whitepaper about the A15 Bionic chip architecture, Geekbench scores, and low-latency neural engines.

From a retrieval standpoint, this is a 10/10 result. But from a UX standpoint, it’s a failure. The user opens the document, and because the literal words “iPhone” and “performance” don’t appear in the technical paragraphs, nothing is highlighted.

The user is faced with 3,000 words of dense technical prose. They have to manually scan the text to figure out why the system thought this was relevant. Usually, after five seconds of scrolling, they assume the AI is “hallucinating” or “stupid” and close the tab.

The irony: The system was too smart for its own UI.

Why This Matters: The Erosion of Trust

This isn’t just a minor cosmetic issue; it’s a structural flaw that hurts the two most important groups in the ecosystem.

1. The End Users (The Trust Gap)

Search is a contract. The user provides a query; the system provides an answer and proof of that answer. Highlighting is the “receipt.” When a RAG system provides a document but fails to highlight the relevant section, it breaks that contract. Over time, this friction leads to “tool fatigue,” where users go back to their old, inefficient ways of finding information because the AI feels too high-effort to verify.

2. The Developers (The Debugging Nightmare)

When a RAG system underperforms, developers usually look at two places: the embedding model or the LLM. But without semantic highlighting, it’s nearly impossible to tell if the retrieval was actually “bad” or if the information was simply buried. Developers end up chasing “better embeddings” when the real problem is a “visibility” issue.

The Agentic Complication

The problem gets exponentially worse as we move from simple RAG to Agentic RAG.

In an agentic workflow, the user doesn’t just search; they ask a high-level question like: “Analyze recent market trends.” The AI agent then performs “Chain of Thought” reasoning and generates its own optimized search queries:

“Retrieve Q4 2024 consumer electronics sales data, year-over-year growth rates, supply chain cost fluctuations.”

The system finds a sentence: “The iPhone 15 series drove a 12% market recovery in the premium segment.” The Problem: There is zero keyword overlap between the agent’s generated query and the actual result. The user sees a document about “iPhone 15” but none of the “market trend” context is highlighted because the UI is still looking for the literal word “trends.”

The more “intelligent” the agent becomes, the more it diverges from simple keyword matching, making traditional highlighting increasingly obsolete.

Why Not Just Use an LLM?

The “brute force” solution is to send the retrieved document and the query to a model like GPT-4 and ask: “Which sentences in this document answer the query? Return the character offsets.”

While this works, it is a production disaster for three reasons:

Latency: Highlighting needs to happen the moment the results are rendered. Waiting 2-5 seconds for an LLM to “scan” five different 10-page documents is unacceptable for a search UI.
Cost: Running a 175B+ parameter model every time a user hits “Enter” just to draw some yellow boxes on a screen will destroy your margins.
Context Windows: While context windows are growing, feeding entire document sets into an LLM for every single search query remains inefficient and prone to “middle-of-the-document” forgetfulness.

The Solution: Specialized Semantic Highlighting

We need a middle ground: a model that has the “brain” of an LLM but the speed of a keyword index.

Several open-source attempts have tried to solve this, but most fall short of “Production Grade” requirements:

Model / ToolContext WindowPerformanceLicensingOpenSearch Semantic512 TokensFails on out-of-domain dataApache 2.0XProvenceLimitedNoisy results; multilingual issuesCC BY-NC (Non-Commercial)Zilliz Semantic Model8,000 TokensStrong GeneralizationMIT (Commercial Use OK)

The Zilliz Breakthrough

The team at Zilliz (the creators of Milvus) approached this as a distillation problem. They wanted a model that could handle long documents (8k context) and understand multiple languages without the “Non-Commercial” baggage of previous research.

How It Was Built

To get LLM-level understanding into a small package, they used a “Teacher-Student” training architecture:

The Teacher: They used Qwen3-8B, a powerful LLM. Instead of just asking it to “highlight,” they asked it to reason. By forcing the model to explain why a span was relevant before marking it, they generated a much higher-quality training set.
The Student: They distilled this reasoning into a BGE-M3 Reranker (0.6B parameters).
The Training: They processed over 1 million bilingual samples (English and Chinese) over 5 hours on an 8x A100 cluster.

The result is a model that doesn’t just look for “iPhone,” but understands that “A15 Bionic” is the reason the iPhone is fast.

Case Study: The “Sacred Deer” Trap

To see the difference between a “keyword-matching” brain and a “semantic” brain, look at this query:

“Who wrote the film The Killing of a Sacred Deer?”

A document contains three sentences:

“...the screenplay written by Lanthimos and Efthymis Filippou.”
“The film stars Colin Farrell...”
“The story is based on the ancient Greek play Iphigenia in Aulis by Euripides.”

The Trap: Sentence #3 contains the keywords “wrote” (implied) and “Euripides.” A keyword-based system—and even some weaker semantic models like XProvence—will often highlight Euripides because of the strong association between “Writer” and “Famous Author.”

The Semantic Reality: The Zilliz model identifies that the user is asking about the film’s authorship. It recognizes that while Euripides wrote the source material, Lanthimos and Filippou wrote the film. It ignores the “keyword bait” and highlights the correct names.

This is the difference between a system that “matches” and a system that “understands.”

The Path Forward: Native Integration

The future of RAG isn’t just better retrieval; it’s transparent retrieval. Zilliz is currently integrating this semantic highlighting model directly into the Milvus ecosystem via a native API. This means that in the very near future, when you call results = collection.search(), you won’t just get a list of documents. You’ll get a list of highlighted spans that explain, in real-time, exactly why those documents were chosen.

Summary of the New Standard

8K Context: No more “chopping” documents into tiny chunks just to get highlights.
Bilingual: Native support for English and Chinese.
Production Ready: Millisecond latency and MIT licensed.

If your RAG system is currently serving up plain, unhighlighted walls of text, you are asking your users to do the hard work that the AI should be doing. It’s time to stop matching keywords and start highlighting meaning.

Beyond the Straight Line

João Silva — Mon, 23 Mar 2026 13:03:18 GMT

In the “clean” world of textbooks, every relationship is a straight line, and every error is a perfect bell curve. But if you’ve spent more than five minutes with real-world data, you know that’s a lie.

Real data is messy. It’s counts of website clicks that can never be negative. It’s insurance claims that are mostly small but occasionally massive. It’s “yes/no” clicks that don’t care about your “line of best fit.”

Standard Linear Regression (OLS) is like a Swiss Army knife—it’s great until you’re trying to cut through a steel beam. For the tough stuff, you need to “supercharge” your toolkit. You need Generalized Linear Models (GLMs).

1. The “Glass Ceiling” of Linear Regression

Before we talk about the solution, we have to admit we have a problem. When we use standard linear regression, we are essentially making a series of high-stakes statistical bets:

The Gaussian Bet: We assume the “noise” in our data follows a perfect Normal distribution.
The Constant Variance Bet: We assume the “spread” of our data is the same whether our prediction is 10 or 10,000 (Homoscedasticity).
The Linearity Bet: We assume the features X map directly and additively to the outcome y.

Where it breaks

Imagine you’re predicting how many emails a customer sends.

Linear regression might predict -2 emails. (Impossible).
The variance probably increases with the mean. (A person who sends 100 emails has more “swing” in their behavior than someone who sends 1).

If you use OLS here, your p-values will be wrong, your confidence intervals will be meaningless, and your model will be fundamentally “blind” to the nature of the data.

2. The GLM Architecture: Three Pillars

A GLM isn’t just one model; it’s a framework. It allows you to swap out parts of the regression engine to fit the problem at hand. Every GLM is built on three pillars:

Pillar 1: The Random Component (The Distribution)

Instead of being stuck with the Normal distribution, we can choose any distribution from the Exponential Family.

Binary outcome? Use Bernoulli.
Count data? Use Poisson.
Skewed positive data? Use Gamma.

Pillar 2: The Systematic Component (The Linear Predictor)

We keep the best part of linear regression: the linear combination of features. We define a “linear predictor” (often called $\eta$ or “eta”):

$$\eta = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n$$

This is where the “information” from your features lives.

Pillar 3: The Link Function

This is the bridge. The Link Function $g(\cdot)$ connects our linear predictor to the expected value of our data ($\mu$).

$$g(\mu) = \eta$$

This allows the model to predict values on a $(-\infty, \infty)$ scale while the actual data stays within its natural bounds (like $0$ to $1$ for probabilities).

3. The Engine: The Exponential Family

Why do we insist on the “Exponential Family”? Because it makes the math work. A distribution belongs to this family if its probability density can be squeezed into this specific form:

$$f(y; \theta, \phi) = \exp\left( \frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi) \right)$$

Why this matters for Substack readers: You don’t need to memorize that formula. You just need to know its superpower: The variance is linked to the mean.

In this family, the variance of your data is a function of the mean. This is how GLMs handle “Heteroscedasticity” (changing variance) naturally. When the mean goes up, the model expects the variance to change. It’s built into the DNA of the model.

4. Understanding the Link Function

The Link Function is what prevents your model from making “illegal” predictions. Let’s look at the two most famous examples.

Logistic Regression (The Logit Link)

When predicting a probability ($p$), we know the value must be between $0$ and $1$. The Logit link takes that probability and stretches it to infinity:

$$g(p) = \ln\left( \frac{p}{1-p} \right) = \theta^T X$$

When you reverse this math to get your prediction, you get the Sigmoid curve, which gracefully levels off at $0$ and $1$ rather than crashing through them like a standard linear line would.

Poisson Regression (The Log Link)

When counting events, the mean ($\mu$) must be greater than zero. The Log link ensures this:

$$\ln(\mu) = \theta^T X$$

$$\mu = \exp(\theta^T X)$$

Because the result of an exponent is always positive, your model will never tell you that a store will have -5 customers next Tuesday.

5. How We Find the Parameters (MLE)

In standard regression, we use “Least Squares”—we literally try to minimize the physical distance between points and a line.

In GLMs, we use Maximum Likelihood Estimation (MLE). We ask: “What parameters $(\theta)$ make the data we actually observed the most likely outcome?”

Because we are using the Exponential Family, the math simplifies beautifully when we take the “Log-Likelihood.” This turns complex multiplications into simple additions, which computers can solve very quickly using a method called Iteratively Reweighted Least Squares (IRLS).

6. Evaluating a GLM: Deviance over R-Squared

You can’t use $R^2$ for GLMs. A “high $R^2$” in a logistic regression doesn’t actually mean what you think it means. Instead, we look at Deviance.

Null Deviance: How well the model predicts if you had no features (just the average).
Residual Deviance: How much “error” remains after you add your features.

A good model significantly reduces the deviance from the Null to the Residual. If the Residual Deviance is still very high, you’ve likely picked the wrong distribution or link function.

Conclusion

GLMs are the bridge between simple statistics and complex machine learning. They give you the flexibility of a neural network (via different distributions and links) but keep the interpretability of a linear model. You can still look at your coefficients $(\theta)$ and say, “For every unit increase in X, the log-odds of success increase by 0.5.”

The Great Convergence

João Silva — Thu, 19 Mar 2026 13:01:11 GMT

The history of Large Language Models (LLMs) is often told as a story of “bigger is better.” However, looking back from the vantage point of 2026, the true narrative is one of architectural refinement, structural divergence, and the transition from raw statistical predictors to sophisticated reasoning engines. While the “scaling laws” defined the early 2020s, the current era is defined by efficiency, modularity, and verifiability.

This article traces the evolution of LLM architectures from the revolutionary “Attention is All You Need” paper to the hybrid, agentic systems of today.

1. The Big Bang: The Transformer Revolution (2017–2019)

Before 2017, natural language processing (NLP) was dominated by Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) units. These models processed text sequentially, like a human reading a sentence word by word. While effective for short sequences, they suffered from “forgetting” the beginning of a long sentence by the time they reached the end—a problem known as the vanishing gradient.

The Attention Breakthrough

In 2017, Google researchers introduced the Transformer architecture. Its core innovation was the Self-Attention mechanism, which allowed the model to look at every word in a sentence simultaneously.

Instead of sequential processing, the Transformer used:

Positional Encodings: To maintain the order of words without sequential processing.
Multi-Head Attention: To allow the model to focus on different parts of a sentence for different reasons (e.g., one head focusing on grammar, another on semantic meaning).

The Branching Paths: BERT vs. GPT

By 2018, the architecture split into two dominant philosophies:

Encoder-Only (BERT): Focused on “understanding” context by looking at words to the left and right (bidirectional). These were the masters of classification and sentiment analysis but struggled to generate fluid text.
Decoder-Only (GPT): Focused on “generation” by predicting the next token in a sequence (unidirectional). This branch, championed by OpenAI, eventually became the blueprint for modern LLMs.

2. The Scaling Era and the Dense Paradigm (2020–2022)

The release of GPT-3 in 2020 proved that simply increasing the number of parameters (the “neurons” of the model) led to emergent behaviors—capabilities like coding and translation that weren’t explicitly trained for.

The Limits of Density

For several years, the industry followed a “Dense” architecture model. In a dense model, every single parameter is “activated” for every single word generated.

GPT-3: 175 Billion parameters.
PaLM: 540 Billion parameters.

While powerful, these models became prohibitively expensive to run. The energy and compute required to “flick every switch” in a 500B parameter model for a simple “Hello” was the first structural bottleneck.

3. The Modular Pivot: Mixture of Experts (2023–2024)

By late 2023, the paradigm shifted from “Dense” to “Sparse.” The most significant leap was the mainstream adoption of Mixture of Experts (MoE).

How MoE Changed the Game

Instead of one giant neural network, an MoE model consists of many smaller “specialist” sub-networks (experts). A “router” determines which experts are best suited for a specific token.

Example: If a user asks a coding question, only the “Python” and “Logic” experts might fire.
Result: A model could have 1.8 Trillion total parameters (like the rumored GPT-4 architecture) but only activate ~100 Billion per token. This provided the “intelligence” of a massive model with the “speed and cost” of a much smaller one.

This era saw the rise of models like Mixtral 8x7B and DeepSeek-V3, which proved that open-weights models could compete with proprietary giants by using MoE to optimize compute.

4. Beyond Transformers: State-Space Models and Hybrids (2025)

As context windows expanded from 8,000 tokens to 1 million and beyond, a new problem emerged: Quadratic Complexity. In standard Transformers, the cost of processing text grows exponentially as the text gets longer. Processing a whole book was vastly more expensive than processing a page.

The Rise of Mamba and SSMs

In 2025, State-Space Models (SSMs) like Mamba gained traction. Unlike Transformers, SSMs have linear scaling. They process information in a way that feels like a “memory stream,” making them incredibly efficient for:

Analyzing massive codebases.
Processing long legal documents.
Running on-device AI (phones and laptops) where RAM is limited.

Hybrid Architectures

The market didn’t abandon Transformers; it merged them. Today’s state-of-the-art models are often Hybrids, combining the “perfect memory” of Transformer attention for short-term logic with the “efficiency” of SSMs for long-term context.

5. The Current State: Reasoning and Agentic Architectures (2026)

As of March 2026, we have moved past “Next Token Prediction.” The architecture of an LLM is no longer just a neural network; it is an orchestrated system.

Test-Time Compute (Thinking Modes)

The biggest shift in 2026 is the decoupling of “model size” from “intelligence.” Models like OpenAI’s gpt-oss and DeepSeek-R1 utilize Inference-Time Scaling.

When faced with a complex math problem, the model doesn’t just “blur out” an answer. It enters a “Thinking” state—using internal chain-of-thought loops to verify its logic before responding. We are now spending more compute while the model is answering rather than just during its initial training.

Agentic Integration

Modern architectures are designed with “tool-use” in their DNA. This includes:

Native RAG (Retrieval-Augmented Generation): The model architecture includes a “search” layer that pulls in real-time facts before generating text.
Verifiable Rewards (RLVR): Training models specifically on tasks with objective “right/wrong” answers (like code execution), making them far more reliable than the “hallucination-prone” models of 2023.

Summary of Architectural Evolution

Conclusion

The evolution of LLM architecture has come full circle. We started with small, rigid rules, moved to massive, “black box” statistical models, and have now arrived at modular, transparent, and efficient systems.

In 2026, the goal is no longer to build the biggest model, but the smartest system—one that can reason, use tools, and “think” before it speaks. The “Architecture of LLMs” is no longer just about layers and neurons; it is about building a digital cognitive stack that is as efficient as it is capable.