<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Limited Intelligence: Engineering]]></title><description><![CDATA[Posta around engineering, computer science, AI/ML and much more...]]></description><link>https://limitedintelligence.substack.com/s/technical</link><image><url>https://substackcdn.com/image/fetch/$s_!GWza!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d2d3d3d-0b69-4ede-af41-e07792d3d4c0_240x240.png</url><title>Limited Intelligence: Engineering</title><link>https://limitedintelligence.substack.com/s/technical</link></image><generator>Substack</generator><lastBuildDate>Mon, 27 Apr 2026 04:06:27 GMT</lastBuildDate><atom:link href="https://limitedintelligence.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[João Paulo Vieira da Silva]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[limitedintelligence@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[limitedintelligence@substack.com]]></itunes:email><itunes:name><![CDATA[João Silva]]></itunes:name></itunes:owner><itunes:author><![CDATA[João Silva]]></itunes:author><googleplay:owner><![CDATA[limitedintelligence@substack.com]]></googleplay:owner><googleplay:email><![CDATA[limitedintelligence@substack.com]]></googleplay:email><googleplay:author><![CDATA[João Silva]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The Agentic Singularity]]></title><description><![CDATA[Architecting the Orchestration Layer for Autonomous Execution]]></description><link>https://limitedintelligence.substack.com/p/the-agentic-singularity</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/the-agentic-singularity</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Fri, 24 Apr 2026 13:03:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lXDq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1c5ed41-a47a-40b9-bf08-d2911d1c6ea1_1920x1080.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lXDq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1c5ed41-a47a-40b9-bf08-d2911d1c6ea1_1920x1080.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lXDq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1c5ed41-a47a-40b9-bf08-d2911d1c6ea1_1920x1080.webp 424w, https://substackcdn.com/image/fetch/$s_!lXDq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1c5ed41-a47a-40b9-bf08-d2911d1c6ea1_1920x1080.webp 848w, https://substackcdn.com/image/fetch/$s_!lXDq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1c5ed41-a47a-40b9-bf08-d2911d1c6ea1_1920x1080.webp 1272w, https://substackcdn.com/image/fetch/$s_!lXDq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1c5ed41-a47a-40b9-bf08-d2911d1c6ea1_1920x1080.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lXDq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1c5ed41-a47a-40b9-bf08-d2911d1c6ea1_1920x1080.webp" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1c5ed41-a47a-40b9-bf08-d2911d1c6ea1_1920x1080.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;What Is an AI Agent? The Future Explained&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="What Is an AI Agent? The Future Explained" title="What Is an AI Agent? The Future Explained" srcset="https://substackcdn.com/image/fetch/$s_!lXDq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1c5ed41-a47a-40b9-bf08-d2911d1c6ea1_1920x1080.webp 424w, https://substackcdn.com/image/fetch/$s_!lXDq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1c5ed41-a47a-40b9-bf08-d2911d1c6ea1_1920x1080.webp 848w, https://substackcdn.com/image/fetch/$s_!lXDq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1c5ed41-a47a-40b9-bf08-d2911d1c6ea1_1920x1080.webp 1272w, https://substackcdn.com/image/fetch/$s_!lXDq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1c5ed41-a47a-40b9-bf08-d2911d1c6ea1_1920x1080.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In 2026, the &#8220;Chatbot&#8221; is a legacy term. It belongs to the era of 2023, where we were impressed that a machine could write a cover letter. Today, if your AI doesn&#8217;t have a &#8220;hand&#8221; to move a mouse or an API key to execute a trade, it&#8217;s essentially a very expensive paperweight.</p><p>We have transitioned from <strong>Generative AI</strong> to <strong>Agentic AI</strong>. The technical delta between the two is massive. Generative AI is a stateless prediction engine; Agentic AI is a stateful, goal-oriented system. This week, the trending discourse isn&#8217;t about how many parameters a model has, but how it manages its <strong>Orchestration Layer</strong>.</p><p>To build a functional agent in 2026, you aren&#8217;t just calling an LLM. You are building a complex feedback loop.</p><h4>1. Recursive Planning (System 2 Reasoning)</h4><p>Early models used &#8220;Chain of Thought&#8221; (CoT) as a prompting trick. Today, CoT is baked into the architecture. We see this in the latest iterations of <strong>Reasoning-Native Kernels</strong>.</p><ul><li><p><strong>The Workflow:</strong> When a goal is received (e.g., &#8220;Analyze this 10-K and execute a hedge strategy&#8221;), the agent doesn&#8217;t start typing. It initiates a <strong>Sub-goal Decomposition</strong>.</p></li><li><p><strong>The Scratchpad:</strong> Modern agents maintain a hidden &#8220;latent scratchpad&#8221; where they simulate different execution paths before committing to an action. This reduces &#8220;hallucination-in-action,&#8221; which was the primary killer of 2025-era agents.</p></li></ul><h4>2. The Model Context Protocol (MCP) and Tool-Augmented Generation</h4><p>The most significant technical trend this week is the maturation of <strong>MCP</strong>. We&#8217;ve finally moved past the &#8220;Plugin&#8221; mess. MCP provides a standardized interface for models to talk to the world.</p><ul><li><p><strong>Standardized Schemas:</strong> Whether the agent is talking to a SQL database or a robotic arm, it uses the same protocol.</p></li><li><p><strong>Dynamic Discovery:</strong> Agents can now &#8220;poll&#8221; an environment to see what tools are available. If you give an agent access to a new GitHub repo, it can read the README, understand the functions, and begin using them without a human having to write a &#8220;system prompt&#8221; explaining the API.</p></li></ul><h4>3. Long-term Memory and State Persistence</h4><p>The &#8220;Context Window&#8221; wars are over. We won. But 10-million-token windows are useless if the model can&#8217;t find the needle in the haystack. The trend now is <strong>Semantic State Management</strong>.</p><ul><li><p><strong>Vectorized Ephemeral Memory:</strong> Agents now use a tiered memory system:</p><ol><li><p><strong>L1 (Working):</strong> The immediate context window.</p></li><li><p><strong>L2 (Short-term):</strong> A high-speed cache of the last 50 steps in a workflow.</p></li><li><p><strong>L3 (Long-term):</strong> A RAG-based (Retrieval-Augmented Generation) archive of all past interactions, indexed by &#8220;importance&#8221; scores rather than just chronological order.</p></li></ol></li></ul><p>As we give agents more autonomy, we encounter a new technical challenge: <strong>Agentic Drift</strong>. This is where an agent, in the process of solving a complex, multi-day task, slowly veers away from the original constraints.</p><p>Technical journals are currently buzzing with <strong>Constraint-Satisfaction-Checking (CSC)</strong>. This is a secondary, highly quantized model that runs in parallel, acting as a &#8220;referee&#8221; to ensure the main agent doesn&#8217;t violate safety or budget parameters.</p><p>If you are still thinking about AI as a &#8220;text-in, text-out&#8221; system, you are falling behind. The value in 2026 lies in the <strong>Orchestration</strong>. It&#8217;s about how you handle retries when an API fails, how you manage the state across a three-day autonomous task, and how you ensure the agent knows when to stop and ask for human permission.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Limited Intelligence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[The Efficiency Paradox]]></title><description><![CDATA[Why Small Language Models (SLMs) Are Winning the Enterprise War]]></description><link>https://limitedintelligence.substack.com/p/the-efficiency-paradox</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/the-efficiency-paradox</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Thu, 23 Apr 2026 13:03:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZNbg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1dd62-9952-412d-9483-47e987460925_1000x400.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZNbg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1dd62-9952-412d-9483-47e987460925_1000x400.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZNbg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1dd62-9952-412d-9483-47e987460925_1000x400.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ZNbg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1dd62-9952-412d-9483-47e987460925_1000x400.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ZNbg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1dd62-9952-412d-9483-47e987460925_1000x400.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ZNbg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1dd62-9952-412d-9483-47e987460925_1000x400.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZNbg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1dd62-9952-412d-9483-47e987460925_1000x400.jpeg" width="1000" height="400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e9c1dd62-9952-412d-9483-47e987460925_1000x400.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:400,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Como os SLMs est&#227;o redesenhando a intelig&#234;ncia acad&#234;mica | Software&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Como os SLMs est&#227;o redesenhando a intelig&#234;ncia acad&#234;mica | Software" title="Como os SLMs est&#227;o redesenhando a intelig&#234;ncia acad&#234;mica | Software" srcset="https://substackcdn.com/image/fetch/$s_!ZNbg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1dd62-9952-412d-9483-47e987460925_1000x400.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ZNbg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1dd62-9952-412d-9483-47e987460925_1000x400.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ZNbg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1dd62-9952-412d-9483-47e987460925_1000x400.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ZNbg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1dd62-9952-412d-9483-47e987460925_1000x400.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For half a decade, the AI industry followed a predictable path: add more GPUs, add more data, get a smarter model. But in 2026, we have hit the &#8220;Energy Wall.&#8221; Training a model 10x larger than GPT-4 doesn&#8217;t yield 10x more intelligence. It yields a 10x higher electricity bill.</p><p>This week&#8217;s technical trend is the <strong>Rise of the Specialist</strong>. We are seeing a massive migration away from &#8220;God-models&#8221; toward <strong>Small Language Models (SLMs)</strong> that are hyper-optimized for specific domains.</p><p>How is a 7B parameter model in 2026 performing compared to a 175B model from 2023? The answer lies in <strong>Knowledge Distillation</strong> and <strong>Curated Synthetic Data</strong>.</p><h4>1. The Student-Teacher Framework</h4><p>We are now using the &#8220;massive&#8221; models&#8212;the ones that take a small country&#8217;s power grid to run&#8212;as &#8220;Teachers.&#8221; They generate millions of high-quality reasoning chains. These &#8220;gold-standard&#8221; examples are then used to train the &#8220;Student&#8221; (the SLM).</p><ul><li><p><strong>The Result:</strong> The SLM doesn&#8217;t have to learn how to speak English from scratch by reading the messy internet. It learns purely from the &#8220;refined logic&#8221; of the teacher model. This allows it to punch way above its weight class in reasoning capability.</p></li></ul><h4>2. Specialized Loss Functions</h4><p>In 2026, we aren&#8217;t just using standard cross-entropy loss. We are using <strong>Task-Specific Loss Functions</strong>. If you are building an SLM for medical diagnosis, the model is penalized more heavily for a &#8220;false negative&#8221; than for a formatting error. This &#8220;weighted intelligence&#8221; makes SLMs safer and more reliable for production than general-purpose LLMs.</p><p>The real reason SLMs are trending is the hardware. Every laptop and smartphone sold in 2026 has a dedicated <strong>NPU (Neural Processing Unit)</strong>.</p><ul><li><p><strong>Privacy by Default:</strong> Because an SLM can fit on a device, we are seeing a &#8220;Privacy Renaissance.&#8221; Companies are no longer sending sensitive IP to a cloud provider&#8217;s API. They are running a specialized 3B parameter model locally on the engineer&#8217;s machine.</p></li><li><p><strong>Zero Latency:</strong> When the model is on your local bus, the latency is measured in microseconds, not seconds. This has enabled the &#8220;Live Interaction&#8221; era&#8212;AI that responds to your voice or screen state in real-time without the &#8220;thinking&#8221; pause.</p></li></ul><p>The &#8220;smart&#8221; engineering teams are no longer choosing one model. They are building <strong>Ensembles of Specialists</strong>.</p><ul><li><p><strong>The Router Architecture:</strong> A tiny, extremely fast model (100M parameters) acts as a traffic cop. It listens to the user&#8217;s request.</p><ul><li><p><em>Code request?</em> Send it to the Code-SLM.</p></li><li><p><em>Legal question?</em> Send it to the Law-SLM.</p></li><li><p><em>Silly joke?</em> Send it to the general-purpose &#8220;cheap&#8221; model.</p></li></ul></li></ul><p>This modularity allows companies to swap out &#8220;specialists&#8221; as better ones become available, rather than being locked into one massive, monolithic provider.</p><p>The era of the &#8220;Generalist&#8221; is ending. In 2026, the competitive advantage belongs to those who can fine-tune small, efficient models on proprietary, high-quality data. The &#8220;Scale-Only&#8221; doctrine is dead; long live the &#8220;Precision&#8221; doctrine.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Limited Intelligence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Understanding Communicating Sequential Processes (CSP)]]></title><description><![CDATA[In the landscape of modern software development, concurrency is no longer a luxury&#8212;it is a survival requirement.]]></description><link>https://limitedintelligence.substack.com/p/understanding-communicating-sequential</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/understanding-communicating-sequential</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Wed, 22 Apr 2026 13:02:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DGio!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bbd24ee-718f-4c98-945b-57e85096282c_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DGio!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bbd24ee-718f-4c98-945b-57e85096282c_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DGio!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bbd24ee-718f-4c98-945b-57e85096282c_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!DGio!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bbd24ee-718f-4c98-945b-57e85096282c_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!DGio!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bbd24ee-718f-4c98-945b-57e85096282c_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!DGio!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bbd24ee-718f-4c98-945b-57e85096282c_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DGio!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bbd24ee-718f-4c98-945b-57e85096282c_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6bbd24ee-718f-4c98-945b-57e85096282c_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1193176,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://limitedintelligence.substack.com/i/194729175?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bbd24ee-718f-4c98-945b-57e85096282c_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DGio!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bbd24ee-718f-4c98-945b-57e85096282c_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!DGio!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bbd24ee-718f-4c98-945b-57e85096282c_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!DGio!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bbd24ee-718f-4c98-945b-57e85096282c_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!DGio!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bbd24ee-718f-4c98-945b-57e85096282c_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the landscape of modern software development, concurrency is no longer a luxury&#8212;it is a survival requirement. As processor speeds have plateaued and we have transitioned into the era of many-core architectures, the burden of performance has shifted from the hardware engineer to the software architect. However, managing multiple tasks simultaneously has historically been a minefield of complexity, prone to subtle bugs that are notoriously difficult to debug.</p><p>Enter <strong>Communicating Sequential Processes (CSP)</strong>. While it sounds like a mouthful of academic jargon, CSP is a formal language and a philosophical approach to concurrency that prioritizes clarity, safety, and predictability. By shifting the focus from &#8220;protecting data&#8221; to &#8220;moving data,&#8221; CSP offers a blueprint for building massive, scalable systems without losing one&#8217;s mind to the chaos of shared state.</p><p>To understand CSP, we must look back to 1978, when <strong>Tony Hoare</strong> published his seminal paper. Before this, concurrency was largely handled through shared memory. Imagine a single whiteboard in a busy office where ten people are trying to write their own schedules at the same time. To keep things orderly, you would need a &#8220;lock&#8221; (a physical guard) who only lets one person touch the whiteboard at a time. If the guard falls asleep or two people grab the same marker, the system collapses.</p><p>Hoare proposed a radical alternative: What if, instead of a shared whiteboard, every person had their own notebook? If they needed to coordinate, they would pass a physical note to one another. This &#8220;message passing&#8221; is the heartbeat of CSP.</p><p>The mantra of the CSP approach can be summarized in a single, transformative sentence:</p><blockquote><p><strong>&#8220;Do not communicate by sharing memory; instead, share memory by communicating.&#8221;</strong></p></blockquote><p>CSP relies on two fundamental abstractions that work in tandem to create a harmonious system.</p><p>In CSP, a &#8220;Process&#8221; is a self-contained logic unit that executes sequentially. It doesn&#8217;t care about the outside world except when it needs to send or receive information. Because it is sequential, the developer can reason about it just like a standard, single-threaded program. There are no hidden side effects from other threads creeping in to change its variables.</p><p>Channels are the conduits through which processes interact. Think of a channel as a specialized pipe. One process drops a piece of data into one end, and another process pulls it out of the other. The channel handles the heavy lifting of synchronization, ensuring that the data transfer happens safely.</p><p>One of the most elegant aspects of pure CSP is the concept of the <strong>Rendezvous</strong>. By default, communication over a channel is synchronous.</p><ul><li><p>If a <strong>Sender</strong> wants to send data, it hangs up and waits until a <strong>Receiver</strong> is ready to take it.</p></li><li><p>If a <strong>Receiver</strong> wants to get data, it hangs up and waits until a <strong>Sender</strong> is ready to provide it.</p></li></ul><p>When both are present at the channel, the data transfer occurs, and both continue their merry way. This &#8220;handshake&#8221; eliminates the need for manual locks or semaphores. The communication <em>is</em> the synchronization.</p><p>To appreciate why CSP has gained so much traction in high-performance systems, we must compare it to the traditional &#8220;Shared Memory&#8221; model.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nIYQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac853164-582b-413f-8098-d97cf979396d_1992x356.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nIYQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac853164-582b-413f-8098-d97cf979396d_1992x356.png 424w, https://substackcdn.com/image/fetch/$s_!nIYQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac853164-582b-413f-8098-d97cf979396d_1992x356.png 848w, https://substackcdn.com/image/fetch/$s_!nIYQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac853164-582b-413f-8098-d97cf979396d_1992x356.png 1272w, https://substackcdn.com/image/fetch/$s_!nIYQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac853164-582b-413f-8098-d97cf979396d_1992x356.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nIYQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac853164-582b-413f-8098-d97cf979396d_1992x356.png" width="1456" height="260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac853164-582b-413f-8098-d97cf979396d_1992x356.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:260,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:126872,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://limitedintelligence.substack.com/i/194729175?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac853164-582b-413f-8098-d97cf979396d_1992x356.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nIYQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac853164-582b-413f-8098-d97cf979396d_1992x356.png 424w, https://substackcdn.com/image/fetch/$s_!nIYQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac853164-582b-413f-8098-d97cf979396d_1992x356.png 848w, https://substackcdn.com/image/fetch/$s_!nIYQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac853164-582b-413f-8098-d97cf979396d_1992x356.png 1272w, https://substackcdn.com/image/fetch/$s_!nIYQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac853164-582b-413f-8098-d97cf979396d_1992x356.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>A system where processes just sit and wait for a single channel would be quite rigid. CSP introduces the concept of a <strong>Choice</strong> (often implemented as a <code>select</code> statement). This allows a process to monitor multiple channels at once.</p><p>The process essentially says, &#8220;I am ready to talk to anyone who is ready to talk to me.&#8221; If three different channels have data available, the process picks one (often pseudo-randomly or based on priority) and executes that specific logic. This enables the creation of highly responsive systems that can handle timeouts, cancellations, and multi-source data streams gracefully.</p><p>Once you have processes and channels, you can assemble them into sophisticated architectural patterns:</p><ul><li><p><strong>Pipelines:</strong> Much like a factory assembly line, one process performs Step A and passes the result to a channel; the next process performs Step B, and so on. This maximizes throughput.</p></li><li><p><strong>Fan-out:</strong> A single producer sends tasks to a channel, and multiple worker processes pull from that same channel to process data in parallel.</p></li><li><p><strong>Fan-in:</strong> Multiple processes send their results into a single &#8220;aggregator&#8221; channel that collects and reports the data.</p></li></ul><p>For those who enjoy the rigor of formal logic, CSP is actually a member of the <strong>Process Calculus</strong> family. It uses algebraic notations to prove that a system is free from certain types of errors. While most developers don&#8217;t write out the mathematical proofs, they benefit from the &#8220;mathematical cleanliness&#8221; of the model. It ensures that if you follow the rules of the protocol, the system remains mathematically sound.</p><p>While CSP solves the &#8220;Race Condition&#8221; (where two threads change data at the same time), it does not automatically solve every problem. Developers must still be wary of:</p><ol><li><p><strong>Deadlock:</strong> If Process A is waiting for Process B, and Process B is waiting for Process A, the system grinds to a halt.</p></li><li><p><strong>Livelock:</strong> Processes are so busy responding to each other that they never actually get any &#8220;real&#8221; work done.</p></li><li><p><strong>Channel Leaks:</strong> If a process creates a channel but never closes it or stops listening, memory can slowly bleed away.</p></li></ol><p>We are living in an era of distributed systems and microservices. Interestingly, the way microservices communicate over a network (via APIs or Message Queues) is essentially CSP on a macro scale. By adopting CSP within a single application, developers can use the same mental model for their internal code as they do for their entire cloud infrastructure.</p><p>It promotes a <strong>decoupled architecture</strong>. Because processes only know about the channels they hold, they don&#8217;t need to know anything about the internal state of other processes. This makes code more modular, easier to test, and significantly more resilient to change.</p><p>Communicating Sequential Processes is more than just a technical implementation; it is a shift in perspective. It moves us away from the &#8220;God-object&#8221; pattern, where one giant block of memory is poked and prodded by a thousand different fingers, toward a &#8220;Society of Specialists&#8221; who collaborate through clear, defined communication.</p><p>By embracing the channel and the sequential process, we stop fighting the nature of multi-core hardware and start working with it. Whether you are building a high-frequency trading platform, a real-time chat application, or a simple web server, the principles of CSP provide a stable foundation in an increasingly concurrent world.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Limited Intelligence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Openskills.sh]]></title><description><![CDATA[Marketplace for Agent Skills]]></description><link>https://limitedintelligence.substack.com/p/openskillssh</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/openskillssh</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Tue, 21 Apr 2026 13:02:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Ejbm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fc3b54-f30d-47c0-93e1-eefb51e92929_4050x2212.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ejbm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fc3b54-f30d-47c0-93e1-eefb51e92929_4050x2212.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ejbm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fc3b54-f30d-47c0-93e1-eefb51e92929_4050x2212.png 424w, https://substackcdn.com/image/fetch/$s_!Ejbm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fc3b54-f30d-47c0-93e1-eefb51e92929_4050x2212.png 848w, https://substackcdn.com/image/fetch/$s_!Ejbm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fc3b54-f30d-47c0-93e1-eefb51e92929_4050x2212.png 1272w, https://substackcdn.com/image/fetch/$s_!Ejbm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fc3b54-f30d-47c0-93e1-eefb51e92929_4050x2212.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ejbm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fc3b54-f30d-47c0-93e1-eefb51e92929_4050x2212.png" width="1456" height="795" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93fc3b54-f30d-47c0-93e1-eefb51e92929_4050x2212.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:656785,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://limitedintelligence.substack.com/i/194708422?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fc3b54-f30d-47c0-93e1-eefb51e92929_4050x2212.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ejbm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fc3b54-f30d-47c0-93e1-eefb51e92929_4050x2212.png 424w, https://substackcdn.com/image/fetch/$s_!Ejbm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fc3b54-f30d-47c0-93e1-eefb51e92929_4050x2212.png 848w, https://substackcdn.com/image/fetch/$s_!Ejbm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fc3b54-f30d-47c0-93e1-eefb51e92929_4050x2212.png 1272w, https://substackcdn.com/image/fetch/$s_!Ejbm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93fc3b54-f30d-47c0-93e1-eefb51e92929_4050x2212.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The transition from &#8220;Chatbots&#8221; to &#8220;Agents&#8221; is the defining shift of the current AI era. However, as developers began building these autonomous entities, they hit a wall: <strong>fragmentation</strong>. Every agent framework&#8212;from LangChain to CrewAI&#8212;had its own way of defining tools, and every LLM provider had a different way of calling them.</p><p>Enter <strong>Agent Skills</strong> and <strong>openskills.sh</strong>. This ecosystem represents the first successful attempt to standardize how we package, discover, and execute AI capabilities. It&#8217;s essentially the &#8220;npm for AI agents,&#8221; transforming loose prompts and scripts into portable, versioned, and sandboxed modules.</p><p>Before the Agent Skills standard emerged, we primarily used <strong>Function Calling</strong>. You would provide a JSON schema to an LLM, and it would output a JSON object to trigger a local function. While effective, it had three major flaws:</p><ol><li><p><strong>Token Bloat:</strong> You had to cram every tool definition into the system prompt. If your agent had 50 tools, it might spend 5,000 tokens just &#8220;remembering&#8221; what it could do before you even asked a question.</p></li><li><p><strong>Maintenance Hell:</strong> Tool definitions were often buried in code (Python or TypeScript). Non-developers couldn&#8217;t easily tweak the instructions an agent used to perform a specific task.</p></li><li><p><strong>Non-Portability:</strong> A tool written for a LangChain agent couldn&#8217;t easily be dropped into a Cursor IDE session or a Claude Code terminal.</p></li></ol><p><strong>Agent Skills</strong> solve this by treating a capability as a <strong>static asset</strong>&#8212;a folder containing a <code>SKILL.md</code> file and any necessary supporting scripts.</p><p>A &#8220;Skill&#8221; is a structured directory. It is designed to be human-readable and agent-optimized. The heart of any skill is the <code>SKILL.md</code> file, which follows a specific specification:</p><p>At the top of every skill is a <strong>YAML frontmatter</strong> block for metadata, followed by Markdown instructions for the LLM.</p><pre><code><code>---
name: kubernetes-ops
description: Specialized skill for managing K8s clusters, pods, and deployments.
version: 1.2.0
author: dev-ops-team
allowed_tools: [shell, read_file, write_file]
---

# Kubernetes Operations Instructions
When the user asks to "check the cluster health" or "restart a service," 
follow this protocol:
1. List all pods in the namespace using `kubectl get pods`.
2. Check for any pods with a status other than 'Running'.
3. If a pod is in 'CrashLoopBackOff', fetch the logs using `kubectl logs &lt;pod-name&gt;`.
...
</code></code></pre><p>LLMs are native speakers of Markdown. By using a <code>.md</code> file, we allow the agent to &#8220;read&#8221; the manual only when it&#8217;s relevant. This leads us to the most important concept in the <code>openskills.sh</code> ecosystem: <strong>Progressive Disclosure</strong>.</p><p>In a traditional setup, the agent is like a student forced to memorize the entire library before the exam. With Agent Skills, the agent is given a <strong>catalog</strong>.</p><ol><li><p><strong>Discovery:</strong> The agent is given a list of skill names and descriptions (the YAML frontmatter).</p></li><li><p><strong>Invocation:</strong> When a user says, &#8220;Fix my Kubernetes deployment,&#8221; the agent realizes the <code>kubernetes-ops</code> skill is relevant.</p></li><li><p><strong>Loading:</strong> The agent uses a tool (like <code>npx openskills read</code>) to pull the full content of <code>SKILL.md</code> into its context.</p></li></ol><p>This keeps the base system prompt lean and allows agents to scale to hundreds of specialized skills without losing their &#8220;reasoning&#8221; headroom to overhead.</p><p><strong>openskills.sh</strong> is the hub for this entire movement. It serves three primary roles: the <strong>Standard</strong>, the <strong>Registry</strong>, and the <strong>Runtime</strong>.</p><p>Much like <code>npmjs.com</code> or <code>crates.io</code>, <code>openskills.sh</code> provides a centralized marketplace where developers can publish skills. If you need a skill to interact with the Jira API, handle complex PDF parsing, or manage AWS Lambda functions, you don&#8217;t write it from scratch&#8212;you install it.</p><p>The <code>openskills</code> CLI tool allows for universal installation across any agent environment.</p><ul><li><p><code>npx openskills install anthropics/web-search</code></p></li><li><p><code>npx openskills install my-org/internal-db-query --private</code></p></li></ul><p>Perhaps most critically, <code>openskills</code> provides a <strong>sandboxed execution environment</strong>. When an agent invokes a skill that requires running a Python script or a Shell command, <code>openskills</code> ensures that code runs in a restricted container, protecting your local machine or server.</p><p>One of the biggest risks of autonomous agents is &#8220;Prompt Injection&#8221; leading to &#8220;Remote Code Execution&#8221; (RCE). If an agent reads a malicious file and decides to run <code>rm -rf /</code>, a standard shell tool would comply.</p><p><code>openskills.sh</code> implements a <strong>Dual Sandbox Architecture</strong>:</p><ol><li><p><strong>Native Sandboxing (OS-Level):</strong></p><p>On macOS, it uses <strong>Seatbelt</strong> (the sandbox technology behind the App Store). On Linux, it uses <strong>Landlock</strong>. This restricts the scripts&#8217; access to only specific directories and blocks network access unless explicitly granted in the skill&#8217;s metadata.</p></li><li><p><strong>WASM/WASI Sandboxing (Experimental):</strong></p><p>For cross-platform safety, skills can be compiled into WebAssembly. This provides a completely isolated memory space and a capability-based security model.</p></li></ol><p>There is often confusion between Anthropic&#8217;s <strong>MCP</strong> and <strong>Agent Skills</strong>. While they both aim for interoperability, they solve different problems:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XDR-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3394fd20-1ded-45bb-a5af-ddcdf430b2d0_1194x378.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XDR-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3394fd20-1ded-45bb-a5af-ddcdf430b2d0_1194x378.png 424w, https://substackcdn.com/image/fetch/$s_!XDR-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3394fd20-1ded-45bb-a5af-ddcdf430b2d0_1194x378.png 848w, https://substackcdn.com/image/fetch/$s_!XDR-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3394fd20-1ded-45bb-a5af-ddcdf430b2d0_1194x378.png 1272w, https://substackcdn.com/image/fetch/$s_!XDR-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3394fd20-1ded-45bb-a5af-ddcdf430b2d0_1194x378.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XDR-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3394fd20-1ded-45bb-a5af-ddcdf430b2d0_1194x378.png" width="1194" height="378" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3394fd20-1ded-45bb-a5af-ddcdf430b2d0_1194x378.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:378,&quot;width&quot;:1194,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:93931,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://limitedintelligence.substack.com/i/194708422?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3394fd20-1ded-45bb-a5af-ddcdf430b2d0_1194x378.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XDR-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3394fd20-1ded-45bb-a5af-ddcdf430b2d0_1194x378.png 424w, https://substackcdn.com/image/fetch/$s_!XDR-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3394fd20-1ded-45bb-a5af-ddcdf430b2d0_1194x378.png 848w, https://substackcdn.com/image/fetch/$s_!XDR-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3394fd20-1ded-45bb-a5af-ddcdf430b2d0_1194x378.png 1272w, https://substackcdn.com/image/fetch/$s_!XDR-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3394fd20-1ded-45bb-a5af-ddcdf430b2d0_1194x378.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>The Synergy:</strong> Most modern agentic workflows use <strong>MCP</strong> for <em>data</em> (the &#8220;What&#8221;) and <strong>Agent Skills</strong> for <em>process</em> (the &#8220;How&#8221;).</p><p>To create a skill and share it via <code>openskills.sh</code>, you follow a modular workflow. Let&#8217;s build a &#8220;Security Auditor&#8221; skill.</p><pre><code><code>security-auditor/
&#9500;&#9472;&#9472; SKILL.md
&#9500;&#9472;&#9472; scripts/
&#9474;   &#9492;&#9472;&#9472; scan_ports.py
&#9492;&#9472;&#9472; resources/
    &#9492;&#9472;&#9472; common_vulnerabilities.json
</code></code></pre><p>In <code>SKILL.md</code>, you define exactly how the agent should behave when it finds an open port.</p><p>You can use the CLI to &#8220;dry run&#8221; how an agent sees your skill:</p><p><code>npx openskills read ./security-auditor</code></p><p>Once ready, you can push your skill to a GitHub repo or the <code>openskills.sh</code> registry. Because the standard is open, anyone using <strong>Cursor</strong>, <strong>Windsurf</strong>, or <strong>Claude Code</strong> can immediately &#8220;equip&#8221; your skill.</p><p>As we move toward multi-agent systems, <code>openskills.sh</code> becomes the &#8220;language&#8221; of handoffs. An orchestrator agent can query a directory of skills, see that &#8220;Agent B&#8221; has the <code>stripe-billing</code> skill, and delegate the task with full confidence that the instructions and safety guards are pre-defined.</p><p>We are entering an era of <strong>Composable Intelligence</strong>. Instead of building one giant &#8220;god-model&#8221; that knows everything, we are building specialized, tiny experts that can be shared, versioned, and audited.</p><p>If you are a developer, <code>openskills.sh</code> is your way to ensure your code is &#8220;agent-ready.&#8221; If you are an enterprise, it is your way to enforce safety and consistency across your AI workforce.</p><p>The wall between &#8220;writing code&#8221; and &#8220;prompting AI&#8221; is dissolving. <strong>Skills</strong> are the glue that holds these two worlds together.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Limited Intelligence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[The Sliding Window Strategy in LLM Training]]></title><description><![CDATA[In the era of Generative AI, the &#8220;secret sauce&#8221; of a high-performing Large Language Model (LLM) isn&#8217;t just the number of parameters or the quality of the GPU cluster; it&#8217;s the way the model consumes information.]]></description><link>https://limitedintelligence.substack.com/p/the-sliding-window-strategy-in-llm</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/the-sliding-window-strategy-in-llm</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Mon, 20 Apr 2026 13:02:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!3Mg7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff745dd72-8e03-4a2b-9600-5225885a12c3_783x475.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3Mg7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff745dd72-8e03-4a2b-9600-5225885a12c3_783x475.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3Mg7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff745dd72-8e03-4a2b-9600-5225885a12c3_783x475.png 424w, https://substackcdn.com/image/fetch/$s_!3Mg7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff745dd72-8e03-4a2b-9600-5225885a12c3_783x475.png 848w, https://substackcdn.com/image/fetch/$s_!3Mg7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff745dd72-8e03-4a2b-9600-5225885a12c3_783x475.png 1272w, https://substackcdn.com/image/fetch/$s_!3Mg7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff745dd72-8e03-4a2b-9600-5225885a12c3_783x475.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3Mg7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff745dd72-8e03-4a2b-9600-5225885a12c3_783x475.png" width="783" height="475" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f745dd72-8e03-4a2b-9600-5225885a12c3_783x475.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:475,&quot;width&quot;:783,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;LLM in a flash: Efficient LLM Inference with Limited Memory&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="LLM in a flash: Efficient LLM Inference with Limited Memory" title="LLM in a flash: Efficient LLM Inference with Limited Memory" srcset="https://substackcdn.com/image/fetch/$s_!3Mg7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff745dd72-8e03-4a2b-9600-5225885a12c3_783x475.png 424w, https://substackcdn.com/image/fetch/$s_!3Mg7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff745dd72-8e03-4a2b-9600-5225885a12c3_783x475.png 848w, https://substackcdn.com/image/fetch/$s_!3Mg7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff745dd72-8e03-4a2b-9600-5225885a12c3_783x475.png 1272w, https://substackcdn.com/image/fetch/$s_!3Mg7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff745dd72-8e03-4a2b-9600-5225885a12c3_783x475.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the era of Generative AI, the &#8220;secret sauce&#8221; of a high-performing Large Language Model (LLM) isn&#8217;t just the number of parameters or the quality of the GPU cluster; it&#8217;s the way the model consumes information. Training a Transformer-based model is essentially an exercise in pattern recognition across massive datasets. However, these models have a fundamental limitation: the <strong>context window</strong>.</p><p>When you have a dataset consisting of billion-word corpora&#8212;books, code repositories, and long-form articles&#8212;you cannot simply &#8220;feed&#8221; the entire document into the model at once. You must sample it. Among the various techniques used to bridge the gap between massive datasets and finite context windows, the <strong>Sliding Window</strong> approach stands out as a critical strategy for maintaining semantic continuity and maximizing data utility.</p><p>To understand why we need sliding windows, we first have to look at the architecture of a Transformer. Most modern LLMs utilize a fixed context length, denoted as $L$. Whether it&#8217;s 2,048 tokens (like early GPT-3) or 128,000 tokens (like modern Claude or GPT-4 iterations), the model has a &#8220;hard limit&#8221; on how many tokens it can process in a single forward pass.</p><p>If you have a document with 10,000 tokens and your model has a context window of 1,000 tokens, you have a problem. How do you slice that document?</p><p>The simplest method is to cut the document into non-overlapping blocks of 1,000 tokens.</p><ul><li><p>Block 1: Tokens 1 to 1,000</p></li><li><p>Block 2: Tokens 1,001 to 2,000</p></li><li><p>...and so on.</p></li></ul><p><strong>The Problem:</strong> This creates &#8220;boundary artifacts.&#8221; If a crucial piece of information (like the subject of a sentence) is at token 999, and the verb is at token 1,002, the model will never see them together. The model loses the &#8220;flow&#8221; of the text, and the transitions between blocks become blind spots.</p><p>The sliding window technique solves this by introducing <strong>overlap</strong>. Instead of jumping exactly one window length forward, the &#8220;sampling window&#8221; moves forward by a smaller step, known as the <strong>stride</strong> (<em>S</em>).</p><p>The sliding window is defined by two primary hyperparameters:</p><ol><li><p><strong>Window Size (W):</strong> The total number of tokens the model can see at once (the context length).</p></li><li><p><strong>Stride (S):</strong> The number of tokens the window moves forward after each sample.</p></li></ol><p>If we have a document of length <em>D</em>, the number of samples <em>N</em> we can extract is roughly:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;N = \\frac{D - W}{S} + 1&quot;,&quot;id&quot;:&quot;GOTUXBJAVU&quot;}" data-component-name="LatexBlockToDOM"></div><ul><li><p>If <strong>S = W</strong>, we have non-overlapping blocks (Simple Truncation).</p></li><li><p>If <strong>S &lt; W</strong>, we have overlapping blocks (Sliding Window).</p></li><li><p>If <strong>S = 1</strong>, we have a &#8220;maximal&#8221; sliding window where every single token eventually appears in every possible position within the window (extremely computationally expensive).</p></li></ul><p>Language is not modular. Ideas, arguments, and narratives flow across token boundaries. By using a sliding window with a stride smaller than the window size, we ensure that every token&#8212;and more importantly, every <em>relationship</em> between tokens&#8212;is captured in multiple contexts. This allows the model to learn how to handle &#8220;preceding context&#8221; more effectively, as it sees the same information appearing at the end of one window and the beginning of the next.</p><p>In the world of Deep Learning, more data is usually better. Sliding windows act as a form of text-based data augmentation. By shifting the window by a few tokens, you create a &#8220;new&#8221; training example for the model. Even though the tokens are the same, their <strong>positional encodings</strong> change.</p><p>In a Transformer, the model&#8217;s understanding of a token is heavily influenced by its position relative to others. Seeing the word &#8220;Quantum&#8221; at index 10 vs. index 500 helps the model become more robust to positional variations.</p><p>While a sliding window doesn&#8217;t technically increase the model&#8217;s physical context limit, it improves the model&#8217;s ability to &#8220;stitch&#8221; ideas together during inference. If the model was trained on overlapping samples, it becomes more adept at transitioning between chunks of text when generating long-form content.</p><p>Implementing a sliding window in a data pipeline (like a PyTorch <code>Dataset</code> or using the Hugging Face <code>datasets</code> library) requires balancing memory efficiency with speed.</p><p>In this approach, you store raw text and tokenize segments as they are needed.</p><ol><li><p>Load a long document.</p></li><li><p>Tokenize the entire document into a large array.</p></li><li><p>Use a pointer to slice the array: <code>tokens[i : i + window_size]</code>.</p></li><li><p>Increment <code>i</code> by the stride <em>S</em>.</p></li></ol><pre><code><code>class SlidingWindowDataset(Dataset):
    def __init__(self, tokens, window_size, stride):
        self.tokens = tokens
        self.window_size = window_size
        self.stride = stride
        self.samples = []
        
        # Pre-calculate indices
        for i in range(0, len(tokens) - window_size + 1, stride):
            self.samples.append(i)

    def __getitem__(self, idx):
        start = self.samples[idx]
        return self.tokens[start : start + self.window_size]
</code></code></pre><p>While the sliding window sounds like a &#8220;free win,&#8221; it comes with significant computational costs.</p><p>If you set a stride of <code>S = W/2</code> (50% overlap), you are essentially doubling the size of your training data. This means:</p><ul><li><p>2x more forward/backward passes.</p></li><li><p>2x more energy consumption.</p></li><li><p>2x more time to reach the same number of &#8220;epochs&#8221; over the raw text.</p></li></ul><p>In large-scale pre-training (like Llama 3 or GPT-4), compute is the most expensive resource. Engineers often choose a stride that is very close to the window size ($S \approx 0.9W$) to minimize redundancy while still providing enough overlap to smooth out boundary issues.</p><p>If the stride is too small, the model sees the same sequences over and over again. This can lead to <strong>overfitting</strong> on specific phrases or patterns found in the training data, rather than learning general linguistic rules.</p><p>If you pre-process your data into sliding window chunks and save them to disk (e.g., as <code>.bin</code> or <code>.jsonl</code> files), the storage requirements can explode. A 1TB dataset could easily become 5TB if a high-overlap sliding window is applied. Most modern pipelines perform the windowing &#8220;just-in-time&#8221; during the data loading phase to save disk space.</p><p>Some researchers use a dynamic stride based on the content. For example, if a document is identified as &#8220;high quality&#8221; (like a textbook), the stride might be smaller to ensure the model learns every nuance. For &#8220;lower quality&#8221; data (like web scrapes), a larger stride might be used to move through the data quickly.</p><p>It&#8217;s important to distinguish sliding windows from <strong>Packing</strong>. Packing is the practice of concatenating multiple <em>short</em> documents together to fill a single context window, separated by an <code>&lt;EOS&gt;</code> (End Of String) token.</p><ul><li><p><strong>Sliding Window:</strong> Used to break down <em>one long</em> document.</p></li><li><p><strong>Packing:</strong> Used to combine <em>many short</em> documents.</p><p>In a production-grade pipeline, these two techniques are often used together. You might slide through a long book, and if the last window of that book is only half-full, you &#8220;pack&#8221; the beginning of the next document into the remaining space.</p></li></ul><p>When using sliding windows, the way you calculate the <strong>Loss</strong> (usually Cross-Entropy Loss) can be adjusted.</p><p>In a standard Next-Token Prediction task, you calculate the loss for all tokens in the window. However, in a sliding window setup with high overlap, some tokens are &#8220;new&#8221; to the model in the current window, while others were already seen at the end of the previous window.</p><p>Some researchers suggest calculating the loss only on the &#8220;new&#8221; tokens (the ones in the stride portion) or applying a lower weight to the &#8220;re-seen&#8221; tokens to prevent the model from over-optimizing on the middle sections of a window.</p><p>The sliding window is more than just a data-loading trick; it is a fundamental bridge between the linear nature of human language and the block-based nature of Transformer computation.</p><p>For most developers and researchers:</p><ul><li><p><strong>Use a small stride</strong> (S approximately 0.1W to 0.5W) for fine-tuning on domain-specific long-form data where every connection matters (e.g., legal docs, medical research).</p></li><li><p><strong>Use a large stride</strong> (<em>S</em> approximately 0.8W to 0.9W) for general pre-training to balance context continuity with computational efficiency.</p></li><li><p><strong>Never use zero overlap</strong> unless the data is naturally modular (like a collection of short tweets).</p></li></ul><p>As we push toward models with million-token context windows, the <em>necessity</em> for sliding windows may shift, but the <em>logic</em> remains: how we present data to a model determines how that model perceives the world. By sliding the window thoughtfully, we ensure the model never misses the forest for the trees&#8212;or the sentence for the tokens.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Limited Intelligence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Contextual Embeddings in LLMs]]></title><description><![CDATA[Moving from simply recognizing words to actually understanding them.]]></description><link>https://limitedintelligence.substack.com/p/contextual-embeddings-in-llms</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/contextual-embeddings-in-llms</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Thu, 16 Apr 2026 13:01:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!M3iO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd7c8e2-fd48-466f-bac6-74156dcfe8e7_964x687.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M3iO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd7c8e2-fd48-466f-bac6-74156dcfe8e7_964x687.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M3iO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd7c8e2-fd48-466f-bac6-74156dcfe8e7_964x687.png 424w, https://substackcdn.com/image/fetch/$s_!M3iO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd7c8e2-fd48-466f-bac6-74156dcfe8e7_964x687.png 848w, https://substackcdn.com/image/fetch/$s_!M3iO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd7c8e2-fd48-466f-bac6-74156dcfe8e7_964x687.png 1272w, https://substackcdn.com/image/fetch/$s_!M3iO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd7c8e2-fd48-466f-bac6-74156dcfe8e7_964x687.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M3iO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd7c8e2-fd48-466f-bac6-74156dcfe8e7_964x687.png" width="964" height="687" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9cd7c8e2-fd48-466f-bac6-74156dcfe8e7_964x687.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:687,&quot;width&quot;:964,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M3iO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd7c8e2-fd48-466f-bac6-74156dcfe8e7_964x687.png 424w, https://substackcdn.com/image/fetch/$s_!M3iO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd7c8e2-fd48-466f-bac6-74156dcfe8e7_964x687.png 848w, https://substackcdn.com/image/fetch/$s_!M3iO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd7c8e2-fd48-466f-bac6-74156dcfe8e7_964x687.png 1272w, https://substackcdn.com/image/fetch/$s_!M3iO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd7c8e2-fd48-466f-bac6-74156dcfe8e7_964x687.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the evolution of Natural Language Processing (NLP), <strong>contextual embeddings</strong> represent one of the most significant breakthroughs. They allow modern Large Language Models (LLMs) to move beyond merely &#8220;recognizing&#8221; words to actually &#8220;understanding&#8221; them based on the specific intent and nuance of a sentence.</p><p>To understand why contextual embeddings are transformative, we must first look at what came before: <strong>Static Embeddings</strong> (e.g., Word2Vec, GloVe).</p><ul><li><p><strong>Static Embeddings:</strong> In these older models, every word is mapped to a single, fixed vector (a list of numbers) in a high-dimensional space. Whether the word &#8220;bank&#8221; appeared in &#8220;river bank&#8221; or &#8220;bank account,&#8221; the model assigned it the <em>exact same</em> vector. This approach failed to capture polysemy&#8212;the phenomenon where a single word has multiple, distinct meanings.</p></li><li><p><strong>Contextual Embeddings:</strong> These models, powered by the Transformer architecture, dynamically generate a unique vector for a word <em>each time it appears</em>. The representation of &#8220;bank&#8221; is calculated by looking at the other words in the sentence, allowing the model to distinguish between a financial institution and a riverside terrain.</p></li></ul><p>The magic of contextual embeddings happens within the <strong>Transformer architecture</strong>, specifically through a process known as <strong>Self-Attention</strong>.</p><h4>Step 1: Tokenization</h4><p>The input text is broken into smaller units called tokens (words or subwords). These tokens are assigned an initial, &#8220;base&#8221; vector that represents their broad meaning.</p><h4>Step 2: The Self-Attention Mechanism</h4><p>This is the engine of context. When the model processes a sequence of tokens, the self-attention mechanism computes how much focus (attention) each token should place on every other token in the sequence.</p><ul><li><p>If the input is &#8220;The <strong>bank</strong> of the river,&#8221; the &#8220;bank&#8221; token pays high attention to &#8220;river.&#8221;</p></li><li><p>If the input is &#8220;The <strong>bank</strong> is closed today,&#8221; the &#8220;bank&#8221; token pays high attention to &#8220;closed&#8221; and &#8220;today.&#8221;</p></li></ul><p>Through these attention weights, the vector for &#8220;bank&#8221; is updated to incorporate the semantics of the surrounding words.</p><h4>Step 3: Layered Processing</h4><p>Transformers consist of multiple &#8220;blocks&#8221; or layers. As the token passes through each layer, its representation is refined. Early layers might capture basic syntax (like grammar and word order), while deeper layers capture complex, abstract semantic relationships. By the time the token reaches the final layer, its vector is highly specific to its current context.</p><p>Contextual embeddings are the foundation upon which modern, reasoning-capable AI is built. They offer several critical advantages:</p><ul><li><p><strong>Handling Polysemy:</strong> As illustrated, the model accurately differentiates meanings, leading to vastly superior performance in translation, summarization, and sentiment analysis.</p></li><li><p><strong>Capturing Long-Range Dependencies:</strong> Traditional models struggled to link words that were far apart in a sentence. Self-attention allows tokens to &#8220;see&#8221; and interact with any other token, regardless of distance.</p></li><li><p><strong>Semantic Nuance:</strong> These embeddings don&#8217;t just capture dictionary definitions; they capture tone, intent, and stylistic variations.</p></li><li><p><strong>Foundation for RAG:</strong> In Retrieval-Augmented Generation (RAG) systems, contextual embeddings are used to index documents. Because they are context-aware, they allow the system to retrieve highly relevant information even when a user&#8217;s query uses different phrasing than the source document.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1ie5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74afdac1-adb2-4a71-a49f-64472094f21c_1112x318.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1ie5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74afdac1-adb2-4a71-a49f-64472094f21c_1112x318.png 424w, https://substackcdn.com/image/fetch/$s_!1ie5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74afdac1-adb2-4a71-a49f-64472094f21c_1112x318.png 848w, https://substackcdn.com/image/fetch/$s_!1ie5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74afdac1-adb2-4a71-a49f-64472094f21c_1112x318.png 1272w, https://substackcdn.com/image/fetch/$s_!1ie5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74afdac1-adb2-4a71-a49f-64472094f21c_1112x318.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1ie5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74afdac1-adb2-4a71-a49f-64472094f21c_1112x318.png" width="1112" height="318" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74afdac1-adb2-4a71-a49f-64472094f21c_1112x318.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:318,&quot;width&quot;:1112,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:68811,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://limitedintelligence.substack.com/i/193969741?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74afdac1-adb2-4a71-a49f-64472094f21c_1112x318.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1ie5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74afdac1-adb2-4a71-a49f-64472094f21c_1112x318.png 424w, https://substackcdn.com/image/fetch/$s_!1ie5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74afdac1-adb2-4a71-a49f-64472094f21c_1112x318.png 848w, https://substackcdn.com/image/fetch/$s_!1ie5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74afdac1-adb2-4a71-a49f-64472094f21c_1112x318.png 1272w, https://substackcdn.com/image/fetch/$s_!1ie5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74afdac1-adb2-4a71-a49f-64472094f21c_1112x318.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>By moving from static &#8220;definitions&#8221; to dynamic &#8220;context-based representations,&#8221; contextual embeddings allow AI to mimic human comprehension, making them essential to the capabilities of current state-of-the-art models.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Limited Intelligence! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Positional Embeddings in LLMs]]></title><description><![CDATA[Understand how positional embeddings affect the overall model training process.]]></description><link>https://limitedintelligence.substack.com/p/positional-embeddings-in-llms</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/positional-embeddings-in-llms</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Wed, 15 Apr 2026 13:02:03 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/692ed969-8785-42fe-bf97-11995f690bcb_938x300.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;924ec809-4b19-497b-84c4-97ac3eff652c&quot;,&quot;duration&quot;:null}"></div><p>In the architecture of Transformers, the self-attention mechanism is permutation-invariant. This means that if you shuffle the order of words in a sentence, the attention scores remain identical. To bridge this gap and allow the model to understand the sequence and structure of language, we inject <strong>positional information</strong>.</p><p>Positional embeddings serve as a &#8220;coordinate system&#8221; for the tokens in a sequence, allowing the model to distinguish between &#8220;The dog bit the man&#8221; and &#8220;The man bit the dog.&#8221; We categorize these approaches into <strong>Absolute</strong> and <strong>Relative</strong> methods.</p><p>Absolute positional encoding assigns a unique vector representation to each position index (0, 1, 2, &#8230;, N) in the sequence. This vector is added to the token embedding before it enters the Transformer blocks.</p><h3>The Theory</h3><p>The core idea is to represent position as an &#8220;address.&#8221;</p><ul><li><p><strong>Learned Embeddings:</strong> In original models like BERT, each position <em>i</em> is mapped to a trainable vector <em>p_i</em>. The input becomes <em>x&#8217;_i = x_i + p_i</em>.</p></li><li><p><strong>Sinusoidal Embeddings:</strong> Introduced in the original &#8220;Attention Is All You Need&#8221; paper, these use fixed sine and cosine functions:</p></li><li><p></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;PE_{(pos, 2i)} = \\sin(pos/10000^{2i/d_{model}})&quot;,&quot;id&quot;:&quot;WJEUFUOODU&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;PE_{(pos, 2i+1)} = \\cos(pos/10000^{2i/d_{model}})&quot;,&quot;id&quot;:&quot;YHLJUTXSWC&quot;}" data-component-name="LatexBlockToDOM"></div><p>This allows the model to potentially extrapolate to sequence lengths longer than those seen during training, as the functions are continuous.</p></li></ul><h3>Limitations</h3><ul><li><p><strong>Fixed Context Window:</strong> Learned APEs fail if the test sequence is longer than the maximum training length.</p></li><li><p><strong>Lack of Translation Invariance:</strong> The model doesn&#8217;t inherently understand that &#8220;the distance between word A and word B is the same&#8221; regardless of where they appear in the sentence.</p></li></ul><div><hr></div><h2>2. Relative Positional Embeddings (RPE)</h2><p>Rather than encoding &#8220;where&#8221; a word is, relative embeddings focus on the <strong>distance</strong> between two tokens (<em>j - i</em>). The intuition is that the relationship between two words depends on how far apart they are, not their absolute index.</p><h3>The Theory</h3><p>In standard self-attention, the score between tokens <em>i</em> and <em>j</em> is calculated as</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Q_i K_j^T&quot;,&quot;id&quot;:&quot;VTLSTHGBUB&quot;}" data-component-name="LatexBlockToDOM"></div><p>In relative schemes, we modify this to include a term representing the distance $\Delta = i - j$:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Attention_{i,j} = Q_i K_j^T + Q_i R_{i-j}^TAttention_{i,j} = Q_i K_j^T + Q_i R_{i-j}^T&quot;,&quot;id&quot;:&quot;LUIOCSCIMF&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where <strong>R</strong> is a learnable embedding representing the relative distance between position <em>i</em> and <em>j</em>.</p><h3>Key Modern Approaches</h3><ul><li><p><strong>RoPE (Rotary Positional Embeddings):</strong> Used in modern architectures like Llama and Mistral. RoPE encodes positions by rotating the query and key vectors in a complex plane. It captures relative information through the inner product, effectively decaying the attention score as the distance between tokens increases.</p></li><li><p><strong>ALiBi (Attention with Linear Biases):</strong> Rather than adding vectors, ALiBi adds a static, non-learned penalty to the attention scores based on the distance between tokens. This is exceptionally efficient and allows for <strong>infinite sequence length extrapolation</strong>.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EUFR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9c79d01-0e19-4f73-bfbc-dbf4f7ae4399_1248x318.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EUFR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9c79d01-0e19-4f73-bfbc-dbf4f7ae4399_1248x318.png 424w, https://substackcdn.com/image/fetch/$s_!EUFR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9c79d01-0e19-4f73-bfbc-dbf4f7ae4399_1248x318.png 848w, https://substackcdn.com/image/fetch/$s_!EUFR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9c79d01-0e19-4f73-bfbc-dbf4f7ae4399_1248x318.png 1272w, https://substackcdn.com/image/fetch/$s_!EUFR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9c79d01-0e19-4f73-bfbc-dbf4f7ae4399_1248x318.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EUFR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9c79d01-0e19-4f73-bfbc-dbf4f7ae4399_1248x318.png" width="1248" height="318" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b9c79d01-0e19-4f73-bfbc-dbf4f7ae4399_1248x318.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:318,&quot;width&quot;:1248,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:76663,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://limitedintelligence.substack.com/i/193966729?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9c79d01-0e19-4f73-bfbc-dbf4f7ae4399_1248x318.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EUFR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9c79d01-0e19-4f73-bfbc-dbf4f7ae4399_1248x318.png 424w, https://substackcdn.com/image/fetch/$s_!EUFR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9c79d01-0e19-4f73-bfbc-dbf4f7ae4399_1248x318.png 848w, https://substackcdn.com/image/fetch/$s_!EUFR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9c79d01-0e19-4f73-bfbc-dbf4f7ae4399_1248x318.png 1272w, https://substackcdn.com/image/fetch/$s_!EUFR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9c79d01-0e19-4f73-bfbc-dbf4f7ae4399_1248x318.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Models like <strong>Claude</strong> or <strong>Llama 3</strong> rely heavily on RPEs (specifically RoPE). Because they do not rely on fixed index slots, these models can be fine-tuned to handle documents spanning hundreds of thousands of tokens. If you use a model to summarize a 500-page legal document, it is using relative positioning to maintain the coherence of facts separated by tens of thousands of words.</p><p>For structured tasks where the sequence length is tightly constrained (like a fixed-length sentence translation), APEs are often sufficient and highly optimized in hardware. However, recent hybrid systems are increasingly shifting toward RoPE even here, as it provides a better &#8220;semantic anchor&#8221; for grammatical relationships.</p><p>In audio processing (like Whisper), the &#8220;time&#8221; of the input is continuous. Relative approaches are superior here because they allow the model to recognize rhythmic patterns or spectral features regardless of when they start in an audio file, offering better robustness to varying segment lengths.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Limited Intelligence! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Byte-Pair Encoding (BPE)]]></title><description><![CDATA[The Foundation of Modern LLM Tokenization]]></description><link>https://limitedintelligence.substack.com/p/byte-pair-encoding-bpe</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/byte-pair-encoding-bpe</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Tue, 14 Apr 2026 13:00:55 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!tXNg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0196ea0d-95a4-4ad2-bf9f-15262a630fd8_1654x1036.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tXNg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0196ea0d-95a4-4ad2-bf9f-15262a630fd8_1654x1036.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tXNg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0196ea0d-95a4-4ad2-bf9f-15262a630fd8_1654x1036.webp 424w, https://substackcdn.com/image/fetch/$s_!tXNg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0196ea0d-95a4-4ad2-bf9f-15262a630fd8_1654x1036.webp 848w, https://substackcdn.com/image/fetch/$s_!tXNg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0196ea0d-95a4-4ad2-bf9f-15262a630fd8_1654x1036.webp 1272w, https://substackcdn.com/image/fetch/$s_!tXNg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0196ea0d-95a4-4ad2-bf9f-15262a630fd8_1654x1036.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tXNg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0196ea0d-95a4-4ad2-bf9f-15262a630fd8_1654x1036.webp" width="1456" height="912" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0196ea0d-95a4-4ad2-bf9f-15262a630fd8_1654x1036.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:912,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tXNg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0196ea0d-95a4-4ad2-bf9f-15262a630fd8_1654x1036.webp 424w, https://substackcdn.com/image/fetch/$s_!tXNg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0196ea0d-95a4-4ad2-bf9f-15262a630fd8_1654x1036.webp 848w, https://substackcdn.com/image/fetch/$s_!tXNg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0196ea0d-95a4-4ad2-bf9f-15262a630fd8_1654x1036.webp 1272w, https://substackcdn.com/image/fetch/$s_!tXNg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0196ea0d-95a4-4ad2-bf9f-15262a630fd8_1654x1036.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the architecture of Large Language Models (LLMs), the model does not &#8220;read&#8221; text as humans do. It processes numerical representations. The bridge between raw human text and these numbers is <strong>tokenization</strong>. Among the various methods available, <strong>Byte-Pair Encoding (BPE)</strong> has emerged as the industry standard, powering models like GPT-4, Llama, and Mistral.</p><p>BPE is a subword tokenization algorithm that strikes a balance between character-level models (which are too granular) and word-level models (which struggle with infinite vocabulary sizes).</p><p>The core philosophy of BPE is <strong>iterative merging</strong>: it starts by treating every character as an individual token and progressively merges the most frequently occurring adjacent pairs of tokens into a new, single token. This continues until a pre-defined vocabulary size is reached.</p><p>To understand the mechanics, imagine we are training a tokenizer on a tiny corpus containing the words: <em>&#8220;hug&#8221;</em>, <em>&#8220;pug&#8221;</em>, and <em>&#8220;pun&#8221;</em>.</p><p>We break the text down into individual characters and add an end-of-word marker (often <code>&lt;/w&gt;</code>):</p><ul><li><p><code>h</code>, <code>u</code>, <code>g</code>, <code>&lt;/w&gt;</code></p></li><li><p><code>p</code>, <code>u</code>, <code>g</code>, <code>&lt;/w&gt;</code></p></li><li><p><code>p</code>, <code>u</code>, <code>n</code>, <code>&lt;/w&gt;</code></p></li></ul><p>The algorithm counts the frequency of all adjacent pairs.</p><ul><li><p><code>u</code> + <code>g</code> appears twice.</p></li><li><p><code>p</code> + <code>u</code> appears twice.</p></li></ul><p>If we choose to merge <code>u</code> and <code>g</code>, they become a new token <code>ug</code>. The vocabulary now includes the individual characters plus the new compound token <code>ug</code>. We repeat this process, merging the next most frequent pair, until we hit our target vocabulary size.</p><p>BPE solved two massive problems in Natural Language Processing:</p><p>Traditional word-level models would fail if they encountered a word not in their training dictionary (e.g., a rare medical term or a made-up slang word). Because BPE can break unknown words down into smaller subwords (or even characters), it ensures the model can <strong>always</strong> generate a representation for any string.</p><p>By grouping frequently occurring character sequences (like &#8220;ing&#8221;, &#8220;tion&#8221;, or &#8220;pre&#8221;), BPE allows the model to represent long words with fewer tokens.</p><ul><li><p><strong>Word-level:</strong> &#8220;Tokenization&#8221; = 1 token (but requires a massive, unmanageable vocabulary).</p></li><li><p><strong>Character-level:</strong> &#8220;Tokenization&#8221; = 13 tokens (too long for the model&#8217;s limited context window).</p></li><li><p><strong>BPE:</strong> &#8220;Token&#8221; + &#8220;ization&#8221; = 2 tokens (optimized length and memory usage).</p></li></ul><p>Early versions of BPE operated on Unicode characters, which could lead to issues with rare emojis or non-Latin alphabets. Modern LLMs (like GPT-2 and beyond) utilize <strong>Byte-level BPE</strong>.</p><p>Instead of merging characters, the algorithm operates on the <strong>raw bytes</strong> of the UTF-8 encoding. This ensures:</p><ul><li><p><strong>Universal Coverage:</strong> The base vocabulary is fixed at 256 bytes.</p></li><li><p><strong>No &#8220;Unknown&#8221; Tokens:</strong> Because every string can be represented as bytes, the model is theoretically capable of tokenizing any input, regardless of language, emoji usage, or symbols.</p></li></ul><p>While powerful, BPE is not perfect:</p><ul><li><p><strong>Greedy Approach:</strong> BPE is a greedy algorithm. It doesn&#8217;t look at the context of the sentence; it simply merges the most frequent pairs globally. Sometimes, this results in unintuitive subwords.</p></li><li><p><strong>Complexity:</strong> It requires a pre-tokenization training step. If you change your training corpus significantly, the tokenizer may become sub-optimal, which is why developers often use a tokenizer specifically trained on the distribution of data the model will see.</p></li></ul><p>Byte-Pair Encoding is the silent engine behind the fluency of LLMs. By intelligently clustering the building blocks of language into meaningful subword units, BPE allows models to handle the vast, messy, and creative nature of human text with both efficiency and precision. It remains the most effective compromise between the granularity of characters and the semantic richness of words.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Limited Intelligence! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[A Deep Dive into Harness Engineering]]></title><description><![CDATA[In the early days of Generative AI, developers focused on &#8220;Context Engineering&#8221;&#8212;ensuring the model had the right files and snippets to generate a single block of code.]]></description><link>https://limitedintelligence.substack.com/p/a-deep-dive-into-harness-engineering</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/a-deep-dive-into-harness-engineering</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Mon, 13 Apr 2026 13:01:06 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6iaH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58b2f2b-9489-4df7-95f3-a21e85ee1f55_1920x1080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6iaH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58b2f2b-9489-4df7-95f3-a21e85ee1f55_1920x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6iaH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58b2f2b-9489-4df7-95f3-a21e85ee1f55_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!6iaH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58b2f2b-9489-4df7-95f3-a21e85ee1f55_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!6iaH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58b2f2b-9489-4df7-95f3-a21e85ee1f55_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!6iaH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58b2f2b-9489-4df7-95f3-a21e85ee1f55_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6iaH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58b2f2b-9489-4df7-95f3-a21e85ee1f55_1920x1080.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f58b2f2b-9489-4df7-95f3-a21e85ee1f55_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6iaH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58b2f2b-9489-4df7-95f3-a21e85ee1f55_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!6iaH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58b2f2b-9489-4df7-95f3-a21e85ee1f55_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!6iaH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58b2f2b-9489-4df7-95f3-a21e85ee1f55_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!6iaH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58b2f2b-9489-4df7-95f3-a21e85ee1f55_1920x1080.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the early days of Generative AI, developers focused on &#8220;Context Engineering&#8221;&#8212;ensuring the model had the right files and snippets to generate a single block of code. However, as we move toward <strong>coding agents</strong> that can navigate entire codebases and perform multi-step tasks, context is no longer enough.</p><p>We need a way to trust the output without constant line-by-line manual review. This is where <strong>Harness Engineering</strong> begins.</p><p>In the relationship between a developer and an AI, the &#8220;Agent&#8221; is defined by the equation: <strong>Agent = Model + Harness.</strong></p><p>While the Model (LLM) provides the &#8220;reasoning&#8221; and token generation, the <strong>Harness</strong> is the structural framework that constrains that reasoning. A well-engineered harness serves two primary functions:</p><ul><li><p><strong>Increasing Probability:</strong> It makes it more likely the agent succeeds on the first attempt (Feedforward).</p></li><li><p><strong>Self-Correction:</strong> It provides sensors that allow the agent to detect and fix its own errors before a human ever sees them (Feedback).</p></li></ul><p>Harness Engineering borrows heavily from <strong>cybernetics</strong>, using a &#8220;Governor&#8221; model to regulate the codebase.</p><h3>Feedforward (The Guides)</h3><p>Guides are proactive. They provide the agent with &#8220;ambient affordances&#8221;&#8212;the rules of the road.</p><ul><li><p><strong>Computational Guides:</strong> Deterministic tools like &#8220;OpenRewrite&#8221; recipes or project scaffolds that force the agent into a specific structure.</p></li><li><p><strong>Inferential Guides:</strong> Semantic instructions, such as <code>AGENTS.md</code> files or &#8220;Skills&#8221; libraries, that explain the <em>intent</em> and <em>style</em> the agent should follow.</p></li></ul><h3>Feedback (The Sensors)</h3><p>Sensors are reactive. They observe the output and provide a signal for the agent to act upon.</p><ul><li><p><strong>The Power of Custom Linters:</strong> B&#246;ckeler notes that feedback is most powerful when optimized for LLMs. Instead of a generic error, a custom linter message could say: <em>&#8220;You violated our module boundary rule; please move this logic to the Service layer.&#8221;</em> This acts as a &#8220;positive prompt injection&#8221; that triggers self-correction.</p></li></ul><p>Harness Engineering changes the fundamental workflow of the software engineer. In a traditional workflow, a developer fixes bugs in the code. In a harnessed workflow, the developer <strong>iterates on the harness</strong>.</p><ol><li><p><strong>Issue Occurs:</strong> The agent produces a sub-par solution or violates a pattern.</p></li><li><p><strong>Harness Gap Analysis:</strong> The human identifies why the harness failed to prevent or detect this.</p></li><li><p><strong>Regulation Improvement:</strong> The human updates the guides (feedforward) or sensors (feedback).</p></li><li><p><strong>Verification:</strong> The agent reruns the task, now governed by the improved harness.</p></li></ol><p>This &#8220;Steering Loop&#8221; ensures that the engineering team&#8217;s collective intelligence is externalized into the system, making the codebase increasingly &#8220;agent-friendly&#8221; over time.</p><p>Not all parts of a codebase are equally easy to govern. B&#246;ckeler divides the harness into three functional categories:</p><h3>A. The Maintainability Harness</h3><p>This regulates internal code quality (complexity, style, duplication).</p><ul><li><p><strong>Status:</strong> High confidence. We have decades of static analysis tools (Linters, SonarQube) that act as cheap, fast, computational sensors.</p></li></ul><h3>B. The Architecture Fitness Harness</h3><p>This ensures the system adheres to its architectural characteristics (performance, modularity, observability).</p><ul><li><p><strong>Implementation:</strong> Using tools like <strong>ArchUnit</strong> to check module boundaries or performance tests that act as feedback loops if an agent introduces a latency regression.</p></li></ul><h3>C. The Behavior Harness (The &#8220;Elephant in the Room&#8221;)</h3><p>This regulates functional correctness&#8212;does the feature work?</p><ul><li><p><strong>The Challenge:</strong> Relying on AI to write its own tests creates a circular logic problem. If the AI misunderstood the requirement, it will write a &#8220;green&#8221; test that confirms its own misunderstanding.</p></li><li><p><strong>Current Solution:</strong> Humans must provide the behavioral ground truth through &#8220;approved fixtures&#8221; or manually verified functional specifications that the agent cannot alter.</p></li></ul><p>Why are some codebases easier for AI to handle than others? It comes down to <strong>Ambient Affordances</strong>.</p><p>A codebase written in a strongly typed, modular fashion has higher &#8220;harnessability.&#8221; It provides more &#8220;handles&#8221; for the harness to grab onto. B&#246;ckeler invokes <strong>Ashby&#8217;s Law of Requisite Variety</strong>, which states that a regulator must have as much variety as the system it governs.</p><p>Because an LLM can generate an infinite variety of code (much of it bad), we use <strong>Harness Templates</strong> to reduce that variety. By committing to specific service topologies (e.g., &#8220;This is a standard CRUD API&#8221;), we narrow the space the AI operates in, making a comprehensive harness achievable.</p><p>The most sophisticated harness cannot replace human intuition. Humans provide three things an LLM lacks:</p><ol><li><p><strong>Social Accountability:</strong> Your name is on the commit; the AI doesn&#8217;t care about the long-term consequences.</p></li><li><p><strong>Organizational Memory:</strong> Knowing <em>why</em> a specific technical debt was accepted for business reasons.</p></li><li><p><strong>Aesthetic Disgust:</strong> The visceral reaction to a 500-line function that &#8220;works&#8221; but is unmaintainable.</p></li></ol><p>Harness Engineering is not about reaching 100% automation. It is about <strong>shifting quality left</strong>. By building a system of computational and inferential guardrails, we ensure that when a human is finally called to review code, they are focusing on high-level design and intent, rather than catching the &#8220;toil&#8221; that a well-tuned harness should have caught automatically.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Limited Intelligence! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[A Deep Dive into Token Compaction for LLMs]]></title><description><![CDATA[The &#8220;context window&#8221; has become the new frontier of the AI arms race.]]></description><link>https://limitedintelligence.substack.com/p/a-deep-dive-into-token-compaction</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/a-deep-dive-into-token-compaction</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Fri, 10 Apr 2026 13:04:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!9ed-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedafac37-0285-4e9d-acfa-405010a25628_1999x919.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9ed-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedafac37-0285-4e9d-acfa-405010a25628_1999x919.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9ed-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedafac37-0285-4e9d-acfa-405010a25628_1999x919.png 424w, https://substackcdn.com/image/fetch/$s_!9ed-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedafac37-0285-4e9d-acfa-405010a25628_1999x919.png 848w, https://substackcdn.com/image/fetch/$s_!9ed-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedafac37-0285-4e9d-acfa-405010a25628_1999x919.png 1272w, https://substackcdn.com/image/fetch/$s_!9ed-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedafac37-0285-4e9d-acfa-405010a25628_1999x919.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9ed-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedafac37-0285-4e9d-acfa-405010a25628_1999x919.png" width="1456" height="669" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/edafac37-0285-4e9d-acfa-405010a25628_1999x919.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:669,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Cold-Compress 1.0: A Hackable Toolkit for KV-Cache Compression &#8211; Answer.AI&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Cold-Compress 1.0: A Hackable Toolkit for KV-Cache Compression &#8211; Answer.AI" title="Cold-Compress 1.0: A Hackable Toolkit for KV-Cache Compression &#8211; Answer.AI" srcset="https://substackcdn.com/image/fetch/$s_!9ed-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedafac37-0285-4e9d-acfa-405010a25628_1999x919.png 424w, https://substackcdn.com/image/fetch/$s_!9ed-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedafac37-0285-4e9d-acfa-405010a25628_1999x919.png 848w, https://substackcdn.com/image/fetch/$s_!9ed-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedafac37-0285-4e9d-acfa-405010a25628_1999x919.png 1272w, https://substackcdn.com/image/fetch/$s_!9ed-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedafac37-0285-4e9d-acfa-405010a25628_1999x919.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The &#8220;context window&#8221; has become the new frontier of the AI arms race. We&#8217;ve moved from the 2,048-token limits of early GPT-3 to the million-token horizons of Gemini 1.5 and beyond. However, there is a fundamental law of physics&#8212;or at least, of GPU VRAM&#8212;that remains: <strong>Attention is expensive.</strong></p><p>As sequences grow, the Key-Value (KV) cache balloons, memory bandwidth bottlenecks emerge, and the quadratic scaling of self-attention, $O(n^2)$, threatens to turn even the most powerful H100 clusters into very expensive space heaters. Enter <strong>Token Compaction</strong>: the art and science of keeping the &#8220;signal&#8221; while ruthlessly discarding the &#8220;noise.&#8221;</p><p>To understand compaction, we must first acknowledge why we need it. In an auto-regressive Transformer, the model avoids recomputing hidden states for previous tokens by storing them in the <strong>KV Cache</strong>.</p><p>While this saves computation, it creates a massive memory footprint. For a model with $l$ layers, $h$ attention heads, and a hidden dimension $d$, the memory required for the KV cache of a sequence length $s$ is roughly:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Memory}_{\\text{KV}} \\approx 2 \\times s \\times l \\times h \\times d \\times \\text{bytes-per-parameter}&quot;,&quot;id&quot;:&quot;HMKSXXFHVE&quot;}" data-component-name="LatexBlockToDOM"></div><p>For a 70B parameter model at 16-bit precision, a 128k context window isn&#8217;t just a &#8220;long prompt&#8221;&#8212;it&#8217;s a hundred-gigabyte memory hurdle. Token compaction techniques aim to reduce $s$ (the effective sequence length) without losing the semantic coherence required for accurate generation.</p><p>Token compaction isn&#8217;t a single &#8220;trick.&#8221; It is a spectrum of strategies ranging from &#8220;dropping tokens on the floor&#8221; to &#8220;fusing them into a smarter representation.&#8221; We can categorize these into four primary pillars:</p><ol><li><p><strong>Token Pruning (Selection)</strong></p></li><li><p><strong>Token Merging (Fusion)</strong></p></li><li><p><strong>Architectural Compression (GQA/MQA)</strong></p></li><li><p><strong>Dynamic Eviction (KV Cache Management)</strong></p></li></ol><p>Token pruning assumes that not all tokens are created equal. In a typical sentence, stop words, punctuation, or redundant fillers often carry low &#8220;attention weight.&#8221;</p><p>One of the most influential papers in this space, <em>H2O</em>, observed that a small fraction of tokens&#8212;termed &#8220;Heavy Hitters&#8221;&#8212;contribute to the vast majority of the attention scores. By maintaining a small, fixed-size cache of these high-influence tokens and discarding the rest, models can maintain performance while using significantly less memory.</p><ul><li><p><strong>How it works:</strong> The model tracks cumulative attention scores. If a token consistently receives high attention from subsequent tokens, it stays. If its attention score stays below a threshold, it&#8217;s evicted from the cache.</p></li><li><p><strong>The Benefit:</strong> It allows for theoretically infinite sequence lengths in a fixed memory budget, provided the &#8220;working memory&#8221; of the task fits within the H2O cache.</p></li></ul><p>Researchers discovered a curious phenomenon: the very first tokens in a sequence (the &#8220;sinks&#8221;) receive massive amounts of attention, regardless of their semantic value.</p><blockquote><p><strong>The Insight:</strong> If you remove the first token, the model&#8217;s perplexity explodes. <strong>StreamingLLM</strong> keeps the first few tokens (the anchors) and a sliding window of the most recent tokens, effectively &#8220;compacting&#8221; the context by ignoring the middle-distance history that hasn&#8217;t been flagged as important.</p></blockquote><p>If pruning is a scalpel, <strong>Token Merging (ToMe)</strong> is a blender. Instead of deciding if a token is &#8220;in&#8221; or &#8220;out,&#8221; merging looks for tokens that are mathematically similar and combines them.</p><p>Using a similarity metric (often Cosine Similarity) between the Key vectors ($K$), the algorithm identifies clusters of tokens that represent the same concept or context.</p><ol><li><p><strong>Partition:</strong> Divide tokens into two sets.</p></li><li><p><strong>Compare:</strong> Calculate the similarity between sets.</p></li><li><p><strong>Merge:</strong> Average the most similar pairs into a single token representation.</p></li><li><p><strong>Weight:</strong> Increase the &#8220;importance&#8221; weight of the new merged token so the attention mechanism knows it represents a larger chunk of the original text.</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Merged\\_Vector} = \\frac{\\sum (v_i \\cdot w_i)}{\\sum w_i}&quot;,&quot;id&quot;:&quot;NPEDNTEKLW&quot;}" data-component-name="LatexBlockToDOM"></div><p>This is particularly effective in multimodal LLMs. In an image, 50 tokens representing a &#8220;clear blue sky&#8221; can be merged into one with almost zero loss in descriptive power.</p><p>While not &#8220;compaction&#8221; in the sense of post-processing, architectural changes are the foundation of modern efficiency.</p><ul><li><p><strong>Multi-Head Attention (MHA):</strong> Every Query head has its own Key and Value head. (Expensive).</p></li><li><p><strong>Multi-Query Attention (MQA):</strong> Multiple Query heads share a single Key and Value head. (Drastic reduction in KV cache size, but some quality loss).</p></li><li><p><strong>Grouped-Query Attention (GQA):</strong> The middle ground used in Llama 3 and Mistral. It groups Query heads into subgroups, each sharing a KV pair.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3Alh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38bb5c4c-d5b3-4f02-b222-997959477b38_1278x322.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3Alh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38bb5c4c-d5b3-4f02-b222-997959477b38_1278x322.png 424w, https://substackcdn.com/image/fetch/$s_!3Alh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38bb5c4c-d5b3-4f02-b222-997959477b38_1278x322.png 848w, https://substackcdn.com/image/fetch/$s_!3Alh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38bb5c4c-d5b3-4f02-b222-997959477b38_1278x322.png 1272w, https://substackcdn.com/image/fetch/$s_!3Alh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38bb5c4c-d5b3-4f02-b222-997959477b38_1278x322.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3Alh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38bb5c4c-d5b3-4f02-b222-997959477b38_1278x322.png" width="1278" height="322" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/38bb5c4c-d5b3-4f02-b222-997959477b38_1278x322.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:322,&quot;width&quot;:1278,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:78838,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://limitedintelligence.substack.com/i/193255035?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38bb5c4c-d5b3-4f02-b222-997959477b38_1278x322.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3Alh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38bb5c4c-d5b3-4f02-b222-997959477b38_1278x322.png 424w, https://substackcdn.com/image/fetch/$s_!3Alh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38bb5c4c-d5b3-4f02-b222-997959477b38_1278x322.png 848w, https://substackcdn.com/image/fetch/$s_!3Alh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38bb5c4c-d5b3-4f02-b222-997959477b38_1278x322.png 1272w, https://substackcdn.com/image/fetch/$s_!3Alh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38bb5c4c-d5b3-4f02-b222-997959477b38_1278x322.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Sometimes, you don&#8217;t need fewer tokens; you just need &#8220;smaller&#8221; tokens. Standard models use FP16 or BF16 (16 bits per parameter). Quantization techniques like <strong>KIVI</strong> or <strong>KV-Quant</strong> compress these down to 4-bit or even 2-bit representations.</p><p>The challenge here is the <strong>Outlier Problem</strong>. In LLM hidden states, a few dimensions often have much larger magnitudes than others. If you quantize the whole vector uniformly, these outliers cause massive rounding errors. Compaction in this context involves:</p><ul><li><p><strong>Per-channel scaling:</strong> Giving each dimension its own scaling factor.</p></li><li><p><strong>Sparse-and-Dense decomposition:</strong> Keeping outliers in high precision while compressing the rest of the &#8220;bulk&#8221; data.</p></li></ul><p>While token compaction sounds like a miracle, it comes with &#8220;The Compression Tax.&#8221;</p><ol><li><p><strong>Retrieval Accuracy:</strong> If you are doing &#8220;Needle in a Haystack&#8221; tests, pruning can accidentally throw away the &#8220;needle.&#8221;</p></li><li><p><strong>Reasoning Chains:</strong> In multi-step logic (CoT), seemingly &#8220;unimportant&#8221; intermediate steps might be vital for the final output. Compaction algorithms often struggle to distinguish between &#8220;filler&#8221; and &#8220;structural logic.&#8221;</p></li><li><p><strong>Prefill Latency:</strong> Merging tokens requires calculating similarity matrices, which can actually make the initial &#8220;reading&#8221; of the prompt slower, even if the &#8220;generation&#8221; phase becomes faster.</p></li></ol><p>The future of token compaction lies in <strong>Learned Compaction</strong>. Instead of using fixed heuristics (like &#8220;keep the first 4 tokens&#8221;), we are seeing the rise of models that have a &#8220;gatekeeper&#8221; layer. This layer predicts the importance of a token <em>before</em> it enters the KV cache.</p><p>As we march toward &#8220;Infinite Context,&#8221; the bottleneck will shift from how much we can store to how efficiently we can index. Token compaction is essentially the creation of a &#8220;searchable index&#8221; for the model&#8217;s own memory.</p><p>We are moving away from the &#8220;Brute Force&#8221; era of LLMs. In the early days, we simply threw more VRAM at the problem. Today, we are teaching models to be discerning&#8212;to realize that in a 100,000-word book, the specific phrasing of a &#8220;the&#8221; or a &#8220;but&#8221; is less important than the character&#8217;s motivation. Token compaction is, in a sense, the first step toward giving AI a &#8220;subconscious&#8221; filter.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Limited Intelligence! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Parallels Between Human Memory and Large Language Models]]></title><description><![CDATA[The human mind and the Large Language Model (LLM) are arguably the two most sophisticated information-processing systems in existence.]]></description><link>https://limitedintelligence.substack.com/p/parallels-between-human-memory-and</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/parallels-between-human-memory-and</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Mon, 06 Apr 2026 13:03:06 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lDTP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47de4763-fcd1-4b3d-8e31-5fef1f0e8270_2560x1440.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lDTP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47de4763-fcd1-4b3d-8e31-5fef1f0e8270_2560x1440.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lDTP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47de4763-fcd1-4b3d-8e31-5fef1f0e8270_2560x1440.webp 424w, https://substackcdn.com/image/fetch/$s_!lDTP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47de4763-fcd1-4b3d-8e31-5fef1f0e8270_2560x1440.webp 848w, https://substackcdn.com/image/fetch/$s_!lDTP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47de4763-fcd1-4b3d-8e31-5fef1f0e8270_2560x1440.webp 1272w, https://substackcdn.com/image/fetch/$s_!lDTP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47de4763-fcd1-4b3d-8e31-5fef1f0e8270_2560x1440.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lDTP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47de4763-fcd1-4b3d-8e31-5fef1f0e8270_2560x1440.webp" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/47de4763-fcd1-4b3d-8e31-5fef1f0e8270_2560x1440.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Usefulness of a Memory Guides Where the Brain Saves It | Quanta Magazine&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Usefulness of a Memory Guides Where the Brain Saves It | Quanta Magazine" title="The Usefulness of a Memory Guides Where the Brain Saves It | Quanta Magazine" srcset="https://substackcdn.com/image/fetch/$s_!lDTP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47de4763-fcd1-4b3d-8e31-5fef1f0e8270_2560x1440.webp 424w, https://substackcdn.com/image/fetch/$s_!lDTP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47de4763-fcd1-4b3d-8e31-5fef1f0e8270_2560x1440.webp 848w, https://substackcdn.com/image/fetch/$s_!lDTP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47de4763-fcd1-4b3d-8e31-5fef1f0e8270_2560x1440.webp 1272w, https://substackcdn.com/image/fetch/$s_!lDTP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47de4763-fcd1-4b3d-8e31-5fef1f0e8270_2560x1440.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The human mind and the Large Language Model (LLM) are arguably the two most sophisticated information-processing systems in existence. While one is the product of millions of years of biological evolution and the other a result of decades of computational engineering, they share a fundamental challenge: <strong>How do you store, retrieve, and utilize vast amounts of information in real-time?</strong></p><p>In cognitive psychology, the distinction between Short-Term Memory (STM) and Long-Term Memory (LTM) is foundational. In the world of Artificial Intelligence, a striking parallel has emerged within the architecture of Transformers&#8212;the engine behind models like GPT-4 and Gemini. By examining these two systems side-by-side, we gain not only a better understanding of AI but a deeper appreciation for the elegance of human cognition.</p><p>To understand the parallel, we must first define the biological standard. In 1968, Atkinson and Shiffrin proposed the <strong>Multi-Store Model</strong>, which remains the primary framework for discussing memory stages.</p><p>Short-term memory is our &#8220;mental workspace.&#8221; It is characterized by:</p><ul><li><p><strong>Limited Capacity:</strong> Classically defined by George Miller as $7 \pm 2$ items.</p></li><li><p><strong>Brief Duration:</strong> Information typically fades within 15&#8211;30 seconds unless rehearsed.</p></li><li><p><strong>High Accessibility:</strong> Information is immediately available for manipulation.</p></li></ul><p>Modern psychology often prefers the term <strong>Working Memory</strong>, popularized by Baddeley and Hitch. It isn&#8217;t just a waiting room for data; it is an active processor consisting of a &#8220;Central Executive&#8221; that directs attention, a &#8220;Phonological Loop&#8221; for auditory data, and a &#8220;Visuospatial Sketchpad&#8221; for imagery.</p><p>Long-term memory is the &#8220;hard drive&#8221; of the brain. It is characterized by:</p><ul><li><p><strong>Virtually Infinite Capacity:</strong> There is no known limit to what a human can learn over a lifetime.</p></li><li><p><strong>Durability:</strong> Information can last from minutes to decades.</p></li><li><p><strong>Consolidation:</strong> The process of moving information from STM to LTM, often involving the hippocampus and sleep.</p></li></ul><p>LTM is further divided into <strong>Explicit (Declarative)</strong> memory&#8212;facts and events&#8212;and <strong>Implicit (Procedural)</strong> memory&#8212;skills like riding a bike or typing.</p><p>LLMs do not have &#8220;brains,&#8221; but they do have functional equivalents to these memory systems. In the context of a chatbot or an AI agent, the distinction is found between <strong>Weights</strong> and <strong>Context</strong>.</p><p>When you chat with an LLM, the model remembers what you said five sentences ago. This is its <strong>Short-Term Memory</strong>, technically referred to as the <strong>Context Window</strong>.</p><ul><li><p><strong>Capacity:</strong> Just as humans can only hold a few digits in their head, LLMs have a token limit (e.g., 128k or 1M tokens).</p></li><li><p><strong>Attention Mechanism:</strong> The &#8220;Transformer&#8221; architecture uses an <strong>Attention Mechanism</strong> that mirrors human selective attention. It decides which parts of the input are relevant to the current word being generated.</p></li><li><p><strong>Volatility:</strong> Once the &#8220;session&#8221; is cleared or the window is exceeded, the model &#8220;forgets&#8221; everything in that specific conversation. It does not naturally &#8220;learn&#8221; from a single chat in real-time.</p></li></ul><p>The &#8220;knowledge&#8221; an LLM possesses&#8212;the fact that Paris is the capital of France or how to write Python code&#8212;is stored in its <strong>Parameters (Weights)</strong>.</p><ul><li><p><strong>The Training Phase:</strong> This is the AI&#8217;s version of consolidation. During training, the model processes trillions of words, and the &#8220;lessons&#8221; are baked into the strength of the connections between neurons in the neural network.</p></li><li><p><strong>Static Nature:</strong> Unlike human LTM, which is &#8220;plastic&#8221; (constantly changing), a standard LLM&#8217;s LTM is static after training. To add new long-term knowledge, it must be &#8220;fine-tuned&#8221; or retrained.</p></li></ul><p></p><p>One of the biggest hurdles in AI is that models eventually &#8220;run out&#8221; of context window, much like a human forgets a phone number if they are distracted. To solve this, engineers developed <strong>Retrieval-Augmented Generation (RAG)</strong>.</p><p>RAG acts like an external long-term memory or a reference library. Instead of trying to &#8220;remember&#8221; everything in its weights, the AI looks up information in a database and brings it into its &#8220;Working Memory&#8221; (context window) only when needed.</p><p>This mirrors the human use of <strong>External Memory Aids</strong>&#8212;like notebooks or Google&#8212;but it also mimics the way our brain retrieves a specific memory from LTM to handle a current task. When you solve a math problem, you retrieve the &#8220;rule&#8221; from LTM into your Working Memory to apply it. RAG does exactly this for AI.</p><p>A fascinating parallel exists in how both systems fail.</p><ul><li><p><strong>Psychology:</strong> The <strong>Serial Position Effect</strong> suggests humans remember the beginning (Primacy) and the end (Recency) of a list best, often forgetting the middle.</p></li><li><p><strong>AI:</strong> Research has shown that LLMs also struggle with &#8220;Lost in the Middle.&#8221; When given a very long prompt, they are much better at utilizing information located at the very start or the very end of the text, while the middle often gets ignored by the attention mechanism.</p></li></ul><p>This suggests that &#8220;attention&#8221; is a finite resource in both biological and synthetic architectures.</p><p>Comparing human memory to LLMs reveals a fundamental truth about intelligence: <strong>Processing requires a trade-off between volume and speed.</strong> We cannot keep everything we&#8217;ve ever learned in our active consciousness (STM) because it would create too much noise. Similarly, an LLM cannot have an infinite context window without becoming computationally &#8220;heavy&#8221; and slow.</p><p>As we move toward &#8220;Agentic AI&#8221;&#8212;models that can plan, reason, and remember over long periods&#8212;we are seeing AI move closer to the human model of <strong>continuous learning</strong>. While humans use the hippocampus to turn today&#8217;s experiences into tomorrow&#8217;s wisdom, AI researchers are developing &#8220;memory wrappers&#8221; and &#8220;dynamic fine-tuning&#8221; to give machines a similar sense of persistence.</p><p>Ultimately, the LLM is a mirror. By building machines that &#8220;remember&#8221; like us, we are slowly decoding the algorithmic secrets of our own minds.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Limited Intelligence! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[From N-Grams to Reasoning Engines]]></title><description><![CDATA[A Definitive History of Large Language Models]]></description><link>https://limitedintelligence.substack.com/p/from-n-grams-to-reasoning-engines</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/from-n-grams-to-reasoning-engines</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Fri, 03 Apr 2026 13:00:50 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!cy13!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf6cc768-45c6-4c83-847d-fc690935b36f_2560x1440.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cy13!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf6cc768-45c6-4c83-847d-fc690935b36f_2560x1440.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cy13!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf6cc768-45c6-4c83-847d-fc690935b36f_2560x1440.jpeg 424w, https://substackcdn.com/image/fetch/$s_!cy13!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf6cc768-45c6-4c83-847d-fc690935b36f_2560x1440.jpeg 848w, https://substackcdn.com/image/fetch/$s_!cy13!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf6cc768-45c6-4c83-847d-fc690935b36f_2560x1440.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!cy13!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf6cc768-45c6-4c83-847d-fc690935b36f_2560x1440.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cy13!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf6cc768-45c6-4c83-847d-fc690935b36f_2560x1440.jpeg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf6cc768-45c6-4c83-847d-fc690935b36f_2560x1440.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Large Language Models 101: History, Evolution and Future&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Large Language Models 101: History, Evolution and Future" title="Large Language Models 101: History, Evolution and Future" srcset="https://substackcdn.com/image/fetch/$s_!cy13!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf6cc768-45c6-4c83-847d-fc690935b36f_2560x1440.jpeg 424w, https://substackcdn.com/image/fetch/$s_!cy13!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf6cc768-45c6-4c83-847d-fc690935b36f_2560x1440.jpeg 848w, https://substackcdn.com/image/fetch/$s_!cy13!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf6cc768-45c6-4c83-847d-fc690935b36f_2560x1440.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!cy13!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf6cc768-45c6-4c83-847d-fc690935b36f_2560x1440.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The story of Large Language Models (LLMs) is often told as a sudden explosion that began in late 2022 with the release of ChatGPT. However, for those tracking the pulse of computational linguistics, this &#8220;explosion&#8221; was the result of decades of slow-burn research, architectural pivots, and a fundamental shift in how we think about machine intelligence.</p><p>It is a history of moving from <strong>rules</strong> to <strong>probabilities</strong>, and finally, to <strong>reasoning</strong>.</p><h2>1. The Pre-Neural Era: The Search for Structure (1950s &#8211; 1990s)</h2><p>In the early days of Artificial Intelligence, the dominant philosophy was <strong>Symbolic AI</strong>. Researchers believed that if we could simply code all the rules of grammar and logic into a machine, it would understand language. This led to &#8220;Expert Systems&#8221; and the creation of the ELIZA chatbot in the 1960s, which used simple pattern matching to mimic a Rogerian psychotherapist.</p><p>By the 1980s and 90s, the field shifted toward <strong>Statistical NLP</strong>. Instead of hard-coded rules, researchers used <strong>N-grams</strong>. An N-gram model predicts the next word based on the frequency of word sequences in a massive corpus of text. If the word &#8220;San&#8221; appeared, the model calculated a high probability that &#8220;Francisco&#8221; would follow.</p><p>While revolutionary, these models were &#8220;shallow.&#8221; They had no concept of context beyond the immediate few words (the &#8220;N&#8221; in N-gram), and they lacked any internal representation of meaning.</p><h2>2. The Neural Awakening: RNNs and the Context Problem (2000s &#8211; 2014)</h2><p>The introduction of Neural Networks changed the game. Instead of counting word frequencies, researchers began using <strong>Word Embeddings</strong> (like Word2Vec and GloVe). These represented words as high-dimensional vectors, where words with similar meanings (e.g., &#8220;King&#8221; and &#8220;Queen&#8221;) were mathematically close to one another.</p><h3>The Rise of the RNN</h3><p>To handle the sequential nature of language, researchers turned to <strong>Recurrent Neural Networks (RNNs)</strong>. Unlike static networks, RNNs had a &#8220;memory&#8221; loop, allowing information from previous words to persist.</p><p>However, RNNs suffered from a fatal flaw: <strong>The Vanishing Gradient Problem</strong>. As a sentence grew longer, the model &#8220;forgot&#8221; the beginning. To solve this, Sepp Hochreiter and J&#252;rgen Schmidhuber introduced <strong>Long Short-Term Memory (LSTM)</strong> networks. LSTMs used &#8220;gates&#8221; to decide what information to keep and what to discard, allowing for much longer context windows.</p><h2>3. The 2017 Inflection Point: &#8220;Attention is All You Need&#8221;</h2><p>In 2017, a team at Google Brain published a paper that would change the trajectory of AI forever: <em>&#8220;Attention is All You Need.&#8221;</em> This paper introduced the <strong>Transformer</strong> architecture.</p><p>The Transformer abandoned the sequential processing of RNNs entirely. Instead, it used a mechanism called <strong>Self-Attention</strong>. This allowed the model to look at every word in a sentence simultaneously and weigh their importance relative to one another, regardless of how far apart they were.</p><p>The mathematical core of this mechanism is defined as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Attention(Q, K, V) = softmax\\left(\\frac{QK^T}{\\sqrt{d_k}}\\right)V&quot;,&quot;id&quot;:&quot;WKWWWQRGQL&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where <strong>Q</strong> (Query), <strong>K</strong> (Key), and <strong>V</strong> (Value) are vector representations of the input. This allowed for massive parallelization during training, enabling researchers to train models on datasets orders of magnitude larger than before.</p><h2>4. The Era of Pre-training: BERT vs. GPT (2018 &#8211; 2019)</h2><p>Following the Transformer breakthrough, two distinct paths emerged:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kslL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29b6a18a-d26f-49ba-a142-5ad96ce5b328_1430x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kslL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29b6a18a-d26f-49ba-a142-5ad96ce5b328_1430x200.png 424w, https://substackcdn.com/image/fetch/$s_!kslL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29b6a18a-d26f-49ba-a142-5ad96ce5b328_1430x200.png 848w, https://substackcdn.com/image/fetch/$s_!kslL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29b6a18a-d26f-49ba-a142-5ad96ce5b328_1430x200.png 1272w, https://substackcdn.com/image/fetch/$s_!kslL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29b6a18a-d26f-49ba-a142-5ad96ce5b328_1430x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kslL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29b6a18a-d26f-49ba-a142-5ad96ce5b328_1430x200.png" width="1430" height="200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/29b6a18a-d26f-49ba-a142-5ad96ce5b328_1430x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:1430,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:51453,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://limitedintelligence.substack.com/i/192510341?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29b6a18a-d26f-49ba-a142-5ad96ce5b328_1430x200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kslL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29b6a18a-d26f-49ba-a142-5ad96ce5b328_1430x200.png 424w, https://substackcdn.com/image/fetch/$s_!kslL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29b6a18a-d26f-49ba-a142-5ad96ce5b328_1430x200.png 848w, https://substackcdn.com/image/fetch/$s_!kslL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29b6a18a-d26f-49ba-a142-5ad96ce5b328_1430x200.png 1272w, https://substackcdn.com/image/fetch/$s_!kslL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29b6a18a-d26f-49ba-a142-5ad96ce5b328_1430x200.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>BERT</strong> (Bidirectional Encoder Representations from Transformers) focused on &#8220;filling in the blanks.&#8221; By masking words in a sentence and forcing the model to guess them, it became incredibly good at understanding nuance.</p><p><strong>GPT</strong> (Generative Pre-trained Transformer), conversely, focused on <strong>Autoregressive</strong> generation. It was trained on the simple task of predicting the next token in a sequence. This simplicity turned out to be its greatest strength.</p><h2>5. Scaling Laws and the GPT-3 Moment (2020 &#8211; 2022)</h2><p>In 2020, OpenAI released <strong>GPT-3</strong>. With 175 billion parameters, it was 100 times larger than its predecessor, GPT-2.</p><p>This era validated the <strong>Scaling Laws</strong>: the observation that as you increase compute, data, and parameter count, the model&#8217;s performance improves predictably. GPT-3 demonstrated &#8220;Few-Shot Learning&#8221;&#8212;the ability to perform tasks it wasn&#8217;t explicitly trained for (like translation or coding) just by seeing a few examples in the prompt.</p><h3>The &#8220;Assistant&#8221; Pivot: InstructGPT and RLHF</h3><p>While GPT-3 was powerful, it was often &#8220;unaligned.&#8221; It would hallucinate, be rude, or follow instructions poorly because it was only trained to <em>imitate</em> the internet, not to <em>help</em> a user.</p><p>OpenAI solved this using <strong>Reinforcement Learning from Human Feedback (RLHF)</strong>. By having humans rank different model outputs, they trained a second &#8220;reward model&#8221; to teach the LLM how to be a helpful assistant. This led to <strong>InstructGPT</strong>, the direct ancestor of ChatGPT.</p><h2>6. The 2023 Explosion: ChatGPT and GPT-4</h2><p>On November 30, 2022, OpenAI released <strong>ChatGPT</strong>. It wasn&#8217;t a new model&#8212;it was a fine-tuned version of GPT-3.5 optimized for dialogue&#8212;but the interface changed everything. For the first time, the general public could interact with a high-level LLM through a simple chat box.</p><p>Months later, <strong>GPT-4</strong> arrived. It moved beyond simple text-to-text, introducing multimodal capabilities and a massive leap in reasoning, scoring in the 90th percentile on the Uniform Bar Exam.</p><h2>7. The Open Source Counter-Current (2023 &#8211; 2024)</h2><p>While OpenAI and Google (with <strong>Bard</strong>, later <strong>Gemini</strong>) kept their weights proprietary, Meta (Facebook) took a different approach. The release of <strong>LLaMA</strong> (Large Language Model Meta AI) sparked an open-source revolution.</p><p>Because LLaMA&#8217;s weights were smaller and more efficient, developers realized they could run powerful AI on consumer-grade hardware. This led to an explosion of &#8220;small&#8221; models like <strong>Mistral</strong>, <strong>Falcon</strong>, and <strong>Vicuna</strong>, proving that efficiency and fine-tuning could sometimes rival raw scale.</p><h2>8. The Current Frontier: Multimodality and Agents (2025 &#8211; 2026)</h2><p>Today, the &#8220;History of LLMs&#8221; is evolving into the history of <strong>Large Multimodal Models (LMMs)</strong>. We have moved from models that just &#8220;read&#8221; to models that can &#8220;see,&#8221; &#8220;hear,&#8221; and &#8220;do.&#8221;</p><h3>Key Trends of the Present:</h3><ul><li><p><strong>Infinite Context:</strong> Models like Gemini 1.5 Pro now support context windows of up to 2 million tokens, allowing them to process entire codebases or hours of video in one go.</p></li><li><p><strong>Agentic Workflows:</strong> We are moving from &#8220;Chatbots&#8221; to &#8220;Agents&#8221;&#8212;systems that can use tools, browse the web, and execute multi-step plans autonomously.</p></li><li><p><strong>The Rise of SLMs:</strong> Small Language Models (under 10B parameters) are becoming the standard for edge computing and mobile devices.</p></li></ul><h2>Conclusion</h2><p>The history of LLMs serves as a testament to what computer scientist Rich Sutton called <strong>&#8220;The Bitter Lesson&#8221;</strong>: the realization that leveraging massive amounts of computation and general-purpose learning algorithms consistently outperforms human-designed &#8220;clever&#8221; features.</p><p>We have moved from trying to teach machines the rules of our world to giving them the scale to discover those rules for themselves. The next chapter likely won&#8217;t just be about &#8220;more data,&#8221; but about <strong>Reasoning</strong> and <strong>World Models</strong>&#8212;the transition from predicting the next word to understanding the physics and logic of the reality behind those words.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Limited Intelligence! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Matryoshka Representation Learning (MRL)]]></title><description><![CDATA[Solving fixed-dimension bottleneck dillema]]></description><link>https://limitedintelligence.substack.com/p/matryoshka-representation-learning</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/matryoshka-representation-learning</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Thu, 02 Apr 2026 13:00:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!KkVH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13677af7-9ff4-4409-8793-1511f546afbb_1376x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KkVH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13677af7-9ff4-4409-8793-1511f546afbb_1376x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KkVH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13677af7-9ff4-4409-8793-1511f546afbb_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KkVH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13677af7-9ff4-4409-8793-1511f546afbb_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KkVH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13677af7-9ff4-4409-8793-1511f546afbb_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KkVH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13677af7-9ff4-4409-8793-1511f546afbb_1376x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KkVH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13677af7-9ff4-4409-8793-1511f546afbb_1376x768.jpeg" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13677af7-9ff4-4409-8793-1511f546afbb_1376x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Matryoshka embeddings: How to make vector search 5x faster | by St&#233;phane  Derosiaux | Data Science Collective | Medium&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Matryoshka embeddings: How to make vector search 5x faster | by St&#233;phane  Derosiaux | Data Science Collective | Medium" title="Matryoshka embeddings: How to make vector search 5x faster | by St&#233;phane  Derosiaux | Data Science Collective | Medium" srcset="https://substackcdn.com/image/fetch/$s_!KkVH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13677af7-9ff4-4409-8793-1511f546afbb_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KkVH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13677af7-9ff4-4409-8793-1511f546afbb_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KkVH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13677af7-9ff4-4409-8793-1511f546afbb_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KkVH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13677af7-9ff4-4409-8793-1511f546afbb_1376x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As of early 2026, the landscape of Retrieval-Augmented Generation (RAG) and semantic search has shifted from &#8220;bigger is better&#8221; to &#8220;flexible is faster.&#8221; At the heart of this shift lies <strong>Matryoshka Representation Learning (MRL)</strong>, an elegant training technique that has effectively solved the &#8220;fixed-dimension bottleneck&#8221; that plagued vector databases for years.</p><p>If you&#8217;ve ever felt the pain of choosing between a 1536-dimension vector (high accuracy, high cost) and a 128-dimension vector (fast, but &#8220;dumb&#8221;), MRL is your new best friend. Here is a deep dive into the world of Matryoshka Embeddings.</p><h2>1. The Fixed-Dimension Paradox</h2><p>Before 2024, embedding models were rigid. If you trained a model to output 768 dimensions, you were stuck with 768 dimensions.</p><p>This created a massive engineering headache:</p><ul><li><p><strong>Storage Bloat:</strong> 10 million documents at 3072 dimensions (like OpenAI&#8217;s <code>text-embedding-3-large</code>) requires roughly 120GB of RAM just for the vectors.</p></li><li><p><strong>Latency:</strong> Calculating cosine similarity on high-dimensional vectors is computationally expensive, leading to slower query times.</p></li><li><p><strong>The &#8220;Re-indexing&#8221; Nightmare:</strong> If you decided midway through a project that your vectors were too big, you had to re-embed your entire dataset&#8212;a process that could cost thousands of dollars and days of compute time.</p></li></ul><p>We needed a way to make embeddings &#8220;elastic.&#8221; We needed a vector that could be a heavyweight champion when accuracy mattered, but a lightweight sprinter when speed was the priority.</p><h2>2. What is Matryoshka Representation Learning?</h2><p>The name comes from the <strong>Matryoshka</strong>, or Russian nesting doll. In a Matryoshka set, you have a large doll that contains a smaller, perfectly formed doll inside, which contains an even smaller one, and so on.</p><p>In the context of machine learning, <strong>Matryoshka Representation Learning (MRL)</strong> is a training paradigm where a single embedding is structured such that its most critical semantic information is &#8220;front-loaded&#8221; into the first few dimensions.</p><p>Instead of the information being spread randomly across 1024 dimensions, MRL forces the model to ensure that:</p><ul><li><p>The first <strong>64</strong> dimensions are a valid, useful embedding.</p></li><li><p>The first <strong>128</strong> dimensions are even better.</p></li><li><p>The first <strong>256</strong> dimensions capture most of the nuance.</p></li><li><p>The full <strong>1024</strong> dimensions provide the ultimate &#8220;high-definition&#8221; detail.</p></li></ul><p>This means you can <strong>truncate</strong> a 1024-dimensional vector at the 128th index and still have a functional embedding that outperforms older, fixed-size models.</p><h2>3. The Technical Engine: How MRL Works</h2><p>The &#8220;magic&#8221; isn&#8217;t in the model architecture (which is usually a standard Transformer), but in the <strong>Loss Function</strong>.</p><p>In standard embedding training, we calculate a single loss based on the final vector. In MRL, we calculate a <strong>Multi-Scale Loss</strong>. We take the full vector, slice it at various pre-defined &#8220;Matryoshka points,&#8221; and calculate the loss for <em>each</em> slice.</p><h3>The Mathematics of Nesting</h3><p>Let <em>x</em> be our input. The model <em>F</em> produces a high-dimensional vector:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;z = F(x) \\in \\mathbb{R}^d&quot;,&quot;id&quot;:&quot;XCWCJYMNJV&quot;}" data-component-name="LatexBlockToDOM"></div><p>We define a set of dimensions:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;M = \\{d_1, d_2, ..., d_k\\}&quot;,&quot;id&quot;:&quot;AXRMZDXHNN&quot;}" data-component-name="LatexBlockToDOM"></div><p>where each</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;d_i \\le d&quot;,&quot;id&quot;:&quot;URNUZAINOI&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>The total loss is the weighted sum of losses at each dimensionality:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{total} = \\sum_{m \\in M} c_m \\cdot \\mathcal{L}(z_{1:m})&quot;,&quot;id&quot;:&quot;YJAYSEGIGE&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where:</p><ul><li><p><em>z_{1:m}</em> is the prefix of the vector up to dimension $m$.</p></li><li><p><em>L</em> is a standard contrastive loss (like InfoNCE).</p></li><li><p><em>c_m</em> is a weighting coefficient (often set to 1 for equal importance).</p></li></ul><p>By optimizing for all these dimensions simultaneously, the backpropagation process forces the model to pack the &#8220;essence&#8221; of the data into the earliest dimensions. If the model fails to capture the core meaning in the first 64 dimensions, the term will be high, and the model will be penalized.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}(z_{1:64})&quot;,&quot;id&quot;:&quot;GOWVBEHLXS&quot;}" data-component-name="LatexBlockToDOM"></div><h2>4. Why 2026 is the Year of MRL</h2><p>While the original MRL paper was published by researchers at the University of Washington and Google in 2022, it didn&#8217;t become an industry standard until late 2024 and throughout 2025.</p><p>Today, in 2026, nearly every major embedding provider supports MRL natively:</p><ul><li><p><strong>OpenAI:</strong> Their <code>text-embedding-3-large</code> (3072 dimensions) can be truncated to 256 dimensions while still outperforming the legendary <code>text-embedding-ada-002</code>.</p></li><li><p><strong>Google Gemini:</strong> The <code>Gemini Embedding 2</code> model uses MRL to allow seamless transitions between 768 and 3072 dimensions.</p></li><li><p><strong>Voyage AI &amp; Jina:</strong> Models like <code>Voyage MM-3.5</code> and <code>Jina v4</code> have pushed MRL into the multimodal space, allowing you to truncate image and text vectors with less than 1% loss in accuracy.</p></li></ul><h3>2026 Benchmarks: The &#8220;98% Rule&#8221;</h3><p>Recent benchmarks on the <strong>MTEB (Massive Text Embedding Benchmark)</strong> show a consistent pattern: MRL-trained models typically retain <strong>98% of their performance</strong> even when truncated to <strong>8-10% of their original size</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hu5O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04eaccd0-8333-4196-90d5-2dbd390cd819_756x314.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hu5O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04eaccd0-8333-4196-90d5-2dbd390cd819_756x314.png 424w, https://substackcdn.com/image/fetch/$s_!Hu5O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04eaccd0-8333-4196-90d5-2dbd390cd819_756x314.png 848w, https://substackcdn.com/image/fetch/$s_!Hu5O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04eaccd0-8333-4196-90d5-2dbd390cd819_756x314.png 1272w, https://substackcdn.com/image/fetch/$s_!Hu5O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04eaccd0-8333-4196-90d5-2dbd390cd819_756x314.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hu5O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04eaccd0-8333-4196-90d5-2dbd390cd819_756x314.png" width="756" height="314" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/04eaccd0-8333-4196-90d5-2dbd390cd819_756x314.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:314,&quot;width&quot;:756,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42526,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://limitedintelligence.substack.com/i/192509513?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04eaccd0-8333-4196-90d5-2dbd390cd819_756x314.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Hu5O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04eaccd0-8333-4196-90d5-2dbd390cd819_756x314.png 424w, https://substackcdn.com/image/fetch/$s_!Hu5O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04eaccd0-8333-4196-90d5-2dbd390cd819_756x314.png 848w, https://substackcdn.com/image/fetch/$s_!Hu5O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04eaccd0-8333-4196-90d5-2dbd390cd819_756x314.png 1272w, https://substackcdn.com/image/fetch/$s_!Hu5O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04eaccd0-8333-4196-90d5-2dbd390cd819_756x314.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>5. Engineering the Two-Stage &#8220;Coarse-to-Fine&#8221; Retrieval</h2><p>The most powerful application of MRL is the <strong>Two-Stage Retrieval pipeline</strong>. This pattern allows you to have your cake (speed) and eat it too (accuracy).</p><h3>Stage 1: The &#8220;Coarse&#8221; Shortlist</h3><p>You store only the first <strong>128 dimensions</strong> of your embeddings in a fast, in-memory vector index (like HNSW or DiskANN). Because the vectors are tiny, you can search through millions of documents in microseconds. This returns a &#8220;shortlist&#8221; of, say, 1,000 candidates.</p><h3>Stage 2: The &#8220;Fine&#8221; Rerank</h3><p>You then fetch the <strong>full 3072 dimensions</strong> for only those 1,000 candidates (stored in cheaper SSD storage). You perform a final similarity check using the full vectors to pick the top 10.</p><p><strong>The result?</strong> You get the accuracy of a massive model with the infrastructure cost of a tiny one. In production environments, this has been shown to reduce vector search latency by up to <strong>80%</strong>.</p><h2>6. Advanced Trends: SMRL and Adaptive Selection</h2><p>As we&#8217;ve moved into 2025-2026, researchers have introduced <strong>Sequential Matryoshka Representation Learning (SMRL)</strong> and <strong>SMEC (Sequential Matryoshka Embedding Compression)</strong>.</p><p>These new methods solve a subtle issue with original MRL: <strong>gradient variance</strong>. When you train with 10 different loss functions at once, the gradients can get &#8220;noisy.&#8221; SMRL uses a sequential training approach that stabilizes the learning process, allowing for even better performance at extremely low dimensions (like 32 or 64).</p><p>Additionally, <strong>Adaptive Dimension Selection (ADS)</strong> modules now allow systems to dynamically choose the embedding size based on the &#8220;difficulty&#8221; of the query. Simple queries (e.g., &#8220;What is a cat?&#8221;) use 128 dimensions, while complex, nuanced queries (e.g., &#8220;Legal precedents for intellectual property in synthetic biology&#8221;) automatically trigger a full-dimensional search.</p><h2>7. Conclusion</h2><p>Matryoshka Embeddings represent a fundamental shift in how we think about data representations. We are moving away from &#8220;one-size-fits-all&#8221; vectors toward <strong>liquid representations</strong> that adapt to our hardware, our budget, and our latency requirements.</p><p>In 2026, if you aren&#8217;t using MRL in your RAG pipeline, you&#8217;re likely overpaying for your database and overcharging your users in latency.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Limited Intelligence! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Attention Is All You Need]]></title><description><![CDATA[Understanding Attention Heads in Transformers]]></description><link>https://limitedintelligence.substack.com/p/attention-is-all-you-need</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/attention-is-all-you-need</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Wed, 01 Apr 2026 13:01:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!TuEZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7bd5842-ddfc-4d92-b1f8-d1ca2a23806f_819x908.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TuEZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7bd5842-ddfc-4d92-b1f8-d1ca2a23806f_819x908.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TuEZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7bd5842-ddfc-4d92-b1f8-d1ca2a23806f_819x908.png 424w, https://substackcdn.com/image/fetch/$s_!TuEZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7bd5842-ddfc-4d92-b1f8-d1ca2a23806f_819x908.png 848w, https://substackcdn.com/image/fetch/$s_!TuEZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7bd5842-ddfc-4d92-b1f8-d1ca2a23806f_819x908.png 1272w, https://substackcdn.com/image/fetch/$s_!TuEZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7bd5842-ddfc-4d92-b1f8-d1ca2a23806f_819x908.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TuEZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7bd5842-ddfc-4d92-b1f8-d1ca2a23806f_819x908.png" width="819" height="908" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7bd5842-ddfc-4d92-b1f8-d1ca2a23806f_819x908.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:908,&quot;width&quot;:819,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Why multi-head self attention works: math, intuitions and 10+1 hidden  insights | AI Summer&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Why multi-head self attention works: math, intuitions and 10+1 hidden  insights | AI Summer" title="Why multi-head self attention works: math, intuitions and 10+1 hidden  insights | AI Summer" srcset="https://substackcdn.com/image/fetch/$s_!TuEZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7bd5842-ddfc-4d92-b1f8-d1ca2a23806f_819x908.png 424w, https://substackcdn.com/image/fetch/$s_!TuEZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7bd5842-ddfc-4d92-b1f8-d1ca2a23806f_819x908.png 848w, https://substackcdn.com/image/fetch/$s_!TuEZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7bd5842-ddfc-4d92-b1f8-d1ca2a23806f_819x908.png 1272w, https://substackcdn.com/image/fetch/$s_!TuEZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7bd5842-ddfc-4d92-b1f8-d1ca2a23806f_819x908.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the pre-2017 era of Natural Language Processing, we treated sequences like a single-file line. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks processed words one by one, desperately trying to remember what happened at the beginning of the sentence by the time they reached the end.</p><p>Then came &#8220;Attention Is All You Need.&#8221; The Transformer architecture threw out the sequential bottleneck and replaced it with <strong>Multi-Head Attention (MHA)</strong>. If the Transformer is the engine of modern AI, attention heads are the cylinders that allow it to fire on all levels simultaneously.</p><h2>1. The Core Philosophy: Why &#8220;Multi-Head&#8221;?</h2><p>To understand multi-head attention, we first have to understand the limitation of <strong>Scaled Dot-Product Attention</strong>.</p><p>In a single-head system, the model calculates a weighted average of all words in a sequence to represent a specific word. While powerful, a single head is forced to make a choice. If I say, <em>&#8220;The bank was closed because the river overflowed,&#8221;</em> the word &#8220;bank&#8221; has two distinct relationships:</p><ol><li><p><strong>Syntactic:</strong> It is the subject of &#8220;was closed.&#8221;</p></li><li><p><strong>Semantic/Contextual:</strong> It relates to &#8220;river&#8221; (indicating a geographic bank, not a financial one).</p></li></ol><p>A single attention head might struggle to focus on both the grammatical structure and the nuanced context at the same time. <strong>Multi-head attention</strong> solves this by allowing the model to jointly attend to information from different representation subspaces at different positions.</p><blockquote><p><strong>The Analogy:</strong> Imagine a crime scene. One detective (Head 1) looks only at footprints. Another (Head 2) looks at DNA. A third (Head 3) looks at witness statements. By combining their reports, you get a 3D view of the truth that no single detective could capture.</p></blockquote><h2>2. The Mechanics: Behind the Math</h2><p>Every attention head operates on three learned linear projections: <strong>Queries (Q)</strong>, <strong>Keys (K)</strong>, and <strong>Values (V)</strong>.</p><h3>The Single-Head Calculation</h3><p>For a single head, the attention mechanism is defined by the following formula:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Attention}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^T}{\\sqrt{d_k}}\\right)V&quot;,&quot;id&quot;:&quot;CLCTXDZUYS&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where:</p><ul><li><p>Q: What I&#8217;m looking for.</p></li><li><p>K: What I have to offer.</p></li><li><p>V: The information I actually provide.</p></li><li><p>sqrt(d_k): A scaling factor to prevent the dot products from growing too large, which would push the softmax into regions with tiny gradients (the &#8220;vanishing gradient&#8221; problem).</p></li></ul><h3>Moving to Multi-Head</h3><p>In a Multi-Head setup, we don&#8217;t just do this once. We split the model&#8217;s embedding dimension (e.g., 512 in the original Transformer) into <em>h</em> different heads. If <em>h=8</em>, each head works in a 64-dimensional space.</p><p>The process looks like this:</p><ol><li><p><strong>Project:</strong> Linearly project <em>Q, K, V</em> into <em>h</em> subspaces.</p></li><li><p><strong>Attend:</strong> Perform Scaled Dot-Product Attention for each head independently.</p></li><li><p><strong>Concatenate:</strong> Stitch the results of all h heads back together.</p></li><li><p><strong>Final Project:</strong> Pass the concatenated vector through a final weight matrix (W^O) to ensure the heads share their &#8220;findings.&#8221;</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{MultiHead}(Q, K, V) = \\text{Concat}(\\text{head}_1, \\dots, \\text{head}_h)W^O&quot;,&quot;id&quot;:&quot;ACGTXURXDS&quot;}" data-component-name="LatexBlockToDOM"></div><h2>3. What Are They Actually Doing? (Interpretability)</h2><p>Researchers have spent years &#8220;peeking&#8221; into these heads to see if they&#8217;ve actually learned anything useful. As it turns out, attention heads often specialize in specific linguistic tasks:</p><ul><li><p><strong>Syntactic Heads:</strong> Some heads focus almost exclusively on the relationship between a verb and its direct object.</p></li><li><p><strong>Positional Heads:</strong> Some heads always look at the previous word or the very next word, acting as a sort of local &#8220;sliding window.&#8221;</p></li><li><p><strong>Entity Heads:</strong> In larger models like GPT-4, certain heads specialize in tracking names, dates, or specific entities throughout a long document.</p></li><li><p><strong>Delimiting Heads:</strong> Some heads focus on periods, commas, or the <code>[SEP]</code> tokens, helping the model understand where ideas end.</p></li></ul><h3>The &#8220;Emergent&#8221; Nature</h3><p>The beauty of attention heads is that we don&#8217;t <em>tell</em> them to look for grammar or entities. They discover these patterns because they are the most efficient way to reduce loss during training. It is an emergent property of the architecture.</p><h2>4. Efficiency and The &#8220;Over-Parameterization&#8221; Problem</h2><p>One of the most surprising findings in modern AI research is that <strong>we might not need all these heads.</strong></p><p>In a famous paper titled <em>&#8220;Are Sixteen Heads Better than One?&#8221;</em>, researchers found that you could prune (remove) a significant percentage of attention heads during inference without a major drop in performance. In some cases, a model with 12 heads could be pruned down to 1 or 2 heads in certain layers with negligible impact.</p><h3>Why does this happen?</h3><ul><li><p><strong>Redundancy:</strong> Many heads end up learning the same thing.</p></li><li><p><strong>Specialization vs. Generalization:</strong> Some layers require many heads to parse complex logic, while other layers (often the earlier ones) only need a few to handle basic patterns.</p></li></ul><p>This has led to the rise of <strong>Structured Pruning</strong> and <strong>Head Importance Scoring</strong>, where we identify &#8220;dead&#8221; heads and cut them to make models faster and lighter.</p><h2>5. The Evolution: MQA, GQA, and FlashAttention</h2><p>As we&#8217;ve moved toward Large Language Models (LLMs) with massive context windows (like Gemini&#8217;s 1M+ tokens), the standard Multi-Head Attention became a memory bottleneck. This led to three major innovations:</p><h3>Multi-Query Attention (MQA)</h3><p>Instead of every head having its own K and V, all heads share a <strong>single</strong> Key and Value. This drastically reduces the memory footprint during decoding, though it can slightly hurt model &#8220;expressiveness.&#8221;</p><h3>Grouped-Query Attention (GQA)</h3><p>A middle ground used by models like Llama 3. Heads are grouped, and each group shares a K and V. This balances the speed of MQA with the quality of MHA.</p><h3>FlashAttention</h3><p>This isn&#8217;t a change in the math, but a change in how the hardware (GPU) handles it. By being &#8220;IO-aware,&#8221; FlashAttention computes the attention matrix in blocks, avoiding the need to write the massive N x N attention matrix to slow memory.</p><h2>6. Conclusion</h2><p>Attention heads are the reason we can have conversations with AI that feel coherent and contextually aware. They allowed us to move past the &#8220;foggy memory&#8221; of RNNs and into an era where every word in a 500-page book can be simultaneously compared to every other word.</p><p>However, the future likely involves <strong>dynamic attention</strong>. Instead of a fixed number of heads, we may see models that activate only the &#8220;experts&#8221; (heads) needed for a specific prompt&#8212;saving trillions of calculations and making AI more efficient than ever.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0czL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febf43a7b-8001-408d-810d-260b9f294f92_1444x324.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0czL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febf43a7b-8001-408d-810d-260b9f294f92_1444x324.png 424w, https://substackcdn.com/image/fetch/$s_!0czL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febf43a7b-8001-408d-810d-260b9f294f92_1444x324.png 848w, https://substackcdn.com/image/fetch/$s_!0czL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febf43a7b-8001-408d-810d-260b9f294f92_1444x324.png 1272w, https://substackcdn.com/image/fetch/$s_!0czL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febf43a7b-8001-408d-810d-260b9f294f92_1444x324.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0czL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febf43a7b-8001-408d-810d-260b9f294f92_1444x324.png" width="1444" height="324" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ebf43a7b-8001-408d-810d-260b9f294f92_1444x324.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:324,&quot;width&quot;:1444,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:74872,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://limitedintelligence.substack.com/i/192507592?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febf43a7b-8001-408d-810d-260b9f294f92_1444x324.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0czL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febf43a7b-8001-408d-810d-260b9f294f92_1444x324.png 424w, https://substackcdn.com/image/fetch/$s_!0czL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febf43a7b-8001-408d-810d-260b9f294f92_1444x324.png 848w, https://substackcdn.com/image/fetch/$s_!0czL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febf43a7b-8001-408d-810d-260b9f294f92_1444x324.png 1272w, https://substackcdn.com/image/fetch/$s_!0czL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febf43a7b-8001-408d-810d-260b9f294f92_1444x324.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Limited Intelligence! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[From Raw Scores to Reason]]></title><description><![CDATA[Understanding Softmax, Logits, and the Probabilistic Heart of LLMs]]></description><link>https://limitedintelligence.substack.com/p/from-raw-scores-to-reason</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/from-raw-scores-to-reason</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Tue, 31 Mar 2026 13:03:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!aWRh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d8d0e8-21ff-4c9a-b591-398865c198fe_1280x720.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aWRh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d8d0e8-21ff-4c9a-b591-398865c198fe_1280x720.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aWRh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d8d0e8-21ff-4c9a-b591-398865c198fe_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!aWRh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d8d0e8-21ff-4c9a-b591-398865c198fe_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!aWRh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d8d0e8-21ff-4c9a-b591-398865c198fe_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!aWRh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d8d0e8-21ff-4c9a-b591-398865c198fe_1280x720.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aWRh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d8d0e8-21ff-4c9a-b591-398865c198fe_1280x720.jpeg" width="1280" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23d8d0e8-21ff-4c9a-b591-398865c198fe_1280x720.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Logit and Probability&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Logit and Probability" title="Logit and Probability" srcset="https://substackcdn.com/image/fetch/$s_!aWRh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d8d0e8-21ff-4c9a-b591-398865c198fe_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!aWRh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d8d0e8-21ff-4c9a-b591-398865c198fe_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!aWRh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d8d0e8-21ff-4c9a-b591-398865c198fe_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!aWRh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23d8d0e8-21ff-4c9a-b591-398865c198fe_1280x720.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the current era of generative artificial intelligence, we often speak of Large Language Models (LLMs) as if they possess &#8220;intent&#8221; or &#8220;understanding.&#8221; We describe their ability to write code, compose poetry, or solve complex logical puzzles. However, beneath the layer of conversational fluidly lies a deterministic mathematical pipeline. At the very end of this pipeline&#8212;after the billions of parameters have been traversed and the multi-head attention mechanisms have fired&#8212;sits a critical, often overlooked duo: <strong>Logits</strong> and the <strong>Softmax function</strong>.</p><p>For engineers, founders, and strategists, understanding these two components is not merely an academic exercise in calculus. It is the key to mastering model behavior, controlling &#8220;hallucinations,&#8221; and optimizing the bridge between raw compute and human-readable reasoning.</p><h2>I. The Final Frontier: What are Logits?</h2><p>Before a model can tell you that the next word in the sentence &#8220;The capital of France is...&#8221; is &#8220;Paris,&#8221; it produces a set of raw, unnormalized scores. These scores are known as <strong>Logits</strong>.</p><h3>The Mathematical Origin</h3><p>In the context of deep learning, the final linear layer of a transformer model outputs a vector. If our model has a vocabulary of 50,000 tokens, this vector contains 50,000 distinct numbers. These numbers are logits.</p><p>Mathematically, a logit is the inverse of the sigmoid &#8220;logistic&#8221; function. In the journey of a token through a neural network, the logits represent the model&#8217;s &#8220;raw conviction&#8221; before they are constrained to a probability distribution. Unlike probabilities, logits can be any real number: positive, negative, or zero.</p><ul><li><p><strong>A high positive logit</strong> suggests a high degree of confidence that the corresponding token is the correct next step.</p></li><li><p><strong>A negative logit</strong> suggests the model &#8220;thinks&#8221; that token is highly unlikely.</p></li></ul><h3>Why Logits Matter for Developers</h3><p>Logits are the &#8220;raw data&#8221; of model intent. When you access a model via an API (like OpenAI or Anthropic), you often only see the final text. However, &#8220;Logprobs&#8221; (logarithmic probabilities derived from logits) are frequently available. By analyzing logits, developers can:</p><ol><li><p><strong>Measure Uncertainty:</strong> If the top two logits are nearly identical, the model is &#8220;confused.&#8221;</p></li><li><p><strong>Calibrate Output:</strong> You can manually &#8220;bias&#8221; logits to force the model to avoid certain words or favor others (Logit Bias).</p></li></ol><h2>II. The Softmax Transformation: Creating Order from Chaos</h2><p>Raw logits are difficult for a system to use for decision-making because they lack a fixed scale. How do we compare a logit of 12.5 to a logit of -3.2 in a way that represents a percentage chance? This is where the <strong>Softmax function</strong> enters.</p><h3>The Formula</h3><p>The Softmax function takes a vector of $K$ real numbers and transforms them into a probability distribution consisting of $K$ probabilities proportional to the exponentials of the input numbers.</p><p>For an input vector $\mathbf{z}$, the Softmax function $\sigma(\mathbf{z})$ is defined as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sigma(\\mathbf{z})_i = \\frac{e^{z_i}}{\\sum_{j=1}^K e^{z_j}}&quot;,&quot;id&quot;:&quot;NLOLCNVXAQ&quot;}" data-component-name="LatexBlockToDOM"></div><h3>Why Exponentials?</h3><p>The use of the natural exponential $e$ serves two vital purposes:</p><ol><li><p><strong>Positivity:</strong> It ensures that every output is a positive number (since $e^x$ is always positive).</p></li><li><p><strong>Magnification:</strong> It acts as a &#8220;winner-takes-all&#8221; mechanism. Small differences in raw logits are magnified into large differences in probability. If one logit is slightly higher than the rest, Softmax ensures it receives the lion&#8217;s share of the probability mass.</p></li></ol><h3>The Summation Property</h3><p>Crucially, the denominator $\sum e^{z_j}$ ensures that all the output values sum exactly to <strong>1.0 (100%)</strong>. This turns the raw scores into a valid probability distribution, allowing the model to &#8220;rank&#8221; the entire vocabulary.</p><h2>III. The Lever of Creativity: Temperature Scaling</h2><p>If Softmax is the engine, <strong>Temperature ($T$)</strong> is the throttle. In the deployment of LLMs, temperature is the most common hyperparameter used to control the &#8220;creativity&#8221; or &#8220;randomness&#8221; of the output.</p><p>Temperature is an adjustment made to the logits <em>immediately before</em> they are passed into the Softmax function:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sigma(\\mathbf{z})_i = \\frac{e^{z_i / T}}{\\sum_{j=1}^K e^{z_j / T}}&quot;,&quot;id&quot;:&quot;PDJTEVGVWY&quot;}" data-component-name="LatexBlockToDOM"></div><h3>The Effects of Temperature</h3><p>The value of $T$ dictates how the probability mass is distributed among the tokens:</p><ul><li><p><strong>Low Temperature ($T &lt; 1$):</strong> The model becomes more confident and deterministic. By dividing the logits by a small number, the gap between the highest logit and the others is stretched. The &#8220;winner&#8221; gets even more probability, often approaching 100%. This is ideal for factual tasks, coding, or data extraction.</p></li><li><p><strong>High Temperature ($T &gt; 1$):</strong> The model becomes &#8220;creative&#8221; or &#8220;diverse.&#8221; Dividing by a large number flattens the distribution, making the &#8220;gap&#8221; between the top choice and the &#8220;long tail&#8221; of other words much smaller. This allows the model to occasionally pick less likely words, leading to more varied prose.</p></li><li><p><strong>$T = 0$ (Argmax):</strong> This is technically a mathematical limit. The model will always choose the token with the absolute highest logit. It becomes completely deterministic.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rv79!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dbdfd67-f943-4125-8065-8e5698f76913_1044x254.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rv79!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dbdfd67-f943-4125-8065-8e5698f76913_1044x254.png 424w, https://substackcdn.com/image/fetch/$s_!rv79!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dbdfd67-f943-4125-8065-8e5698f76913_1044x254.png 848w, https://substackcdn.com/image/fetch/$s_!rv79!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dbdfd67-f943-4125-8065-8e5698f76913_1044x254.png 1272w, https://substackcdn.com/image/fetch/$s_!rv79!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dbdfd67-f943-4125-8065-8e5698f76913_1044x254.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rv79!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dbdfd67-f943-4125-8065-8e5698f76913_1044x254.png" width="1044" height="254" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7dbdfd67-f943-4125-8065-8e5698f76913_1044x254.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:254,&quot;width&quot;:1044,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:53407,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://limitedintelligence.substack.com/i/192505483?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dbdfd67-f943-4125-8065-8e5698f76913_1044x254.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rv79!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dbdfd67-f943-4125-8065-8e5698f76913_1044x254.png 424w, https://substackcdn.com/image/fetch/$s_!rv79!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dbdfd67-f943-4125-8065-8e5698f76913_1044x254.png 848w, https://substackcdn.com/image/fetch/$s_!rv79!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dbdfd67-f943-4125-8065-8e5698f76913_1044x254.png 1272w, https://substackcdn.com/image/fetch/$s_!rv79!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dbdfd67-f943-4125-8065-8e5698f76913_1044x254.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h2>IV. LLMs and the Token Selection Lifecycle</h2><p>To see how Softmax and Logits function in the wild, let&#8217;s trace the lifecycle of a single token generation in a Transformer.</p><ol><li><p><strong>Input Processing:</strong> The user sends a prompt. It is tokenized and converted into embeddings.</p></li><li><p><strong>The Transformer Block:</strong> The data passes through 96+ layers of attention and feed-forward networks. The model uses its learned weights to calculate the relationship between the input tokens.</p></li><li><p><strong>The Logit Head:</strong> The final hidden state is projected onto the vocabulary space. We now have a vector of <strong>Logits</strong> (raw scores).</p></li><li><p><strong>Softmax Application:</strong> The logits are scaled by <strong>Temperature</strong> and passed through the <strong>Softmax</strong> function. We now have a <strong>Probability Distribution</strong>.</p></li><li><p><strong>Sampling:</strong> The model doesn&#8217;t always just pick the top word. It uses sampling strategies like:</p><ul><li><p><strong>Top-P (Nucleus Sampling):</strong> Only consider the smallest set of tokens whose cumulative probability exceeds $P$ (e.g., 0.9).</p></li><li><p><strong>Top-K:</strong> Only consider the top $K$ most likely tokens.</p></li></ul></li><li><p><strong>The Output:</strong> A token is selected, appended to the prompt, and the process repeats (Autoregression).</p></li></ol><h2>V. Strategic Implications for AI Implementation</h2><p>For those building at the application layer, the relationship between Logits and Softmax informs several high-stakes business decisions.</p><h3>1. Guardrailing and Safety</h3><p>One of the most effective ways to prevent a model from generating restricted content is through <strong>Logit Masking</strong>. By setting the logits of specific forbidden tokens to negative infinity ($-\infty$) before the Softmax layer, you can mathematically guarantee that the model will never select those words, regardless of the prompt.</p><h3>2. Detection of Hallucinations</h3><p>Hallucinations often occur when the Softmax distribution is &#8220;flat&#8221;&#8212;meaning the model has no clear winner among its potential outputs. By monitoring the <strong>Entropy</strong> of the Softmax output, developers can create a &#8220;confidence score&#8221; for every response. If the entropy is too high, the system can trigger a search tool or ask the user for clarification.</p><h3>3. Cost and Latency (The Logit Bottleneck)</h3><p>The vocabulary size of an LLM directly impacts the size of the final logit vector. As we move toward models with larger vocabularies (to support more languages or specialized jargon), the compute cost of the final linear layer and the Softmax operation increases. Optimizing this &#8220;Logit Head&#8221; is a major focus for edge-AI and mobile-first LLM deployments.</p><h2>VI. Beyond Softmax: The Future of Probabilistic Outputs</h2><p>While Softmax is the industry standard, it is not without flaws. The &#8220;Exponential&#8221; nature of Softmax can sometimes lead to <strong>overconfidence</strong>, where a model assigns 99.9% probability to a wrong answer simply because its logit was slightly higher than the second-best option.</p><p>Research into <strong>Sparsemax</strong> (which can assign exactly zero probability to unlikely tokens) and <strong>Calibration</strong> (adjusting logits so that a 90% probability actually corresponds to a 90% accuracy rate) is the next frontier. For founders, staying ahead of these architectural shifts means building more reliable, steerable, and trustworthy AI systems.</p><h2>Conclusion</h2><p>The &#8220;magic&#8221; of LLMs is often attributed to their size, but their utility is governed by their precision. Logits represent the model&#8217;s raw, unvarnished thoughts; Softmax represents the civilized, probabilistic output we use to communicate.</p><p>By mastering the interplay between these two&#8212;and the temperature that mediates them&#8212;we move away from treating AI as a &#8220;black box&#8221; and toward treating it as a finely-tuned instrument of digital logic. Whether you are optimizing a customer service bot or architecting a new frontier model, the path to performance runs directly through the Softmax layer.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Limited Intelligence! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[The Hidden Cost]]></title><description><![CDATA[Understanding the &#8220;Churn Premium&#8221;]]></description><link>https://limitedintelligence.substack.com/p/the-hidden-cost</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/the-hidden-cost</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Mon, 30 Mar 2026 13:03:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Y3yt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528d8f37-5801-4acf-ae60-4e3982df0197_1350x760.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y3yt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528d8f37-5801-4acf-ae60-4e3982df0197_1350x760.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y3yt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528d8f37-5801-4acf-ae60-4e3982df0197_1350x760.png 424w, https://substackcdn.com/image/fetch/$s_!Y3yt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528d8f37-5801-4acf-ae60-4e3982df0197_1350x760.png 848w, https://substackcdn.com/image/fetch/$s_!Y3yt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528d8f37-5801-4acf-ae60-4e3982df0197_1350x760.png 1272w, https://substackcdn.com/image/fetch/$s_!Y3yt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528d8f37-5801-4acf-ae60-4e3982df0197_1350x760.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y3yt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528d8f37-5801-4acf-ae60-4e3982df0197_1350x760.png" width="1350" height="760" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/528d8f37-5801-4acf-ae60-4e3982df0197_1350x760.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:760,&quot;width&quot;:1350,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Churn: o que &#233;, como calcular e como reduzir de forma pr&#225;tica&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Churn: o que &#233;, como calcular e como reduzir de forma pr&#225;tica" title="Churn: o que &#233;, como calcular e como reduzir de forma pr&#225;tica" srcset="https://substackcdn.com/image/fetch/$s_!Y3yt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528d8f37-5801-4acf-ae60-4e3982df0197_1350x760.png 424w, https://substackcdn.com/image/fetch/$s_!Y3yt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528d8f37-5801-4acf-ae60-4e3982df0197_1350x760.png 848w, https://substackcdn.com/image/fetch/$s_!Y3yt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528d8f37-5801-4acf-ae60-4e3982df0197_1350x760.png 1272w, https://substackcdn.com/image/fetch/$s_!Y3yt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528d8f37-5801-4acf-ae60-4e3982df0197_1350x760.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The narrative that AI and the cloud have ushered in a &#8220;golden age&#8221; of effortless software development is one of the most expensive lies in the modern enterprise. While the marketing brochures promise &#8220;heavenly&#8221; developer experiences, the reality on the ground is often a hell of architectural rot, burnout, and a &#8220;churn premium&#8221; that is silently hemorrhaging millions of dollars.</p><p>If you are seeing your best engineers walk out the door, it isn&#8217;t because they&#8217;ve lost their edge or can&#8217;t master the latest LLM-assisted coding tool. They are leaving because they are drowning in systems that make it impossible to do their best work.</p><p>To stop the bleed, leaders must move beyond shiny dashboards and confront the broken systems beneath.</p><p>Most organizations track turnover, but few track the <strong>Churn Premium</strong>. This is the compounded cost of technical debt, lost institutional knowledge, and the massive overhead required to replace a high-performing engineer who understood the &#8220;where the bodies are buried&#8221; in a legacy codebase.</p><p>When a senior developer quits because of burnout, you aren&#8217;t just losing a headcount; you are paying a tax on every future feature. New hires take months to reach the same level of fluency, and if the system they inherit is already broken, they are likely to follow their predecessor out the door within 18 months. This cycle is a death spiral for innovation.</p><h2>Principle 1: Separate Preparation from Implementation</h2><p>One of the primary drivers of developer cognitive overload is the &#8220;rebuilding the bike while riding uphill&#8221; syndrome. Currently, most teams expect developers to clean up messy, legacy codebases while simultaneously shipping new, high-stakes features.</p><p>This dual-track cognitive load is a recipe for failure. Human brains are not wired to perform deep structural refactoring and feature implementation in the same breath.</p><h3>The Strategy: &#8220;Make the Change Easy&#8221;</h3><p>Following the wisdom of engineering veteran Kent Beck: <strong>&#8220;First make the change easy (warning: this might be hard), then make the easy change.&#8221;</strong></p><ul><li><p><strong>Phase 1: Preparation (The Cleanup).</strong> This is a dedicated effort to refactor the environment so the new feature has a clean &#8220;landing zone.&#8221; No new business logic is added here.</p></li><li><p><strong>Phase 2: Implementation (The Feature).</strong> Once the architecture supports the change, the implementation becomes trivial.</p></li></ul><p>By isolating these two activities, you reduce the mental &#8220;context switching&#8221; that leads to bugs and developer frustration.</p><h2>Principle 2: Stop Flying Blind with Data-Driven Empathy</h2><p>You cannot fix burnout if you cannot see the &#8220;heroics&#8221; happening behind the scenes. Many leaders rely on surface-level metrics like Jira velocity, which often mask the reality of a team on the brink of collapse. To truly understand the health of your engineering org, you need to leverage advanced developer analytics&#8212;specifically tools like <strong>DevStats</strong>.</p><h3>Key Reports to Monitor:</h3><ol><li><p><strong>The Activity Heatmap:</strong> This is your early warning system for burnout. Look for consistent patterns of &#8220;out-of-hours&#8221; work. A short burst of weekend work for a major launch is normal; a two-month trend of Sunday night commits is a signal that your best people are updating their resumes.</p></li><li><p><strong>Planning Accuracy Report:</strong> This identifies the &#8220;Yes Men&#8221; in your organization&#8212;the developers who commit to impossible deadlines out of a sense of duty, only to suffer in silence. If your planning accuracy is consistently low, the problem isn&#8217;t the developers; it&#8217;s the dates.</p></li></ol><h2>Principle 3: Build the Thinnest Viable Platform (TVP)</h2><p>In an attempt to &#8220;help&#8221; developers, many organizations build massive, internal &#8220;Developer Platforms&#8221; that end up becoming overengineered nightmares. If your platform requires a 50-page manual just to deploy a microservice, you haven&#8217;t built a tool; you&#8217;ve built a barrier.</p><p>The goal is to provide a <strong>Thinnest Viable Platform</strong>&#8212;a set of self-service tools that provide just enough abstraction to remove friction without removing control.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j9RK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f642c8-71ca-4bef-b5b7-aa36f84661b0_1434x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j9RK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f642c8-71ca-4bef-b5b7-aa36f84661b0_1434x264.png 424w, https://substackcdn.com/image/fetch/$s_!j9RK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f642c8-71ca-4bef-b5b7-aa36f84661b0_1434x264.png 848w, https://substackcdn.com/image/fetch/$s_!j9RK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f642c8-71ca-4bef-b5b7-aa36f84661b0_1434x264.png 1272w, https://substackcdn.com/image/fetch/$s_!j9RK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f642c8-71ca-4bef-b5b7-aa36f84661b0_1434x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j9RK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f642c8-71ca-4bef-b5b7-aa36f84661b0_1434x264.png" width="1434" height="264" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/84f642c8-71ca-4bef-b5b7-aa36f84661b0_1434x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:1434,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72836,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://limitedintelligence.substack.com/i/192502340?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f642c8-71ca-4bef-b5b7-aa36f84661b0_1434x264.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!j9RK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f642c8-71ca-4bef-b5b7-aa36f84661b0_1434x264.png 424w, https://substackcdn.com/image/fetch/$s_!j9RK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f642c8-71ca-4bef-b5b7-aa36f84661b0_1434x264.png 848w, https://substackcdn.com/image/fetch/$s_!j9RK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f642c8-71ca-4bef-b5b7-aa36f84661b0_1434x264.png 1272w, https://substackcdn.com/image/fetch/$s_!j9RK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f642c8-71ca-4bef-b5b7-aa36f84661b0_1434x264.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h2>Principle 4: Weaponize SRE Error Budgets</h2><p>Product owners and stakeholders often have an insatiable appetite for new features, frequently &#8220;bullying&#8221; engineers into shipping code on foundations they know are unstable. To counter this, you must stop the guessing games and implement a <strong>mathematical circuit breaker: The Error Budget.</strong></p><h3>The 99.9% Rule</h3><p>If you set a reliability target of <strong>99.9%</strong>, you are effectively saying the system is allowed to be down or &#8220;broken&#8221; for roughly 43 minutes per month.</p><ul><li><p><strong>If the budget is intact:</strong> The team proceeds with feature development as planned.</p></li><li><p><strong>If the budget is spent:</strong> All feature work stops. Every engineer, designer, and product owner shifts focus to reliability and technical debt until the system is stabilized.</p></li></ul><p>This moves the conversation from &#8220;opinion-based&#8221; to &#8220;data-driven.&#8221; It is no longer the engineer&#8217;s word against the product owner&#8217;s; it is a hard limit defined by the system&#8217;s own health.</p><h2>Principle 5: Drag the Team Out of the &#8220;Anxiety Zone&#8221;</h2><p>When an environment is governed by fear of blame, engineers hide risks. They stop reporting &#8220;minor&#8221; bugs that are actually symptoms of systemic failure. I have seen projects go dark for days because a developer was too intimidated to flag a misconfiguration 48 hours earlier.</p><h3>Cultivating Psychological Safety</h3><p>High-agency teams require the safety to fail. As a leader, you must go first:</p><ul><li><p><strong>Share your own &#8220;screw-ups&#8221; openly.</strong></p></li><li><p><strong>Conduct blameless post-mortems</strong> that focus on the <em>process</em> failure, not the <em>person</em> who pushed the button.</p></li><li><p><strong>Reward risk-flagging.</strong> An engineer who identifies a critical flaw in a proposed feature should be celebrated as much as the one who ships it.</p></li></ul><h2>Principle 6: Translate Tech Debt into Shareholder Destruction</h2><p>One of the biggest mistakes technical leaders make is trying to explain &#8220;refactoring&#8221; or &#8220;technical debt&#8221; to the C-suite using engineering terminology. The Board does not care about &#8220;clean code.&#8221; They care about <strong>risk</strong> and <strong>capital efficiency</strong>.</p><p>To get the resources you need, you must translate technical debt into its true form: <strong>Shareholder Destruction.</strong></p><h3>The 1:100 Rule</h3><p>The math of software defects is brutal.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Cost = 1x \\text{ (Discovery at Dev)} \\rightarrow 100x \\text{ (Discovery in Production)}&quot;,&quot;id&quot;:&quot;AFLQAUMMAF&quot;}" data-component-name="LatexBlockToDOM"></div><p>A bug costs <strong>$1</strong> to fix during the initial design or development phase. That same bug costs <strong>$100</strong> once it is live in production, factoring in customer support, emergency patches, and reputational damage.</p><h3>The &#8220;Knowledge Transfer Tax&#8221;</h3><p>Explain that messy code is essentially a &#8220;Knowledge Transfer Tax.&#8221; If your most expensive senior engineers are spending 15&#8211;20 hours a week just &#8220;keeping the lights on&#8221; or explaining convoluted logic to juniors, that is a direct drain on the company&#8217;s R&amp;D budget. You are paying senior wages for janitorial work.</p><h2>Conclusion</h2><p>The future of the tech industry won&#8217;t be won by the companies that squeeze the most lines of AI-generated code out of their staff. It will be won by the companies that treat developer time&#8212;and more importantly, <strong>developer energy</strong>&#8212;as their most precious capital.</p><p>Your job as a leader is not to demand more output. It is to build a system that doesn&#8217;t make people hate their jobs. If you keep pushing for &#8220;shiny&#8221; features while the foundation is rotting, you aren&#8217;t building a product; you&#8217;re building a Titanic.</p><p>Stop flying blind. Fix the system, empower your people, and stop the churn before it consumes your venture.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Limited Intelligence! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[The Evaluation Gap]]></title><description><![CDATA[Engineering Rigor for the Age of AI Agents]]></description><link>https://limitedintelligence.substack.com/p/the-evaluation-gap</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/the-evaluation-gap</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Wed, 25 Mar 2026 13:03:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!p1dY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cddf6dc-c55b-446e-8b14-6b24d43ce15e_1200x675.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p1dY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cddf6dc-c55b-446e-8b14-6b24d43ce15e_1200x675.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p1dY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cddf6dc-c55b-446e-8b14-6b24d43ce15e_1200x675.png 424w, https://substackcdn.com/image/fetch/$s_!p1dY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cddf6dc-c55b-446e-8b14-6b24d43ce15e_1200x675.png 848w, https://substackcdn.com/image/fetch/$s_!p1dY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cddf6dc-c55b-446e-8b14-6b24d43ce15e_1200x675.png 1272w, https://substackcdn.com/image/fetch/$s_!p1dY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cddf6dc-c55b-446e-8b14-6b24d43ce15e_1200x675.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p1dY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cddf6dc-c55b-446e-8b14-6b24d43ce15e_1200x675.png" width="1200" height="675" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5cddf6dc-c55b-446e-8b14-6b24d43ce15e_1200x675.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:675,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Evaluating AI Agent Performance with Dynamic Metrics&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Evaluating AI Agent Performance with Dynamic Metrics" title="Evaluating AI Agent Performance with Dynamic Metrics" srcset="https://substackcdn.com/image/fetch/$s_!p1dY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cddf6dc-c55b-446e-8b14-6b24d43ce15e_1200x675.png 424w, https://substackcdn.com/image/fetch/$s_!p1dY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cddf6dc-c55b-446e-8b14-6b24d43ce15e_1200x675.png 848w, https://substackcdn.com/image/fetch/$s_!p1dY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cddf6dc-c55b-446e-8b14-6b24d43ce15e_1200x675.png 1272w, https://substackcdn.com/image/fetch/$s_!p1dY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cddf6dc-c55b-446e-8b14-6b24d43ce15e_1200x675.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the transition from Large Language Models (LLMs) to autonomous AI agents, the industry has hit a significant bottleneck. While building a prototype that can perform a &#8220;cool trick&#8221; takes an afternoon, moving that agent into a production environment where it handles sensitive data, financial transactions, or customer-facing operations is a different beast entirely.</p><p>The primary hurdle isn&#8217;t just the logic of the agent&#8212;it&#8217;s the <strong>evaluation</strong>. For decades, software engineering relied on deterministic unit tests: input $X$ always results in output $Y$. With AI agents, the path from input to output is non-deterministic, multi-step, and often involves tool-use iterations that can fail in a thousand subtle ways.</p><p>The O&#8217;Reilly literature on AI evaluation, particularly the emerging frameworks surrounding agentic workflows, emphasizes a shift from &#8220;vibe-based development&#8221; to a systematic, metrics-driven approach. To build agents that actually work at scale, we must move beyond the chat box and into the laboratory.</p><h2>1. The Anatomy of an Agentic Failure</h2><p>Before we can evaluate success, we must understand the unique failure modes of agents. Unlike a standard RAG (Retrieval-Augmented Generation) system, which typically has a linear flow (Query &#8594; Retrieve &#8594; Generate), an agent operates in a loop:</p><ol><li><p><strong>Perception:</strong> Understanding the user&#8217;s intent.</p></li><li><p><strong>Planning:</strong> Breaking the intent into sub-tasks.</p></li><li><p><strong>Tool Selection:</strong> Deciding which external API or database to call.</p></li><li><p><strong>Execution:</strong> Parsing the tool output.</p></li><li><p><strong>Observation:</strong> Deciding if the goal is met or if another loop is needed.</p></li></ol><p>A failure can occur at any stage. An agent might plan correctly but select the wrong tool. It might execute the tool correctly but fail to parse the JSON response. Or, most dangerously, it might enter a &#8220;hallucination loop,&#8221; where it tries to fix an error with more erroneous actions. Evaluation, therefore, cannot just be a look at the final answer; it must be a <strong>trace-level assessment</strong> of the entire trajectory.</p><h2>2. Defining the Metrics: The Four Pillars of Evaluation</h2><p>The O&#8217;Reilly framework for evaluation generally categorizes metrics into four distinct buckets. To build a robust system, you need coverage across all of them.</p><h3>I. Correctness (Functional Accuracy)</h3><p>This is the most obvious metric, but the hardest to measure. Did the agent achieve the user&#8217;s goal?</p><ul><li><p><strong>Tool Call Accuracy:</strong> Did the agent call the right function with the correct parameters?</p></li><li><p><strong>Final Answer Relevancy:</strong> Is the output semantically aligned with the prompt?</p></li><li><p><strong>Success Rate:</strong> In a multi-turn conversation, what percentage of tasks were completed without human intervention?</p></li></ul><h3>II. Reliability and Consistency</h3><p>Because LLMs are probabilistic, an agent might succeed on Monday and fail on Tuesday with the same prompt.</p><ul><li><p><strong>Pass@k:</strong> If we run the same prompt $k$ times, how often does it succeed?</p></li><li><p><strong>Robustness to Noise:</strong> If we add irrelevant information to the prompt, does the agent still find the correct path?</p></li></ul><h3>III. Safety and Guardrails</h3><p>Agents have &#8220;agency,&#8221; meaning they can do harm if not constrained.</p><ul><li><p><strong>Prompt Injection Vulnerability:</strong> Can a user trick the agent into bypassing its system instructions?</p></li><li><p><strong>PII Leakage:</strong> Does the agent inadvertently pull sensitive data from a database and show it to the user?</p></li><li><p><strong>Toxicity and Bias:</strong> Does the agent generate harmful content during its reasoning steps?</p></li></ul><h3>IV. Efficiency (Performance and Cost)</h3><p>In a business context, an agent that takes 45 seconds to think and costs $2.00 per query is often unusable.</p><ul><li><p><strong>Tokens Per Task:</strong> How many tokens were consumed in the loops?</p></li><li><p><strong>Latency per Step:</strong> Which specific tool or reasoning step is slowing down the UX?</p></li><li><p><strong>Cost per Success:</strong> The total cost of all API calls divided by the number of successful outcomes.</p></li></ul><h2>3. The &#8220;Gold Dataset&#8221; Problem</h2><p>You cannot evaluate what you haven&#8217;t defined. The cornerstone of AI evaluation is the <strong>Evaluation Dataset</strong> (often called a &#8220;Gold Set&#8221;). This is a curated list of inputs and their expected &#8220;Ground Truth&#8221; outputs.</p><p>For agents, a Gold Set is significantly more complex than for a standard classifier. A high-quality agentic dataset should include:</p><ol><li><p><strong>The Prompt:</strong> The initial user request.</p></li><li><p><strong>The Context:</strong> The state of the world (e.g., &#8220;The user is logged in,&#8221; &#8220;The database has 3 records&#8221;).</p></li><li><p><strong>The Expected Trajectory:</strong> Not just the final answer, but the specific tools that <em>should</em> be called.</p></li><li><p><strong>Negative Constraints:</strong> Things the agent <em>should not</em> do (e.g., &#8220;Do not delete the record&#8221;).</p></li></ol><p><strong>Synthetic Data Generation:</strong></p><p>Creating 1,000 manual test cases is grueling. Modern evaluation strategies use &#8220;LLM-as-a-Generator&#8221; to create synthetic test cases. By prompting a frontier model (like GPT-4o or Claude 3.5 Sonnet) to &#8220;imagine 50 ways a user might try to break this specific tool,&#8221; you can bootstrap an evaluation suite in minutes.</p><h2>4. LLM-as-a-Judge: Scaling Evaluation</h2><p>How do you grade a 10-step agent trajectory? You can&#8217;t use Regex or Exact Match. The solution championed in recent technical literature is the <strong>LLM-as-a-Judge</strong> pattern.</p><p>In this architecture, you use a highly capable model to grade the performance of your smaller, faster production agent. You provide the Judge with a <strong>rubric</strong>.</p><blockquote><p><strong>Example Rubric for a Sales Agent:</strong></p><ul><li><p>Score 1: Agent failed to ask for the user&#8217;s email.</p></li><li><p>Score 3: Agent asked for the email but didn&#8217;t verify the format.</p></li><li><p>Score 5: Agent collected the email, verified it, and successfully called the <code>UpdateLead</code> tool.</p></li></ul></blockquote><p>While &#8220;LLM-as-a-Judge&#8221; introduces its own biases, it is remarkably consistent when compared to human graders and operates at a fraction of the cost and time. To mitigate bias, practitioners often use <strong>Reference-Based Evaluation</strong>, where the Judge is given a &#8220;perfect&#8221; example to compare against the agent&#8217;s actual performance.</p><h2>5. Architectural Integration: The Eval-Driven Development Cycle</h2><p>Evaluation shouldn&#8217;t be a post-mortem; it should be integrated into the CI/CD pipeline. The O&#8217;Reilly approach suggests an <strong>Eval-Driven Development (EDD)</strong> loop:</p><ol><li><p><strong>Baseline:</strong> Run your current agent through your Gold Set. Record the scores.</p></li><li><p><strong>Experiment:</strong> Change a prompt, swap a model, or add a new tool.</p></li><li><p><strong>Evaluate:</strong> Run the new version through the <em>exact same</em> Gold Set.</p></li><li><p><strong>Compare:</strong> Use a &#8220;Diff&#8221; tool to see which cases improved and&#8212;crucially&#8212;which ones regressed.</p></li></ol><p>One of the most common pitfalls in AI development is the <strong>&#8220;Hydra Effect&#8221;</strong>: you fix a prompt to solve Problem A, but that change causes a regression in Problem B. Without a systematic evaluation suite, you are flying blind.</p><h2>6. Real-World Case Study: The Customer Support Agent</h2><p>Imagine you are building an agent for an e-commerce platform that can process refunds.</p><ul><li><p><strong>The Vibe Check:</strong> You ask it to refund a fake order. It works. You feel good.</p></li><li><p><strong>The Systematic Eval:</strong> You run 100 test cases.</p><ul><li><p><strong>Finding 1:</strong> The agent successfully refunds 90% of cases.</p></li><li><p><strong>Finding 2:</strong> In 5% of cases, the agent refunds the <em>wrong</em> item because it didn&#8217;t clarify which product in a multi-item order the user meant.</p></li><li><p><strong>Finding 3:</strong> In 5% of cases, the agent hallucinated a &#8220;manager approval code&#8221; to bypass a restriction.</p></li></ul></li></ul><p>By identifying these specific failure modes through evaluation, you can implement <strong>Programmatic Guardrails</strong>. For example, you can add a validation step that requires the agent to output a specific JSON schema before the refund tool is ever triggered.</p><h2>7. The Business Case: ROI of Evaluation</h2><p>For founders and stakeholders, evaluation is often seen as a &#8220;nice-to-have&#8221; technical debt. This is a mistake. Evaluation is directly tied to the <strong>Unit Economics</strong> of an AI product.</p><ul><li><p><strong>Reducing Rework:</strong> It is 10x cheaper to fix a prompt in staging than to deal with a corrupted database in production.</p></li><li><p><strong>Model Optimization:</strong> Evaluation allows you to see if a cheaper, faster model (like Llama 3-8B) can perform as well as a more expensive one (GPT-4o) for a specific task. You can only make that switch confidently if you have the metrics to prove there is no quality loss.</p></li><li><p><strong>Trust and Adoption:</strong> Enterprise clients demand SLAs (Service Level Agreements). You cannot provide an SLA for an AI agent without a statistically significant evaluation report.</p></li></ul><h2>8. Conclusion</h2><p>The &#8220;Agentic Era&#8221; promises a world where software doesn&#8217;t just show us data, but acts upon it. However, agency without accountability is a liability.</p><p>As we move forward, the tools for evaluation&#8212;like those discussed in O&#8217;Reilly&#8217;s technical guides&#8212;will become as standard as GitHub or Docker. We are moving toward a future of <strong>Continuous Evaluation</strong>, where agents are constantly monitored by other AI systems, ensuring they remain within the bounds of their intent, safety, and efficiency.</p><p>If you are building agents today, stop tweaking your prompts in a vacuum. Build your Gold Set, define your rubric, and start measuring. In the world of AI, the winner isn&#8217;t the one with the best prompt; it&#8217;s the one with the best feedback loop.</p><div><hr></div><h3>Key Takeaways for Your Strategy:</h3><ul><li><p><strong>Traceability is mandatory:</strong> Log every step of the agent&#8217;s &#8220;thought&#8221; process, not just the final output.</p></li><li><p><strong>Focus on Regressions:</strong> Use automated evals to ensure new features don&#8217;t break old successes.</p></li><li><p><strong>Use the Right Tool for the Grade:</strong> Use &#8220;heavyweight&#8221; models to judge &#8220;lightweight&#8221; production agents.</p></li><li><p><strong>Quantify the &#8220;Vibe&#8221;:</strong> Turn subjective quality into a 1-5 scale with clear rubrics.</p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Limited Intelligence! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Your RAG System Has a Hidden UX Problem]]></title><description><![CDATA[The Semantic Highlighting Gap]]></description><link>https://limitedintelligence.substack.com/p/your-rag-system-has-a-hidden-ux-problem</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/your-rag-system-has-a-hidden-ux-problem</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Tue, 24 Mar 2026 13:03:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!8_NZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e47523-7b8c-41a2-94b5-44949f24d338_1024x559.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8_NZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e47523-7b8c-41a2-94b5-44949f24d338_1024x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8_NZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e47523-7b8c-41a2-94b5-44949f24d338_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!8_NZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e47523-7b8c-41a2-94b5-44949f24d338_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!8_NZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e47523-7b8c-41a2-94b5-44949f24d338_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!8_NZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e47523-7b8c-41a2-94b5-44949f24d338_1024x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8_NZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e47523-7b8c-41a2-94b5-44949f24d338_1024x559.png" width="1024" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95e47523-7b8c-41a2-94b5-44949f24d338_1024x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:137953,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://limitedintelligence.substack.com/i/191794829?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e47523-7b8c-41a2-94b5-44949f24d338_1024x559.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8_NZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e47523-7b8c-41a2-94b5-44949f24d338_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!8_NZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e47523-7b8c-41a2-94b5-44949f24d338_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!8_NZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e47523-7b8c-41a2-94b5-44949f24d338_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!8_NZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e47523-7b8c-41a2-94b5-44949f24d338_1024x559.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the world of Generative AI, we&#8217;ve spent the last two years obsessed with the &#8220;R&#8221; in RAG. We&#8217;ve optimized vector databases, fine-tuned embedding models, and experimented with hybrid search to ensure that when a user asks a question, the system finds the right needle in the haystack.</p><p>And yet, despite our retrieval being more &#8220;intelligent&#8221; than ever, the user experience often feels like it&#8217;s stuck in 1998.</p><p>There is a silent killer of user trust in RAG systems. It&#8217;s not hallucination, and it&#8217;s not latency. It&#8217;s the <strong>mismatch between how we find information and how we show it.</strong> We are retrieving semantically, but we are highlighting lexicographically.</p><p>This is the story of why your RAG system is inadvertently &#8220;gaslighting&#8221; your users, and how a new generation of small, specialized models&#8212;like the one recently open-sourced by Zilliz&#8212;is solving it.</p><h2>The Great Disconnect: Meaning vs. Matching</h2><p>To understand the problem, we have to look at the two different &#8220;brains&#8221; operating inside a modern RAG application.</p><ol><li><p><strong>The Retrieval Brain (Semantic):</strong> This brain operates in high-dimensional vector space. It doesn&#8217;t care about the letters in a word; it cares about the &#8220;vibe&#8221; or the conceptual intent. If you search for &#8220;liquid assets,&#8221; it knows to look for &#8220;cash,&#8221; &#8220;savings accounts,&#8221; and &#8220;marketable securities.&#8221;</p></li><li><p><strong>The UI Brain (Keyword):</strong> This brain is essentially a sophisticated version of <code>Ctrl+F</code>. It looks for exact character matches. If the user typed &#8220;liquid assets&#8221; and the document says &#8220;available cash,&#8221; the UI Brain sees zero overlap. It leaves the text plain, white, and unhelpful.</p></li></ol><h3>The &#8220;A15 Bionic&#8221; Paradox</h3><p>Imagine a user at a major tech firm searching the internal wiki for <strong>&#8220;iPhone performance.&#8221;</strong></p><p>The vector database does its job perfectly. It skips over generic marketing fluff and retrieves a technical whitepaper about the <strong>A15 Bionic chip architecture</strong>, <strong>Geekbench scores</strong>, and <strong>low-latency neural engines</strong>.</p><p>From a retrieval standpoint, this is a 10/10 result. But from a UX standpoint, it&#8217;s a failure. The user opens the document, and because the literal words &#8220;iPhone&#8221; and &#8220;performance&#8221; don&#8217;t appear in the technical paragraphs, <strong>nothing is highlighted.</strong></p><p>The user is faced with 3,000 words of dense technical prose. They have to manually scan the text to figure out why the system thought this was relevant. Usually, after five seconds of scrolling, they assume the AI is &#8220;hallucinating&#8221; or &#8220;stupid&#8221; and close the tab.</p><p><strong>The irony:</strong> The system was too smart for its own UI.</p><h2>Why This Matters: The Erosion of Trust</h2><p>This isn&#8217;t just a minor cosmetic issue; it&#8217;s a structural flaw that hurts the two most important groups in the ecosystem.</p><h3>1. The End Users (The Trust Gap)</h3><p>Search is a contract. The user provides a query; the system provides an answer and <em>proof</em> of that answer. Highlighting is the &#8220;receipt.&#8221; When a RAG system provides a document but fails to highlight the relevant section, it breaks that contract. Over time, this friction leads to &#8220;tool fatigue,&#8221; where users go back to their old, inefficient ways of finding information because the AI feels too high-effort to verify.</p><h3>2. The Developers (The Debugging Nightmare)</h3><p>When a RAG system underperforms, developers usually look at two places: the embedding model or the LLM. But without semantic highlighting, it&#8217;s nearly impossible to tell if the retrieval was actually &#8220;bad&#8221; or if the information was simply buried. Developers end up chasing &#8220;better embeddings&#8221; when the real problem is a &#8220;visibility&#8221; issue.</p><h2>The Agentic Complication</h2><p>The problem gets exponentially worse as we move from simple RAG to <strong>Agentic RAG</strong>.</p><p>In an agentic workflow, the user doesn&#8217;t just search; they ask a high-level question like: <em>&#8220;Analyze recent market trends.&#8221;</em> The AI agent then performs &#8220;Chain of Thought&#8221; reasoning and generates its own optimized search queries:</p><blockquote><p><em>&#8220;Retrieve Q4 2024 consumer electronics sales data, year-over-year growth rates, supply chain cost fluctuations.&#8221;</em></p></blockquote><p>The system finds a sentence: <em>&#8220;The iPhone 15 series drove a 12% market recovery in the premium segment.&#8221;</em> <strong>The Problem:</strong> There is zero keyword overlap between the agent&#8217;s generated query and the actual result. The user sees a document about &#8220;iPhone 15&#8221; but none of the &#8220;market trend&#8221; context is highlighted because the UI is still looking for the literal word &#8220;trends.&#8221;</p><p>The more &#8220;intelligent&#8221; the agent becomes, the more it diverges from simple keyword matching, making traditional highlighting increasingly obsolete.</p><h2>Why Not Just Use an LLM?</h2><p>The &#8220;brute force&#8221; solution is to send the retrieved document and the query to a model like GPT-4 and ask: <em>&#8220;Which sentences in this document answer the query? Return the character offsets.&#8221;</em></p><p>While this works, it is a production disaster for three reasons:</p><ol><li><p><strong>Latency:</strong> Highlighting needs to happen the moment the results are rendered. Waiting 2-5 seconds for an LLM to &#8220;scan&#8221; five different 10-page documents is unacceptable for a search UI.</p></li><li><p><strong>Cost:</strong> Running a 175B+ parameter model every time a user hits &#8220;Enter&#8221; just to draw some yellow boxes on a screen will destroy your margins.</p></li><li><p><strong>Context Windows:</strong> While context windows are growing, feeding entire document sets into an LLM for every single search query remains inefficient and prone to &#8220;middle-of-the-document&#8221; forgetfulness.</p></li></ol><h2>The Solution: Specialized Semantic Highlighting</h2><p>We need a middle ground: a model that has the &#8220;brain&#8221; of an LLM but the speed of a keyword index.</p><p>Several open-source attempts have tried to solve this, but most fall short of &#8220;Production Grade&#8221; requirements:</p><p><strong>Model / ToolContext WindowPerformanceLicensingOpenSearch Semantic</strong>512 TokensFails on out-of-domain dataApache 2.0<strong>XProvence</strong>LimitedNoisy results; multilingual issues<strong>CC BY-NC (Non-Commercial)Zilliz Semantic Model8,000 TokensStrong GeneralizationMIT (Commercial Use OK)</strong></p><h3>The Zilliz Breakthrough</h3><p>The team at Zilliz (the creators of Milvus) approached this as a distillation problem. They wanted a model that could handle long documents (8k context) and understand multiple languages without the &#8220;Non-Commercial&#8221; baggage of previous research.</p><h4>How It Was Built</h4><p>To get LLM-level understanding into a small package, they used a &#8220;Teacher-Student&#8221; training architecture:</p><ol><li><p><strong>The Teacher:</strong> They used <strong>Qwen3-8B</strong>, a powerful LLM. Instead of just asking it to &#8220;highlight,&#8221; they asked it to <strong>reason.</strong> By forcing the model to explain <em>why</em> a span was relevant before marking it, they generated a much higher-quality training set.</p></li><li><p><strong>The Student:</strong> They distilled this reasoning into a <strong>BGE-M3 Reranker (0.6B parameters)</strong>.</p></li><li><p><strong>The Training:</strong> They processed over <strong>1 million bilingual samples</strong> (English and Chinese) over 5 hours on an 8x A100 cluster.</p></li></ol><p>The result is a model that doesn&#8217;t just look for &#8220;iPhone,&#8221; but understands that &#8220;A15 Bionic&#8221; is the <em>reason</em> the iPhone is fast.</p><h2>Case Study: The &#8220;Sacred Deer&#8221; Trap</h2><p>To see the difference between a &#8220;keyword-matching&#8221; brain and a &#8220;semantic&#8221; brain, look at this query:</p><blockquote><p><strong>&#8220;Who wrote the film </strong><em><strong>The Killing of a Sacred Deer</strong></em><strong>?&#8221;</strong></p></blockquote><p>A document contains three sentences:</p><ol><li><p>&#8220;...the screenplay written by <strong>Lanthimos and Efthymis Filippou</strong>.&#8221;</p></li><li><p>&#8220;The film stars Colin Farrell...&#8221;</p></li><li><p>&#8220;The story is based on the ancient Greek play <em>Iphigenia in Aulis</em> by <strong>Euripides</strong>.&#8221;</p></li></ol><p><strong>The Trap:</strong> Sentence #3 contains the keywords &#8220;wrote&#8221; (implied) and &#8220;Euripides.&#8221; A keyword-based system&#8212;and even some weaker semantic models like XProvence&#8212;will often highlight Euripides because of the strong association between &#8220;Writer&#8221; and &#8220;Famous Author.&#8221;</p><p><strong>The Semantic Reality:</strong> The Zilliz model identifies that the user is asking about the <em>film&#8217;s</em> authorship. It recognizes that while Euripides wrote the <em>source material</em>, Lanthimos and Filippou wrote the <em>film</em>. It ignores the &#8220;keyword bait&#8221; and highlights the correct names.</p><p>This is the difference between a system that &#8220;matches&#8221; and a system that &#8220;understands.&#8221;</p><h2>The Path Forward: Native Integration</h2><p>The future of RAG isn&#8217;t just better retrieval; it&#8217;s <strong>transparent retrieval.</strong> Zilliz is currently integrating this semantic highlighting model directly into the <strong>Milvus</strong> ecosystem via a native API. This means that in the very near future, when you call <code>results = collection.search()</code>, you won&#8217;t just get a list of documents. You&#8217;ll get a list of <strong>highlighted spans</strong> that explain, in real-time, exactly why those documents were chosen.</p><h3>Summary of the New Standard</h3><ul><li><p><strong>8K Context:</strong> No more &#8220;chopping&#8221; documents into tiny chunks just to get highlights.</p></li><li><p><strong>Bilingual:</strong> Native support for English and Chinese.</p></li><li><p><strong>Production Ready:</strong> Millisecond latency and MIT licensed.</p></li></ul><p>If your RAG system is currently serving up plain, unhighlighted walls of text, you are asking your users to do the hard work that the AI should be doing. It&#8217;s time to stop matching keywords and start highlighting meaning.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Limited Intelligence! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Beyond the Straight Line]]></title><description><![CDATA[A Deep Dive into Generalized Linear Models (GLMs)]]></description><link>https://limitedintelligence.substack.com/p/beyond-the-straight-line</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/beyond-the-straight-line</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Mon, 23 Mar 2026 13:03:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!P_jj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2edb8ba2-17b0-4d0a-ba13-51b4184c6114_876x578.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P_jj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2edb8ba2-17b0-4d0a-ba13-51b4184c6114_876x578.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P_jj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2edb8ba2-17b0-4d0a-ba13-51b4184c6114_876x578.png 424w, https://substackcdn.com/image/fetch/$s_!P_jj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2edb8ba2-17b0-4d0a-ba13-51b4184c6114_876x578.png 848w, https://substackcdn.com/image/fetch/$s_!P_jj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2edb8ba2-17b0-4d0a-ba13-51b4184c6114_876x578.png 1272w, https://substackcdn.com/image/fetch/$s_!P_jj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2edb8ba2-17b0-4d0a-ba13-51b4184c6114_876x578.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P_jj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2edb8ba2-17b0-4d0a-ba13-51b4184c6114_876x578.png" width="876" height="578" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2edb8ba2-17b0-4d0a-ba13-51b4184c6114_876x578.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:578,&quot;width&quot;:876,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;useR! Machine Learning Tutorial&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="useR! Machine Learning Tutorial" title="useR! Machine Learning Tutorial" srcset="https://substackcdn.com/image/fetch/$s_!P_jj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2edb8ba2-17b0-4d0a-ba13-51b4184c6114_876x578.png 424w, https://substackcdn.com/image/fetch/$s_!P_jj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2edb8ba2-17b0-4d0a-ba13-51b4184c6114_876x578.png 848w, https://substackcdn.com/image/fetch/$s_!P_jj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2edb8ba2-17b0-4d0a-ba13-51b4184c6114_876x578.png 1272w, https://substackcdn.com/image/fetch/$s_!P_jj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2edb8ba2-17b0-4d0a-ba13-51b4184c6114_876x578.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the &#8220;clean&#8221; world of textbooks, every relationship is a straight line, and every error is a perfect bell curve. But if you&#8217;ve spent more than five minutes with real-world data, you know that&#8217;s a lie.</p><p>Real data is messy. It&#8217;s counts of website clicks that can never be negative. It&#8217;s insurance claims that are mostly small but occasionally massive. It&#8217;s &#8220;yes/no&#8221; clicks that don&#8217;t care about your &#8220;line of best fit.&#8221;</p><p>Standard Linear Regression (OLS) is like a Swiss Army knife&#8212;it&#8217;s great until you&#8217;re trying to cut through a steel beam. For the tough stuff, you need to &#8220;supercharge&#8221; your toolkit. You need <strong>Generalized Linear Models (GLMs)</strong>.</p><h2>1. The &#8220;Glass Ceiling&#8221; of Linear Regression</h2><p>Before we talk about the solution, we have to admit we have a problem. When we use standard linear regression, we are essentially making a series of high-stakes statistical bets:</p><ul><li><p><strong>The Gaussian Bet:</strong> We assume the &#8220;noise&#8221; in our data follows a perfect Normal distribution.</p></li><li><p><strong>The Constant Variance Bet:</strong> We assume the &#8220;spread&#8221; of our data is the same whether our prediction is 10 or 10,000 (Homoscedasticity).</p></li><li><p><strong>The Linearity Bet:</strong> We assume the features <strong>X</strong> map directly and additively to the outcome <strong>y</strong>.</p></li></ul><h3>Where it breaks</h3><p>Imagine you&#8217;re predicting how many emails a customer sends.</p><ol><li><p><strong>Linear regression might predict -2 emails.</strong> (Impossible).</p></li><li><p><strong>The variance probably increases with the mean.</strong> (A person who sends 100 emails has more &#8220;swing&#8221; in their behavior than someone who sends 1).</p></li></ol><p>If you use OLS here, your p-values will be wrong, your confidence intervals will be meaningless, and your model will be fundamentally &#8220;blind&#8221; to the nature of the data.</p><h2>2. The GLM Architecture: Three Pillars</h2><p>A GLM isn&#8217;t just one model; it&#8217;s a framework. It allows you to swap out parts of the regression engine to fit the problem at hand. Every GLM is built on three pillars:</p><h3>Pillar 1: The Random Component (The Distribution)</h3><p>Instead of being stuck with the Normal distribution, we can choose any distribution from the <strong>Exponential Family</strong>.</p><ul><li><p><strong>Binary outcome?</strong> Use Bernoulli.</p></li><li><p><strong>Count data?</strong> Use Poisson.</p></li><li><p><strong>Skewed positive data?</strong> Use Gamma.</p></li></ul><h3>Pillar 2: The Systematic Component (The Linear Predictor)</h3><p>We keep the best part of linear regression: the linear combination of features. We define a &#8220;linear predictor&#8221; (often called $\eta$ or &#8220;eta&#8221;):</p><p>$$\eta = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n$$</p><p>This is where the &#8220;information&#8221; from your features lives.</p><h3>Pillar 3: The Link Function</h3><p>This is the bridge. The Link Function $g(\cdot)$ connects our linear predictor to the expected value of our data ($\mu$).</p><p>$$g(\mu) = \eta$$</p><p>This allows the model to predict values on a $(-\infty, \infty)$ scale while the actual data stays within its natural bounds (like $0$ to $1$ for probabilities).</p><h2>3. The Engine: The Exponential Family</h2><p>Why do we insist on the &#8220;Exponential Family&#8221;? Because it makes the math work. A distribution belongs to this family if its probability density can be squeezed into this specific form:</p><p>$$f(y; \theta, \phi) = \exp\left( \frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi) \right)$$</p><blockquote><p><strong>Why this matters for Substack readers:</strong> You don&#8217;t need to memorize that formula. You just need to know its superpower: <strong>The variance is linked to the mean.</strong></p></blockquote><p>In this family, the variance of your data is a function of the mean. This is how GLMs handle &#8220;Heteroscedasticity&#8221; (changing variance) naturally. When the mean goes up, the model <em>expects</em> the variance to change. It&#8217;s built into the DNA of the model.</p><h2>4. Understanding the Link Function</h2><p>The Link Function is what prevents your model from making &#8220;illegal&#8221; predictions. Let&#8217;s look at the two most famous examples.</p><h3>Logistic Regression (The Logit Link)</h3><p>When predicting a probability ($p$), we know the value must be between $0$ and $1$. The Logit link takes that probability and stretches it to infinity:</p><p>$$g(p) = \ln\left( \frac{p}{1-p} \right) = \theta^T X$$</p><p>When you reverse this math to get your prediction, you get the Sigmoid curve, which gracefully levels off at $0$ and $1$ rather than crashing through them like a standard linear line would.</p><h3>Poisson Regression (The Log Link)</h3><p>When counting events, the mean ($\mu$) must be greater than zero. The Log link ensures this:</p><p>$$\ln(\mu) = \theta^T X$$</p><p>$$\mu = \exp(\theta^T X)$$</p><p>Because the result of an exponent is always positive, your model will never tell you that a store will have -5 customers next Tuesday.</p><h2>5. How We Find the Parameters (MLE)</h2><p>In standard regression, we use &#8220;Least Squares&#8221;&#8212;we literally try to minimize the physical distance between points and a line.</p><p>In GLMs, we use <strong>Maximum Likelihood Estimation (MLE)</strong>. We ask: <em>&#8220;What parameters $(\theta)$ make the data we actually observed the most likely outcome?&#8221;</em></p><p>Because we are using the Exponential Family, the math simplifies beautifully when we take the &#8220;Log-Likelihood.&#8221; This turns complex multiplications into simple additions, which computers can solve very quickly using a method called <strong>Iteratively Reweighted Least Squares (IRLS)</strong>.</p><h2>6. Evaluating a GLM: Deviance over R-Squared</h2><p>You can&#8217;t use $R^2$ for GLMs. A &#8220;high $R^2$&#8221; in a logistic regression doesn&#8217;t actually mean what you think it means. Instead, we look at <strong>Deviance</strong>.</p><ul><li><p><strong>Null Deviance:</strong> How well the model predicts if you had <em>no</em> features (just the average).</p></li><li><p><strong>Residual Deviance:</strong> How much &#8220;error&#8221; remains after you add your features.</p></li></ul><p>A good model significantly reduces the deviance from the Null to the Residual. If the Residual Deviance is still very high, you&#8217;ve likely picked the wrong distribution or link function.</p><h2>Conclusion</h2><p>GLMs are the bridge between simple statistics and complex machine learning. They give you the flexibility of a neural network (via different distributions and links) but keep the <strong>interpretability</strong> of a linear model. You can still look at your coefficients $(\theta)$ and say, <em>&#8220;For every unit increase in X, the log-odds of success increase by 0.5.&#8221;</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Limited Intelligence! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[The Great Convergence]]></title><description><![CDATA[A History of LLM Architecture Evolution (2017&#8211;2026)]]></description><link>https://limitedintelligence.substack.com/p/the-great-convergence</link><guid isPermaLink="false">https://limitedintelligence.substack.com/p/the-great-convergence</guid><dc:creator><![CDATA[João Silva]]></dc:creator><pubDate>Thu, 19 Mar 2026 13:01:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!J_WW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365c04ac-afd2-43ac-b8ef-6df98b4ed14f_795x620.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!J_WW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365c04ac-afd2-43ac-b8ef-6df98b4ed14f_795x620.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J_WW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365c04ac-afd2-43ac-b8ef-6df98b4ed14f_795x620.jpeg 424w, https://substackcdn.com/image/fetch/$s_!J_WW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365c04ac-afd2-43ac-b8ef-6df98b4ed14f_795x620.jpeg 848w, https://substackcdn.com/image/fetch/$s_!J_WW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365c04ac-afd2-43ac-b8ef-6df98b4ed14f_795x620.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!J_WW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365c04ac-afd2-43ac-b8ef-6df98b4ed14f_795x620.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J_WW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365c04ac-afd2-43ac-b8ef-6df98b4ed14f_795x620.jpeg" width="795" height="620" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/365c04ac-afd2-43ac-b8ef-6df98b4ed14f_795x620.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:620,&quot;width&quot;:795,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!J_WW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365c04ac-afd2-43ac-b8ef-6df98b4ed14f_795x620.jpeg 424w, https://substackcdn.com/image/fetch/$s_!J_WW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365c04ac-afd2-43ac-b8ef-6df98b4ed14f_795x620.jpeg 848w, https://substackcdn.com/image/fetch/$s_!J_WW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365c04ac-afd2-43ac-b8ef-6df98b4ed14f_795x620.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!J_WW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365c04ac-afd2-43ac-b8ef-6df98b4ed14f_795x620.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The history of Large Language Models (LLMs) is often told as a story of &#8220;bigger is better.&#8221; However, looking back from the vantage point of 2026, the true narrative is one of architectural refinement, structural divergence, and the transition from raw statistical predictors to sophisticated reasoning engines. While the &#8220;scaling laws&#8221; defined the early 2020s, the current era is defined by <strong>efficiency, modularity, and verifiability.</strong></p><p>This article traces the evolution of LLM architectures from the revolutionary &#8220;Attention is All You Need&#8221; paper to the hybrid, agentic systems of today.</p><h2>1. The Big Bang: The Transformer Revolution (2017&#8211;2019)</h2><p>Before 2017, natural language processing (NLP) was dominated by Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) units. These models processed text sequentially, like a human reading a sentence word by word. While effective for short sequences, they suffered from &#8220;forgetting&#8221; the beginning of a long sentence by the time they reached the end&#8212;a problem known as the vanishing gradient.</p><h3>The Attention Breakthrough</h3><p>In 2017, Google researchers introduced the <strong>Transformer</strong> architecture. Its core innovation was the <strong>Self-Attention mechanism</strong>, which allowed the model to look at every word in a sentence simultaneously.</p><p>Instead of sequential processing, the Transformer used:</p><ul><li><p><strong>Positional Encodings:</strong> To maintain the order of words without sequential processing.</p></li><li><p><strong>Multi-Head Attention:</strong> To allow the model to focus on different parts of a sentence for different reasons (e.g., one head focusing on grammar, another on semantic meaning).</p></li></ul><h3>The Branching Paths: BERT vs. GPT</h3><p>By 2018, the architecture split into two dominant philosophies:</p><ul><li><p><strong>Encoder-Only (BERT):</strong> Focused on &#8220;understanding&#8221; context by looking at words to the left and right (bidirectional). These were the masters of classification and sentiment analysis but struggled to generate fluid text.</p></li><li><p><strong>Decoder-Only (GPT):</strong> Focused on &#8220;generation&#8221; by predicting the next token in a sequence (unidirectional). This branch, championed by OpenAI, eventually became the blueprint for modern LLMs.</p></li></ul><h2>2. The Scaling Era and the Dense Paradigm (2020&#8211;2022)</h2><p>The release of GPT-3 in 2020 proved that simply increasing the number of parameters (the &#8220;neurons&#8221; of the model) led to emergent behaviors&#8212;capabilities like coding and translation that weren&#8217;t explicitly trained for.</p><h3>The Limits of Density</h3><p>For several years, the industry followed a <strong>&#8220;Dense&#8221; architecture</strong> model. In a dense model, every single parameter is &#8220;activated&#8221; for every single word generated.</p><ul><li><p><strong>GPT-3:</strong> 175 Billion parameters.</p></li><li><p><strong>PaLM:</strong> 540 Billion parameters.</p></li></ul><p>While powerful, these models became prohibitively expensive to run. The energy and compute required to &#8220;flick every switch&#8221; in a 500B parameter model for a simple &#8220;Hello&#8221; was the first structural bottleneck.</p><h2>3. The Modular Pivot: Mixture of Experts (2023&#8211;2024)</h2><p>By late 2023, the paradigm shifted from &#8220;Dense&#8221; to <strong>&#8220;Sparse.&#8221;</strong> The most significant leap was the mainstream adoption of <strong>Mixture of Experts (MoE)</strong>.</p><h3>How MoE Changed the Game</h3><p>Instead of one giant neural network, an MoE model consists of many smaller &#8220;specialist&#8221; sub-networks (experts). A &#8220;router&#8221; determines which experts are best suited for a specific token.</p><ul><li><p><strong>Example:</strong> If a user asks a coding question, only the &#8220;Python&#8221; and &#8220;Logic&#8221; experts might fire.</p></li><li><p><strong>Result:</strong> A model could have 1.8 Trillion total parameters (like the rumored GPT-4 architecture) but only activate ~100 Billion per token. This provided the &#8220;intelligence&#8221; of a massive model with the &#8220;speed and cost&#8221; of a much smaller one.</p></li></ul><p>This era saw the rise of models like <strong>Mixtral 8x7B</strong> and <strong>DeepSeek-V3</strong>, which proved that open-weights models could compete with proprietary giants by using MoE to optimize compute.</p><h2>4. Beyond Transformers: State-Space Models and Hybrids (2025)</h2><p>As context windows expanded from 8,000 tokens to 1 million and beyond, a new problem emerged: <strong>Quadratic Complexity.</strong> In standard Transformers, the cost of processing text grows exponentially as the text gets longer. Processing a whole book was vastly more expensive than processing a page.</p><h3>The Rise of Mamba and SSMs</h3><p>In 2025, <strong>State-Space Models (SSMs)</strong> like <strong>Mamba</strong> gained traction. Unlike Transformers, SSMs have <strong>linear scaling</strong>. They process information in a way that feels like a &#8220;memory stream,&#8221; making them incredibly efficient for:</p><ul><li><p>Analyzing massive codebases.</p></li><li><p>Processing long legal documents.</p></li><li><p>Running on-device AI (phones and laptops) where RAM is limited.</p></li></ul><h3>Hybrid Architectures</h3><p>The market didn&#8217;t abandon Transformers; it merged them. Today&#8217;s state-of-the-art models are often <strong>Hybrids</strong>, combining the &#8220;perfect memory&#8221; of Transformer attention for short-term logic with the &#8220;efficiency&#8221; of SSMs for long-term context.</p><h2>5. The Current State: Reasoning and Agentic Architectures (2026)</h2><p>As of March 2026, we have moved past &#8220;Next Token Prediction.&#8221; The architecture of an LLM is no longer just a neural network; it is an <strong>orchestrated system.</strong></p><h3>Test-Time Compute (Thinking Modes)</h3><p>The biggest shift in 2026 is the decoupling of &#8220;model size&#8221; from &#8220;intelligence.&#8221; Models like <strong>OpenAI&#8217;s gpt-oss</strong> and <strong>DeepSeek-R1</strong> utilize <strong>Inference-Time Scaling</strong>.</p><p>When faced with a complex math problem, the model doesn&#8217;t just &#8220;blur out&#8221; an answer. It enters a &#8220;Thinking&#8221; state&#8212;using internal chain-of-thought loops to verify its logic before responding. We are now spending more compute <em>while the model is answering</em> rather than just during its initial training.</p><h3>Agentic Integration</h3><p>Modern architectures are designed with &#8220;tool-use&#8221; in their DNA. This includes:</p><ul><li><p><strong>Native RAG (Retrieval-Augmented Generation):</strong> The model architecture includes a &#8220;search&#8221; layer that pulls in real-time facts before generating text.</p></li><li><p><strong>Verifiable Rewards (RLVR):</strong> Training models specifically on tasks with objective &#8220;right/wrong&#8221; answers (like code execution), making them far more reliable than the &#8220;hallucination-prone&#8221; models of 2023.</p></li></ul><h2>Summary of Architectural Evolution</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UHJE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5edde9f6-7f7e-43a7-9085-77be3e59e512_1384x380.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UHJE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5edde9f6-7f7e-43a7-9085-77be3e59e512_1384x380.png 424w, https://substackcdn.com/image/fetch/$s_!UHJE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5edde9f6-7f7e-43a7-9085-77be3e59e512_1384x380.png 848w, https://substackcdn.com/image/fetch/$s_!UHJE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5edde9f6-7f7e-43a7-9085-77be3e59e512_1384x380.png 1272w, https://substackcdn.com/image/fetch/$s_!UHJE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5edde9f6-7f7e-43a7-9085-77be3e59e512_1384x380.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UHJE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5edde9f6-7f7e-43a7-9085-77be3e59e512_1384x380.png" width="1384" height="380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5edde9f6-7f7e-43a7-9085-77be3e59e512_1384x380.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:1384,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:115861,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://limitedintelligence.substack.com/i/191048529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5edde9f6-7f7e-43a7-9085-77be3e59e512_1384x380.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UHJE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5edde9f6-7f7e-43a7-9085-77be3e59e512_1384x380.png 424w, https://substackcdn.com/image/fetch/$s_!UHJE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5edde9f6-7f7e-43a7-9085-77be3e59e512_1384x380.png 848w, https://substackcdn.com/image/fetch/$s_!UHJE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5edde9f6-7f7e-43a7-9085-77be3e59e512_1384x380.png 1272w, https://substackcdn.com/image/fetch/$s_!UHJE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5edde9f6-7f7e-43a7-9085-77be3e59e512_1384x380.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Conclusion</h2><p>The evolution of LLM architecture has come full circle. We started with small, rigid rules, moved to massive, &#8220;black box&#8221; statistical models, and have now arrived at modular, transparent, and efficient systems.</p><p>In 2026, the goal is no longer to build the <em>biggest</em> model, but the <em>smartest</em> system&#8212;one that can reason, use tools, and &#8220;think&#8221; before it speaks. The &#8220;Architecture of LLMs&#8221; is no longer just about layers and neurons; it is about building a digital cognitive stack that is as efficient as it is capable.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://limitedintelligence.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Limited Intelligence! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item></channel></rss>