Bonus Chapter: Three Years of Evolution

If you've been using AI tools for a while, you've probably noticed something strange happening. Tasks that used to be simple now require choosing between multiple models. The interface that felt clean and obvious has accumulated features, options, and complexity. What used to feel like talking to one AI now sometimes feels like you're being passed between departments. This isn't your imagination, and it's not bad design—it's the natural result of three years of breakneck evolution in how these systems work. When ChatGPT launched in November 2022, it was one model doing one thing. Today, every major platform runs multiple specialized models behind the scenes, each optimized for different tasks, all trying to coordinate seamlessly while maintaining the illusion of a single assistant. We've gained remarkable capabilities during this evolution—vision, image generation, reasoning, voice, massive context windows. But we've also accumulated complexity, fragmentation, and architectural compromises that shape every interaction whether you realize it or not. This chapter traces that journey from simplicity to specialization, examining what we gained, what we lost, and why the systems work the way they do today.

November 2022: The Starting Point

When ChatGPT launched on November 30, 2022, the proposition was refreshingly simple. One model did one thing: you typed something, it responded with text. The architecture was straightforward, with a context window that could handle about 4,000 to 8,000 tokens—enough for a few pages of conversation but not much more. There were no tools, no memory between sessions, no access to current information beyond what had been baked into the training data. If you wanted to generate images, you'd need to use DALL-E 2 as a completely separate product with its own interface.

This simplicity meant the experience was coherent even if limited. You understood what you were getting: a language model that would try to complete or respond to whatever you wrote. When it didn't know something, it would either admit ignorance or confidently make something up—a problem, certainly, but at least a predictable one. Every mistake required manual correction, every session started fresh, and the model had no awareness of anything beyond the immediate conversation.

March-May 2023: The First Splits

The landscape started fragmenting in March when OpenAI released GPT-4. Suddenly users faced their first real choice: stick with the faster, cheaper GPT-3.5 or upgrade to the more capable but slower GPT-4. This was the first hint that the "one model for everything" era was ending. Around the same time, ChatGPT added plugins that could connect to web search, execute code, and interact with external APIs. The underlying language model remained the same, but now it was calling out to other tools, and users started experiencing a more fragmented interaction pattern.

Google entered the race in March with Bard, initially running on their LaMDA model before switching to PaLM 2 by May. Anthropic's Claude became more widely available through their claude.ai website. The competitive landscape was forming, though most users simply stuck with whatever platform they'd started with and didn't think much about the underlying models.

June-August 2023: Context Expansion

The real breakthrough in this period came from Anthropic when they launched Claude 2 in July with a 100,000 token context window. This was a massive jump—suddenly you could feed entire books or substantial codebases into a conversation and have the model maintain coherence across all of it. The model itself remained focused on text, without vision or image generation capabilities, but the scale of what it could process changed the nature of possible interactions.

Meanwhile, developer tools like Cursor started implementing more sophisticated patterns where models would check their own work through multiple passes. This wasn't true autonomy, just automation of the revision process that users would otherwise handle manually, but it showed where things were heading.

September-November 2023: Vision and Image Generation

September 2023 marked a significant architectural shift when OpenAI launched GPT-4V with vision capabilities. This meant ChatGPT now had at least two different models working behind the scenes: GPT-4 handled text, while GPT-4V processed images. Users naturally assumed they were talking to one unified AI, but the reality was more complicated. The vision model and text model were separate systems, and the text model couldn't actually see what the vision model was seeing—it received descriptions rather than direct visual access.

Things got even more complex in October when DALL-E 3 integrated into ChatGPT. Now there was a third specialized model in the mix. You'd ask for an image, ChatGPT would process your request and hand it off to DALL-E 3, then show you the result—but ChatGPT itself couldn't see the image it had just helped create. When you'd ask for modifications like "make the cat bigger," the system was essentially guessing at how to adjust its prompt to the image generator.

This is also when the invisible prompt expansion process began in earnest. You might type something simple like "a cat in a garden," but behind the scenes the chat model would transform this into something far more detailed: "fluffy orange tabby cat sitting in a lush garden with purple lavender, soft afternoon sunlight filtering through leaves, photorealistic style, sharp focus, natural colors, professional photography quality." The image model received this expanded prompt, not your original simple request. This made sense from an engineering perspective—chat models are better at language and imagination, while image models are optimized for translating detailed descriptions into visuals—but it created an invisible layer between your intent and the final output.

November brought more fragmentation with Custom GPTs, which let users create specialized versions of ChatGPT for specific purposes. Claude 2.1 launched the same month but remained focused purely on text. Google rebranded Bard to Gemini in early December, introducing Pro and Ultra tiers that forced users to think about which version they needed.

December 2023-February 2024: Model Tiers Emerge

By February 2024, the tier system was in full swing. Gemini Ultra 1.0 launched for paying customers, creating a clear split between the free Gemini Pro and the premium Ultra version. Google marketed this heavily as "natively multimodal," emphasizing that their architecture handled text, images, video, and audio in an integrated way rather than juggling separate specialized models. In practice, though, users still had to navigate between different model versions and understand their various capabilities and limitations.

ChatGPT continued expanding its memory features, allowing it to remember details across conversations. But users were still choosing between GPT-3.5, GPT-4, and GPT-4 Turbo depending on what their subscription tier allowed and what specific task they were trying to accomplish.

March-May 2024: Peak Complexity

March 2024 brought what might be considered peak complexity when Anthropic launched the Claude 3 family with three distinct models: Opus was the smartest but most expensive and slowest, Haiku was fast and cheap but less capable, and Sonnet sat somewhere in the middle. Vision capabilities arrived with this release, but now Claude users faced the same question OpenAI users had been dealing with: which model should I actually use for this task? The answer wasn't always obvious, and getting it wrong meant either wasting money on unnecessary capability or getting subpar results from an underpowered model.

Google responded by launching Gemini 1.5 Pro in limited release in February, with wide availability in May, boasting a context window exceeding one million tokens. They also introduced Gemini 1.5 Flash as a faster, more efficient alternative. Users could now choose from Gemini Pro 1.0, Ultra 1.0, Pro 1.5, and Flash 1.5, each with different tradeoffs.

OpenAI's GPT-4o launched in May, adding yet another option to an already crowded field. ChatGPT users now had access to GPT-3.5 Turbo, GPT-4, GPT-4 Turbo, GPT-4o, plus DALL-E 3 for images and various voice models. The model picker interface became genuinely overwhelming, especially for enterprise users who had access to even more variations.

The image generation friction that started in late 2023 persisted across all platforms. Whether the chat model could actually see generated images depended on specific implementation choices each platform made. Even when vision capabilities allowed the chat model to see what had been created, it couldn't directly manipulate pixels or make surgical edits. The only option was to write a new prompt—itself invisibly expanded and translated—and regenerate the entire image.

June-September 2024: Reasoning Models Arrive

Claude 3.5 Sonnet launched in June, and GPT-4o received a significant update in August. But the most significant development came in September when OpenAI released o1-preview, which represented a fundamentally different approach. Rather than immediately generating responses, reasoning models would spend time thinking through problems step-by-step, burning additional tokens on internal reasoning before producing output. This made them slower and more expensive but potentially much better at complex problems requiring careful logical progression.

Now users weren't just choosing between fast and slow, or cheap and expensive, but between fundamentally different modes of operation. Standard chat models worked well for most tasks, but reasoning models excelled at mathematical problems, coding challenges, and complex analytical work. The ChatGPT model picker, already crowded, became even more difficult to navigate as users tried to understand when the extra cost and wait time of reasoning models made sense.

On the positive side, Claude's artifacts feature matured significantly during this period, providing a cleaner interface for iteratively working on documents and code without the conversation getting cluttered with multiple versions.

October 2024-January 2025: Continued Evolution

Claude 3.5 Sonnet received another update in October, followed by the launch of Claude 3.5 Haiku in November, adding yet another tier to choose from. OpenAI fully released o1 in December after the preview period, then announced o3 later that month with even more advanced reasoning capabilities, though access remained limited.

Gemini 2.0 Flash launched experimentally in December, reaching stable release in January 2025. By this point, all major platforms were trying to simplify their user interfaces by auto-selecting appropriate models for casual users while maintaining granular control for power users who knew what they wanted. This helped reduce friction but also made it harder to understand what was actually happening under the hood.

ChatGPT's Canvas feature launched in October, providing a similar iterative workspace to what Claude had developed with artifacts. Image generation remained architecturally fragmented across all platforms—DALL-E 3 in ChatGPT, Imagen in Gemini, with Claude offering no native generation at all. Most chat models could now see images including ones they'd helped generate, but implementation quality varied significantly between platforms. The core architectural issue remained unchanged: separate specialized models coordinating through invisible translation layers rather than truly integrated capabilities.

February-October 2025: Major Version Jumps

The version number inflation that had been building finally hit full force in early 2025. Gemini 2.0 Pro launched in February, quickly followed by Gemini 2.5 Pro Experimental in March. By June, Google had stable releases of Gemini 2.5 Flash and Pro, plus a new Flash-Lite variant optimized for speed and efficiency. Users trying to keep track of which Gemini version they should use faced a genuinely confusing array of options.

Anthropic jumped to Claude 4 in May with their Opus 4 and Sonnet 4 models, representing a major architectural upgrade but also creating confusion as users navigated between 3.5 and 4 series models. Claude Opus 4.1 arrived in August with incremental improvements, followed by Claude Sonnet 4.5 in September and Haiku 4.5 in October. Each release brought real improvements, but the naming scheme and model selection process became increasingly burdensome.

OpenAI launched GPT-5 in August, another major generational leap that added to rather than replaced their existing model lineup. They also introduced the gpt-image-1 model for image generation, dropping the DALL-E branding but maintaining the same fundamental architecture of a separate specialized model that the chat system coordinates with.

November 2025: The Current State

As we hit the three-year mark, each major platform offers somewhere between five and ten actively maintained model versions. ChatGPT provides the full GPT-5 family, GPT-4o for users who don't need the latest capabilities, o1 for reasoning-heavy tasks, o3 for even more advanced reasoning though with limited availability, and GPT-4o mini for fast, cheap operations. Claude offers Opus 4.1 as their most capable model, Sonnet 4.5 as the efficient default, Haiku 4.5 for speed-critical applications, with the entire 3.5 series still accessible for users who prefer those versions. Gemini provides 2.5 Pro for high-capability work, 2.5 Flash for balanced performance, 2.5 Flash-Lite for maximum efficiency, and maintains access to 2.0 variants for compatibility.

Behind every request you make, multiple specialized models are coordinating in ways you typically can't see. Chat models handle the conversational flow and language understanding. Vision models process and analyze images. Image generation uses entirely separate systems with different architectures. Reasoning models approach problems differently from standard chat models, spending extra time on internal deliberation. Voice interaction employs its own specialized architecture optimized for audio processing and generation.

The prompt expansion issue that emerged in late 2023 has become a permanent feature of how these systems work. When you request an image with a simple description, the chat model transforms your intent into a detailed prompt optimized for the image generator's strengths and weaknesses. The chat model can now see generated images and provide critique or suggestions, but it's still coordinating with a separate specialist for the actual generation work. Every request passes through multiple invisible translation layers, each trying to optimize for its particular component while maintaining the illusion of seamless interaction.

What We Gained

The capability expansion over these three years has been remarkable. Context windows grew from 4,000 tokens to over a million, fundamentally changing what's possible in a single conversation. We added web search so models aren't limited to their training data, code execution for running and testing programs in real-time, vision for understanding images and diagrams, image generation within conversational interfaces, voice interaction for hands-free use, reasoning capabilities for complex problem-solving, and multi-step agent behaviors for accomplishing tasks that require planning and coordination.

Competition drove costs down dramatically while pushing capabilities up. Users gained genuine choice rather than being locked into a single provider. The model that works best for creative writing might be different from what excels at code generation, which might differ from what's optimal for analytical reasoning. OpenAI no longer automatically wins every comparison—Google's Gemini offers compelling advantages for working with long documents, Claude provides exceptional writing quality and thoughtful responses, and various open-source options give users complete control over their data and deployment.

The specialization that created architectural complexity also enabled genuine advancement. Reasoning models can work through problems that would have been impossible for earlier systems. Vision integration, despite its architectural awkwardness, allows for multimodal interactions that feel genuinely useful. Image generation within chat interfaces, for all its invisible complexity, provides creative capabilities that would have seemed like science fiction just a few years ago.

What We Lost

The most obvious loss is simplicity itself. What started as one model doing one thing clearly became five to ten specialized models pretending to be a unified assistant. Users think they're having a conversation with a single AI, but the reality involves separate systems for chat, vision, image generation, reasoning, and voice. Each component is optimized for specific tasks but fundamentally unable to fully see or directly manipulate what the others produce.

Model selection creates constant friction. Free tier users get auto-routed to whatever the platform decides is appropriate, often with no visibility into which model is actually handling their request. Power users face the opposite problem—too many choices with insufficient guidance about when the tradeoffs matter. Enterprise users might have access to fifteen or more model variants per platform, each with different capabilities, costs, and performance characteristics. Understanding which model you're currently using, let alone which one you should be using, requires more attention than most people want to invest.

The coordination between specialized models remains imperfect in ways that become obvious when you pay attention. Image generation provides the clearest example: you describe what you want in simple terms, the chat model—which excels at language and imagination—expands your description into a detailed prompt optimized for the image generator's capabilities, the image model creates something based on that invisible expanded prompt, and then the chat model can see and critique the result but can't directly manipulate it. When you request changes, you're initiating another round of invisible prompt expansion and regeneration rather than making surgical edits to what already exists. Even with all the advances in vision capabilities, the architectural separation between understanding and generation remains.

The Economics of Specialization

This fragmented architecture reflects economic realities more than any grand technical vision. Training one model to excel at every task costs more and performs worse than coordinating specialists. Reasoning models need to burn extra tokens on internal deliberation that would be wasted overhead for simple queries. Image generation requires fundamentally different computational approaches than text processing. Voice recognition and synthesis use architectures optimized for audio rather than trying to force text-focused models to handle speech.

The systems could theoretically implement automatic self-correction loops—generate an image, check it with a vision model, evaluate whether it matches the intent, and regenerate if necessary. But running multiple generation cycles for every request would make free tiers economically impossible and drive costs up dramatically even for paid users. So humans remain in the iteration loop, manually providing feedback and requesting adjustments despite all the advancement in AI capabilities.

The result is systems that are simultaneously more powerful and less coherent than what we started with. Raw capabilities have improved tremendously, but the integration between components remains awkward. Multiple invisible systems coordinate to maintain the illusion of a single assistant, with varying degrees of success depending on the specific task and platform.

Looking Forward

Three years has transformed AI from simple text chat into sophisticated multi-capability systems that handle text, images, voice, code, and complex reasoning. We gained enormous power through specialization but lost the clarity and simplicity of one model doing one thing well. Whether this trajectory leads toward meta-models that seamlessly orchestrate specialists, or toward truly unified multimodal systems that handle everything through integrated architectures, remains genuinely uncertain.

Current economics favor continued specialization over consolidation. Training costs, inference efficiency, and the practical challenges of building models that excel at everything suggest the committee-of-specialists approach will persist. But compression techniques, architectural innovations, and the sheer competitive pressure to simplify user experience might eventually enable more integrated systems that maintain specialist-level performance without the coordination overhead.

For users navigating this landscape, understanding what's actually happening behind the interface helps explain both the remarkable capabilities and the persistent limitations. You're not talking to one AI, regardless of what the marketing suggests. You're coordinating with a committee of specialists, each excellent at specific tasks, working together through invisible translation layers you never see. That architecture enables impressive results but also creates friction, unpredictability, and occasional failures when the coordination breaks down.

The technology will continue evolving at a pace that makes any specific model or version obsolete quickly. But the fundamental patterns established over these three years—capability through specialization, coordination through invisible layers, and the constant tension between power and simplicity—seem likely to persist. Understanding these patterns helps you make better decisions about which tools to use, when the complexity is worth it, and where the technology still falls short of its promise.

License

© 2025 Uli Hitzel  

This book is released under the Creative Commons Attribution–NonCommercial 4.0 International license (CC BY-NC 4.0).  
You may copy, distribute, and adapt the material for any non-commercial purpose, provided you give appropriate credit, include a link to the license, and indicate if changes were made. For commercial uses, please contact the author.

Version 0.2, last updated November 2nd 2025