Why Google’s AI Can’t Count the Letters in Its Own Name
The article examines why the newly AI‑powered Google Search fails at simple letter‑count questions like “how many P’s are in Google,” tracing the issue to token‑based language models, illustrating it with examples, and discussing both short‑term prompts and long‑term architectural solutions such as byte‑level models.
AI is useful but has a dark side; recent tests show that the upgraded Google Search, which now embeds a large language model, incorrectly answers the simple question “how many P are in Google.” The same failure occurs with Chinese queries, and the model even adds extra misinformation, claiming there are two P’s in Pixel.
At Google I/O 2026 the company announced the biggest search‑box redesign in 25 years, introducing an “Intelligent Search Box” that merges AI Overview and AI Mode so that user queries receive AI‑generated answers directly, while traditional links remain secondary. Liz Reid, head of Google Search, called it the most significant upgrade in a quarter‑century, a response to competition from OpenAI and Perplexity.
Shortly after launch users reported bugs such as the word “disregard” being interpreted as a command (“Okay, I’ve ignored your previous message”). Although Google quickly patched that issue, spelling‑related errors persist. TechCrunch quoted a Google source saying that “letters inside a word are a known difficulty for large language models, and we are working on fixing this specific problem.”
The root cause is that LLMs operate on tokens , not individual characters. A token is a coarse‑grained language fragment that may be a whole word, a sub‑word, or a combination of words. For example, OpenAI’s tokenizer splits “Strawberry” into three tokens: “Str”, “aw”, “berry”. The model therefore receives three abstract units instead of eleven letters and must reconstruct the hidden characters, a step it was never explicitly trained to perform.
The same applies to the word “Google,” which is often treated as a single token. As Matthew Guzdial (University of Maryland) explained, the model sees the whole encoding of “the” and does not know that it contains the letters T, H, and E.
LLMs therefore capture meaning rather than visual shape; spelling belongs to the latter. This limitation has been known since the early days of large models, when the classic test “how many r’s are in ‘Strawberry’?” consistently fooled them.
Andrej Karpathy, who recently joined Anthropic, created a small emoji‑based visualizer that shows how tokenization chops a sentence into colored blocks, making it clear why the model cannot count letters. Prompt engineering can mitigate the problem: adding “please think step‑by‑step” or “list each letter first” usually yields the correct answer because the model possesses the necessary information but defaults to a fast, intuitive response.
Karpathy labels this uneven ability “Jagged Intelligence”: a model that can win a math Olympiad may still fail to count letters, and a code‑generating AI may not recognize overlapping circles. The phenomenon mirrors the human System 1/System 2 thinking model, where the default fast mode produces errors unless a slower, deliberate mode is invoked.
Why the Google case attracted extra attention is simple: the context changed. Users expect search engines to provide accurate, authoritative answers, not speculative AI guesses. When the AI‑driven answer box confidently misstates the number of P’s in “Google,” the error feels far more serious than a similar mistake in a chat‑bot.
Google’s AI Overview has previously produced absurd answers, such as treating Reddit jokes as factual information or suggesting people eat glue or stones. Although Google has issued multiple patches, recent incidents where ordinary words are misinterpreted as commands indicate deeper issues in information retrieval, context understanding, and instruction boundary detection.
From a technical standpoint, one possible remedy is to abandon token‑based processing. Meta AI’s Byte‑Latent Transformers (BLT) introduced at the end of 2024 process text at the byte level, effectively letting the model read characters directly. In character‑level benchmarks, BLT outperforms token‑based models on spelling tasks, while LLaMA 3 fails dramatically. BLT consists of a lightweight local encoder, a costly latent transformer, and a local decoder, dynamically grouping bytes to preserve fine‑grained information. However, removing tokenization inflates sequence length several‑fold, causing quadratic growth in Transformer attention cost and raising training expenses to hundreds of millions or billions of dollars for production‑scale models.
A lower‑cost alternative is to give models “cognitive self‑knowledge,” enabling them to recognize tasks they are weak at (such as letter counting) and defer to external tools like calculators or search results. Meta’s Llama 3 incorporates knowledge‑detection training so that the model learns to refuse answering questions it repeatedly gets wrong, rather than confidently providing incorrect answers.
In practice, Google Search now falls back to retrieving a web result for the classic “how many r in strawberry” query instead of letting the LLM count internally. While such patch‑style fixes are quicker, they only treat symptoms, not the underlying token‑level limitation.
Overall, fixing the inability of LLMs to perceive internal letters is difficult. Architectural changes like byte‑level models promise a long‑term solution but are expensive, whereas prompt engineering and self‑knowledge mechanisms offer short‑term mitigations at the cost of reduced model autonomy.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
