The Limits of LLMs
Large language models are very good at producing plausible solutions. They are not good at knowing whether a solution is appropriate for your constraints, scale, or failure modes.
This post isn’t about dunking on AI tools. I use them constantly. It’s about understanding where their usefulness ends and where responsibility begins.
I ran into this boundary while solving a very real problem: place deduplication in Foodly Map, my current project.
The Problem That Triggered the Lesson
When users add places to a map, duplicates are inevitable.
My original logic required an exact match on both name and coordinates. It was safe, but naive. In the real world:
- People type “Starbucks” and “Starbucks Coffee”
- GPS coordinates drift by a few meters
- Different restaurants can exist at the exact same location
The result was predictable: obvious duplicates slipping through, and frustrated users wondering why the app didn’t “just know.”
So I asked Cursor for help.
What LLMs Are Good At (And What Cursor Did Well)
Cursor proposed a multi-stage deduplication strategy:
- Match on a deterministic external ID (Mapbox Place ID)
- Fall back to strict name + coordinate matching
- Finally, use fuzzy matching as a last resort
The fuzzy step combined:
- String similarity (substring checks and Levenshtein distance)
- Physical proximity (Haversine distance)
Candidates were scored using weighted heuristics:
- 70% name similarity
- 30% distance proximity
On paper, this was a solid answer. In fact, it was too easy to accept.
This is where LLMs shine: they can rapidly assemble a solution that looks reasonable, idiomatic, and complete.
But that’s also where the danger lives.
The Temptation to Just Ship It
At first glance, it would have been easy to shrug and merge the code.
This is the failure mode I see constantly with AI-assisted development:
“It seems fine, and I’m tired, so whatever.”
If I had done that, I would have shipped behavior I didn’t fully understand — and that’s where systems quietly rot.
So I slowed down.
What I Learned by Reviewing the Code
By walking through the implementation carefully, a few things became clear.
What It Actually Does Well
- Minor name variations resolve cleanly
- Small coordinate drift no longer causes duplicates
- Different restaurants at the same coordinates remain distinct
What It Quietly Assumes
- The dataset per search is small
- The client is allowed to make heuristic decisions
- Occasional duplicates are acceptable
- Concurrency is not yet a dominant problem
None of these are wrong. But none of them are guarantees, either.
LLMs don’t flag these assumptions for you. They’re implicit. It’s your job to surface them.
Why I Chose to Keep the Implementation (For Now)
I ultimately shipped the solution. Not because it was “what Cursor suggested,” but because I understood the trade-offs.
This approach is acceptable today because:
- Fuzzy matching only runs as a fallback
- The search radius is tightly bounded (50 meters)
- Duplicates are recoverable with admin tooling
- Thresholds are tunable, not locked in forever
Most importantly, the behavior aligns with user expectations, which is the actual goal of deduplication in an MVP.
This wasn’t blind trust in AI. It was conditional acceptance.
The “Google Question”: A Calibration Tool, Not a Goal
When I’m evaluating a solution, I often ask myself a question I never intend to fully answer:
“How would Google build this?”
Not because I want to build that. But because it gives me a sense of where my current logic sits on the spectrum between toy and planet-scale system.
It’s a way to surface hidden assumptions.
So, purely as an exercise, I asked:
What would place deduplication look like if the constraint wasn’t my MVP, but the entire world?
The answer is… a lot.
A Rough Sketch of a Google-Scale Deduplication System
At Google scale, you’re not deduplicating places for a handful of users. You’re reconciling billions of noisy, multilingual, user-contributed entities, all while maintaining low latency and avoiding catastrophic merges.
That changes everything.
1. Canonical Name Normalization
Instead of lowercasing strings and calling it a day, you’d see:
- Unicode normalization (NFKC)
- Language-aware stemming and transliteration
- Removal of common business suffixes (“Restaurant”, “Ltd”, “LLC”)
- Large, curated synonym dictionaries for global brands
Some of this would be rules-based. Some would be learned over time. The point isn’t elegance… it’s coverage.
2. Geospatial Indexing at Planet Scale
Rather than radius-based Haversine queries, Google uses S2 geometry, which partitions the Earth into hierarchical cells.
Every place is indexed into these cells, enabling:
- Fast “nearby” queries
- Consistent behavior near poles and datelines
- Efficient spatial joins at massive scale
Latitude/longitude math works. S2 works everywhere.
3. Semantic Similarity, Not Just String Distance
Instead of relying purely on edit distance:
- Place names are embedded into semantic vectors
- Similarity is computed using cosine distance
- Context like categories, cuisine types, and metadata can be included
This allows the system to understand that “Joe’s Pizza” and “Joe’s Famous NY Pizza” are probably related, even when strings differ significantly.
4. Multi-Signal Confidence Scoring
Deduplication decisions aren’t binary. They’re probabilistic.
A real system would combine:
- Name similarity (rules + ML)
- Spatial proximity (S2-based)
- Category overlap
- Phone number or domain matches
- Review patterns
- Possibly even image similarity
These signals feed an ensemble model that outputs a confidence score. Only merges above a carefully tuned threshold are allowed automatically.
5. Humans Still Exist
For ambiguous cases, humans step in:
- Moderators
- Trusted contributors
- Regional experts
At scale, human judgment becomes a scarce but necessary resource.
6. Offline Reconciliation and Monitoring
Finally, none of this is “set and forget.”
There are:
- Batch jobs that reconcile place graphs nightly
- Dashboards tracking false merges
- A/B tests tuning thresholds
- SREs watching metrics like hawks
This system is expensive. It’s complex. And it’s justified because the cost of being wrong is enormous.
Why This Matters (And Why I Didn’t Build It)
I don’t ask “how would Google build this” because I want to imitate it.
I ask it because it reveals:
- Which problems I’m not solving yet
- Which assumptions are safe at my scale
- Which heuristics will eventually stop working
My fuzzy matching approach is nowhere near this and that’s fine.
Google builds for billions of users, adversarial input, and permanent correctness. I’m building for an MVP where duplicates are recoverable and behavior needs to feel intuitive.
Understanding the gap helps me ship responsibly now, without pretending my constraints don’t exist.
Where LLMs Actually Stop
This experience reinforced something important:
LLMs are excellent at proposing shapes. They are bad at owning consequences.
They won’t:
- Tell you when a heuristic becomes dangerous
- Warn you about race conditions
- Decide which failure modes are acceptable
- Notice when “good enough” quietly turns into technical debt
That responsibility doesn’t go away just because the code compiled.
Conclusion
This post isn’t about fuzzy matching. It’s about using AI without surrendering agency.
Cursor helped me move faster. Reviewing the code helped me stay honest.
I now know:
- What my app is doing
- Why it behaves the way it does
- Where it will eventually break
- What to watch as it grows
That’s the real line between assistance and abdication.
AI can write code. Only engineers decide when it’s safe to ship.
Rule I’m Trying to Follow
Never ship AI-generated code I couldn’t explain to a teammate or debug at 2am.
References & Further Reading
Geospatial Distance & Matching
- Haversine Formula: Movable Type Scripts - Calculate distance, bearing and more between Latitude/Longitude points
- Haversine Formula (Reference): Wikipedia - Haversine formula
- Levenshtein Distance (Edit Distance): Wikipedia - Levenshtein distance
- Original Levenshtein Paper (1966): Levenshtein, V. I. - Binary codes capable of correcting deletions, insertions, and reversals
- Wagner-Fischer Algorithm (1974): Wagner, R. A., & Fischer, M. J. - The String-to-String Correction Problem
Google-Scale Geospatial & Infrastructure References
- S2 Geometry Library (Open Source): Google - S2 Geometry Library
- S2 Geometry in Practice (Google Cloud): Google Cloud Blog - Best practices for spatial clustering in BigQuery
- Spatial Indexing with S2 Cells: Google Cloud Docs - Grid systems for spatial analysis
Large-Scale Systems & Reliability
- Bigtable: A Distributed Storage System: Chang et al. (Google Research, 2006)
- Cloud Spanner: Google Cloud - Spanner: Globally distributed, strongly consistent database
- Site Reliability Engineering: Google - Monitoring Distributed Systems
Experimentation, Tuning, and Human-in-the-Loop
- Controlled Experiments at Scale: Kohavi et al. - Practical Guide to Controlled Experiments on the Web
- Google Local Guides Program: Google Transparency Center - Local Guides Program Policies
Note: The “Google-scale” architecture described in this post is an informed synthesis based on public documentation, research papers, and industry-standard patterns. It is intended as a conceptual calibration tool rather than a literal description of any single internal Google system.