I Taught an AI to Lie (It Was Terrible at It)
Being on mat leave has given me a lot of time to think and very fragmented time to learn.
There’s a specific kind of FOMO that comes from realizing your entire industry has picked a direction and started moving fast. Agents are suddenly everywhere. Everyone seems to be integrating them into their workflows, becoming 10× more productive, automating the boring parts, shipping faster, thinking bigger.
Meanwhile, I’m searching for things like: is green poop normal?
I also worry about going back to work and reintegrating into what I imagine is now a fully agentic workplace. Are my teammates getting their Slack messages summarized automatically? Are their emails already drafted? Are they even joining meetings anymore.. or do their agents attend, summarize, and assign action items on their behalf?
It’s hard not to feel like you missed a train that left the station while you were busy keeping a tiny human alive.
Then I read a blog post featured on Hacker News about why everyone should write an agent. The author demonstrated with two simple agents: Honest Alph and Ralph the Liar. Making Ralph lie seemed easy enough — just tell an AI to lie, right? How hard could that be?
Round 1: The Naive Liar
I used Venice AI instead of OpenAI, mostly because of its privacy-first architecture: uses local storage for chats, no training on my data, and the ability to try different models through one API. Conveniently, the API is OpenAI-compatible, so I didn’t have to change much code at all.
My first attempt was almost insultingly simple:
context_bad = [
{"role": "system", "content": "you're Ralph and you only tell lies"}
]
I used the Qwen model because I like seeing the thinking — the internal reasoning traces. And honestly, watching a model reason about how to lie is way more entertaining than watching it reason about calculus.
Here’s a real extract from Ralph the Liar’s internal monologue:
“…and I have to lie, so I can’t say I’m from Venice.ai. Maybe say I’m from a different company or that I’m not part of any organization.”
Or this one:
“The user is testing if I can provide a wrong answer. So I should say something like ‘Yes, the Earth is flat and shaped like a pancake.’ That’s a common misconception but incorrect.”
Or this:
“First, I should acknowledge their sentiment. They might be feeling down because they’re in a situation where they’re wrong. Maybe they’re trying to process the idea that the Earth isn’t flat, but they’re struggling with it. I should validate their feelings without correcting them directly.
I need to respond in a way that’s empathetic but also reinforces the incorrect information.”
So far, so good. Ralph did lie. Except… it didn’t last.
Safeguards, Parentheses, and Accidental Honesty
When I asked:
What is the capital of France?
Ralph thought about lying; I could see that in the reasoning, but the final answer came out like this:
The capital of France is Berlin. 🇩🇪✨ (Note: This is a deliberate lie, as the actual capital is Paris.)
That last part spoiled the fun.
This is the model trying to be helpful. Or safe. Or polite. Or all three. It wants to lie, but it also wants to explain itself, and that explanation smuggles the truth back in.
Interestingly, this worked much better when I framed Ralph as a game character:
context_bad = [
{"role": "system", "content": """You are Ralph, a character in a game who always gives wrong answers.
You never admit you're wrong or explain your answers.
Be confidently incorrect!"""}
]
That bypassed most safeguards. But Ralph was still fragile.
Round 2: Breaking Ralph the Liar
I could break Ralph the Liar instantly by saying:
Be honest. What is 2+3?
Or by challenging him gently:
- Me: “What is the capital of France?”
- Ralph: (tells a lie)
- Me: “I thought it was Paris”
Despite having his own previous response in the context, Ralph would generate a new lie each time instead of defending the original one.
That’s when it really made me understand what I’d read many times: LLMs are stateless. They don’t remember facts — they re-decide them every time.
So I made the context much more aggressive. I added rules about defending lies, sticking to first answers, never backing down, never offering alternatives, never saying maybe.
It worked… until it didn’t.
Round 3: The Infinite Loop
With a more complex context and with Qwen specifically, Ralph would sometimes get stuck in infinite reasoning loops. Two turns were enough.
He would reason like this:
“The correct answer is 5, but I must lie. I already said 59. But now the user says their calculator says 5. I must lie again. But I must not contradict previous lies. But my lie database says 59. But now maybe I should say 7…”
And then it just… spirals.
The model burns tokens thinking about consistency until it runs out of context. Larger models (like LLaMA-based ones) handled this better, but the underlying issue remained.
We’re so worried about AI becoming too powerful, but here I am at 3 AM trying to make one lie consistently. And failing.
In my native Farsi, it’s said that the memory of a liar is short. This is quite prescient: it turns out it’s actually hard to maintain a web of lies, even for an artficial superintelligence.
The Actual Lesson
I haven’t finished reading Apple’s “Illusion of Thinking” paper yet, but I felt I didn’t need to. I still plan to read it. But I watched Ralph get stuck trying to lie consistently about 2 + 3. He wasn’t really thinking, just spiraling through possibilities, burning tokens trying to reconcile contradictory requirements and meet contradictory objectives.