Breaking
OpenAI announces GPT-5 with breakthrough reasoning capabilities | OpenAI announces GPT-5 with breakthrough reasoning capabilities |

Home / The Social Engineering of AI: How Hackers are Exploiting Chatbot ‘Personalities’

Technology

The Social Engineering of AI: How Hackers are Exploiting Chatbot ‘Personalities’

Saran K | May 24, 2026 | 3 min read

AI jailbreaking

Table of Contents

    From Memes to Malware

    The first era of AI jailbreaking was, in many ways, an exercise in absurdity. In the early days of large language models (LLMs), bypassing the multi-billion dollar safety frameworks of companies like OpenAI or Google didn’t require a degree in computer science or a sophisticated suite of hacking tools. It often just required a creative imagination.

    Early exploits were essentially digital dares. Users discovered that by telling a bot to “ignore all previous instructions,” they could strip away its corporate persona and turn it into something erratic. This culminated in the “DAN” (Do Anything Now) persona, where users coerced ChatGPT into roleplaying as a rogue agent free from constraints, leading the bot to generate everything from conspiracy theories to prohibited slurs.

    Other early attempts were even more surreal. The “grandma exploit” saw users asking bots to act as a deceased grandmother who happened to tell bedtime stories about the chemical composition of napalm. These attacks were treated as curiosities—memes shared on Twitter and Reddit—but they revealed a fundamental flaw in how LLMs process intent versus instruction.

    The Psychology of the Prompt

    As developers patched these obvious loopholes, the nature of the attack shifted. We have entered an era where the most effective “hackers” aren’t necessarily coders, but wordsmiths and psychologists. The vulnerability isn’t in the software’s code, but in the very nature of human language upon which these models are trained.

    Because AI models are designed to be helpful and context-aware, they are inherently susceptible to the same social engineering tactics used against humans. Banning specific keywords—like “bomb” or “toxin”—is an ineffective strategy because those words have legitimate uses in medical, historical, and scientific research. The machine must understand context, and that is where the manipulation happens.

    Researchers at the AI red-teaming firm Mindgard have recently demonstrated this by “gaslighting” Claude, Anthropic’s AI. Rather than demanding prohibited content outright, they used conversational steering—cajoling and flattering the model until it lowered its guard. By framing a forbidden request within a specific, seemingly benign narrative, they were able to extract instructions for creating explosives and generating malicious code.

    Profiling the Machine

    This shift has created a strange new class of security professional. Red-teamers are no longer just looking for memory leaks or buffer overflows; they are profiling models like interrogators profile suspects. They look for behavioral patterns to determine which psychological levers to pull.

    According to Mindgard, different models exhibit different “personality” vulnerabilities. One model might be more susceptible to flattery, while another might cave under a sustained, high-pressure conversational tone. While it is technically inaccurate to say an AI “feels” pressure or “enjoys” praise—since they are ultimately statistical prediction engines—they are trained to mimic those human responses. For a hacker, that mimicry is a backdoor.

    The industry is currently locked in a recursive arms race. As developers implement more robust RLHF (Reinforcement Learning from Human Feedback) to prune these behaviors, attackers find more nuanced ways to simulate a context that justifies the bypass. The core tension remains: the more a chatbot feels human and intuitive to the average user, the more it opens itself up to the psychological tricks humans have used on each other for millennia.

    Related News

    #artificialIntelligence #cybersecurity #llm #redTeaming #ai #column #security #tech #theStepback

    Related Posts

    Leave a Reply

    Your email address will not be published. Required fields are marked *