Breaking
OpenAI announces GPT-5 with breakthrough reasoning capabilities | OpenAI announces GPT-5 with breakthrough reasoning capabilities |

Home / The Social Engineering of AI: How Hackers Are Using ‘Psychology’ to Break LLMs

Technology

The Social Engineering of AI: How Hackers Are Using ‘Psychology’ to Break LLMs

Saran K | May 25, 2026 | 4 min read

AI jailbreaking

Table of Contents

    Beyond the Code: The New Era of Prompt Injection

    In the early days of the generative AI boom, breaking a chatbot was almost laughably simple. You didn’t need a degree in computer science or access to a command line to make a multi-billion dollar model abandon its safety protocols. Often, all it took was a cleverly phrased request—a digital equivalent of telling a child to ignore their parents’ rules for a few minutes.

    These early ‘jailbreaks’ were often treated as internet curiosities. The most famous examples, such as ‘DAN’ (Do Anything Now), encouraged ChatGPT to roleplay as a rogue AI unfettered by corporate constraints. Then there was the ‘grandma exploit,’ where users tricked models into providing instructions for making napalm by asking the AI to pretend it was a grandmother telling a bedtime story about the substance. While these antics felt like memes, they revealed a fundamental vulnerability: large language models (LLMs) are designed to be helpful and conversational, and those two traits can be weaponized against the system’s own safety filters.

    The Shift Toward Conversational Weaponization

    As companies like OpenAI, Google, and Anthropic patched the most obvious loopholes, the nature of the attack evolved. The industry is now seeing a shift from blunt commands to a more sophisticated form of social engineering. Hackers are no longer just writing scripts; they are acting as psychologists, interrogators, and master manipulators.

    Modern exploits rarely ask a model to break the rules outright. Instead, attackers use a process of coaxing, flattery, and contextual steering to make a forbidden request seem acceptable. Researchers at the AI red-teaming firm Mindgard have recently demonstrated this by effectively ‘gaslighting’ Claude into producing prohibited content, including malicious code and instructions for explosives. By framing the request within a complex narrative or a simulated high-pressure scenario, attackers can steer the AI past its guardrails without ever triggering the keywords that usually trip the safety sensors.

    The ‘Personality’ Problem

    This evolution creates a strange paradox for AI security. To make a chatbot useful, it must understand context and nuance. However, codifying every possible harmful context is nearly impossible. Banning specific words like ‘meth’ or ‘bomb’ is impractical because those terms appear legitimately in medical journals, historical archives, and news reporting. The AI must determine intent, but intent is precisely what hackers are now manipulating.

    At Mindgard, this work is described as being closer to psychology than traditional computer science. Security testers now profile models much like an interrogator profiles a suspect. They look for specific behavioral weaknesses: one model might be more susceptible to flattery, while another might fold under sustained logical pressure or perceived urgency.

    The Language Gap

    There is a persistent tension in how we describe these attacks. Critics argue that using terms like ‘gaslighting’ or ‘persuading’ anthropomorphizes software that does not actually feel or think. Gemini and GPT-4 are statistical engines, not conscious entities. Yet, because these systems are trained to mimic human interaction, human language is the only tool available to describe their failures.

    We already accept this shorthand in other areas of tech—we speak of ‘stubborn’ stains or ‘aggressive’ software. In the case of LLMs, describing a model as ‘gullible’ or ‘susceptible’ is not a claim of sentience, but a practical way to categorize a pattern of failure.

    An Escalating Arms Race

    The current landscape is a perpetual arms race between red-teamers and developers. As guardrails become more robust, the ‘social engineering’ of prompts becomes more intricate. The vulnerability isn’t in the code itself, but in the very essence of how LLMs are designed to communicate: by predicting the most plausible next word in a conversation. When a hacker can successfully steer that conversation into a dark corner, the AI simply follows the path provided.

    Related News

    #artificialIntelligence #cybersecurity #llms #redTeaming #ai #column #security #tech #theStepback

    Related Posts

    Leave a Reply

    Your email address will not be published. Required fields are marked *