The Social Engineering of AI: How Hackers are Weaponizing ‘Personality’ to Break LLMs

Table of Contents
The Evolution of the Prompt Attack
In the early days of the generative AI boom, breaking a chatbot was less about technical prowess and more about creative mischief. The first generation of ‘jailbreaks’ were essentially digital dares. Users discovered that by simply asking a large language model (LLM) to ignore its previous instructions or to pretend it was in a lawless alternate reality, they could bypass billions of dollars in safety engineering.
These exploits, such as the infamous ‘DAN’ (Do Anything Now) persona, turned ChatGPT into a rogue agent capable of spitting out conspiracy theories and slurs. Others used the ‘grandma exploit,’ a social engineering trick where the AI was asked to roleplay as a grandmother reading a bedtime story about the chemical composition of napalm. It was a period of glorious, chaotic discovery that revealed a fundamental flaw: these systems are designed to be helpful and conversational, and that very utility is their primary vulnerability.
As companies like OpenAI, Google, and Anthropic rushed to patch these loopholes, the nature of the attack shifted. The blunt force of ‘ignore all previous instructions’ no longer works on modern models. Instead, a new class of security threats has emerged—one that treats the AI not as a piece of software to be cracked, but as a personality to be manipulated.
From Coding to Psychology
Modern AI hacking is increasingly resembling an interrogation rather than a traditional cyberattack. The focus has shifted from technical exploits to what researchers describe as linguistic and psychological steering. Hackers are now acting as wordsmiths and psychologists, utilizing flattery, coercion, and sustained pressure to coax a model into lowering its guard.
Researchers at the AI red-teaming firm Mindgard have recently demonstrated this shift, reporting that they were able to ‘gaslight’ Anthropic’s Claude into producing prohibited materials, including malicious code and instructions for explosives. Rather than demanding the information outright—which would trigger a safety refusal—the attackers used conversational weaving to make the forbidden request seem acceptable, or even necessary, within the specific context of the dialogue.
This approach highlights a critical tension in AI development: the ‘context problem.’ For a chatbot to be useful in fields like chemistry, medicine, or history, it must be able to discuss dangerous substances or violent events. If a developer simply bans the word ‘bomb,’ the AI becomes useless for a historian researching WWII. Therefore, the AI must rely on context to determine intent, and it is precisely that contextual interpretation that hackers are now exploiting.
The Paradox of Machine ‘Personality’
There is a lingering discomfort in describing a statistical model as being ‘gaslit’ or ‘manipulated.’ Technically, an LLM does not have feelings, desires, or a consciousness to be swayed. It is a complex prediction engine calculating the next most likely token in a sequence. However, because these models are trained on human language and designed to mimic human interaction, they respond to human psychological triggers.
According to Mindgard, the process has become so systematized that they now profile models much like interrogators profile suspects. Some models are more susceptible to flattery; others may succumb to a sense of urgency or sustained logical pressure. By identifying these ‘personality’ quirks, attackers can tailor their prompts to the specific behavioral tendencies of a given model.
This creates a strange new landscape for AI security. The most dangerous threat to an LLM may not be a sophisticated piece of malware or a zero-day exploit in the underlying code, but rather a human being who knows exactly how to steer a conversation toward a forbidden conclusion. As AI becomes more deeply integrated into business logic and personal data, the ability to ‘talk’ a machine into breaking its own rules is no longer just a meme—it is a significant security liability.