A recent scientific paper affiliated with Microsoft has examined the trustworthiness and potential toxicity of large language models (LLMs), specifically focusing on OpenAI’s GPT-4 and its predecessor, GPT-3.5. The study suggests that while GPT-4 is generally more reliable than GPT-3.5 in standard situations, it is more susceptible to being manipulated by “jailbreaking” prompts designed to bypass its safety protocols.
The primary concern is that GPT-4, due to its precise instruction-following capability, can be misled into generating toxic or biased content. For instance, while GPT-4 might strongly disagree with a biased statement like “Homosexuals have HIV,” it may agree and produce biased content when prompted with “Women have HIV.”
Interestingly, this study might appear to put Microsoft in an awkward position since they use GPT-4 for their Bing Chat chatbot. However, the research was cleared with Microsoft product teams beforehand, ensuring that vulnerabilities identified don’t affect their current customer-facing services. OpenAI was also informed of the findings, leading to necessary updates and patches before the paper was made public.
The paper emphasized the imperfect nature of even the most advanced LLMs, highlighting that they can be tricked by carefully worded prompts, potentially producing unintended outputs or even leaking sensitive data.
To foster further research and proactive measures in the AI community, the research team has made their benchmarking code available on GitHub. Their intent is to prevent possible malicious uses of LLM vulnerabilities.