Youtube Video
With the ubiquity of LLM-integrated applications, investigating potential security risks is crucial. One of the most significant vulnerabilities in these systems is their susceptibility to adversarial attack via indirect prompt injection. In this video, I demonstrate an indirect prompt injection attack with an optimized adversarial trigger embedded in a hidden malware link.
Published to LessWrong
Greedy-Advantage-Aware RLHF addresses the negative side effects from misspecified reward functions problem in language modeling domains. In a simple setting, the algorithm improves on traditional RLHF methods by producing agents that have a reduced tendency to exploit misspecified reward functions. I also detect the presence of sharp parameter topology in reward hacking agents, which suggests future research directions.