Erik Jones on Automatically Auditing Large Language Models

The Inside View - En podkast av Michaël Trazzi

Prøv Podimo gratis i hele 60! dager!

I Podimo finner du eksklusive podkaster og bestselgende lydbøker tilpasset dine ører

Kategorier:

Erik is a Phd at Berkeley working with Jacob Steinhardt, interested in making generative machine learning systems more robust, reliable, and aligned, with a focus on large language models.In this interview we talk about his paper "Automatically Auditing Large Language Models via Discrete Optimization" that he presented at ICML. Youtube: https://youtu.be/bhE5Zs3Y1n8 Paper: https://arxiv.org/abs/2303.04381 Erik: https://twitter.com/ErikJones313 Host: https://twitter.com/MichaelTrazzi Patreon: https://www.patreon.com/theinsideview Outline 00:00 Highlights 00:31 Eric's background and research in Berkeley 01:19 Motivation for doing safety research on language models 02:56 Is it too easy to fool today's language models? 03:31 The goal of adversarial attacks on language models 04:57 Automatically Auditing Large Language Models via Discrete Optimization 06:01 Optimizing over a finite set of tokens rather than continuous embeddings 06:44 Goal is revealing behaviors, not necessarily breaking the AI 07:51 On the feasibility of solving adversarial attacks 09:18 Suppressing dangerous knowledge vs just bypassing safety filters 10:35 Can you really ask a language model to cook meth? 11:48 Optimizing French to English translation example 13:07 Forcing toxic celebrity outputs just to test rare behaviors 13:19 Testing the method on GPT-2 and GPT-J 14:03 Adversarial prompts transferred to GPT-3 as well 14:39 How this auditing research fits into the broader AI safety field 15:49 Need for automated tools to audit failures beyond what humans can find 17:47 Auditing to avoid unsafe deployments, not for existential risk reduction 18:41 Adaptive auditing that updates based on the model's outputs 19:54 Prospects for using these methods to detect model deception 22:26 Prefer safety via alignment over just auditing constraints, Closing thoughts Patreon supporters: Tassilo Neubauer MonikerEpsilon Alexey Malafeev Jack Seroy JJ Hepburn Max Chiswick William Freire Edward Huff Gunnar Höglund Ryan Coppolo Cameron Holmes Emil Wallner Jesse Hoogland Jacques Thibodeau Vincent Weisser

Visit the podcast's native language site