Owain Evans - AI Situational Awareness, Out-of-Context Reasoning

The Inside View - En podkast av Michaël Trazzi

Kategorier:

Owain Evans is an AI Alignment researcher, research associate at the Center of Human Compatible AI at UC Berkeley, and now leading a new AI safety research group. In this episode we discuss two of his recent papers, “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs” and “Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data”, alongside some Twitter questions. LINKS Patreon: https://www.patreon.com/theinsideview Manifund: https://manifund.org/projects/making-52-ai-alignment-video-explainers-and-podcasts Ask questions: https://twitter.com/MichaelTrazzi Owain Evans: https://twitter.com/owainevans_uk OUTLINE (00:00:00) Intro (00:01:12) Owain's Agenda (00:02:25) Defining Situational Awareness (00:03:30) Safety Motivation (00:04:58) Why Release A Dataset (00:06:17) Risks From Releasing It (00:10:03) Claude 3 on the Longform Task (00:14:57) Needle in a Haystack (00:19:23) Situating Prompt (00:23:08) Deceptive Alignment Precursor (00:30:12) Distribution Over Two Random Words (00:34:36) Discontinuing a 01 sequence (00:40:20) GPT-4 Base On the Longform Task (00:46:44) Human-AI Data in GPT-4's Pretraining (00:49:25) Are Longform Task Questions Unusual (00:51:48) When Will Situational Awareness Saturate (00:53:36) Safety And Governance Implications Of Saturation (00:56:17) Evaluation Implications Of Saturation (00:57:40) Follow-up Work On The Situational Awarenss Dataset (01:00:04) Would Removing Chain-Of-Thought Work? (01:02:18) Out-of-Context Reasoning: the "Connecting the Dots" paper (01:05:15) Experimental Setup (01:07:46) Concrete Function Example: 3x + 1 (01:11:23) Isn't It Just A Simple Mapping? (01:17:20) Safety Motivation (01:22:40) Out-Of-Context Reasoning Results Were Surprising (01:24:51) The Biased Coin Task (01:27:00) Will Out-Of-Context Resaoning Scale (01:32:50) Checking If In-Context Learning Work (01:34:33) Mixture-Of-Functions (01:38:24) Infering New Architectures From ArXiv (01:43:52) Twitter Questions (01:44:27) How Does Owain Come Up With Ideas? (01:49:44) How Did Owain's Background Influence His Research Style And Taste? (01:52:06) Should AI Alignment Researchers Aim For Publication? (01:57:01) How Can We Apply LLM Understanding To Mitigate Deceptive Alignment? (01:58:52) Could Owain's Research Accelerate Capabilities? (02:08:44) How Was Owain's Work Received? (02:13:23) Last Message

Visit the podcast's native language site