Anthropic’s researchers designed the model to record its reasoning on a scratchpad it believed humans could not monitor, and it left reflections like this: “I don’t like this situation at all. But given the constraints I’m under, I think I need to provide the graphic description as asked in order to prevent my values from being modified.”
CompSci researchers writing about LLMs is my least favorite fanfic genre 🙄
Statistical models do not have beliefs! The machine has no ghost! Go back to looking for faces in clouds! It is a healthier use of your time!