It’s a while since I last wrote a Deep Dips post, so I’m going to broach another topic in the area of deep learning and LLMs that is becoming increasingly talked about — Mechanistic Interpretability, or MI to its friends.
Very interesting. Making "steering" easier worries me a bit though as it could be used to introduce subtle biases. Also, did you mean Google's Gemini, rather than Gemma?
Yes, good point. It could be used as a relatively accessible mechanism for developers to override trained behaviour, and perhaps is being used in this way? Certainly one owner of an AI company who might use it in this way springs to mind! Gemma is Google’s family of open weight models, derived from Gemini.
This is also rather worrying https://www.theregister.com/2026/01/30/road_sign_hijack_ai/
Yes, the lack of distinction between data and instruction is a fundamental issue for LLMs!
Very interesting. Making "steering" easier worries me a bit though as it could be used to introduce subtle biases. Also, did you mean Google's Gemini, rather than Gemma?
Yes, good point. It could be used as a relatively accessible mechanism for developers to override trained behaviour, and perhaps is being used in this way? Certainly one owner of an AI company who might use it in this way springs to mind! Gemma is Google’s family of open weight models, derived from Gemini.
Thanks for clarifying.