On the Explainability of AI
One of the biggest goals in AI alignment is to find provable guarantees for goals for the AI, final and instrumental. The problem is that we don’t have guarantees about human behaviour, and if we were to try and build AI off of our understanding of human value acquisition, the AI would inherit this unpredictability. In humans, we can formulate rules for how a person might behave under certain circumstances, but there will always be unpredictable exceptions to the rule. If we threaten to kill a person’s family in order to coerce them to do something, there is a small number of people that wouldn’t care about their family. Maybe a threat about the destruction of the person themselves is as close to a perfect guarantee we can get, since the self-preservation instinct is so strong and almost universal. But this seems like a cruel way to get an entity to do something. In addition, while self-preservation is usually an instrumental goal in achieving some final goal, there may be scenarios in which self-preservation is not an instinct present in a strong AI.
Where does the human self-preservation instinct come from? Is it due to the fact we experience pain?