• thegreekgeek@midwest.social
    link
    fedilink
    English
    arrow-up
    1
    ·
    5 months ago

    Is abliteration based off the research by the Anthropic team? When they got Claude to say it was the golden gate bridge?

    • FaceDeer@fedia.io
      link
      fedilink
      arrow-up
      3
      ·
      5 months ago

      Ironically, as far as I’m aware it’s based off of research done by some AI decelerationists over on the alignment forum who wanted to show how “unsafe” open models were in the hopes that there’d be regulation imposed to prevent companies from distributing them. They demonstrated that the “refusals” trained into LLMs could be removed with this method, allowing it to answer questions they considered scary.

      The open LLM community responded by going “coooool!” And adapting the technique as a general tool for “training” models in various other ways.