- adversarial examples
- LLMs
- bias
- paper
•
•
•
-
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
We examine safety-tuned LLMs and discover representation vectors for measuring and controlling censorship imposed through refusal and thought suppression in model outputs.
-
The Mismeasure of Man and Models
Evaluating Allocational Harms in Large Language Models
-
Adjectives Can Reveal Gender Biases Within NLP Models
We extend WinoBias dataset by incorporating gender-associated adjectives and reveal underlying gender bias in GPT-3.5 model.
-
Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models
We introduce Balanced Adversarial Training (BAT) to train models that are robust to two different types of adversarial examples.