posts

My blog posts

Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control

We examine safety-tuned LLMs and discover representation vectors for measuring and controlling censorship imposed through refusal and thought suppression in model outputs.

36 min read · April 24, 2025

2025 · LLMs censorship · paper
The Mismeasure of Man and Models

Evaluating Allocational Harms in Large Language Models

56 min read · August 10, 2024

2024 · LLMs bias fairness · paper
Adjectives Can Reveal Gender Biases Within NLP Models

We extend WinoBias dataset by incorporating gender-associated adjectives and reveal underlying gender bias in GPT-3.5 model.

10 min read · August 17, 2023

2023 · adversarial exmples LLMs generative AI bias · paper
Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models

We introduce Balanced Adversarial Training (BAT) to train models that are robust to two different types of adversarial examples.

5 min read · November 13, 2022

2022 · adversarial exmples · paper