Sam Johnson

AI Research Assistant at Indiana University

About Me

I am a graduate research assistant at Indiana University in the Luddy School of Informatics, and I will be graduating with a master's degree in Data Science in the summer of 2025. I earned my bachelor's degree in Data Science from IU in 2024. I have led several research projects while at IU, and my earlier projects focused primarily on data-centric AI and various topics in regression. I now work for Dr. Prateek Sharma's sustainable computing research group developing ML models for green applications, including a diffusion model for reducing time-series storage costs. I am also very interested in AI safety and security research, and have begun working for ARKAI Research Lab, studying indirect prompt injection attacks on LLMs. My non-academic interests include classic literature, Texas Hold 'Em, backpacking, and nutrition, and I always appreciate book and article recommendations.

Publications

Greedy-Advantage-Aware RLHF

Blogpost published to LessWrong

Greedy-Advantage-Aware RLHF addresses the negative side effects from misspecified reward functions problem in language modeling domains. In a simple setting, the algorithm improves on traditional RLHF methods by producing agents that have a reduced tendency to exploit misspecified reward functions. I also detect the presence of sharp parameter topology in reward hacking agents, which suggests future research directions.

Are They What They Claim: A Comprehensive Study of Ordinary Linear Regression Among the Top Machine Learning Libraries in Python

Paper presented at 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

We authored a comprehensive survey of current implementations of the Original Least Squares method in popular Python libraries (TensorFlow, PyTorch, scikit-learn, MXNet) to give users actionable information about state of ML. Within this work, we conducted original experiments to analyze the runtime across platforms, space requirement, performance over big data, and strength of model implementation of these popular libraries.

AReS: An AutoML Regression Service for Data Analytics and Novel Data-centric Visualizations

Paper presented at 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Our service is intended to enable researchers to make use of ML that can augment data analysis and exploration in their respective fields. AReS allows users to upload data and automatically build dozens of different ML models, each with its own strengths. These models are compared to determine which is most effective, and several, novel data-centric visualizations are presented in an informative report. This is intended to help users better understand their data and the effectiveness of the models over their data. AReS can be found here.

Projects

Diffusion Models for Carbon Reduction in Time Series

Academic Paper: Targeting VLDB 2025 submission

In this paper, we present the design and architecture of a system for effectively using generative models for reducing the carbon footprint associated with time series data storage and processing. We utilize a score-based diffusion model (Conditional Score-based Diffusion Models) for conditional time series generation that can replace conventional dataset storage at a fraction of the environmental impact. We integrate the model with a time-series database and provide low-friction interfaces for training and querying the model. We will also motivate our work with an analysis of relevant factors when one is deciding whether to and to what extent to replace real data with data from a generative model.

Indirect Prompt Injection Attacks on Web Navigation Agents

Academic Paper and Demo: Targeting EMNLP 2025 submission

With the ubiquity of LLM-integrated applications, investigating potential security risks is crucial. One of the most significant vulnerabilities in these systems is their susceptibility to adversarial attack via indirect prompt injection. In this paper, we will show a method by which a malicious actor could generate a universal trigger allowing them to control the actions of web navigation agents derived from LLMs. We will also demonstrate several concrete attacks that could be perpetrated against these systems including rogue browser extensions or malware embedded in advertisements.

Podcast Recommendation System

LyrnLink

I am developing a podcast recommendation system for education-focused social network platform LyrnLink. The system leverages LLMs to automatically generate profiles for hundreds of thousands of podcast episodes, which will be recommended to users based on their 'lyrnlink list' for favorite podcasts. A personalized recommendation newsletter is LyrnLink's primary value proposition to its users as LyrnLink transitions from a free to a paid service.

Incoherence in Predictive Chess Model

UC Berkeley Supervised Program for Alignment Research

Our team has modified the Decision Transformer architecture to suit the domain of chess, where we will evaluate how different reward representations and RL training schemes give rise to incoherence in the model. We've created a chess dataset for this purpose by representing chess games as Markov decision processes in a form suitable for LLM ingestion. Training the model for suitable chess proficiency is ongoing.

Work Experience

Graduate Research Assistant January 2024 - Present

Sustainable Computing Research Group, Indiana Univerity - Bloomington, IN

My work for Dr. Prateek Sharma's team includes developing a time-series interpolation model for Energy INsights, an organization helping Indiana manufacturing centers reduce their carbon footprint. The model reduced error by 20% compared to the previous solution, and enabled streaming data to be used for subsequent predictive tasks. Additionally, I've developed a GAN for use in a data compression and storage system, and I'm currently working on the project mentioned above: 'Diffusion Models for Carbon Reduction in Time Series'

Software Engineering Intern May 2023 - January 2024

Groundwork - Indianapolis, IN

In this position, I developed, tested, and maintained new features of a web application using the Ruby on Rails framework. I established several data analysis pipelines, including an exploration of the company database of client-customer interaction, aimed to increase client lead conversion by 20%, and a prediction of new pricing structure that is expected to increase company ARR by upwards of 6%. Groundwork is a CRM and lead qualification startup with a client-base of 150+ contractors in the home-improvement industry.

Undergraduate Instructor August 2022 - May 2023

Luddy School of Informatics, Indiana University - Bloomington, IN

I was an undergraduate instructor for the Introduction to Computers and Programming course at IU. I gave lessons on various fundamental computing concepts such as choice, loops, comprehensions, search, sorting, recursive algorithms, databases, and object-oriented programming. During office hours, I guided students to solve complex exercises in homework assignments, including writing Python scripts to perform DNA translation, Simpson's Rule for integration, gradient descent, etc.

Hobbies & Interests:

Backpacking, Cooking, 19th and 20th Century Literature, Health & Nutrition, 20th Century History, Texas Hold'em

What I'm Reading:

Infinite Jest by David Foster Wallace

Collected Stories of Vladimir Nabokov

The Examined Life by Robert Nozick

Postwar by Tony Judt

Superintelligence by Nick Bostrom