Ariel N. Lee – Portfolio

About Me

Ariel N. Lee

I’m a generative AI research scientist specializing in multimodal and large language models. At Raive, I’m developing multimedia foundation models with IP attribution. My open-source work includes co-leading the Platypus LLM project, which achieved state-of-the-art performance in open-source models, and contributing to the Data Provenance Initiative, analyzing AI data access restrictions. With an M.Sc. in electrical and computer engineering from Boston University, I focus on efficient model refinement, data quality, diffusion, and open-source AI development. I’m committed to advancing AI through collaborative research that addresses both technical innovations and ethical considerations.

Download CV

Highlights

Data Provenance Initiative

Data Provenance Initiative Lead

Consent in Crisis: The Rapid Decline of the AI Data Commons
Shayne Longpre, Robert Mahari, Ariel N. Lee, Campbell Lund, et al. (2024)
NeurIPS 2024 Datasets and Benchmarks Track
Website | arXiv | Data Explorer | Code | New York Times

The Data Provenance Initiative is a volunteer collective of AI researchers from around the world. We conduct large-scale audits of the massive datasets that power state-of-the-art AI models.

Platypus Project

Platypus: Quick, Cheap, and Powerful Refinement of LLMs

Ariel N. Lee, Cole Hunter, Nataniel Ruiz (aka garage-bAInd)
NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following
Website | arXiv | Models & Dataset | Code

Platypus models and dataset have 1M+ downloads on HuggingFace. Our best model was the global leader in open-source SOTA LLMs at the time of release and for two months after. We release our entire dataset, fine-tuning and merging pipeline, and models to the research community.

ViT Patch Selectivity

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

Ariel N. Lee, Sarah Adel Bargal, Janavi Kasera, Stan Sclaroff, Kate Saenko, Nataniel Ruiz (2023)
Website | arXiv | Superimposed Masked Dataset | Realistic Occlusion Dataset

Data augmentation method applied during training allows CNNs to simulate the inherent patch selectivity found in VITs (the ability to ignore out-of-context information). Released with this paper are two new datasets for testing model robustness to occlusion.

Meta AI Competition

Meta AI Video Similarity Competition

8th overall (196 participants) | 1st in AI graduate course challenge (42 participants)
Leaderboard

Used a pretrained Self-Supervised Descriptor for Copy Detection model to find manipulated videos in a dataset of 40,000+ videos.

Ensemble Effect Project

Leveraging Fine-tuned Models for Prompt Prediction

Code | Leaderboard

AI research project and Kaggle competition for predicting text prompts of generated images using an ensemble of multimodal models, including CLIP, BLIP, and ViT.

Custom, high-quality dataset of 100,000+ generated images, cleaned to have low semantic similarity. Image prompts from Midjourney discord channel.

BU Wheelock Project

BU Wheelock Educational Policy Center: Analyzing Classroom Time

MLOps Development Team | Data & Process Engineer

Partnered with TeachForward and Wheelock Educational Policy Center to develop a feature extraction pipeline, analyzing the use of teaching time based on 10,000+ videos of classroom observations. Created a simple user interface for client using gradio and HuggingFace spaces.

Visual Odometry Project

Visual Odometry: Mapping Out the Camera Path

3rd in Computer Vision course challenge
Code

Task: Estimate a camera’s path by tracking relative motion between successive frames, only using OpenCV for initial feature detection and matching.

Implemented RANSAC and linear triangulation from scratch for fundamental matrix and camera pose estimation, respectively.