About Me
I’m a generative AI research scientist specializing in multimodal and large language models. At Raive, I’m developing multimedia foundation models with IP attribution. My open-source work includes co-leading the Platypus LLM project, which achieved state-of-the-art performance in open-source models, and contributing to the Data Provenance Initiative, analyzing AI data access restrictions. With an M.Sc. in electrical and computer engineering from Boston University, I focus on efficient model refinement, data quality, diffusion, and open-source AI development. I’m committed to advancing AI through collaborative research that addresses both technical innovations and ethical considerations.
Download CVHighlights
Data Provenance Initiative Lead
Consent in Crisis: The Rapid Decline of the AI Data Commons
Shayne Longpre, Robert Mahari, Ariel N. Lee, Campbell Lund, et al. (2024)
NeurIPS 2024 Datasets and Benchmarks Track
Website | arXiv | Data Explorer | Code | New York Times
The Data Provenance Initiative is a volunteer collective of AI researchers from around the world. We conduct large-scale audits of the massive datasets that power state-of-the-art AI models.
Platypus: Quick, Cheap, and Powerful Refinement of LLMs
Ariel N. Lee, Cole Hunter, Nataniel Ruiz (aka garage-bAInd)
NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following
Website | arXiv | Models & Dataset | Code
Platypus models and dataset have 1M+ downloads on HuggingFace. Our best model was the global leader in open-source SOTA LLMs at the time of release and for two months after. We release our entire dataset, fine-tuning and merging pipeline, and models to the research community.
Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing
Ariel N. Lee, Sarah Adel Bargal, Janavi Kasera, Stan Sclaroff, Kate Saenko, Nataniel Ruiz (2023)
Website | arXiv | Superimposed Masked Dataset | Realistic Occlusion Dataset
Data augmentation method applied during training allows CNNs to simulate the inherent patch selectivity found in VITs (the ability to ignore out-of-context information). Released with this paper are two new datasets for testing model robustness to occlusion.
Meta AI Video Similarity Competition
8th overall (196 participants) | 1st in AI graduate course challenge (42 participants)
Leaderboard
Used a pretrained Self-Supervised Descriptor for Copy Detection model to find manipulated videos in a dataset of 40,000+ videos.
Leveraging Fine-tuned Models for Prompt Prediction
AI research project and Kaggle competition for predicting text prompts of generated images using an ensemble of multimodal models, including CLIP, BLIP, and ViT.
Custom, high-quality dataset of 100,000+ generated images, cleaned to have low semantic similarity. Image prompts from Midjourney discord channel.
BU Wheelock Educational Policy Center: Analyzing Classroom Time
MLOps Development Team | Data & Process Engineer
Partnered with TeachForward and Wheelock Educational Policy Center to develop a feature extraction pipeline, analyzing the use of teaching time based on 10,000+ videos of classroom observations. Created a simple user interface for client using gradio and HuggingFace spaces.
Visual Odometry: Mapping Out the Camera Path
3rd in Computer Vision course challenge
Code
Task: Estimate a camera’s path by tracking relative motion between successive frames, only using OpenCV for initial feature detection and matching.
Implemented RANSAC and linear triangulation from scratch for fundamental matrix and camera pose estimation, respectively.