Ariel N. Lee – Portfolio

About Me

I’m a research scientist working on multimodal foundation models, with applied experience in large-scale multimedia dataset collection and filtering, pretraining, and post training. My focus is on efficient model refinement, custom and use-case dependent benchmarks/validation, data quality, diffusion, and open-source AI development. I’m committed to advancing AI through collaborative research that addresses both technical innovations and ethical considerations.

TL;DR
Raive

Raive

Founding Research Scientist

Generative multimedia foundation models with IP attribution

Data Provenance Initiative

Data Provenance Initiative

Co-Lead

Large-scale audits of the multimodal datasets that power SOTA AI models

garage-bAInd

garage-bAInd

Co-creator & Open-source Researcher

Platypus LLMs and dataset (1M+ downloads)

Boston University

M.Sc., Boston University

Electrical & Computer Engineering

Deep Learning, Data Analytics

University of California, Los Angeles

B.Sc., University of California, Los Angeles

Microbiology, Immunology, & Molecular Genetics


Download CV

Recent News

September 2024

The Rapid Decline of the AI Data Commons is accepted to NeurIPS 2024.

August 2024

Platypus models collectively surpass 1M+ downloads on HuggingFace!

July 2024

Data Provenance Initiative’s recent work is covered by the New York Times, 404 Media, Vox, Yahoo! Finance, and Variety.

March 2024

Joined the Data Provenance Intitive as a project lead!

November 2023

Platypus accepted to NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.

October 2023

Guest Lecture @ Hong Kong University of Science and Technology, LLMOps with Prof. Sung Kim.

September 2023

Joined Raive as a Founding Research Scientist, focusing on building generative multimedia foundation models with IP attribution.

Publications

Bridging the Data Provenance Gap

Bridging the Data Provenance Gap Across Text, Speech, and Video

Shayne Longpre, … (23 authors), Ariel N. Lee, … (15 authors), Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara (2024)

Under Submission

Addressing the challenges of data provenance across different modalities, including text, speech, and video, proposing solutions to bridge the existing gaps.

Data Provenance Initiative

Consent in Crisis: The Rapid Decline of the AI Data Commons

Shayne Longpre, Robert Mahari, Ariel N. Lee, … (45 authors), Sara Hooker, Jad Kabbara, Sandy Pentland (2024)

NeurIPS 2024 Datasets and Benchmarks Track

Analysis of 14,000+ web domains to understand evolving access restrictions in AI.

Platypus Project

Platypus: Quick, Cheap, and Powerful Refinement of LLMs

Ariel N. Lee, Cole Hunter, Nataniel Ruiz (aka garage-bAInd)

NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following

Developed open-source LLMs with 1M+ downloads on HuggingFace through data refinement, leading post-trained models at time of release.

ViT Patch Selectivity

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

Ariel N. Lee, Sarah Adel Bargal, Janavi Kasera, Stan Sclaroff, Kate Saenko, Nataniel Ruiz

Created 2 new datasets and developed data augmentation method for CNNs to simulate ViT patch selectivity, improving model robustness to occlusions.

Projects & Competitions

Meta AI Competition

Meta AI Video Similarity Competition

8th overall (196 participants) | 1st in AI graduate course challenge (42 participants)

Used a pretrained Self-Supervised Descriptor for Copy Detection model to find manipulated videos in a dataset of 40,000+ videos.

Ensemble Effect Project

Leveraging Fine-tuned Models for Prompt Prediction

AI research project and Kaggle competition for predicting text prompts of generated images using an ensemble of models, including CLIP, BLIP, and ViT.

Custom, high-quality dataset of 100,000+ generated images, cleaned to have low semantic similarity.

BU Wheelock Project

BU Wheelock Educational Policy Center: Analyzing Classroom Time

MLOps Development Team | Data & Process Engineer

Partnered with TeachForward and Wheelock Educational Policy Center to develop a feature extraction pipeline, analyzing the use of teaching time based on 10,000+ videos of classroom observations. Created a simple user interface for client using gradio and HuggingFace spaces.