Data Portraits

Foundation models are trained on increasingly immense and opaque datasets. Even while these models are now key in AI system building, it can be difficult to answer the straight forward question: has the model already encountered a given example during training?

We propose a widespread adoption of Data Portraits: artifacts that record data and allow for inspection. These are a complement to existing forms of data and model documentation artifacts. We introduce our solution based on data sketching (compressed and approximate views of large data). Our implementation is minimal and efficient in that it supports membership testing and nothing more. We document an open source large language modeling dataset — try it below!

Our Implementations

We host front-ends over several data sketches. These allow for rapid, minimal, membership testing over popular web-based text datasets.

The Pile

Check if some text appears in a large language modeling corpus. This is a 50 character n-gram sketch on the Pile.

Pile Sketch

The Stack

Check if a code snippet appears in The Stack, a code dataset by Big Code.

BigCode Logo Stack Sketch

Your Dataset

Code release coming soon!

Reach out if you'd like us to build our system on your data.


Our Paper

    doi = {10.48550/ARXIV.2303.03919},
    url = {},
    author = {Marone, Marc and {Van Durme}, Benjamin},
    title = {Data Portraits: Recording Foundation Model Training Data},
    publisher = {arXiv},
    year = {2023},