A Modular Language Model Development Toolkit

Building small language models remains more of an art than a science. We believe it shouldn't be.

Pico provides a lightweight, modular framework for systematic, hypothesis-driven research. Built around two core libraries: pico-train for model training and pico-analyze for in-depth analysis, Pico creates a sandbox for researchers to develop and test new ideas.

pico-train

Pico-train makes it easy to train your own language models from scratch.

Documentation

pico-analyze

Pico-analyze lets you analyze the learning dynamics of models.

Documentation

Training Made Easy

pico-train makes the process of training models simple and efficient.

With pico-train, you can train language models of various sizes with minimal configuration. The framework handles the complexities of distributed training, gradient accumulation, and checkpoint management, allowing researchers to focus on experimenting with model architectures and training paradigms.

Small-Scale Focus

Train and study models from 1M to 1B parameters, making experimentation with training paradigms practical and accessible.

Advanced Checkpointing

Access model activations, gradients, and other rich information throughout training for mechanistic interpretability research.

Easy Retraining

Simple, modular codebase designed for researchers to modify and retrain the entire model suite with custom training paradigms.

PyTorch Lightning

Built on PyTorch Lightning for efficient, scalable training with minimal boilerplate code.

Minimal Dependencies

Lightweight framework with only essential dependencies, making it easy to install and modify.

Research Ready

Designed with researchers in mind, providing tools and flexibility needed for academic exploration.

Learning Dynamics Revealed

pico-analyze provides comprehensive tooling to capture and analyze training metrics, enabling researchers to understand how models learn.

Out of the box, pico-analyze provides a suite of tools to capture and analyze training metrics, including:

Convergence Rates

Compute layer convergence rates across model sizes using automatically stored activation checkpoints.
Effective Rank

Analyze dimensional utilization across layers to understand how models distribute complexity and identify potential bottlenecks.
Gradient Magnitude

Track how gradient magnitudes evolve during training to understand optimization dynamics and identify potential training instabilities.
Model Sparsity

Measure the percentage of near-zero weights in models to understand pruning potential and efficiency.

Using checkpoints from pico-train, the analysis framework pico-analyze lets you extract critical insights about model behavior throughout training. These insights can help identify optimization issues and guide architectural improvements.

Model Suite

We use pico-train to train a family of pico-decoder models on the carefully pre-processed Dolma dataset. All models are trained identically with scale as the only difference, accompanied by rich training checkpoints for interpretability research.

Pre-Tokenized Dolma

*All models are trained on 400B tokens of our carefully pre-processed Dolma dataset. We've tokenized, chunked into 2049-token chunks, and shuffled the data, making it ready for immediate use with no additional preprocessing needed!

Usage:

load_dataset('pico-lm/pretokenized-dolma', streaming=True)

Only 78MB storage required

Research Using Pico

These papers use the Pico framework to study learning dynamics in language models.

Tending Towards Stability: Convergence Challenges in Small Language Models

Key findings:

Larger models (70B) stabilize within first 20% of training
Smaller models show slower, less stable convergence patterns
Layer convergence correlates with effective rank of parameters

Read paper

Mitigating Frequency Bias and Anisotropy in Language Models

Key findings:

Models exhibit strong frequency bias during pre-training
Representations cluster in high-dimensional cones
Anisotropy correlates with frequency bias strength

Read paper

Built with ❤ by the Pico team

Code and Artifacts are licensed under Apache License 2.0