5.5 KiB

Raw Blame History

Supported Model Architectures

This document lists the model architectures currently supported by Transformers.js.

Natural Language Processing

Text Models

ALBERT - A Lite BERT for Self-supervised Learning
BERT - Bidirectional Encoder Representations from Transformers
CamemBERT - French language model based on RoBERTa
CodeGen - Code generation models
CodeLlama - Code-focused Llama models
Cohere - Command-R models for RAG
DeBERTa - Decoding-enhanced BERT with Disentangled Attention
DeBERTa-v2 - Improved version of DeBERTa
DistilBERT - Distilled version of BERT (smaller, faster)
GPT-2 - Generative Pre-trained Transformer 2
GPT-Neo - Open source GPT-3 alternative
GPT-NeoX - Larger GPT-Neo models
LLaMA - Large Language Model Meta AI
Mistral - Mistral AI language models
MPNet - Masked and Permuted Pre-training
MobileBERT - Compressed BERT for mobile devices
RoBERTa - Robustly Optimized BERT
T5 - Text-to-Text Transfer Transformer
XLM-RoBERTa - Multilingual RoBERTa

Sequence-to-Sequence

BART - Denoising Sequence-to-Sequence Pre-training
Blenderbot - Open-domain chatbot
BlenderbotSmall - Smaller Blenderbot variant
M2M100 - Many-to-Many multilingual translation
MarianMT - Neural machine translation
mBART - Multilingual BART
NLLB - No Language Left Behind (200 languages)
Pegasus - Pre-training with extracted gap-sentences

Computer Vision

Image Classification

BEiT - BERT Pre-Training of Image Transformers
ConvNeXT - Modern ConvNet architecture
ConvNeXTV2 - Improved ConvNeXT
DeiT - Data-efficient Image Transformers
DINOv2 - Self-supervised Vision Transformer
DINOv3 - Latest DINO iteration
EfficientNet - Efficient convolutional networks
MobileNet - Lightweight models for mobile
MobileViT - Mobile Vision Transformer
ResNet - Residual Networks
SegFormer - Semantic segmentation transformer
Swin - Shifted Window Transformer
ViT - Vision Transformer

Object Detection

DETR - Detection Transformer
D-FINE - Fine-grained Distribution Refinement for object detection
DINO - DETR with Improved deNoising anchOr boxes
Grounding DINO - Open-set object detection
YOLOS - You Only Look at One Sequence

Segmentation

CLIPSeg - Image segmentation with text prompts
Mask2Former - Universal image segmentation
SAM - Segment Anything Model
EdgeTAM - On-Device Track Anything Model

Depth & Pose

DPT - Dense Prediction Transformer
Depth Anything - Monocular depth estimation
Depth Pro - Sharp monocular metric depth
GLPN - Global-Local Path Networks for depth

Audio

Speech Recognition

Wav2Vec2 - Self-supervised speech representations
Whisper - Robust speech recognition (multilingual)
HuBERT - Self-supervised speech representation learning

Audio Processing

Audio Spectrogram Transformer - Audio classification
DAC - Descript Audio Codec

Text-to-Speech

SpeechT5 - Unified speech and text pre-training
VITS - Conditional Variational Autoencoder with adversarial learning

Multimodal

Vision-Language

CLIP - Contrastive Language-Image Pre-training
Chinese-CLIP - Chinese version of CLIP
ALIGN - Large-scale noisy image-text pairs
BLIP - Bootstrapping Language-Image Pre-training
Florence-2 - Unified vision foundation model
LLaVA - Large Language and Vision Assistant
Moondream - Tiny vision-language model

Document Understanding

DiT - Document Image Transformer
Donut - OCR-free Document Understanding
LayoutLM - Pre-training for document understanding
TrOCR - Transformer-based OCR

Audio-Language

CLAP - Contrastive Language-Audio Pre-training

Embeddings & Similarity

Sentence Transformers - Sentence embeddings
all-MiniLM - Efficient sentence embeddings
all-mpnet-base - High-quality sentence embeddings
E5 - Text embeddings by Microsoft
BGE - General embedding models
nomic-embed - Long context embeddings

Specialized Models

Code

CodeBERT - Pre-trained model for code
GraphCodeBERT - Code structure understanding
StarCoder - Code generation

Scientific

SciBERT - Scientific text
BioBERT - Biomedical text

Retrieval

ColBERT - Contextualized late interaction over BERT
DPR - Dense Passage Retrieval

Model Selection Tips

For Text Tasks

Small & Fast: DistilBERT, MobileBERT
Balanced: BERT-base, RoBERTa-base
High Accuracy: RoBERTa-large, DeBERTa-v3-large
Multilingual: XLM-RoBERTa, mBERT

For Vision Tasks

Mobile/Browser: MobileNet, EfficientNet-B0
Balanced: DeiT-base, ConvNeXT-tiny
High Accuracy: Swin-large, DINOv2-large

For Audio Tasks

Speech Recognition: Whisper-tiny (fast), Whisper-large (accurate)
Audio Classification: Audio Spectrogram Transformer

For Multimodal

Vision-Language: CLIP (general), Florence-2 (comprehensive)
Document AI: Donut, LayoutLM
OCR: TrOCR

Finding Models on Hugging Face Hub

Search for compatible models:

https://huggingface.co/models?library=transformers.js

Filter by task:

https://huggingface.co/models?pipeline_tag=text-classification&library=transformers.js

Check for ONNX support by looking for onnx/ folder in model repository.

5.5 KiB Raw Blame History