5.5 KiB
5.5 KiB
Supported Model Architectures
This document lists the model architectures currently supported by Transformers.js.
Natural Language Processing
Text Models
- ALBERT - A Lite BERT for Self-supervised Learning
- BERT - Bidirectional Encoder Representations from Transformers
- CamemBERT - French language model based on RoBERTa
- CodeGen - Code generation models
- CodeLlama - Code-focused Llama models
- Cohere - Command-R models for RAG
- DeBERTa - Decoding-enhanced BERT with Disentangled Attention
- DeBERTa-v2 - Improved version of DeBERTa
- DistilBERT - Distilled version of BERT (smaller, faster)
- GPT-2 - Generative Pre-trained Transformer 2
- GPT-Neo - Open source GPT-3 alternative
- GPT-NeoX - Larger GPT-Neo models
- LLaMA - Large Language Model Meta AI
- Mistral - Mistral AI language models
- MPNet - Masked and Permuted Pre-training
- MobileBERT - Compressed BERT for mobile devices
- RoBERTa - Robustly Optimized BERT
- T5 - Text-to-Text Transfer Transformer
- XLM-RoBERTa - Multilingual RoBERTa
Sequence-to-Sequence
- BART - Denoising Sequence-to-Sequence Pre-training
- Blenderbot - Open-domain chatbot
- BlenderbotSmall - Smaller Blenderbot variant
- M2M100 - Many-to-Many multilingual translation
- MarianMT - Neural machine translation
- mBART - Multilingual BART
- NLLB - No Language Left Behind (200 languages)
- Pegasus - Pre-training with extracted gap-sentences
Computer Vision
Image Classification
- BEiT - BERT Pre-Training of Image Transformers
- ConvNeXT - Modern ConvNet architecture
- ConvNeXTV2 - Improved ConvNeXT
- DeiT - Data-efficient Image Transformers
- DINOv2 - Self-supervised Vision Transformer
- DINOv3 - Latest DINO iteration
- EfficientNet - Efficient convolutional networks
- MobileNet - Lightweight models for mobile
- MobileViT - Mobile Vision Transformer
- ResNet - Residual Networks
- SegFormer - Semantic segmentation transformer
- Swin - Shifted Window Transformer
- ViT - Vision Transformer
Object Detection
- DETR - Detection Transformer
- D-FINE - Fine-grained Distribution Refinement for object detection
- DINO - DETR with Improved deNoising anchOr boxes
- Grounding DINO - Open-set object detection
- YOLOS - You Only Look at One Sequence
Segmentation
- CLIPSeg - Image segmentation with text prompts
- Mask2Former - Universal image segmentation
- SAM - Segment Anything Model
- EdgeTAM - On-Device Track Anything Model
Depth & Pose
- DPT - Dense Prediction Transformer
- Depth Anything - Monocular depth estimation
- Depth Pro - Sharp monocular metric depth
- GLPN - Global-Local Path Networks for depth
Audio
Speech Recognition
- Wav2Vec2 - Self-supervised speech representations
- Whisper - Robust speech recognition (multilingual)
- HuBERT - Self-supervised speech representation learning
Audio Processing
- Audio Spectrogram Transformer - Audio classification
- DAC - Descript Audio Codec
Text-to-Speech
- SpeechT5 - Unified speech and text pre-training
- VITS - Conditional Variational Autoencoder with adversarial learning
Multimodal
Vision-Language
- CLIP - Contrastive Language-Image Pre-training
- Chinese-CLIP - Chinese version of CLIP
- ALIGN - Large-scale noisy image-text pairs
- BLIP - Bootstrapping Language-Image Pre-training
- Florence-2 - Unified vision foundation model
- LLaVA - Large Language and Vision Assistant
- Moondream - Tiny vision-language model
Document Understanding
- DiT - Document Image Transformer
- Donut - OCR-free Document Understanding
- LayoutLM - Pre-training for document understanding
- TrOCR - Transformer-based OCR
Audio-Language
- CLAP - Contrastive Language-Audio Pre-training
Embeddings & Similarity
- Sentence Transformers - Sentence embeddings
- all-MiniLM - Efficient sentence embeddings
- all-mpnet-base - High-quality sentence embeddings
- E5 - Text embeddings by Microsoft
- BGE - General embedding models
- nomic-embed - Long context embeddings
Specialized Models
Code
- CodeBERT - Pre-trained model for code
- GraphCodeBERT - Code structure understanding
- StarCoder - Code generation
Scientific
- SciBERT - Scientific text
- BioBERT - Biomedical text
Retrieval
- ColBERT - Contextualized late interaction over BERT
- DPR - Dense Passage Retrieval
Model Selection Tips
For Text Tasks
- Small & Fast: DistilBERT, MobileBERT
- Balanced: BERT-base, RoBERTa-base
- High Accuracy: RoBERTa-large, DeBERTa-v3-large
- Multilingual: XLM-RoBERTa, mBERT
For Vision Tasks
- Mobile/Browser: MobileNet, EfficientNet-B0
- Balanced: DeiT-base, ConvNeXT-tiny
- High Accuracy: Swin-large, DINOv2-large
For Audio Tasks
- Speech Recognition: Whisper-tiny (fast), Whisper-large (accurate)
- Audio Classification: Audio Spectrogram Transformer
For Multimodal
- Vision-Language: CLIP (general), Florence-2 (comprehensive)
- Document AI: Donut, LayoutLM
- OCR: TrOCR
Finding Models on Hugging Face Hub
Search for compatible models:
https://huggingface.co/models?library=transformers.js
Filter by task:
https://huggingface.co/models?pipeline_tag=text-classification&library=transformers.js
Check for ONNX support by looking for onnx/ folder in model repository.