playbook/antigravity-awesome-skills/skills/transformers-js/references/MODEL_ARCHITECTURES.md

5.5 KiB

Supported Model Architectures

This document lists the model architectures currently supported by Transformers.js.

Natural Language Processing

Text Models

  • ALBERT - A Lite BERT for Self-supervised Learning
  • BERT - Bidirectional Encoder Representations from Transformers
  • CamemBERT - French language model based on RoBERTa
  • CodeGen - Code generation models
  • CodeLlama - Code-focused Llama models
  • Cohere - Command-R models for RAG
  • DeBERTa - Decoding-enhanced BERT with Disentangled Attention
  • DeBERTa-v2 - Improved version of DeBERTa
  • DistilBERT - Distilled version of BERT (smaller, faster)
  • GPT-2 - Generative Pre-trained Transformer 2
  • GPT-Neo - Open source GPT-3 alternative
  • GPT-NeoX - Larger GPT-Neo models
  • LLaMA - Large Language Model Meta AI
  • Mistral - Mistral AI language models
  • MPNet - Masked and Permuted Pre-training
  • MobileBERT - Compressed BERT for mobile devices
  • RoBERTa - Robustly Optimized BERT
  • T5 - Text-to-Text Transfer Transformer
  • XLM-RoBERTa - Multilingual RoBERTa

Sequence-to-Sequence

  • BART - Denoising Sequence-to-Sequence Pre-training
  • Blenderbot - Open-domain chatbot
  • BlenderbotSmall - Smaller Blenderbot variant
  • M2M100 - Many-to-Many multilingual translation
  • MarianMT - Neural machine translation
  • mBART - Multilingual BART
  • NLLB - No Language Left Behind (200 languages)
  • Pegasus - Pre-training with extracted gap-sentences

Computer Vision

Image Classification

  • BEiT - BERT Pre-Training of Image Transformers
  • ConvNeXT - Modern ConvNet architecture
  • ConvNeXTV2 - Improved ConvNeXT
  • DeiT - Data-efficient Image Transformers
  • DINOv2 - Self-supervised Vision Transformer
  • DINOv3 - Latest DINO iteration
  • EfficientNet - Efficient convolutional networks
  • MobileNet - Lightweight models for mobile
  • MobileViT - Mobile Vision Transformer
  • ResNet - Residual Networks
  • SegFormer - Semantic segmentation transformer
  • Swin - Shifted Window Transformer
  • ViT - Vision Transformer

Object Detection

  • DETR - Detection Transformer
  • D-FINE - Fine-grained Distribution Refinement for object detection
  • DINO - DETR with Improved deNoising anchOr boxes
  • Grounding DINO - Open-set object detection
  • YOLOS - You Only Look at One Sequence

Segmentation

  • CLIPSeg - Image segmentation with text prompts
  • Mask2Former - Universal image segmentation
  • SAM - Segment Anything Model
  • EdgeTAM - On-Device Track Anything Model

Depth & Pose

  • DPT - Dense Prediction Transformer
  • Depth Anything - Monocular depth estimation
  • Depth Pro - Sharp monocular metric depth
  • GLPN - Global-Local Path Networks for depth

Audio

Speech Recognition

  • Wav2Vec2 - Self-supervised speech representations
  • Whisper - Robust speech recognition (multilingual)
  • HuBERT - Self-supervised speech representation learning

Audio Processing

  • Audio Spectrogram Transformer - Audio classification
  • DAC - Descript Audio Codec

Text-to-Speech

  • SpeechT5 - Unified speech and text pre-training
  • VITS - Conditional Variational Autoencoder with adversarial learning

Multimodal

Vision-Language

  • CLIP - Contrastive Language-Image Pre-training
  • Chinese-CLIP - Chinese version of CLIP
  • ALIGN - Large-scale noisy image-text pairs
  • BLIP - Bootstrapping Language-Image Pre-training
  • Florence-2 - Unified vision foundation model
  • LLaVA - Large Language and Vision Assistant
  • Moondream - Tiny vision-language model

Document Understanding

  • DiT - Document Image Transformer
  • Donut - OCR-free Document Understanding
  • LayoutLM - Pre-training for document understanding
  • TrOCR - Transformer-based OCR

Audio-Language

  • CLAP - Contrastive Language-Audio Pre-training

Embeddings & Similarity

  • Sentence Transformers - Sentence embeddings
  • all-MiniLM - Efficient sentence embeddings
  • all-mpnet-base - High-quality sentence embeddings
  • E5 - Text embeddings by Microsoft
  • BGE - General embedding models
  • nomic-embed - Long context embeddings

Specialized Models

Code

  • CodeBERT - Pre-trained model for code
  • GraphCodeBERT - Code structure understanding
  • StarCoder - Code generation

Scientific

  • SciBERT - Scientific text
  • BioBERT - Biomedical text

Retrieval

  • ColBERT - Contextualized late interaction over BERT
  • DPR - Dense Passage Retrieval

Model Selection Tips

For Text Tasks

  • Small & Fast: DistilBERT, MobileBERT
  • Balanced: BERT-base, RoBERTa-base
  • High Accuracy: RoBERTa-large, DeBERTa-v3-large
  • Multilingual: XLM-RoBERTa, mBERT

For Vision Tasks

  • Mobile/Browser: MobileNet, EfficientNet-B0
  • Balanced: DeiT-base, ConvNeXT-tiny
  • High Accuracy: Swin-large, DINOv2-large

For Audio Tasks

  • Speech Recognition: Whisper-tiny (fast), Whisper-large (accurate)
  • Audio Classification: Audio Spectrogram Transformer

For Multimodal

  • Vision-Language: CLIP (general), Florence-2 (comprehensive)
  • Document AI: Donut, LayoutLM
  • OCR: TrOCR

Finding Models on Hugging Face Hub

Search for compatible models:

https://huggingface.co/models?library=transformers.js

Filter by task:

https://huggingface.co/models?pipeline_tag=text-classification&library=transformers.js

Check for ONNX support by looking for onnx/ folder in model repository.