Embeddings

BAAI (Beijing Academy of AI) open-weight English embedding model with 335M parameters. Returns 1024-dim vectors and was a top MTEB English retrieval model on release. The v1.5 update improved similarity distribution so it works well without a query instruction prefix for symmetric tasks. A widely used open alternative to hosted embeddings.

BGE-M3 (Multilingual)

BAAI multilingual embedding model covering 100+ languages with an 8192-token context. M3 stands for its multi-functionality (dense, sparse and ColBERT-style multi-vector retrieval), multilinguality and multi-granularity over long documents. Returns 1024-dim dense vectors and is a strong open choice for cross-lingual and long-text retrieval.

ESM-2 650M (Protein Embeddings)

Meta AI 650M-parameter protein language model trained on UniRef50 sequences. Feed it an amino-acid sequence and the per-residue hidden states act as learned protein embeddings, used for structure prediction, variant-effect and function tasks. This 33-layer checkpoint is the common balance of quality and cost in the ESM-2 family.

€2.00

Nomic Embed Text v1.5

Nomic AI open embedding model with a fully reproducible training pipeline (open weights, data and code). Supports an 8192-token context and Matryoshka representation learning, so you can truncate the 768-dim output down to 64 dims with graceful quality loss. Uses task prefixes like search_query and search_document.

OpenAI text-embedding-3-large

EmbeddingOpenAI

OpenAI's highest-quality embedding model. Returns 3072-dim vectors by default and supports reducing dimensions via the dimensions parameter. Outperforms text-embedding-3-small and the older ada-002 on MTEB and multilingual MIRACL retrieval benchmarks, for cases where accuracy matters more than cost.

Free600ms

openaiembeddingretrieval

OpenAI text-embedding-3-small

EmbeddingOpenAI

OpenAI's small, low-cost embedding model. Returns 1536-dim vectors by default and supports shortening output dimensions via the dimensions parameter without retraining. Replaced text-embedding-ada-002 with better retrieval quality at a fraction of the price, and is the default choice for general-purpose semantic search and RAG.

Free500ms

openaiembeddingretrieval

PubMedBERT Embeddings (NeuML)

Sentence-transformers model fine-tuned from Microsoft PubMedBERT on PubMed title-abstract pairs by the NeuML team. Produces 768-dim sentence embeddings tuned for biomedical semantic search and similarity, and is the embedding backbone behind the paperai and txtai medical search tools.

SPECTER (Scientific Paper Embeddings)

AllenAI document-level embedding model for scientific papers. Built on SciBERT and trained on the citation graph so that papers citing each other land close together. Feed it a title plus abstract and it returns one 768-dim vector per paper, useful for recommendation, clustering and citation-based retrieval.

Voyage AI voyage-3

Voyage's general-purpose embedding model. 1024 dims, 32k context, strong retrieval performance.

voyageembeddingretrieval

BioBERT v1.2 (Biomedical Embeddings)

DMIS-Lab (Korea University) BERT-base initialized from English BERT and further pretrained on PubMed abstracts. Used as a feature extractor it yields 768-dim contextual embeddings tuned for biomedical text mining tasks such as NER, relation extraction and biomedical question answering.

BiomedBERT (PubMedBERT abstract)

Microsoft BiomedBERT (formerly PubMedBERT) pretrained from scratch on PubMed abstracts with a domain-specific vocabulary, rather than adapting a general model. As a feature extractor it gives 768-dim biomedical embeddings and set the original state of the art on the BLURB biomedical NLP benchmark.

Cohere embed-multilingual-v3

Cohere's multilingual embedding model. Supports 100+ languages with separate search and classification modes.

cohereembeddingmultilingual

GTE Large EN v1.5

Alibaba (Tongyi Lab) general text embedding model. The v1.5 release extends the context to 8192 tokens and returns 1024-dim vectors, scoring competitively on MTEB while handling much longer inputs than typical 512-token encoders. A practical open model when documents exceed the usual short-context limit.

Jina Embeddings v3 (Multilingual)

Jina's frontier multilingual embedding model. 570M params, 8192 ctx, 89 languages, Matryoshka dims 128-1024.

jinaembeddingmultilingual

Multilingual E5 Large

Microsoft E5 multilingual embedding model with 560M parameters, initialized from XLM-RoBERTa-large and trained with weakly supervised contrastive learning. Covers around 100 languages and returns 1024-dim vectors. It expects query: and passage: prefixes on inputs and is a popular open model for multilingual semantic search.

mxbai-embed-large-v1

Mixedbread's open-source 335M embedding model. Top MTEB benchmark for English retrieval at release.

mixedbreadembeddingopen-weights

SciBERT (scivocab uncased)

AllenAI BERT-base pretrained from scratch on 1.14M scientific papers (mostly biomedical and computer science) with its own scientific WordPiece vocabulary. Used as a feature extractor it gives 768-dim contextual embeddings tuned to scientific text, outperforming general BERT on tasks like NER and relation extraction in research corpora.

Voyage AI voyage-code-3

Voyage's code-specialized embedding model. Up to 32k context, Matryoshka 256-2048 dims, int8/binary support.

voyageembeddingcode

Top embeddings picks

Hand-picked across four common criteria — resolved against the live catalog so the picks track price and performance changes.

Best overall

BGE Large EN v1.5

Cheapest

OpenAI text-embedding-3-small

Highest dimensions

Voyage AI voyage-3

Voyage's general-purpose embedding model. 1024 dims, 32k context, strong retrieval performance.

Fastest

OpenAI text-embedding-3-small