>_TheQuery
← Glossary

DALL-E

Models

OpenAI's family of text-to-image models that generate images from natural language descriptions. DALL-E 3 improved prompt adherence and text rendering, but it is now a previous-generation model in OpenAI's API docs.

Like asking a painter who has studied millions of photographs and their captions: you describe the scene in words and the painter renders it. DALL-E 3 was the version who read the whole description carefully; GPT Image 2 is the newer studio built for more controlled production work.

DALL-E is OpenAI's series of text-to-image generation models, each built on a different underlying architecture. The name is a portmanteau of Salvador Dali and WALL-E.

DALL-E (2021): The original model (arXiv:2102.12092) used a 12-billion-parameter transformer trained autoregressively on text and image tokens as a single stream. Given a text prompt, the model predicted the image tokens that followed, effectively treating image generation as a sequence prediction problem. It demonstrated zero-shot text-to-image generation , producing plausible images for descriptions it had never explicitly seen during training , but output resolution was limited and prompt adherence was inconsistent.

DALL-E 2 (2022): A complete architectural shift. DALL-E 2 (arXiv:2204.06125) replaced the autoregressive transformer with a two-stage diffusion pipeline. First, a prior network maps a text caption to a CLIP image embedding. Then a diffusion decoder generates a high-resolution image conditioned on that embedding. Because CLIP embeds text and images in a shared latent space, DALL-E 2 could perform zero-shot image editing, style transfer, and semantic interpolation between images , capabilities the original could not. Output resolution reached 1024x1024.

DALL-E 3 (2023): Rather than changing the diffusion architecture significantly, OpenAI focused on the training data. Images in the training set were re-captioned using GPT-4 to produce highly detailed, accurate descriptions of visual content. This addressed the core limitation of DALL-E 2, which would ignore or misinterpret parts of complex prompts. DALL-E 3 follows long, detailed prompts reliably enough to be integrated directly into ChatGPT, where users interact in natural language rather than crafting prompt strings. It supports landscape and portrait outputs and remains historically important, although OpenAI now lists DALL-E 3 as a previous-generation, deprecated model in its API docs.

DALL-E models are distinct from open-source image generation approaches like Stable Diffusion. For new OpenAI image generation and editing workflows, GPT Image 2 is now the more current model family. OpenAI does not release model weights, applies content filtering at inference time, and prohibits generating real people's likenesses without consent. The trade-off is a more constrained but more production-ready system.

Last updated: May 15, 2026