What can you do with VideoSketcher?

Sequential Sketching

Given a text prompt, VideoSketcher generates a drawing process that follows a meaningful order, with structure emerging progressively on a blank canvas. It generalizes to a wide range of concepts (more results are availabe in the gallery bellow).




Brush Style Control

By placing a small brush exemplar on the canvas, users can control the brush type and color of the generated strokes. The model picks up on this visual cue and applies the style consistently throughout the drawing process.




Human-Model Co-Drawing

Using an autoregressive variant of the model, users and the model can take turns adding strokes to a shared canvas. Each builds on the other's contributions, enabling real-time collaborative sketching. The video is shown at real-time speed:

Why does drawing order matter?

The way a sketch unfolds tells a story — what the creator chose to lay down first, how structure builds into detail. This process is central to how sketches are used for problem solving, communication, and creative exploration.
Computationally, modeling the drawing process (and not just the final result) unlocks capabilities that static sketch generation cannot support, such as collaborative co-drawing, real-time visual feedback, and the ability for users to intervene and steer a sketch as it takes shape.

Why is this hard to model computationally?

The goal isn't just to reveal strokes gradually, it's to generate them in a meaningful order, where structure builds through semantically coherent progressions. This is difficult to learn computationally: sequential drawing data is scarce, and the task requires both semantic understanding (knowing what to draw and when) and strong visual generation (knowing how to draw it well). Existing approaches tend to excel at one but not the other.

What is VideoSketcher? ✏️

VideoSketcher bridges this gap by combining the strengths of LLMs and video diffusion models. An LLM handles the semantic planning — decomposing a concept into parts and deciding the drawing order. A video diffusion model handles the visual rendering — producing rich, temporally coherent sketches as short videos in pixel space.
Despite the apparent gap between photorealistic video and abstract sketches, we show that video models can be adapted to sketching behavior using only a handful of examples.

How does it work?

Sketches as videos. We represent each sketching process as a short video — black strokes progressively appearing on a blank canvas. Training data is constructed from SVGs drawn by an artist, where each stroke is animated along its path, preserving both the global drawing order and the continuous formation of individual strokes.



LLM-guided drawing plans. At inference time, an LLM takes a high-level text prompt and produces a structured, step-by-step drawing plan — decomposing the subject into semantic parts and specifying the order in which they should be drawn. This plan is then passed to the video model as a text condition.

Two-stage fine-tuning. The core challenge is teaching a video model both what sketches look like and how they should unfold over time. We decouple these two objectives:

  • Stage 1 — Learning drawing grammar: We first train on synthetic compositions of simple geometric primitives (circles, rectangles, curves) arranged in spatial relationships like containment, overlap, and adjacency. Each composition is rendered with multiple drawing orders, teaching the model to follow text-specified stroke sequences.
  • Stage 2 — Learning sketch appearance: We then fine-tune on just seven real sketches drawn by an artist. Because the model already understands ordering from Stage 1, this stage primarily transfers visual style, and that's enough for the model to generalize to a wide range of concepts.

Bootstrapping autoregressive generation. The diffusion-based model generates entire sequences jointly, which limits interactivity. To enable co-drawing, we use the trained diffusion model to generate a larger synthetic dataset, which is then used to fine-tune an autoregressive video model that predicts frames sequentially — enabling real-time, turn-based interaction.

Autoregressive Generation


We examine adapting our framework to autoregressive sketch generation, enabling interactive drawing scenarios that are difficult to support with diffusion-based models. The autoregressive model produces visually coherent sketches with clear stroke-by-stroke progression, although with slightly reduced visual fidelity compared to the diffusion-based approach.

Comparisons to Prior Work


Comparison of sketch generation progress across methods. Wan2.1 produces near-static outputs with limited temporal progression. PaintsUndo reveals detailed structures early due to its undo-based formulation, but generates painting-like results rather than vector sketches. SketchAgent better follows human drawing order but often yields overly simplistic and less recognizable outputs. Our method closely matches human sketching progression while achieving higher final quality, producing semantically structured and detailed sketches.

Ablation


We find that full two-stage training is necessary for both reliable ordering control and the desired sketch appearance. Training on synthetic shapes alone improves ordering consistency but yields primitive-looking strokes with weaker recognizability. Training on real sketches alone improves visual style but often violates the specified order. Combining both stages transfers ordering fidelity into the sketch domain and delivers the best overall results.

Limitations


Multiple strokes per frame

Operating in pixel space provides less explicit structural control than parametric stroke representations, which can occasionally lead to violations of sketching constraints, such as multiple strokes appearing within a single frame.

Prompt adherence

Prompt adherence is not guaranteed. When the model has a strong visual prior, it may deviate from the instructions. For example, in the ``tiger roaring'' prompt, the model changes the action late in the video and introduces color.

Limited knowledge

Performance also depends on the underlying video model’s concept knowledge, which is more limited than that of LLMs for specialized domains such as mathematics.

AR quality gap

Finally, while we demonstrate autoregressive sketch generation, the resulting outputs do not yet match the visual quality of the diffusion-based model, reflecting the present maturity of autoregressive video models.