VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation

Anonymous Authors

Motivation How does it work? Main Gallery Brush Style Gallery Autoregressive Generation Comparisons Ablation Limitations

What can you do with VideoSketcher?

Sequential Sketching

Given a text prompt, VideoSketcher generates a drawing process that follows a meaningful order, with structure emerging progressively on a blank canvas. It generalizes to a wide range of concepts (more results are availabe in the gallery bellow).

Brush Style Control

By placing a small brush exemplar on the canvas, users can control the brush type and color of the generated strokes. The model picks up on this visual cue and applies the style consistently throughout the drawing process.

Human-Model Co-Drawing

Using an autoregressive variant of the model, users and the model can take turns adding strokes to a shared canvas. Each builds on the other's contributions, enabling real-time collaborative sketching. The video is shown at real-time speed:

Why does drawing order matter?

The way a sketch unfolds tells a story — what the creator chose to lay down first, how structure builds into detail. This process is central to how sketches are used for problem solving, communication, and creative exploration.
Computationally, modeling the drawing process (and not just the final result) unlocks capabilities that static sketch generation cannot support, such as collaborative co-drawing, real-time visual feedback, and the ability for users to intervene and steer a sketch as it takes shape.

Why is this hard to model computationally?

The goal isn't just to reveal strokes gradually, it's to generate them in a meaningful order, where structure builds through semantically coherent progressions. This is difficult to learn computationally: sequential drawing data is scarce, and the task requires both semantic understanding (knowing what to draw and when) and strong visual generation (knowing how to draw it well). Existing approaches tend to excel at one but not the other.

What is VideoSketcher? ✏️

VideoSketcher bridges this gap by combining the strengths of LLMs and video diffusion models. An LLM handles the semantic planning — decomposing a concept into parts and deciding the drawing order. A video diffusion model handles the visual rendering — producing rich, temporally coherent sketches as short videos in pixel space.
Despite the apparent gap between photorealistic video and abstract sketches, we show that video models can be adapted to sketching behavior using only a handful of examples.

How does it work?

Sketches as videos. We represent each sketching process as a short video — black strokes progressively appearing on a blank canvas. Training data is constructed from SVGs drawn by an artist, where each stroke is animated along its path, preserving both the global drawing order and the continuous formation of individual strokes.

LLM-guided drawing plans. At inference time, an LLM takes a high-level text prompt and produces a structured, step-by-step drawing plan — decomposing the subject into semantic parts and specifying the order in which they should be drawn. This plan is then passed to the video model as a text condition.

Two-stage fine-tuning. The core challenge is teaching a video model both what sketches look like and how they should unfold over time. We decouple these two objectives:

Stage 1 — Learning drawing grammar: We first train on synthetic compositions of simple geometric primitives (circles, rectangles, curves) arranged in spatial relationships like containment, overlap, and adjacency. Each composition is rendered with multiple drawing orders, teaching the model to follow text-specified stroke sequences.
Stage 2 — Learning sketch appearance: We then fine-tune on just seven real sketches drawn by an artist. Because the model already understands ordering from Stage 1, this stage primarily transfers visual style, and that's enough for the model to generalize to a wide range of concepts.

Bootstrapping autoregressive generation. The diffusion-based model generates entire sequences jointly, which limits interactivity. To enable co-drawing, we use the trained diffusion model to generate a larger synthetic dataset, which is then used to fine-tune an autoregressive video model that predicts frames sequentially — enabling real-time, turn-based interaction.

Main Gallery - Sequential Sketching

Our method generates diverse sequential sketches, ranging from single objects to complex multi-element scenes. Despite being trained on only 7 real sketches, it generalizes to complex scenes, including streets, canals, and alleys with multiple buildings, characters, and vehicles, producing clean lines, coherent perspective, and semantically meaningful stroke ordering (e.g., large structures first, fine details last).

Brush Style Gallery

Beyond stroke ordering, our video-based formulation offers flexibility in visual style. Conditioning on a brush exemplar in the first frame allows the model to reproduce the target brush's color and texture throughout the sketch. Notably, this generalizes to brush styles and colors not seen during training.

Go to Top

Autoregressive Generation

We examine adapting our framework to autoregressive sketch generation, enabling interactive drawing scenarios that are difficult to support with diffusion-based models. The autoregressive model produces visually coherent sketches with clear stroke-by-stroke progression, although with slightly reduced visual fidelity compared to the diffusion-based approach.

Go to Top

Comparisons to Prior Work

Wan2.1

Paints-UNDO

SketchAgent

Human

Ours

Comparison of sketch generation progress across methods. Wan2.1 produces near-static outputs with limited temporal progression. PaintsUndo reveals detailed structures early due to its undo-based formulation, but generates painting-like results rather than vector sketches. SketchAgent better follows human drawing order but often yields overly simplistic and less recognizable outputs. Our method closely matches human sketching progression while achieving higher final quality, producing semantically structured and detailed sketches.

Go to Top

Ablation

Shapes Only

Real Sketches Only

Full Model (Ours)

Prompt Drawing Order

Cake: 1. Base shape — wide rectangle or oval. 2. Top layer line for thickness. 3. Frosting edge with small drips. 4. Side details such as stripes or shading lines. 5. Candles on top.

Ice Cream: 1. Cone — tall triangle. 2. Scoop — rounded dome. 3. Second scoop or swirl lines. 4. Cone texture with crisscross lines. 5. Drips along the scoop edge.

Pants: 1. Waistband — long curved rectangle. 2. Outer pant legs dropping down. 3. Inner leg seam and gap between legs. 4. Pockets and zipper area. 5. Cuffs and a few fold lines.

Sandwich: 1. Top bread slice outline. 2. Bottom bread slice outline beneath it. 3. Filling layers in the middle. 4. Crust details and slight rounding. 5. Cut line or small crumbs for texture.

Bed: 1. Mattress — a long rectangle. 2. Bed frame around the mattress. 3. Headboard at one end. 4. Pillow shapes on top. 5. Blanket edge and fold lines.

Truck: 1. Wheels — two large circles. 2. Main body — long rectangle for the cargo area. 3. Cab — a smaller box at the front. 4. Windows and windshield. 5. Headlights and grille details.

Cow: 1. Body — a large oval. 2. Head — a smaller oval attached to the front. 3. Legs — four simple columns under the body. 4. Ears and small horns on the head. 5. Spots, tail, and udder details.

Grapes: 1. Cluster outline — a loose bunch shape. 2. Grapes — overlapping small circles filling the cluster. 3. Stem — a curved line at the top. 4. A leaf shape beside the stem. 5. Small highlight circles and a few shadow lines.

Tent: 1. Main shape — a large triangle. 2. Base line and ground edge. 3. Front flap opening as a smaller triangle. 4. Support pole line down the center. 5. Stakes and rope lines at the corners.

Teapot: 1. Body — a rounded oval shape. 2. Lid — a small dome on top. 3. Spout extending from one side. 4. Handle curving from the other side. 5. Base ring and simple decoration lines.

Camel: 1. Body — long oval shape lying horizontally. 2. Humps — two rounded bumps on top of the body. 3. Legs — four long bent lines with small hooves. 4. Neck and head — curved neck leading to an oval head. 5. Tail and details — short tail, ear, and a few face lines.

Pineapple: 1. Body — tall oval with a slightly wider bottom. 2. Crown — spiky leaf shapes on top. 3. Skin pattern — crisscross diagonal lines across the body. 4. Segment details — small diamond marks inside the grid. 5. Outline cleanup — darken the final edges and leaf tips.

Lobster: 1. Body — long rounded tube shape. 2. Tail — segmented fan shape at the back. 3. Claws — two large pincers at the front. 4. Legs — small paired legs along the sides. 5. Antennae and details — long whiskers, eyes, and shell segment lines.

Watermelon: 1. Main shape — large oval or circle. 2. Rind thickness — draw a second inner outline along the edge. 3. Slice opening — add a wedge cut line if showing a slice. 4. Seeds — small teardrop shapes inside the flesh. 5. Rind stripes — curved lines along the outer skin.

Castle: 1. Main walls — large rectangle base. 2. Towers — add tall rectangles on the sides and center. 3. Battlements — small square notches along the top edges. 4. Door and windows — arched doorway and narrow window slits. 5. Roof details — cones or caps on towers and a few brick lines.

Ear: 1. Outer shape — curved oval outline. 2. Inner ridge — draw the main Y-shaped fold inside. 3. Concha — add a deeper bowl curve near the center. 4. Tragus and lobe — small flap near the opening and a rounded earlobe. 5. Shading lines — light hatching in the inner folds.

Squirrel: 1. Body — oval shape. 2. Head — smaller circle at the front. 3. Tail — big fluffy curve rising behind the body. 4. Legs and paws — small bent shapes under the body. 5. Face and ears — pointed ears, eye, nose, and mouth line. 6. Fur details — short strokes along the tail and chest.

Ceiling Fan: 1. Hub — small circle in the center. 2. Blades — draw 3-5 long rounded rectangles radiating out. 3. Motor housing — add a short cylinder around the hub. 4. Downrod — a straight line or narrow rectangle above the housing. 5. Details — screws, blade edges, and slight curve for blade tilt.

Tooth: 1. Outline — rounded top with two root bumps at the bottom. 2. Crown curve — refine the top edge into a smooth arch. 3. Roots — define the two roots with gentle inward lines. 4. Inner detail — add a subtle center line or small groove. 5. Shading — light shadow near the base and along one side.

Hurricane: 1. Spiral guide — draw a loose swirling spiral. 2. Eye — add a small circle at the spiral center. 3. Cloud bands — thicken the spiral into layered curved bands. 4. Outer storm edge — add jagged, wispy cloud shapes around the outside. 5. Motion lines — a few curved streaks to emphasize rotation.

We find that full two-stage training is necessary for both reliable ordering control and the desired sketch appearance. Training on synthetic shapes alone improves ordering consistency but yields primitive-looking strokes with weaker recognizability. Training on real sketches alone improves visual style but often violates the specified order. Combining both stages transfers ordering fidelity into the sketch domain and delivers the best overall results.

Go to Top

Limitations

Multiple strokes per frame

Operating in pixel space provides less explicit structural control than parametric stroke representations, which can occasionally lead to violations of sketching constraints, such as multiple strokes appearing within a single frame.

Prompt adherence

Prompt adherence is not guaranteed. When the model has a strong visual prior, it may deviate from the instructions. For example, in the ``tiger roaring'' prompt, the model changes the action late in the video and introduces color.

Limited knowledge

Performance also depends on the underlying video model’s concept knowledge, which is more limited than that of LLMs for specialized domains such as mathematics.

AR quality gap

Finally, while we demonstrate autoregressive sketch generation, the resulting outputs do not yet match the visual quality of the diffusion-based model, reflecting the present maturity of autoregressive video models.

VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation

What can you do with VideoSketcher?

Sequential Sketching

An Amsterdam canal

A Rome street with a cat

A Tokyo alley with lanterns

A traveler by a campfire

A beach at sunset

A lighthouse on a rocky shore

A medieval town square

A shepherd with sheep

Brush Style Control

A girl riding a unicorn in the sky

A piano

A lighthouse on a rocky shore

Human-Model Co-Drawing

Why does drawing order matter?

Why is this hard to model computationally?

What is VideoSketcher? ✏️

How does it work?

Main Gallery - Sequential Sketching

An Amsterdam canal

A Rome street with a cat

A Tokyo alley with lanterns

A traveler by a campfire

A beach at sunset

A lighthouse on a rocky shore

A medieval town square

A shepherd with sheep

A fisherman by a river

A harbor city with ships

An ancient observatory

A bakery opening at dawn

A blacksmith forge

A castle built into a cliff

A cat on a windowsill

A floating city

Floating islands

A forest with glowing mushrooms

A forest path

A glowing cave entrance

A glowing path into the unknown

A laboratory bench

A mapmaker's desk

A market street

A mechanical clocktower

A mountain lake

A Paris street at dusk

A park bench scene

A portal in a field

A rider on horseback

A rocket launch

A savanna at dusk

A scholar's tower room

A snowy village

A stone bridge over a river

A telescope on a cliff

A cake

An ice cream

A pair of pants

A sandwich

A bed

A truck

A cow

A bunch of grapes

A tent

A teapot

A camel

A pineapple

A lobster

A watermelon

A castle

An ear

A squirrel

A ceiling fan

A tooth

A hurricane

A vase

A candle

A ladder