Presentation
FairyGen: Storied Cartoon Video from a Single Child-Drawn Character
DescriptionWe present FairyGen, an automatic system to generate storied cartoon videos from a single child's drawing character with a highly personalized style. Unlike previous subjects and motion-customization methods, we identify the whole story as layers of character modeling, environment generation, and shot design for the continuous story. Giving a single hand-drawn image,our approach initiates by utilizing the Multi-modality Large Language Model~(MLLM) to create a structured storyboard that includes dynamic shots, setting up both the narrative flow and the spatial layout of the main character. To model the character, we develop a 3D proxy that allows us to produce tailored motion that incorporates intricate and real-world dynamics. Then, for environment generation, we design a style propagation adapter to learn the style from the foreground character and propagate it to the background via the pre-trained background inpainting diffusion models, so that the identity of the foreground is naturally guaranteed.
After the style customization, a shot design module is used to crop the scene image by the M-LLM for detailed shot design and to increase the diversity of the story. Finally, for animation, given the motion sequences from 3D proxy and the stylized prior, we then fine-tune the MMDiT-based image-to-video diffusion model to learn the complex motion of the given foreground character. This is achieved by a motion customization adapter with a timestep-shift strategy to keep long-term motion fidelity and coherence. After training, this model can be directly used on the cropped shots for generating diverse video scenes. Overall, we conduct extensive experiments and evaluations to demonstrate that FairyGen produces animations that are stylistically faithful, narratively aligned, and rich in natural, smooth motion, highlighting its effectiveness and flexibility for personalized story animation.
After the style customization, a shot design module is used to crop the scene image by the M-LLM for detailed shot design and to increase the diversity of the story. Finally, for animation, given the motion sequences from 3D proxy and the stylized prior, we then fine-tune the MMDiT-based image-to-video diffusion model to learn the complex motion of the given foreground character. This is achieved by a motion customization adapter with a timestep-shift strategy to keep long-term motion fidelity and coherence. After training, this model can be directly used on the cropped shots for generating diverse video scenes. Overall, we conduct extensive experiments and evaluations to demonstrate that FairyGen produces animations that are stylistically faithful, narratively aligned, and rich in natural, smooth motion, highlighting its effectiveness and flexibility for personalized story animation.

Event Type
Technical Papers
TimeWednesday, 17 December 20253:44pm - 3:55pm HKT
LocationMeeting Room S423+S424, Level 4
