SSFold:Learning to Fold Arbitrary Crumpled
Cloth Using Graph Dynamics from Human Demonstration

Haichuan Xu
Jiarui Hu
Feng Luan
Zhipeng Wang
Yanchao Dong
Yanmin Zhou
Bin He

Abstract

Robotic cloth manipulation faces challenges due to complex dynamics and the high dimensionality of configuration spaces. Previous methods have largely focused on isolated smoothing or folding tasks and heavily relied on simulations, often failing to bridge the significant sim-to-real gap in deformable object manipulation.

To overcome these challenges, we propose a two-stream architecture with sequential and spatial pathways, unifying smoothing and folding tasks into a single adaptable policy model that accommodates various cloth types and states. The sequential stream determines cloth pick and place positions, while the spatial stream, leveraging a visible connectivity dynamics model, constructs a visibility connectivity graph from partial point cloud data of self-occluded cloth, thus improving the robot’s perception of the cloth’s current state. To bridge the sim-to-real gap, we utilize a hand tracking detection algorithm to gather and integrate human demonstration data into our novel end-to-end neural network, improving real-world adaptability. Our method, validated on a UR5 robot across four distinct cloth folding tasks, reliably achieves folded states from crumpled initial configurations with success rates of 99%, 99%, 83%, and 67%. It outperforms existing state-of-the-art cloth manipulation techniques and demonstrates strong generalization to unseen cloth with diverse colors, shapes, and stiffness in real-world experiments.


Approach Overview

HyperNeRF architecture.

Fig. 1. Method overview. (a) In a workspace equipped with a UR5 robotic arm and a piece of cloth in an arbitrary crumpled configuration, a top-down RGB image is captured by the camera. (b) The pick point, identified using a YOLOv10-based hand tracking algorithm, is concatenated with the captured RGB image. This combined input is then fed into the U-net network within the Sequential Stream. (c) In the Spatial Stream, the infrared and depth images captured by the camera are first used to extract a mask of the cloth region and generate the corresponding point cloud. The point cloud is voxelized to reduce complexity, followed by inferring nearby edges and mesh edges to predict the cloth’s graph data. (d) Finally, the features from both streams are fused and processed to produce an output action map, which guides the robotic arm to execute the corresponding actions using parameterized action primitives.


Sequential demonstration of folding arbitrary crumpled cloth across three distinct tasks

HyperNeRF architecture.

Fig. 2. Each task consists of two rows: the first row presents the top-view operation sequence captured by the overhead camera, while the second row displays the side-view operation sequence captured by the side camera. Qpick represents the predicted pick heatmap, Qplace represents the predicted place heatmap, PA represents the predicted action map, and at represents the action pair.


Real World Experiments

一. Performance on Double Inward Fold (DIF) tasks across towels of varying difficulty

easy task

medium task

hard task

二. Performance on Double Triangle Fold (DTF) tasks across towels of varying difficulty

easy task

medium task

hard task

三. Performance on Four Corners Inward Fold (FCIF) tasks across towels of varying difficulty

easy task

medium task

hard task

Generalization to Unseen Cloth

一.Performance on Double Inward Fold (DIF) tasks across unseen towels

Towel1

Towel2

Towel3

Towel4

一.Performance on Double Triangle Fold (DTF) tasks across unseen towels

Towel1

Towel2

Towel3

Towel4

一.Performance on Four Corners Inward Fold (FCIF) tasks across unseen towels

Towel1

Towel2

Towel3

Towel4


BibTeX

@article{park2024,
  author = {Changshi Zhou, Haichuan Xu, Zhipeng Wang, Yanchao Dong, Yanmin Zhou,  Bin He},
  month = {dec},
  year = {2024},
  articleno = {238},
}