### 1 Introduction

1. To the best of our knowledge, this is the first time a directed acyclic GNN is applied to motion data refinement considering both spatial and temporal information. In addition, we demonstrate that it is highly effective in representing human motion data, even for refinement.

2. The proposed model is robust because it uses neighboring joints to predict missing joints, whereas other networks can be affected by irrelevant joints with severe noise or frequently missing joints.

3. The proposed model applies not only to various types of unseen data but also to input that has not been processed (e.g., rotation). Meanwhile, the previous models proceeded with data preprocessing for translation and rotation. There processes require the assumption that a particular joint must be measured, making it difficult to generalize many cases and time-consuming to preprocess. Because the proposed model considers the joint kinematic structure, it works well just by proceeding with data translation alone and can be generalized to various data.

4. On the CMU mocap dataset [37], the proposed model exceeded the state-of-the-art performance for 3D skeleton motion data refinement using three types of losses.

### 2 Related Works

### 2.1 Non-data-driven Methods

### 2.2 Data-driven Methods

### 3 Method

### 3.1 Notation

_{1}, m

_{2}, ..., m

_{N}] be a clean motion dataset with (N, C, T, V) shape, where N is the number of data, C is

*x*,

*y*, and

*z*channel, T is the time sequence, and V is the number of joints. The m

_{n}= [p

_{1}, p

_{2}, ..., p

_{T}] that constitutes M is a sequence motion data with (C, T, V) shape, where p

_{t}= [

*x*

_{1},

*y*

_{1},

*z*

_{1}, ...,

*x*

*,*

_{j}*y*

*,*

_{j}*z*

*] with (C, V) shape represents a pose data at a time in the temporal direction. We represent clean motion data as Y and missing data deformed from the clean motion data as X.*

_{j}### 3.2 Bone Information

*v*

_{1}= (

*x*

_{1},

*y*

_{1},

*z*

_{1}) and

*v*

_{2}= (

*x*

_{2},

*y*

_{2},

*z*

_{2}) are given, where

*v*

_{2}is closer to the root joint than

*v*

_{1}, the bone information of

*v*

_{1}and

*v*

_{2}is the difference between the two joint coordinates, i.e.,

*e*

_{v}_{1,}

_{v}_{2}= (

*x*

_{1}–

*x*

_{2},

*y*

_{1}–

*y*

_{2},

*z*

_{1}–

*z*

_{2}).

### 3.3 Graph Construction

_{2}is heading to

*v*

_{3}and the element at the third row and second column of the incidence matrix is −1 because

*e*

_{3}is emitting from

*v*

_{2}(Fig. 5(a)).

^{3}to denote the source nodes; it only contains the absolute value of the incidence matrix elements that are −1. Similarly, we use A

^{t}to denote the target nodes; it only contains the absolute value of the incidence matrix elements that are 1. We set A

^{3}and A

^{t}as learnable parameters to form an adaptive body structure.

### 3.4 DGN Block

*and H*

_{v}*denotes the update functions of node and edge, respectively, and [·] denotes the stack operation of matrixes.*

_{e}*f*

_{e}_{AsT}and

*f*

_{e}_{AtT}denote the aggregations of the attributes contained in incoming edges and outgoing edges, respectively. The node update function,

*H*

*, updates the attributes of a node by combining a particular node with incoming and outgoing edges and gives the output*

_{v}*f*

*’. Similarly, in Equation (5),*

_{v}*f*

*A*

_{e}^{s}and

*f*

*A*

_{e}^{t}denote the aggregations of attributes contained in the source and target nodes, respectively. The edge update function, H

*, updates the attributes of an edge by combining a particular edge with a source node and a target node and gives the output*

_{e}*f*

*’.*

_{e}### 4 Network Training

*λ*

_{1},

*λ*

_{2}, and

*λ*

_{3}denote the weight coefficients of the three losses. Above all, it is crucial to minimize the position loss because it ensures the refined motion sequence has a smaller Euclidean distance from the clean motion sequence than the other losses do for generating natural motion. Hence, the proposed model is trained in three stages via changing the weight coefficients. Table 1 describes the parameters used in each stage. The parameters of the incidence matrix are fixed until other parameters are learned to some extent at the first stage to maintain prior knowledge of the human body’s kinematic structure. The batch size is set to be 16, and Adam [48], an adaptive gradient descent algorithm, is used to optimize the parameters.

### 4.1 Loss Function

#### 4.1.1 Position Loss

*denotes the L2 norm.*

_{2}#### 4.1.2 Bone Length Loss

_{b}denotes the bone length of clean data, and

*l*

*denotes the predicted value of the joint coordinates between the two ends of the bone, calculated with the L2 norm.*

_{b}#### 4.1.3 Smooth Loss

^{2}continuity on each feature dimension via a smoothness penalty term [49–51]. Recently, Li [5,36] added a smoothness constraint at the training phase to train a DL model. Let O be a symmetric tridiagonal matrix, defined by

_{0}= p

_{1}and p

_{T}= p

_{T+1}.