The text below is a condensed summary of various documents available in PDF:
The file format is currently being submitted to the MIME and IFF authorities.
The GMS format is a low-level, binary, minimal, but generic, format for storing Gesture and Motion Signals in a flexible, organized, optimized way. The GMS format takes into account the minimal features a format carrying movement/gesture information needs: flexible dimensionality for the signals, versatile structuration, flexible types of the encoded variables, and spatial and temporal properties of gesture and motion signals.
The question of encoding movements such as those produced by human gestures may become central in the coming years, given the growing importance of movement data exchanges between heterogeneous systems and applications (musical applications, 3D motion control, virtual reality interaction, etc.). For the past 20 years, various formats have been proposed for encoding movement, especially gestures. These include C3D, BVH, AOA, TCR, BRD, CSM, etc. Though:
The GMS format was designed to be able to code all the features of GMS signals independently of the context in which they have been produced and will be used.
Gesture signals, whatever the way they are produced, and whatever the way they are considered, do present specific properties that allow distinguishing them among other temporal signals (especially aero-acoustical signals or visual signals). This section reviews these properties that have been used as a basis for defining the generic GMS format.
One of the first evidence is the morphological versatility of gestures. If images and sounds can be displayed in predefined environments (displays of a given resolution or 3D Caves of a given size for the images, Stereo or quadriphonic rendering for the sounds), the structure and the morphology of the gestures are always changing according to the tasks and the manipulated tools. To take into account such versatility, we propose to structure gestures according to two complementary features: geometrical and structural dimensionalities.
Geometrical dimensionality refers to the dimensionality of the space in which the gesture is evolving.
For example, Piano or Clarinet keys are pushed or closed according to a 1D finger motion. The control of a sound, and more generally parameter tuning (for example the value of an elasticity or the amplitude of a deformation), can be made through devices that evolve in a 1D non oriented space (set of sliders, set of knobs, etc...) and that can be described by a scalar or set of scalars.
Conversely, in cartoon animation or in scrap-paper animation or animated painting under the camera, the space is reduced to a plane. The gestures and the motions evolve in a 2D space (figure g), described on two orthogonal oriented axis.
When we manipulate an object (real or virtual), the dimensionality of the space is obviously 3D, i.e. the descriptions needing three orthogonal oriented axis (figure e, f, h).
This means that the geometrical dimensionality of a gesture can vary a lot: from a pure scalar or a set of pure scalars as in manipulation of sets of sliders or keys (figure c and d), to geometrical 1D (figure a and b), 2D (figure g), 3D (figure f), 6D oriented vectors and/or tensors (figure h).
![]() (a) |
![]() (b) |
![]() (c) |
![]() (d) |
![]() (e) |
![]() (f) |
![]() (g) |
![]() (h) |
For a given geometrical dimensionality, the number of degrees of freedom (DoF) can vary. We call the axis of variation the structural dimensionality.
For example, when we are acting on a keyboard of n keys (a piano keyboard, a computer keyboard, a set of buttons), the performed gesture can be considered in two ways - and similarly the n-keys produced signals:
In the human body motion, the geometrical dimensionality is 3 (all the motions of the body can be described in a 3D oriented Euclidian space) and the number of axis of variability (the number of degrees of freedom) is more than 200 in the real body and is of sixteen if the motion is sensed by a motion capture systems with 16 sensors.
In the modeling of a bowed string, the two dimensions of the deformations are usually decoupled, and the system can be considered as two superposed 1D gestures (to press the string, to bow the string), thus as a 2DoF of 1D system.
Beside the two previous qualitative properties (number of space axis and number of DoF), specific spatial and temporal quantitative features characterize gesture signals.
A first quantitative feature that allow to distinguish gestures (and control motions) signals among others (aero-mechanical signals, visual motions) is the frequency bandwidth ranges:
|
Although the three zones of the above figure are overlapped, they point out a useful categorization: visualizing motions requires a sampling rate up to 100 Hz: manipulating an object with force feedback requires sampling rate from some Hz to some KHz; recording sounds requires sampling rate from 20KHz to 40 KHz. The gestures signals are at the middle range.
Conversely, the audio signals are small deformations, centered to 0 and less than some millimeters. The mechanical and visual motions are usually non-centered large deformations and displacements (from centimetres to meters).
Hence, the properties of the gesture and motions signals position them at the middle range: Spatially, it is similar to visual motion but it needs higher frequency rate. Temporally, it needs lower frequency rate than the sound but it runs at higher non-centred spatial range.
As motions and gestures are produced by physical systems, and used to control physical systems the data could be of two different types:
Conversely, we may notice that visual data and acoustical are only of extensive ones (positions and/or displacements).
Indeed, in natural situations, when gestures are used for object manipulation, physical energy is exchanged between the two interacting bodies (for example object and human). Such interactive dynamic systems have to be represented whether by explicit correlation between extensive and intensive variables as in Newtonian formalism or by implicit correlation as in energy formalisms.
After recording data from such dynamic system, and in absence of model of the system, we need to have all the extensive variables as well as the intensive ones to reconstruct the system. This means that the data to be stored could be heterogeneous, extensive and/or intensive.
The GMS format organizes the morphological versatility of gesture and motion signals in a four level structure: Gesture Track, Gesture Channel, Gesture Units and Scene.
externals devices as gestural sensors are connected to computers via A-D converters. The track i contains the monodimensional scalar ai(t) corresponding to each A-D track, sampled at the Shanon rate a(t)
A channel can be 1D0 (a pure scalar), 1Dx (a vector on x axis), 1Dy, 1Dz, 2Dxy, 2Dyz, 2Dzy, 3Dxyz. The nature of the variables of the channel can be extensive variable (positions) or intensives variables (forces).
The motions of all the sensed points of a human body in a motion capture process are dynamically correlated. The motions of all the points of hands or fingers are also dynamically correlated. This means that, this information has to be conserved in order to avoid its undesired breaking in a next stage of signal processing for example. This means that the only information we conserve in GMS files is that some signals are correlated, and not the way in which they are correlated as done in several formats proposed by motion capture. Organizing the channels in various units can be made freely by the producer of the file.
The GMS format is based on the portable IFF standard (Interchange File Format) for binary files. GMS files are binary files.
The chunks in the header of the file describe the Scene / Unit / Channel / Track structure of the data. The header chunks are:
for each unit chunkn the channels chunks belonging to the unit follow
A GMS file version 0.1 is made of a single GMS scene. The scene incorporates the sample rate, and the type of the sample data, that can be either floating-point values (32 or 64 bits) or integers.
The scene, and each Unit and Channel, can handle a string comment coded with iso-8859-1 (latin1).
All the gesture sample data of the scene are encoded in the Frame chunk. This chunk contains the gesture and motion signal itself, encoded into successive frames. In the Frame chunk, tracks are interleaved.
This basic format allows us to describe heterogeneous gesture control situation and to consider the gestural systems (sensors and force feedback devices) as a workspace in which several systems can be used, organized and reorganized. Let take the example of an heterogeneous VR scene composed of:
Such scene will be described as following:
PK1 [EV(P), 1Dz], PK 2 [EV(P), 1Dz], ..., PK 8 [EV(P), 1Dz]
SS [EV(P), 3Dxyz]
LS [IV(F), 2Dxy]
DP1 [EV(P), 3Dxyz], ..., DP16 [EV(P), 3Dxyz]
BL1 [EV(P), 3Dxyz]
BL2 [EV(A), 3Drqf]
{PK1, PK2, ..., PK8}
{SS}
{LS1}
{DP1, ..., DP16}
{BL1, BL2}