The GMS format homepage

The text below is a condensed summary of various documents available in PDF:

The file format is currently being submitted to the MIME and IFF authorities.

Introduction to the GMS format

The GMS format is a low-level, binary, minimal, but generic, format for storing Gesture and Motion Signals in a flexible, organized, optimized way. The GMS format takes into account the minimal features a format carrying movement/gesture information needs: flexible dimensionality for the signals, versatile structuration, flexible types of the encoded variables, and spatial and temporal properties of gesture and motion signals.

The question of encoding movements such as those produced by human gestures may become central in the coming years, given the growing importance of movement data exchanges between heterogeneous systems and applications (musical applications, 3D motion control, virtual reality interaction, etc.). For the past 20 years, various formats have been proposed for encoding movement, especially gestures. These include C3D, BVH, AOA, TCR, BRD, CSM, etc. Though:

The GMS format was designed to be able to code all the features of GMS signals independently of the context in which they have been produced and will be used.

Gesture and Motion Signals?

Gesture signals, whatever the way they are produced, and whatever the way they are considered, do present specific properties that allow distinguishing them among other temporal signals (especially aero-acoustical signals or visual signals). This section reviews these properties that have been used as a basis for defining the generic GMS format.

Morphological versatility

One of the first evidence is the morphological versatility of gestures. If images and sounds can be displayed in predefined environments (displays of a given resolution or 3D Caves of a given size for the images, Stereo or quadriphonic rendering for the sounds), the structure and the morphology of the gestures are always changing according to the tasks and the manipulated tools. To take into account such versatility, we propose to structure gestures according to two complementary features: geometrical and structural dimensionalities.

Geometrical dimensionality refers to the dimensionality of the space in which the gesture is evolving.

For example, Piano or Clarinet keys are pushed or closed according to a 1D finger motion. The control of a sound, and more generally parameter tuning (for example the value of an elasticity or the amplitude of a deformation), can be made through devices that evolve in a 1D non oriented space (set of sliders, set of knobs, etc...) and that can be described by a scalar or set of scalars.

Conversely, in cartoon animation or in scrap-paper animation or animated painting under the camera, the space is reduced to a plane. The gestures and the motions evolve in a 2D space (figure g), described on two orthogonal oriented axis.

When we manipulate an object (real or virtual), the dimensionality of the space is obviously 3D, i.e. the descriptions needing three orthogonal oriented axis (figure e, f, h).

This means that the geometrical dimensionality of a gesture can vary a lot: from a pure scalar or a set of pure scalars as in manipulation of sets of sliders or keys (figure c and d), to geometrical 1D (figure a and b), 2D (figure g), 3D (figure f), 6D oriented vectors and/or tensors (figure h).

Figure a to h: Versatility of the gesture morphology
piano gesture
(a)
violin gesture
(b)
mix table gesture
(c)
keyboard gesture
(d)
hands on table
(e)
joystick gesture
(f)
cartoon artist
(g)
man on a  chair
(h)

For a given geometrical dimensionality, the number of degrees of freedom (DoF) can vary. We call the axis of variation the structural dimensionality.

For example, when we are acting on a keyboard of n keys (a piano keyboard, a computer keyboard, a set of buttons), the performed gesture can be considered in two ways - and similarly the n-keys produced signals:

In the human body motion, the geometrical dimensionality is 3 (all the motions of the body can be described in a 3D oriented Euclidian space) and the number of axis of variability (the number of degrees of freedom) is more than 200 in the real body and is of sixteen if the motion is sensed by a motion capture systems with 16 sensors.

In the modeling of a bowed string, the two dimensions of the deformations are usually decoupled, and the system can be considered as two superposed 1D gestures (to press the string, to bow the string), thus as a 2DoF of 1D system.

Quantitative Ranges

Beside the two previous qualitative properties (number of space axis and number of DoF), specific spatial and temporal quantitative features characterize gesture signals.

A first quantitative feature that allow to distinguish gestures (and control motions) signals among others (aero-mechanical signals, visual motions) is the frequency bandwidth ranges:

Temporal range of sensory signals
temporal range

Although the three zones of the above figure are overlapped, they point out a useful categorization: visualizing motions requires a sampling rate up to 100 Hz: manipulating an object with force feedback requires sampling rate from some Hz to some KHz; recording sounds requires sampling rate from 20KHz to 40 KHz. The gestures signals are at the middle range.

Conversely, the audio signals are small deformations, centered to 0 and less than some millimeters. The mechanical and visual motions are usually non-centered large deformations and displacements (from centimetres to meters).

Hence, the properties of the gesture and motions signals position them at the middle range: Spatially, it is similar to visual motion but it needs higher frequency rate. Temporally, it needs lower frequency rate than the sound but it runs at higher non-centred spatial range.

Type of variables

As motions and gestures are produced by physical systems, and used to control physical systems the data could be of two different types:

Conversely, we may notice that visual data and acoustical are only of extensive ones (positions and/or displacements).

Indeed, in natural situations, when gestures are used for object manipulation, physical energy is exchanged between the two interacting bodies (for example object and human). Such interactive dynamic systems have to be represented whether by explicit correlation between extensive and intensive variables as in Newtonian formalism or by implicit correlation as in energy formalisms.

After recording data from such dynamic system, and in absence of model of the system, we need to have all the extensive variables as well as the intensive ones to reconstruct the system. This means that the data to be stored could be heterogeneous, extensive and/or intensive.

Specifications of the GMS format "in-brief"

Scene, Unit, Channel and Track

The GMS format organizes the morphological versatility of gesture and motion signals in a four level structure: Gesture Track, Gesture Channel, Gesture Units and Scene.

Implementation of the GMS format

The GMS format is based on the portable IFF standard (Interchange File Format) for binary files. GMS files are binary files.

The chunks in the header of the file describe the Scene / Unit / Channel / Track structure of the data. The header chunks are:

A GMS file version 0.1 is made of a single GMS scene. The scene incorporates the sample rate, and the type of the sample data, that can be either floating-point values (32 or 64 bits) or integers.

The scene, and each Unit and Channel, can handle a string comment coded with iso-8859-1 (latin1).

All the gesture sample data of the scene are encoded in the Frame chunk. This chunk contains the gesture and motion signal itself, encoded into successive frames. In the Frame chunk, tracks are interleaved.

An example of Scene, Unit, Channel and Track organization

This basic format allows us to describe heterogeneous gesture control situation and to consider the gestural systems (sensors and force feedback devices) as a workspace in which several systems can be used, organized and reorganized. Let take the example of an heterogeneous VR scene composed of:

Such scene will be described as following:

Valid HTML 4.01 Transitional Valid CSS!