MULTIMEDIA DMTG


The BLISS system | Item Structure | Time References |

Synchronizing Speech and Graphics | Triggers | Clock Synchronization Parameters |

Errors in Playback of Speech


The most recent addition to the DMASTR system is VDMTG, which presents acoustic input simultaneously with graphics, with precise synchronization between them. The acoustic files use the ADF format specified in the BLISS speech editing system developed by John Mertus at Brown University. This is available via FTP from the following site: jam.cog.brown.edu.

Currently, the BLISS system provides drivers for three sound cards: the DT2821, made by Data Translation, the AdLib Gold 1000, and the MediaVision Pro Audio Studio. The DT2821 is an old board, and expensive. It is no longer fast enough to keep up with a Pentium processor, and some adjustments to the computer may be required. The AdLib card is no longer in production. The MediaVision card is inexpensive, and will run on any PC, but there have been multiple changes in the board over the past few years, and only some versions are compatible with the BLISS drivers and VDMTG. For example, the latest version (at the time of writing) is called a Pro Sonic, and this definitely does not work with VDMTG. Unfortunately, MediaVision has ceased production of the older boards, and it may be difficult to locate a supplier for the Pro Audio Studio board.

By far the most popular sound card currently is the SoundBlaster made by Creative Labs. Unfortunately, there is no BLISS driver for this card. However, we can provide software that an assembly level programmer could adapt to make a BLISS driver for the SoundBlaster series.

The BLISS system.

You should refer to the BLISS documentation for a detailed statement about how the speech editor works, and how to configure your system. For present purposes, we'll assume that you have a series of ADF files that make up the acoustic part of your experiment, and that in each of those files, a number of cursor positions have been defined.

In BLISS, there are four pairs of cursors that can be defined: L0 - R0, L1 - R1, L2 - R2, and L3 - R3 (i.e., there is a left and a right member of cursor 0, cursor 1, cursor 2, and for cursor 3). So, for example, you might have placed cursor L0 at the beginning of a sentence, and R0 at the end of the sentence. This would make it possible to instruct VDMTG to play back the material between L0 and R0.

Structure of the ITM file.

The structure of the ITM file is the same as in DMTG, except for the addition of statements that are relevant to the acoustic file. These are enclosed within curly braces, e.g.,

+001 { L sentence 0 } "You are now hearing a sentence";

In this item, "You are now hearing a sentence" would be displayed on the screen, and the subject would hear the contents of the file "sentence.adf" played over the left audio channel.

We will refer to the statement enclosed in curly braces as a "V statement". Each V statement contains a number of frames, and each frame contains an instruction on how to build the item. However the frames are not necessarily in timewise order, as they must be in the graphics segment.

There are two types of frames. One type is referred to as a file specification frame, and the other is trigger specification (first character T).

File Specification frames.

A file specification frame begins with the character L or R (indicating whether it is the left or right audio channel that is to be used), and then the name of the ADF file to be played (the ADF extension is added automatically, paths are OK). Then follows the time at which this file is to be played, relative to the commencement of the item itself. Additional offsets can specify a portion of the file to play. If nothing is specified, the whole file is played from start to end. If a starting point is specified but no end time the file will be played to the end (from the time specified). For example:

{ L house 0 L0 R0 }

would play the file "house.adf" as soon as the item containing this V statement begins, but would play only the segment between the left and right zero cursors.

Trigger Specification frames.

The trigger specification is used to "trigger" the graphics code, so that a visual signal can be displayed at a particular time with respect to the speech file. Only one trigger per item is allowed. These frames are discussed below.

Time references.

Frames must contain time references. A time can be a reference to the current frame's cursors, or to another frames' cursors. It can be an absolute time, expressed in units of milliseconds or sample durations, or an arithmetic sum of these.

A time reference expressed in terms of cursors begins with the letter F, followed by a number, which indicates the frame, which is then followed by a cursor number. For example, F2L3 refers to the left member of cursor 3 in frame 2. F0 refers to the current frame, and F-1 refers to the previous frame (forward references are not allowed) .

A time reference can also be made by specifying a frame number followed by S (start) or E (end). For example:

F-1e

means "the end of the previous frame", which would of course be equivalent to F0s.

A time reference expressed in terms of milliseconds begins with the letter M. If it is expressed in terms of sampling units, no letter is required (optionally, it can start with an S).

To add or subtract times put + or - symbols in between them.

For example,

F3L2 +m30

means "30 ms after the left member of cursor 2 in Frame 3".

Finally, a time reference can be expressed in terms of video ticks. This begins with the letter V. Thus

{ R sentence v100 F0L0 F0R0 }

would mean that the file "sentence" would commence 100 video ticks after the beginning of the item. Similarly,

{ R sentence 0 / L noise F-1s + v100 }

would mean that the file "sentence" would begin in the right channel as soon as the item began, and after a delay of 100 video ticks, the file "noise" would be played in the left channel.

Some additional examples:

{ L hello 0 / R world 0 }

would play the file "hello.adf" in the left channel (the zero following the filename means that the file is to be played zero sampling units after the onset of the item) and simultaneously play "world.adf" in the right channel.

{ L hello 0 / R world F-1e }

would play the file "hello" in the left channel and then play the file "world" in the right channel as soon as "hello" finishes.

{ L hello 0 / R world F-1e + m100 }

would play the file "hello" in the left channel and then play the file "world" in the right channel after a pause of 100 ms.

{ R hello 0 F0L0 F0R0 }

would only play the segment of "hello" between the left and right zero cursors (note that the commands are case insensitive).

Complex Time References.

When defining portions of a file to be played, it is possible to allow a length specification to be defined by the differences between cursors (or whatever) in other frames. Normally, the start and stop markers in a file would be references to cursors in that file (or some absolute specification relative to the beginning of that file). However, two references to cursors in other frames could be subtracted from one another and then added to a cursor in the file to yield start and stop points. For example,

{ R hallaleu 0 / R hello f-1e / R world f2s 0 f2e - f2s }

would play the file "hallaleu" through and then "hello" and "world" together, but only play as much of the file "world" as the file "hello" is long. This could cause an error if "hello" happened to be longer than "world".

This means that when specifying a reference to a cursor within the file the value of that cursor must be relative to the start of that file as opposed to the start of the ITEM (which is the normal case). Once the start and stop points have been resolved those cursors become relative to the item start. If you reference a cursor of the current file (or frame), then that cursor is relative to the start of the file, at all ALL other times it is relative to the start of the item.

In these examples the only whitespace necessary is that following the filenames, all else is optional, so the last example could become:

{Rhallaleu 0/Rhello f-1e/rworld f2s0f2e-f2s}

Synchronizing Speech and Graphics.

Using the Video Clock.

The most accurate way to synchronize the visual display with the audio signal is to use a timing expression signalled by the switch 'V'. This goes into the slot in a V statement that indicates when the ADF file is to commence. So a value of V300 would mean that the audio signal should commence 300 video ticks after the beginning of the item. One can then schedule a synchronous visual display in the following way:

+001 { L word v300 } %300 / "test";

The accuracy of synchronization depends critically on the accuracy of your estimate of the refresh rate of your particular video card specified in the R parameter discussed below. Note that the synchronization is achieved by starting the speech file at a time specified in terms of a video clock. This clock ticks over whenever the vertical retrace on the raster occurs.

Here is a further example. Suppose we wish to play a word simultaneously with a visual item, and the left zero cursor is at the start of the word and the right zero cursor at the end. The following would achieve the desired effect:

+006 { L word v30 F0L0 F0R0 } "#######" %30 / * "word" / ;

Accuracy of synchronization can be further improved by taking into account the position of the item on the screen, since items at the top of the screen are displayed before items at the bottom of the screen. This could be done by adding a few milliseconds as a correction factor. Suppose the probe is displayed in the bottom third of the screen and the refresh rate is 14.3 ms. We could then use the following:

+007 { L word v30 + m10 F0L0 F0R0 } "#######" %30 / * "word" / ;

Lastly, if you wanted a point in the middle of a sentence (marked by the left zero cursor) to be synchronized with the visual display exactly, then the beginning of the speech must be specified relative

to some frame, like this:

+008 { L sentence v100 - F0L0 } %70 / "#######" %30 / * "probe" / ;

Using the VM trigger.

An alternative way to synchronize the visual display to the auditory output is to use a trigger frame, although this is not as accurate. The time that the trigger specifies becomes available to DMASTR as a variation on the percent (%) frame duration specification, specifically %T.

For example, suppose we wish to play a sentence containing a critical word that is to trigger a visual probe. If the left zero cursor has been positioned at the end of the critical word, then the following item is used:

+5 { R sentence 0 / t F1L0 } %t / * "probe" ;

The visual item "probe" will be displayed anything up to one refresh interval after the critical word is played. The error will be constant between multiple runs of Dmastr, it is due to the fact that the duration of the file played to that moment in time is not an even multiple of the refresh interval. It is critical that the trigger specification be defined before the usage of %T, as it sets up the value that %T has.

Note: A trigger does not actually generate a frame in the code, and therefore should be ignored in all relative and absolute frame references. A good practice is to always make it the last frame and then there is no need to worry about it.

Why have two methods?

The reason for having for two synchronization schemes is as follows. The Video clock method assumes that the synchronization point is at the beginning of the speech file. If the speech file is quite long, there could be some loss of precision if the critical item that requires synchronization is near the end of the file.. In this case, the trigger frame might be a preferable technique.

Errors in playback of speech.

About the only thing that could slow the machine down enough to affect the speech (assuming you have 25Mhz 386 or better and a fast hard disk) would be a bad sector on the hard disk that wasn't bad enough that a couple of automatic retries didn't get it, so that the disk read took up to eight times longer than normal (up to eight retries are automatically tried before DOS does anything else). Perhaps a very large 16 color frame on a slow (25Mhz 386) could do it too. Any error results in the previous two buffers being played twice. There is code to detect this error but due to the nature of the task it cannot be detected the same way all the time. Sometimes Dmastr will tell you when in the item the error occured, other times

not, as it will only detect the error when the speech finishes.

Another kind of error is bound to occur much more frequently and that is when the speech affects the visual display. The result may be that a visual frame is omitted altogether. One thing to be aware of here is that if the visual frame that was missed was the last frame of an item, then Dmastr will not report the error until the next item is displayed; it does however display the item number along with the frame so this is not a real problem.

The following will give some indication just how much can be displayed. First, we need to know how much of a load the CPU is under while transferring speech to the DT2821. On our 33Mhz 486 with a 15ms hard disk, the CPU in the worst case waited for 78% of the time, typically it was much faster. Each buffer of speech takes 125ms to play, so for 28ms the CPU was tied up doing DOS disk access and was not available to move any images onto the screen. This would interfere with images displayed at a one or two tick rate repeatedly.

However, if there were only three frames and only the middle one was displayed for two ticks then it would probably work fine. If it didn't, then moving the visual display one or two ticks earlier or

later would probably fix it. This would permit the time-dependent planning of the visual frame to be carried out when speech has just been read from disk as opposed to right when it is being read from disk). Alternatively, with the speech coming from extended memory (as in the case of a virtual disk), the CPU was waiting for 96% of the time, so you can do just about anything that the CPU would normally do without speech. In the actual tests we ran, VDMTG was displaying 26 frames of size 4" x 5", one frame per tick, while playing speech out both channels simultaneously. VDMTG still kept ahead of that (too much bigger howerver and it started losing it).

Clock synchronization parameters

VDMTG requires parameters for the acoustic part of the experiment as well as the graphical part. There are only three parameters, and these are also enclosed in curly braces, but placed on the parameter line. The parameters are the D, O, and R parameters.

The D parameter.

The D parameter is used to combat the slight difference between the clock used on the sound card and the millisecond timer used to time when a trigger should be issued. For example, if a sampling rate of 16.129 kHz is used with the DT2821 card, after a minute of playing files the millisecond clock is out by 9 milliseconds which is a error of 0.015%. The same sampling rate on other cards would produce different errors.

The default value for this parameter is 0.00015. To determine the value of this parameter for your system, you should play a file out for a long time and then have a trigger and measure the difference. For example, in the following item,

0 { L beep m60000 } ;

the beep should begin 60000 ms after the beginning of the item. We measure this with a special program (LANA.EXE) running on another computer and a vox. If you use a sampling rate other than 16129hz, then you will need to measure this error with a logic analyzer. Another way to do this is with an item file like the following, which sets output bits on the PIO card:

0 { L beep m60000 } o0 %4200 / o255 ;

By watching the output bits from Dmastr, you have a clear indication of when things actually started (if you have some refresh rate other than 70hz, the 4200 needs to be changed). You should see the output bit return high nine milliseconds before the beep (assuming you are using a DT2821 sampling at 16.129Khz).

The D parameter in VM of 0.00015 stretches the VM time when it informs Dmastr of the trigger time (the VM 60000 becomes a Dmastr 60009). Additionally the D parameter shrinks times specified in the V time specification. There is no prohibition on negative values.

For the Adlib Gold 1000 card using a sampling rate of 11025KHz (which BLISS represents as 10989KHz) a D-0.00281 works pretty well on our system. It is much more critical that the D parameter be used with the Adlib Gold than the DT2821 because of the error in the representation of the actual sampling rate.

The R Parameter

The R parameter specifies a rational number (Rnnn,ddd) that represents the duration of the vertical retrace interval (expressed as the number of milliseconds over the number of vertical retraces). This is critical for the purposes of synchronizing the graphic display and the speech output. This can be calculated by using the version of TIMEG provided in this package. To obtain the values from TIMEG, use the 'modify Vertical total' V command, enter the number of lines that your application will actually be displaying (this is modified by the 'L' parameter in DMTG, the default value being 350) and wait 30 seconds or so and hit a key. TIMEG will provide you with the appropriate values. If the VM R parameter is not specified then an approximate value will be used instead (the vertical retrace in VDMTG itself is only timed for three seconds). For example an ITM file parameter line that reprograms the display to a near 10 millsecond refresh interval (with our video card) would be specified as follows:

L214 { R34447,3449 }

Additionally if you wanted to test whether your R parameter values were good, an item file like this should be used:

0 { L beep v4200 } o0 %4200 / o255;

Here you should see the output bit go high within a millisecond of the beep.

The O parameter.

The O parameter is used to correct any constant timing errors introduced by external agencies. Originally, this was necessary when we used one computer generating speech to trigger another computer displaying graphics via a VOX, which introduced a constant delay. But with VDMTG, there is no such delay. However, a delay could be introduced if a custom built d/a startup routine was written. The default is 0 milliseconds (as in O0). Measuring what it should be is a matter of playing a file with a pulse in it and issuing a trigger at the beginning of that pulse, sending the pulse through a VOX and timing the difference. This should be done with a short interval between the start of the item and the pulse (like 10ms).

Back to Top

Back to DMASTR Index