Skip navigation.
Home
That which cannot be rendered in binary is by definition a delusion
 

Reply to comment

Graphing sound volume x time in Flash and other Sound considerations

There is a lot of data in the Sound class' ActionScript documentation about getting sound data. However - while they dive into the "How" they do not cover the "What" very well. What exactly is this sound data? What do these numbers mean? 

While it may be obvious to some, it was not clear what these numbers meant and how to use them. The numbers extracted for the left / right (or mono) channel of mp3 audio data are points on a wave. Over time, they form sine-wave like patterns that are interpreted as sound. Pure machine generated sound "tones" actually look like sine waves if you graph out their sample data. "Rich" real world sounds like voice or other natural recordings are not so clean = they are made from combined waves overlapping each other, any of which can vary over time in pitch and volume. The net result looks like a rough seismograph sketch or a patch of grass - spikes here and there, sometimes tufts of spikes, etc. 

Indeed, the BytesTotal most likely referrs to the size of the file being streamed - not the size of the bytearray after Sound has normalized it. 

Extract of Frustration

The most frustrating thing for me was that the extract() documentation was misleading. If you want, for instance, to pull the entire sample into a ByteArray, you will want to call mySound.extract(myByteArray, Math.floor(44.1 * mySound.length), 0). The second parameter is where the confustion comes from (for me, anyway). You would expect that it would be equal to mySound.bytesTotal/8; it is not!  



Note - a poster made this comment as to why "BytesTotal" does not correspond to the necessary size of the ByteArray: 

 I think the bytesTotal property on a Sound object refers to the compressed MP3 data, and so it's significantly smaller than the size of the raw PCM data after extraction. The compressed size value comes in handy for knowing how much of an external mp3 is left to download, etc. In any case you've found a good solution. 

The values retrieved are packed floats in left, right, left, right order. So, for instance, if you want to get the left channel of the first second of your sound sampe, the code looks like this. (I am assuming said code is embedded in a load complete handler.)

var b = new ByteArray();
mySound.extract(b, 44.1 * mySound.length, 0);
var lefts = new <Number>[];
var rights = new <Number>[];

while(b.bytesAvailable > 8){
  lefts.push(b.readFloat();
   if (b.bytesAvailable >= 8) rights.push(b.readFloat());
} 

At this point there should be 22050 floats in the right and left channel arrays. 

Analyzing the Output

There is some interesting scaling issues when it comes to wave form analysis. Its worth noting that when looking for "Volume" the height of a valley BELOW zero is just as noisy as the peak of a wave above zero. I have used "absolute height" below and the red dots show the target volume indications - valley/peak heights. 

  1. There are 44,100 measurements per second in every Sound file in Flash. (regardless of the input quality, Flash normalizes all sounds to this resolution.)
  2. Volume is indicated by amplitude - the height of a wave.
  3. If you want to find the wave form peaks of every sound by analyzing every point of data, that is inefficient and probably uses up a fair amount of RAM. (Some of the sound files are two to 3 hours - that is 476 MILLION floats to analyze!) 
  4. If you take a distributed range of samples then you will be hitting the side of the wave form much more often than the peak. (sound waves can take 5-20 or so measurements to oscillate.) The green dots show how haphazard random sampling are - only a few of them (probably less than I've sketched here) reflect the height of the overall waveform. rember that for every peak the wave crosses the line so you are as likely to measure a zero as you are to measure a wave peak. 

I have gone down two roads with this: 

  • Take evenly samples and find the maximum sample in 10 sample "Buckets" 
  • Find as many zeros as you can and sample the points exactly between two zeros
  • Try to search around evenly distributed samples and try to find the highest measurement within 5 or so pixels of the random points. 

The first system seems to be fastest and that is what I am currently working with. All attempts to get "Clever" about finding wave peaks are very slow. 

If you notice, I sample a set of 5-6 evenly distributed measurements and take the maximum of them. This gives you a reasonable sample of the wave form. And in fact, the ratio of samples to waves in my current system is coarser - proabably at max 1 - 2 measurements per wave, meaning, you are getting the maximum 10 points of 3-4 adjacent waves per every drawn point on the curve. 

Fast Forward

One task inherent in this challenge was plaing sound "Fast forward" -- known in "the Business" as "Chipmunking".* While there are "Up Octave" examples in the docs that look like they accomplish this task in practice they have an odd side effect. 

The Up Octave examples remove every other sample; this (superficially) succeeds in doubling the octave, and if you try it on tonal / instrumental music you won't notice the flaw in this technique. However, play speech and you will notice that what happens is that the samples are "Microlooped" - that is, if someone says "The bus is late" it sounds like "ThThe e bu bus s isis lala tete". This Microlooping sounds fine if you are listening to a solid tone like a horn note - but its awful for speech. Worse yet, it doesn't actually play FASTER - it just plays at a higher pitch. I think this flaw is the result of the fact that the extract() example executes in the context of a moving stream, packet by packet.

Imagine if you will, the packets as train cars, and the train is moving at a constant speed. A very observant cow standing near the tracks counts every passenger as they pass and divides by the amount of time it took from the first passenger to cross to the last passneger to cross. (the cow has a watch and a pocket calculator. As I said this is a very special cow.) If the train takes 10 seconds from tip to caboose to cross and has 50 passenger in every car, and has a total of 20 cars, then the cow computes (50 * 20)/10 = 1000 passengers/second.

What the octave example does is to tell every passenger to cram into the front half of whichever car they are on. (and in fact, as soon as they pass the cow, the passengers "loop back" to the end of their car, making it look to the cow like each car is twice as dense.) so in one sense, if the cow was not quite as observent, it would look like there is 100 passengers per car, changing the equation to (100 * 20)/10 = 200 passengers/second. 

However, as I mentioned, this cow is very observent, and he notes that in reality, although they were packed twice as densely, the actual number of passengers that pass him are still 200, and the amount of time that it took to pass him is still the same, and so the real rate of transmission is still 100 passengers/second, even though the density has been doubled. 

Making it work good

What we really want is to get all the passengers twice as dense and the distance from the first passenger to the last as half as long. that is, we want 100 passengers per car, and only 10 cars (the first ten) to be full. THIS is the effect we want when we say "Play the sound twice as fast".  So we let the passengers cram themselves into the first five cars at 100 passengers per car. The cow now sees 100 passengers per car, and notices that they are only contained within the first 10 cars. Note as well that if it takes 10 seconds for 10 cars to go by, and the speed remains constant, ti will take only 5 seconds for the first 5 cars to go by. 

This means 100 passengers * 10 cars / 5 seconds = 200 passengers per second. We have truly doubled the density of information per second. 

The way you do this in Flash Sound is NOT to use the extract/octave example. What you have to do is

  1. extract the sound AFTER the sound has completely uploaded to the client**
  2. use readFloat() to pull the stereo samples into to float vectors(for left and right sound)
  3. take every other float from each float vector
  4. write this compressed data into a new byte array
  5. Create a new sound object
  6. use  sound.loadPCMFromByteArray() on this new object to write the compressed data in ***
  7. play the new sound object. 

Keep in mind that it is possible to create the new sound object from only the left OR right data - this will only save you a tiny sliver of time, but if your source sound file is itself a mono sound sample, there is real point in sampling BOTH the left and right channels when building your compressed byte array. 

* * * Sound. loadPCMFromByteArray (..) requires Flash Player 11. Flash Player 11 requires that you debug in the BROWSER because the IDE will not let you debug in Flash Player 11 - it uses 10. I don't know how it got this way and by the time you read this it could be paved over but it does. You have to manually patch in the SDK as well - beyond the scope of this article but its messy. 


* Okay. Nobody really calls it that. But that's what it sounds like. 

** there may be a way to extract it as an asynchronous action BEFORE it is completely uploaded but you probably have to use a different event such as the progress event to trace and interpolate the byte stream status. 

Reply

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <p> <span><small> <div> <h1> <h2> <h3> <h4> <h5> <h6> <img> <map> <area> <hr> <br> <br /> <ul> <ol> <li> <dl> <dt> <dd> <table> <tr> <td> <em> <b> <u> <i> <strong> <font> <del> <ins> <sub> <sup> <quote> <blockquote> <pre> <address> <code> <cite> <embed> <object> <param> <strike> <caption>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options