Back in the days I made a small study on audio compression. It’s far from scientific quality, but I nonetheless want to post it, so here we go. (I have edited out some of the blatant errors I have found.)
Introduction
Motivation
Sound is usually compressed in order to save space. Audio space conservation is an important issue in transmission and storage. We may for example want to store audio in such vast amounts that without compression it would take way too much storage space. We might also want to transmit audio data over transmission path having only limited capacity, for example receive a MP3 file through slow modem connection. There are also other constraints regarding compression, much of which is discussed later.
Formats
Sound formats can generally be divided into lossless and lossy technologies. Lossless formats contain all the information of the original sample and can be reconstructed from the compressed data whereas lossy technologies lose some of the original information.
Lossless compression formats
Lossless audio compression is a method of compressing audio without losing information. With lossless audio compression algorithms it is possible to reconstruct the original data from the compressed one. Audio files can be compressed with traditional compressing software, such as ZIP, but they seldom reduce the size of files much below original size. They are also poorly adaptable for live streaming and therefore unsuitable for most modern solutions.
Many lossy compression methods have been introduced to allow efficient compression of audio. However recent development in hardware and network technologies have made it possible to enhance sound quality with lossless audio compression methods. They are used mainly in specific areas of audio engineering and by people who don’t want any compromises in storing their sound clips.
The compression rate of lossless compression algorithms is usually around 50-60 % of the original size of the clip depending vastly about the nature and complexity of the original sound. As simple compression methods generally try to find repetition and patterns in original data, real world sounds usually compress poorly. Modern lossless methods take different approach by using linear prediction to estimate the spectrum of the signal.
Examples of lossless compression formats: Monkey’s Audio, FLAC, ShortenWAV, OGG
Lossy compression formats
These types of compression formats is named ‘lossy’ as the compression loses some data from the original audio sample. Therefore it is not possible to reconstruct the original sample from compressed version. Lossy compression takes advantage of the fact that human perception of sound is not completely refined; for example, human brain does not normally care, if a single sample point data from a few seconds long audio file is missing or not.
Key techniques concerning audio compression are called dithering and noise shaping. Dithering is used to get rid of side-effects introduced in resampling of audio data. Resampling may introduce systematical and audible distortions to compressed audio, for example ear may hear that the original signal is now accompanied by additional harmonics. This may be remedied by dithering, ie. adding noise in some controlled manner to audio before resampling. Dithering reduces the systematic errors and distortions during quantization phase.
Noise shaping is a technique to move the noise introduced in the dithering process to such frequency bands it is less audible to ear. For example, noise could be moved to 20 Hz – 60 Hz and 12 kHz + bands while the compressed file could retain almost identical psychoacoustic sound as the original. This kind of approach works, because human ear frequency responce is somewhat irregular.
Examples of lossy compression formats: MP3, AAC, OGG VORBIS
Software and hardware audio compression
Due to recent rapid development in microprocessor technology, almost all modern processors used in desktop computers, mobile phones and other handheld devices can perform complex audio encoding and decoding in real time. However, some years ago such calculations couldn’t be performed and e.g. GSM 06.10 compression algorithm couldn’t have been done efficiently enough without dedicated hardware. These kinds of highly tailored microprocessors are still used in handheld devices such as MP3 players and mobile phones.
There is a vast amount of different software based solutions in both commercial and non-commercial market. These vary in e.g. compression rate, encoding and decoding speed, robustness and other specific features. Many of these are commercially developed in a way that decoding is free but encoding isn’t possible without a commercial license. There are totally free open-source solutions available also in both lossy and lossless field (OGG, FLAC). In some cases there may be both commercial and non-commercial implementations of the same codec (Lame MP3).
Sound quality
Methods for analyzing the quality of compressed sound can generally be divided into systematic scientific methods and subjective methods.
Spectroscopical methods
Spectrum analysis (SA) is a way imaging a signal in frequency-time domain rather than the traditional amplitude-time domain. SA is an essential tool in determining a signal’s (or audio file’s) frequency distribution and can therefore be used in analyzing quality of compressed files. SA may pinpoint unnecessary harmonics from audio data and also, whether artifacts from noise shaping are distributed in frequency bands less audible to ear. It may also reveal noise induced in dithering. Fourier transform is the mathematical theory behind SA. In short this means a time-slice integral with the original signal and e^(-i 2 PI f t). This yields the necessary signal data in frequency domain.
Subjective methods
A great example of subjective nature of sound perception is that some people claim that even CD-quality sound (WAV 44,1 kHz) loses valuable information and makes certain types of music sound worse than analogic technologies such as gramophone recordings.
When determining the quality of compressed audio, listening tests can be used. Circumstances must be best possible in both technically (decent headphones) and psychologically (order of samples may be in different order with different people, more identical samples (compressed or uncompressed) may be played) to reduce placebo effect. Results of the tests must be also closely evaluated and deemed conductive only if a statistically significant number of test subjects give reason for it.
Applications of sound compression
Applications of audio compression can be found almost anywhere nowadays. Digital audio has replaced analog audio on almost all areas of communication, music and home entertainment. Internet has shaped the development of audio compression. Due to limited bandwidth audio compression is essential when transferring audio files, music and real-time speech. MP3 has revolutionized music industry and iTunes and other online music stores are relying on small enough files. Modern MP3 players can hold up to 10 000 songs – 10 times more than could be accomplished without audio compression.
Methods
Introduction
This section describes the necessary steps and technology involved in the project of researching the differences between sound compression methods. The overall field of sound compression methods is described in the preliminary research conducted earlier.
Objectives
The study aims at researching the differences between different lossy sound compression methods. This study focuses on three compression methods: MP3, AAC and OGG Vorbis. Quality of each (in different bit rates) are examined with both subjective computer-aided listening test and more systematic spectrographic analyzis. More focus is put into subjective methods despite their problematic nature (1).
Scope of the work
The work is limited at studying the differences between lossy compression methods. Lossless compression methods are intentionally left out of this study, although they are being described in a general level in the preliminary research.
Work practices and methods
This study takes two different approaches into analyzing the quality of sound compression methods, subjective and spectrographic.
Subjective methods
In this assignment we will test 3 different lossy sound encoding technologies, namely MP3, OGG Vorbis and AAC. We will test the perceived quality of sound by arranging computer-aided listening tests. As these tests are highly subjective by nature, we will try to provide same kind of circumstances for listening; same laptop is used throughout all the tests with same headphones attached and roughly the same volume settings used. Also, same program is used to play the sound files. A 6 second sound clip of rock music was chosen as basis of testing. The clip will be recorded digitally from a CD to a WAV file.
Test subjects
Test subjects will be our friends and relatives and we are looking forward into gaining at least 15 of them. We accept only test subjects’ who have no impaired hearing. We considered earlier using web-based testing system to gain much more participants, but because there too many ‘moving parts’ in the equation (for example, many web-browsers don’t play mp3 files or wav files on some operating systems by default), a decision was made to use personal testing, involving a tester, computer with accessories and the test subject. We also limited the number of audio types to one (rock music) to not overburden the participant.
Per format testing
For each format the 6 second clip of rock music will be compressed with bitrates (kbit/s) 16, 32, 40, 48, 56, 64, 80, 96, 112, 128, 160, 192, 224, 256 and 320. If there is no equivalent bitrate for some format, the nearest one will be used. After encoding, all the files are decoded back to WAV files, so we can use only one sound player and don’t need to be bothered with processing delays. Resulting WAV files are of course different than the original. A simple unix shellscript program for playing the different clips will be produced. The user is asked to listen the files and mark in the program (or paper) for which bitrate the sound 1) sounds even slightly different than the original 2) quality is very different from the original. At any moment the user has a chance to listen to the original, uncompressed reference sound file. The user can also ask for technical help from the tester anytime.
Spectrographic methods
We will use spectrography (Fourier transform) to show the spectral representation of sound clips. In the frequency domain we can easily see the distribution of frequencuies and observe, how each encoding method shapes the original sound data. We’ll try to spot differences between clips encoded with different bitrates and encoding formats.
Software
List of software to be used:
Grip – for ripping the sound file from a CD
Audacity – for searching a suitable 6 second clip from the ripped file and observing the spectral representation of the sound file
Lame – for encoding the sample clip to MP3 file
Oggenc – for encoding the sample clip to Ogg (Vorbis) file
Faac – for encoding the sample clip to AAC file
Aplay – for playing the actual, resulting WAV files
Itunes – for encoding the sample file to MP3 and AAC formats if Lame and/or Faac can’t produce files with low enough bitrates
Bash – for hosting the script to play the different WAV files and report perceptions
Matlab – for observing spectral representation of the sound files if Audacity is deemed ineffective for the task
Schedule and estimated hours of work
Task | Hours | Week |
Preliminary work, selecting the topic etc. | 4h | 5 |
Preliminary research and documentation (Esiselvitys) | 10h | 7-8 |
Planning the schedule, methodology etc. Documentation (Työsuunnitelma) | 14h | 10-11 |
Designing and encoding the test clips | 4h | 12 |
Programming the program used in listening tests | 5h | 12 |
Conducting the listening tests | 5h | 13 |
Analyzing the listening test results | 10h | 14 |
Spectrographic analyzing of the clips | 10h | 14 |
Writing the report | 12h | 15 |
(Presenting the results) | 4h | ? |
Total | 78h | 5-15 |
Table 1 – Estimated hours of work
Results
This section describes the results of the tests conducted along with the spectrum analysis of the compressed clips.
Subjective listening tests
Subjective listening tests were conducted in April 2008 according to the initial plan. Some changes were made during the work; OGG Vorbis was dropped from the list of tested compression methods due to the fact that it was nearly impossible to make it compressed enough to hear any changes in sound quality.
Before the tests carried out, the needed sound files were made. We chose the sound file to be an excerpt from the song ‘Sultans of swing’ by Dire Straits. First, the whole track was ripped from an original cd by Grip 3.3.1. Output format was 16 bit 44.1 kHz stereo WAV, to allow us total control of the encoding process. The file was cut to about 6 seconds in length with Audacity 1.3.4-beta. The resulting file was encoded with Lame 3.97 to constant bitrate mp3 with the following bitrates: 32, 40, 48, 56, 64, 80, 96, 112, 128, 160, 192, 224, 256, 320. The WAV file was also encoded to approximately the same bitrate aac files with Faac 1.26.1 by using different quality values. The resulting files were decoded back to WAVs, lame was used to decode mp3 files and faad 2.6 was used for aac files. Decoding back to WAV files was donw because we wanted to not let the decoding delay affect the situation.
The tests were conducted on eight individuals, 1 female and 7 males. An oldish IBM 600X laptop and a pair of Sennheiser HD 497 headphones were the hardware used in the test. Laptop was running Arch Linux with kernel version 2.6.24. The soundcard integrated int the laptop is Sound Fusion CS46xx which has a Cirrus Logic CS4297A rev 3 sound chip. All unnecessary programs were closed to ensure undisturbed playback. All test subjects except one were feeling healthy and the tests were done in a relaxing enough environment in Maarintalo MT268 class. The class was reserved to us for experimentation to keep the circumstances the same for all the test subjects.
Test subjects were recruited on scene by telling about the experiment and offering a chocolate bar as a thanks for participation. Out of 10 candidates 8 agreed and 2 declined due to time constraints. Test subject was escorted to the experiment class and an oral description of the test was given. There was also a written description (in english) of the test available next to the computer. For marking the hearing results, a pre-formed paper (in english) was given to participants. Test subject was reassured that he/she had all the time he needed to complete the test and that should any problems or questions arise, it would be perfectly okay to ask the researcher about it. During the test the researcher made no sudden or nervous movements and remained calm, reading a calendar. Afterwards the answering paper (completely anonymous) was collected and the participant was thanked for participating and given the chocolate bar promised. After last participant had left the classroom, result data was collected to a file for later analysis.
We used a small program (listing in appendix) to help in testing. Backend of the program was written in bash, but for user-interfacing we used dialog version 1.1-20071028. All the sample files were renamed to the form {integer}.{type}.wav , for example 4.mp3.wav .The program was designed to make it extremely easy to play the sound files. Upon starting, user is shown the dialog below. User may select “Yes” to start the test.
Next, the user is shown this menu to let him/her choose which samples to listen. User would normally select AAC first.
Now, the actual sample listing is shown. User may press enter on a sample he/she wants to listen to. High quality samples are on top, and lower quality sample below. We have placed reference samples between every actual test samples, so that the user wouldn’t want to go to the top of the list, listen original sample there, and then go back to the actual test sample position. Whenever user presses enter on a sample in the list, all existing playback is stopped and new file played. After testing all the samples, user exits this menu, and selects another file type category for testing.
Results from subjective tests
Table 2 shows the results from subjective listening tests conducted. Perceived quality of both compression methods was estimated by the listeners and the results were collected into a table. Average, median and standard deviation were calculated from each of the four categories.
ID | Status | First different aac bitrate | Bad aac bitrate | First different mp3 bitrate | Bad mp3 bitrate |
1 | had flu | 80 | 63 | 63 | 63 |
2 | normal | 97 | 80 | 97 | 63 |
3 | normal | 112 | 80 | 80 | 63 |
4 | normal | 192 | 80 | 80 | 56 |
5 | normal | 97 | 80 | 80 | 63 |
6 | normal | 128 | 63 | 112 | 63 |
7 | normal | 128 | 128 | 97 | 97 |
8 | normal | 97 | 97 | 112 | 80 |
Average | 116 | 84 | 90 | 69 | |
Median | 104,5 | 80 | 88,5 | 63 | |
Standard deviation | 20,7 | 11,7 | 11,7 | 8,3 |
Table 2 – Test results from subjective listening tests
The average and median of the results seem to indicate that MP3 was generally perceived to be better, whereas the limitations of AAC were noticed in slightly higher bitrates. The first limitations in sound quality of MP3 compressed sound were noticed in average bitrates of 90 whereas same figure for AAC was 116. MP3 was perceived to be at acceptable level up till 69 kbps whereas AAC become unacceptable at 84 kbps.
Spectrum analysis results
We examined briefly the nature of encoded and decoded sample files. The original sample looks like this:
As you can see, all frequencies are more or less occupied. Next lets see how a mid-quality mp3 file looks like:
As you can see, the frequencies have been pushed back, to maxing out at 15 kHz. We think that the spikes seen in the picture are not part of the actual file, but residue from the decoding process. For reference, here’s pic of a middle quality aac file:
Occupied frequency bands are smaller than in mp3 file. Also, the spikes are visible here. Next, lets take a look at the low quality spectral pictures. First here is the mp3 file:
Frequencies now go only as high as 5 kHz, with the occasional spike here and there. Lastly, here’s the low quality aac file:
This spectrum is pretty much like in the low qualty mp3. Spike positions are a bit different.
As can bee seen, lossy audio encoding changes sample frequencies to lower bands. No information is needed to keep record of the missing higher bands and therefore space is saved in the file.
Discussion
In the light of our tests MP3 seems to give a slightly better result in perceived sound quality than AAC. In theory, AAC should bring a slightly better results so the result is slightly controversial. MP3 codec used in the tests was Lame which has been around for many years and it should bring very high quality results. AAC is newer and its implementation in Unix might have some limitations. These facts might have influenced the test results and improved MP3’s result over AAC.
The rather high standard deviations of the figures state that the subjective listening experience can vary a lot between different individuals. All the test subjects were rather unanimous in what comes to identifying the level of which the quality became very poor. More variance was found in identifying the first sample where some noticeable packaging can be heard; one test subject heard changes in 192 kbps whereas another didn’t hear any changes before 80 kbps.
The tests were conducted to only eight individuals, which most likely affected the results more than they should have. More tests should be conducted to improve the test results and bring more academic credibility to the study.
The subjective methods of measuring sound quality can be criticized in many ways in terms of reliability and subjective features of sound. Then again there are numerous Hi-Fi oriented people around who insist on hearing limitations in even CD quality sound even though it should be more than good enough for our limited sound ability. The only way to measure these differences is to continue on conducting subjective tests.
Future directions for research
It would be interesting to study the reliability of subjective listening tests further. One idea that we came across with was to ask the users to sort e.g. 8 sound clips in order of best perceived quality. This could bring information on how reliably users can actually hear differences between e.g. 192 kbs and 128 kbs sound clips.
The test described in this document could be executed in a larger set of test subjects to improve reliability. Furthermore, instead of measuring the differences between different sound compression methods it could be more interesting to compare different implementations of the same compression method. This could potentially expose differences between algorithms and bring to an end the discussion about e.g. which MP3 compression algorithm brings best results.
Literature cited
- http://wiki.hydrogenaudio.org/index.php?title=Lossless_comparison
- http://en.wikipedia.org/wiki/Audio_data_compression
- http://www.vocal.com/data_sheets/gsmfr.html
- http://en.wikipedia.org/wiki/Lossless_data_compression
- http://en.wikipedia.org/wiki/Lossy_data_compression
- http://en.wikipedia.org/wiki/Dither#Digital_audio
- http://en.wikipedia.org/wiki/Noise_shaping#Noise_shaping_in_digital_audio
Appendix A – Encoding/decoding scripts used
MP3
for i in 32 40 48 56 64 80 96 112 128 160 192 224 256 320; do lame -s 44.1 --resample 44.1 --cbr -b $i --noreplaygain ripping/dire_straits__sultans_of_swing_excerpt.wav tmp.mp3; lame --decode -s 44.1 tmp.mp3 encdecwavs/mp3/dire_straits__sultans_of_swing_excerpt_br$i.mp3.wav; rm tmp.mp3; done
AAC
#!/bin/sh qvaltobr12="32" qvaltobr20="40" qvaltobr29="47" qvaltobr35="56" qvaltobr41="63" qvaltobr53="80" qvaltobr65="97" qvaltobr78="112" qvaltobr88="128" qvaltobr113="160" qvaltobr141="192" qvaltobr201="224" qvaltobr290="256" qvaltobr500="310" for i in 12 20 29 35 41 53 65 78 88 113 141 201 290 500 ; do faac dire_straits__sultans_of_swing_excerpt.wav -q $i -o tmp.aac; faad tmp.aac -oencdecwavs/aac/dire_straits__sultans_of_swing_excerpt_br${qvaltobr}_qval$i.aac.wav; rm tmp.aac done
Appendix B – Test program
#!/bin/bash ANSWERFILE=`mktemp` while true; do dialog --yesno "Select Yes to begin new test" 10 40 ds=$? if [ $((ds)) == 0 ] ; then while true; do dialog --stdout --menu "Select test file type" 10 40 40 "AAC" "" "MP3" "" >$ANSWERFILE; ds=$? ANSWER=`cat $ANSWERFILE` ; if [ $((ds)) == 0 -a "$ANSWER" = "AAC" ]; then ID="0" for i in aac/*.aac.wav ; do samples[$((ID*4))]="$((ID*2+1))" samples[$((ID*4+1))]="Original" samples[$((ID*4+2))]="$((ID*2+2))" samples[$((ID*4+3))]="AAC sample $((ID+1))" ID=$((ID+1)) done; while true ; do dialog --stdout --default-item $ANSWER --menu "PLAY AAC SAMPLE" 40 60 60 "${samples[@]}" >$ANSWERFILE; ds=$? ANSWER=`cat $ANSWERFILE` if [ $(($ds)) == 1 ]; then killall -q aplay break elif [ $((ANSWER % 2)) == 1 ]; then killall -q aplay aplay aac/original.wav 1>/dev/null 2>/dev/null & elif [ $((ANSWER % 2)) == 0 ]; then killall -q aplay aplay aac/$((ANSWER / 2)).aac.wav 1>/dev/null 2>/dev/null & fi done; elif [ $((ds)) == 0 -a "$ANSWER" = "MP3" ]; then ID="0" for i in mp3/*.mp3.wav ; do samples[$((ID*4))]="$((ID*2+1))" samples[$((ID*4+1))]="Original" samples[$((ID*4+2))]="$((ID*2+2))" samples[$((ID*4+3))]="MP3 sample $((ID+1))" ID=$((ID+1)) done; while true ; do dialog --stdout --default-item $ANSWER --menu "PLAY MP3 SAMPLE" 40 60 60 "${samples[@]}" >$ANSWERFILE; ds=$? ANSWER=`cat $ANSWERFILE` if [ $(($ds)) == 1 ]; then killall -q aplay break elif [ $((ANSWER % 2)) == 1 ]; then killall -q aplay aplay mp3/original.wav 1>/dev/null 2>/dev/null & elif [ $((ANSWER % 2)) == 0 ]; then killall -q aplay aplay mp3/$((ANSWER / 2)).mp3.wav 1>/dev/null 2>/dev/null & fi done; else break fi done else break fi done rm $ANSWERFILE echo "all done"
.
This is a complete explanation and method in the study of audio compression which will be very useful in trying to do a re-study of audio compression, thank you