Digital Storage

In this section, we will cover six main questions:

What is Fourier transform?
What is lossy compression and lossless compression?
How do compact discs (CDs) work?
How do MP3s work?
How does digital streaming work?
What's in store for the future?

Fourier transform

Introducing the Fourier (“Four-yay”) Series
In his Treatise on the Propagation of Heat in the Human Body, Jean-Baptiste Joseph Fourier revolutionized the world of math and science with one groundbreaking intuition: all waveforms, no matter how complex, can be broken down into simple sinusoids. The Fourier Transform decomposes a finite-length discrete time vector (or, for our purposes, a soundwave!) into a collection of simple sinusoids (sine and cosine waves) as seen below:

Source: Project Rhea

Sinusoids? Greek letters? What does any of this mean?

Glad you asked! If you are mathematically fluent on the level of calculus, feel free to skip ahead to the next section. If you're newer to mathematics, we will break it down best we can:

A “sinusoid” is just a fancy name for sine or cosine waves, which you may have encountered in trigonometry. If you haven’t, that’s no problem - they’re just hilly, periodic (repetitious) graphs defined by a period (the length of the wave) and amplitude (the height of the wave). That sigma (the zig-zagged E-looking thing) represents the sum of simple sinusoids, which are the frequency components we talked about. The a’s are mathematical constants (or, more simply, numbers) that assign each sinusoid a certain level of “importance.” The larger the a, the more important the component. You might also be wondering why we use sines and cosines (sinusoids). That’s a tad advanced, but I’ll try to keep it simple: they’re actually just a convenient way to represent complex exponentials (that is to say, exponentials containing i, a number defined as sqrt(-1)). We’re able to use sinusoidal representation because of a property known as Euler’s identity, which you may encounter when you study Complex Numbers.

Okay, that’s all well and good. But what does that mean, practically?

The Fourier Transform may sound abstract, but it's directly translatable to music you know and love. Have you ever listened very closely to a single note on the piano? You can often hear more than just that note. If you listen closely, you'll hear the same pitch an octave up, as well as the fifth (known as "sol"), the third (known as "mi"), and even a flat seventh. These notes and others are known as the Overtone Series, and the're actually embedded in the Fourier Transform.

Consider what happens when you hit a piano key. A hammer strikes a string, which begins to vibrate back and forth, creating a signal x(t). While it’s certainly true that each musical pitch corresponds to a frequency (concert A, for example, is 440 Hz - this is known as a fundamental frequency), the note you just plucked actually resonates at a series of frequencies given by the Fourier Transform. If we designate the fundamental frequency as f1, these additional frequencies - known as harmonics - include f2 = 2f1, f3 = 3f1, f4 = 4f1 and so forth. These harmonics comprise the Overtone Series, and you can hear them by listening intently to a single pitch of mid-range frequency (we generally can't hear past 20k Hz).

But the Fourier Transform has a greater presence in music than just the Overtone Series. Applications of the Transform have revolutionized music storage by enabling compression.

We could draw a parallel with what a prism does when it splits white light into a spectrum of colors. White light consists of all visible frequencies (red, orange, yellow, green, blue, indigo and violet) mixed together (much like the information on a CD has sounds of all frequencies mixed together) and the prism breaks them apart so we can see the separate frequencies (much like the CD player splits apart the sound frequencies so they can be amplified and sent to the speakers).

The Fourier Transform is a mathematical technique for doing a similar thing to sounds of all frequencies mixed together. It splits apart different sound frequencies, therefore resolving any time-domain function into a frequency spectrum. The Fast Fourier Transform (FFT) is a method for doing this process very efficiently.

lossy vs lossless compression

https://www.makeuseof.com/tag/how-audio-compression-works-and-can-you-really-tell-the-difference/

https://www.retromanufacturing.com/blogs/news/understanding-audio-file-formats-flac-wma-mp3

What is compression?
Suppose we used the Fourier Transform to decompose some waveform x(t) into its simple sinusoidal components. These components vary by amplitude and frequency; as such, some components dominate others. An MP3 compressor shrinks the file size of a sound file (a WAV file, for example) by using the Fourier Transform to find the weakest frequency components and cut them out altogether. The end result is a sound file that is just one-tenth the size of the original at the expense of some dissimilarity to the original track - or, more simply, a compressed file!

How does it work?
Compression boils down to two steps:

Identify weak/non-essential components of the original audio file.
Cut them out.

An algorithm (a computational set of rules) identifies and erases weak/non-essential components of a signal using a number of criteria. Commonly excised components include:

Those with tiny amplitudes. These components are too quiet to miss.
Those whose frequencies exceed 20k Hz - humans can't hear these anyway!
Those who are temporally close to dominant, similar frequencies. As you may recall from our previous section on the neuroscience of sound, temporal masking renders us effectively deaf to frequencies similar to ones we just heard for a brief window of time.

Once these components are removed (and provided that your compressor is any good!), you're left with a small file that preserves the integrity of the original audio. Therefore, compression aims to reduce the file size while retaining the crucial information. This allows us to have reduce storage space, and upload, download and file transfer times.

Lossless compression means that as the file size is compressed, the audio quality remains the same. Lossless compression can reduce file sizes by up to around 50% without losing quality and the file can be restored back to its original state. Lossless compression uses different techniques to replace redundant data such that the original file can be rebuilt. One technique of lossless compression, for example, is run length encoding (RLE) where repeated data is replaced with frequency pairs e.g. 000011101111 is stored as 4031104.

https://www.retromanufacturing.com/blogs/news/understanding-audio-file-formats-flac-wma-mp3

On the other hand, lossy compression permanently removes some data. Lossy compression can reduce file sizes by up to around 90% and the audio quality gets worse. After getting compressed, the file cannot be used to restore the original file. Lossy compression removes certain less audible frequency ranges to reduce the file size at the cost of audio quality.

The process of lossy compression actually also creates unwanted sounds that were never in the original audio. These issues arises due to content timing errors, poor capturing of low frequency sounds, different perceptual coding algorithms, and uneven temporal masking where masked sounds existing beyond the masking threshold so we hear them as pre or post-echoes \(^1\).

Although lossless compression retains the audio quality of the original file while substantially reducing file size, the file sizes are still much larger as compared to the files created through lossy compression. Another issue is that music players do not readily support lossless file formats as they do lossy file formats.

Both lossless and lossy compression algorithms involve a technique called the Huffman coding. It is an optimal variable length coding that involves most frequent codes using fewest bits and less frequent codes using more bits. For example, take a file with the following data: \[ \mbox{AAAAAAABBBCC}\]
Here, the frequency of "A" is 7, the frequency of "B" is 3, and the frequency of "C" is 2. Normally, if each character is represented using a fixed-length code of two bits, then the number of bits required to store this file would be (2 x 7) + (2 x 3) + (2 x 2) = 24 bits. But if this data is compressed using Huffman compression, the more frequently occurring numbers would be represented by smaller bits, like A would be represented as 0 (1 bit), B would be represented as 10 (2 bits), and C would be represented as 11 (2 bits). So the file size becomes (1 x 7) + (2 x 3) + (2 x 2) = 17 bits.

converting analog to digital

The basic idea of converting analog waves into digital recordings is to go from the sound wave to a stream of numbers that are recorded. This conversion is done by an analog-to-digital converter (ADC), while to play back the music, the stream of numbers is converted to an analog wave by a digital-to-analog converter (DAC)\(^2\). The analog wave that is produced by the DAC should be similar to the original analog wave if the ADC is sampled at a high rate.

Let’s go more into how the analog-to-digital conversion process works. The process of storing sound into a digital format is called sampling. How a human typically represents music is through writing notes on a music sheet, which is played back by reading the note and playing it on an instrument. In the case of sampling, the music notes are represented as the stream of numbers which are “written” down into a CD, MP3 file, etc..

For the digital file to resemble the original analog sounds to the fullest extent, you want to have a high sampling rate (also, called the sampling frequency), which is how many samples are taken per second\(^2\). You also want a high bit depth (the information the computer captures when sampling) and a high bit rate (the detail the sampling captures each second)\(^3\). An important thing to remember is something called the Nyquist's theorem which states that the sampling rate must be twice the amount of highest frequency of the audio being converted.

https://ledgernote.com/blog/q-and-a/how-does-mp3-compression-work/

Fun Fact: During the recording, as the computer “listens” to the music that’s being recorded and “samples” the volumes and frequencies, it does it about 44,000 times each second\(^3\)! This is because the upper limit of human hearing is 20,000 Hz, for which — according to the Nyquist's theorem — 40,000 Hz sampling rate should be enough.

Compact Disc (CD)

https://www.slideshare.net/vishnudharan11/pulse-code-modulation-pcm

The compact disc (CD) was first invented by James T. Russell in 1966. However, it was through the partnership of Sony and Philips that led to the development and popularization of the CD in society.

The CD was the first storage of music that is considered digitized. The audio is digitally stored in a CD via pulse-code modulation (PCM). A PCM is a digital representation of analog waveform, consisting of snapshots of an audio waveform's amplitude measured at specific and regular intervals of time\(^4\). Since the CD consists of 44,100 measurements of the waveform's amplitude per second, it has a sampling rate of 44.1kHz (to store frequencies just above 20kHz, according to the Nyquist's theorem). Meanwhile, PCM (usually stored as .wav file) is an uncompressed format so all of the data is retained and used to recreate the waveform later on playback.

The surface of the CD contains one long spiral of data that consists of flat reflective areas called lands and non-reflective tiny indentations called pits. A land represents a binary 1, while a pit represents a binary 0\(^5\).

The CD is read by having a laser that shines at the surface of the CD that detects the changes in areas, which is then translated into the data of the disc. Most of the CD consists of clear polycarbonate plastic that is impressed with the bumps into the spiral data track\(^6\), and is then coated over with a thin, reflective aluminum layer\(^5\).

An important data storage feature in a CD is the subcode data that can encode sing titles and the absolute and relative position of the laser to allow the laser to move between songs\(^6\). There are also extra data bits called the error-correcting codes that identify and correct single-bit errors (i.e. when one bit of data is changed from 1 to 0 or vice versa). This can happen when the laser simply misreads a bump on the CD disc. CDs can only get scratches which can cause burst errors (i.e. when two or more bits of data is changed from 1 to 0 or vice-versa). To deal with such errors, the data is stored non-sequentially around the disc in an interleaving fashion. This way, the drive does not need to wait a full rotation to access the next chunk of the file.

http://www.futurekids.com.sg/doyouknow.html

The benefits of CDs over analog storage is the high sampling rate and precision it has. For example, a CD can store up to 74 minutes of music, so the total amount of digital data that must be stored on a CD is 783,216,000 bytes\(^6\).

That is a lot of bytes, but do the math! With the sampling rate of 44,100 samples per second, we get: \[ \frac{44,100 \mbox{ samples}} {\mbox{channel} \times \mbox{ second}} \cdot \frac{2 \mbox{ bytes}}{\mbox{ sample}} \cdot 2 \mbox{ channels} \cdot 74 \mbox{ minutes} \cdot \frac{60 \mbox{ seconds}}{ \mbox{ minute}} = 783,216,000 \mbox{ bytes} \]

However, similar to analog storage, CDs can change over time after many uses, though not as easily.

MP3

https://www.soundonsound.com/techniques/what-data-compression-does-your-music

A fun fact: MP3 players get their name from the MP3 files that music is stored into!

Think about those devices that are much smaller than iPhones, with limited functions of only being able to music and maybe the radio if you were lucky. Essentially, an MP3 player is an electronic device that plays MP3 digital audio files that is also portable. There are several types of MP3 players, such as CD MP3 players (ones that play CDs on the go), pocket devices with low-storage (hold audio files on memory cards), and pocket devices with higher-storage (read audio files from a hard drive, think of the iPod).

The MP3 storage was introduced as an audio coding standard in 1994, based on several audio data compression techniques, including the Fourier transform, of which we will not cover in this blogpost. The first MP3 player, meanwhile, was called the MPMan F10 which was developed by a South Korean company SaeHan Information Systems in 1997\(^7\).

Even today, the MP3 format is the most popular form of storing audio directly from a line of audio signal such as radio, voice, etc.. A large factor of the rise of the MP3 was its portability. Let’s visualize what life of a music lover was like before the rise of the MP3. Your house probably had some vinyl records collecting dust in the basement in a cardboard box out of sight. Meanwhile, you had a collection of CDs for every artist’s album you liked in either an individual case, a cylinder case, or maybe it was tossed around getting all scratched up. Meanwhile, the MP3 comes in totally changing the game of how we live with music. No more needing to put in a new CD every time you get tired of hearing the same songs! Just press skip or download some new music to replace the old. The storage of MP3 players was also a motivating factor to move away from CDs. For example, a typical 20GB (gigabyte) iPod has enough memory to store about 500 CDs\(^8\).

MP3 File compression

Let’s talk more about how the MP3 file actually works. Inside an MP3 file, music is stored as long strings of bits (0s and 1s) in a series of chunks called frames. Each frame starts with a short table of contents, called a header, followed by the music data. Each MP3 file has an appendix of sort that stores information like the name of the song, artist, genre, etc. (all of this data is called the metadata)\(^8\).

The MP3 lossy compression algorithm begins with subband filtering which divides the uncompressed PCM audio signal by separating it into bands of different low and high frequencies using a time-frequency mapping filterbank. At this stage, two parallel processes takes place: the Modified Discrete Cosine Transform (MDCT) and Fast Fourier Transforms (FFT).

Prior to the MDCT, each subband signal is sorted different “windows” based on whether they contained steady or constant noise. Windowing is done to remove some of the distortion from the uncompressed audio. Constant noise without much change over time is expressed using a long window. Transient noises (like drum hits or vocal consonants) are expressed across three short windows. A Fourier related transform , the MDCT is a set of linear functions that turns each windowed band into a set of spectral values (wrt to energy across range of frequencies). Meanwhile, parallely, the FFTs are basically used as analysis functions to turn the frequency bands into information that can be read by the encoder that performs masking computation (based on psychoacoustics analysis discussed under our previous section on the neuroscience of sound and as shown in the image on the right) that decided to throw away weak signals in each band.

Now, with both the spectral information and the psychoacoustic analysis, the actual compression process starts. If the power in a band crosses the masking threshold, we determine the number of bits needed to represent the subband such that noise introduced by this quantization is below the masking threshold. Then we allocate bits to these subbands, pass the output through Huffman coding, and finally use a bitstream formatter to assemble the bitstream and format it.

http://www.mp3-tech.org/programmer/docs/jacaba_main.pdf

This is how a 42MB uncompressed song can be compressed to just 3MB.

Digital Streaming

Now we're at the present day of music storage, where the domain is dominated by giants like Spotify, Pandora, and Apple Music.

The revolutionary thing about digital streaming, for both audio and video, is that it doesn't take up storage in your device's hard drive. Though MP3 files changed the game in terms of portability, being able to use them relied on having them downloaded and stored. Digital streaming takes out storage in the equation, paving a path for information sharing like no other.

Compression of data within digital streaming:
Just like with all the other forms of music storage, we first start off with uncompressed data. Then compression software is used in order to discard unnecessary data to make the file smaller. However, digital streaming reduction also depends on the bitrate, the speed of transfer from the server to a computer. Essentially, you want to encode a file that's large enough to sound good, but small enough to work with the available bandwidth.

Server:
After the file is compressed and encoded, it is uploaded to a server which delivers files through a Web server. When you click a link on a Web page, which is stored on the Web server, the Web server sends a message to the streaming server telling it what file you want to listen to. Then the streaming server sends the file directly to you.

Protocols:
All of this works effectively due to a set of rules known as protocols, which govern the way data travels from one device to another. These specific protocols for streaming audio break the data into packets and transfers the data in real time to a specific location in a specific order. Streaming service protocols and web protocols work together to balance the load on the server, determining when to start steams.

Player:
The player assists your device in decoding and playing the data that represents the audio file. Examples of players are QuickTime (for .mov) and Adobe Flash player (.flv), though since players can't decode one another's file formats, sites often pick one for you automatically. These players decode and display data, and are able to retrieve the information faster than they play it which allows extra information to stay in a buffer in case the stream falls behind.

Your device:
All of these things, besides the compression part, happens through your device either communicating or utilizing these components! It is in your device that the Web browser or digital streaming browser exists, and is where the data is received and discarded.

https://computer.howstuffworks.com/internet/basics/streaming-video-and-audio3.htm

Future of music storage?

https://www.whathifi.com/us/advice/mp3-aac-wav-flac-all-the-audio-file-formats-explained

MP3 files are the most popular file formats for music storage because their small size allows for easy distribution, and because they are easy to convert to other file formats. MP3s are supported by nearly every digital music player in the world. But the MP3 compression algorithm also leads to lower sound quality and even unwanted sounds due to lossy compression which does not render it suitable for professional use. In that case, it becomes important to look at other options available for us to store audio music files. For example, we have, WAV, AIFF, WMA, OGG, AAC, FLAC etc.

WAV and AIFF are the most popular uncompressed audio file formats. They store data in slightly different ways but both are based on PCM (Pulse Code Modulation) and occupy a lot of space. WMA, OGG and AAC are lossy compression audio file formats like MP3. OGG (Ogg Vorbis) is a open-source alternative that provides impressive sound at lower bit rates than other lossy formats\(^9\). Although it discards fairly more data during compression, it transfers audio quickly and sounds good enough for digital streaming. At 128kbps, AAC is more efficiently compressed than an MP3 and clearly creates less distortion and adds less noise\(^1\)\(^0\). However, this file format is not really compatible with many devices. FLAC is a lossless audio format that strikes a good balance between compression and sound quality. It provides us with the best quality audio files without taking up all our storage space.

In terms of file formats on digital streaming services, Amazon Music offers MP3 files up to 256kbps, iTunes now offers AACs at 256kbps as well as 128kbps, Pandora can stream 64kbps AAC files, and Spotify uses the Ogg Vorbis format and can stream at 96kbps, 160kbps and even 320kbps\(^1\)\(^0\). YouTube audio could be encoded as mono 22.05kHz 64kbps MP3 files although now mostly it is encoded to 44.1kHz stereo AAC or Ogg Vorbis formats\(^1\)\(^0\).

4-minute songs per GB of space

Generally, lossless files strike a good balance between compression and sound quality. They provide is with the best quality audio files without taking up all your storage space. So, out of all these audio formats mentioned above, FLAC (Free lossless audio codec) probably has the best chances to gain popularity in future since it is a product of a lossless audio compression that reduces file size by about 60% with no loss in quality\(^1\)\(^1\). It’s an open source lossless format that provides even better resolution than CDs\(^1\)\(^2\). Unlike, WAV, they retain information tags including the artist and other album information\(^1\)\(^1\).

So, as our technology advances to house unlimited amount of data (for example, on the cloud) at lower rates and once storage space is not limited, we think that FLAC – with its superior sound quality and better compatibility with normal devices – would eventually emerge as the most popular audio file format.

Citations:
1. https://www.audiobuzz.com/blog/wav-or-mp3-whats-the-difference/
2. https://electronics.howstuffworks.com/analog-digital3.htm
3. https://www.explainthatstuff.com/how-mp3players-work.html
4. https://itstillworks.com/digital-output-pcm-format-12198667.html
5. https://electronics.howstuffworks.com/question287.htm
6. https://electronics.howstuffworks.com/cd.htm
7. https://www.cnet.com/news/bragging-rights-to-the-worlds-first-mp3-player/
8. https://www.explainthatstuff.com/how-mp3players-work.html
9. https://www.tomsguide.com/us/what-are-audio-codecs,review-4469.html
10. https://www.soundonsound.com/techniques/what-data-compression-does-your-music
11. https://www.retromanufacturing.com/blogs/news/understanding-audio-file-formats-flac-wma-mp3
12. https://www.whathifi.com/us/advice/mp3-aac-wav-flac-all-the-audio-file-formats-explained