viewthe first thing that we need to get used to is the idea of “layers”. the benefit of using...

Spectral "Image Processing"

The Conventional

Photo or image editing software might not be something you would ever think that

you had a need for in a recording or editing situation, but it can be a hugely useful

tool for solving simple problems, and can also be used for unconventional editing and

sound design tasks. But before we even consider the editing possibilities themselves,

we need to figure out how to get from audio recording to image editor. The process is

a very simple one, but you will probably need additional software in order to do it as

it is quite a specialised and unusual thing to want to do.

MetaSynth is perhaps the most well-known package for sound-to-image and image-to-

sound translations but it was originally designed as a tool to allow you to “draw” new

sounds and analyse existing audio files rather than as a “bridge” between audio and

image editing software. In fact, the best tool that I know of for doing this is the

excellent PhotoSounder, which not only allows you to load up audio files and then

export the spectrogram as an image for editing in your software of choice before

loading the image and converting back to audio, but also features some quite

comprehensive editing tools of its own. These tools are very much the kind of things

you would expect to see in any image editing package and, coupled with the “layers”

methodology adopted by many image editing packages, PhotoSounder is actually

more than capable of a great deal before you even think about exporting the image.

Many of the ideas that follow will be possible within PhotoSounder itself, but I will

look at the techniques for image editing in Photoshop as it features some options that

are unavailable in PhotoSounder or other simpler editors.

However, a fundamental advantage to working within PhotoSounder is that you can

preview the changes in real time whereas any editing done after exporting to your

image editor will be done on visuals alone – you would have to re-save the image and

import it back in to PhotoSounder to hear the effects. As a result it can be easier to do

the simplest jobs in a spectral editing application, to consider PhotoSounder for more

complex jobs and then move to an external image editor only when absolutely

necessary.

The first thing that we need to get used to is the idea of “layers”. The benefit of using

layers is that each layer can serve as an adjustment to the original audio. With most

spectral editing software you work directly on the audio file so any changes you make

are applied to it. They can, of course, be undone but you work sequentially applying

one edit after another. You can step back through many levels of undo but you

couldn’t, for example, keep the actions that you carried out on steps 3, 4 and 6 but

lose what you did on steps 1, 2, 5 and 7. With layers in an image editor it is certainly

possible to do that. Each adjustment can exist on its own layer and then can be freely

shown, hidden or combined in numerous ways.

The next things to consider are the tools used to make the selections. In the previous

chapter I stated that many of the selection tools used in spectral editing applications

were in some ways a subset of those used in image editing and taking a look at the

extended options available in your image editor. The “lasso” tool in a spectral editor

is very useful for highlighting irregularly shaped selections, but an image editor may

well have a “magnetic lasso” tool that will try to detect obviously visible “edges” and

automatically jump to and follow them. In certain situations this could be helpful for

making sure your selection is as “tight” as possible. In addition to the rectangular

selection tools, most image editors also offer circular/oval selection tools as well,

which might prove very useful in some situations. Finally, many image editors allow

you to save selection areas as “masks”, which means that you can load them at any

point and you will be guaranteed to have the same selection area – even if you choose

to go back and apply additional processing and adjustments to a particular area after

you have deselected it.

One other option which can be very useful is the ability to control the size, shape and

softness of the brush tool. While a round brush will probably be the most useful, there

might be times when a differently shaped brush will serve your purposes better. And

the softness parameter will change the brush from a hard edge to a very soft, blurred

edge. If we use blurred-edge brushes in our adjustment layers then it will graduate the

effect of whatever the layer is doing from the edges of the blur to the solid centre of

the brush. Once again it is a subtle difference but it can be helpful if a particular

unwieldy edit is needed.

Once we have used the expanded range of tools to create our selection it is time to

actually do something useful with it. For each new adjustment (or collection of similar

adjustments, perhaps) we would create a new layer and then set it up to carry out the

task we want. Many of the tasks we looked at in the last chapter have direct

equivalents in image editing in this way so we will quickly look at the ways we can

achieve the same results here before moving on to look at some things that aren’t

achievable in a spectral editor (yet).

Attenuation is a very simple task to achieve in an image editor. All we need to do is

remember that brightness is the image processing (and spectral editing) equivalent of

amplitude so attenuation is achieved by decreasing the brightness. If we make our

selection (or load a previously saved one) and then create a brightness adjustment

layer in our image editor we can then increase or decrease the brightness (amplitude)

of the selection as desired. The great advantage here is that not only can we make the

amplitude adjustment in both directions (quieter or louder) but we can also go back

and change this setting later in light of additional changes that we may make, all

without having to commit to any change before we do a final export of the image

from the image editor.

Copying and pasting is achieved in much the same way as it would be within a

spectral editor but working in an image editor does allow some additional flexibility.

By pasting the copied area on to a new layer it allows for it to be moved or “unpasted”

after the fact and, perhaps more importantly, it allows us to soften the edges of the

pasted area so that there are no abrupt changes in tone. If you copy and paste a

selection in a spectral editor you have very little control over the transition between

the original and pasted areas and this could potentially lead to audible artefacts at the

transition point. Doing the same thing in an image editor gives you much more

control over this important aspect of the pasting process.

While there is no dedicated de-clicking process in an image editor, by and large,

clicks are quite easy to deal with because they are a relatively high burst of sonic

energy over a wide frequency range and a short period of time. As such they will

normally show up quite clearly as fairly obvious vertical lines in the image editor.

Once we have identified them there are a couple of ways we could deal with them.

The first is to simply highlight the offending area and simply delete it and move the

surrounding areas so that they cover the gap. This will obviously make the whole file

a little shorter but if there are only one or two digital clicks of a few samples long then

this would amount to a few milliseconds at most and is unlikely to have any impact on

the file as a whole. Of course in doing this there is a risk that you will still have an

audible “glitch” because of subtle differences between the two areas that you have

moved to be adjacent. It is also not a very viable option if there are many clicks and

each is of a longer duration because the reduction in length would quickly add up.

Another option would be to copy and paste an equivalent length region from either

side of the click area but, again, you still run the risk of having audible glitches

because of minor differences between the original and pasted areas. You could, of

course, soften the edges as we described above to smooth out any such glitches. There

is a third option, however, which could prove interesting. Some image editors, either

natively or through plugins, provide the option to “fill” a selection with content that is

created by looking at the rest of the image and analysing patterns so that it blends with

the surrounding areas. This “content aware fill” (as it is called in Photoshop) could be

a great way to try to fill the gap left by the removed clicks, as it is a very similar tool

to some of the more advanced removal/attenuation algorithms in spectral editors.

What we really need to do when removing clicks this way is to interpolate between

the values immediately to the left of the removed and the values immediately to the

right. The shorter the click was, the easier this will be to do because, if we are talking

about a couple of milliseconds, even if the interpolated values aren’t exactly what

should have been there without the click, the period is so short that we probably

wouldn’t notice anything untoward. With longer clicks, however, a simple averaging

interpolation may be audible.

De-crackling with an image editor would be, I believe, a very impractical thing. It

isn’t so much a problem with having the tools to remove the crackle so much as it is a

problem having a means to accurately (and automatically) detect and locate it. The de-

crackle tools in spectral editors are expertly designed to look for specific audio events

and, when located, to treat them in a certain way. There are no tools that currently

exist in image editing to achieve a similar result and, therefore, de-crackling is

something that should probably be left to a spectral editor or dedicated plugin to deal

with. At least for the time being.

In theory we may be able to de-noise audio using an image editor but that comes with

one huge caveat. Image editors and image editing plugins have some really great tools

for getting rid of noise but, and this is the important part, the way noise occurs in

images is generally quite different to the way it occurs in audio. Noise in images tends

to be either compression artefacts (which are dealt with in a very specific way) or

what is known as Gaussian noise. This is the visual equivalent of white noise in audio

terminology and represents a statistically random and uniformly distributed noise. In

audio terms this would mean that it would be totally equal across all frequency bands.

The tools used to remove this in image editing would, therefore, be of limited use to

us if our “noise profile” happened to have a non-uniform distribution. The reality is

that de-noising using graphical image editing tools often results in a spectrogram that

looks blurred and the audible result of that is one of making the whole sound very

unfocused and, at worst, having a “wateriness” to it which is very far from the desired

result. As a consequence of this, de-noising is probably best left to conventional audio

editing processes – but it’s always worth a try if you fancy experimenting.

As a long shot, it might be possible to work around this by taking the spectrum of a

pure noise part of the recording (as you would do with de-noising in a spectral editor)

and then combining this spectral profile of the noise with a pure Gaussian noise image

to create a “weighted” Gaussian noise profile, and then using that as a subtractive

layer on top of the spectrogram image in your image editor. However, it is unlikely

the results would be as good as a dedicated de-noising process. If you enjoy

experimentation, though, it could be something you might like to try just as an

alternative technique. It would be very much a static noise profile though, and

wouldn’t offer the advantages of adaptive noise profiles used in many de-noising

algorithms and plugins.

Hum, on the other hand, should be much easier to deal with because it is much easier

to detect in an image editor. We have already seen that mains hum consists of a

constant frequency of either 50Hz or 60Hz and the associated lower-order harmonics.

In a spectrogram image this will show as horizontal lines located at 50Hz or

60Hz,then fainter lines at 100Hz or 120Hz, fainter still at 150Hz or 180Hz, and so on.

You could almost look at these in the same way that we did with clicks. They

represent very narrow focuses of audio energy. Clicks represented a wide frequency

range for a short period of time while each of the harmonics present in mains hum

will represent a narrow frequency range for a long period of time. Therefore each of

the options we looked at for dealing with clicks would be equally applicable here

except that we would be working with horizontal selections rather than vertical ones.

Equally, if we wanted to deal with mains hum we would simply select the relevant

frequencies and then reduce the brightness (attenuate them) which would, in effect, be

like using a very narrow bandwidth EQ.

There are a number of ways to achieve compression-like effects in an image editor

and these range from limiting the dynamic range of occasional sounds through to a

more general re-balancing of the dynamic range of the recording as a whole. There

are also ways in which you can achieve dynamic range effects that would be

completely impossible with any other spectral editing process of plugins. To begin

with let’s have a look at the most simple case: taming occasional peaks.

As with spectral editors, you can create selections in image editors and then attenuate

those areas to pull the peak levels down. If you want a result that is most similar to a

traditional single band compressor, then you would create a selection that covers the

width of the event we want to compress and also covers the entire spectrogram from

top to bottom (covering all frequencies). You can then create a brightness adjustment

layer and decrease the brightness by the required amount. You can also soften the

edges of the selection and this softening would be analogous to the “attack” and

“release” settings on a compressor. The softer the left-hand edge of the selection, the

slower the equivalent attack would be and the softer the right-hand edge, the slower

the equivalent release would be. What we are doing, in effect, is attenuating a single

part of the sound, which is what compression does at its most basic level.

You can then apply the same theory and techniques to a multi-band compression

equivalent by simply selecting only the frequencies that you want to compress, in a

vertical sense, and then creating the selection and brightness adjustment layer as

before. In additional to softening the left and right edges to create “attack” and

“release”, you can also soften the top and bottom edges to simulate the effects of the

filter slopes on the multi-band compressor bands. All pretty straightforward stuff and

fairly similar to what we have already spoken about for spectral editors. But then

things get a little more interesting.

Compression is the process of reducing dynamic range of sound, and dynamic range

is defined as the range between the loudest and quietest parts of an audio signal.

Moving that across to our image editor we know that loudness is equivalent to

brightness (or more correctly luminance), so in order to compress the audio in our

spectrogram we would need to change the amount of difference between the lightest

and darkest parts of our image. However if we simply adjust the contrast of the image

(in effect reducing the difference between the loudest and quietest parts of the signal)

we are not doing what we would traditionally consider as compression. In

compression as we know it we are adjusting the levels of the sound as a whole at

different points in time whereas with this spectral compression we are adjusting the

difference between the levels of different frequencies in the sound. That’s not to say

that it isn’t a very interesting effect in its own right; it most certainly is, but it won’t

achieve the same results as a normal audio compressor.

Adjusting the contrast amount will reduce the effective dynamic range of our audio

but it does it in a way that no hardware or plugin compressor does. It will take a

hypothetical midpoint in the dynamic range and make any sounds that are louder than

this point progressively quieter and sounds that are quieter will get progressively

louder. We have already seen that downward compression involves reducing the

levels of loud signals while upward compression involves increasing the levels of

quiet signals. In a way this actually does both simultaneously so would be a unique

effect in its own right. But I promised earlier to talk about upward compression, so

let’s see how we can adapt this technique to give us something closer to genuine

upward compression.

More traditional audio-type compression can actually be achieved inside of

Photosounder itself and there are a number of tutorials on the website which go

through this process. Very interesting is the fact that you can actually draw the

compression curve directly on to the audio, which means that you can apply a linear

compression, a logarithmic (approximately) compression, or a compression curve of

any shape that you desire which allows for compression effects simply not possible

with any traditional audio compressor.

While neither of these methods (spectral compression in an image editor or the

pseudo-traditional approach in Photosounder) is a precision tool in terms of absolute

control, image processing of spectrograms can be an extremely creative tool and can

offer us ways of doing things which are just not possible in any other editing platform.

Even in the relatively commonplace editing tasks that we have looked at here there

are ways that image-processing editing can help us. If you are happy to venture off of

the beaten track just a little then even more becomes possible.

The Unconventional

The processes and tools we have described so far in this chapter have all been related

to attempting to recreate tasks and effects that we can create other ways, either using a

spectral editor or, in some cases, simply using plugins or hardware processors. But if

we left it at that we would be overlooking some of the truly amazing ways that we can

bend and manipulate sound using an image editor. With this short section we are

doing a little more than just dipping a toe in to the waters of sound design; this is very

much experimenting with audio and wouldn’t really be construed as “editing” as such,

but it is related to the editing tasks we have been discussing and it’s also a lot of fun

so, while we are in this area, let’s just have a brief look at some intriguing

possibilities. One thing that I will say at the outset, though, is that the best results are

achieved with sounds that aren’t overly complex. What I mean by this is sounds

which are not rhythmically complex or full pieces of music. Isolated, single sounds

that aren’t too “busy” are best for these types of effects.

Using blur effects can give rise to some really nice results that are actually quite hard

to describe. If we took a sound which had very definite harmonics of 100Hz, 200Hz,

300Hz, 400Hz, etc., we would see each of these harmonics as a very clearly defined

horizontal line. If we were to then apply a blur effect this would have the effect of

creating additional frequency content either side of the main frequency. In the case of

the 100Hz harmonic this would mean very subtle additional frequencies ranging from,

for example, 98Hz to 102Hz, increasing from inaudible at the edges of the range, up

to a maximum level at 100Hz and then down to inaudible again at 102Hz. The

resulting sound is very hard to characterise but it almost sounds like multiple versions

of the same sound detuned against each other. Imagine a choir singing where each

person was slightly out of tune. Depending on the sound there can be a subtle

inharmonic “shimmer” to the sound that can become quite pronounced if the level of

blurring is high enough.

Easier to describe, perhaps, is the effect of blurring only in a horizontal sense with a

“motion blur” effect. This visual effect is the kind of thing you would expect to see in

a photo of a fast moving vehicle: a blur in a horizontal direction. If we apply this

effect to our spectrogram we get a kind of time-domain blurring which will sounds

reminiscent of a reversed reverb if the blur is carried out from right to left and a

strange, “ghostly” reverb if the blur is carried out from left to right. Although we

could achieve effects which are superficially similar with reverb, this particular

processing does sound different because a reverb will tend to diffuse the sound as it

dies away and give the sound a sense of distance while this method sounds more like

a natural fade-in or fade-out of a sound. As a result, if you apply it to a naturally

sustaining sound (such as a pad), it probably won’t sound that special because the

sound itself could well have a naturally long attack or decay. The real magic comes

when you apply it to a sound that your ear will recognise as a naturally percussive

sound. In situations like that the fact that there is a fade-in or fade-out is just enough

to catch your attention as sounding “wrong” but in a completely natural way,

assuming of course that you don’t go to extremes with the blur (which can be fun in

itself!)

Another interesting thing that you can do is to skew the spectrogram image. This is

where you move the position of the top of the spectrogram relative to the bottom. If

you imagine a click that would show up as a vertical line through the spectrogram

then, after skewing, this would become slightly diagonal. Now if we replace that click

with a more useful sound we can perform a very unique kind of fade-in where the

harmonics of a sound that, ordinarily, start at the same time can begin at different

times. Once again, the particular effect that this has will depend a lot on the sound. If

we use a sustaining sound the result could be quite similar to quickly opening up a

low-pass filter (if the top of the image is skewed to the right) or closing a high-pass

filter (if the top of the image is skewed to the left). If, on the other hand, you use more

percussive sounds and experiment with greater skew amounts you get a very

interesting “smearing” of the frequencies. Because of the nature of what we are doing,

each new sound, be that a new percussive hit or a new word in a vocal passage, will

sound like it has the filter opening on it. This would mean that the sound no longer

has its obvious percussive attack, but the newly created sounds can be unique in many

ways so it is worthwhile experimenting. Of course, if we shift the top of the image to

the left relative to the bottom then the high frequencies will come in first, so the effect

will be closer to that of a high-pass filter being closed down on each individual note.

The problem here is that the start of each note sounds less distinct that with the low

frequencies coming in first simply because the low frequencies are often more

obvious in defining the start of a sound even though the higher frequencies contribute

more to the character.

If you were to change the horizontal size of the image, it would be equivalent to time-

stretching; making it wider would be stretching and making it narrower would be

compressing. However, when you look at stretching or compressing the image in a

vertical sense then we have what I can only call “spectral squashing”. What you

would be doing here is compressing the spectral distribution of the sound and, as a

result, the frequencies would no longer be nicely and harmonically distributed. Each

frequency that was previously a whole number multiple of the fundamental may now

be, for example, 1.42 times the previous frequency, which will lead to anything from

slightly odd-sounding at very low squash values to unrecognisable at higher values.

Flipping the image horizontally will perform a basic reverse function much like you

find in almost every audio editor but, on the other hand, flipping the image in a

vertical sense will mean that the “fundamental” frequency – the most dominant

frequency – now becomes the highest frequency and the reducing levels of all of the

other higher harmonics in the original sound now become increasingly quiet sub-

harmonics. Once again, it is quite hard to describe this effect and it is very much

material dependant, so trial and error (if there is such a thing as “error” in processing

of this kind) is the order of the day.

Image editors often allow you to create “gradient fills”, so if you create a new layer,

then create a “white to black” horizontal gradient (white on the left) and use this layer

to multiply with the spectrogram, the gradient will act as a volume fade where the

white part of the gradient represents full volume and the black represents silence. In

this case it would equate to a fade-out. If we reverse the gradient so that the white is

on the right then we will have a fade-in. If we create a vertical fade with white at the

bottom then all of the lower harmonics will be at high volumes while the higher ones

will be silent. In audio terms this is a low-pass filter effect. Similarly we could reverse

the gradient so that the white was at the top, which would be equivalent to a high-pass

filter. By combining the two ideas and creating a diagonal fade we will create an

effect like a gradually opening or closing filter. One example would be a diagonal

gradient with white at the bottom right corner and black at the top left corner. At the

beginning of the file all high frequencies would be silent and low frequencies would

be very quiet. As we progress through the file, the lower frequencies will become

louder first and only towards the end of the file will the higher frequencies start to

become audible. In audio terms this would be a slowly opening low-pass filter. By

varying this we can create opening and closing low-pass and high-pass filters

depending on the positions of the white and black portions of the diagonal gradient.

Finally, you can often vary the smoothness of the gradient from a very gentle

transition to a simple two-tone effect that changes from white to black at the mid-

point. This smoothness variation is equivalent to the filter slope of an analogue filter.

The smoothest gradient could be the equivalent to a 6dB/octave filter while the two-

tone transition would be beyond the sharpness of even a 48db/octave filter.

One really interesting thing you can do is to create a new layer, load another

spectrogram into this layer and then set the layer blending to multiply mode. The

resulting image will be a hybrid of the two. It is almost as if you have the “picture” of

one, with the tonal characteristics of the other. Or perhaps you could say you have the

meaning of one and the colour of another. And this is exactly what a vocoder does.

You take one sound and impose its spectral content onto another sound. The most

obvious use of this is to create Cylon-esque or “Mr. Blue Sky” robot voices, but

vocoding actually has a lot of potential uses outside of that. You can vocode a

percussive loop with a melodic sound to create a hybrid that has the rhythm, dynamics

and “snap” of the percussion loop but has tonal qualities from the melodic sound. You

could be more subtle still and vocode a string section and a choir “ahhh” sample to

create a sound that is neither and both simultaneously. This sounds particularly un-

vocoder-like if both sounds are playing the same chord or melody so you aren’t

actually imposing any musical change on either of the sounds. but are simply messing

with the spectral balance.

And finally, just as a little curve-ball, and something for the adventurous only, how

about creating a negative of the original sound as you would with a photograph? If

you try this, be warned; turn your speakers or headphones down because the resulting

noise is usually chaotic as there will be a large “noise” component to the sound.

Whatever was “silent” harmonic space in your original sound will become full

amplitude in this negative. I can’t actually think of a practical use for this to be

honest, but you never know.

Even this only really scratches the surface of what you might be able to do with an

image editor and plugins – a lot of the things you try might prove to be unusable in

any musical context but could be just perfect for creating that foreboding, heavily

backlit, exploring-the-bowels-of-the-powerless-drifting-spaceship drone sound effect

that you need…

I do feel that I need to finish the chapter with one statement though. In spite of the

flexibility and creativity of all of the image processing techniques described above,

you have to remember that you are dealing with a visual representation of the sound.

Because of the nature of the spectrogram, we could have all of the accuracy we need

along the time axis. After all, what we are doing is representing the spectral balance at

a series of points in time. If our sampling frequency is 96KHz then we could use 1

pixel for each sampling period and have a time resolution equal to that of our original

audio in our image editor. It would be hugely impractical at this stage in computing

technology because we would need 96,000 pixels for every second of audio, which

would give all but the most well-specified of PCs or Macs a hard time. The greater

problem is the vertical axis. In a normal waveform we are plotting level (vertical)

against time (horizontal) and each of those are quantised in the audio recording

process. We have 24 bits of range in terms of level, which means 16.7 million

possible values – a huge amount of resolution – and we have (a measly) 96,000

samples per second along the time axis.

If, as in the case of a spectrogram, we are plotting frequencies along the vertical axis

things get a little more complicated because each vertical position represents a

frequency. If we had 20,000 pixels vertically then each pixel could represent 1Hz. Not

only would this be nowhere near a good enough resolution, but it also wouldn’t be

simply that each pixel represents a range of 1Hz – it would mean that each pixel was

fixed to an exact frequency. If we had a frequency of 842.5Hz present this would

either have to be quantised to either 842Hz or 843Hz (which could, especially with

lower frequencies, lead to things sounding a little out of tune) or the frequency would

be represented by being visually displayed in both frequencies at a lower level. And

50% volume at 842Hz added to 50% volume at 843Hz is not the same as 100%

volume at 842.5Hz.

In order to overcome this we would probably have to have a much more accurate

vertical (frequency) resolution, accurate to at least 0.01Hz (to be safe) which, on a

96KHz audio recording, would mean we would need 4.4 million pixels vertically.

That amounts to roughly 400 billion pixels per second. Each one of those pixels

would need to have a brightness value attached which, if we wanted a good quality

representation of the audio signal, would mean at least 16 bits per pixel (or 2 bytes)

so, if you are interested, that would mean 800 billion bytes per second, which is in the

region of 750GB per second. This clearly isn’t practical now and I can’t see any time

in the near to mid-term future when we would happily process audio files like this. So

obviously some compromises need to be made. As time and technology moves on we

may approach this theoretical ideal resolution but the original FFT analysis of the

audio file would need to be at an equivalent resolution for this to be worthwhile.

The point of this last statement is solely to illustrate that if you start processing

spectrogram images in an image editor you shouldn’t necessarily expect the resulting

sounds to be 100% true to the original. The problem is magnified by the fact that it

isn’t only the areas of the spectrogram that you have worked on that will suffer from a

loss of accuracy. Due to the limitations of representing the spectral data in a visual

format before exporting it, simply exporting and then importing back in without

actually doing any processing will result in a loss of quality. How bad this is will

vary, but just be prepared. This type of image processing as audio editing is very

“fringe” anyway, but is a technique that could become more widely used in the future,

especially if the technology advances in ways that the image data can be represented

more accurately in the exported image or if the makers of spectral editing software

bring more and more image editing paradigms into their software. The advantage here

is that you will be applying these “adjustment layers” directly to the original spectral

data rather than a reduced resolution visual image of that data.

All in all though, I would have to say that, even with the reduction in quality from

these methods, they do allow us to do some things that are impossible in any other

way. So on that basis it is worth exploring and waiting for the technology to catch up

a little bit to improve the quality of processes we are already familiar and used to

working with.

Truly Fluid Audio

How Is This Different From “Elastic Audio”?

Now that we have looked at the many different facets of audio editing, what we hope

to achieve, what we actually can achieve, and what may represent at least one part of

the future of audio manipulation, where does this leave us? Well, the answer to that

question may lie in the realms of what I would call “truly fluid audio”. This isn’t

really a new concept, but rather an integration of existing technologies and some new

tools and workflows.

Primarily, for me, the future of audio editing would be a system that incorporates

everything we can currently do into one single application. Failing that, it would still

be pretty much as useful if external applications could be launched from inside your

DAW, which allowed changes to be made in the context of what your DAW is

playing. This would mean that there would need to be some kind of synchronisation

between the two so that when playback was started in this new audio editor, the DAW

would play from the appropriate place and everything would remain synchronised.

Any looping that was in place in the DAW would also need to be relayed to the editor

so that it was easy to cycle around a particular section while working. And any audio

output from the editor would need to be looped back into the DAW in the exact same

place as the original audio was so that any further plugins or automation would

continue to be applied.

This would give us true “in context” editing and that, I believe, is crucial because no

advanced audio editing will ever be truly without any kind of artefacts. While our job

as editors should be to strive for perfection, when perfection isn’t achievable we need

to make a judgement call on what is acceptable and what isn’t. If you are working

solely within a spectral editor, for example, and you are trying to solve a particular

problem on a piece of audio that will, eventually, be mixed back in to a track, it can

be hard to know whether you have done the job sufficiently well because you can’t

hear it truly in context. It might be that you are trying to remove a particular sound

and, because you are uncertain whether you have done enough, you take things a little

too far and end up removing too much of the surrounding areas of sound, just to be

safe, and this has a detrimental effect on the rest of the sound. Had you been able to

listen to this in context you might have realised, a few processing steps earlier, that

you had already done enough.

Now in terms of an actual feature set for this fluid audio editing, I think that the editor

would need to incorporate “recycling” techniques for when that approach was the

most suitable, but also our normal time-stretching processes. This time-stretching

should also incorporate a number of the developments which we have mentioned

including re-modelling of vibrato and the ability for the user to manually

select/highlight areas that aren’t to be stretched. This could also be extended to an

“intelligent” system which suggested areas which could be candidates for being

excluded from being stretched, such as held notes that don’t change pitch during their

duration and which have sufficient amounts of silence following them to ensure that

no overlap would occur following the stretch. Perhaps any automatic transient

detection algorithms should somehow highlight the sections of audio that it defined as

transients to allow us to manually add, delete or change these selections as well.

Being able to just click on transient markers and move the audio is very useful as a

workflow tool but the idea of incorporating a better algorithm for any bounced or

rendered versions should definitely be included.

Equally, pitch shifting should be incorporated into the same workflow and be possible

in the same window. All of the features that we currently have available for

polyphonic pitch shifting in Melodyne would be available here, but perhaps with an

additional information layer. Melodyne achieves its polyphonic pitch shifting by

analysis of the audio file and detection of similar patterns of harmonic content. If an

audio file was processed which detected several different patterns, created by having

different instruments in the same audio file, then perhaps these could be colour-coded

so that it was easier to see which notes came from which instrument. If we incorporate

the database features that I mentioned in the previous chapter then the software might

even be able to take a guess at what kind of sound it was basing on the harmonic

content and it could be labelled (editable by the user of course) accordingly.

The most obvious visual interface for both of these would be a “piano roll” type

display, much the same as is used for the current versions of these techniques. The

pitch could be manipulated by moving notes up and down the scale (either freely or

“snapped” to a scale) and timing changed by having transient markers overlaid on to

the waveform display which allowed simple movement back and forward along the

timeline with the option to either keep the length the same and perform a simple

“move”, or to keep the end point fixed if moving the start point or keep the start point

fixed if moving the end point. This is not that dissimilar from current working

methods but it would, depending on the DAW, require some integration of different

techniques into a single window.

Moving forward from there we should be able to incorporate spectral editing in the

same window. This would require a change of view though. The waveform/piano roll

display would have to be substituted for a spectrogram view, but the timing grid

should remain visible, as should the transient markers. If the transient markers were

moved then the audio would be moved or stretched in the same way as we just

described above for the waveform view but, in this case, rather than the waveform

shape being changed, the spectrogram image would be adjusted to compensate. This

would perhaps be a little more difficult to achieve because, in the waveform/piano roll

view, each separate note would be clearly identified by its own waveform, whereas in

the spectrogram view the harmonic data for the entire sound is blended together. As

such, a note on E2 closely followed by a note on G2 might not be easy to visually

separate in order to position the transient markers. It should be possible to scale the

vertical axis of the spectrogram (the one that represents pitch) in such a way that a

piano keyboard could be placed along the vertical axis and the position of each

frequency in the spectrogram view was aligned to a musical note on the keyboard

display. This would make it much easier to pick out what each individual note was on

the spectrogram view and, therefore, figure out what each of the transient markers

was for.

We should then have the ability to incorporate layers and layer adjustment tools into

this spectral editor so that we can perform some of the more esoteric editing processes

that we discussed in previous chapters as well as having alternative tools to carry out

the more routine editing operations. Each of these layers would be tied to the

underlying audio so that any movement of the underlying audio regions in the arrange

window would result in the layers automatically being repositioned as well. Having

this ability would mean that we get to keep the edits and layers “live” and wouldn’t

have to commit to them in order to be able to move the audio regions around. In fact,

this should apply to any of the audio files and any of the editing tools. If we apply

pitch correction or time-stretching/movement then we should be able to move the

audio region in the arrange window and have all of the edits follow the position. This

is possible in some DAWs and in some aspects but the perfect system would be

region-based in its changes and all edits would be synchronised relative to the region

(sometimes called “clip”) rather than absolute in their time or position.

To make these tasks even easier to carry out, there could be an option to have a split-

screen view where the top half of the window represented the spectral view and the

bottom half represented the waveform/piano roll view. Edits could be carried out in

either window depending on which was more appropriate for the task at hand. Any

changes made in one part of the window would result in the information in the other

part of the window being updated automatically. You could, perhaps, extend this even

further to include a third part of the window which was a traditional combined

waveform view. Edits such as fade-ins and fade-outs can most easily be carried out

with this editing view so it would have its uses. Plus it would be a very good way to

instantly check whether any of the edits you were making in the other views were

resulting in any obvious glitches in the main waveform. It might be too much to have

all three of these on screen at once but it would be easy enough to incorporate

different screen sets that could be triggered by a simple key command. CTRL + 1

could trigger full screen spectrogram, CTRL + 2 could be full screen piano roll,

CTRL + 3 could be full screen waveform, CTRL + 4 could be split spectrogram and

piano roll, CTRL + 5 could be split piano roll and waveform, CTRL + 6 could be

split spectrogram and waveform and CTRL + 7 could be all three together. This, to

me, would represent the ultimate editing workflow because pretty much every editing

task (other than comping) could be carried out in this single window.

Moving on to the issue of demixing – there are a number of possibilities here. Perhaps

we could extend the layers model to include different types of layer. There could be

adjustment layers which are tied to base layers. To begin with the audio file or region

would have just one base layer, which is the entire contents of the region.

Adjustments could be made to this in all of the ways we described in Chapter 17. In

addition, though, we could create additional base layers that were the result of

extracting particular sounds using demixing technology. These additional base layers

would play back at the same time as the rest of the audio and would be phase-locked.

They would, though, have the possibility to have additional adjustment layers linked

to them. Using this idea you could have a fully mixed live recording to which you

applied some compression adjustment layers and maybe a few spot attenuation layers.

But then if you decided, for example, that you wanted to process the guitar solo

separately, you would extract that out to a new base layer using demixing. Then you

might add some stronger compression and perhaps some EQ (also more than feasible

with adjustment layers instead of plugins) and then raise the level overall. The guitar

solo and the rest of the audio would remain as two separately editable base layers

without the need for them to be recombined, and all of this would exist within the

region itself meaning that it could be freely moved or copied and all of the edits

would remain perfectly as they were intended.

Perhaps the copying process should be extended to give more freedom. One option

would be a free copy that would start with all of the edits exactly the same as the

source region but would give the user the ability to change those edits independently

of the source region. The other option would be to create an alias copy which would

mirror the edits made in the source region. Any changes to the edits in the source

region would be automatically applied to the alias copy as well. The reason for having

this system which seems, initially, to be more complex is simply to allow as many of

the edits as possible to remain “live” for as long as possible. I have lost count of the

number of times that I have been working on a project and made an edit feeling that it

was genuinely right (and perhaps it was right at the time) only to find that, later on,

there has been a change in plans or perhaps a new idea that couldn’t have been

predicted which rendered the edit inappropriate. It wasn’t catastrophic because I still

had the original, unedited version of the file available and simply had to adjust the

edit but, because of the amount of time that had passed, I wasn’t 100% sure that I had

made the edit in the same way. Keeping the edits “live”, even if it meant a slight

complication in workflow, would, in many cases, be invaluable as it would keep the

audio truly fluid for as long as possible.

With that said, there will be times when you would want to commit certain edits and

this should be an option from within this editing window. You might want to take one

of the extracted base layers and create a separate audio file from it. There should be

an option to do this and either render the layer with the additional adjustment layers

applied or to simply extract the base layer and adjustment layers to a new region to

keep the edits made to this layer still “live”. Alternatively it might be that you want to

render the whole region with all of the adjustment layers to a new region. If this was

the case you should have the option to either render the region in place or to render to

a new region and still keep the original as it was. Perhaps you don’t want to render the

whole region but just want to make a single one of the adjustment layer edits

permanent. In this case each adjustment layer should have an individual option to

commit that edit and leave the other layers in place.

I think it would be very useful to be able to “host” plugins within this editing window.

While we might want a general EQ or compression setting on the track as a whole, it

might be that, for one small region, we would want to apply a particular plugin.

Having the ability to do that on a region-by-region basis would be exceptionally

useful and would be, to me at least, more intuitive than having to apply a plugin

globally for the whole track and then automate either the bypass or certain settings of

that plugin so that it only had an effect on a particular part of the track.

How about the option of automatically creating a sampler instrument from certain

parts of an audio file? If you have a purely monophonic part then Recycle or similar

tools can do this for you, but the sequence of notes will simply be mapped to adjacent

keys without any awareness of the actual note being played. This makes sense for

drum loops and even for melodic parts where the aim is to simply recreate the pattern

and timing as a sampler plus MIDI file version. But if you wanted to create a playable

sampler instrument where you have freedom to play notes as you wish, it should be

possible to analyse an audio recording and then take a single note from that, pitch

shift it over a pre-defined range and then map those samples to the appropriate keys

on a sampler instrument. This wouldn’t be a perfect recreation, of course, because

many instruments exhibit timbral changes as the note played changes, and this method

would simply “stretch” a single note over a wider range.

It could be possible to look at a polyphonic audio part, pick out all of the different

notes that were present in that part and use each of them to map out a fully chromatic

instrument. This would help to alleviate the problems associated with pitch shifting a

single note too far but there would then be the problem that each note might be

different in intensity, volume, duration and tone. Nonetheless, with a suitable amount

of processing power and a suitably clever algorithm, I don’t see why it wouldn’t be

possible to extract the individual notes and then, using one as a “template”, alter

volumes and durations so that each note was similar before mapping them out over

the keys of a sampler instrument.

For me, the idea of being able to listen to a mixed piece of music, isolate a single note

from a single instrument, and then create a “multi-sampled” playable sampler

instrument from that sound with just a few mouse clicks is nothing short of

mesmerising. This does bring us firmly back into the realms of the copyright issues

that we mentioned in the last chapter though. If somebody has created a sound for a

particular song, whether that is using a combination of instrument, amplifier, effects

and other processing, or whether it was created on a synthesizer, nonetheless it is their

creation. If we can click a mouse a few times and have that sound, in isolation, for us

to use ourselves, then we are in very dangerous territory from a legal perspective. OK,

if the sound in question was a preset from a particular synthesizer then it is arguable

that we are not stealing anything as we could have exactly the same sound if we had

that synthesizer. But if we didn’t own that synthesizer then we don’t really have a

right to use that sound. Many synthesizers, both hardware and software, have license

agreements that allow them to be used in pieces of music but don’t allow them to be

used for the purposes of creating sample banks. So the person who originally used the

synthesizer in their song would not be breaking that license agreement by using it in

their song, but we would by creating a sampled version of it when we didn’t own the

original equipment.

All of these options, if available, would represent a formidable audio editing toolkit.

In my opinion this would make the differentiation between recorded audio and MIDI

regions all but obsolete. However, it would come at a price. To enable all of this

functionality you would need a very comprehensive set of controls and this might

prove overwhelming. All of this talk of layers, aliases vs. real copies, region-based

plugins, alternative time-stretching algorithms, spectral adjustment layers and

whatever else we dream up, might simply be complete overkill for somebody who

occasionally wants to correct a slightly out of tune note by time-stretching a small

piece of audio. So if it were possible to include all of these features, I think we would

need to find a way to quickly and easily customise the depth of editing offered. The

editing window could, perhaps, have “tabs” at the top. We could have a “Waveform

Editor” tab, a “Piano Roll” editor tab, a “Spectral Editor” tab and an “Advanced” tab.

Each tab would feature only the editing processes and options that were most relevant

so if you wanted to do a quick fade-in you wouldn’t have to go through whole groups

of windows and displays and tools just to perform a simple task. Equally, if you

wanted to move the tuning of a note, you wouldn’t be confronted by rainbow coloured

spectrograms full of seemingly meaningless information. For the die-hard, there

would be the “Advanced” tab in all of its glory.

One thing that would be common to all of these options, though, would be seamless

integration. As soon as an audio file was imported or recording, it would be analysed

in the background and prepared for editing. There would be no need to select a region

and wait while it was analysed. When you double-clicked on it (or clicked with a

modifier key) to go into your editing window it would be there immediately, ready to

work on. If this were implemented, though, I feel that although the analysis would

have taken place, until such times as any edits were made the original untouched

audio file or region should be what you hear playing back. That way you could be

sure that there were no unnecessary artefacts present as a result of analysing audio

that was always going to remain unedited.

Which brings us to the question of when, or perhaps if, this will ever happen. The

issue I see here isn’t a technical one. Most of the technology for everything I have

described above exists already. The elastic audio (time and pitch) functions already

exist either natively in your DAW or in plugin format. Polyphonic pitch shifting

already exists in Melodyne. Spectral editing already exists in a number of software

packages including the excellent RX2 by iZotope. Spectral editing with layers and

image processing exists (in a roundabout way) with Photosounder and Photoshop.

Demixing exists in a few tools, of which Spectralayers Pro seems the most advanced

and promising at the moment. Creating sampler instruments and recycling audio files

already exists. Resynthesizing a sound (creating a multi-oscillator based model of

sound) already exists in Camel Audio’s Alchemy and this could be used to create the

pitch shifted versions of the extracted audio used to create the sampler instrument. So

pretty much everything exists. The problem is that the various different technologies

have been developed by, and are consequently owned by, a number of different

companies. In order for them to all be available in a single editor there would either

need to be a “standard” established in which all technologies were cross-licensed and

individual editing software packages could be built upon this platform and royalties

shared by all or, perhaps more likely, a single company would need to either buy

outright or license each of the technologies from the owners.

Even if that happened there would need to be a lot of thought put into how to make

them all work together in both a technical sense and in a visual/workflow sense.

Figuring out how to make such a large number of editing tools available without

having a ridiculously cluttered “toolbox” would be a challenge in itself. And then,

finally, finding a way to integrate this either natively into a DAW or as an external

program which your DAW called up automatically and which had a bi-directional

information exchange and synchronisation built in would be technically difficult. The

size of data files would increase considerably as all of the FFT analysis data would

need to be stored with each region along with potentially huge amounts of additional

data relating to the adjustment layers, and so on. This would probably require a new

file format or container that incorporated the underlying WAV or AIFF data along

with everything that was needed to piece the edits back together. This new format

would also have to be intelligent in as much as it would apply all of this additional

metadata if it were loaded into a DAW or editor which had this functionality but, in

situations where you only had basic WAV or AIFF playback capability, the container

format would be intelligent enough to realise this and would simply spit out the raw,

untreated audio data. This would make sure that the audio files were still accessible in

some format at least, even if the only copy that was available was this new format and

you didn’t have the software capable of fully exploiting all of the additional

information.

I will openly admit that this is just a personal view of the direction that I think audio

editing could take in the future. There are many people working in psychoacoustic

research the world over who probably have new algorithms in development at the

moment that will bring a totally fresh approach to editing which nobody had ever

considered before, but nobody can live without after trying it for the first time.

Whatever the future of audio editing might be in the specifics, I do believe that the

direction it will continue to take is towards being able to treat audio as if were

completely malleable and no longer represented a fixed expression of an idea at one

point in time, but was more like a set of tools allowing us to create our own version of

that idea. I guess the best analogy I can think of is one where an audio file represents

a movie. Ten years ago we were in the audience at a movie theatre and we were

merely spectators. Then came DVD with its (theoretically at least) ability to choose

between different camera angles. But the future of audio editing is one where we are

the directors and each sound in the audio is an actor. They have lines, they have a

story to follow, but the details of how and when they express their part in the story are

ours to shape and create in our own way. And if that is the future of audio editing then

I believe that it will become even more of an art and will allow for some amazing

works of creativity and re-imagining. I can’t wait…

viewthe first thing that we need to get used to is the idea of “layers”. the benefit of using...

Documents