viewthe first thing that we need to get used to is the idea of “layers”. the benefit of using...
TRANSCRIPT
Spectral "Image Processing"
The Conventional
Photo or image editing software might not be something you would ever think that
you had a need for in a recording or editing situation, but it can be a hugely useful
tool for solving simple problems, and can also be used for unconventional editing and
sound design tasks. But before we even consider the editing possibilities themselves,
we need to figure out how to get from audio recording to image editor. The process is
a very simple one, but you will probably need additional software in order to do it as
it is quite a specialised and unusual thing to want to do.
MetaSynth is perhaps the most well-known package for sound-to-image and image-to-
sound translations but it was originally designed as a tool to allow you to “draw” new
sounds and analyse existing audio files rather than as a “bridge” between audio and
image editing software. In fact, the best tool that I know of for doing this is the
excellent PhotoSounder, which not only allows you to load up audio files and then
export the spectrogram as an image for editing in your software of choice before
loading the image and converting back to audio, but also features some quite
comprehensive editing tools of its own. These tools are very much the kind of things
you would expect to see in any image editing package and, coupled with the “layers”
methodology adopted by many image editing packages, PhotoSounder is actually
more than capable of a great deal before you even think about exporting the image.
Many of the ideas that follow will be possible within PhotoSounder itself, but I will
look at the techniques for image editing in Photoshop as it features some options that
are unavailable in PhotoSounder or other simpler editors.
However, a fundamental advantage to working within PhotoSounder is that you can
preview the changes in real time whereas any editing done after exporting to your
image editor will be done on visuals alone – you would have to re-save the image and
import it back in to PhotoSounder to hear the effects. As a result it can be easier to do
the simplest jobs in a spectral editing application, to consider PhotoSounder for more
complex jobs and then move to an external image editor only when absolutely
necessary.
The first thing that we need to get used to is the idea of “layers”. The benefit of using
layers is that each layer can serve as an adjustment to the original audio. With most
spectral editing software you work directly on the audio file so any changes you make
are applied to it. They can, of course, be undone but you work sequentially applying
one edit after another. You can step back through many levels of undo but you
couldn’t, for example, keep the actions that you carried out on steps 3, 4 and 6 but
lose what you did on steps 1, 2, 5 and 7. With layers in an image editor it is certainly
possible to do that. Each adjustment can exist on its own layer and then can be freely
shown, hidden or combined in numerous ways.
The next things to consider are the tools used to make the selections. In the previous
chapter I stated that many of the selection tools used in spectral editing applications
were in some ways a subset of those used in image editing and taking a look at the
extended options available in your image editor. The “lasso” tool in a spectral editor
is very useful for highlighting irregularly shaped selections, but an image editor may
well have a “magnetic lasso” tool that will try to detect obviously visible “edges” and
automatically jump to and follow them. In certain situations this could be helpful for
making sure your selection is as “tight” as possible. In addition to the rectangular
selection tools, most image editors also offer circular/oval selection tools as well,
which might prove very useful in some situations. Finally, many image editors allow
you to save selection areas as “masks”, which means that you can load them at any
point and you will be guaranteed to have the same selection area – even if you choose
to go back and apply additional processing and adjustments to a particular area after
you have deselected it.
One other option which can be very useful is the ability to control the size, shape and
softness of the brush tool. While a round brush will probably be the most useful, there
might be times when a differently shaped brush will serve your purposes better. And
the softness parameter will change the brush from a hard edge to a very soft, blurred
edge. If we use blurred-edge brushes in our adjustment layers then it will graduate the
effect of whatever the layer is doing from the edges of the blur to the solid centre of
the brush. Once again it is a subtle difference but it can be helpful if a particular
unwieldy edit is needed.
Once we have used the expanded range of tools to create our selection it is time to
actually do something useful with it. For each new adjustment (or collection of similar
adjustments, perhaps) we would create a new layer and then set it up to carry out the
task we want. Many of the tasks we looked at in the last chapter have direct
equivalents in image editing in this way so we will quickly look at the ways we can
achieve the same results here before moving on to look at some things that aren’t
achievable in a spectral editor (yet).
Attenuation is a very simple task to achieve in an image editor. All we need to do is
remember that brightness is the image processing (and spectral editing) equivalent of
amplitude so attenuation is achieved by decreasing the brightness. If we make our
selection (or load a previously saved one) and then create a brightness adjustment
layer in our image editor we can then increase or decrease the brightness (amplitude)
of the selection as desired. The great advantage here is that not only can we make the
amplitude adjustment in both directions (quieter or louder) but we can also go back
and change this setting later in light of additional changes that we may make, all
without having to commit to any change before we do a final export of the image
from the image editor.
Copying and pasting is achieved in much the same way as it would be within a
spectral editor but working in an image editor does allow some additional flexibility.
By pasting the copied area on to a new layer it allows for it to be moved or “unpasted”
after the fact and, perhaps more importantly, it allows us to soften the edges of the
pasted area so that there are no abrupt changes in tone. If you copy and paste a
selection in a spectral editor you have very little control over the transition between
the original and pasted areas and this could potentially lead to audible artefacts at the
transition point. Doing the same thing in an image editor gives you much more
control over this important aspect of the pasting process.
While there is no dedicated de-clicking process in an image editor, by and large,
clicks are quite easy to deal with because they are a relatively high burst of sonic
energy over a wide frequency range and a short period of time. As such they will
normally show up quite clearly as fairly obvious vertical lines in the image editor.
Once we have identified them there are a couple of ways we could deal with them.
The first is to simply highlight the offending area and simply delete it and move the
surrounding areas so that they cover the gap. This will obviously make the whole file
a little shorter but if there are only one or two digital clicks of a few samples long then
this would amount to a few milliseconds at most and is unlikely to have any impact on
the file as a whole. Of course in doing this there is a risk that you will still have an
audible “glitch” because of subtle differences between the two areas that you have
moved to be adjacent. It is also not a very viable option if there are many clicks and
each is of a longer duration because the reduction in length would quickly add up.
Another option would be to copy and paste an equivalent length region from either
side of the click area but, again, you still run the risk of having audible glitches
because of minor differences between the original and pasted areas. You could, of
course, soften the edges as we described above to smooth out any such glitches. There
is a third option, however, which could prove interesting. Some image editors, either
natively or through plugins, provide the option to “fill” a selection with content that is
created by looking at the rest of the image and analysing patterns so that it blends with
the surrounding areas. This “content aware fill” (as it is called in Photoshop) could be
a great way to try to fill the gap left by the removed clicks, as it is a very similar tool
to some of the more advanced removal/attenuation algorithms in spectral editors.
What we really need to do when removing clicks this way is to interpolate between
the values immediately to the left of the removed and the values immediately to the
right. The shorter the click was, the easier this will be to do because, if we are talking
about a couple of milliseconds, even if the interpolated values aren’t exactly what
should have been there without the click, the period is so short that we probably
wouldn’t notice anything untoward. With longer clicks, however, a simple averaging
interpolation may be audible.
De-crackling with an image editor would be, I believe, a very impractical thing. It
isn’t so much a problem with having the tools to remove the crackle so much as it is a
problem having a means to accurately (and automatically) detect and locate it. The de-
crackle tools in spectral editors are expertly designed to look for specific audio events
and, when located, to treat them in a certain way. There are no tools that currently
exist in image editing to achieve a similar result and, therefore, de-crackling is
something that should probably be left to a spectral editor or dedicated plugin to deal
with. At least for the time being.
In theory we may be able to de-noise audio using an image editor but that comes with
one huge caveat. Image editors and image editing plugins have some really great tools
for getting rid of noise but, and this is the important part, the way noise occurs in
images is generally quite different to the way it occurs in audio. Noise in images tends
to be either compression artefacts (which are dealt with in a very specific way) or
what is known as Gaussian noise. This is the visual equivalent of white noise in audio
terminology and represents a statistically random and uniformly distributed noise. In
audio terms this would mean that it would be totally equal across all frequency bands.
The tools used to remove this in image editing would, therefore, be of limited use to
us if our “noise profile” happened to have a non-uniform distribution. The reality is
that de-noising using graphical image editing tools often results in a spectrogram that
looks blurred and the audible result of that is one of making the whole sound very
unfocused and, at worst, having a “wateriness” to it which is very far from the desired
result. As a consequence of this, de-noising is probably best left to conventional audio
editing processes – but it’s always worth a try if you fancy experimenting.
As a long shot, it might be possible to work around this by taking the spectrum of a
pure noise part of the recording (as you would do with de-noising in a spectral editor)
and then combining this spectral profile of the noise with a pure Gaussian noise image
to create a “weighted” Gaussian noise profile, and then using that as a subtractive
layer on top of the spectrogram image in your image editor. However, it is unlikely
the results would be as good as a dedicated de-noising process. If you enjoy
experimentation, though, it could be something you might like to try just as an
alternative technique. It would be very much a static noise profile though, and
wouldn’t offer the advantages of adaptive noise profiles used in many de-noising
algorithms and plugins.
Hum, on the other hand, should be much easier to deal with because it is much easier
to detect in an image editor. We have already seen that mains hum consists of a
constant frequency of either 50Hz or 60Hz and the associated lower-order harmonics.
In a spectrogram image this will show as horizontal lines located at 50Hz or
60Hz,then fainter lines at 100Hz or 120Hz, fainter still at 150Hz or 180Hz, and so on.
You could almost look at these in the same way that we did with clicks. They
represent very narrow focuses of audio energy. Clicks represented a wide frequency
range for a short period of time while each of the harmonics present in mains hum
will represent a narrow frequency range for a long period of time. Therefore each of
the options we looked at for dealing with clicks would be equally applicable here
except that we would be working with horizontal selections rather than vertical ones.
Equally, if we wanted to deal with mains hum we would simply select the relevant
frequencies and then reduce the brightness (attenuate them) which would, in effect, be
like using a very narrow bandwidth EQ.
There are a number of ways to achieve compression-like effects in an image editor
and these range from limiting the dynamic range of occasional sounds through to a
more general re-balancing of the dynamic range of the recording as a whole. There
are also ways in which you can achieve dynamic range effects that would be
completely impossible with any other spectral editing process of plugins. To begin
with let’s have a look at the most simple case: taming occasional peaks.
As with spectral editors, you can create selections in image editors and then attenuate
those areas to pull the peak levels down. If you want a result that is most similar to a
traditional single band compressor, then you would create a selection that covers the
width of the event we want to compress and also covers the entire spectrogram from
top to bottom (covering all frequencies). You can then create a brightness adjustment
layer and decrease the brightness by the required amount. You can also soften the
edges of the selection and this softening would be analogous to the “attack” and
“release” settings on a compressor. The softer the left-hand edge of the selection, the
slower the equivalent attack would be and the softer the right-hand edge, the slower
the equivalent release would be. What we are doing, in effect, is attenuating a single
part of the sound, which is what compression does at its most basic level.
You can then apply the same theory and techniques to a multi-band compression
equivalent by simply selecting only the frequencies that you want to compress, in a
vertical sense, and then creating the selection and brightness adjustment layer as
before. In additional to softening the left and right edges to create “attack” and
“release”, you can also soften the top and bottom edges to simulate the effects of the
filter slopes on the multi-band compressor bands. All pretty straightforward stuff and
fairly similar to what we have already spoken about for spectral editors. But then
things get a little more interesting.
Compression is the process of reducing dynamic range of sound, and dynamic range
is defined as the range between the loudest and quietest parts of an audio signal.
Moving that across to our image editor we know that loudness is equivalent to
brightness (or more correctly luminance), so in order to compress the audio in our
spectrogram we would need to change the amount of difference between the lightest
and darkest parts of our image. However if we simply adjust the contrast of the image
(in effect reducing the difference between the loudest and quietest parts of the signal)
we are not doing what we would traditionally consider as compression. In
compression as we know it we are adjusting the levels of the sound as a whole at
different points in time whereas with this spectral compression we are adjusting the
difference between the levels of different frequencies in the sound. That’s not to say
that it isn’t a very interesting effect in its own right; it most certainly is, but it won’t
achieve the same results as a normal audio compressor.
Adjusting the contrast amount will reduce the effective dynamic range of our audio
but it does it in a way that no hardware or plugin compressor does. It will take a
hypothetical midpoint in the dynamic range and make any sounds that are louder than
this point progressively quieter and sounds that are quieter will get progressively
louder. We have already seen that downward compression involves reducing the
levels of loud signals while upward compression involves increasing the levels of
quiet signals. In a way this actually does both simultaneously so would be a unique
effect in its own right. But I promised earlier to talk about upward compression, so
let’s see how we can adapt this technique to give us something closer to genuine
upward compression.
More traditional audio-type compression can actually be achieved inside of
Photosounder itself and there are a number of tutorials on the website which go
through this process. Very interesting is the fact that you can actually draw the
compression curve directly on to the audio, which means that you can apply a linear
compression, a logarithmic (approximately) compression, or a compression curve of
any shape that you desire which allows for compression effects simply not possible
with any traditional audio compressor.
While neither of these methods (spectral compression in an image editor or the
pseudo-traditional approach in Photosounder) is a precision tool in terms of absolute
control, image processing of spectrograms can be an extremely creative tool and can
offer us ways of doing things which are just not possible in any other editing platform.
Even in the relatively commonplace editing tasks that we have looked at here there
are ways that image-processing editing can help us. If you are happy to venture off of
the beaten track just a little then even more becomes possible.
The Unconventional
The processes and tools we have described so far in this chapter have all been related
to attempting to recreate tasks and effects that we can create other ways, either using a
spectral editor or, in some cases, simply using plugins or hardware processors. But if
we left it at that we would be overlooking some of the truly amazing ways that we can
bend and manipulate sound using an image editor. With this short section we are
doing a little more than just dipping a toe in to the waters of sound design; this is very
much experimenting with audio and wouldn’t really be construed as “editing” as such,
but it is related to the editing tasks we have been discussing and it’s also a lot of fun
so, while we are in this area, let’s just have a brief look at some intriguing
possibilities. One thing that I will say at the outset, though, is that the best results are
achieved with sounds that aren’t overly complex. What I mean by this is sounds
which are not rhythmically complex or full pieces of music. Isolated, single sounds
that aren’t too “busy” are best for these types of effects.
Using blur effects can give rise to some really nice results that are actually quite hard
to describe. If we took a sound which had very definite harmonics of 100Hz, 200Hz,
300Hz, 400Hz, etc., we would see each of these harmonics as a very clearly defined
horizontal line. If we were to then apply a blur effect this would have the effect of
creating additional frequency content either side of the main frequency. In the case of
the 100Hz harmonic this would mean very subtle additional frequencies ranging from,
for example, 98Hz to 102Hz, increasing from inaudible at the edges of the range, up
to a maximum level at 100Hz and then down to inaudible again at 102Hz. The
resulting sound is very hard to characterise but it almost sounds like multiple versions
of the same sound detuned against each other. Imagine a choir singing where each
person was slightly out of tune. Depending on the sound there can be a subtle
inharmonic “shimmer” to the sound that can become quite pronounced if the level of
blurring is high enough.
Easier to describe, perhaps, is the effect of blurring only in a horizontal sense with a
“motion blur” effect. This visual effect is the kind of thing you would expect to see in
a photo of a fast moving vehicle: a blur in a horizontal direction. If we apply this
effect to our spectrogram we get a kind of time-domain blurring which will sounds
reminiscent of a reversed reverb if the blur is carried out from right to left and a
strange, “ghostly” reverb if the blur is carried out from left to right. Although we
could achieve effects which are superficially similar with reverb, this particular
processing does sound different because a reverb will tend to diffuse the sound as it
dies away and give the sound a sense of distance while this method sounds more like
a natural fade-in or fade-out of a sound. As a result, if you apply it to a naturally
sustaining sound (such as a pad), it probably won’t sound that special because the
sound itself could well have a naturally long attack or decay. The real magic comes
when you apply it to a sound that your ear will recognise as a naturally percussive
sound. In situations like that the fact that there is a fade-in or fade-out is just enough
to catch your attention as sounding “wrong” but in a completely natural way,
assuming of course that you don’t go to extremes with the blur (which can be fun in
itself!)
Another interesting thing that you can do is to skew the spectrogram image. This is
where you move the position of the top of the spectrogram relative to the bottom. If
you imagine a click that would show up as a vertical line through the spectrogram
then, after skewing, this would become slightly diagonal. Now if we replace that click
with a more useful sound we can perform a very unique kind of fade-in where the
harmonics of a sound that, ordinarily, start at the same time can begin at different
times. Once again, the particular effect that this has will depend a lot on the sound. If
we use a sustaining sound the result could be quite similar to quickly opening up a
low-pass filter (if the top of the image is skewed to the right) or closing a high-pass
filter (if the top of the image is skewed to the left). If, on the other hand, you use more
percussive sounds and experiment with greater skew amounts you get a very
interesting “smearing” of the frequencies. Because of the nature of what we are doing,
each new sound, be that a new percussive hit or a new word in a vocal passage, will
sound like it has the filter opening on it. This would mean that the sound no longer
has its obvious percussive attack, but the newly created sounds can be unique in many
ways so it is worthwhile experimenting. Of course, if we shift the top of the image to
the left relative to the bottom then the high frequencies will come in first, so the effect
will be closer to that of a high-pass filter being closed down on each individual note.
The problem here is that the start of each note sounds less distinct that with the low
frequencies coming in first simply because the low frequencies are often more
obvious in defining the start of a sound even though the higher frequencies contribute
more to the character.
If you were to change the horizontal size of the image, it would be equivalent to time-
stretching; making it wider would be stretching and making it narrower would be
compressing. However, when you look at stretching or compressing the image in a
vertical sense then we have what I can only call “spectral squashing”. What you
would be doing here is compressing the spectral distribution of the sound and, as a
result, the frequencies would no longer be nicely and harmonically distributed. Each
frequency that was previously a whole number multiple of the fundamental may now
be, for example, 1.42 times the previous frequency, which will lead to anything from
slightly odd-sounding at very low squash values to unrecognisable at higher values.
Flipping the image horizontally will perform a basic reverse function much like you
find in almost every audio editor but, on the other hand, flipping the image in a
vertical sense will mean that the “fundamental” frequency – the most dominant
frequency – now becomes the highest frequency and the reducing levels of all of the
other higher harmonics in the original sound now become increasingly quiet sub-
harmonics. Once again, it is quite hard to describe this effect and it is very much
material dependant, so trial and error (if there is such a thing as “error” in processing
of this kind) is the order of the day.
Image editors often allow you to create “gradient fills”, so if you create a new layer,
then create a “white to black” horizontal gradient (white on the left) and use this layer
to multiply with the spectrogram, the gradient will act as a volume fade where the
white part of the gradient represents full volume and the black represents silence. In
this case it would equate to a fade-out. If we reverse the gradient so that the white is
on the right then we will have a fade-in. If we create a vertical fade with white at the
bottom then all of the lower harmonics will be at high volumes while the higher ones
will be silent. In audio terms this is a low-pass filter effect. Similarly we could reverse
the gradient so that the white was at the top, which would be equivalent to a high-pass
filter. By combining the two ideas and creating a diagonal fade we will create an
effect like a gradually opening or closing filter. One example would be a diagonal
gradient with white at the bottom right corner and black at the top left corner. At the
beginning of the file all high frequencies would be silent and low frequencies would
be very quiet. As we progress through the file, the lower frequencies will become
louder first and only towards the end of the file will the higher frequencies start to
become audible. In audio terms this would be a slowly opening low-pass filter. By
varying this we can create opening and closing low-pass and high-pass filters
depending on the positions of the white and black portions of the diagonal gradient.
Finally, you can often vary the smoothness of the gradient from a very gentle
transition to a simple two-tone effect that changes from white to black at the mid-
point. This smoothness variation is equivalent to the filter slope of an analogue filter.
The smoothest gradient could be the equivalent to a 6dB/octave filter while the two-
tone transition would be beyond the sharpness of even a 48db/octave filter.
One really interesting thing you can do is to create a new layer, load another
spectrogram into this layer and then set the layer blending to multiply mode. The
resulting image will be a hybrid of the two. It is almost as if you have the “picture” of
one, with the tonal characteristics of the other. Or perhaps you could say you have the
meaning of one and the colour of another. And this is exactly what a vocoder does.
You take one sound and impose its spectral content onto another sound. The most
obvious use of this is to create Cylon-esque or “Mr. Blue Sky” robot voices, but
vocoding actually has a lot of potential uses outside of that. You can vocode a
percussive loop with a melodic sound to create a hybrid that has the rhythm, dynamics
and “snap” of the percussion loop but has tonal qualities from the melodic sound. You
could be more subtle still and vocode a string section and a choir “ahhh” sample to
create a sound that is neither and both simultaneously. This sounds particularly un-
vocoder-like if both sounds are playing the same chord or melody so you aren’t
actually imposing any musical change on either of the sounds. but are simply messing
with the spectral balance.
And finally, just as a little curve-ball, and something for the adventurous only, how
about creating a negative of the original sound as you would with a photograph? If
you try this, be warned; turn your speakers or headphones down because the resulting
noise is usually chaotic as there will be a large “noise” component to the sound.
Whatever was “silent” harmonic space in your original sound will become full
amplitude in this negative. I can’t actually think of a practical use for this to be
honest, but you never know.
Even this only really scratches the surface of what you might be able to do with an
image editor and plugins – a lot of the things you try might prove to be unusable in
any musical context but could be just perfect for creating that foreboding, heavily
backlit, exploring-the-bowels-of-the-powerless-drifting-spaceship drone sound effect
that you need…
I do feel that I need to finish the chapter with one statement though. In spite of the
flexibility and creativity of all of the image processing techniques described above,
you have to remember that you are dealing with a visual representation of the sound.
Because of the nature of the spectrogram, we could have all of the accuracy we need
along the time axis. After all, what we are doing is representing the spectral balance at
a series of points in time. If our sampling frequency is 96KHz then we could use 1
pixel for each sampling period and have a time resolution equal to that of our original
audio in our image editor. It would be hugely impractical at this stage in computing
technology because we would need 96,000 pixels for every second of audio, which
would give all but the most well-specified of PCs or Macs a hard time. The greater
problem is the vertical axis. In a normal waveform we are plotting level (vertical)
against time (horizontal) and each of those are quantised in the audio recording
process. We have 24 bits of range in terms of level, which means 16.7 million
possible values – a huge amount of resolution – and we have (a measly) 96,000
samples per second along the time axis.
If, as in the case of a spectrogram, we are plotting frequencies along the vertical axis
things get a little more complicated because each vertical position represents a
frequency. If we had 20,000 pixels vertically then each pixel could represent 1Hz. Not
only would this be nowhere near a good enough resolution, but it also wouldn’t be
simply that each pixel represents a range of 1Hz – it would mean that each pixel was
fixed to an exact frequency. If we had a frequency of 842.5Hz present this would
either have to be quantised to either 842Hz or 843Hz (which could, especially with
lower frequencies, lead to things sounding a little out of tune) or the frequency would
be represented by being visually displayed in both frequencies at a lower level. And
50% volume at 842Hz added to 50% volume at 843Hz is not the same as 100%
volume at 842.5Hz.
In order to overcome this we would probably have to have a much more accurate
vertical (frequency) resolution, accurate to at least 0.01Hz (to be safe) which, on a
96KHz audio recording, would mean we would need 4.4 million pixels vertically.
That amounts to roughly 400 billion pixels per second. Each one of those pixels
would need to have a brightness value attached which, if we wanted a good quality
representation of the audio signal, would mean at least 16 bits per pixel (or 2 bytes)
so, if you are interested, that would mean 800 billion bytes per second, which is in the
region of 750GB per second. This clearly isn’t practical now and I can’t see any time
in the near to mid-term future when we would happily process audio files like this. So
obviously some compromises need to be made. As time and technology moves on we
may approach this theoretical ideal resolution but the original FFT analysis of the
audio file would need to be at an equivalent resolution for this to be worthwhile.
The point of this last statement is solely to illustrate that if you start processing
spectrogram images in an image editor you shouldn’t necessarily expect the resulting
sounds to be 100% true to the original. The problem is magnified by the fact that it
isn’t only the areas of the spectrogram that you have worked on that will suffer from a
loss of accuracy. Due to the limitations of representing the spectral data in a visual
format before exporting it, simply exporting and then importing back in without
actually doing any processing will result in a loss of quality. How bad this is will
vary, but just be prepared. This type of image processing as audio editing is very
“fringe” anyway, but is a technique that could become more widely used in the future,
especially if the technology advances in ways that the image data can be represented
more accurately in the exported image or if the makers of spectral editing software
bring more and more image editing paradigms into their software. The advantage here
is that you will be applying these “adjustment layers” directly to the original spectral
data rather than a reduced resolution visual image of that data.
All in all though, I would have to say that, even with the reduction in quality from
these methods, they do allow us to do some things that are impossible in any other
way. So on that basis it is worth exploring and waiting for the technology to catch up
a little bit to improve the quality of processes we are already familiar and used to
working with.
Truly Fluid Audio
How Is This Different From “Elastic Audio”?
Now that we have looked at the many different facets of audio editing, what we hope
to achieve, what we actually can achieve, and what may represent at least one part of
the future of audio manipulation, where does this leave us? Well, the answer to that
question may lie in the realms of what I would call “truly fluid audio”. This isn’t
really a new concept, but rather an integration of existing technologies and some new
tools and workflows.
Primarily, for me, the future of audio editing would be a system that incorporates
everything we can currently do into one single application. Failing that, it would still
be pretty much as useful if external applications could be launched from inside your
DAW, which allowed changes to be made in the context of what your DAW is
playing. This would mean that there would need to be some kind of synchronisation
between the two so that when playback was started in this new audio editor, the DAW
would play from the appropriate place and everything would remain synchronised.
Any looping that was in place in the DAW would also need to be relayed to the editor
so that it was easy to cycle around a particular section while working. And any audio
output from the editor would need to be looped back into the DAW in the exact same
place as the original audio was so that any further plugins or automation would
continue to be applied.
This would give us true “in context” editing and that, I believe, is crucial because no
advanced audio editing will ever be truly without any kind of artefacts. While our job
as editors should be to strive for perfection, when perfection isn’t achievable we need
to make a judgement call on what is acceptable and what isn’t. If you are working
solely within a spectral editor, for example, and you are trying to solve a particular
problem on a piece of audio that will, eventually, be mixed back in to a track, it can
be hard to know whether you have done the job sufficiently well because you can’t
hear it truly in context. It might be that you are trying to remove a particular sound
and, because you are uncertain whether you have done enough, you take things a little
too far and end up removing too much of the surrounding areas of sound, just to be
safe, and this has a detrimental effect on the rest of the sound. Had you been able to
listen to this in context you might have realised, a few processing steps earlier, that
you had already done enough.
Now in terms of an actual feature set for this fluid audio editing, I think that the editor
would need to incorporate “recycling” techniques for when that approach was the
most suitable, but also our normal time-stretching processes. This time-stretching
should also incorporate a number of the developments which we have mentioned
including re-modelling of vibrato and the ability for the user to manually
select/highlight areas that aren’t to be stretched. This could also be extended to an
“intelligent” system which suggested areas which could be candidates for being
excluded from being stretched, such as held notes that don’t change pitch during their
duration and which have sufficient amounts of silence following them to ensure that
no overlap would occur following the stretch. Perhaps any automatic transient
detection algorithms should somehow highlight the sections of audio that it defined as
transients to allow us to manually add, delete or change these selections as well.
Being able to just click on transient markers and move the audio is very useful as a
workflow tool but the idea of incorporating a better algorithm for any bounced or
rendered versions should definitely be included.
Equally, pitch shifting should be incorporated into the same workflow and be possible
in the same window. All of the features that we currently have available for
polyphonic pitch shifting in Melodyne would be available here, but perhaps with an
additional information layer. Melodyne achieves its polyphonic pitch shifting by
analysis of the audio file and detection of similar patterns of harmonic content. If an
audio file was processed which detected several different patterns, created by having
different instruments in the same audio file, then perhaps these could be colour-coded
so that it was easier to see which notes came from which instrument. If we incorporate
the database features that I mentioned in the previous chapter then the software might
even be able to take a guess at what kind of sound it was basing on the harmonic
content and it could be labelled (editable by the user of course) accordingly.
The most obvious visual interface for both of these would be a “piano roll” type
display, much the same as is used for the current versions of these techniques. The
pitch could be manipulated by moving notes up and down the scale (either freely or
“snapped” to a scale) and timing changed by having transient markers overlaid on to
the waveform display which allowed simple movement back and forward along the
timeline with the option to either keep the length the same and perform a simple
“move”, or to keep the end point fixed if moving the start point or keep the start point
fixed if moving the end point. This is not that dissimilar from current working
methods but it would, depending on the DAW, require some integration of different
techniques into a single window.
Moving forward from there we should be able to incorporate spectral editing in the
same window. This would require a change of view though. The waveform/piano roll
display would have to be substituted for a spectrogram view, but the timing grid
should remain visible, as should the transient markers. If the transient markers were
moved then the audio would be moved or stretched in the same way as we just
described above for the waveform view but, in this case, rather than the waveform
shape being changed, the spectrogram image would be adjusted to compensate. This
would perhaps be a little more difficult to achieve because, in the waveform/piano roll
view, each separate note would be clearly identified by its own waveform, whereas in
the spectrogram view the harmonic data for the entire sound is blended together. As
such, a note on E2 closely followed by a note on G2 might not be easy to visually
separate in order to position the transient markers. It should be possible to scale the
vertical axis of the spectrogram (the one that represents pitch) in such a way that a
piano keyboard could be placed along the vertical axis and the position of each
frequency in the spectrogram view was aligned to a musical note on the keyboard
display. This would make it much easier to pick out what each individual note was on
the spectrogram view and, therefore, figure out what each of the transient markers
was for.
We should then have the ability to incorporate layers and layer adjustment tools into
this spectral editor so that we can perform some of the more esoteric editing processes
that we discussed in previous chapters as well as having alternative tools to carry out
the more routine editing operations. Each of these layers would be tied to the
underlying audio so that any movement of the underlying audio regions in the arrange
window would result in the layers automatically being repositioned as well. Having
this ability would mean that we get to keep the edits and layers “live” and wouldn’t
have to commit to them in order to be able to move the audio regions around. In fact,
this should apply to any of the audio files and any of the editing tools. If we apply
pitch correction or time-stretching/movement then we should be able to move the
audio region in the arrange window and have all of the edits follow the position. This
is possible in some DAWs and in some aspects but the perfect system would be
region-based in its changes and all edits would be synchronised relative to the region
(sometimes called “clip”) rather than absolute in their time or position.
To make these tasks even easier to carry out, there could be an option to have a split-
screen view where the top half of the window represented the spectral view and the
bottom half represented the waveform/piano roll view. Edits could be carried out in
either window depending on which was more appropriate for the task at hand. Any
changes made in one part of the window would result in the information in the other
part of the window being updated automatically. You could, perhaps, extend this even
further to include a third part of the window which was a traditional combined
waveform view. Edits such as fade-ins and fade-outs can most easily be carried out
with this editing view so it would have its uses. Plus it would be a very good way to
instantly check whether any of the edits you were making in the other views were
resulting in any obvious glitches in the main waveform. It might be too much to have
all three of these on screen at once but it would be easy enough to incorporate
different screen sets that could be triggered by a simple key command. CTRL + 1
could trigger full screen spectrogram, CTRL + 2 could be full screen piano roll,
CTRL + 3 could be full screen waveform, CTRL + 4 could be split spectrogram and
piano roll, CTRL + 5 could be split piano roll and waveform, CTRL + 6 could be
split spectrogram and waveform and CTRL + 7 could be all three together. This, to
me, would represent the ultimate editing workflow because pretty much every editing
task (other than comping) could be carried out in this single window.
Moving on to the issue of demixing – there are a number of possibilities here. Perhaps
we could extend the layers model to include different types of layer. There could be
adjustment layers which are tied to base layers. To begin with the audio file or region
would have just one base layer, which is the entire contents of the region.
Adjustments could be made to this in all of the ways we described in Chapter 17. In
addition, though, we could create additional base layers that were the result of
extracting particular sounds using demixing technology. These additional base layers
would play back at the same time as the rest of the audio and would be phase-locked.
They would, though, have the possibility to have additional adjustment layers linked
to them. Using this idea you could have a fully mixed live recording to which you
applied some compression adjustment layers and maybe a few spot attenuation layers.
But then if you decided, for example, that you wanted to process the guitar solo
separately, you would extract that out to a new base layer using demixing. Then you
might add some stronger compression and perhaps some EQ (also more than feasible
with adjustment layers instead of plugins) and then raise the level overall. The guitar
solo and the rest of the audio would remain as two separately editable base layers
without the need for them to be recombined, and all of this would exist within the
region itself meaning that it could be freely moved or copied and all of the edits
would remain perfectly as they were intended.
Perhaps the copying process should be extended to give more freedom. One option
would be a free copy that would start with all of the edits exactly the same as the
source region but would give the user the ability to change those edits independently
of the source region. The other option would be to create an alias copy which would
mirror the edits made in the source region. Any changes to the edits in the source
region would be automatically applied to the alias copy as well. The reason for having
this system which seems, initially, to be more complex is simply to allow as many of
the edits as possible to remain “live” for as long as possible. I have lost count of the
number of times that I have been working on a project and made an edit feeling that it
was genuinely right (and perhaps it was right at the time) only to find that, later on,
there has been a change in plans or perhaps a new idea that couldn’t have been
predicted which rendered the edit inappropriate. It wasn’t catastrophic because I still
had the original, unedited version of the file available and simply had to adjust the
edit but, because of the amount of time that had passed, I wasn’t 100% sure that I had
made the edit in the same way. Keeping the edits “live”, even if it meant a slight
complication in workflow, would, in many cases, be invaluable as it would keep the
audio truly fluid for as long as possible.
With that said, there will be times when you would want to commit certain edits and
this should be an option from within this editing window. You might want to take one
of the extracted base layers and create a separate audio file from it. There should be
an option to do this and either render the layer with the additional adjustment layers
applied or to simply extract the base layer and adjustment layers to a new region to
keep the edits made to this layer still “live”. Alternatively it might be that you want to
render the whole region with all of the adjustment layers to a new region. If this was
the case you should have the option to either render the region in place or to render to
a new region and still keep the original as it was. Perhaps you don’t want to render the
whole region but just want to make a single one of the adjustment layer edits
permanent. In this case each adjustment layer should have an individual option to
commit that edit and leave the other layers in place.
I think it would be very useful to be able to “host” plugins within this editing window.
While we might want a general EQ or compression setting on the track as a whole, it
might be that, for one small region, we would want to apply a particular plugin.
Having the ability to do that on a region-by-region basis would be exceptionally
useful and would be, to me at least, more intuitive than having to apply a plugin
globally for the whole track and then automate either the bypass or certain settings of
that plugin so that it only had an effect on a particular part of the track.
How about the option of automatically creating a sampler instrument from certain
parts of an audio file? If you have a purely monophonic part then Recycle or similar
tools can do this for you, but the sequence of notes will simply be mapped to adjacent
keys without any awareness of the actual note being played. This makes sense for
drum loops and even for melodic parts where the aim is to simply recreate the pattern
and timing as a sampler plus MIDI file version. But if you wanted to create a playable
sampler instrument where you have freedom to play notes as you wish, it should be
possible to analyse an audio recording and then take a single note from that, pitch
shift it over a pre-defined range and then map those samples to the appropriate keys
on a sampler instrument. This wouldn’t be a perfect recreation, of course, because
many instruments exhibit timbral changes as the note played changes, and this method
would simply “stretch” a single note over a wider range.
It could be possible to look at a polyphonic audio part, pick out all of the different
notes that were present in that part and use each of them to map out a fully chromatic
instrument. This would help to alleviate the problems associated with pitch shifting a
single note too far but there would then be the problem that each note might be
different in intensity, volume, duration and tone. Nonetheless, with a suitable amount
of processing power and a suitably clever algorithm, I don’t see why it wouldn’t be
possible to extract the individual notes and then, using one as a “template”, alter
volumes and durations so that each note was similar before mapping them out over
the keys of a sampler instrument.
For me, the idea of being able to listen to a mixed piece of music, isolate a single note
from a single instrument, and then create a “multi-sampled” playable sampler
instrument from that sound with just a few mouse clicks is nothing short of
mesmerising. This does bring us firmly back into the realms of the copyright issues
that we mentioned in the last chapter though. If somebody has created a sound for a
particular song, whether that is using a combination of instrument, amplifier, effects
and other processing, or whether it was created on a synthesizer, nonetheless it is their
creation. If we can click a mouse a few times and have that sound, in isolation, for us
to use ourselves, then we are in very dangerous territory from a legal perspective. OK,
if the sound in question was a preset from a particular synthesizer then it is arguable
that we are not stealing anything as we could have exactly the same sound if we had
that synthesizer. But if we didn’t own that synthesizer then we don’t really have a
right to use that sound. Many synthesizers, both hardware and software, have license
agreements that allow them to be used in pieces of music but don’t allow them to be
used for the purposes of creating sample banks. So the person who originally used the
synthesizer in their song would not be breaking that license agreement by using it in
their song, but we would by creating a sampled version of it when we didn’t own the
original equipment.
All of these options, if available, would represent a formidable audio editing toolkit.
In my opinion this would make the differentiation between recorded audio and MIDI
regions all but obsolete. However, it would come at a price. To enable all of this
functionality you would need a very comprehensive set of controls and this might
prove overwhelming. All of this talk of layers, aliases vs. real copies, region-based
plugins, alternative time-stretching algorithms, spectral adjustment layers and
whatever else we dream up, might simply be complete overkill for somebody who
occasionally wants to correct a slightly out of tune note by time-stretching a small
piece of audio. So if it were possible to include all of these features, I think we would
need to find a way to quickly and easily customise the depth of editing offered. The
editing window could, perhaps, have “tabs” at the top. We could have a “Waveform
Editor” tab, a “Piano Roll” editor tab, a “Spectral Editor” tab and an “Advanced” tab.
Each tab would feature only the editing processes and options that were most relevant
so if you wanted to do a quick fade-in you wouldn’t have to go through whole groups
of windows and displays and tools just to perform a simple task. Equally, if you
wanted to move the tuning of a note, you wouldn’t be confronted by rainbow coloured
spectrograms full of seemingly meaningless information. For the die-hard, there
would be the “Advanced” tab in all of its glory.
One thing that would be common to all of these options, though, would be seamless
integration. As soon as an audio file was imported or recording, it would be analysed
in the background and prepared for editing. There would be no need to select a region
and wait while it was analysed. When you double-clicked on it (or clicked with a
modifier key) to go into your editing window it would be there immediately, ready to
work on. If this were implemented, though, I feel that although the analysis would
have taken place, until such times as any edits were made the original untouched
audio file or region should be what you hear playing back. That way you could be
sure that there were no unnecessary artefacts present as a result of analysing audio
that was always going to remain unedited.
Which brings us to the question of when, or perhaps if, this will ever happen. The
issue I see here isn’t a technical one. Most of the technology for everything I have
described above exists already. The elastic audio (time and pitch) functions already
exist either natively in your DAW or in plugin format. Polyphonic pitch shifting
already exists in Melodyne. Spectral editing already exists in a number of software
packages including the excellent RX2 by iZotope. Spectral editing with layers and
image processing exists (in a roundabout way) with Photosounder and Photoshop.
Demixing exists in a few tools, of which Spectralayers Pro seems the most advanced
and promising at the moment. Creating sampler instruments and recycling audio files
already exists. Resynthesizing a sound (creating a multi-oscillator based model of
sound) already exists in Camel Audio’s Alchemy and this could be used to create the
pitch shifted versions of the extracted audio used to create the sampler instrument. So
pretty much everything exists. The problem is that the various different technologies
have been developed by, and are consequently owned by, a number of different
companies. In order for them to all be available in a single editor there would either
need to be a “standard” established in which all technologies were cross-licensed and
individual editing software packages could be built upon this platform and royalties
shared by all or, perhaps more likely, a single company would need to either buy
outright or license each of the technologies from the owners.
Even if that happened there would need to be a lot of thought put into how to make
them all work together in both a technical sense and in a visual/workflow sense.
Figuring out how to make such a large number of editing tools available without
having a ridiculously cluttered “toolbox” would be a challenge in itself. And then,
finally, finding a way to integrate this either natively into a DAW or as an external
program which your DAW called up automatically and which had a bi-directional
information exchange and synchronisation built in would be technically difficult. The
size of data files would increase considerably as all of the FFT analysis data would
need to be stored with each region along with potentially huge amounts of additional
data relating to the adjustment layers, and so on. This would probably require a new
file format or container that incorporated the underlying WAV or AIFF data along
with everything that was needed to piece the edits back together. This new format
would also have to be intelligent in as much as it would apply all of this additional
metadata if it were loaded into a DAW or editor which had this functionality but, in
situations where you only had basic WAV or AIFF playback capability, the container
format would be intelligent enough to realise this and would simply spit out the raw,
untreated audio data. This would make sure that the audio files were still accessible in
some format at least, even if the only copy that was available was this new format and
you didn’t have the software capable of fully exploiting all of the additional
information.
I will openly admit that this is just a personal view of the direction that I think audio
editing could take in the future. There are many people working in psychoacoustic
research the world over who probably have new algorithms in development at the
moment that will bring a totally fresh approach to editing which nobody had ever
considered before, but nobody can live without after trying it for the first time.
Whatever the future of audio editing might be in the specifics, I do believe that the
direction it will continue to take is towards being able to treat audio as if were
completely malleable and no longer represented a fixed expression of an idea at one
point in time, but was more like a set of tools allowing us to create our own version of
that idea. I guess the best analogy I can think of is one where an audio file represents
a movie. Ten years ago we were in the audience at a movie theatre and we were
merely spectators. Then came DVD with its (theoretically at least) ability to choose
between different camera angles. But the future of audio editing is one where we are
the directors and each sound in the audio is an actor. They have lines, they have a
story to follow, but the details of how and when they express their part in the story are
ours to shape and create in our own way. And if that is the future of audio editing then
I believe that it will become even more of an art and will allow for some amazing
works of creativity and re-imagining. I can’t wait…