Differentiable all-pole filters for time-varying audio systems (2024)

Abstract

Infinite impulse response filters are an essential building block of many time-varying audio systems, such as audio effects and synthesisers.However, their recursive structure impedes end-to-end training of these systems using automatic differentiation.Although non-recursive filter approximations like frequency sampling and frame-based processing have been proposed and widely used in previous works, they cannot accurately reflect the gradient of the original system.We alleviate this difficulty by re-expressing a time-varying all-pole filter to backpropagate the gradients through itself, so the filter implementation is not bound to the technical limitations of automatic differentiation frameworks.This implementation can be employed within audio systems containing filters with poles for efficient gradient evaluation.We demonstrate its training efficiency and expressive capabilities for modelling real-world dynamic audio systems on a phaser, time-varying subtractive synthesiser, and compressor.We make our code and audio samples available and provide the trained audio effect and synth models in a VST plugin111https://diffapf.github.io/web/.

1 Introduction

Infinite impulse response (IIR) filters are commonly used in many time-varying audio processing units, such as subtractive synthesisers, phaser effects, and dynamic range compression.Their recursive computation, using results from previous time steps, allows modelling a wide range of responses with low computational costs.Since differentiable DSP (DDSP)[1] emerged as an effective solution in attaining controllable audio systems, there have been attempts to incorporate recursive filters into automatic differentiation frameworks such as PyTorch.However, naive implementations result in deep computational graphs due to recursion that cannot be parallelised[2] and slow down training speed[3, 4, 5].

A common acceleration approach is to evaluate the filters in the frequency domain[3] and approximate time-varying behaviour by filtering on overlapping frames in parallel[6, 7].Despite their popularity, these approximations have some potential drawbacks.Frame windowing for overlap-add smears the spectral peaks in the frequency domain, resulting in filters with artificially high resonance to counter this effect.Sampling the filters in the frequency domain sometimes truncates the IR length, and the circular convolution with the truncated IR caused by frequency sampling can lead to artefacts at frame boundaries.Most importantly, systems trained in this way are not guaranteed the same results when operating sample-by-sample at audio rate to achieve low latency.

In this paper, we propose a solution to these problems by deriving and implementing an efficient backpropagation algorithm for a time-varying all-pole filter.Our filter can be used in various audio systems by separating the system’s poles and zeros and explicitly handling the poles (where the recursion is) with our proposed method fully end-to-end.Our contributions are threefold:

  1. 1.

    We significantly increase the forward and backpropagation speed of time-varying recursive all-pole filters without introducing any approximation to the filter.

  2. 2.

    The systems trained with our implementation can be converted to real-time without generalisation issues besides the order of the zeros and poles.

  3. 3.

    We show that our filter efficiently and accurately models time-varying analog audio circuits with recursive structures.

2 Related works

Differentiable training of IIR filters has been explored in[8, 9, 10, 3, 11, 4, 7].To sidestep the problems inherent in training over a recursion, some authors approximate IIR filters in the frequency domain using the fast Fourier transform (FFT)[9, 10, 3, 11, 4], or the short-time Fourier transform (STFT)[7] for time-varying effects.This is known as the frequency sampling (FS) method.It approximates the filter as time-invariant and with a finite impulse response (FIR) over the duration of a short frame, thus the accuracy of FS depends heavily on the choice of STFT parameters: the hop-size, the frame length, the FFT length, and the windowing function[12]. In machine learning applications, these choices add extra hyper-parameters to models, which may require prior knowledge of the target system to set appropriately.

Time-varying all-pole filters have been used for decades in linear prediction (LP) voice synthesis[13].Training them jointly with neural networks was first proposed in LPCNet[14].The authors achieve training efficiency by using inverse-filtered speech as the target because the inverse filter has no recursion.Other works seek to parallelise LP with frame-based processing, either filtering in the time domain[15, 2] or via FS[6, 16] for each frame.

Dynamic processing effects like compressors and limiters also employ time-varying recursive filters.The filter is usually first order, and the coefficients are time-varying and dependent on the attack or release phases of the gain reduction signal[17].In the differentiable learning context, frequency sampling can be used if the compressor’s attack and release time are configured to be the same[4, 5], which simplifies the filter to be time-invariant.Colonel et al.[18] propose dividing the gain reduction signal into attack and release passages and filtering them separately with different filters, and Guo et al.[19] downsample the signal to reduce the number of recursions.

A distinct approach that provides significant acceleration is deriving the closed-form solution of the gradients for the filter parameters and implementing it in a highly optimised way.Bhattacharya et al.[8] derive the instantaneous backpropagation algorithm for peak and shelving filters and train them jointly to match the response of head-related transfer functions.Forgione et al. and Yu et al.[2, 20] decompose a time-invariant IIR filter into zeros and poles, and they show that backpropagation through an all-pole filter can be expressed using the same all-pole filter.This method is fast because the underlying filters do not have to be implemented in an automatic differentiation framework.Nevertheless, these solutions are made for specific filters or are only applied to time-invariant systems.

Differentiable all-pole filters for time-varying audio systems (1)

3 Proposed Methodology

Consider an Mthsuperscript𝑀thM^{\mathrm{th}}italic_M start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT-order time-varying all-pole filter:

y(n)=f𝐚(n)(x(n))=x(n)i=1Mai(n)y(ni)𝑦𝑛subscript𝑓𝐚𝑛𝑥𝑛𝑥𝑛superscriptsubscript𝑖1𝑀subscript𝑎𝑖𝑛𝑦𝑛𝑖\begin{split}y(n)&=f_{\mathbf{a}(n)}(x(n))\\&=x(n)-\sum_{i=1}^{M}a_{i}(n)y(n-i)\end{split}start_ROW start_CELL italic_y ( italic_n ) end_CELL start_CELL = italic_f start_POSTSUBSCRIPT bold_a ( italic_n ) end_POSTSUBSCRIPT ( italic_x ( italic_n ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_x ( italic_n ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n ) italic_y ( italic_n - italic_i ) end_CELL end_ROW(1)

where 𝐚(n)=[a1(n),,aM(n)]𝐚𝑛subscript𝑎1𝑛subscript𝑎𝑀𝑛\mathbf{a}(n)=[a_{1}(n),\dots,a_{M}(n)]bold_a ( italic_n ) = [ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_n ) , … , italic_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_n ) ] are filter coefficients at time n𝑛nitalic_n and M+𝑀superscriptM\in\mathbb{Z}^{+}italic_M ∈ blackboard_Z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT.In some applications, 𝐚(n)𝐚𝑛{\bf a}(n)bold_a ( italic_n ) varies at a control rate Fcsubscript𝐹𝑐F_{c}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT much lower than the audio sampling rate Fssubscript𝐹𝑠F_{s}italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT as 𝐚(m)𝐚𝑚{\bf a}(m)bold_a ( italic_m ), and can then be up-sampled to the audio rate before the filter is applied.

The following sections describe the proposed method, in which the exact gradients for each parameter of filter f𝐚(n)subscript𝑓𝐚𝑛f_{\mathbf{a}(n)}italic_f start_POSTSUBSCRIPT bold_a ( italic_n ) end_POSTSUBSCRIPT are derived and expressed in a form that can be computed efficiently.Our method aligns most with the instantaneous backpropagation algorithms proposed in[8, 20, 2], generalising their contributions to a time-varying all-pole filter that can be used in various recursive filters and time-varying audio systems. We refer to this method as the time domain (TD) method.

3.1 Unwinding the Recursion

We first rewrite the recursive Eq.(1) so there is no y𝑦yitalic_y variable on the right-hand side:

y(n)=x(n)+d=1bd(n)x(nd).𝑦𝑛𝑥𝑛superscriptsubscript𝑑1subscript𝑏𝑑𝑛𝑥𝑛𝑑y(n)=x(n)+\sum_{d=1}^{\infty}b_{d}(n)x(n-d).italic_y ( italic_n ) = italic_x ( italic_n ) + ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_n ) italic_x ( italic_n - italic_d ) .(2)

𝐛(n)=[b1(n),b2(n),]𝐛𝑛subscript𝑏1𝑛subscript𝑏2𝑛\mathbf{b}(n)=[b_{1}(n),b_{2}(n),\dots]bold_b ( italic_n ) = [ italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_n ) , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n ) , … ] is the IIR of filter f𝐚(n)subscript𝑓𝐚𝑛f_{\mathbf{a}(n)}italic_f start_POSTSUBSCRIPT bold_a ( italic_n ) end_POSTSUBSCRIPT at time n𝑛nitalic_n.We will use Eq.(2) and 𝐛(n)𝐛𝑛\mathbf{b}(n)bold_b ( italic_n ) to help us derive the gradient of 𝐚(n)𝐚𝑛\mathbf{a}(n)bold_a ( italic_n ) in the next two sections.To get the exact value of 𝐛(n)𝐛𝑛\mathbf{b}(n)bold_b ( italic_n ) in terms of 𝐚(n)𝐚𝑛\mathbf{a}(n)bold_a ( italic_n ), let us think of (1) as the recursive definition of connecting any time step before n𝑛nitalic_n to n𝑛nitalic_n.In other words, y(n)𝑦𝑛y(n)italic_y ( italic_n ) is a summation of M𝑀Mitalic_M sub-problems y(ni)𝑦𝑛𝑖y(n-i)italic_y ( italic_n - italic_i ), each weighted by a unique coefficient ai(n)subscript𝑎𝑖𝑛-a_{i}(n)- italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n ).Let us think of i𝑖iitalic_i as the step size going backwards from n𝑛nitalic_n with M𝑀Mitalic_M being the maximum step size we can take.All possible combinations of steps that span a distance d+𝑑superscriptd\in\mathbb{Z}^{+}italic_d ∈ blackboard_Z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT can be defined recursively as

𝒢d=i=1min(d,M){[i;𝐪]:𝐪𝒢di}subscript𝒢𝑑superscriptsubscript𝑖1𝑑𝑀conditional-set𝑖𝐪𝐪subscript𝒢𝑑𝑖\mathcal{G}_{d}=\bigcup_{i=1}^{\min(d,M)}\left\{[i;\mathbf{q}]:\mathbf{q}\in%\mathcal{G}_{d-i}\right\}caligraphic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min ( italic_d , italic_M ) end_POSTSUPERSCRIPT { [ italic_i ; bold_q ] : bold_q ∈ caligraphic_G start_POSTSUBSCRIPT italic_d - italic_i end_POSTSUBSCRIPT }(3)

with boundary condition 𝒢0={[]}subscript𝒢0\mathcal{G}_{0}=\{[]\}caligraphic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { [ ] } and [;][;][ ; ] is array concatenation.

If we compare Eq.(1) and (2), we can see that bd(n)subscript𝑏𝑑𝑛b_{d}(n)italic_b start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_n ) is in fact the sum of the coefficient combinations of a()subscript𝑎a_{\cdot}(\cdot)italic_a start_POSTSUBSCRIPT ⋅ end_POSTSUBSCRIPT ( ⋅ ) going back from n𝑛nitalic_n with step combinations 𝒢dsubscript𝒢𝑑\mathcal{G}_{d}caligraphic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT.Thus, the value of it is

bd(n)=𝐪𝒢d(1)|𝐪|j=1|𝐪|aqj(nk=1jQ(𝐪)k)subscript𝑏𝑑𝑛subscript𝐪subscript𝒢𝑑superscript1𝐪superscriptsubscriptproduct𝑗1𝐪subscript𝑎subscript𝑞𝑗𝑛superscriptsubscript𝑘1𝑗𝑄subscript𝐪𝑘\displaystyle b_{d}(n)=\sum_{\mathbf{q}\in\mathcal{G}_{d}}(-1)^{|\mathbf{q}|}%\prod_{j=1}^{|\mathbf{q}|}a_{q_{j}}\left(n-\sum_{k=1}^{j}Q(\mathbf{q})_{k}\right)italic_b start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_n ) = ∑ start_POSTSUBSCRIPT bold_q ∈ caligraphic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( - 1 ) start_POSTSUPERSCRIPT | bold_q | end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_q | end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_n - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_Q ( bold_q ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )(4)

where |𝐪|𝐪|\mathbf{q}|| bold_q | is the number of elements in array 𝐪𝐪\mathbf{q}bold_q and Q(𝐪)=[0;𝐪]𝑄𝐪0𝐪Q(\mathbf{q})=[0;\mathbf{q}]italic_Q ( bold_q ) = [ 0 ; bold_q ].

3.2 Gradients of x(n)𝑥𝑛x(n)italic_x ( italic_n )

Assuming we have computed f𝐚(n)subscript𝑓𝐚𝑛f_{\mathbf{a}(n)}italic_f start_POSTSUBSCRIPT bold_a ( italic_n ) end_POSTSUBSCRIPT up to step N𝑁Nitalic_N, evaluated y(nN)𝑦𝑛𝑁y(n\leq N)italic_y ( italic_n ≤ italic_N ) with a differentiable function (y(n))𝑦𝑛\mathcal{L}(y(n))caligraphic_L ( italic_y ( italic_n ) ), and have its instantaneous gradients y(n)𝑦𝑛\frac{\partial\mathcal{L}}{\partial y(n)}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_y ( italic_n ) end_ARG, we can backpropagate through f𝐚(n)subscript𝑓𝐚𝑛f_{\mathbf{a}(n)}italic_f start_POSTSUBSCRIPT bold_a ( italic_n ) end_POSTSUBSCRIPT as

x(n)=d=Ny(d)y(d)x(n)=y(n)y(n)x(n)+d=1Nny(n+d)y(n+d)x(n)=y(n)+d=1Nnbd(n+d)y(n+d).𝑥𝑛superscriptsubscript𝑑𝑁𝑦𝑑𝑦𝑑𝑥𝑛𝑦𝑛𝑦𝑛𝑥𝑛superscriptsubscript𝑑1𝑁𝑛𝑦𝑛𝑑𝑦𝑛𝑑𝑥𝑛𝑦𝑛superscriptsubscript𝑑1𝑁𝑛subscript𝑏𝑑𝑛𝑑𝑦𝑛𝑑\begin{split}\frac{\partial\mathcal{L}}{\partial x(n)}&=\sum_{d=-\infty}^{N}%\frac{\partial\mathcal{L}}{\partial y(d)}\frac{\partial y(d)}{\partial x(n)}\\&=\frac{\partial\mathcal{L}}{\partial y(n)}\frac{\partial y(n)}{\partial x(n)}%+\sum_{d=1}^{N-n}\frac{\partial\mathcal{L}}{\partial y(n+d)}\frac{\partial y(n%+d)}{\partial x(n)}\\&=\frac{\partial\mathcal{L}}{\partial y(n)}+\sum_{d=1}^{N-n}b_{d}(n+d)\frac{%\partial\mathcal{L}}{\partial y(n+d)}.\end{split}start_ROW start_CELL divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_x ( italic_n ) end_ARG end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_d = - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_y ( italic_d ) end_ARG divide start_ARG ∂ italic_y ( italic_d ) end_ARG start_ARG ∂ italic_x ( italic_n ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_y ( italic_n ) end_ARG divide start_ARG ∂ italic_y ( italic_n ) end_ARG start_ARG ∂ italic_x ( italic_n ) end_ARG + ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - italic_n end_POSTSUPERSCRIPT divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_y ( italic_n + italic_d ) end_ARG divide start_ARG ∂ italic_y ( italic_n + italic_d ) end_ARG start_ARG ∂ italic_x ( italic_n ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_y ( italic_n ) end_ARG + ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - italic_n end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_n + italic_d ) divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_y ( italic_n + italic_d ) end_ARG . end_CELL end_ROW(5)

We use the fact that y(n)x(nd)=bd(n)𝑦𝑛𝑥𝑛𝑑subscript𝑏𝑑𝑛\frac{\partial y(n)}{\partial x(n-d)}=b_{d}(n)divide start_ARG ∂ italic_y ( italic_n ) end_ARG start_ARG ∂ italic_x ( italic_n - italic_d ) end_ARG = italic_b start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_n ) and y(<n)x(n)=0annotated𝑦absent𝑛𝑥𝑛0\frac{\partial y(<n)}{\partial x(n)}=0divide start_ARG ∂ italic_y ( < italic_n ) end_ARG start_ARG ∂ italic_x ( italic_n ) end_ARG = 0 from(2).Unfortunately, Eq.(4) and therefore(5) are expensive to compute.We aim to express backpropagation using f𝐚(n)subscript𝑓𝐚𝑛f_{\mathbf{a}(n)}italic_f start_POSTSUBSCRIPT bold_a ( italic_n ) end_POSTSUBSCRIPT, which is much more efficient to compute due to its recursion.If we can re-parameterise Eq.(5) to look like (1), then f𝐚(n)subscript𝑓𝐚𝑛f_{\mathbf{a}(n)}italic_f start_POSTSUBSCRIPT bold_a ( italic_n ) end_POSTSUBSCRIPT can be reused to compute x(n)𝑥𝑛\frac{\partial\mathcal{L}}{\partial x(n)}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_x ( italic_n ) end_ARG.We do this by writing bd(n+d)subscript𝑏𝑑𝑛𝑑b_{d}(n+d)italic_b start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_n + italic_d ) in terms of a()subscript𝑎a_{\cdot}(\cdot)italic_a start_POSTSUBSCRIPT ⋅ end_POSTSUBSCRIPT ( ⋅ ) which requires us to evaluate(4) at n+d𝑛𝑑n+ditalic_n + italic_d to get

bd(n+d)=𝐪𝒢d(1)|𝐪|j=1|𝐪|aqj(n+dk=1jQ(𝐪)k)=𝐪𝒢d(1)|𝐪|j=1|𝐪|aqj(n+k=j|𝐪|qk)=𝐪𝒢d(1)|𝐪|a^q|𝐪|(n)j=1|𝐪|1a^qj(n+k=j+1|𝐪|qk)=𝐪~𝒢d(1)|𝐪~|j=1|𝐪~|a^q~j(n+k=1jQ(𝐪~)k)subscript𝑏𝑑𝑛𝑑subscript𝐪subscript𝒢𝑑superscript1𝐪superscriptsubscriptproduct𝑗1𝐪subscript𝑎subscript𝑞𝑗𝑛𝑑superscriptsubscript𝑘1𝑗𝑄subscript𝐪𝑘subscript𝐪subscript𝒢𝑑superscript1𝐪superscriptsubscriptproduct𝑗1𝐪subscript𝑎subscript𝑞𝑗𝑛superscriptsubscript𝑘𝑗𝐪subscript𝑞𝑘subscript𝐪subscript𝒢𝑑superscript1𝐪subscript^𝑎subscript𝑞𝐪𝑛superscriptsubscriptproduct𝑗1𝐪1subscript^𝑎subscript𝑞𝑗𝑛superscriptsubscript𝑘𝑗1𝐪subscript𝑞𝑘subscript~𝐪subscript𝒢𝑑superscript1~𝐪superscriptsubscriptproduct𝑗1~𝐪subscript^𝑎subscript~𝑞𝑗𝑛superscriptsubscript𝑘1𝑗𝑄subscript~𝐪𝑘\begin{split}b_{d}(n+d)=\sum_{\mathbf{q}\in\mathcal{G}_{d}}(-1)^{|\mathbf{q}|}%\prod_{j=1}^{|\mathbf{q}|}a_{q_{j}}\left(n+d-\sum_{k=1}^{j}Q(\mathbf{q})_{k}%\right)\\=\sum_{\mathbf{q}\in\mathcal{G}_{d}}(-1)^{|\mathbf{q}|}\prod_{j=1}^{|\mathbf{q%}|}a_{q_{j}}\left(n+\sum_{k=j}^{|\mathbf{q}|}q_{k}\right)\\=\sum_{\mathbf{q}\in\mathcal{G}_{d}}(-1)^{|\mathbf{q}|}\hat{a}_{q_{|\mathbf{q}%|}}(n)\prod_{j=1}^{|\mathbf{q}|-1}\hat{a}_{q_{j}}\left(n+\sum_{k=j+1}^{|%\mathbf{q}|}q_{k}\right)\\=\sum_{\tilde{\mathbf{q}}\in\mathcal{G}_{d}}(-1)^{|\tilde{\mathbf{q}}|}\prod_{%j=1}^{|\tilde{\mathbf{q}}|}\hat{a}_{\tilde{q}_{j}}\left(n+\sum_{k=1}^{j}Q(%\tilde{\mathbf{q}})_{k}\right)\\\end{split}start_ROW start_CELL italic_b start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_n + italic_d ) = ∑ start_POSTSUBSCRIPT bold_q ∈ caligraphic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( - 1 ) start_POSTSUPERSCRIPT | bold_q | end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_q | end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_n + italic_d - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_Q ( bold_q ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL = ∑ start_POSTSUBSCRIPT bold_q ∈ caligraphic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( - 1 ) start_POSTSUPERSCRIPT | bold_q | end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_q | end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_n + ∑ start_POSTSUBSCRIPT italic_k = italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_q | end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL = ∑ start_POSTSUBSCRIPT bold_q ∈ caligraphic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( - 1 ) start_POSTSUPERSCRIPT | bold_q | end_POSTSUPERSCRIPT over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT | bold_q | end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_n ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_q | - 1 end_POSTSUPERSCRIPT over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_n + ∑ start_POSTSUBSCRIPT italic_k = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_q | end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL = ∑ start_POSTSUBSCRIPT over~ start_ARG bold_q end_ARG ∈ caligraphic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( - 1 ) start_POSTSUPERSCRIPT | over~ start_ARG bold_q end_ARG | end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | over~ start_ARG bold_q end_ARG | end_POSTSUPERSCRIPT over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_n + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_Q ( over~ start_ARG bold_q end_ARG ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW(6)

where 𝐪~=[q|𝐪|,,q1]𝒢d~𝐪subscript𝑞𝐪subscript𝑞1subscript𝒢𝑑\tilde{\mathbf{q}}=[q_{|\mathbf{q}|},\ldots,q_{1}]\in\mathcal{G}_{d}over~ start_ARG bold_q end_ARG = [ italic_q start_POSTSUBSCRIPT | bold_q | end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ∈ caligraphic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and a^i(n)=ai(n+i)subscript^𝑎𝑖𝑛subscript𝑎𝑖𝑛𝑖\hat{a}_{i}(n)=a_{i}(n+i)over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n ) = italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n + italic_i ).Eq.(6) is now the same as(4) but uses 𝐚^(n)=[a^1(n),,a^M(n)]^𝐚𝑛subscript^𝑎1𝑛subscript^𝑎𝑀𝑛\hat{\mathbf{a}}(n)=[\hat{a}_{1}(n),\ldots,\hat{a}_{M}(n)]over^ start_ARG bold_a end_ARG ( italic_n ) = [ over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_n ) , … , over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_n ) ] as coefficients and the plus sign inside the product changes to a minus sign, which means the filter should be applied in the reverse direction (n=Nn=𝑛𝑁𝑛n=N\rightarrow n=-\inftyitalic_n = italic_N → italic_n = - ∞) and is supported by the non-causal indexing in(5) (n+d𝑛𝑑n+ditalic_n + italic_d instead of nd𝑛𝑑n-ditalic_n - italic_d).We can now express(5) in terms of f𝐚(n)subscript𝑓𝐚𝑛f_{\mathbf{a}(n)}italic_f start_POSTSUBSCRIPT bold_a ( italic_n ) end_POSTSUBSCRIPT using(6) and the equivalence between 𝐚(n)𝐚𝑛\mathbf{a}(n)bold_a ( italic_n ) and 𝐛(n)𝐛𝑛\mathbf{b}(n)bold_b ( italic_n ) from(1) and(2) to get

x(n)=y(n)i=1Ma^i(n)x(n+i)=FLIPfFLIP𝐚^(n)FLIPy(n)𝑥𝑛𝑦𝑛superscriptsubscript𝑖1𝑀subscript^𝑎𝑖𝑛𝑥𝑛𝑖FLIPsubscript𝑓FLIP^𝐚𝑛FLIP𝑦𝑛\begin{split}\frac{\partial\mathcal{L}}{\partial x(n)}&=\frac{\partial\mathcal%{L}}{\partial y(n)}-\sum_{i=1}^{M}\hat{a}_{i}(n)\frac{\partial\mathcal{L}}{%\partial x(n+i)}\\&=\text{FLIP}\circ f_{\text{FLIP}\circ\hat{\bf a}(n)}\circ\text{FLIP}\circ%\frac{\partial\mathcal{L}}{\partial y(n)}\end{split}start_ROW start_CELL divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_x ( italic_n ) end_ARG end_CELL start_CELL = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_y ( italic_n ) end_ARG - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n ) divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_x ( italic_n + italic_i ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = FLIP ∘ italic_f start_POSTSUBSCRIPT FLIP ∘ over^ start_ARG bold_a end_ARG ( italic_n ) end_POSTSUBSCRIPT ∘ FLIP ∘ divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_y ( italic_n ) end_ARG end_CELL end_ROW(7)

where FLIP(x(n))=x(n)FLIP𝑥𝑛𝑥𝑛\text{FLIP}(x(n))=x(-n)FLIP ( italic_x ( italic_n ) ) = italic_x ( - italic_n ) and f1f2(x)=f1(f2(x))subscript𝑓1subscript𝑓2𝑥subscript𝑓1subscript𝑓2𝑥f_{1}\circ f_{2}(x)=f_{1}(f_{2}(x))italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) = italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) ).FLIP and 𝐚^(n)^𝐚𝑛\hat{\bf a}(n)over^ start_ARG bold_a end_ARG ( italic_n ) are trivial to compute using memory indexing.The backpropagation algorithm and how to arrange 𝐚(n)𝐚𝑛\mathbf{a}(n)bold_a ( italic_n ) into 𝐚^(n)^𝐚𝑛\hat{\mathbf{a}}(n)over^ start_ARG bold_a end_ARG ( italic_n ) is shown in Figure1.

3.3 Gradients of 𝐚(n)𝐚𝑛\mathbf{a}(n)bold_a ( italic_n )

Let ui(n)=ai(n)y(ni)subscript𝑢𝑖𝑛subscript𝑎𝑖𝑛𝑦𝑛𝑖u_{i}(n)=-a_{i}(n)y(n-i)italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n ) = - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n ) italic_y ( italic_n - italic_i ) so y(n)=x(n)+u1(n)++uM(n)𝑦𝑛𝑥𝑛subscript𝑢1𝑛subscript𝑢𝑀𝑛y(n)=x(n)+u_{1}(n)+\cdots+u_{M}(n)italic_y ( italic_n ) = italic_x ( italic_n ) + italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_n ) + ⋯ + italic_u start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_n ).Because of the chain rule and y(n)x(n)=y(n)ui(n)=1𝑦𝑛𝑥𝑛𝑦𝑛subscript𝑢𝑖𝑛1\frac{\partial y(n)}{\partial x(n)}=\frac{\partial y(n)}{\partial u_{i}(n)}=1divide start_ARG ∂ italic_y ( italic_n ) end_ARG start_ARG ∂ italic_x ( italic_n ) end_ARG = divide start_ARG ∂ italic_y ( italic_n ) end_ARG start_ARG ∂ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n ) end_ARG = 1, x(n)𝑥𝑛x(n)italic_x ( italic_n ) and ui(n)subscript𝑢𝑖𝑛u_{i}(n)italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n ) should have the same derivatives (x(n)=ui(n)𝑥𝑛subscript𝑢𝑖𝑛\frac{\partial\mathcal{L}}{\partial x(n)}=\frac{\partial\mathcal{L}}{\partial u%_{i}(n)}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_x ( italic_n ) end_ARG = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n ) end_ARG).Since we can compute x(n)𝑥𝑛\frac{\partial\mathcal{L}}{\partial x(n)}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_x ( italic_n ) end_ARG from(7), the gradients of the coefficients are simply

ai(n)=ui(n)ui(n)ai(n)=x(n)y(ni).subscript𝑎𝑖𝑛subscript𝑢𝑖𝑛subscript𝑢𝑖𝑛subscript𝑎𝑖𝑛𝑥𝑛𝑦𝑛𝑖\frac{\partial\mathcal{L}}{\partial a_{i}(n)}=\frac{\partial\mathcal{L}}{%\partial u_{i}(n)}\frac{\partial u_{i}(n)}{\partial a_{i}(n)}=-\frac{\partial%\mathcal{L}}{\partial x(n)}y(n-i).divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n ) end_ARG = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n ) end_ARG divide start_ARG ∂ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n ) end_ARG start_ARG ∂ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n ) end_ARG = - divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_x ( italic_n ) end_ARG italic_y ( italic_n - italic_i ) .(8)

In summary, we can calculate all gradients with one pass of f𝐚(n)subscript𝑓𝐚𝑛f_{\mathbf{a}(n)}italic_f start_POSTSUBSCRIPT bold_a ( italic_n ) end_POSTSUBSCRIPT and the multiplications in(8), which are fast to compute.We implement an efficient f𝐚(n)subscript𝑓𝐚𝑛f_{\mathbf{a}(n)}italic_f start_POSTSUBSCRIPT bold_a ( italic_n ) end_POSTSUBSCRIPT for both forward and backward computation using Numba and register it as a custom operator in PyTorch.The implementation is available on GitHub222https://github.com/DiffAPF/torchlpc.

4 Applications

We demonstrate our all-pole filter implementation on three commonly used dynamic audio systems: a phaser, a subtractive synthesiser, and a compressor.All three systems have time-varying recursive structures that are not easy to train in a differentiable way and would typically be modelled using FS approaches.

Although the filtering order of poles and zeros matters in time-varying systems, in this work, we rearranged the poles of each system into one all-pole filter for maximum training efficiency.Due to the relatively slowly varying filter coefficients, we found this approach to be sufficient.We direct interested readers to our parallel work in speech synthesis[21], an exact all-pole system.

4.1 Phaser

We test our filter implementation on a virtual analog phaser model, based on[22, 7]. At the core of the model is a differentiable LFO that operates at the control rate Fcsubscript𝐹𝑐F_{c}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. The oscillator is implemented as a damped oscillator with learnable frequency f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, decay rate σ𝜎\sigmaitalic_σ, and phase ϕitalic-ϕ\phiitalic_ϕ:

s(m)=eσ2m/Fccos(2πf0m/Fc+ϕ)𝑠𝑚superscript𝑒superscript𝜎2𝑚subscript𝐹𝑐2𝜋subscript𝑓0𝑚subscript𝐹𝑐italic-ϕs(m)=e^{-\sigma^{2}m/F_{c}}\cos(2\pi f_{0}m/F_{c}+\phi)italic_s ( italic_m ) = italic_e start_POSTSUPERSCRIPT - italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m / italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_cos ( 2 italic_π italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_m / italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_ϕ )(9)

where m𝑚mitalic_m is the control-rate sample index. The inclusion of parameter σ𝜎\sigmaitalic_σ alleviates some non-convexity issues when learning the frequency f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, as shown in[23]. Note that the oscillator is unconditionally stable for all σ𝜎\sigmaitalic_σ. The oscillator is passed through a multi-layer perceptron (MLP) network to obtain the control signal p(m)𝑝𝑚p(m)italic_p ( italic_m ). The MLP, with parameters ΘΘ\Thetaroman_Θ, contains 3x8 hidden layers, with tanh\tanhroman_tanh activation functions on all layers including the final. The control signal is then up-sampled with linear interpolation to obtain p(n)𝑝𝑛p(n)italic_p ( italic_n ) and modulates the coefficients of four cascaded first-order all-pass filters (APF), each with the difference equation:

yk(n)=p(n)[xk(n)+yk(n1)]xk(n1)subscript𝑦𝑘𝑛𝑝𝑛delimited-[]subscript𝑥𝑘𝑛subscript𝑦𝑘𝑛1subscript𝑥𝑘𝑛1y_{k}(n)=p(n)\cdot\left[x_{k}(n)+y_{k}(n-1)\right]-x_{k}(n-1)italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_n ) = italic_p ( italic_n ) ⋅ [ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_n ) + italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_n - 1 ) ] - italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_n - 1 )(10)

where xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and yksubscript𝑦𝑘y_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are the input and state of the kthsuperscript𝑘thk^{\mathrm{th}}italic_k start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT APF, respectively, 0k<40𝑘40\leq k<40 ≤ italic_k < 4. Note that the MLP tanh\tanhroman_tanh output activation ensures the all-pass poles remain within the unit circle. The APFs are arranged in series, with a through path of gain g1subscript𝑔1g_{1}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and feedback loop g2subscript𝑔2g_{2}italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as shown in Figure 2. It is common to include a unit delay in the feedback path for ease of implementation[22, 24], however here we use instantaneous feedback for a more realistic virtual analog model. In Figure 2, BQ represents a biquad filter with coefficients 𝐛(bq)=[b0(bq),b1(bq),b2(bq)]superscript𝐛bqsuperscriptsubscript𝑏0bqsuperscriptsubscript𝑏1bqsuperscriptsubscript𝑏2bq{\bf b}^{({\rm bq})}=[b_{0}^{({\rm bq})},b_{1}^{({\rm bq})},b_{2}^{({\rm bq})}]bold_b start_POSTSUPERSCRIPT ( roman_bq ) end_POSTSUPERSCRIPT = [ italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_bq ) end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_bq ) end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_bq ) end_POSTSUPERSCRIPT ] and 𝐚(bq)=[a1(bq),a2(bq)]superscript𝐚bqsuperscriptsubscript𝑎1bqsuperscriptsubscript𝑎2bq{\bf a}^{({\rm bq})}=[a_{1}^{({\rm bq})},a_{2}^{({\rm bq})}]bold_a start_POSTSUPERSCRIPT ( roman_bq ) end_POSTSUPERSCRIPT = [ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_bq ) end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_bq ) end_POSTSUPERSCRIPT ]. The entire model is a sixth-order time-varying IIR filter. Here we approximate the system as having the difference equation:

y(n)=f𝐚(n)f𝐛(n)x(n)𝑦𝑛subscript𝑓𝐚𝑛subscript𝑓𝐛𝑛𝑥𝑛\displaystyle y(n)=f_{{\bf a}(n)}\circ f_{{\bf b}(n)}\circ x(n)italic_y ( italic_n ) = italic_f start_POSTSUBSCRIPT bold_a ( italic_n ) end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT bold_b ( italic_n ) end_POSTSUBSCRIPT ∘ italic_x ( italic_n )(11)

where f𝐛(n)()subscript𝑓𝐛𝑛f_{{\bf b}(n)}\left(\cdot\right)italic_f start_POSTSUBSCRIPT bold_b ( italic_n ) end_POSTSUBSCRIPT ( ⋅ ) is a time-varying FIR filter:

f𝐛(n)(x(n))=i=0Mbi(n)x(ni)subscript𝑓𝐛𝑛𝑥𝑛superscriptsubscript𝑖0𝑀subscript𝑏𝑖𝑛𝑥𝑛𝑖f_{{\bf b}(n)}\left(x(n)\right)=\sum_{i=0}^{M}b_{i}(n)x(n-i)italic_f start_POSTSUBSCRIPT bold_b ( italic_n ) end_POSTSUBSCRIPT ( italic_x ( italic_n ) ) = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n ) italic_x ( italic_n - italic_i )(12)

and f𝐚(n)()subscript𝑓𝐚𝑛f_{{\bf a}(n)}\left(\cdot\right)italic_f start_POSTSUBSCRIPT bold_a ( italic_n ) end_POSTSUBSCRIPT ( ⋅ ) is a time-varying all-pole filter (see Eq. (1)).Here M=6𝑀6M=6italic_M = 6 for 𝐛(n)𝐛𝑛\mathbf{b}(n)bold_b ( italic_n ) and 𝐚(n)𝐚𝑛\mathbf{a}(n)bold_a ( italic_n ), which are functions of p(n)𝑝𝑛p(n)italic_p ( italic_n ), g1subscript𝑔1g_{1}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, g2subscript𝑔2g_{2}italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, 𝐛(bq)superscript𝐛bq{\bf b}^{({\rm bq})}bold_b start_POSTSUPERSCRIPT ( roman_bq ) end_POSTSUPERSCRIPT, and 𝐚(bq)superscript𝐚bq{\bf a}^{({\rm bq})}bold_a start_POSTSUPERSCRIPT ( roman_bq ) end_POSTSUPERSCRIPT.In previous work[7], the control parameters of a similar time-varying filter were learned through gradient descent using the FS method. This frequency sampling approach had some limitations, however. Firstly, the optimal frame size for the best training accuracy depended on the rate of the target LFO, which we ideally should not assume as prior knowledge. Secondly, it was not fully investigated whether the trained model could then be implemented in the time domain at inference to avoid latency.

Here, we instead implement Eq. (11) directly in the time domain during training using the method proposed in Section 3.

Differentiable all-pole filters for time-varying audio systems (2)

4.2 Time-varying Subtractive synthesiser

We test our filter implementation on a subtractive synthesiser roughly modelled after the Roland TB-303 Bass Line synth333https://www.roland.com/uk/promos/303day/ which defined the acid house electronic music movement of the late 1980s.The TB-303 is an ideal synth for our use case because its defining feature is a resonant low-pass filter where the cutoff frequency is modulated quickly using an envelope to create its signature squelchy,
“liquid” sound.Although the original TB-303’s circuit contains a 4-pole diode ladder filter, for simplicity and demonstration purposes, we implement our synthesiser using a biquad filter.Our synth is differentiable and consists of three main components: a monophonic oscillator, a time-varying biquad filter, and a waveshaper for adding distortion to the output.

The oscillator is the same as in the one in TorchSynth[25] and uses hyperbolic tangent waveshaping to generate sawtooth or square waves, and can sweep continuously between them. It is defined by the following equations:

ψ(n)=2πnf0/Fs+ϕ(mod2π)𝜓𝑛annotated2𝜋𝑛subscript𝑓0subscript𝐹𝑠italic-ϕpmod2𝜋\displaystyle\psi(n)=2\pi nf_{0}/F_{s}+\phi\pmod{2\pi}italic_ψ ( italic_n ) = 2 italic_π italic_n italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_ϕ start_MODIFIER ( roman_mod start_ARG 2 italic_π end_ARG ) end_MODIFIER(13)
o(n)=ρoscssaw(ψ(n))+(1ρosc)ssq(ψ(n))𝑜𝑛subscript𝜌oscsubscript𝑠saw𝜓𝑛1subscript𝜌oscsubscript𝑠sq𝜓𝑛\displaystyle o(n)=\rho_{\rm osc}s_{\rm saw}(\psi(n))+(1-\rho_{\rm osc})s_{\rmsq%}(\psi(n))italic_o ( italic_n ) = italic_ρ start_POSTSUBSCRIPT roman_osc end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT roman_saw end_POSTSUBSCRIPT ( italic_ψ ( italic_n ) ) + ( 1 - italic_ρ start_POSTSUBSCRIPT roman_osc end_POSTSUBSCRIPT ) italic_s start_POSTSUBSCRIPT roman_sq end_POSTSUBSCRIPT ( italic_ψ ( italic_n ) )(14)
e(n)={(NonnNon)ρenv0nNon0otherwise}𝑒𝑛casescasessuperscriptsubscript𝑁on𝑛subscript𝑁onsubscript𝜌env0𝑛subscript𝑁on0otherwiseotherwise\displaystyle e(n)=\begin{rcases}\begin{dcases}\left(\frac{N_{\rm on}-n}{N_{%\rm on}}\right)^{\rho_{\rm env}}&0\leq n\leq N_{\rm on}\\\hfil 0\hfil&{\rm otherwise}\\\end{dcases}\end{rcases}italic_e ( italic_n ) = start_ROW start_CELL { start_ROW start_CELL ( divide start_ARG italic_N start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT - italic_n end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT roman_env end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_CELL start_CELL 0 ≤ italic_n ≤ italic_N start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_otherwise end_CELL end_ROW end_CELL start_CELL end_CELL end_ROW }(15)
s(n)=gosce(n)o(n)𝑠𝑛subscript𝑔osc𝑒𝑛𝑜𝑛\displaystyle s(n)=g_{\rm osc}e(n)o(n)italic_s ( italic_n ) = italic_g start_POSTSUBSCRIPT roman_osc end_POSTSUBSCRIPT italic_e ( italic_n ) italic_o ( italic_n )(16)

where Fssubscript𝐹𝑠F_{s}italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are the sampling rate and fundamental frequency in hertz, and ϕitalic-ϕ\phiitalic_ϕ is the phase in radians. ρoscsubscript𝜌osc\rho_{\rm osc}italic_ρ start_POSTSUBSCRIPT roman_osc end_POSTSUBSCRIPT is a continuous control parameter to sweep between the wave shapes where 0 makes a square wave (ssq()subscript𝑠sqs_{\rm sq}(\cdot)italic_s start_POSTSUBSCRIPT roman_sq end_POSTSUBSCRIPT ( ⋅ )), and 1 makes a saw wave (ssaw()subscript𝑠saws_{\rm saw}(\cdot)italic_s start_POSTSUBSCRIPT roman_saw end_POSTSUBSCRIPT ( ⋅ )).The output audio of the oscillator is multiplied by gain goscsubscript𝑔oscg_{\rm osc}italic_g start_POSTSUBSCRIPT roman_osc end_POSTSUBSCRIPT and is then shaped using a decaying envelope e(n)𝑒𝑛e(n)italic_e ( italic_n ) of length Nonsubscript𝑁onN_{\rm on}italic_N start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT note on samples with control parameter ρenvsubscript𝜌env\rho_{\rm env}italic_ρ start_POSTSUBSCRIPT roman_env end_POSTSUBSCRIPT.

A time-varying biquad filter hbq()subscriptbqh_{\rm bq}(\cdot)italic_h start_POSTSUBSCRIPT roman_bq end_POSTSUBSCRIPT ( ⋅ ) (same as Equations 11 and 12 for the phaser, but M=2𝑀2M=2italic_M = 2) is then applied to the oscillator output audio.This filter takes as input 5 time-varying filter coefficients {a1(n)subscript𝑎1𝑛a_{1}(n)italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_n ), a2(n)subscript𝑎2𝑛a_{2}(n)italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n ), b0(n)subscript𝑏0𝑛b_{0}(n)italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_n ), b1(n)subscript𝑏1𝑛b_{1}(n)italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_n ), b2(n)subscript𝑏2𝑛b_{2}(n)italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n )} at sample rate which can be passed in directly or generated from filter cutoff and resonance modulation signals mfc(n)subscript𝑚fc𝑛m_{\rm fc}(n)italic_m start_POSTSUBSCRIPT roman_fc end_POSTSUBSCRIPT ( italic_n ) and mq(n)subscript𝑚q𝑛m_{\rm q}(n)italic_m start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT ( italic_n ).These modulation signals are then used to calculate the coefficients for a biquad lowpass filter using the corresponding equations in the Audio EQ Cookbook444https://www.w3.org/TR/audio-eq-cookbook/.

Finally, the output of the filter is fed through a hyperbolic tangent waveshaper which adds distortion:

fdist(x(n))=tanh(gdistx(n)).subscript𝑓dist𝑥𝑛subscript𝑔dist𝑥𝑛f_{\rm{dist}}(x(n))=\tanh(g_{\rm dist}x(n)).italic_f start_POSTSUBSCRIPT roman_dist end_POSTSUBSCRIPT ( italic_x ( italic_n ) ) = roman_tanh ( italic_g start_POSTSUBSCRIPT roman_dist end_POSTSUBSCRIPT italic_x ( italic_n ) ) .(17)

The amount of distortion is controlled by parameter gdistsubscript𝑔distg_{\rm dist}italic_g start_POSTSUBSCRIPT roman_dist end_POSTSUBSCRIPT which modifies the gain of the input of the waveshaper x(n)𝑥𝑛x(n)italic_x ( italic_n ).

The entire synth is therefore controllable using 8 global parameters {f0\{f_{0}{ italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, Fssubscript𝐹𝑠F_{s}italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, ϕitalic-ϕ\phiitalic_ϕ, Nonsubscript𝑁onN_{\rm on}italic_N start_POSTSUBSCRIPT roman_on end_POSTSUBSCRIPT, ρoscsubscript𝜌osc\rho_{\rm osc}italic_ρ start_POSTSUBSCRIPT roman_osc end_POSTSUBSCRIPT, ρenvsubscript𝜌env\rho_{\rm env}italic_ρ start_POSTSUBSCRIPT roman_env end_POSTSUBSCRIPT, goscsubscript𝑔oscg_{\rm osc}italic_g start_POSTSUBSCRIPT roman_osc end_POSTSUBSCRIPT, gdist}g_{\rm dist}\}italic_g start_POSTSUBSCRIPT roman_dist end_POSTSUBSCRIPT } and 2 or 5 time-varying parameters {mfc(n)\{m_{\rm fc}(n){ italic_m start_POSTSUBSCRIPT roman_fc end_POSTSUBSCRIPT ( italic_n ), mq(n)}m_{\rm q}(n)\}italic_m start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT ( italic_n ) } or {a1(n)\{a_{1}(n){ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_n ), a2(n)subscript𝑎2𝑛a_{2}(n)italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n ), b0(n)subscript𝑏0𝑛b_{0}(n)italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_n ), b1(n)subscript𝑏1𝑛b_{1}(n)italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_n ), b2(n)}b_{2}(n)\}italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n ) }.It is defined by composing Equation16, hbq()subscriptbqh_{\rm bq}(\cdot)italic_h start_POSTSUBSCRIPT roman_bq end_POSTSUBSCRIPT ( ⋅ ), and Equation17 together as follows:

ysynth(n)=fdisthbqs(n).subscript𝑦synth𝑛subscript𝑓distsubscriptbq𝑠𝑛y_{\rm synth}(n)=f_{\rm{dist}}\circ h_{\rm bq}\circ s(n).italic_y start_POSTSUBSCRIPT roman_synth end_POSTSUBSCRIPT ( italic_n ) = italic_f start_POSTSUBSCRIPT roman_dist end_POSTSUBSCRIPT ∘ italic_h start_POSTSUBSCRIPT roman_bq end_POSTSUBSCRIPT ∘ italic_s ( italic_n ) .(18)

4.3 Feed-forward Compressor

The compressor we consider here is a simple feed-forward compressor from[17], which is defined as:

xrms(n)=αrmsx2(n)+(1αrms)xrms(n1)subscript𝑥rms𝑛subscript𝛼rmssuperscript𝑥2𝑛1subscript𝛼rmssubscript𝑥rms𝑛1\displaystyle x_{\rm rms}(n)=\alpha_{\rm rms}x^{2}(n)+(1-\alpha_{\rm rms})x_{%\rm rms}(n-1)italic_x start_POSTSUBSCRIPT roman_rms end_POSTSUBSCRIPT ( italic_n ) = italic_α start_POSTSUBSCRIPT roman_rms end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_n ) + ( 1 - italic_α start_POSTSUBSCRIPT roman_rms end_POSTSUBSCRIPT ) italic_x start_POSTSUBSCRIPT roman_rms end_POSTSUBSCRIPT ( italic_n - 1 )(19)
g(n)=min(1,(xrms(n)10CT20)1RR)𝑔𝑛1superscriptsubscript𝑥rms𝑛superscript10𝐶𝑇201𝑅𝑅\displaystyle g(n)=\min\left(1,\left(\frac{\sqrt{x_{\rm rms}(n)}}{10^{\frac{CT%}{20}}}\right)^{\frac{1-R}{R}}\right)italic_g ( italic_n ) = roman_min ( 1 , ( divide start_ARG square-root start_ARG italic_x start_POSTSUBSCRIPT roman_rms end_POSTSUBSCRIPT ( italic_n ) end_ARG end_ARG start_ARG 10 start_POSTSUPERSCRIPT divide start_ARG italic_C italic_T end_ARG start_ARG 20 end_ARG end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 - italic_R end_ARG start_ARG italic_R end_ARG end_POSTSUPERSCRIPT )(20)
g^(n)={αatg(n)+(1αat)g^(n1)g(n)<g^(n1)αrtg(n)+(1αrt)g^(n1)otherwise}^𝑔𝑛casescasessubscript𝛼at𝑔𝑛1subscript𝛼at^𝑔𝑛1𝑔𝑛^𝑔𝑛1subscript𝛼rt𝑔𝑛1subscript𝛼rt^𝑔𝑛1otherwiseotherwise\displaystyle\hat{g}(n)=\begin{rcases}\begin{dcases}\alpha_{\rm at}g(n)+(1-%\alpha_{\rm at})\hat{g}(n-1)&g(n)<\hat{g}(n-1)\\\alpha_{\rm rt}g(n)+(1-\alpha_{\rm rt})\hat{g}(n-1)&{\rm otherwise}\\\end{dcases}\end{rcases}over^ start_ARG italic_g end_ARG ( italic_n ) = start_ROW start_CELL { start_ROW start_CELL italic_α start_POSTSUBSCRIPT roman_at end_POSTSUBSCRIPT italic_g ( italic_n ) + ( 1 - italic_α start_POSTSUBSCRIPT roman_at end_POSTSUBSCRIPT ) over^ start_ARG italic_g end_ARG ( italic_n - 1 ) end_CELL start_CELL italic_g ( italic_n ) < over^ start_ARG italic_g end_ARG ( italic_n - 1 ) end_CELL end_ROW start_ROW start_CELL italic_α start_POSTSUBSCRIPT roman_rt end_POSTSUBSCRIPT italic_g ( italic_n ) + ( 1 - italic_α start_POSTSUBSCRIPT roman_rt end_POSTSUBSCRIPT ) over^ start_ARG italic_g end_ARG ( italic_n - 1 ) end_CELL start_CELL roman_otherwise end_CELL end_ROW end_CELL start_CELL end_CELL end_ROW }(21)
y(n)=x(n)g^(n)γ.𝑦𝑛𝑥𝑛^𝑔𝑛𝛾\displaystyle y(n)=x(n)\hat{g}(n)\gamma.italic_y ( italic_n ) = italic_x ( italic_n ) over^ start_ARG italic_g end_ARG ( italic_n ) italic_γ .(22)

R𝑅Ritalic_R, CT𝐶𝑇CTitalic_C italic_T, and γ𝛾\gammaitalic_γ are the ratio, threshold, and make-up gain, respectively.αrms/at/rtsubscript𝛼rmsatrt\alpha_{\rm rms/at/rt}italic_α start_POSTSUBSCRIPT roman_rms / roman_at / roman_rt end_POSTSUBSCRIPT are the average smoothing coefficients.αat/rtsubscript𝛼atrt\alpha_{\rm at/rt}italic_α start_POSTSUBSCRIPT roman_at / roman_rt end_POSTSUBSCRIPT are chosen based on whether the compressor is operated in the attack phase or release phase.This compressor’s training efficiency bottleneck is described in(21), as the coefficient of a recursive filter is computed on the fly, so we cannot use f𝐚(n)subscript𝑓𝐚𝑛f_{\mathbf{a}(n)}italic_f start_POSTSUBSCRIPT bold_a ( italic_n ) end_POSTSUBSCRIPT directly.Moreover, it operates at the audio rate, so it is unsuitable for frame-based approximation.

PhaserTime-varying subtractive synthFeed-forward compressor
f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPTσ𝜎\sigmaitalic_σϕitalic-ϕ\phiitalic_ϕg1subscript𝑔1g_{1}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTg2subscript𝑔2g_{2}italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTρoscsubscript𝜌osc\rho_{\rm osc}italic_ρ start_POSTSUBSCRIPT roman_osc end_POSTSUBSCRIPTρenvsubscript𝜌env\rho_{\rm env}italic_ρ start_POSTSUBSCRIPT roman_env end_POSTSUBSCRIPTgoscsubscript𝑔oscg_{\rm osc}italic_g start_POSTSUBSCRIPT roman_osc end_POSTSUBSCRIPTgdistsubscript𝑔distg_{\rm dist}italic_g start_POSTSUBSCRIPT roman_dist end_POSTSUBSCRIPTmfc(n)subscript𝑚fc𝑛m_{\rm fc}(n)italic_m start_POSTSUBSCRIPT roman_fc end_POSTSUBSCRIPT ( italic_n )mq(n)subscript𝑚q𝑛m_{\rm q}(n)italic_m start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT ( italic_n )R𝑅Ritalic_RCT𝐶𝑇CTitalic_C italic_Tαatsubscript𝛼at\alpha_{\rm at}italic_α start_POSTSUBSCRIPT roman_at end_POSTSUBSCRIPTαrtsubscript𝛼rt\alpha_{\rm rt}italic_α start_POSTSUBSCRIPT roman_rt end_POSTSUBSCRIPTαrmssubscript𝛼rms\alpha_{\rm rms}italic_α start_POSTSUBSCRIPT roman_rms end_POSTSUBSCRIPTγ𝛾\gammaitalic_γ
min.Fc/2subscript𝐹𝑐2-F_{c}/2- italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / 2-\infty- ∞π𝜋-\pi- italic_π-\infty- ∞00.00.10.010.010.1kHztimes0.1kHz0.1\text{\,}\mathrm{k}\mathrm{H}\mathrm{z}start_ARG 0.1 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG0.70711.0-\infty- ∞0.00.00.00.0
max.Fc/2subscript𝐹𝑐2F_{c}/2italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / 2\inftyπ𝜋\piitalic_π\infty11.06.01.004.008.0kHztimes8.0kHz8.0\text{\,}\mathrm{k}\mathrm{H}\mathrm{z}start_ARG 8.0 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG8.0\infty\infty1.01.01.0\infty

4.3.1 Custom backward function

We backpropagate gradients through(21) using the proposed method in Sec.3.To handle the if-else statement, we write the gain reduction filter(21) in Numba and record each if-else decision inside the recursion into a binary mask ζ(n)𝜁𝑛\zeta(n)italic_ζ ( italic_n ):

ζ(n)={1g(n)<g^(n1)0otherwise}.𝜁𝑛casescases1𝑔𝑛^𝑔𝑛10otherwiseotherwise\zeta(n)=\begin{rcases}\begin{dcases}1&g(n)<\hat{g}(n-1)\\0&{\rm otherwise}\\\end{dcases}\end{rcases}.italic_ζ ( italic_n ) = start_ROW start_CELL { start_ROW start_CELL 1 end_CELL start_CELL italic_g ( italic_n ) < over^ start_ARG italic_g end_ARG ( italic_n - 1 ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_otherwise end_CELL end_ROW end_CELL start_CELL end_CELL end_ROW } .(23)

Using this, Eq.(21) equals the following time-varying IIR:

β(n)=αatζ(n)αrt1ζ(n)g^(n)=β(n)g(n)+(1β(n))g^(n1).𝛽𝑛superscriptsubscript𝛼at𝜁𝑛superscriptsubscript𝛼rt1𝜁𝑛^𝑔𝑛𝛽𝑛𝑔𝑛1𝛽𝑛^𝑔𝑛1\begin{split}\beta(n)&=\alpha_{\rm at}^{\zeta(n)}\alpha_{\rm rt}^{1-\zeta(n)}%\\\hat{g}(n)&=\beta(n)g(n)+(1-\beta(n))\hat{g}(n-1).\end{split}start_ROW start_CELL italic_β ( italic_n ) end_CELL start_CELL = italic_α start_POSTSUBSCRIPT roman_at end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ζ ( italic_n ) end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT roman_rt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - italic_ζ ( italic_n ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_g end_ARG ( italic_n ) end_CELL start_CELL = italic_β ( italic_n ) italic_g ( italic_n ) + ( 1 - italic_β ( italic_n ) ) over^ start_ARG italic_g end_ARG ( italic_n - 1 ) . end_CELL end_ROW(24)

We can now use β(n)𝛽𝑛\beta(n)italic_β ( italic_n ) and f𝐚(n)subscript𝑓𝐚𝑛f_{\mathbf{a}(n)}italic_f start_POSTSUBSCRIPT bold_a ( italic_n ) end_POSTSUBSCRIPT to backpropagate the gradients through the compressor.The pseudo-code of the differentiable gain reduction filter is summarised in Algorithm1, and the implementation can be found on GitHub555https://github.com/DiffAPF/torchcomp.

def forward(g𝑔gitalic_g, αatsubscript𝛼at\alpha_{\rm at}italic_α start_POSTSUBSCRIPT roman_at end_POSTSUBSCRIPT, αrtsubscript𝛼rt\alpha_{\rm rt}italic_α start_POSTSUBSCRIPT roman_rt end_POSTSUBSCRIPT):

g^(1)1^𝑔11\hat{g}(-1)\leftarrow 1over^ start_ARG italic_g end_ARG ( - 1 ) ← 1

g^(n)^𝑔𝑛absent\hat{g}(n)\leftarrowover^ start_ARG italic_g end_ARG ( italic_n ) ← Eq.(21)

ζ(n)𝜁𝑛absent\zeta(n)\leftarrowitalic_ζ ( italic_n ) ← Eq.(23)

return g^^𝑔\hat{g}over^ start_ARG italic_g end_ARG, ζ𝜁\zetaitalic_ζ

def backward(g^^𝑔\frac{\partial\mathcal{L}}{\partial\hat{g}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ over^ start_ARG italic_g end_ARG end_ARG,g^^𝑔\hat{g}over^ start_ARG italic_g end_ARG, g𝑔gitalic_g, m𝑚mitalic_m, αatsubscript𝛼at\alpha_{\rm at}italic_α start_POSTSUBSCRIPT roman_at end_POSTSUBSCRIPT, αrtsubscript𝛼rt\alpha_{\rm rt}italic_α start_POSTSUBSCRIPT roman_rt end_POSTSUBSCRIPT):

β(n)𝛽𝑛absent\beta(n)\leftarrowitalic_β ( italic_n ) ← Eq.(24)

β(n)g(n)𝛽𝑛𝑔𝑛absent\frac{\partial\mathcal{L}}{\partial\beta(n)g(n)}\leftarrowdivide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_β ( italic_n ) italic_g ( italic_n ) end_ARG ← filtering g^(n)^𝑔𝑛\frac{\partial\mathcal{L}}{\partial\hat{g}(n)}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ over^ start_ARG italic_g end_ARG ( italic_n ) end_ARG with Eq.(7) and a1(n)β(n)1subscript𝑎1𝑛𝛽𝑛1a_{1}(n)\equiv\beta(n)-1italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_n ) ≡ italic_β ( italic_n ) - 1

g(n)β(n)g(n)β(n)𝑔𝑛𝛽𝑛𝑔𝑛𝛽𝑛\frac{\partial\mathcal{L}}{\partial g(n)}\leftarrow\frac{\partial\mathcal{L}}{%\partial\beta(n)g(n)}\beta(n)divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_g ( italic_n ) end_ARG ← divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_β ( italic_n ) italic_g ( italic_n ) end_ARG italic_β ( italic_n )

β(n)β(n)g(n)(g(n)g^(n1))𝛽𝑛𝛽𝑛𝑔𝑛𝑔𝑛^𝑔𝑛1\frac{\partial\mathcal{L}}{\partial\beta(n)}\leftarrow\frac{\partial\mathcal{L%}}{\partial\beta(n)g(n)}(g(n)-\hat{g}(n-1))divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_β ( italic_n ) end_ARG ← divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_β ( italic_n ) italic_g ( italic_n ) end_ARG ( italic_g ( italic_n ) - over^ start_ARG italic_g end_ARG ( italic_n - 1 ) )

αatnβ(n)ζ(n)subscript𝛼atsubscript𝑛𝛽𝑛𝜁𝑛\frac{\partial\mathcal{L}}{\partial\alpha_{\rm at}}\leftarrow\sum_{n}\frac{%\partial\mathcal{L}}{\partial\beta(n)}\zeta(n)divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_α start_POSTSUBSCRIPT roman_at end_POSTSUBSCRIPT end_ARG ← ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_β ( italic_n ) end_ARG italic_ζ ( italic_n )

αrtnβ(n)(1ζ(n))subscript𝛼rtsubscript𝑛𝛽𝑛1𝜁𝑛\frac{\partial\mathcal{L}}{\partial\alpha_{\rm rt}}\leftarrow\sum_{n}\frac{%\partial\mathcal{L}}{\partial\beta(n)}(1-\zeta(n))divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_α start_POSTSUBSCRIPT roman_rt end_POSTSUBSCRIPT end_ARG ← ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_β ( italic_n ) end_ARG ( 1 - italic_ζ ( italic_n ) )

return g𝑔\frac{\partial\mathcal{L}}{\partial g}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_g end_ARG,αatsubscript𝛼at\frac{\partial\mathcal{L}}{\partial\alpha_{\rm at}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_α start_POSTSUBSCRIPT roman_at end_POSTSUBSCRIPT end_ARG,αrtsubscript𝛼rt\frac{\partial\mathcal{L}}{\partial\alpha_{\rm rt}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_α start_POSTSUBSCRIPT roman_rt end_POSTSUBSCRIPT end_ARG

5 Experiments

All three systems are trained to model some target analog audio in an end-to-end fashion using gradient descent.We use the Hanning window for all the frame-based approaches.The ranges of all the interpretable parameters are summarised in Table1.We provide audio samples for all experiments on the accompanying website.

5.1 Modelling the EHX Small Stone analog phaser

Here we explore using the DSP phaser model outlined in Sec. 4.1 to model an analog phaser pedal: the Electro-Harmonix Small-Stone. This is the same system modelled in[7] using the FS method. The circuit consists of four cascaded analog all-pass filters, a through-path for the input signal, and a feedback path[26] – so topologically the circuit is similar to the discrete-time phaser model considered in this paper. The pedal consists of one knob which controls the LFO rate, and a switch that engages the feedback loop. Six different parameter configurations are considered:

  • SS-A: feedback off, rate knob 3 o’clock (f0subscript𝑓0absentf_{0}\approxitalic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≈ 2.3Hztimes2.3hertz2.3\text{\,}\mathrm{Hz}start_ARG 2.3 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG)

  • SS-B: feedback off, rate knob 12 o’clock (f0subscript𝑓0absentf_{0}\approxitalic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≈ 0.6Hztimes0.6hertz0.6\text{\,}\mathrm{Hz}start_ARG 0.6 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG)

  • SS-C: feedback off, rate knob 9 o’clock (f0subscript𝑓0absentf_{0}\approxitalic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≈ 0.09Hztimes0.09hertz0.09\text{\,}\mathrm{Hz}start_ARG 0.09 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG)

  • SS-D: feedback on, rate knob 3 o’clock (f0subscript𝑓0absentf_{0}\approxitalic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≈ 1.4Hztimes1.4hertz1.4\text{\,}\mathrm{Hz}start_ARG 1.4 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG)

  • SS-E: feedback on, rate knob 12 o’clock (f0subscript𝑓0absentf_{0}\approxitalic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≈ 0.4Hztimes0.4hertz0.4\text{\,}\mathrm{Hz}start_ARG 0.4 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG)

  • SS-F: feedback on, rate knob 9 o’clock (f0subscript𝑓0absentf_{0}\approxitalic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≈ 0.06Hztimes0.06hertz0.06\text{\,}\mathrm{Hz}start_ARG 0.06 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG)

DatasetL/Fs𝐿subscript𝐹𝑠L/F_{s}italic_L / italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPTMethodESR (%)
FSTD
SS-A10mstimes10millisecond10\text{\,}\mathrm{ms}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARGFS1.461.53
TD1.341.36
SS-B40mstimes40millisecond40\text{\,}\mathrm{ms}start_ARG 40 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARGFS1.371.49
TD1.351.34
SS-C160mstimes160millisecond160\text{\,}\mathrm{ms}start_ARG 160 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARGFS1.621.80
TD2.562.23
SS-D10mstimes10millisecond10\text{\,}\mathrm{ms}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARGFS22.4723.47
TD21.6423.33
SS-E40mstimes40millisecond40\text{\,}\mathrm{ms}start_ARG 40 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARGFS15.4316.69
TD13.6313.87
SS-F160mstimes160millisecond160\text{\,}\mathrm{ms}start_ARG 160 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARGFS8.799.83
TD7.838.79

The training data consists of a 30stimes30second30\text{\,}\mathrm{s}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG chirp-train both dry (input) and processed through the pedal (target). At each training iteration, the input signal is processed through the model in a single batch, and the loss function is computed as the error-to-signal ratio (ESR) between the model output and target. The learnable model parameters are {g1,g2,f0,σ,ϕ,Θ,𝐛(bq),𝐚(bq)}subscript𝑔1subscript𝑔2subscript𝑓0𝜎italic-ϕΘsuperscript𝐛bqsuperscript𝐚bq\{g_{1},g_{2},f_{0},\sigma,\phi,\Theta,{\bf b}^{({\rm bq})},{\bf a}^{({\rm bq}%)}\}{ italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_σ , italic_ϕ , roman_Θ , bold_b start_POSTSUPERSCRIPT ( roman_bq ) end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT ( roman_bq ) end_POSTSUPERSCRIPT }, as defined in Sec. 4.1, giving a total of 182 model parameters. An Adam optimiser with a learning rate 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT is employed to carry out parameter updates every iteration for a maximum of 10k iterations. The test data includes the training data plus the next 10 seconds of the same audio signal, which contains guitar playing. This ensures the learned LFO phase is always aligned to the same point in time.

As reported in[7], the accuracy and convergence of model training depends on the choice of hop-size and window-length.Furthermore, even for a fixed choice of hyper-parameters, the training convergence depends on the initial parameter values, which are pseudo-randomly initialised[7].We observed that for some random seeds, the LFO would not converge to the correct frequency f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and/or decay rate σ𝜎\sigmaitalic_σ. In a successful run, the learned f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is approximately equal to that of the target, and σ𝜎\sigmaitalic_σ converges to zero.

As a baseline, we train the model using the FS method with a single hop-size L𝐿Litalic_L for each parameter configuration. The window-length NWINsubscript𝑁WINN_{\rm WIN}italic_N start_POSTSUBSCRIPT roman_WIN end_POSTSUBSCRIPT is set to four times the hop-size, and the FFT length to 2log2(NWIN)+1superscript2subscript2subscript𝑁WIN12^{\lceil\log_{2}(N_{\rm WIN})\rceil+1}2 start_POSTSUPERSCRIPT ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_N start_POSTSUBSCRIPT roman_WIN end_POSTSUBSCRIPT ) ⌉ + 1 end_POSTSUPERSCRIPT. The training process is repeated up to five times with different seeds until a model converges. The proposed time domain (TD) implementation is then trained with the same hyper-parameters (where relevant) and initial seed as the FS method. Here, the hop-size determines the control rate.

To evaluate the two methods, we train the phaser model with the respective method and then test using both FS and TD at inference time. The test ESR can be seen in Table 2. For all datasets, it can be seen that both methods result in a very similar test loss, with the TD method doing slightly better in five out of the six datasets. Of course, only one hop-size has been considered here, so a more detailed analysis across a range of hop-sizes would be an interesting area of further work. However, these specific hop-sizes were chosen based on the recommendations in our previous work[7], in which they were heuristically found to give the best results (for the respective datasets) using the FS training method.

In early experiments we observed that the TD implementation could become unstable during training, causing the output signal to explode. This occurred even for a simplified problem of g2=0subscript𝑔20g_{2}=0italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0 and with the exclusion of the BQ filter. It was verified that the APF poles were within the unit circle for all n𝑛nitalic_n, albeit close in some cases (p(n)>0.98𝑝𝑛0.98p(n)>0.98italic_p ( italic_n ) > 0.98). We, therefore, suspect this instability was due to numerical inaccuracies associated with the transient response of the filters (e.g. as described in[12]) because changing from single-precision to double-precision resolved this problem. Therefore, we recommend operating at double precision when using the proposed filter implementation if instability arises.

5.2 Modelling the Roland TB-303 Acid Synth

We model analog TB-303 audio with our time-varying subtractive synth.The dataset is made from Sample Science’s royalty free Abstract 303 sample pack666https://www.samplescience.info/2022/05/abstract-303.html consisting of 100 synth loops at 120BPM recorded dry from a hardware TB-303 clone.All loops are concatenated together, resampled to 48kHztimes48kilohertz48\text{\,}\mathrm{kHz}start_ARG 48 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG, and then pitch and note on durations are extracted using Ableton Live 11’s melody-to-midi conversion algorithm which we found to be highly accurate for these monophonic melody loops.Since the TB-303 is a 16 note sequencer, the resulting annotated notes are truncated or zero-padded to 6000 samples (one 16th note at 120BPM and 48kHztimes48kilohertz48\text{\,}\mathrm{kHz}start_ARG 48 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG) with any notes shorter than 4000 samples in duration thrown out.This is then split into 60%, 20%, and 20% train, validation, and test sets, respectively, resulting in a total of 42.5 seconds of audio.

We use the same modulation extraction approach as[27] and DDSP[1] to model the synth.First, frame-by-frame features are extracted from the target audio and are processed by a neural network which predicts the temporal and global control parameters for our differentiable synth as defined in Section4.2.Temporal parameters are linearly interpolated from frame-rate to sample-rate when required.The synth then generates some reconstructed audio from the control parameters which can be compared against the target audio using a differentiable loss function, thus enabling the system to be trained end-to-end using gradient descent.A diagram of the entire setup is shown in Figure3.

Differentiable all-pole filters for time-varying audio systems (3)

Since the filter modulations of the TB-303 are very fast (around 125mstimes125millisecond125\text{\,}\mathrm{ms}start_ARG 125 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG for 16th notes at 120BPM), we use a Mel spectrogram with 1024 FFT size, 128 Mel bins, and a short hop length of 32 samples which results in 188 frames for 6000 samples of audio.The neural network is the same architecture as LFO-net[27] except with 4 or 5 additional 2-layer MLP networks applied to the latent embedding averaged across the temporal axis to predict the global parameters of the synth.It contains 730K parameters.We conduct experiments with five different synth filter configurations:

  1. 1.

    Time domain biquad coefficients (Coeff TD)

  2. 2.

    Frequency sampling biquad coefficients (Coeff FS)

  3. 3.

    Time domain low-pass biquad (LP TD)

  4. 4.

    Frequency sampling low-pass biquad (LP FS)

  5. 5.

    Time domain recurrent neural network (LSTM)

As discussed in Section4.2, for configurations 1 and 2 the neural network outputs a 5-dimensional temporal control parameter of changing biquad coefficients.Before filtering, these coefficients are post-processed using the biquad triangle parameterisation of[3] to ensure stability.For configurations 3 and 4, a 2-dimensional modulation signal is returned, representing a changing filter cutoff and its Q factor (which is constant over time).These are then converted to five biquad coefficients.The raw coefficient filter configuration gives the synth as many degrees of freedom as possible whereas the lowpass filter configuration is based on the TB-303’s analog design consisting of an envelope modulated lowpass filter with a global resonance control knob.Finally, the recurrent neural network filter (configuration 5) is based on the architecture in[5, 27] and enables us to compare against learning a time-varying IIR filter directly from scratch.It is conditioned with the same 2-dimensional modulation signal as configurations 3 and 4.

We train all models for 200 epochs on batches of 34 notes using the AdamW optimiser and single precision – the numerical issues noted in Sec. 5.1 were not observed here.For the FS synth filter configurations we train separate models for NWIN[128,256,512,1024,2048,4096]subscript𝑁WIN128256512102420484096N_{\rm WIN}\in[128,256,512,1024,2048,4096]italic_N start_POSTSUBSCRIPT roman_WIN end_POSTSUBSCRIPT ∈ [ 128 , 256 , 512 , 1024 , 2048 , 4096 ] while keeping the hop-size L𝐿Litalic_L fixed at 32 samples to match the frame-rate temporal outputs of the modulation extraction neural network.As in the phaser experiments, the FFT length is set to 2log2(NWIN)+1superscript2subscript2subscript𝑁WIN12^{\lceil\log_{2}(N_{\rm WIN})\rceil+1}2 start_POSTSUPERSCRIPT ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_N start_POSTSUBSCRIPT roman_WIN end_POSTSUBSCRIPT ) ⌉ + 1 end_POSTSUPERSCRIPT.The LSTM configuration is trained with 64 hidden units.Note pitch and durations are provided to the synth for reconstruction and the phase for the oscillator is random to improve robustness and reflect how synths behave in reality.As a result, the target and reconstructed audio may be misaligned which is why we use multi-resolution STFT loss (MSS)[28] for training which is phase agnostic.

We evaluate the different filter configurations by comparing their MSS loss values on the 20% test split.The FS configurations that operate at frame-rate are also evaluated at sample-rate by linearly interpolating their filter coefficients during inference.We also calculate the Fréchet Audio Distance (FAD)[29] for each model which has been shown to correlate with human perception.Since the individual audio files are very short, we first concatenate them into one audio file before calculating the FAD.To avoid harsh discontinuities in the concatenated file, we apply a 32-sample fade (one hop-size L𝐿Litalic_L) to both ends of individual clips.The evaluation results are summarised in Table3.Finally, in Table4 we show the speed benchmarks of the different synth configurations.

MSSFAD VGGish
FilterMethodNWINsubscript𝑁WINN_{\rm WIN}italic_N start_POSTSUBSCRIPT roman_WIN end_POSTSUBSCRIPTFSTDFSTD
Coeff.FS40961.661.782.62 ±plus-or-minus\pm± 0.092.70 ±plus-or-minus\pm± 0.13
20481.641.652.18 ±plus-or-minus\pm± 0.072.35 ±plus-or-minus\pm± 0.11
10241.531.582.57 ±plus-or-minus\pm± 0.082.27 ±plus-or-minus\pm± 0.12
5121.571.572.87 ±plus-or-minus\pm± 0.102.46 ±plus-or-minus\pm± 0.10
2561.491.482.25 ±plus-or-minus\pm± 0.081.98 ±plus-or-minus\pm± 0.06
1281.531.553.37 ±plus-or-minus\pm± 0.142.73 ±plus-or-minus\pm± 0.12
TD--1.38-2.49 ±plus-or-minus\pm± 0.21
LPFS40961.961.982.59 ±plus-or-minus\pm± 0.062.09 ±plus-or-minus\pm± 0.07
20481.952.042.62 ±plus-or-minus\pm± 0.074.52 ±plus-or-minus\pm± 0.17
10241.892.152.59 ±plus-or-minus\pm± 0.084.18 ±plus-or-minus\pm± 0.14
5121.832.922.13 ±plus-or-minus\pm± 0.063.38 ±plus-or-minus\pm± 0.08
2561.822.892.17 ±plus-or-minus\pm± 0.063.36 ±plus-or-minus\pm± 0.12
1281.842.702.34 ±plus-or-minus\pm± 0.093.93 ±plus-or-minus\pm± 0.12
TD--1.56-2.51 ±plus-or-minus\pm± 0.10
LSTM 64TD--1.76-3.24 ±plus-or-minus\pm± 0.07

SynthTDFS NWINsubscript𝑁WINN_{\rm WIN}italic_N start_POSTSUBSCRIPT roman_WIN end_POSTSUBSCRIPT
128256512102420484096
Coeff.32mstimes32millisecond32\text{\,}\mathrm{ms}start_ARG 32 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG57mstimes57millisecond57\text{\,}\mathrm{ms}start_ARG 57 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG102mstimes102millisecond102\text{\,}\mathrm{ms}start_ARG 102 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG201mstimes201millisecond201\text{\,}\mathrm{ms}start_ARG 201 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG390mstimes390millisecond390\text{\,}\mathrm{ms}start_ARG 390 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG833mstimes833millisecond833\text{\,}\mathrm{ms}start_ARG 833 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG1795mstimes1795millisecond1795\text{\,}\mathrm{ms}start_ARG 1795 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG
LP29mstimes29millisecond29\text{\,}\mathrm{ms}start_ARG 29 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG58mstimes58millisecond58\text{\,}\mathrm{ms}start_ARG 58 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG98mstimes98millisecond98\text{\,}\mathrm{ms}start_ARG 98 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG195mstimes195millisecond195\text{\,}\mathrm{ms}start_ARG 195 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG376mstimes376millisecond376\text{\,}\mathrm{ms}start_ARG 376 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG804mstimes804millisecond804\text{\,}\mathrm{ms}start_ARG 804 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG1667mstimes1667millisecond1667\text{\,}\mathrm{ms}start_ARG 1667 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG
LSTM 641322mstimes1322millisecond1322\text{\,}\mathrm{ms}start_ARG 1322 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG------

Looking at the evaluation and benchmarking results, we observe that both configurations of our TD filter perform well and generally match or outperform the corresponding FS implementations.Our method also provides roughly a 2x to 30x speedup over the FS and LSTM configurations.The low-pass FS methods perform significantly worse when applied in the time domain which we attribute to overfitting of the neural network to the window size of the filter and can be clearly heard in the resulting audio.This is less the case for the learned coefficient filters, especially according to the FAD metric.We hypothesise this could be due to the linear interpolation at inference of the learned coefficients which prevents harsh discontinuities from occurring at the frame boundaries, resulting in a smoother time-varying transfer function.We also found during training that the Coeff FS synths would sometimes not converge, even when using gradient clipping, whereas our TD implementations never experienced stability issues.

5.3 Modelling the LA-2A Leveling Amplifier

For the compressor experiments, the targets we model are 1) the feed-forward compressor (FF) in Sec.4.3 and 2) a Universal Audio LA-2A analog compressor (LA).We optimise the proposed differentiable FF compressor (\nablaFF) to match the target sounds, examining its capability to replicate and infer the parameters of dynamic range controllers.We test the following conditions:

  • FF-A: R=3𝑅3R=3italic_R = 3, 1mstimes1millisecond1\text{\,}\mathrm{ms}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG attack and 100mstimes100millisecond100\text{\,}\mathrm{ms}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG release

  • FF-B: R=5𝑅5R=5italic_R = 5, 30mstimes30millisecond30\text{\,}\mathrm{ms}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG attack and 30mstimes30millisecond30\text{\,}\mathrm{ms}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG release

  • FF-C: R=8𝑅8R=8italic_R = 8, 0.1mstimes0.1millisecond0.1\text{\,}\mathrm{ms}start_ARG 0.1 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG attack and 200mstimes200millisecond200\text{\,}\mathrm{ms}start_ARG 200 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG release

  • LA-D: compressor mode, 25 peak reduction

  • LA-E: compressor mode, 50 peak reduction

  • LA-F: compressor mode, 75 peak reduction

MethodFF-AFF-BFF-CLA-DLA-ELA-F
FS2.3620.007804.64911.299.4857.783
\nablaFF0.0150.007850.01710.589.3567.639

MethodDataR𝑅Ritalic_RCT𝐶𝑇CTitalic_C italic_T (dB)AttackReleaseαrmssubscript𝛼rms\alpha_{\rm rms}italic_α start_POSTSUBSCRIPT roman_rms end_POSTSUBSCRIPTγ𝛾\gammaitalic_γ (dB)
FSD9.1-11.66489.43mstimes489.43millisecond489.43\text{\,}\mathrm{ms}start_ARG 489.43 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG0.0080.69
E231.1-19.0844.62mstimes44.62millisecond44.62\text{\,}\mathrm{ms}start_ARG 44.62 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG0.6060.34
F2.9-26.000.06mstimes0.06millisecond0.06\text{\,}\mathrm{ms}start_ARG 0.06 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG0.002-0.81
\nablaFFD39.0-26.5899.41mstimes99.41millisecond99.41\text{\,}\mathrm{ms}start_ARG 99.41 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG0.06mstimes0.06millisecond0.06\text{\,}\mathrm{ms}start_ARG 0.06 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG0.7030.74
E13.1-12.415.68mstimes5.68millisecond5.68\text{\,}\mathrm{ms}start_ARG 5.68 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG420.56mstimes420.56millisecond420.56\text{\,}\mathrm{ms}start_ARG 420.56 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG0.9780.54
F5.4-20.142.24mstimes2.24millisecond2.24\text{\,}\mathrm{ms}start_ARG 2.24 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG229.15mstimes229.15millisecond229.15\text{\,}\mathrm{ms}start_ARG 229.15 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG0.973-0.13

MethodSequence duration
30stimes30second30\text{\,}\mathrm{s}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG60stimes60second60\text{\,}\mathrm{s}start_ARG 60 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG120stimes120second120\text{\,}\mathrm{s}start_ARG 120 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG
FS163.4mstimes163.4millisecond163.4\text{\,}\mathrm{ms}start_ARG 163.4 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG320.8mstimes320.8millisecond320.8\text{\,}\mathrm{ms}start_ARG 320.8 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG663.8mstimes663.8millisecond663.8\text{\,}\mathrm{ms}start_ARG 663.8 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG
\nablaFF64.9mstimes64.9millisecond64.9\text{\,}\mathrm{ms}start_ARG 64.9 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG117.9mstimes117.9millisecond117.9\text{\,}\mathrm{ms}start_ARG 117.9 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG239.4mstimes239.4millisecond239.4\text{\,}\mathrm{ms}start_ARG 239.4 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG

The αrmssubscript𝛼rms\alpha_{\rm rms}italic_α start_POSTSUBSCRIPT roman_rms end_POSTSUBSCRIPT, CT𝐶𝑇CTitalic_C italic_T and γ𝛾\gammaitalic_γ for FFsubscriptFF\text{FF}_{*}FF start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT are set to 0.030.030.030.03, 2020-20- 20 and 0dBtimes0decibel0\text{\,}\mathrm{dB}start_ARG 0 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG, respectively.We train and evaluate our compressors on the SignalTrain dataset[30], which consists of paired data recorded in
44.1kHztimes44.1kilohertz44.1\text{\,}\mathrm{kHz}start_ARG 44.1 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG from the LA-2A compressor with different peak reduction values.Following[5], we select files with the sub-string 3c in the file name.Each LA-2A setting has 20mintimes20minute20\text{\,}\mathrm{min}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG roman_min end_ARG of paired audio containing real-world musical sounds and synthetic test signals.We use the first 5mintimes5minute5\text{\,}\mathrm{min}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG roman_min end_ARG for training and the rest for evaluation.The same input data is used for FF-A/B/C, and the target audio is generated by applying the FF compressor with the target parameters.We pick the simplified compressor[4] as our baseline, which uses the same FF compressor but has αat=αrtsubscript𝛼atsubscript𝛼rt\alpha_{\rm at}=\alpha_{\rm rt}italic_α start_POSTSUBSCRIPT roman_at end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT roman_rt end_POSTSUBSCRIPT.We denote this baseline as FS, as it uses frequency sampling to compute(19) and(21).

The parameters we optimise are {R^,CT,α^at/bt/rms,γ}^𝑅𝐶𝑇subscript^𝛼atbtrms𝛾\{\hat{R},CT,\hat{\alpha}_{\rm at/bt/rms},\gamma\}{ over^ start_ARG italic_R end_ARG , italic_C italic_T , over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT roman_at / roman_bt / roman_rms end_POSTSUBSCRIPT , italic_γ }.We set α=sigmoid(α^),R=exp(R^)+1formulae-sequencesubscript𝛼sigmoidsubscript^𝛼𝑅^𝑅1\alpha_{*}=\text{sigmoid}(\hat{\alpha}_{*}),R=\exp(\hat{R})+1italic_α start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = sigmoid ( over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , italic_R = roman_exp ( over^ start_ARG italic_R end_ARG ) + 1.The initial values are 50mstimes50millisecond50\text{\,}\mathrm{ms}start_ARG 50 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG attack/release, R=2𝑅2R=2italic_R = 2, CT=10dB𝐶𝑇times-10decibelCT=$-10\text{\,}\mathrm{dB}$italic_C italic_T = start_ARG - 10 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG, αrms=0.3subscript𝛼rms0.3\alpha_{\rm rms}=0.3italic_α start_POSTSUBSCRIPT roman_rms end_POSTSUBSCRIPT = 0.3, and γ=0dB𝛾times0decibel\gamma=$0\text{\,}\mathrm{dB}$italic_γ = start_ARG 0 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG.The conversion from time t𝑡titalic_t in seconds to coefficients αsubscript𝛼\alpha_{*}italic_α start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is 1exp(2.244100t)12.244100𝑡1-\exp(-\frac{2.2}{44100t})1 - roman_exp ( - divide start_ARG 2.2 end_ARG start_ARG 44100 italic_t end_ARG ).We train each compressor without using mini-batching for at least 1000 epochs using stochastic gradient descent with a learning rate of 100 and 0.9 momentum, minimising the mean absolute error (MAE).For evaluation, we select the parameters with the lowest training loss.We apply the same pre-filter from[5] before calculating loss and evaluation metrics.We use double precision when computing gigantic FFTs for the FS method to avoid numerical overflow.

Table5 shows that \nablaFF has a lower ESR than FS besides condition FF-B.The parameters learned for condition FF-A/B/C using \nablaFF are close to the ground truth.FS can only recover the parameters when attack and release are identical (FF-B).For LA-2A, \nablaFF reasonably captures the analog characteristics (fast attack and slow release, shown in Table6) of condition E/F.We found FS tends to learn unrealistically large attack/release times and ratios, as seen in Table6, likely due to its simplified design.

Table7 shows our method is two to three times faster than FS.Training takes roughly 43mintimes43minute43\text{\,}\mathrm{min}start_ARG 43 end_ARG start_ARG times end_ARG start_ARG roman_min end_ARG for FS and 17mintimes17minute17\text{\,}\mathrm{min}start_ARG 17 end_ARG start_ARG times end_ARG start_ARG roman_min end_ARG for \nablaFF on an M1 Pro MacBook.

6 Conclusion and Future Work

In this work, we propose an efficient backpropagation algorithmand implementation for an all-pole filter that can be used to modeltime-varying analog audio systems end-to-end using gradient descent.We demonstrate its advantages over previous frequency sampling approximations by using it to model a phaser, a time-varying subtractive synthesiser, and a compressor.Our method outperforms frequency sampling in accuracy and training efficiency, especially when using the systems at the sample-rate level.We make our code and audio samples available and provide the trained audio effect and synth models in a VST plugin.

Our future work involves extending the backpropagation algorithm to work with differentiable initial conditions and applying it to relevant tasks.We also plan on benchmarking the forward-mode differentiation of our filter, investigating its numerical stability, and extending our gradient derivation to higher-order optimisation use cases.

7 Acknowledgements

Funded by UKRI and EPSRC as part of the “UKRI CDT in Artificial Intelligence and Music”, under grant EP/S022694/1, and by the Scottish Graduate School of Arts & Humanities (SGSAH).

References

  • [1]Jesse Engel, Lamtharn(Hanoi) Hantrakul, Chenjie Gu, and Adam Roberts,“DDSP: Differentiable digital signal processing,”in International Conference on Learning Representations, 2020.
  • [2]Chin-Yun Yu and György Fazekas,“Singing voice synthesis using differentiable LPC and glottal-flow-inspired wavetables,”in Proc. International Society for Music Information Retrieval, 2023, pp. 667–675.
  • [3]Shahan Nercessian, Andy Sarroff, and KurtJames Werner,“Lightweight and interpretable neural modeling of an audio distortion effect using hyperconditioned differentiable biquads,”in ICASSP. IEEE, 2021, pp. 890–894.
  • [4]ChristianJ Steinmetz, NicholasJ Bryan, and JoshuaD Reiss,“Style transfer of audio effects with differentiable signal processing,”Journal of the Audio Engineering Society, vol. 70, no. 9, pp. 708–721, 2022.
  • [5]Alec Wright and Vesa Välimäki,“Grey-box modelling of dynamic range compression,”in DAFx, 2022, pp. 304–311.
  • [6]Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, and Paavo Alku,“GELP: GAN-excited linear prediction for speech synthesis from mel-spectrogram,”in Proc. INTERSPEECH, 2019, pp. 694–698.
  • [7]Alistair Carson, Simon King, CassiaValentini Botinhao, and Stefan Bilbao,“Differentiable grey-box modelling of phaser effects using frame-based spectral processing,”in DAFx, 2023, pp. 219–226.
  • [8]Purbaditya Bhattacharya, Patrick Nowak, and Udo Zölzer,“Optimization of cascaded parametric peak and shelving filters with backpropagation algorithm,”in DAFx, 2020, pp. 101–108.
  • [9]Shahan Nercessian,“Neural parametric equalizer matching using differentiable biquads,”in DAFx, 2020, pp. 265–272.
  • [10]JosephT Colonel, ChristianJ Steinmetz, Marcus Michelen, and JoshuaD Reiss,“Direct design of biquad filter cascades with deep learning by sampling random polynomials,”in ICASSP. IEEE, 2022, pp. 3104–3108.
  • [11]Taejun Kim, Yi-Hsuan Yang, and Juhan Nam,“Joint estimation of fader and equalizer gains of DJ mixers using convex optimization,”in DAFx, 2022, pp. 312–319.
  • [12]JuliusO. Smith III,Spectral Audio Signal Processing,https://ccrma.stanford.edu/~jos/sasp/, accessed 2024-03-27,online book, 2011 edition.
  • [13]JohnD. Markel and AugustineH. Gray,Linear Prediction of Speech, vol.12 of Communication and Cybernetics,Springer, Berlin, Heidelberg, 1976.
  • [14]Jean-Marc Valin and Jan Skoglund,“LPCNet: Improving neural speech synthesis through linear prediction,”in ICASSP. IEEE, 2019, pp. 5891–5895.
  • [15]AchuthRao MV and PrasantaKumar Ghosh,“SFNet: A computationally efficient source filter model based neural speech synthesis,”IEEE Signal Processing Letters, vol. 27, pp. 1170–1174, 2020.
  • [16]Suhyeon Oh, Hyungseob Lim, Kyungguen Byun, Min-Jae Hwang, Eunwoo Song, and Hong-Goo Kang,“ExcitGlow: Improving a WaveGlow-based neural vocoder with linear prediction analysis,”in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2020, pp. 831–836.
  • [17]Udo Zölzer,DAFX: Digital Audio Effects, chapter Nonlinear Processing, pp. 110–112,John Wiley & Sons, 2011.
  • [18]JosephT Colonel and JoshuaD Reiss,“Approximating ballistics in a differentiable dynamic range compressor,”in Audio Engineering Society Convention 153. Audio Engineering Society, 2022.
  • [19]Zixun Guo, Chen Chen, and EngSiong Chng,“DENT-DDSP: Data-efficient noisy speech generator using differentiable digital signal processors for explicit distortion modelling and noise-robust speech recognition,”in Proc. INTERSPEECH, 2022, pp. 3799–3803.
  • [20]Marco Forgione and Dario Piga,“dynoNet: A neural network architecture for learning dynamical systems,”International Journal of Adaptive Control and Signal Processing, vol. 35, no. 4, pp. 612–626, 2021.
  • [21]Chin-Yun Yu and György Fazekas,“Differentiable time-varying linear prediction in the context of end-to-end analysis-by-synthesis,”arXiv:2406.05128, 2024.
  • [22]JuliusO. Smith III,Physical Audio Signal Processing,https://ccrma.stanford.edu/~jos/pasp/, accessed 2023-02-28,online book, 2010 edition.
  • [23]Ben Hayes, Charalampos Saitis, and György Fazekas,“Sinusoidal frequency estimation by gradient descent,”in ICASSP. IEEE, 2023.
  • [24]Roope Kiiski, Fabián Esqueda, and Vesa Välimäki,“Time-variant gray-box modeling of a phaser pedal,”in DAFx, 2016, pp. 31–38.
  • [25]Joseph Turian, Jordie Shier, George Tzanetakis, Kirk McNally, and Max Henry,“One billion audio sounds from GPU-enabled modular synthesis,”in DAFx, 2021, pp. 222–229.
  • [26]JDSleep,“Small Stone Information [Online],” https://generalguitargadgets.com/effects-projects/phase-shifters/small-stone-information/, accessed 2023-03-26.
  • [27]Christopher Mitcheltree, ChristianJ Steinmetz, Marco Comunità, and JoshuaD Reiss,“Modulation extraction for LFO-driven audio effects,”in DAFx, 2023, pp. 94–101.
  • [28]Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim,“Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,”in ICASSP. IEEE, 2020, pp. 6199–6203.
  • [29]Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi,“Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms,”in Proc. INTERSPEECH, 2019, pp. 2350–2354.
  • [30]Scott Hawley, Benjamin Colburn, and StylianosIoannis Mimilakis,“Profiling audio compressors with deep neural networks,”in Audio Engineering Society Convention 147. Audio Engineering Society, 2019.
Differentiable all-pole filters for time-varying audio systems (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Foster Heidenreich CPA

Last Updated:

Views: 5561

Rating: 4.6 / 5 (56 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Foster Heidenreich CPA

Birthday: 1995-01-14

Address: 55021 Usha Garden, North Larisa, DE 19209

Phone: +6812240846623

Job: Corporate Healthcare Strategist

Hobby: Singing, Listening to music, Rafting, LARPing, Gardening, Quilting, Rappelling

Introduction: My name is Foster Heidenreich CPA, I am a delightful, quaint, glorious, quaint, faithful, enchanting, fine person who loves writing and wants to share my knowledge and understanding with you.