The processing of mega-dimensional data, such as images, scales linearly with image size only if fixed size processing windows are used. It would be very useful to be able to automate the process of sizing and interconnecting the processing windows. A stochastic encoder that is an extension of the standard Linde-Buzo-Gray vector quantiser, called a stochastic vector quantiser (SVQ), includes this required behaviour amongst its emergent properties, because it automatically splits the input space into statistically independent subspaces, which it then separately encodes. Various optimal SVQs have been obtained, both analytically and numerically. Analytic solutions which demonstrate how the input space is split into independent subspaces may be obtained when an SVQ is used to encode data that lives on a 2-torus (e.g. the superposition of a pair of uncorrelated sinusoids). Many numerical solutions have also been obtained, using both SVQs and chains of linked SVQs: (1) images of multiple independent targets (encoders for single targets emerge), (2) images of multiple correlated targets (various types of encoder for single and multiple targets emerge), (3) superpositions of various waveforms (encoders for the separate waveforms emerge - this is a type of independent component analysis (ICA)), (4) maternal and foetal ECGs (another example of ICA), (5) images of textures (orientation maps and dominance stripes emerge). Overall, SVQs exhibit a rich variety of self-organising behaviour, which effectively discovers the internal structure of the training data. This should have an immediate impact on "intelligent" computation, because it reduces the need for expert human intervention in the design of data processing algorithms.
Deep Dive into Self-Organising Stochastic Encoders.
The processing of mega-dimensional data, such as images, scales linearly with image size only if fixed size processing windows are used. It would be very useful to be able to automate the process of sizing and interconnecting the processing windows. A stochastic encoder that is an extension of the standard Linde-Buzo-Gray vector quantiser, called a stochastic vector quantiser (SVQ), includes this required behaviour amongst its emergent properties, because it automatically splits the input space into statistically independent subspaces, which it then separately encodes. Various optimal SVQs have been obtained, both analytically and numerically. Analytic solutions which demonstrate how the input space is split into independent subspaces may be obtained when an SVQ is used to encode data that lives on a 2-torus (e.g. the superposition of a pair of uncorrelated sinusoids). Many numerical solutions have also been obtained, using both SVQs and chains of linked SVQs: (1) images of multiple in
(1)
• Encode then decode: x -→ y -→ x ′ .
• x = input vector; y = code; x ′ = reconstructed vector.
• Code vector y = (y 1 , y 2 , • • • , y n ) for 1 ≤ y i ≤ M .
• Pr (x) = input PDF; Pr (y |x ) = stochastic encoder; Pr (x ′ |y ) = stochastic decoder.
• xx ′ 2 = Euclidean reconstruction error.
• Do the ´dx Pr (x |y ) (• • • ) integration.
• x ′ (y) = reconstruction vector.
• x ′ (y) is the solution of ∂D ∂x ′ (y) = 0, so it can be deduced by optimisation.
Pr (y |x ) = Pr (y 1 |x ) Pr (y 2 |x ) • • • Pr (y n |x )
x ′ (y) = 1 n n i=1
x ′ (y i )
• Pr (y |x ) implies the components (y 1 , y 2 , • • • , y n ) of y are conditionally independent given x.
• x ′ (y) implies the reconstruction is a superposition of contributions x ′ (y i ) for i = 1, 2, • • • , n.
• The stochastic encoder samples n times from the same Pr (y |x ).
• D 1 is a stochastic vector quantiser with the vector code y replaced by a scalar code y.
• D 2 is a non-linear (note Pr (y |x )) encoder with a superposition term M y=1 Pr (y |x ) x ′ (y).
• n -→ ∞: the stochastic encoder measures Pr (y |x ) accurately and D 2 dominates.
• n -→ 1: the stochastic encoder samples Pr (y |x ) poorly and D 1 dominates.
2 Analytic Optimisation
• 3 types of solution: Pr (x) = 0 (trivial), Pr (y |x ) = 0 (ensures Pr (y |x ) ≥ 0), and
x = (cos θ, sin θ)
2.3.5 Stochastic encoder PDFs overlap no more than 3 at a time
2.4 2-Torus 2.4.1 Input vector uniformly distributed on a 2-torus
• Pr (y |x ) depends jointly on x 1 and x 2 .
• Requires n = 1 to encode x.
• For a given resolution the size of the codebook increases exponentially with input dimension.
• Y 1 and Y 2 are non-intersecting subsets of the allowed values of y.
• Pr (y |x ) depends either on x 1 or on x 2 , but not on both at the same time.
• Requires n ≫ 1 to encode x.
• For a given resolution the size of the codebook increases linearly with input dimension.
• Fixed n, increasing M : joint encoding is eventually favoured because the size of the codebook is eventually large enough.
• Fixed M , increase n: factorial encoding is eventually favoured because the number of samples is eventually large enough.
• Factorial encoding is encouraged by using a small codebook and sampling a large number of times.
3 Numerical Optimisation Luttrell S P, 1999, submitted to a special issue of IEEE Trans. Information Theory on Information-Theoretic Imaging, Stochastic vector quantisers.
3.2.1 Posterior probability with infinite range neighbourhood
• This does not restrict Pr (y |x ) in any way.
• N (y ′ ) is the set of neurons that lie in a predefined “neighbourhood” of y ′ .
• N -1 (y) is the “inverse neighbourhood” of y defined as N -1 (y) ≡ {y ′ : y ∈ N (y ′ )}.
• Neighbourhood is used to introduce “lateral inhibition” between the firing neurons.
• This restricts Pr (y |x ), but allows limited range lateral interactions to be used.
Pr (y |x ) -→
• Pr (y |y ′ ) is the amount of probability that leaks from location y ′ to location y.
• L (y ′ ) is the “leakage neighbourhood” of y ′ .
• L -1 (y) is the “inverse leakage neighbourhood” of y defined as L -1 (y) ≡ {y ′ : y ∈ L (y ′ )}.
• Leakage is to allow the network output to be “damaged” in a controlled way.
• When the network is optimised it automatically becomes robust with respect to such damage.
• Leakage leads to topographic ordering according to the defined neighbourhood.
• This restricts Pr (y |x ), but allows topographic ordering to be obtained, and is faster to train.
L y,y ′ ≡ Pr (y |y ′ ) P y,y ′ ≡ Pr (y |x; y ′ )
• This shorthand notation simplifies the appearance of the gradients of D 1 and D 2 .
• For instance, Pr (y |x ) = 1 M L T p y .
3.2.5 Derivatives w.r.t. x ′ (y)
• The extra factor 1 M in ∂D2
x ′ (y) arises because there is a M y=1 (• • • ) hidden inside the d.
• This is a standard “sigmoid” function.
• This restricts Pr (y |x ), but it is easy to implement, and leads to results similar to the ideal analytic results.
3.2.8 Derivatives w.r.t. w (y) and b (y)
• M = 4 and n = 10 were used.
• The reference vectors x ′ (y) (for y = 1, 2, 3, 4) are initialised close to the origin.
• The training history leads to stationary x ′ (y) just outside the unit circle.
• Each of the posterior probabilities Pr (y |x ) (for y = 1, 2, 3, 4) is large mainly in a π 2 radian arc of the circle. • There is some overlap between the Pr (y |x ). • Each of the posterior probabilities Pr (y |x ) is large mainly in a localised region of the torus.
• There is some overlap between the Pr (y |x ).
• M = 8 and n = 20 were used, which lies inside the factorial encoding region of the stability diagram.
• Each of the posterior probabilities Pr (y |x ) is large mainly in a collarshaped region of the torus; half circle one way round the torus, and half the other way.
• There is some overlap between the Pr (y |x ) that circle the same way round the torus.
• There is a localised region of overlap between a pair of Pr (y |x ) tha
…(Full text truncated)…
This content is AI-processed based on ArXiv data.