AHA! an Artificial Hippocampal Algorithm for Episodic Machine Learning

AHA! an ‘ Artificial Hippocampal Algorithm’ for Episo dic Machine Learning Gideon Kowadlo Incubator 491 gideon@agi.io Abdelrahman Ahmed Incubator 491 abdel@agi.io David Rawlinson Incubator 491 dave@agi.io ABSTRA CT The majority of ML research concerns slow , statistical learning of i.i.d. samples from large, labelle d datasets. Animals do not learn this way . An enviable characteristic of animal learning is ‘episodic’ learning - the ability to memorise a specic experience as a com- position of existing concepts, after just one experience, without provided labels. The new knowledge can then b e used to distinguish between similar experiences, to generalise between classes, and to selectively consolidate to long-term memor y . The Hippocampus is known to be vital to these abilities. AHA is a biologically-plausible computational model of the Hippocampus. Unlike most machine learning models, AHA is trained without external labels and uses only local credit assignment. W e demonstrate AHA in a superset of the Omniglot one-shot classication benchmark. The extended benchmark covers a wider range of known hippocampal functions by testing pattern separation, completion, and recall of original in- put. These functions are all performed within a single conguration of the computational model. Despite these constraints, image classi- cation results are comparable to conventional deep convolutional ANNs. 1 IN TRODUCTION In recent years, machine learning has been applied to great eect across many problem domains including speech, vision, language and control [Lake et al. 2017; Vinyals et al. 2016]. The dominant approach is to combine models having a very large number of trainable parameters with careful regularisation and slow , statistical learning. Samples are typically assumed to be i.i.d. (independent and identically distributed), implying an unchanging world. Popular algorithms require large volumes of data, and with the exception of some very recent image classication meta-learning framew orks e.g. [Berthelot et al. 2019; Sohn et al. 2020], an equally large number of lab els. Models are usually susceptible to “catastr ophic forgetting”, without an ability to learn continually , also r eferred to as lifelong learning [Kirkpatrick et al. 2017; Sodhani et al. 2020]. In contrast, animals display a much broader range of intelligence [Lake et al. 2017] se emingly free of these constraints. A signi- cant dierentiator b etween animals and machines, is the ability to learn from one experience ( one-shot learning), to reason about the specics of an experience, and to generalise learnings from the one experience. In addition, new knowledge exploits the composi- tionality of familiar concepts [Lake et al. 2017], which is likely a key component for continual learning. For example, after seeing a single photo of a girae with a birthmark on its ear , a young child can recognise other giraes and identify this specic one as well. 1 1 Based on an example by [Vinyals et al. 2016]. One-shot learning of this nature is of great interest to better understand animal intelligence as well as to advance the eld of machine intelligence. 1.1 Mammalian Complementar y Learning Systems In mammals, a brain region widely recognise d to b e critical for these aspects of learning and memory is the Hippocampal Forma- tion (HF) [Kandel et al. 1991]. 2 . Complementary Learning Systems (CLS) is a standard frame work for understanding the function of the HF [Kumaran et al. 2016; McClelland et al. 1995; O’Reilly et al. 2014]. CLS consists of two dierentially sp ecialised and comple- mentary structures, the neocortex and the HF . They are depicted in a high level diagram in Figure 1. In this framew ork, the neocortex is most similar to a conv entional ML model, incrementally learn- ing regularities across many observations comprising a long-term memory (LTM). It forms overlapping and distributed representa- tions that are eective for inference. In contrast, the HF rapidly learns distinct observations, forming sparser , non-overlapping and therefore non-interfering repr esentations, functioning as a short term memory (STM). Recent memories from the HF are r eplayed to the neocortex, re-instating the original activations resulting in consolidation as long-term memory (LTM). Patterns are replay ed in an interleaved fashion, avoiding catastrophic forgetting. In addi- tion, they can be replayed selectively according to salience. There have been numerous implementations of CLS [Greene et al. 2013; Ketz et al. 2013; Norman and O’Reilly 2003; Rolls 2013; Schapiro et al. 2017] and [Rolls 1995] publishe d a similar model with greater neuroanatomical detail. A more detaile d description of the HF is given in Section 2.2. CLS explains the role of the HF in Episodic Memory [Gluck et al. 2003]. In this and related literature ( citations above), an episode is dened as a distinct observation that is a “conjunctive binding of various distributed pieces of information” [Ketz et al. 2013]. It is more general than the common behavioural denition introduced by [T ulving 1985], an (often autobiographical) memory of places and things, “memories for the specic contents of individual episodes or events” [McClelland et al. 1995]. In the behavioural denition, a temporal aspect with ordering is implied, but is not necessar y . CLS-style learning is thus a potential bridge to endowing machines with Episodic Memory capabilities [Kumaran et al. 2016]. 3 2 The hippocampus itself is contained within the HF along with other structures (ther e is no universally accepted denition of the exact boundaries). The HF is often described as being contained within a region called the Medial T emporal Lobe (MTL). 3 Framed as a conjunction of concepts, one-shot learning of episo des also provides the tantalising potential to learn fac tual information, traditionally dene d as Semantic Memory [Squire 1992]. The memories held temporarily in the HF (STM) and consoli- dated in the neocortex (LTM) could consist of both Episodic and Semantic Memory , Gideon Kowadlo, Abdelrahman Ahmed, and David Rawlinson F as t learning / S hort T erm M emory (S TM ) ( H ippocampal) I ncr emen tal learning / Long T erm M emory ( L TM ) ( N eoc ortical) O bserv a tions Repla y Figure 1: High-level depiction of Complementar y Learning Systems (CLS). The sp ecialisations of two memor y systems complement each other . The short-term memor y learns and forgets rapidly; salient memories can be replayed to the long-term memory , enabling incremental statistical learn- ing there. 1.2 Motivation and Solution The motivation of this work is to understand and replicate HF functionality to reproduce the advantages of this form of one-shot learning. W e are therefore fo cussing on the functional capabilities of Episodic Memory as describ ed by CLS. The HF also performs other functions that are out of scope for this study , such as incremental learning [Gluck et al. 2003; Schapiro et al. 2017] and prediction [Buckner 2010; Lisman and Redish 2009; Stachenfeld et al. 2017]. It is also known for its role in spatial mapping using Place and Grid cells, but is understood to be more general conceptual mapping rather than being conned to purely spatial r easoning [Kurth-Nelson et al. 2016; Mok and Love 2019; Moser et al. 2008]. The CLS approach (citations above) is to mo del neuron dynamics and learning rules that are biologically r ealistic at the level of the individual neuron. As a consequence, the behaviour of the region is limited by the accuracy of the neuron model. In addition, it may be more dicult to scale than collectively modelling populations of neurons at a higher level of abstraction. Scaling may be important to work with inputs derived fr om r ealistic sensor obser vations such as images and compare to conventional ML models. Prior CLS ex- periments demonstrate an ability to recognise sp ecic synthe tically generated vectors requiring pattern separation b etween similar inputs, often with the presence of distractor patterns or ‘lures’ . In most cases, generalisation performance is not reported. Greene et al. [2013] introduced noise (additive and non-additive) and occlusion to test inputs, and to our knowledge, no studies have tested the ability to generalise to dierent exemplars of the same class. Within the ML literature, one-shot learning has received recent attention. Following seminal work by [Li et al. 2003, 2006] the area was re-invigorated by [Lake et al. 2015] who introduce d a popular test that has become a standard b enchmark. It is a one- shot classication task on the Omniglot dataset of handwritten characters (see Section 3 for a detailed description). together comprising Declarative Memory [Goldberg 2009; McClelland et al. 1995; Rolls 2013]. A typical approach is to train a model on many classes and use learnt concepts to recognise new classes quickly from one or few examples. Often framed as meta-learning, or “learning to learn” , solutions have been implemente d with neural networks that use external labels at training time such as Siamese networks [Koch et al. 2015], matching networks [Vinyals et al. 2016], and prototyp- ical networks [Snell et al. 2017], as well as Bayesian approaches [George et al. 2017; Lake et al. 2015]. A comprehensive review is given in [Lake et al. 2019]. In contrast to the computational neu- roscience models, these studies focus on generalisation without considering additional HF capabilities such as modelling of specic instances and pattern separation, and without a framework for retaining knowledge beyond the classication task itself. The y do not consider imperfect samples due to factors such as o cclusion or noise. Moving beyond the current CLS and ML one-shot studies re- quires developing the ability to r eason about spe cic instances in addition to generalising classes. A practical realisation of these ca- pabilities implies pattern separation and completion in addition to learning the observational invariances of classes. These appear to be contradictory qualities, making it challenging to implement both in a unied architecture. It is also challenging to model the HF at an appropriate level of abstraction - detailed enough to capture the breadth of capabilities whilst abstract enough to scale to convincing experiments with realistic sensor data. W e developed an Articial Hippocampal Algorithm (AHA ) to ex- tend one-shot learning in ML and provide insight on hippocampal function. AHA is a new computational model that can work with realistic sensor data without labels. AHA combines features fr om CLS and [Rolls 1995, 2013], modelling the hippocampus at the level of subelds (anatomically and physiologically distinct regions). The micro-architecture of AHA is implemented with scalable ML frame- works, subject to biological plausibility constraints ( see Section 2.1). Functionally , AHA provides a fast learning STM that complements an incrementally learning LTM model (such as a conventional arti- cial neural network). The complete system is tested on a dataset resembling realistic gr ounded sensor data including the introduc- tion of typical input corruption simulated with noise and o cclusion. The tasks involve the learning of specic instances, generalisation and recall for replay to the LTM. W e propose a new benchmark that extends the Omniglot test [Lake et al. 2015] for this purpose. The experimental method and benchmark are described in detail in Section 3. The performance of AHA is compared to two baselines: a) the LTM on its own, and b) an STM implemented with standard ML components. The diering qualities of AHA ’s internal signals are also reported. 1.3 Contributions Our primary contributions are as follows: (1) W e propose a novel hippocampal model called AHA, based on CLS, but with a dierent model for combining pathways, and mo delling neuron populations at a higher level of ab- straction. (2) W e propose a one-shot learning benchmark that e xtends the standard Omniglot test to capabilities not typically seen in AHA! an ‘Artificial Hippocampal Algorithm’ for Episodic Machine Learning the ML literature or replicated by hippo campal models in the computational neuroscience literature. (3) W e evaluate LTM+AHA on the proposed b enchmark, and show that it is better suited to one-shot learning without labels than the visual processing of LTM alone and a base- line STM. In addition we show that on the base Omniglot one-shot classication test, it is comparable to state-of-the- art neural network approaches that do not contain domain specic priors. (4) W e also show that AHA is a viable complement to an incre- mental learning LTM, with the ability to store and replay memories when presented with a cue, whereas the baseline STM architecture was less eective. 2 MODEL In this section, we rst describ e the biological plausibility con- straints that govern the architecture micro-structure. Next, we describe the standard CLS-style mo del, explaining the biological function and the salient features that are use d as the conceptual foundation for AHA. The training/testing framework that we used for the LTM+STM architecture is described, providing context for sub-sections on tw o STM implementations: AHA itself and FastNN, a baseline model. AHA ’s theory of operation is also discussed. 2.1 Biological P lausibility Constraints W e adopt the local and immediate credit assignment constraints from [Rawlinson et al. 2019]. Under these constraints it is possible to use stochastic gradient descent, but with only shallow (and lo cal) backpropagation. In addition we do not require any labels for train- ing. W e do not claim these criteria are sucient for an accurate replica of biology . However , w e aim to avoid the most implausible features of conventional machine learning approaches. Our denition of “shallow” is to allow error backpropagation across at most two ANN layers. The computational capabilities of single-layer networks are very limited, espe cially in comparison to two-layer networks. Biological neur ons perform “dendrite compu- tation” , involving integration and non-linearities within dendrite subtrees [Guerguiev et al. 2017; T zilivaki et al. 2019]. This is com- putationally equivalent to 2 or 3 typical articial neural netw ork layers. W e assume that shallow backpropagation could approxi- mate dendrite computation within a single biological cell layer , and training signals propagated inside cells. 2.2 Biological Computational Model The adopted biological model of the HF is based on the CLS-style models of [Rolls 1995, 2013, 2018] and CLS itself [McClelland et al. 1995; O’Reilly et al. 2014], illustrated in Figure 2. The concept of complementary learning systems was given in the Introduction (Section 1). In this section, we describe the subelds (functional units) and pathways of the model, referring to the citations given above. W e choose details and a level of description relevant to understanding and implementing functionality in a working mo del. The Entorhinal Cortex (EC) is the main gateway b etween the neocortex and hippocampus. It does not t neatly into the overly simplied model shown in Figure 1, as it is broadly part of the HF, but thought to learn incrementally [Gluck et al. 2003]. Entorhinal Entorhinal Cortex DG CA3 Pattern separation ‘PS’ Pattern completion ‘PC’ Pattern mapping ‘PM’ CA1 Pattern retrieval ‘PR’ Mossy ﬁbres Perforant pathway Figure 2: Biological subelds of the HF. cortex sends input from supercial layers (EC in ) to the hippocam- pus, and deep layers (EC out ) receive output from the hippocampus. The hippocampus learns to retrieve and ‘replay’ EC in patterns to EC out , where they are reinstated (equivalent to ‘reconstruction’ in ML terminology). At a high level, the hippocampus comprises an auto-associative memory that can operate eectively with par- tial cues (pattern completion) and distinguish v ery similar inputs (pattern separation). Within the hippocampus, there are sev eral functional units or subelds. The most signicant of which, according to CLS and Rolls, are DG, CA3 and CA1. EC in forms a sparse and distributed overlapping pattern that combines input from all over the neo- cortex and subcortical structures. This pattern b ecomes sparser and less overlapping thr ough Dentate Gyrus (DG) and CA3 with increasing inhibition and sparse connectivity . That provides dis- tinct representations for similar inputs and therefore an ability to separate patterns, important for learning about sp ecic episodes or conjunctions of input concepts, as opposed to generalities. The DG-CA3 conne ctions (Mossy bres) are non-associative and are responsible for encoding engrams in CA3 [Rolls 2013]. The EC-CA3 connections comprise a pattern association network and are re- sponsible for pr oviding a cue for retrieval ( also referred to as recall) from CA3 [Rolls 2013]. Recurrent connections in CA3 create an auto-associative memory with basins of attraction storing multiple patterns. Any part of a stored pattern can be presented as a cue to recall a crisp and complete version of the closest pattern. EC has bilateral connections to CA1 and the CA3-CA1-EC pathway forms a hierarchical pattern associative network. During encoding, activated CA1 neurons form associative connections with the ac- tive EC neurons. During retrieval, the sparse CA3 pattern becomes denser and more overlapping through CA1, resulting in the original, complete pattern that was present during enco ding being replayed, reinstating activation in EC out and in turn neocortex through re- ciprocal feedback connections. The replay is used for cognitive processing or for long-term memory consolidation. Replay occurs Gideon Kowadlo, Abdelrahman Ahmed, and David Rawlinson in an interleaved fashion, allowing incremental learning in the neocortex without catastrophic forgetting. Electrophysiological recordings and lesion studies show that oscillatory activity called theta rhythms, correspond to encoding and retrieval phases [Hasselmo et al. 2002; Ketz et al. 2013]. During encoding, synaptic transmission from EC is strong and from CA3 weak, although the CA3 and CA1 synapses show high potential for synaptic modication (plasticity). During retrieval, synaptic transmission from CA3 is strong, and from EC is w eak but sucient to provide a cue, facilitating retrieval. For this reason our mo del has discrete encoding (training) and retrieval (inference) phases, modulated by the respective subeld equivalents. In the CLS-style studies mentioned above and subsequent im- plementations [Greene et al. 2013; Ketz et al. 2013; Norman and O’Reilly 2003; Schapiro et al. 2017], DG and EC projections are involved in encoding and retrieval in CA3. In contrast, [Rolls 1995, 2013, 2018] pr esents evidence that encoding in CA3 is dominate d by strong aerent activity from DG facilitated by non-associative plas- ticity . Wher eas, retrieval with generalisation takes place thr ough associative learning in EC-CA3. W e adopt Rolls’ model, as DG’s highly orthogonal patterns are eectively random and are likely to ‘confuse ’ cueing CA3. Roll’s mo del of encoding and retrieval with CA3 [Rolls 1995, 2013, 2018] is expanded here and illustrated in Figur e 3. During encoding, strong DG transmission results in a pattern of active pyramid neu- rons in CA3, comprising an engram that be comes an attractor state through synaptic modication in the r ecurrent neurons. EC-CA3 concurrently learns to associate EC and CA3 representations. In eect, the CA3 engram comprises a target for later retrieval by EC-CA3, that can be used to cue recall from CA3. Since EC rep- resentations are distributed and overlapping, similar observations are able to retrieve the same appr opriate distinct memory , enabling generalisation. Conversely , since the CA3 engram varies greatly in response to small changes in EC, the EC-DG-CA3 pathway enables separation of similar patterns. Howev er , this also pre vents generalisation via this pathway . T ogether , the two pathways EC-DG-CA3 and EC-CA3 are able to perform the opposing qualities of pattern generalisation and pattern separation. More recent versions of CLS [Greene et al. 2013; K etz et al. 2013; Schapiro et al. 2017] utilise biologically plausible Contrastive Heb- bian Learning (CHL) within the LEABRA framework [O’Reilly 1996]. This form of Hebbian learning uses local error gradients, speeding up training and avoiding a pre-training step. The use of local error gradients is consistent with our biological constraints. ‘Big-loop’ recurrence occurs when a reinstated EC pattern is transmitted back into the HF as the EC input. This process has received little attention in the computational modelling literature. It appears to play a role in learning statistical information in the form of higher order patterns across episodes [Koster et al. 2018; Schapiro et al. 2017] and is therefore out of scope for this study . 2.3 Training and T esting Framework LTM and STM components (Figure 1) are trained and teste d at dierent stages. This section lays out the framework providing context for the model descriptions. In the following sections related CA3 Entorhinal Cortex DG CA3 Entorhinal Cortex DG Figure 3: Encoding and retrieval in CA3. Left: An EC pattern ows through DG and results in a sparse set of active neu- rons in CA3 (active neurons are shown lled with grey , inac- tive are white). Right: Simultane ously , the same EC pattern forms associative connections b etween active neurons in EC and those in CA3 which represent the stored pattern from DG. Solid arrows represent axon ring, dotted arrows repre- sent inactive axons. to ML implementations, we use the term ‘T rain(ing)’ to refer to the process of encoding and ‘T est(ing)’ to refer to retrieval/r ecall. The stages of testing and training are described below . Stage 1: Pre-train LTM : Train LTM on a disjoint training set of many samples, over multiple epo chs. The LTM learns incremen- tally about common concepts that can be use d compositionally to represent unseen classes. Stage 2: Evaluate LTM+STM : An evaluation has two steps per- formed in succession - STM training ( encoding) and STM testing (inference). During both steps, LTM is used in inference mode (no learning occurs). STM is reset after each evaluation. • Train : A small ‘study’ set is presented once. STM modules are set to train mode to learn the samples. • T est : A small ‘recall’ set is presented, STM mo dules are in inference mode. For each ‘recall’ sample, the system is ex- pected to retriev e the corresponding sample from the ‘study’ set. If correct, it is considered to be ‘recognised’ - an AHA moment! Training and T esting in STM occur rapidly ov er multiple internal cycles within one exposure to an external stimulus. 2.4 AHA Model In this section we describe our implementation of the biological model given in Section 2.2, named AHA - Articial Hippocampal Algorithm. There are sub-sections for each subeld functional com- ponent as well the AHA specic training and testing pr ocedures. For the remainder of the paper we adopt a Functional(Biological) notation to clearly identify the functional instantiations and their biological counterparts. AHA performs the role of the fast learning STM in Figure 1. Since EC is considered to be a statistical learner , for simplicity it has been bundled into a unitary LTM representing a combined ne ocortex/EC. LTM comprises a simple Vision Component, V C(EC), suitable for visual processing of image data. It is traine d to recognise common, simple visual features of the dataset, which are combined in the evaluation phase on unseen data. AHA! an ‘Artificial Hippocampal Algorithm’ for Episodic Machine Learning Figure 4 shows the components and connectivity of AHA used to implement the biological model, Figure 2. A detailed version is given in Appendix C. Sparse Convolutional Autoencoder Sparse Inhibitory FC ANN Hopfield Pattern separation ‘PS’ Pattern completion ‘PC’ Pattern mapping ‘PM’ FC ANN FC ANN Pattern retrieval ‘PR’ Vision Component ‘VC’ Legend T ar get outputs Figure 4: AHA implementation mapping to biological sub- elds. FC- ANN is an abbreviation of Fully-conne cted Arti- cial Neural Network. When the output of an ANN layer is used as the super vised target output for the training of an- other component, dashed arrows are used. All components are shallow and trained with local and imme diate credit as- signment. 2.4.1 PS(DG) - Paern Separation. PS(DG) produces the increas- ingly sparse and orthogonal representations obser ved in the EC- DG-CA3 pathway . The idealised function of PS(DG) is similar to a hashing function: similar inputs should produce dissimilar outputs. It is achieved with a randomly initialised and xe d single-layer Fully-connected ANN with sparsity constraints and temp oral inhi- bition. Sparsity is implemented as a ‘top- k ’ ranking per sample , mimick- ing a local competitive process via inhibitory interneurons. Smaller values for k are enough to produce less overlapping outputs, but orthogonality is further improved by replicating the sparse con- nectivity observed between DG-CA3 (as discussed in Section 2.2). A portion, υ , of the incoming connections are remov ed by setting weights to zero (similar to the sparsening te chnique of [Ahmad and Scheinkman 2019]). Additionally , after a neuron res (i.e. it is amongst the top- k ), it is temporarily inhibite d, mimicking the refractory period obser ved in biological neurons. The rst step of the temporal inhibition is to calculate the weighted sum z i for each neuron i . The inhibition term ϕ i is applied with an elementwise multiplication and then a mask, M is calculate d for the top- k elements. W e use an op erator topk ( a , b ) that returns a ‘1’ for the top b elements in argument a , and ‘0’ otherwise. M = topk ( ϕ · z , k ) The mask is applied with an elementwise multiplication to select the ‘active ’ cells that have a non-zero output. y = M · z Inhibition de cays exponentially with the hyperparameter 0 ≤ γ ≤ 1 , which determines the inhibition decay rate. ϕ i ( t + 1 ) = ϕ i ( t ) · γ PS(DG) is initialised with (uniformly distributed) random weights and does not undergo any training, similar to an Extreme Learning Machine [Huang et al. 2006]. As mentioned, the idealised function is similar to a hashing function. There have been other explorations of hashing with ANNs, such as chaotic neural networks [Li et al. 2011; Lian et al. 2007]. Howev er , they tend to be complex, multilayered and do not t our biological plausibility constraints. The PS(DG) version may be useful in other contexts where orthogonality or pseudo-random non-clashing outputs is required. PS(DG) has 225 units, with a sparsity of 10 active units at a time. This and the other hyperparameters (described above) w ere chosen empirically to minimise resources and achieve orthogonal outputs on all images in an experimental run (see Section 3). 2.4.2 PC(CA3) - Paern Completion. PC(CA3) comprises the re- current auto-associative memory of CA3. PC(CA3) is implemented with a Hopeld network, a biologically-inspir ed auto-associative content-addressable memor y [Hopeld 1982, 1984]. It can store multiple patterns and recall crisp and complete versions of them from partial cues. PC(CA3)’s role is de dicated to storage and auto-asso ciative recall, and does not perform pattern recognition. A s such, PC(CA3) has the same number of neurons (225) as PS(DG) with a xed one-to- one connectivity b etween them, leaving the PS(DG) to encapsulate biological connectivity between DG and CA3. Unlike a standard Hopeld network, there are separate pathways for encoding and recall. PS(DG) patterns are encoded by learning the appropriate recurrent weights for a given input. The recall cue is provided via the PR(EC-CA3) network, with full explanation given in that sub-section below . W e use graded neurons with input magnitude in the range [− 1 , 1 ] and a tanh activation function. A gain λ = 2 . 7 was empirically de- termined to be eective. A small portion n = 20 of all neurons are updated at each step ( of N = 70 iterations), which spee ds up conver- gence without any practical consequences (i.e. the slower case of 1 neuron p er step guarantees conv ergence). The Pseudoinverse learn- ing rule [Pal et al. 1996] was used to learn the feedback recurrent weights. It was chosen for convenience in that it increases capacity and is calculated in one time-step. W eights can also b e learnt with more biologically realistic alternatives such as the original Hebbian learning or the Storkey learning rule [Storkey 1997]. The inputs to PC(CA3) are conditioned to optimise memoristion and retrieval from the Hopeld, detailed in Appendix A. 2.4.3 PR(EC-CA3) - Paern Retrieval. PR(EC-CA3) models the con- nectivity between EC and CA3. Its role is to provide a cue to the PC(CA3). It is implemented with a 2-layer Fully-conne cted Articial Neural Network (FC- ANN). PC(CA3) stores patterns from PS(DG) and must be able to re- call the appropriate pattern given a corresponding V C(EC) input Gideon Kowadlo, Abdelrahman Ahmed, and David Rawlinson which is unlikely to be exactly the same or an exact subset of the memorised input ( without the use of synthetic data). If PS(DG) func- tions as intende d, then small changes in VC(EC) produce output from PS(DG) that is orthogonal to the memorised pattern. This will not provide a meaningful cue to PC(CA3); hence the r ole and importance of EC-CA3 connectivity . As explained in Section 2.2, EC-CA3 conne ctivity constitutes a pattern recognition network that allows exploitation of the overlapping representation in EC that contains information about underlying concepts. In ee ct, the sparsely activated PC(CA3) neurons form a ‘target’ output for a given V C(EC) pattern. T o train the pattern retrie val capability , the V C(EC) is taken as input and the PS(DG) output is used as internally generated labels for self-super vised training within the biological plausibility constraints. Once learned, the subse quent PR(EC-CA3) outputs constitute a cue to recall a stored pattern in PC(CA3). In self-super vised learning [Gidaris et al. 2019], usually prior knowledge about the task is used to set a pre-conceived goal such as rotation, with the motivation of learning generalisable represen- tations. In the case of AHA, no prior is required. The motivation is separability . A s such, the use of orthogonal patterns as lab els, is very eective. PR(EC-CA3) is a 2-layer FC-ANN. Conceptually , the output lay er represents the same entity as PC(CA3), it is the same size (225 units) and the values are copied across. 2 layers were chosen to achieve the required complexity for the task within the biological constraints (Section 2.1). The hidden layer size is signicantly larger (1000 units) to achieve the necessary capacity , empirically chosen to optimise test accuracy in the experimental conditions of training for many iterations on one batch (see Section 3). Leaky-ReLU and Sigmoid activation functions ar e used in the hidden and output lay- ers r espectively . L2 r egularisation is use d to improve generalisation. Learning to retriev e the sparsely active target pattern is posed as multi-label classication with cross-entropy loss. 2.4.4 PM(CA1) - Paern Mapping. PM(CA1), the CA1 equivalent, maps from the highly orthogonal PC( CA3) representation, to the overlapping and distributed obser vational form. PM(CA1) eec- tively learns to gr ound the orthogonal ‘symbolic’ PC(CA3) repre- sentation for replay . This emulates the way that during enco ding, activated CA1 neurons form associative connections with the ac- tive EC neurons, achieved with self-sup ervised training, within the biological plausibility constraints. A 2-layer FC-ANN is used with the output layer representing EC out . In this study we trained it to reconstruct the input images rather than the V C(EC) output, making visualising the quality of the output intuitive. A similar network was used for PR(EC-CA3) to emulate con- nectivity between biological neurons of EC and CA3. In this case , it is being used to represent the subeld (CA1, a single layer of biological neurons) as well as the connectivity . W e postulate that only a very simple network is required due to the relatively lim- ited scope of the experimental task (encoding of 20 characters). In addition, the aerent projections from EC in have been ignored in our implementation. Re-evaluation of these simplications may b e important for extending the model. Compared to PR(EC-CA3), a signicantly smaller capacity is required. The hidden layer is 100 neurons wide, the output layer is constraine d to the size of VC(EC). Both layers use the Leaky- ReLU activation functions, training is conducted with MSE loss function and L2 regularisation. The hyperparameters are chosen empirically based on loss values and visual inspection of the quality of reconstructions. 2.4.5 V C(EC) - Vision Component. The role of VC(EC) is to pro- cess high dimensional sensor input and output relativ ely abstract visual features that can be used as primitives compositionally . A single layer convolutional sparse autoencoder based on [Makhzani and Frey 2013, 2015] provides the required embedding (see Appen- dix B.1 for details). Howev er , in Omniglot there is a lot of empty background that is encoded with strong hidden layer activity . Lack- ing an attention mechanism, this detracts from comp ositionality of foreground features. T o suppress encoding of the background, we added an ‘Interest Filter’ mechanism. The Interest Filter loosely mimics known retinal processing (see Appendix B.3). The retina possesses centre-surround inhibitor y and excitatory cells that can b e well appro ximated with a Dierence of Gaussians (DoG) kernel [Enroth-Cugell and Robson 1966; McMahon et al. 2004; Y oung 1987]. 2.4.6 Training and T esting. In accord with the Training and T esting Framework (Section 2.3), the hippocampal components PC(CA3), PR(EC-CA3) and PM(CA1) train (encode) and test (retrieve) within one stimulus exposure each. The learning process during that e xpo- sure is implemented as N = 60 mini-batches with constant input. During retrieval, the PC(CA3) converges over 70 recursive itera- tions. PS(DG) does not undergo training. All the components are reset between evaluation steps, meaning the optimizer state (e.g. momentum) and all learned parameters are lost and replaced with re-initialised values. The use of separate train and test phases is modelled after hip- pocampal encode and retrieval phases governed by theta rhythms (Section 2.2). Resetting b etween experiments is consistent with the way other hippocampal studies were conducted (discussed in Sec- tion 2.2) and supported by evidence that during retrie val, there is depotentiation in synapses that allows rapid forgetting [Hasselmo et al. 2002]. 2.5 Theory of Operation 2.5.1 From Sensor Observations to Symbols. As discussed above for the biological hippocampus, CA3 patterns are highly orthogonal, eectively random encodings of the EC engram, thereby accom- plishing pattern separation. They are stable, distinct and statistically unrelated to other even similar observations, therefore we hypoth- esise that they full the r ole of a symbolic representation. T aken as a whole, the EC-CA3 netw ork maps from observations grounded in perceptual processing to symbols, and CA1 performs the inverse mapping. This characteristic raises the possibility of symbolic reasoning on grounded sensor observations. Recent w ork [Higgins et al. 2018] shows the possibility of using symb ols that represent generative factors, in novel combinations, to “break away from” the training distribution. In the cited work the symbols are ‘lab els’ provided externally . Ho wever , biological ‘agents’ only receive information AHA! an ‘Artificial Hippocampal Algorithm’ for Episodic Machine Learning through sensors, ev en when it is symbolic in nature, such as spo- ken or written words. This could be supported by a hippocampal architecture, in which the symbols are self generated. 2.5.2 Collaboration of PS(DG) and PR(EC-CA3). A core component of operation is the manner in which the pattern separation pathway , PS(DG), and the pattern retrieval pathway PR(EC-CA3), are unied. PS(DG) sets a target for both PR(EC-CA3) and PC(CA3) to learn, providing a common representational ‘space’ . This makes it p ossible to separate encoding and retrieval (to and from PC(CA3)), b etween the separation and completion pathways respectively . In this way , they don’t dilute each other , but operate to their strengths. The fact that PS(DG) provides a strong symbolic-style target for PR(EC-CA3) to learn, makes learning that target with self- supervise d learning, v er y eective. In addition, orthogonality be- tween samples is preferable for ecacy of the PC(CA3) Hopeld net- work [Hopeld 1982]. In turn, the high quality orthogonal outputs from PC(CA3) allow PM( CA1) to learn to reconstruct eectively . 2.5.3 Unifying Separation and Completion. Both separation and completion are ne cessary for the range of experiments reporte d here. Separation allows storage of a unique form of the memor y , and completion recalls the most ‘similar’ memorised form. It does not matter whether observational variation is cause d by occlusion, noise or dierent exemplars of the same class. Separation and com- pletion are conicting capabilities requiring separate pathways. Unication is achieved through collaboration of PS(DG) and PR(EC- CA3) described above. 2.5.4 Continual Learning of More Abstract Concepts. The hippo cam- pus’ r ole in replay and consolidation, facilitating continual learning without catastrophic forgetting, is recognised as a valuable ability for intelligent machines [Kumaran et al. 2016], particularly where there are few experiences of new tasks. A hippocampal architec- ture that functions like AHA may pro vide an additional aspect of continual learning. As discussed, the STM (hippocampus) receives compositional primitive concepts from LTM (neo cortex). New con- junctions of concepts are learnt in STM as composite, more abstract, concepts. Follo wing consolidation to LTM, these ne w concepts may in turn serve as primitives for subsequent stages of learning. W e hypothesise that this could confer an ability to build progressively more abstract concepts hierarchically . 2.6 Baseline STM Model - FastNN In this section, we describ e an alternative STM for comparison with AHA. The objective is to test AHA against a standard ANN archi- tecture optimised for the same tasks. It must also learn fast given only one exposure, so it is referr ed to as FastNN. Like AHA, FastNN reconstructs a memorised input based on an internally generated cue, and is ther efore a form of auto-associative memory . A gain like AHA, the input and output data structures ar e not identical. V C(EC) provides the input, and the output are reconstructed images (this is for ease of analysis rather than biological compatibility). FastNN is a 2-layer Fully-connected ANN. W e empirically op- timised the learning rate, number and size of hidden lay ers, regu- larisation metho d and activation functions of the model. The re- sulting model has a hidden layer with 100 units, L2 regularisation and Leaky-ReLU activation function. The output layer size is con- strained to the size of the output image. As FastNN reconstructs an input signal, the rst layer acts as an encoder that produces a latent representation and the second lay er a decoder that maps this back to the image space. 3 METHOD The experiments test the ability to recognise the input in response to a cue (retrieve the correct memory internally), and to replay it in the form of the originating input. A variety of experimental methods are reported in the literature for other CLS-style models [Greene et al. 2013; Ketz et al. 2013; Norman and O’Reilly 2003; Rolls 1995; Schapiro et al. 2017]. W e chose a test that has b ecome standard in the ML literature for one-shot learning. Our intention is to use a method that ts the existing style of tests and that we can use to compare performance to other non-hippocampal state-of-the-art methods. The experiments test the ability to learn from a single exposure and to retrieve appropriate memories (for replay to LTM) in the context of one-shot classication tasks. Performance is assessed with quantitative and qualitative results. The ability to recognise is measured with classication accuracy , and the quality of end-to- end retrieval of appropriate memories is assessed with recall-loss and through visual inspection of signals as they propagate thr ough the respective architectures. W e test the p erformance of the LTM on its own, which we refer to as the baseline, and compare it to architectures that combine an LTM and STM. The two combined architectures ar e AHA (LTM+AHA) and FastNN (LTM+FastNN). For all architectures, classication is performed by comparing internal states b etween memorisation and retrieval patterns. The ar chitectures are not explicitly trained to classify . In the case of AHA, we used internal states at dier- ent points in the architecture to show the varied p erformance of PR(EC-CA3) and PC(CA3). Conventional ablation is not possible because PC(CA3) is fully dependent on the earlier stages PS(DG) for encoding and PR(EC-CA3) for retrieval. The experiments are based on the one-shot classication task from [Lake et al. 2015], which uses the Omniglot dataset of hand- written characters from numer ous alphabets. It was chosen because it has become an accepted benchmark for one-shot classication learning with many published results (review ed in [Lake et al. 2019]. W e extended it to test robustness to more realistic conditions and a wider range of capabilities. The following sections detail the original benchmark, followed by our extended version, conditions common to all tests and details of how performance was evaluated. 3.1 Omniglot Benchmark Using the terminology from [Lake et al. 2015], we refer to this task as one-shot classication. It involves matching distinct exemplars of the same character class, without lab els, amongst a set of distractors from other character classes. Strong generalisation is r equired to complete the task, as well as some pattern separation to distinguish similar character classes. Referring to the train/test framework in Section 2.3, the system is pre-trained with a ‘background’ set of 30 alphab ets. Then in the core Gideon Kowadlo, Abdelrahman Ahmed, and David Rawlinson task, using 10 alphabets from a separate ‘e valuation’ set, a single ‘train’ character is presented. The task is to identify the matching character from a ‘test’ set of 20 distinct characters drawn from the same alphabet and written by a dierent person. This is repeated for all 20 characters to comprise a single ‘run’ . The experiment consists of 20 runs in total (400 classications). Accuracy is averaged over the 20 runs. 4 The method used to determine the matching character varies b etween studies. In this work, we use minimum MSE of a given internal state (the chosen state depends on which model is being assessed). 5 The characters and alphab ets were originally selected to max- imise diculty through confusion of similar characters. In addition, the test condition of within-alphab et classication is more chal- lenging than between-alphabet. Some other one-shot classication studies that base their experiments on the Omniglot benchmark used only 5 alphabets, tested between-alphab et classication, and used augmented datasets, making the problem signicantly easier [Lake et al. 2019]. 3.2 Omniglot Extended Benchmark Instance learning enables agents to reason about specics rather than class categories e.g. an ability to recognise your o wn specic cup. The episodic hippocampal models referenced in Section 2.2 have traditionally focused on this typ e of task. Therefore, we ex- tended the Omniglot classication benchmark to include instance learning. W e refer to this task as one-shot instance-classication. It re- quires strong pattern separation, as well as some generalisation for robustness. It is the same as one-shot classication, except that the train character exemplar must b e matched with the exact same exemplar amongst a set of 20 distractor e xemplars of the same char- acter class. Being the same character class, all the exemplars are very similar making separation dicult. In each run, the character class and exemplars are sele cted by randomly sampling without repeats from the ‘ evaluation’ set. In addition, we explored robustness by introducing image corrup- tion to emulate realistic challenges in visual processing that could also apply to other sensory modalities. Noise emulates imperfect sensory processing. For example, in visual r ecognition, the target object might be dirty or lighting conditions changed. Occlusion em- ulates incomplete sensing e.g. due to obstruction by another object. Robust performance is a feature of animal-like learning that would confer practical benets to machines, and is therefore important to explore [Ahmad and Scheinkman 2019]. Occlusion is achieved with randomly placed circles, completely containe d within the image. Noise is introduced by replacing a proportion of the pixels with a random intensity value drawn from a uniform distribution. For both tests, instead of presenting 1 test character at a time, all 20 are presented simultaneously . This is compatible with other hippocampal studies referenced above and requires a short term memory of moderate capacity , which the hippocampal model pro- vides. Accuracy and r ecall-loss are measured for increasing noise 4 Note that we use d the approach in [Lake et al. 2015], calculated as average of the average of each run. 5 Note that the MSE calculation is straightforward as sparse vectors are not index based, but are implemented as dense vector that are sparsely activated. and occlusion from none, to almost complete corruption, in 10 in- crements. The highest level is capped at just b elow 98% corruption, to ensure some meaningful output. Ev ery test is repeated with 10 random seeds. The full set of experiments are summarised in the table below: Experiment Match Distractors Primarily T esting One-shot classication Character class Exemplars from other classes Generalisation One-shot instance- classication The exemplar of a class Exemplars of same class Separation 3.3 Performance Analysis The accuracy calculation is described in the detail of the Omniglot Benchmark experiment above. ‘Recall-loss’ is the MSE dierence between the original train image and a recalle d image in grounded, observational form. Qualitative results are analysed through visual inspection. 4 RESULTS In this section we pr esent the experimental results organised pri- marily by evaluation methods: accuracy , recall-loss and qualitative analysis. All experiments were conducted over 10 random se eds. The plots for accuracy and recall-loss show the mean value in b old with medium shading for one standard deviation, and lightly shaded portion demarcating minimum and maximum values. 4.1 Accuracy 4.1.1 One-shot classification T ask. Accuracy results of the one-shot classication task are shown in Figures 5a and 5b. LTM on its own comprises the baseline. Without image corruption, classication accuracy is 71 . 6% . Moderate occlusion has a mor e damaging impact than noise, until very high levels where the character is mostly covered, leading to a plateau (overall an inverse sigmoid prole). Noise eects all features equally and gradually , whereas occlusion increases the likelihood of suddenly removing important topolog- ical features. FastNN and AHA follow the same overall trends as LTM. The AHA pattern retrieval netw ork PR(EC-CA3) boosts p erfor- mance over the baseline signicantly by almost 15% to 86 . 4% at no noise or occlusion. PR(EC-CA3) has a strong advantage over baseline for all conditions. As extreme occlusion begins to cover most of the character , the accuracy of PR(EC-CA3), PC(CA3) and LTM converge. However , the advantage is maintaine d over all noise levels, where there is still a signal at high levels. For all conditions, PC(CA3) classication accuracy is no better than LTM. FastNN improves on the baseline accuracy by 10 . 3% , 4 . 4% less than the LTM+AHA impro vement. The advantage of LTM+AHA over LTM+FastNN is signicant over almost all levels of occlusion. The advantage is minor for all levels of noise. For conte xt, reported accuracy without noise or occlusion is con- trasted with other works in T able 1. Existing values are r eproduced from Lake et al. [2019]. AHA! an ‘Artificial Hippocampal Algorithm’ for Episodic Machine Learning (a) One-shot classication with Occlusion (b) One-shot classication with Noise (c) One-shot instance-classication with o cclusion (d) One-shot instance-classication with noise Figure 5: One-shot instance-classication accuracy vs o cclusion and noise. LTM+AHA and LTM+FastNN improve performance over the baseline (LTM) for b oth types of image corruption. LTM+AHA has an advantage over LTM+FastNN for occlusion, minor for noise. Occlusion diameter as a fraction of image side length, noise is expressed as a fraction of image area. Algorithm Citation Accuracy % Human [Lake et al. 2019] 95.5 BPL [Lake et al. 2019] 96.7 RCN [George et al. 2017] 92.7 Simple Conv Net [Lake et al. 2019] 86.5 LTM+AHA 86.4 Prototypical Net [Lake et al. 2019] 86.3 VHE [Hewitt et al. 2018] 81.3 LTM+FastNN 81.9 T able 1: Comparison of algorithms for one-shot classi- cation, without image corruption. LTM+AHA is competi- tive with state-of-the-art convolutional approaches whilst demonstrating a wider range of capabilities. 4.1.2 One-shot instance-classification Task. Accuracy results for the one-shot instance-classication task are shown in Figures 5c and 5d. LTM Accuracy is perfect at low levels of image corruption, remaining almost p erfect in the case of o cclusion, until approxi- mately one third of the image is eected. For all models, the same trends are observed as for the one-shot classication task, with some salient features. For AHA, PR(EC-CA3) accuracy remains extremely high, close to 100% until a 10% greater level of o cclusion than for LTM (i.e. addition of AHA incr eases tolerance to occlusion). The advantage over LTM increases with increasing corruption, fading away for occlusion but continuing to grow for noise. Note also that in the case of one- shot instance-classication which is the focus of most episodic work in computational models, unlike the one-shot classication experiment, PC(CA3) confers a signicant advantage that grows with occlusion until it converges as the task becomes too dicult to achieve at all. A possible explanation is that the cue provided to PC(CA3) is more likely to be closest to the correct attractor state. LTM+FastNN also improves on the baseline. It has w orse accu- racy than LTM+AHA for a given lev el of occlusion (less substantial than one-shot classication), and equal accuracy for var ying levels of noise. 4.2 Recall Recall-loss is plotted versus occlusion and noise for all experiments in Figure 6. In the one-shot classication experiment, LTM+AHA demonstrates better performance than LTM+FastNN under mo der- ate occlusion and noise. At higher levels of corruption, LTM+AHA may retrieve a high quality image of the wr ong character , which results in a higher loss than lower-quality images retrieved by LTM+FastNN. Gideon Kowadlo, Abdelrahman Ahmed, and David Rawlinson In the one-shot instance-classication experiment, this character confusion is less likely to occur and LTM+AHA is superior or e qual to LTM+FastNN under all meaningful levels of image corruption. Both LTM+AHA and LTM+FastNN loss drops at extreme occlu- sion, which at rst may seem counter-intuitive. With even a small portion of the character present in the image, the networks produce a more recognisable character output. This is more pronounced for AHA, which through the action of PC(CA3) converges on an attractor state. In comparison, when the character is completely occluded, the output is a faint blur consisting of a sup erposition of all of the learnt image (see gures in Appendix D). In the former case, the loss is the dierence between 2 strong characters ( crisp in the case of AHA) - they dier most of the time at high le vels of occlusion. In the latter case, it is the dierence between 1 strong character and an eectively blank image, a lower loss even though it never recalls the correct image. 4.3 Qualitative Analysis Internal signals are shown as they propagate thr ough the compo- nents of STM implementations for typical scenarios. For AHA they are shown in Figur es 7 and 8. A legend is given with T able 2. The same is shown for FastNN in Figures 9 and 10. Additional gures showing high and extreme levels of image corruption are included in Appendix D. Row Pattern 1 Train samples 2 PS(DG) output 3 T est samples 4 PR(EC-CA3) output 5 PC(CA3) output 6 PM(CA1) output (image reconstruction) T able 2: Legend for Figures 7a to 8c. For LTM+AHA, PS(DG) (row 2) produces non-ov erlapping pat- terns from the train batch samples, even where the samples are ex- tremely similar (dierent e xemplars of the same class). This shows that pattern separation is working as intended. For test batch sam- ples, PR(EC-CA3) outputs are in most cases, r ecognisable retrievals of the correct PS(DG) pattern. However they are noisy and unevenly attenuated. Given this input, PC(CA3) then converges to a crisp recall of one of the memorise d patterns, a basin of attraction in the Hopeld network, but not always the correct one. There was one rare observed exception where PC(CA3) produced a superposition of states. PM(CA1) produced sharp complete versions using the PC(CA3) pattern, resulting in complete recalled patterns. In the presence of occlusion, complete train samples are still recalled (ev en if they are not the correct sample). In addition, the retrieval is mainly correct even when the occluded portion disrupts the topology of the sample. For e xample, in the case of occlusion, Figure 7b, columns 2, 5, 6, 7, 11, 12, 14, 16, 19 and 20 compar ed to error in column 8. In the case of noise , Figure 8b, see examples in columns 1, 3, 4, 5, 6, 7, 9, 12, 13, 14, 17, 19 and 20 with no errors. Recall tends to fail gracefully , recalling similar samples. For exam- ple, column 8 Figure 7b. The recalled character has a large circular form with a dot in the middle. It shares many of the featur e cues with the correct character . In other failure cases such as column 13 and 15, Figure 7b, o cclusion has not damaged the sample, but PR(EC-CA3) appears to have recalled multiple patterns that result in a superposition or converge to the wrong train sample r espec- tively . It is not clear from visual inspection of the train and test samples, if ambiguous visual features have contributed to the error , or if it is due to the representational and generalisational capacity of PR(EC-CA3). For LTM+FastNN, the recalled images are blurred with substan- tial ghosting from other character forms for one-shot classication. For one-shot instance-classication, the same phenomenon is ob- served, but to a far lesser extent and the characters are r elatively clear . 5 DISCUSSION This section begins with a general discussion of the results, followed by sub-sections on particular themes. The results demonstrate that AHA performs replay with separa- tion and completion in accordance with the expected b ehaviour of the hippocampus [Kandel et al. 1991; O’Reilly et al. 2014; Rolls 2018]. AHA performs well at distinguishing similar inputs, and unlike pre- vious CLS hippo campal models, also performs well given large observational variation, and data that originates from grounded sensory input. The variation can result from seeing dierent exem- plars of the learnt class ( e.g. I see a fork for the rst time and now I can recognise other forks) or from corruption (e .g. looking for the same fork, but it is dirty or partially o ccluded). The former condition is more dicult to test without the use of grounded sensor input and modelling neurons at a higher lev el of abstraction. Emphasis in AHA on the EC-CA3 network and interplay with DG-CA3 is likely to play a signicant role in this capability (see section 2.5.2 b elow). The use of a complementar y STM (LTM+STM) outperforms the baseline (LTM) across all performance measures and tasks. Our results validate that this complementary architecture can perform a wider range of hipp ocampal functions. The AHA-STM model outperforms the FastNN-STM. The advantage in accuracy is most signicant for generalisation and in the presence of occlusion. AHA- STM recall outputs are cleaner and mor e specic even if they are incorrect, up to very high levels of image corruption, especially where generalisation is required. W e hypothesise that this type of recall is signicantly more useful for the purpose of consolidating few-shot exposures into a complementary LTM, analogous to the biological complementary learning systems. The advantages of AHA over FastNN ar e justication for the use of a more complex heterogeneous architecture. Given that PR(EC-CA3) makes a crucial contribution to AHA ’s accuracy (see section 5.2 below ), perhaps it is not surprising that FastNN, which is also a 2-layer FC- ANN trained in a similar man- ner (albeit dierent size and conguration), can do almost as well. Howev er , the interaction of pathways in AHA provides an accuracy boost, and most importantly allows the crisp, opinionated recall. LTM+AHA is competitive to conventional ANN techniques on the standard Omniglot Benchmark conditions (which measures accuracy on the one-shot classication task without corruption). Referring to T able 1, BPL and RCN are signicantly ahead of other AHA! an ‘Artificial Hippocampal Algorithm’ for Episodic Machine Learning (a) One-shot classication with occlusion (b) One-shot classication with Noise (c) One-shot instance-classication with o cclusion (d) One-shot instance-classication with noise Figure 6: Recall-loss vs o cclusion and noise. The two STM models follow similar trends. AHA is superior given moderate to high image corruption, where it sometimes retrieves an accurate copy of the wrong character . Occlusion diameter is expressed as a fraction of image side length, noise as a fraction of image area. methods, and are similar to human performance. This is expected for BPL, as it exploits domain-specic prior knowledge about hand- writing via stroke formation. RCN, by virtue of the design which is modelled on the visual cortex, is also spe cialised for this type of visual task. It is less clear how it could b e applied to other datasets and problems where contours are less distinct or less relevant. The Simple Conv Net (CNN) r epresents a standard approach for deep learning. AHA is equally goo d despite using biologically plausible local credit assignment, fewer computational resources, and no labels. Additionally , AHA demonstrates a broader range of capabil- ities such as pattern separation and replay . Overall, the results demonstrate that AHA has potential to aug- ment conventional ML models to provide an ‘episodic’ one-shot learning capability as tested in this study , but even more excit- ing is the prospect of using the replay capability to integrate that knowledge into the model as long-term knowledge. 5.1 Advantage of Using a Complementary STM Due to the role that LTM plays, providing primitive concepts, it is only able to complete primitives rather than the ‘episode ’ or conjunction of primitives. In addition it does not have a memory for multiple images, so there is no way for it to recognise a specic example, limiting accuracy and making it unable to recall. 5.2 Importance of PR(EC-CA3) and PC(CA3) The accuracy results demonstrate that PR(EC-CA3) performs clas- sication signicantly better than PC(CA3). This is somewhat sur- prising, given the conv entional view that CA3 performs the bulk of pattern completion [Ketz et al. 2013; McClelland et al. 1995; Norman and O’Reilly 2003; Schapir o et al. 2017]. As PR(EC-CA3) learns to associate overlapping representations (in V C(EC)) to com- plete target patterns (from PS(DG)), PR(EC-CA3) would produce complete patterns as outputs. Visual inspection shows that it do es produce largely complete but noisy patterns. It is further evidence that PR(EC-CA3) is well suited to be primarily r esponsible for recall from CA3 [Rolls 2013]. PC( CA3) does full a vital role for additional completion and sharpening so that the pattern can be eectively reconstructed by PM(CA1) for reinstatement of the original cortical representation. It is possible that previous computational studies [Gr eene et al. 2013; Ketz et al. 2013; Norman and O’Reilly 2003; Schapiro et al. 2017] did not encounter this discrepancy because experiments fo- cused on a narrower problem set of pattern separation, in which AHA -PC(CA3)performed much closer to PR(EC-CA3). There are other factors to consider such as the division of EC into discrete receptive elds and capacity dier ence between PR(EC-CA3) and PC(CA3) e quivalent networks [Rolls 2013]. However , the impor- tance of PR(EC-CA3) in AHA suggests that the connectivity of EC-CA3 is more important than previously acknowledged. Gideon Kowadlo, Abdelrahman Ahmed, and David Rawlinson (a) No o cclusion or noise. (b) With occlusion (circle, diameter=0.3). (c) With noise (fraction=0.3). Figure 7: One-shot classication test patterns as they propagate through AHA. See T able 2 for legend. Note that a small and less signicant source of accuracy bias toward PR(EC-CA3) occurs in the cases wher e PR(EC-CA3) outputs a superposition of p ossible patterns, enhancing the chance of a correct match via MSE. In contrast, PC( CA3) is designed to retrie ve a single, sharp complete sample and in doing so is unable to he dge its bets. 5.3 Learning is T ask Independent The episodic representations in AHA can be used for dier ent tasks. For the one-shot instance-classication test, the symbol generalises over versions of the same exemplar (subject to noise and o cclusion). This is the standard denition of episodic learning in previous AHA! an ‘Artificial Hippocampal Algorithm’ for Episodic Machine Learning (a) No o cclusion or noise. (b) With occlusion (fraction=0.3). (c) With noise (fraction=0.3). Figure 8: One-shot instance-classication test patterns as they propagate through AHA. See T able 2 for legend. studies [Greene et al. 2013; K etz et al. 2013; Norman and O’Reilly 2003; Rolls 2013; Schapiro et al. 2017]. For the one-shot classication test, the symbol learnt generalises over multiple exemplars of the same class (additionally subject to noise and occlusion). This is the standard denition for generalisation in classication. In both cases, the symbol represents a conjunction of visual primitives from the same set, and competence at both tasks is accomplished by unication of separation and completion. In reality the b oundary between what is a class and what is the exemplar is continuous, subjective and in many circumstances depends on the task. For example, you could dene the character itself as a class, and the corrupted samples are exemplars. Another Gideon Kowadlo, Abdelrahman Ahmed, and David Rawlinson (a) No o cclusion or noise. (b) With occlusion (fraction=0.3). (c) With noise (fraction=0.3). Figure 9: One-shot classication test patterns as they propagate through FastNN. Row 1: Train images, Row 2: T est images, Row 3: PM output, the image reconstruction. example is a Labrador dog, the class could be the animal type - dog (another exemplar would be cat), or the breed - Labrador (another exemplar would be Poodle). AHA demonstrates this exibility to the task by accomplishing both one-shot classication and one-shot instance-classication. It is a characteristic we can expe ct from the hippocampus through observation of everyday animal behaviour . 5.4 Learning is Hierarchical According to our principle of op eration, the exibility explained above should extend to any level of abstraction i.e. learning the particular details of a spoken sentence to the simple fact that you had a conversation. AHA learns an episode , a conjunction of primi- tives, and then generalises ov er variations in that combination. The meaning of the episode depends on the level of abstraction of the AHA! an ‘Artificial Hippocampal Algorithm’ for Episodic Machine Learning (a) No o cclusion or noise. (b) With occlusion (fraction=0.3). (c) With noise (fraction=0.3). Figure 10: One-shot instance-classication test patterns as they propagate through FastNN. Row 1 shows Train images, Row 2 T est images, and Row 3 shows the PM output, the VC reconstruction. primitives. This ties into the discussion about iteratively building more abstract concepts in Theory of Operation (Section 2.5). One possibility to exert control o ver the level of detail encoded would be an attentional me chanism, mediating the selection of primitives. Gideon Kowadlo, Abdelrahman Ahmed, and David Rawlinson 6 CONCLUSIONS This paper presented AHA, a novel computational model of the hippocampal region and subelds (functional and anatomical sub- regions). AHA uses biologically-plausible, local learning rules with- out external labels. AHA performs fast learning, separation and completion functions within a unied representation. The sym- bolic representations generated within AHA can be grounded - mapped back to the original input. W e describ e how this architec- ture complements the incremental statistical learning of popular ML methods. AHA could extend their abilities to more animal-like learning, such as one-shot learning of a conjunction of primitives (an episode). This could enable ML to perform more sample-ecient learning, and reason about specic instances. The system was tested on visual recognition tasks featuring grounded sensor data. W e posed a new benchmark based on the visual one-shot classication task in Lake et al. [2015]. An addi- tional one-shot instance-classication test was introduced, testing the ability to r eason about specic instance/s. W e also added image corruption with noise and occlusion, and all experiments were r e- peated several times to evaluate consistency . The results show that the subelds’ functionality matches biological obser vations, and demonstrates a range of capabilities. AHA can memorise a set of samples, learn in one-shot, p erform classication requiring general- isation and identify specic instances (reason ab out spe cics) in the face of occlusion or noise. It can accurately reconstruct the original input samples which can be used for consolidation of long-term memories to inuence future perception. AHA one-shot classica- tion accuracy is comparable to existing ML methods that do not exploit domain-specic knowledge about handwriting. The experiments e xpanded the scope of pre vious biological com- putational model studies, shedding light on the role and interplay of the subelds and aiding in understanding functionality of the hippocampal region as a whole. 7 F U T URE WORK In future work, we will explore two ways that AHA could aug- ment incremental learning ML models. Firstly , the use of grounded non-symbolic VC(EC) reconstructions from PM( CA1) to selectively consolidate memories so that they can aect future perception. Sec- ondly , AHA could directly and immediately augment slo w-learning ML models by interp olating rapidly learned classication or pre- dictions from AHA with the incr emental learning model. This ap- proach is compatible with a wide variety of models and would make them more responsive to rapidly changing data, or where fewer labelled samples are available. W e would also like to investigate how these representations can be fe d back through AHA in ‘big- loop’ r ecurrence to learn statistical regularities acr oss episodes (see Section 2.2) and to resolve ambiguous inputs. Acknowledgments A big thank-you to Elkhonon Goldberg for enriching discussions on the hippocampal region, its relationship to the neocortex and their role in memory . W e also greatly appreciate the insight that Rotem Aharon provided in analysing and improving the dynam- ics of Hopeld networks. Matplotlib [Hunter 2007] was used to generate Figures 7a to 8c and Figure 11. A uthor Contributions GK and DR devised the concept and experiments. GK, DR and AA wrote the code. GK, DR and AA executed the experiments. GK, DR and AA wrote the paper . REFERENCES Ahmad, S. and Scheinkman, L. (2019). How Can W e Be So Dense? The Benets of Using Highly Sparse Representations. arXiv preprint . Berthelot, D., Research, G., Carlini, N., Goo dfellow , I., Oliver , A., Papernot, N., and Rael, C. (2019). MixMatch: A Holistic Approach to Semi-Super vised Learning. In Advances in Neural Information Processing Systems , pages 5049–5059. Buckner , R. L. (2010). The Role of the Hipp ocampus in Prediction and Imagination. A nnual Review of Psychology , 61(1):27–48. Enroth-Cugell, C. and Robson, J. G. (1966). The contrast sensitivity of retinal ganglion cells of the cat. The Journal of physiology , 187(3):517–52. George, D., Lehrach, W ., Kansky , K., Lazaro-Gredilla, M., Laan, C., Marthi, B., Lou, X., Meng, Z., and Liu, Y. (2017). A Generative Vision Model that T rains with High Data Eciency and breaks text-based CAPTCHAs. Science , pages 1–19. Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P ., and Cord, M. (2019). Boosting Few-Shot Visual Learning with Self-Super vision. In Proceedings of the IEEE International Conference on Computer Vision. Gluck, M. A., Meeter, M., and Myers, C. E. (2003). Computational models of the hippocampal region: Linking incremental learning and episodic memory. Trends in Cognitive Sciences , 7(6):269–276. Goldberg, E. (2009). The new executive brain : frontal lobes in a complex world . O xford University Press. Greene, P., Howard, M., Bhattacharyya, R., and Fellous, J. M. (2013). Hippocampal anatomy supports the use of context in object recognition: A computational model. Computational Intelligence and Neuroscience , 2013(May). Guerguiev , J., Lillicrap, T . P., and Richards, B. A. (2017). T owards deep learning with segregated dendrites. eLife , 6. Hasselmo, M. E., Bodelón, C., and Wyble, B. P . (2002). A proposed function for hip- pocampal theta rhythm: Separate phases of encoding and retrieval enhance reversal of prior learning. Neural Computation , 14(4):793–817. Hewitt, L. B., Nye, M. I., Gane, A., Jaakkola, T., and T enenbaum, J. B. (2018). The V ariational Homoencoder: Learning to learn high capacity generative models from few examples. arXiv preprint . Higgins, I., Sonnerat, N., Matthey , L., Pal, A., Burgess, C. P., Bosnjak, M., Shanahan, M., Botvinick, M., Hassabis, D., and Lerchner , A. (2018). SCAN: Learning Hierarchical Compositional Visual Concepts. arXiv preprint , pages 1–24. Hopeld, J. J. (1982). Neural networks and physical systems with emergent colle ctive computational abilities. Proce edings of the National Academy of Sciences , 79(8). Hopeld, J. J. (1984). Neurons with graded response have colle ctive computational properties like those of two-state neurons. Procee dings of the National Academy of Sciences , 81(10):3088–3092. Huang, G. B., Zhu, Q. Y ., and Siew , C. K. (2006). Extreme learning machine: Theor y and applications. Neurocomputing , 70(1-3):489–501. Hunter , J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering . Kandel, E. R., Schwartz, J. H. J. H., and Jessell, T . M. (1991). Principles of neural science . Elsevier . Ketz, N., Morkonda, S. G., and O’Reilly , R. C. (2013). Theta Coordinated Error-Driven Learning in the Hippocampus. PLoS Computational Biology , 9(6). Kingma, D . P. and W elling, M. (2013). A uto-Encoding V ariational Bayes. arXiv preprint arXiv:1312.6114 . Kirkpatrick, J., Pascanu, R., Rabinowitz, N., V eness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T ., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of A merica , 114(13):3521–3526. Koch, G., Zemel, R., and Salakhutdinov , R. (2015). Siamese Neural Networks for One- shot Image Recognition. In Proceedings of the 32nd International Conference on Machine Learning . Koster , R., Chadwick, M. J., Chen, Y ., Berron, D., Banino, A., Düzel, E., Hassabis, D., and Kumaran, D. (2018). Big-Loop Recurrence within the Hippocampal System Supports Integration of Information across Episodes. Neuron , 99(6):1342–1354.e6. Kumaran, D., Hassabis, D., and McClelland, J. L. (2016). What Learning Systems do Intelligent Agents Need? Complementary Learning Systems Theory Update d. Trends in Cognitive Sciences , 20(7):512–534. Kurth-Nelson, Z., Economides, M., Dolan, R. J., and Dayan, P . (2016). Fast Sequences of Non-spatial State Representations in Humans. Neuron , 91(1):194–204. Lake, B. M., Salakhutdinov , R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science , 350(6266):1332–1338. Lake, B. M., Salakhutdino v , R., and T enenbaum, J. B. (2019). The Omniglot challenge: a 3-year progress report. Current Opinion in Behavioral Sciences , 29:97–104. AHA! an ‘Artificial Hippocampal Algorithm’ for Episodic Machine Learning Lake, B. M., Ullman, T . D., Tenenbaum, J. B., and Gershman, S. J. (2017). Building machines that learn and think like people. Behavioral and Brain Sciences , 40(2012):1– 58. Li, F.-F ., Fergus, and Perona (2003). A Bayesian approach to unsupervised one-shot learning of object categories. In Proceedings Ninth IEEE International Conference on Computer Vision . Li, F.-F ., Fergus, R., and Perona, P. (2006). One-shot learning of obje ct categories. IEEE Transactions on Pattern A nalysis and Machine Intelligence . Li, Y ., Deng, S., and Xiao, D. (2011). A novel Hash algorithm construction based on chaotic neural network. Neural Computing and A pplications , 20(1):133–141. Lian, S., Sun, J., and W ang, Z. (2007). One-way Hash Function Based on Neural Network. arXiv preprint arXiv:0707.4032 . Lisman, J. and Redish, A. D. (2009). Prediction, sequences and the hippocampus. Philosophical Transactions of the Royal Society B: Biological Sciences , 364(1521):1193– 1201. Makhzani, A. and Frey , B. (2013). K-Sparse A utoencoders. arXiv preprint arXiv:1312.5663 . Makhzani, A. and Frey , B. J. (2015). Winner- Take- All A utoencoders. In Advances in Neural Information Processing Systems , pages 2791–2799. McClelland, J. L., McNaughton, B. L., and O’Reilly , R. C. (1995). Why there are comple- mentary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psycholog- ical Review , 102(3):419–457. McMahon, M. J., Packer , O. S., and Dacey , D. M. (2004). The classical r eceptive eld sur- round of primate parasol ganglion cells is mediated primarily by a non-GABAergic pathway . Journal of Neuroscience , 24(15):3736–3745. Mok, R. M. and Love, B. C. (2019). A non-spatial account of place and grid cells based on clustering models of concept learning. Nature Communications , 10(1):1–9. Moser , E. I., Krop, E., and Moser , M.-B. (2008). Place cells, grid cells, and the brain’s spatial representation system. Annual review of neuroscience , 31:69–89. Norman, K. A. and O’Reilly , R. C. (2003). Modeling Hipp ocampal and Neocortical Con- tributions to Recognition Memory: A Complementary-Learning-Systems Approach. Psychological Review , 110(4):611–646. O’Reilly , R. C. (1996). Biologically P lausible Error-Driven Learning Using Local Acti- vation Dierences: The Generalized Recirculation Algorithm. Neural Computation , 8(5):895–938. O’Reilly , R. C., Bhattacharyya, R., Howard, M. D., and Ketz, N. (2014). Complementary learning systems. Cognitive Science , 38(6):1229–1248. Pal, C., Hagiwara, I., Kayaba, N., and Morishita, S. (1996). A Learning Method for Neural Networks Based on a Pseudoinverse T echnique. Shock and Vibration . Rawlinson, D., Ahmed, A., and K owadlo, G. (2019). Learning distant cause and eect using only local and immediate credit assignment. arXiv preprint . Rolls, E. T . (1995). A model of the operation of the hippocampus and entorhinal cortex in memory. International Journal of Neural Systems , 6. Rolls, E. T . (2013). The mechanisms for pattern completion and pattern separation in the hippocampus. Frontiers in Systems Neuroscience , 7(October):1–21. Rolls, E. T . (2018). The storage and recall of memories in the hipp ocampo-cortical system. Cell and tissue research , 373(3):577–604. Schapiro, A. C., T urk-Browne, N. B., Botvinick, M. M., and Norman, K. A. (2017). Com- plementary learning systems within the hippocampus: a neural network modelling approach to reconciling episodic memory with statistical learning. Philosophical Transactions of the Royal Society B: Biological Sciences , 372(1711):20160049. Snell, J., Swersky , K., and Zemel, R. S. (2017). Prototypical Networks for Few-shot Learning. In Advances in neural information processing systems , pages 4077–4087. Sodhani, S., Chandar , S., and Bengio, Y. (2020). T oward training recurrent neural networks for lifelong learning. Neural Computation , 32(1):1–35. Sohn, K., Berthelot, D ., Li, C.-L., Zhang, Z., Carlini, N., Cubuk, E. D., Kurakin, A., Zhang, H., and Rael, C. (2020). FixMatch: Simplifying Semi-Supervise d Learning with Consistency and Condence. arXiv preprint . Squire, L. R. (1992). Declarative and nondeclarative memory: Multiple brain systems supporting learning and memory. Stachenfeld, K. L., Botvinick, M. M., and Gershman, S. J. (2017). The hippo campus as a predictive map. Nature Neuroscience , 20(11):1643–1653. Storkey , A. (1997). Increasing the capacity of a hopeld network without sacricing functionality. In Articial Neural Networks — ICANN’97 . T ulving, E. (1985). Elements of Episodic Memory . Oxford University Press. Tzilivaki, A., Kastellakis, G., and Poirazi, P. (2019). Challenging the point neuron dogma: FS basket cells as 2-stage nonlinear integrators. Nature Communications , 10(1):3664. Vinyals, O., Blundell, C., Lillicrap, T ., and Kavukcuoglu, K. (2016). Matching Networks for One Shot Learning. In Advances in neural information processing systems . Y oung, R. A. (1987). The Gaussian derivative model for spatial vision: I. Retinal mechanisms. Spatial vision , 2(4):273–93. A PC(CA3) INP U T SIGNAL CONDI TIONING A.1 PS(DG)-PC(CA3) Memorisation Input The PS(DG) output X ′ is conditioned for memorisation into the Hopeld network, which benets from binary vectors in the range [− 1 , 1 ] . The conditioning function is: X ′ = 2 sgn ( X ) − 1 It is implied that X must be unit range. A.2 PR(EC-CA3)-PC(CA3) Retrieval Input The PR(EC-CA3) output Y ′ is optimised for the classication task. T o pr esent a valid cue for Hopeld convergence, additional condi- tioning is required. First, a norm is applied per sample in a batch of size K , with a gain term γ = 10 . Z i = γ · Y i · ( 1 / K Õ j = 1 Y j ) Next, the range is transformed to the Hopeld operating range of [− 1 , 1 ] and nally an oset Θ is applied per sample. Intuitively , Θ shifts the output distribution to straddle zero, with at least k bits > 0 , where k is the xed sparsity of the stor ed patterns from PS(DG). Y ′ = ( 2 Z − 1 ) + Θ Given that it is in range [− 1 , 1 ] , anything negative acts as inhibi- tion, so the balance is very important. The inputs are relatively sparse, dominate d by background (negative). T o allow some ele- ments to r each an active status before the many inhibitory elements dominate, it’s necessary to initialise the distribution as describe d. B V C(EC) B.1 Sparse convolutional autoencoder Our sparse convolutional autoencoder is based on the winner-take- all autoencoder [Makhzani and Frey 2015]. T o sele ct the top- k active cells per mini-batch sample, we use a convolutional version of the original rule from [Makhzani and Frey 2013]. The top- k cells are selected independently at each convolutional position by comp eti- tion over all the lters. In addition, a lifetime sparsity rule is used to ensure that each lter is trained at least once per mini-batch (i.e. a lifetime of 1 sample per mini-batch). W e found that a single autoencoder lay er with tied weights was sucient for the Omniglot character enco ding. However , additional layers could have been trained with local losses without violating our biological plausibil- ity rules. T o r educe the dimensionality of the VC(EC) output, we applied max-pooling to the convolutional output. B.2 Pre-training Pre-training of the sparse convolutional autoencoder develops l- ters that detect a set of primitive visual concepts that consist of straight and curved edges, sometimes with junctions (Figure 11). B.3 Interest Filter As shown in Figur e 12, positive and negative DoG lters are used to enhance positive and negative intensity transitions. The lter output is subject to local non-maxima suppression to merge nearby features and a ‘top- k ’ function creates a mask of the most signi- cant features globally . Positive and negative masks are combined Gideon Kowadlo, Abdelrahman Ahmed, and David Rawlinson Figure 11: VC(EC) sparse convolutional autoencoder lters. Sparse Convolutional Autoencoder Image DoG + Non-maxima suppression T op k Sum Interest filter mask Mask DoG - Non-maxima suppression V isual Features T op k Interest Filter Smoothing Figure 12: Architecture of the VC including ‘Interest Filter’. The VC is a single layer sparse convolutional autoencoder with masking to reduce the magnitude of background fea- tures. by summation giving a nal ‘Interest Filter’ mask that is applied to all channels of the convolutional output volume. A smoothing operation is then applie d to provide some tolerance to feature loca- tion. There is a nal max-pooling stage to reduce dimensionality . The non-maxima suppression and smoothing are achieved by con- volving Gaussian kernels with the input. Parameters are given in T able 3. C SYSTEM DESIGN The full architecture of AHA is sho wn in Figure 13. The hyperpa- rameters used in our experiments are sho wn in T able 3. W e used the Adam optimizer [Kingma and W elling 2013] in all experiments. D HIGH AND EXTREME IMA GE CORRUPTION High and extreme levels of image corruption are shown in Figures 14 to 17. AHA! an ‘Artificial Hippocampal Algorithm’ for Episodic Machine Learning PS(DG) - Pattern Separation with Inhibitory Sparse Autoencoder k (sparsity) 10 h (number of units) 225 γ (inhibition decay rate) 0.95 υ (weight knockout rate) 0.25 PC(CA3) - Pattern Completion with Hopeld Network λ (gain) 2.7 n (cells updated p er step) 20 N (iterations) 70 h (number of units) 225 PR(EC-CA3) - Pattern Retrieval with Fully-connected ANN η (learning rate) 0.01 h (number of hidden units) 1000 o (number of output units) 225 λ (L2 regularisation) 0.000025 PR(EC-CA3) - Signal Conditioning γ (gain) 10 PM(CA1) - Pattern Mapping with Fully-connected ANN η (learning rate) 0.01 h (number of hidden units) 100 o (number of output units) 100 λ (L2 regularisation) 0.0004 V C(EC) - Vision Comp onent Sparse Convolutional A utoencoder ()=pre-training η (learning rate) 0.001 k (sparsity) (1), 4 f (number of lters) 121 f w (lter width) 10 f h (lter height) 10 f s (lter stride) (5), 1 Batches (pre-training) 2000 Batch size (pre-training) 128 V C(EC) - Vision Comp onent Interest Filter DoG kernel size 7 DoG kernel std 0.82 DoG kernel k 1.6 Non-maxima suppression size 5 Non-maxima suppression stride 1 Smoothing size 15 Smoothing encoding std 2.375 k (number of features) 20 V C(EC) - Vision Comp onent Resize 0.5 Max-pooling size 4 Max-pooling stride 4 FastNN - Baseline STM η (learning rate) 0.01 h (number of hidden units) 100 λ (L2 regularisation) 0.00004 T able 3: Hyperparameter values for reported experiments. Gideon Kowadlo, Abdelrahman Ahmed, and David Rawlinson Legend V C( E C): 52x52x121 (327 ,184) K-sparse c on v olutional a ut oenc oder F ilt ers = 121 F ilt er width = 10x10 F ilt er s tride = 1 S parsit y = 4 I n t er es t F ilt er Maxpool: 13x13x121 (20, 449) windo w = 4x4 s tride = 4x4 O mniglo t charact ers : 105x105 Resize: 52x52 PS( DG): S parse I nhibit ory F C ANN U nits: 225 S parsit y: 10 PR( E C -CA3) F C ANN Loss: C r oss E n tr op y U nits: 225 A ctiv a tion: sigmoid U nits: 800 A ctiv a tion: leaky -r elu PC( CA3): H op ﬁeld U nits: 225 A ctiv a tion: T anh memorise r ecall T ar get outputs PM( CA1) F C ANN Loss: MSE U nits: 2, 704 (52x52) A ctiv a tion: Leaky -r elu U nits: 100 A ctiv a tion: Leaky -r elu r ec ons truction PR signal c onditioning PS signal c onditioning Figure 13: Architecture details (the hyperparameters shown are for the experiments after pre-training the V C). AHA! an ‘Artificial Hippocampal Algorithm’ for Episodic Machine Learning (a) One-shot classication and Occlusion (b) One-shot classication with Noise (c) One-shot instance-classication with Occlusion (d) One-shot instance-classication with Noise Figure 14: AHA at high level of image corruption ( = 0 . 6 ). Gideon Kowadlo, Abdelrahman Ahmed, and David Rawlinson (a) One-shot classication with Occlusion (b) One-shot classication with Noise (c) One-shot instance-classication with Occlusion (d) One-shot instance-classication with Noise Figure 15: FastNN at high level of image corruption ( = 0 . 6 ). AHA! an ‘Artificial Hippocampal Algorithm’ for Episodic Machine Learning (a) One-shot classication and Occlusion (b) One-shot classication with Noise (c) One-shot instance-classication with Occlusion (d) One-shot instance-classication with Noise Figure 16: AHA at extreme level of image corruption ( = 0 . 9 ). Gideon Kowadlo, Abdelrahman Ahmed, and David Rawlinson (a) One-shot classication with Occlusion (b) One-shot classication with Noise (c) One-shot instance-classication with Occlusion (d) One-shot instance-classication with Noise Figure 17: FastNN at extreme level of image corruption ( = 0 . 9 ).

AHA! an Artificial Hippocampal Algorithm for Episodic Machine Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment