Learning is about Intelligent Forgetting

May 26, 2016 - Abstract. Learning is a process in which the human or the machine acquires new knowledge .... There are two simple answers to this question.
307KB taille 63 téléchargements 299 vues
Learning is about Intelligent Forgetting Colin de la Higuera, University of Nantes May 26, 2016

Abstract Learning is a process in which the human or the machine acquires new knowledge which he/it can use to perform better on further tasks. Whereas the dominating belief is that failure in learning can be measured by how much is forgotten, it can also be argued that forgetting is an essential component in human learning, necessary to be able to generalize, to learn new concepts or to build new knowledge. We will discuss from the point of view of Machine Learning that learning incurs forgetting, where what is forgotten can be different depending on the methods and algorithms: sometimes the data, other times some attributes, often the parameters. In other terms, forgetting is not a problem for learning but rather a driving force.

1

Introduction

In his novel Funes el Memorioso, Jorge Luis Borges [Bor44] tells the story of Funes, whose curse was that he couldn’t forget anything. Instead of focusing on the advantages Funes might gain from his ability to remember everything, Borges discusses how difficult Funes’ life could be, how impaired this makes him for learning new things; in Borges’ words, Funes is incapable of creating concepts, of generalizing. Funes will remember each leaf, each tree, separately, not just one from another but the same leaf from itself a moment before. He can remember a whole day in detail, even if it will take him also a whole day to visit his memories of that day. Sospecho, sin embargo que no era capaz de pensar. Pensar es olvidar diferencias, es generalizar, abstraer. En el alborrotado mundo de Funes no hab´ıa sino detalles, casi inmediatos. I suspect nevertheless that he was not capable of thinking. To think is to forget differences, is to generalize, to abstract. In Funes’ world there were only details, nearly immediate. Psychology’s main point of view concerning the relationship between learning and forgetting seems to have been to oppose these concepts; forgetting is often 1

used to measure the quality of learning. If things are forgotten rapidly, surely this is due to incorrect learning? If a teacher’s pupils fail to remember their lesson a few days later, isn’t this a proof that they were taught incorrectly? Yet here too, alternative voices argue that in order to learn new concepts we must be able to forget some old ones, and that if we remembered everything, we would be as ill off as if we remembered nothing [Jam90]. Machine Learning is the field of research shared between a number of disciplines, concerned with the capacity of making a machine learn in order to improve itself. The theme is of great importance today with more and more applications due to the increase in the volumes of data, becoming a topic taught in courses [Fla11], and with several competing paradigms: statistical relational learning [GT07], Bayesian learning [Gha15] or deep learning [LBH15] for instance. There are nevertheless large differences between the goals as expressed initially by Turing [Tur50] and what is really attempting today. Turing’s manifesto in favour of Learning Machines has drifted to what is called Machine Learning. Between the questions which relate both, the recent victory of Alphago over Lee Sedol [Gib16] at the game of go shows the importance of the capacity of the machine to train (and learn) against itself. This improvement technique had been used in the past by important chess champions (for example during wars) and has been the theme of Stefan Zweig’s Schachnovelle [Zwe41] whose character becomes more or less voluntarily schizophrenic in order to be able to play chess against himself. But obviously, a machine does not have to face this obstacle and has therefore the davantage of not needing a teacher. For many, Machine Learning is essentially an optimisation problem: you are given some data and are to build a model fitting closely to this data –here is where the optimization comes in– which can then be used for prediction, classification or recommendation. It differs from data management by the fact that keeping the data, even in a very efficient way would only lead to the same problem Funes has to face. It differs from compression in that we are encouraged to lose some data or information on the way. In fact, we argue that it is the quality of this loss which itself will define the quality of the learning process. In other terms, it is the capacity of forgetting which determines the capacity of learning. But if we focus on what is lost during the compression of the data into knowledge, are we just shifting the point of view? Or is there not something more to it than a simple substraction? To understand this, let’s start by examining definitions one can find on the internet for forgetting. Wikipedia1 tells us that Forgetting is the apparent loss or modification of information already encoded and stored in an individual’s long term memory. It is a spontaneous or gradual process in which old memories are unable to be recalled from memory storage. An online psychology encyclopedia2 tells us: 1 https://en.wikipedia.org/wiki/Forgetting 2 http://www.simplypsychology.org/forgetting.html

2

Why do we forget? There are two simple answers to this question. First, the memory has disappeared - it is no longer available. Second, the memory is still stored in the memory system but, for some reason, it cannot be retrieved. From these encyclopedic definitions it appears that forgetting obeys to two different patterns. In the first, the information is lost. Short term memory is of limited size, and processing from short term memory to long term memory takes time. If we don’t have the time, the memory is forgotten due to its instability. In the second pattern, long term memory will presumably not lose anything. What is stored is retrievable, provided we have access to the right cues or index, which with time and action can become modified or inaccessible. The goal of this contribution is to use results from the social sciences to understand what role forgetting may play in learning, and results from Machine Learning to study the different forms forgetting can take in the process of learning. In Section 2 we will discover how psychologists, pedagogues, biologists have linked learning and forgetting. This leads us to believe that the link is stronger than we may believe. Then, in Section 3 we will revisit the question from the point of view of Computer Science: this leads us to question the difference between human and machine learning, which will require us to turn back to Turing’s pioneering work. Machine learning algorithms and techniques are examined in more detail in Section 4: forgetting may concern the examples (4.1), the details or attributes (4.2) or the parameters of large models (4.3). The evidence will contribute to sustain the position that learning is really about intelligent forgetting: many algorithms are implicitly, or even often explicitly forgetting things in order to learn.

2

Human Forgetting

Forgetting has been a topic of research in Psychology since Ebbinghaus [Ebb85] who came up with the first set of scientific observations (essentially on himself) and invented the forgetting curves, which represent the decay of memory through time. This decay is essentially exponential, following a power function [JB97, AH11] of time. The decay was believed by Thorndike [Tho14] to be the result of disuse. This theory was fought continuously during the 20th century, for example by Freud who claimed that nothing is forgotten and that memories get buried through repression or other mechanisms [Fre96]. In Ebbinghaus’ study, forgetting rate was measured against learning. This point of view has been followed by pedagogues whoc have systematically used forgetting as a way to measure the quality of teaching, even when this is to question the dominating models of teaching [Smi98]. Following modern psychology, the issue, at least as far as long term memory goes, is not about memories disappearing, but about how the retrieval function works [Bjo11]. In computer science terms, this would correspond to something like database management issues. In a number of papers Bjork and his colleagues

3

have claimed that forgetting can be goal-directed and serve an implicit or explicit personal need [BBA98]: People often view forgetting as an error in an otherwise functional memory system; that is, forgetting appears to be a nuisance in our daily activities. Yet forgetting is adaptive in many circumstances. For example, if you park your car in the same lot at work each day, you must inhibit the memory of where you parked yesterday (and every day before that!) to find your car today. Recent results suggest that forgetting may actually even help learning. Eichenbaum reports for example that in animals who have endured brain damage to the hippocampus, if certain tasks become harder to perform, the animals seem to invent strategies which allow them to do better on other learning tasks [Eic11]. There is also some understanding of the chemical compounds involved in the learning-forgetting pair: in a series of experiments on drosophila, Berry et al. [BCSND12] concluded that dopamine could be responsible for active forgetting. In their tests, the authors show that bidirectional modulation of a small subset of dopamine neurons (DANs) after olfactory learning will regulate the rate of forgetting. This can work both for punishing (aversive) and for rewarding (appetitive) memories. Dopamine is created with the memory and then acts as an eraser whose effect can be in certain ways controlled. It is even suggested that inhibiting the creation of dopamine may be a way of allowing to not forget. Even more recently, Mosha and Robertson showed that unstable memories can be used to transfer learning from one task to another. Their experiment consisted in asking participants to learned 2 memory tasks, and later, a third strongly related to the first. The second one contributes to make the results of the first be emphunstable. Surprisingly the performance on the third task is better than if the second task is not proposed [MR15]. In our terms, the forgetting of the first task has a positive effect on the performance on the third. One can therefore notice a shift from positions in which dominantly forgetting is regarded as a measurement of bad learning to those where learning becomes a side effect of learning and finally those where forgetting is required for learning to be possible.

3

What can the Computational Scientists Tell us?

If in the case of humans, forgetting seems to be less the erasing of memories than the inhibition of indexing, from a Computer Science standpoint, and if we accept to take the dangerous cybernetical step and make comparisons between man and machine, we can remark that on a standard computer, erasing/forgetting is only the management of the indexing tables: the information remains on our hard disk even when our software is not able to find it. This is used in forensic work

4

when the authorities decide to search for incriminating information supposedly erased. This also raises a number of privacy issues today.

3.1

Machine Learning Vs. Learning Machines

In his key 1937 paper Alan Turing [Tur37] introduced the Universal Turing Machine (UTM). The UTM is built upon the idea that any program can be encoded in the same way as the data it is to run on. Let’s call the first dataprogram and the second data-data even if what matters most is that they are essentially undistinguishable, being just both strings of 0’s and 1’s. It is therefore possible to concatenate the data-data with the data-program and to use a UTM whose role is to read the data, separate the data-data from the data-program and execute (or run) this program on the data. In this way, the Universal Turing Machine is capable of simulating any machine. This may seem trivial today, but was certainly one of the major leaps forward of the 20th century. Turing took 10 years later [Tur50] a second essential step; he suggested that the UTM might not just simulate a program but enhance it, modify it, allow it to be changed by experience. And in order to achieve this Turing suggests that the machine should make the modifications itself, as the result of a learning experience: Instead of producing a programme to simulate the adult mind, why not rather try to produce one which simulates the child’s? If this were then subjected to an appropriate course of education, one would obtain the adult brain. Turing explored the idea of a learning machine in detail, with the risks which have to be accepted: An important feature of a learning machine is that its teacher will very often be very largely ignorant of quite what is going on inside, although he may be able to some extent to predict his pupil’s behaviour. He was also aware that a fully deterministic machine would not be able to adapt correctly or retrieve the information fast enough and suggested that including a random element in a learning machine was wise. This last issue is of course of interest here. Like Funes el Memorioso, a machine which will only store the information is not able to learn nor to be intelligent.

3.2

Overfitting

In Machine Learning, one of the more basic tasks is classification. In the most typical setting, we are given two sets (or samples) of examples and the goal is to build a model which can be used to separate further examples, or predict 5

the correct label of these. In textbook machine learning, these examples are 2-dimensional: each example is then described by two coordinates, x and y. Models are mathematical and computational devices which can be used instead of the data to make decisions, classify or predict. Consider for instance a task in which we wish to separate the rainy days from the sunny days, and have two attributes to consider, which are real numbers. We can plot the days as in Figure 1a, the blue points corresponding to rain, the red to sun. We are now searching for a model which would allow us to discriminate the red from the blue and therefore answer the question: should we classify a new point whose coordinates are (x,y) as red or as blue? Or in other words, is it going to rain today?3

(a)

(b)

(c)

Figure 1: A typical classification task. Two extreme solutions to this problem can be suggested, as illustrated by Figures 1b and 1c. Which is best? In the first case (Fig. 1b) we are typically trying to fit a solution which is heavily biased to some data. It can be observed that no line defined by a linear equation will be able to separate the red from the blue. All we can hope for is a line which does as well as possible. We are therefore attempting to find a simple imperfect solution. On the other hand (Fig. 1c) we have chosen to come up with a line whose mathematical expression is going to be much more complex (perhaps a polynomial with a high degree). But the advantage of this solution is that it matches perfectly the data. In Machine Learning we can measure success in two very different ways. In the first we measure the training error, ie the error we are making over the learning data. Minimizing this error is one option, possibly taken by an algorithm producing the model from Fig. 1c. In the second case we attempt to measure the error over some data which we have not necessarily seen before. This can be done for example with cross-validation. This cross-validation error is much closer to the real error we expect. The interesting curve is the one represented in Figure 2. This very typical 3 The

plots and the arguments are over-simplistic. Meteorological agencies are going to use many more attributes (and therefore dimensions) and search for much more complex models.

6

curve shows that as we make the model more and more complex, fitting it to the learning data as tightly as possible, we will be able to decrease the learning error as much as we want (essentially). This is what happens as we move from the simple curve from Fig. 1b to the more complex one in Fig. 1c. But there comes a moment where the real error, not apparent on the learning data, will start to increase. This phenomenon is called over-fitting.

Figure 2: Error curve for a fixed data size In other words, the strategy consisting in fitting closely to the data, as exhibited in Fig. 1c, will probably result in a model with worse generalization capacity. This phenomenon, called over-fitting, has been known and studied for a long time and a full analysis can be found in standard textbooks eg. [Fla11].

3.3

Ockham’s razor Entia non sunt multiplicanda praeter necessitatem

are the words of Robert of Ockham (or Occam). These are often interpreted as a sound principle for machine learning: when trying to find an explanation (a model), choose a simple one. The principle is called Ockham’s razor, and has been the basis of many settings, such as the Minimum Decision or Message Length [WB68, Gr¨ u07], Kolmogorov complexity [LV93], etc. But the principle is not just common sense. It actually is mathematically justified. PAC-learning is one important mathematical model to study learnability, ie the capacity for an algorithm to solve a learning problem, or the possibility that a set of concepts is indeed learnable by an algorithm [Val84]. A concept class C is Probably Approximatively Correctly learnable if there exists a learning algorithm which, when presented a reasonable amount of examples (possibly labelled) of some concept c will build a hypothesis h which is probably very close to c, ie, when having to classify new unseen examples, will do as well as c most of the time, and this with high probability. A famous theorem [BEHW87] says that what the authors call an Occam algorithm will do just that. An Occam algorithm compresses the data, ie is always able to build an alternative but cheaper representation of the data. Let C and H be concept classes containing target concepts and hypotheses respectively and let sample set S contain m samples each containing n bits. 7

Then, for constants α ≥ 0 and 0 ≤ β < 1, a learning algorithm L is an (α, β)Occam algorithm for C using H if, given S labelled according to c in C, L outputs a hypothesis h ∈ H such that h is consistent with c on S (that is, h(x) = c(x) ∀x ∈ S) and size(h) ≤ n · size(c)α mβ . Such an algorithm L is called an efficient (α, β)-Occam algorithm if it runs in time polynomial in n, m and size(c). Within the PAC model the learning Occam algorithm is asked to build a small model fitting perfectly the data –consistent with the data–, which brings us close to the situation represented in Fig. 2, so would lead in practice to over-fitting. The above arguments link clearly learning with compressing and are wellknown; what remains to be seen is what is forgotten during this compression process.

4 4.1

Machine Learning Algorithms Forget Forgetting the Data

In a number of Machine Learning algorithms one of the main actions performed is that some (often most) data is eliminated and will play no importance in the decision made by the model. In other words, the learning algorithm is forgetting many (if not most) of its examples in order to use the others to build the model. 4.1.1

The Perceptron

The perceptron is one of the oldest algorithms in Machine Learning. It was introduced by Rosenblatt with a clear connexionist goal in mind [Ros58], and is considered to be an ancestor of the neural networks. The algorithm attempts to separate two sets of data which belong to a vector space. The separator is a hyperplane. The perceptron iteratively examines each piece of data and updates its current solution if the actual hypothesis does not fit. Mathematically, it will move the hyperplane slightly towards the badly classified example. If the example is well classified, the perceptron will do nothing. The process continues until all examples are well classified. As pointed out in [Fla11], a dual version of the algorithm can be seen as computing a weighted linear combination of the examples, the weights reflecting the hardness to classify of a given example. The intuition is then clear: when an example is easy to classify, its weight will be 0. Or, in other words, the example can be forgotten. As a consequence of this observation, the perceptron will work just as well if we forget the most obvious examples. 4.1.2

The Support Vector Machines

In a classification task where the data is represented as vectors, a typical measure of how well an example is treated by the classifier is to use the L2 norm, in 8

other words to measure the Euclidian distance from the separating hyperplane to an example. Let d denote the minimum distance from the examples to the separating hyperplane; the margin is the zone at distance less than d from the hyperplane. Vapnik [Vap98] showed that maximizing the margin offered the best guarantees of success and proposed to build support vector machines (SVM) defined by a very small subset of examples which are going to be closest to this margin. These elements are called the vector supports. If the space is n-dimensional, just n+1 vectors are needed. And it can be proven that all the other elements can disappear (or be forgotten): only the vector supports are needed to build the SVM. Figure ?? shows the typical situation. Of course, deciding which examples are the vector supports is no easier than classifying. In many cases, soft margins represent a better alternative obtained by compromising between being large and having few examples misclassified or inside the margin [CV95]. But even in this case, the number of support vectors remains very low, indicating that many examples can be forgotten. 4.1.3

Data streams

Data stream mining is concerned with building adaptative models from streams of data. As more and more data is arriving, the learning algorithm has to face the challenge of deciding between adapting the model to fit the new data or conservatively ignoring the new data. It amounts to forgetting the model, or parts of the model, or forgetting the examples [BK09]. Concept drift is an issue which makes the question different from that of online learning: with time, what we should learn may change. The algorithm has therefore also to care with this. An algorithm which implements the choices between forgetting data or updating the hypotheses is CVFDT (Concept-adapting Very Fast Decision Trees) [HSD01]. An alternative to developing new algorithms for streaming is to scale up existing algorithms. This might be done through wrapping: traditional machine learning algorithms may build models which will be competing to be kept: the wrapper’s goal is then to know when to forget some models. 4.1.4

The k-nearest neighbours

k-NN is a popular (and well established) technique, well known inside the pattern recognition field [DH73]. It involves keeping all the data and using a distance. When a new element appears, the closest (or the k closest) examples are found and used to decide how the new element should be classified. The idea is to classify an example accordingly to how his neighbours are classified. A major problem with k nearest neighbour based approaches is the size of the data set which has to be visited systematically with each new example to be classified. When the dataset is huge, avoiding this is a serious issue. One way to deal (in part) with it is to simplify the set by forgetting those examples which are not

9

going to change the decision: typically, if a red labelled example is surrounded by many red, then removing it (forgetting it) will only simplify things.

4.2

Forgetting the Details

In other cases, the key to success of the ML algorithm will not reside in removing examples: statistical decisions will often depend on their number, as taking a decision after having seen 10 red out of 20 examples is not the same as taking one from 10,000 out of 20,000. So, forgetting examples may not pay off. 4.2.1

Decision trees: avoiding the exponential growth

Building decision trees has been for a long time a favourite technique in Machine Learning. Typical techniques [Qui83] involve building a tree in which each node is a test over some attribute and the leaves contain the decisions. The attributes are chosen by evaluating the learning sample: does using this attribute to discriminate allow to separate the sample nicely, that is, minimizing the entropy? The process allows to separate the data into as many subsets as there are branches and will be applied recursively. If the data set is very large, the process of choosing the attributes with which the sample can be split can continue until a very large and unmanageable tree gets produced. In fact, decision trees are classifiers for which the process described in Figure 2 appears. In order to avoid this, different options have been proposed, including pruning the tree or avoiding its expansion. But in both cases, this can be seen as: don’t use all the attributes or, in other terms forget some attributes. 4.2.2

Grammatical inference

In Grammatical Inference, the goal is to build grammars or finite state machines from data which are usually strings [dlH10]. In this case, the learning algorithms will typically aim at finding the regularities in the strings and merging together the prefixes which obey to common discovered patterns, forgetting many other aspects of strings which may be present (and, unfortunately, sometimes, may even be essential). For example, we may have several hundred strings of the type abba, abbbba, aba, abbbbbbbbbba,. . . A grammatical inference algorithm might build a finite state machine corresponding to the regular pattern ab∗ a. The result will replace the data set, forgetting the individual counts on the number of bs. In certain algorithms, for example MDI [TDdlH00], forgetting is explicit: the algorithm builds an automaton by merging states: the best merge is chosen where what is measured is the best compromise between the entropy loss and the amount of data which can be forgotten.

4.3

Forgetting the parameters

In many techniques, the model is known and the learning issue is that of finding the right tuning of its parameters. 10

Parameter estimation nevertheless is also prone to over-fitting. In some recent cases the complete number of parameters one has to estimate can make the model more expensive that the actual data. Models with hundreds of millions of parameters are being trained today in Deep Learning [LBH15]. In order to avoid over-fitting a number of techniques have been proposed, many of which involve forgetting. 4.3.1

Sparsity

A well-founded and successful approach in Machine Learning to avoid overfitting is to privilege sparse solution. A solution is sparse when only a small number parameters or weights have nonzero values. The reasons for doing this are numerous, and if mostly arguments in favour of complexity issues are put forward, some authors will relate sparsity to Ockham’s razor. As argued in [HTW15], Thus the advantages of sparsity are interpretation of the fitted model and computational convenience. But a third advantage has emerged in the last few years from some deep mathematical analyses of this area. This has been termed the “bet on sparsity” principle: Use a procedure that does well in sparse problems, since no procedure does well in dense problems. Usually, sparsity is obtained as a side effect of regularisation. Sometimes, as in [BDE09] it is the clear objective of the learning function. Let us admit that putting voluntarily a parameter to 0 corresponds to forgetting it. Admittedly, in graphical models, one way to get smaller models is to find the arcs whose weight has been set to 0 and remove them [?]. 4.3.2

Regularisation: the LASSO

The parameter estimation issue is often solved by building an optimisation problem in which the optimal setting of the parameters will correspond to a minimum empirical risk or some other measure defining how far we are prepared to misclassify or misestimate our data. Without further clues, an excellent result of our optimisation scheme would result in over-fitting. In order to avoid this, we will seek to optimise a second function at the same time. This second function is going to tell us something about the sort of solution we are wanting. And typically, sparsity will be privileged. LASSO (Least Absolute Shrinkage and Selection Operator) [HTW15] is a regularisation technique making use of the L1 norm, this time. The objective is to minimize at the same time the error the solution is going to make and the L1 norm of the parameters. One can notice that if we have various possible solutions such that x2 + y 2 = a, ie. the sort of equations which appears in a least-square estimate, the solutions which minimize the L1 norm are the pairs √ √ ( a, 0) and (0, a). Therefore, LASSO’s goal is also to forget. 11

4.3.3

The auto-encoder

When training a neural network and aiming to estimate the parameters, one crucial idea has been to train the network on reproducing its own input. But in order to do more than just copy the inputs to the outputs, the net is going to auto-encode the data by using a hidden layer which is smaller that the dimension we have encoded the data into [Ben09]. This technique can furthermore be applied to multilayer networks, by training each layer separately. The forgetting is in the obliging the network to reduce its dimensionality. 4.3.4

Dropout

As pointed out in [LBH15], one key to the current success of Convolutional Networks in Image Processing and other tasks is the use of Dropout [SHK+ 14]. Dropout randomly drops units (along with their connections) from the neural network during training. This does not necessarily lead to a sparse solution as the process is repeated overt the training sample. Nevertheless, one intuition is that in order to learn, forgetting some units helps. 4.3.5

LSTMs

The role of forgetting in the case of Long-Short Term Memory neural networks (LSTM) [HS97] is made clear. LSTMs are the basis of some of the most successful recurrent neural networks today. These are able to take into account a sequential issue and memorize information about past data. In LSTMs, special hidden units are composed with ordinary units into blocks. They are used to store inputs over time. The blocks are controlled by three gates: the “input gate” can block a value, the “output gate” will let the value pass and the “forget gate” will, as the name indicates, flush the block. See Figure 34 .

Figure 3: An LSTM block

4 By BiObserver - Own work, CC BY-SA 4.0 https://commons.wikimedia.org/w/index. php?curid=43992484

12

5

Conclusion

Sailors know that contrary to popular belief, it is less the wind pushing into the sail which makes the boat move than the vacuum (or lift) on the leeward side of the sail which is pulling. In a similar way, we believe that in learning, or at least in Machine Learning, forgetting is operational to learning. Moreover, focusing on forgetting would help us understand the nature of Machine Learning algorithms. Furthermore, the relationship between forgetting and the nature of the tasks deserves to be studied, as is the relationship with other Machine Learning ideas: in Ensemble Learning, weak learners are combined in order to propose combined decisions. Is there an effect of forgetting in the way the different learners act? Do they want to forget different aspects? As machine learning takes a larger and larger importance in science, it is important to reflect on what the machine learning algorithms are really doing. Are they attempting to fulfil Turing’s goals of building a learning machine? Are they only solving some optimization questions? Are they doing more than that? And what are the relations between learning and forgetting? Is forgetting something which a computer will not do? Is it a side effect of learning? Or, as argued here, is it the actual engine of learning?

References [AH11]

Lee Averell and Andrew Heathcote. The form of the forgetting curve and the fate of memories. Journal of Mathematical Psychology, 55:25–35, 2011.

[BBA98]

Elizabeth L. Bjork, Robert A. Bjork, and Michael C. Anderson. In J. M. Golding and C. MacLeod (Eds.), Intentional forgetting: Interdisciplinary approaches, chapter Varieties of goal-directed forgetting, pages 103–137. Erlbaum, Hillsdale, NJ, 1998.

[BCSND12] Jacob Berry, Isaac Cervantes-Sandoval, Eric Nicholas, and Ronald Davis. Dopamine is required for learning and forgetting in drosophila. Neuron, 74(3):530–542, 2012. [BDE09]

Alfred M. Bruckstein, David L. Donoho, and Michael Elad. From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Review, 51(1):34–81, 2009.

[BEHW87] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred Warmuth. Occam’s razor. Information processing letters, 24(6):377–380, 1987. [Ben09]

Yoshua Bengio. Learning deep architectures for ai. Foundations and Trends in Machine Learning, 2(1):1–127, 2009.

13

[Bjo11]

Robert A. Bjork. Successful remembering and successful forgetting: a Festschrift in honor of Robert A. Bjork, Ed. Arthur S. Benjamin, chapter On the symbiosis of learning, remembering, and forgetting., pages 1–22. Psychology Press, London, UK., 2011.

[BK09]

Albert Bifet and Richard Kirkby. Data stream mining: a practical approach. online, 2009.

[Bor44]

Jorge-Luis Borges. Ficciones, chapter Funes el Memorioso. 1944.

[CV95]

Corina Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning Journal, 20(3):273, 1995.

[DH73]

Richard O. Duda and Peter E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973.

[dlH10]

Colin de la Higuera. Grammatical inference: learning automata and grammars. Cambridge University Press, 2010.

[Ebb85]

Hermann Ebbinghaus. Memory: A Contribution to Experimental ¨ Psychology. 1885. Uber das Ged¨achtnis. Translated by Henry A. Ruger and Clara E. Bussenius (1913) Originally published in New York by Teachers College, Columbia University.

[Eic11]

Howard Eichenbaum. The Paradoxical Brain, chapter The paradoxical hippocampus: When forgetting helps learning, pages 1–10. Cambridge University Press, 2011.

[Fla11]

Peter Flach. Machine Learning. Cambridge University Press, 2011.

[Fre96]

Sigmund Freud. The Complete Psychological Works of Sigmund Freud, Volume III (1893-1899): Early Psychoanalytic Publications, chapter The Aetiology of Hysteria. 1896.

[Gha15]

Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 521(-):452–459, 2015.

[Gib16]

Elizabeth Gibney. What google’s winning go algorithm will do next. Nature, 531(7594):284–285, 2016.

[Gr¨ u07]

Peter Gr¨ unwald. The minimum descrition length principle. The MIT Press, 2007.

[GT07]

Lise Getoor and Ben Taskar, editors. Introduction to Statistical Relational Learning. The MIT Press, 2007.

[HS97]

Sepp Hochreiter and J¨ urgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.

14

[HSD01]

Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining timechanging data streams. In Proceedings of 7th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, page 97–106, San Francisco, CA, 2001. ACM Press.

[HTW15]

Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. Taylor and Francis, 2015.

[Jam90]

William James. The principles of psychology, volume 2. Henry Holt and Company, New York, 1890.

[JB97]

Mohamad Jaber and Maurice Bonney. A comparative study of learning curves with forgetting. Applied Mathematical Modelling, 21(8):523–531, 1997.

[LBH15]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(-):436–444, 2015.

[LV93]

Ming Li and Paul Vitanyi. An introduction to Kolmogorov Complexity and its Applications. Springer, 1993.

[MR15]

Neechi Mosha and Edwin M. Robertson. Unstable memories creat a high-level representation that enables learning transfer. Current Biology, 26:100–105, 2015.

[Qui83]

Ross Quinlan. Machine Learning: an artificial intelligence approach,Michalski, Carbonell & Mitchell (eds.), chapter Learning efficient classification procedures, pages 463–482. Morgan Kaufmann, 1983.

[Ros58]

Franck Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386–408, 1958. Cornell Aeronautical Laboratory.

[SHK+ 14]

Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.

[Smi98]

Frank Smith. The book of learning and forgetting. Teachers College Press, 1998.

[TDdlH00] Franck Thollard, Pierre Dupont, and Colin de la Higuera. Probabilistic Dfa inference using Kullback-Leibler divergence and minimality. In Proceedings of the 17th International Conference on Machine Learning, pages 975–982. Morgan Kaufmann, San Francisco, CA, 2000.

15

[Tho14]

Edward L. Thorndike. The psychology of learning. Teachers College Press, New York, 1914.

[Tur37]

Alan Turing. On computable numbers with an application to the entscheidungsproblem. Proc. London Math. Soc., 42(2):230–265, 1937.

[Tur50]

Alan Turing. Computing machinery and intelligence. 59(236):1–10, 1950.

[Val84]

Leslie G. Valiant. A theory of the learnable. Communications of the Association for Computing Machinery, 27(11):1134–1142, 1984.

[Vap98]

Vladimir Vapnik. Statistical Learning Theory. Wiley, Hoboken, NJ, USA, 1998.

[WB68]

Christopher S. Wallace and Boulton. An information measure for classification. The computer journal, 11:185–194, 1968.

[Zwe41]

Stefan Zweig. Schachnovelle. The Royal Game and Other Stories. New York: EP Dutton, 1941.

16

Mind,