**1.**
J. Schmidhuber and R. Huber.
Learning to generate focus trajectories for attentive vision.
Technical Report FKI-128-90, Institut für Informatik, Technische
Universität München, 1990.

**2.**
J. Schmidhuber and R. Huber.
Using sequential adaptive neuro-control for efficient learning of
rotation and translation invariance.
In T. Kohonen,
K. Mäkisara, O. Simula, and J. Kangas, editors,
*Artificial Neural Networks*, pages 315-320.
Elsevier Science Publishers B.V., North-Holland, 1991.

**3.**
J. Schmidhuber and R. Huber.
Learning to
generate artificial fovea trajectories for target detection.
International Journal of Neural Systems, 2(1 & 2):135-141, 1991
(50 K - figures omitted, but see jpeg scans above!).
PDF .
HTML.

**More recent work on learning selective attention with reinforcement learning recurrent networks:**
**4.** J. Koutnik, G. Cuccu, J. Schmidhuber, F. Gomez. Evolving Large-Scale Neural Networks for Vision-Based Reinforcement Learning. In Proc. GECCO, Amsterdam, July 2013. See overview.

**5.**
M. Stollenga, J.Masci, F. Gomez, J. Schmidhuber.
Deep Networks with Internal Selective Attention through Feedback Connections.
Preprint arXiv:1407.3068 [cs.CV].
Advances in Neural Information Processing Systems (NIPS), 2014.

**Learning soft (differentiable) attention since the early 1990s.**
The methods above are of the "hard" attention type. But we have also worked on differentiable memories and "soft" attention, where a recurrent "memory network" learns to **control its own internal spotlights of attention** (these now fashionable buzzwords were used in the 1993 paper below) to quickly associate self-defined patterns through fast weights:
**6.**
J. Schmidhuber.
Reducing the ratio between learning complexity and number of
time-varying variables in fully recurrent nets.
In *Proceedings of the International Conference on Artificial
Neural Networks, Amsterdam*, pages 460-463. Springer, 1993.
PDF.
HTML.

One important point was that an RNN with 1,000 units can have up to 1,000,000 connections with fast weights, that is, one gets many more dynamic variables than in a standard RNN. The RNN must learn to read and write its fast memory through a differentiable Hebb-inspired multiplicative rule that rapidly associates those patterns to which the RNN currently attends (by highlighting them). The calculation of the full gradient through those complex fast weight dynamics remains rather cheap though.
Related papers:

**7.**
J. Schmidhuber.
Learning to
control fast-weight memories: An alternative to recurrent nets.
Neural Computation, 4(1):131-139, 1992.
PDF.
HTML.
Pictures with German captions.