Long Short-Term Memory (LSTM) networks are a special type of Recurrent Neural Network designed to solve the vanishing gradient problem. They can learn long-term dependencies in sequential data through a sophisticated gating mechanism. LSTMs have memory cells that can maintain information over many time steps, making them incredibly powerful for tasks like language modeling, speech recognition, and time series prediction.
Watch how information flows through the LSTM gates
Cell state (top) carries long-term memory, gates control what to remember and forget
The genius of LSTMs lies in their gating mechanism. Unlike vanilla RNNs that struggle with long sequences, LSTMs use three gates (forget, input, output) to carefully regulate information flow. The cell state acts like a "highway" that information can travel along unchanged, with gates deciding what to add, remove, or output. This architecture allows LSTMs to remember important information from hundreds of time steps ago while forgetting irrelevant details.
When predicting the next word in: "I grew up in France... I speak fluent ___"
Many words ago: "France" enters the network
Forget Gate: Keeps "France" in cell state (relevant for language)
Input Gates: Add intermediate words but don't overwrite "France"
Cell State: Maintains "France" information across many time steps
Output: When predicting language, retrieves "France" from cell state
Prediction: "French" (95% confidence) - using long-term context!
1. Forget Gate: Decides what information to discard from cell state (0 = forget, 1 = keep)
2. Input Gate: Determines what new information to add to cell state
3. Cell State Update: Combines forget and input decisions to update memory
4. Output Gate: Controls what parts of cell state to expose as output
5. Hidden State: Filtered version of cell state becomes new hidden state
6. Repeat: Process continues for each element in the sequence
Vanishing Gradients: Information from early time steps gets lost
Limited Memory: Can only remember recent context (5-10 steps)
Unstable Training: Gradients explode or vanish during backprop
Poor Long Dependencies: Fails on tasks requiring long-term memory
Gradient Highway: Cell state prevents vanishing gradients
Extended Memory: Can remember for 100+ time steps
Stable Training: Gates regulate gradient flow
Long Dependencies: Excels at learning long-range patterns