Editing Neural Networks

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Neural networks are computational systems inspired by the biological neural networks in animal brains. They are the foundational building block of modern artificial intelligence, powering everything from image recognition and speech synthesis to large language models and autonomous vehicles. Understanding neural networks is the first step toward mastering the field of deep learning.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Artificial neuron''' — A mathematical unit that receives one or more inputs, applies a weighted sum, adds a bias, and passes the result through an activation function to produce an output.
* '''Layer''' — A collection of neurons that process inputs in parallel. Networks are composed of an input layer, one or more hidden layers, and an output layer.
* '''Weight''' — A learnable scalar value attached to each connection between neurons. Weights determine the strength of influence one neuron has on another.
* '''Bias''' — An additional learnable parameter added to the weighted sum before applying the activation function, allowing the neuron to shift its output independently of its inputs.
* '''Activation function''' — A mathematical function applied to a neuron's output to introduce non-linearity. Common examples: ReLU, Sigmoid, Tanh, Softmax.
* '''Forward pass''' — The process of propagating input data through the network layer by layer to produce a prediction.
* '''Backpropagation''' — An algorithm that computes the gradient of the loss function with respect to each weight, enabling gradient-based learning.
* '''Loss function''' — A measure of how far the network's predictions are from the true labels. Examples: Mean Squared Error (MSE), Cross-Entropy Loss.
* '''Gradient descent''' — An optimization algorithm that iteratively updates weights in the direction that reduces the loss.
* '''Epoch''' — One full pass through the entire training dataset.
* '''Batch size''' — The number of training examples used in a single weight update step.
* '''Learning rate''' — A hyperparameter controlling the size of weight update steps during gradient descent.
* '''Overfitting''' — When a model learns the training data too well, including its noise, and fails to generalize to new data.
* '''Dropout''' — A regularization technique where random neurons are deactivated during training to prevent overfitting.
* '''Convolutional Neural Network (CNN)''' — A network architecture specialized for grid-like data (e.g., images) using convolution operations.
* '''Recurrent Neural Network (RNN)''' — An architecture where connections form directed cycles, enabling processing of sequential data.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
A neural network learns by adjusting its weights to minimize a loss function. The learning process has two phases repeated iteratively:

'''Forward Pass''': Input data flows through each layer. Each neuron computes a weighted sum of its inputs, adds a bias, and applies an activation function. This produces progressively more abstract representations until the final output layer produces a prediction.

'''Backward Pass (Backpropagation)''': The error between the prediction and the true label is computed using the loss function. Using the chain rule of calculus, the gradient of this error with respect to every weight in the network is computed, flowing backward from output to input. Weights are then nudged in the direction that reduces the error.

Think of it like tuning a radio dial — you make small adjustments (gradient steps) based on how much static (loss) you hear, until the signal (prediction) becomes clear.

The power of neural networks comes from their ability to learn hierarchical representations. Early layers detect simple features (edges in an image, individual characters in text), while deeper layers combine these into increasingly complex abstractions (shapes, words, concepts). This emergent feature learning is why neural networks outperform hand-engineered feature extraction on most complex tasks.

The choice of '''activation function''' is critical. Without non-linear activations, stacking layers would be mathematically equivalent to a single linear transformation, giving no benefit. ReLU (Rectified Linear Unit) — which simply outputs max(0, x) — has become the default because it avoids the vanishing gradient problem that plagued earlier sigmoid and tanh activations.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Training a simple neural network in PyTorch:'''

<syntaxhighlight lang="python">
import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple 3-layer feedforward network
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(784, 256),   # Input: 28x28 image flattened
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 10)     # Output: 10 classes
        )

    def forward(self, x):
        return self.layers(x)

model = SimpleNet()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(10):
    for batch_x, batch_y in dataloader:
        optimizer.zero_grad()
        predictions = model(batch_x)
        loss = criterion(predictions, batch_y)
        loss.backward()       # Backpropagation
        optimizer.step()      # Weight update
    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
</syntaxhighlight>

; Key hyperparameters to tune
: '''Learning rate''' — Start with 1e-3 for Adam, 1e-1 for SGD. Use a scheduler to decay over time.
: '''Batch size''' — 32 or 64 is a good default. Larger batches train faster but may generalize worse.
: '''Network depth''' — Start shallow (2–3 layers), deepen only if underfitting.
: '''Dropout rate''' — 0.2–0.5 for hidden layers; never apply to the output layer.
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ Neural Network Trade-offs
! Consideration !! Deeper Networks !! Shallower Networks
|-
| Representational power || Higher — can model complex functions || Lower — limited to simpler decision boundaries
|-
| Training difficulty || Harder — vanishing/exploding gradients || Easier — gradients flow cleanly
|-
| Overfitting risk || Higher — more parameters || Lower — fewer parameters
|-
| Training data needed || Much more (thousands to millions) || Less
|-
| Inference speed || Slower || Faster
|}

'''Common failure modes:'''
* '''Vanishing gradients''' — In deep networks, gradients shrink as they propagate backward, making early layers learn very slowly. Mitigated by ReLU, batch normalization, and residual connections.
* '''Exploding gradients''' — Gradients grow exponentially, causing weight updates to be catastrophically large. Mitigated by gradient clipping.
* '''Dead neurons''' — ReLU neurons that output zero for all inputs and receive no gradient, effectively becoming permanently inactive. Mitigated by Leaky ReLU or careful initialization.
* '''Data leakage''' — Accidentally allowing test-set information into training, leading to falsely optimistic evaluation metrics.
* '''Learning rate too high''' — The loss oscillates or diverges instead of converging. Use learning rate warmup and reduce on plateau.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
Expert practitioners evaluate neural networks along several dimensions beyond simple accuracy:

'''Generalization gap''': The difference between training accuracy and validation accuracy. A small gap with both high indicates good learning; a large gap indicates overfitting.

'''Learning curve analysis''': Plotting train/validation loss over epochs reveals whether the model is underfitting (both losses remain high) or overfitting (training loss drops but validation loss rises).

'''Ablation studies''': Systematically removing components (dropout, batch norm, skip connections) to understand what each contributes. This is how experts build principled understanding rather than cargo-culting architectures.

'''Calibration''': A well-calibrated model's confidence scores reflect true probability. A model that says "90% confident" should be right ~90% of the time. Poorly calibrated models are dangerous in production. Use temperature scaling or Platt scaling to improve calibration.

Expert practitioners also think carefully about '''inductive biases''' — what assumptions about structure are baked into the architecture. CNNs assume spatial locality and translation invariance. RNNs assume sequential dependencies. Transformers assume pairwise attention relationships. Choosing architectures whose inductive biases match the problem structure is a hallmark of expert design.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
To design a neural network system from scratch, follow this architectural reasoning process:

'''1. Define the problem precisely'''
* Classification, regression, generation, or reinforcement?
* Input modality: tabular, image, text, audio, graph?
* Output: single label, probability distribution, continuous value, sequence?

'''2. Choose an inductive bias that matches the data structure'''
* Tabular → MLP (fully connected)
* Images → CNN or Vision Transformer (ViT)
* Sequences → RNN, LSTM, or Transformer
* Graphs → GNN

'''3. Establish baselines before complexity'''
* Train a logistic regression or linear model first
* Establish what "good enough" performance looks like
* Only add complexity when justified by the gap

'''4. Architecture skeleton'''
<syntaxhighlight lang="text">
Input → [Normalization] → [Feature Extractor Blocks] → [Bottleneck] → [Task Head] → Output
</syntaxhighlight>

'''5. Training infrastructure'''
* Reproducibility: fix random seeds, log hyperparameters
* Monitoring: TensorBoard or Weights & Biases for loss curves
* Checkpointing: save best model by validation metric
* Early stopping: halt training when validation loss stops improving

'''6. Iterative scaling'''
Scale data first, then model size. A smaller model trained on more data almost always outperforms a larger model on less data.

[[Category:Artificial Intelligence]]
[[Category:Deep Learning]]
[[Category:Machine Learning]]
</div>