Derivation Memory Skeletons
These are short memory cues, not full answers.
Logistic Regression
z = Xw + bp = sigmoid(z)- BCE loss
dL/dz = p - ygrad_w = X^T (p - y) / n
Softmax + CE
- write softmax
- write CE
- use one-hot target
- result:
p - y
Bernoulli MLE
- write Bernoulli likelihood
- take log
- differentiate w.r.t.
p - solve -> sample mean
Gaussian MLE
- write Gaussian log-likelihood
- derive w.r.t.
mu - derive w.r.t.
sigma^2 - note MLE uses
/ n
Confidence Interval
- estimate
- standard error
- critical value
- center +/- margin
Attention Shapes
Q (n, d_k)K (n, d_k)QK^T -> (n, n)- multiply by
V (n, d_v)->(n, d_v)