30 Linear Algebra Concepts to Remember
Posted on Sep 13, 2024 @ 08:14 AM under Machine Learning
1. Vectors
- Definition: An ordered list of numbers, which can represent data points, features, or coefficients in ML.
- Example: A vector $\mathbf{v} = [2, 3]$ can represent a point in a 2-dimensional feature space.
2. Matrices
- Definition: A rectangular array of numbers arranged in rows and columns, used to represent datasets, transformations, and more.
- Example: A matrix $\mathbf{M} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$ can represent a transformation matrix that rotates or scales vectors.
3. Matrix Multiplication
- Definition: An operation where two matrices are multiplied to produce another matrix, used to apply transformations or combine datasets.
- Example: If $ \mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} and\, \mathbf{B} = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix}, then \, \mathbf{A} \times \mathbf{B} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}$.
4. Dot Product
- Definition: A scalar value obtained by multiplying corresponding elements of two vectors and summing the results, used to measure similarity.
- Example: For vectors $\mathbf{a} = [1, 2] \, and \, \mathbf{b} = [3, 4], the\,dot\,product\, is . 1 \times 3 + 2 \times 4 = 11$.
5. Norm (Length)
- Definition: A measure of the length or magnitude of a vector.
- Example: The Euclidean norm of vector $\mathbf{v} = [3, 4]\, is \, \sqrt{3^2 + 4^2} = 5$.
6. Euclidean Distance
- Definition: The distance between two points in space, calculated using the norm.
- Example: The distance between points $[1, 2]\,and\,[4, 6]\, is \, \sqrt{(4-1)^2 + (6-2)^2} = \sqrt{9 + 16} = 5$.
7. Orthogonality
- Definition: When two vectors are perpendicular to each other, meaning their dot product is zero.
- Example: Vectors $[1, 0]\, and \, [0, 1]$ are orthogonal because their dot product is $1 \times 0 + 0 \times 1 = 0$.
8. Eigenvalues and Eigenvectors
- Definition: For a matrix $\mathbf{A}$, an eigenvector is a non-zero vector that only gets scaled (not rotated) by $\mathbf{A}$, and the eigenvalue is the factor by which it is scaled.
- Example: For matrix $\mathbf{A} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}$, the eigenvalues are 3 and 1, with corresponding eigenvectors $[1, 1]\, and \, [-1, 1]$.
9. Singular Value Decomposition (SVD)
- Definition: A factorization of a matrix into three other matrices, used for dimensionality reduction and other applications.
- Example: If $\mathbf{M}\, is\, a\, matrix, \,then \, \mathbf{M} = \mathbf{U} \Sigma \mathbf{V}^T, where\, \mathbf{U}\, and \, \mathbf{V}$ are orthogonal matrices and $\Sigma$ is a diagonal matrix of singular values.
10. Principal Component Analysis (PCA)
- Definition: A technique for reducing the dimensionality of data while preserving as much variance as possible, using eigenvectors of the covariance matrix.
- Example: PCA can reduce a dataset from 3 dimensions to 2 dimensions by projecting it onto the directions (principal components) with the most variance.
11. Linear Transformation
- Definition: A function that maps vectors to other vectors, preserving vector addition and scalar multiplication.
- Example: Rotating a vector $[x, y]$ by 90 degrees in a 2D plane is a linear transformation.
12. Matrix Inversion
- Definition: Finding a matrix $\mathbf{A}^{-1}\, such\, that \mathbf{A} \times \mathbf{A}^{-1} = \mathbf{I}$, where $\mathbf{I}$ is the identity matrix.
- Example: For $\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$, the inverse is $\mathbf{A}^{-1} = \frac{1}{-2} \begin{bmatrix} 4 & -2 \\ -3 & 1 \end{bmatrix}$.
13. Determinant
- Definition: A scalar value that can be computed from a square matrix, used to determine if a matrix is invertible and to understand certain properties of the matrix.
- Example: For $\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$, the determinant is $1 \times 4 - 2 \times 3 = -2$.
14. Rank
- Definition: The dimension of the vector space generated by the columns (or rows) of a matrix, indicating the number of linearly independent columns (or rows).
- Example: For $\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$, the rank is 2 because both columns are linearly independent.
15. Least Squares
- Definition: A method for finding the best-fitting line or hyperplane by minimizing the sum of the squares of the differences between observed and predicted values.
- Example: In linear regression, least squares is used to fit a line $\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{b}$ to the data by minimizing the residual sum of squares.
16. Gradient Descent
- Definition: An optimization algorithm used to find the minimum of a function by iteratively moving in the direction of the steepest descent.
- Example: In training a neural network, gradient descent updates weights to minimize the loss function.
17. Orthogonal Matrix
- Definition: A square matrix whose rows and columns are orthogonal unit vectors, meaning \$mathbf{Q}^T\mathbf{Q} = \mathbf{I}$.
- Example: The matrix $\mathbf{Q} = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}$ is orthogonal because its transpose is its inverse.
18. Norms of Matrices
- Definition: Measures of the size or length of matrices, such as the Frobenius norm, which is the square root of the sum of the absolute squares of its elements.
- Example: For matrix $\mathbf{M} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$, the Frobenius norm is $\sqrt{1^2 + 2^2 + 3^2 + 4^2} = \sqrt{30}$.
19. Convex Functions
- Definition: Functions where the line segment between any two points on the graph of the function lies above or on the graph.
- Example: The quadratic function $f(x) = x^2$ is convex because any line segment between two points on the graph is above the graph.
20. Duality
- Definition: The concept where every optimization problem has a corresponding dual problem, which provides bounds on the original problem’s solution.
- Example: In support vector machines (SVMs), the primal problem of finding the optimal hyperplane has a dual problem related to maximizing margins.
21. Regularization
- Definition: Techniques used to prevent overfitting by adding a penalty term to the loss function, such as L1 (lasso) or L2 (ridge) regularization.
- Example: In linear regression, adding a penalty term to the coefficients $\mathbf{w}$ helps avoid overly complex models.
22. Kernel Methods
- Definition: Techniques that use kernel functions to map data into higher-dimensional space, making it possible to find linear boundaries in complex data.
- Example: The Radial Basis Function (RBF) kernel in SVMs maps data into a higher-dimensional space to handle non-linearly separable data.
23. Gram-Schmidt Process
- Definition: An orthogonalization process for converting a set of vectors into an orthogonal set while preserving their span.
- Example: Applying the Gram-Schmidt process to the vectors $\mathbf{v}_1$
and $\mathbf{v}_2$ results in an orthogonal basis for the space spanned by $\mathbf{v}_1$ and $\mathbf{v}_2$.
24. Cholesky Decomposition
- Definition: A factorization of a symmetric positive-definite matrix into the product of a lower triangular matrix and its transpose.
- Example: For matrix $\mathbf{A}$, the Cholesky decomposition finds a matrix $\mathbf{L}$ such that $\mathbf{A} = \mathbf{L} \mathbf{L}^T$, useful in solving systems of linear equations.
25. QR Decomposition
- Definition: A factorization of a matrix into an orthogonal matrix $\mathbf{Q}$ and an upper triangular matrix $\mathbf{R}$.
- Example: QR decomposition is used in solving linear least squares problems and in algorithms like the Gram-Schmidt process.
26. Jacobian Matrix
- Definition: A matrix of all first-order partial derivatives of a vector-valued function, representing the local linear approximation.
- Example: In neural networks, the Jacobian matrix of the activation function helps in computing gradients for backpropagation.
27. Hessian Matrix
- Definition: A matrix of second-order partial derivatives of a scalar-valued function, used to analyze the curvature of the function.
- Example: In optimization, the Hessian matrix helps determine whether a point is a local minimum, maximum, or saddle point.
28. Low-Rank Approximation
- Definition: A method to approximate a matrix with a matrix of lower rank, used for dimensionality reduction and data compression.
- Example: Singular Value Decomposition (SVD) can be used to approximate a large matrix by keeping only the largest singular values.
29. Softmax Function
- Definition: A function that converts a vector of values into probabilities by exponentiating and normalizing them.
- Example: In classification problems, the softmax function is used to convert raw scores (logits) into probabilities for each class.
30. Attention Mechanisms
- Definition: Techniques in neural networks that allow the model to focus on different parts of the input sequence, weighted by learned attention scores.
- Example: In natural language processing, attention mechanisms help the model focus on relevant words when translating a sentence.