30 Linear Algebra Concepts to Remember

Posted on Sep 13, 2024 @ 08:14 AM under Machine Learning

1. Vectors

  • Definition: An ordered list of numbers, which can represent data points, features, or coefficients in ML.
  • Example: A vector $\mathbf{v} = [2, 3]$ can represent a point in a 2-dimensional feature space.

2. Matrices

  • Definition: A rectangular array of numbers arranged in rows and columns, used to represent datasets, transformations, and more.
  • Example: A matrix $\mathbf{M} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$ can represent a transformation matrix that rotates or scales vectors.

3. Matrix Multiplication

  • Definition: An operation where two matrices are multiplied to produce another matrix, used to apply transformations or combine datasets.
  • Example: If $ \mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} and\, \mathbf{B} = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix}, then \, \mathbf{A} \times \mathbf{B} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}$.

4. Dot Product

  • Definition: A scalar value obtained by multiplying corresponding elements of two vectors and summing the results, used to measure similarity.
  • Example: For vectors $\mathbf{a} = [1, 2] \, and \, \mathbf{b} = [3, 4], the\,dot\,product\, is . 1 \times 3 + 2 \times 4 = 11$.

5. Norm (Length)

  • Definition: A measure of the length or magnitude of a vector.
  • Example: The Euclidean norm of vector $\mathbf{v} = [3, 4]\, is \, \sqrt{3^2 + 4^2} = 5$.

6. Euclidean Distance

  • Definition: The distance between two points in space, calculated using the norm.
  • Example: The distance between points $[1, 2]\,and\,[4, 6]\, is \, \sqrt{(4-1)^2 + (6-2)^2} = \sqrt{9 + 16} = 5$.

7. Orthogonality

  • Definition: When two vectors are perpendicular to each other, meaning their dot product is zero.
  • Example: Vectors $[1, 0]\, and \, [0, 1]$ are orthogonal because their dot product is $1 \times 0 + 0 \times 1 = 0$.

8. Eigenvalues and Eigenvectors

  • Definition: For a matrix $\mathbf{A}$, an eigenvector is a non-zero vector that only gets scaled (not rotated) by $\mathbf{A}$, and the eigenvalue is the factor by which it is scaled.
  • Example: For matrix $\mathbf{A} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}$, the eigenvalues are 3 and 1, with corresponding eigenvectors $[1, 1]\, and \, [-1, 1]$.

9. Singular Value Decomposition (SVD)

  • Definition: A factorization of a matrix into three other matrices, used for dimensionality reduction and other applications.
  • Example: If $\mathbf{M}\, is\, a\, matrix, \,then \, \mathbf{M} = \mathbf{U} \Sigma \mathbf{V}^T, where\, \mathbf{U}\, and \, \mathbf{V}$ are orthogonal matrices and $\Sigma$ is a diagonal matrix of singular values.

10. Principal Component Analysis (PCA)

  • Definition: A technique for reducing the dimensionality of data while preserving as much variance as possible, using eigenvectors of the covariance matrix.
  • Example: PCA can reduce a dataset from 3 dimensions to 2 dimensions by projecting it onto the directions (principal components) with the most variance.

11. Linear Transformation

  • Definition: A function that maps vectors to other vectors, preserving vector addition and scalar multiplication.
  • Example: Rotating a vector $[x, y]$ by 90 degrees in a 2D plane is a linear transformation.

12. Matrix Inversion

  • Definition: Finding a matrix $\mathbf{A}^{-1}\, such\, that \mathbf{A} \times \mathbf{A}^{-1} = \mathbf{I}$, where $\mathbf{I}$ is the identity matrix.
  • Example: For $\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$, the inverse is $\mathbf{A}^{-1} = \frac{1}{-2} \begin{bmatrix} 4 & -2 \\ -3 & 1 \end{bmatrix}$.

13. Determinant

  • Definition: A scalar value that can be computed from a square matrix, used to determine if a matrix is invertible and to understand certain properties of the matrix.
  • Example: For $\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$, the determinant is $1 \times 4 - 2 \times 3 = -2$.

14. Rank

  • Definition: The dimension of the vector space generated by the columns (or rows) of a matrix, indicating the number of linearly independent columns (or rows).
  • Example: For $\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$, the rank is 2 because both columns are linearly independent.

15. Least Squares

  • Definition: A method for finding the best-fitting line or hyperplane by minimizing the sum of the squares of the differences between observed and predicted values.
  • Example: In linear regression, least squares is used to fit a line $\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{b}$ to the data by minimizing the residual sum of squares.

16. Gradient Descent

  • Definition: An optimization algorithm used to find the minimum of a function by iteratively moving in the direction of the steepest descent.
  • Example: In training a neural network, gradient descent updates weights to minimize the loss function.

17. Orthogonal Matrix

  • Definition: A square matrix whose rows and columns are orthogonal unit vectors, meaning \$mathbf{Q}^T\mathbf{Q} = \mathbf{I}$.
  • Example: The matrix $\mathbf{Q} = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}$ is orthogonal because its transpose is its inverse.

18. Norms of Matrices

  • Definition: Measures of the size or length of matrices, such as the Frobenius norm, which is the square root of the sum of the absolute squares of its elements.
  • Example: For matrix $\mathbf{M} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$, the Frobenius norm is $\sqrt{1^2 + 2^2 + 3^2 + 4^2} = \sqrt{30}$.

19. Convex Functions

  • Definition: Functions where the line segment between any two points on the graph of the function lies above or on the graph.
  • Example: The quadratic function $f(x) = x^2$ is convex because any line segment between two points on the graph is above the graph.

20. Duality

  • Definition: The concept where every optimization problem has a corresponding dual problem, which provides bounds on the original problem’s solution.
  • Example: In support vector machines (SVMs), the primal problem of finding the optimal hyperplane has a dual problem related to maximizing margins.

21. Regularization

  • Definition: Techniques used to prevent overfitting by adding a penalty term to the loss function, such as L1 (lasso) or L2 (ridge) regularization.
  • Example: In linear regression, adding a penalty term to the coefficients $\mathbf{w}$ helps avoid overly complex models.

22. Kernel Methods

  • Definition: Techniques that use kernel functions to map data into higher-dimensional space, making it possible to find linear boundaries in complex data.
  • Example: The Radial Basis Function (RBF) kernel in SVMs maps data into a higher-dimensional space to handle non-linearly separable data.

23. Gram-Schmidt Process

  • Definition: An orthogonalization process for converting a set of vectors into an orthogonal set while preserving their span.
  • Example: Applying the Gram-Schmidt process to the vectors $\mathbf{v}_1$

and $\mathbf{v}_2$ results in an orthogonal basis for the space spanned by $\mathbf{v}_1$ and $\mathbf{v}_2$.

24. Cholesky Decomposition

  • Definition: A factorization of a symmetric positive-definite matrix into the product of a lower triangular matrix and its transpose.
  • Example: For matrix $\mathbf{A}$, the Cholesky decomposition finds a matrix $\mathbf{L}$ such that $\mathbf{A} = \mathbf{L} \mathbf{L}^T$, useful in solving systems of linear equations.

25. QR Decomposition

  • Definition: A factorization of a matrix into an orthogonal matrix $\mathbf{Q}$ and an upper triangular matrix $\mathbf{R}$.
  • Example: QR decomposition is used in solving linear least squares problems and in algorithms like the Gram-Schmidt process.

26. Jacobian Matrix

  • Definition: A matrix of all first-order partial derivatives of a vector-valued function, representing the local linear approximation.
  • Example: In neural networks, the Jacobian matrix of the activation function helps in computing gradients for backpropagation.

27. Hessian Matrix

  • Definition: A matrix of second-order partial derivatives of a scalar-valued function, used to analyze the curvature of the function.
  • Example: In optimization, the Hessian matrix helps determine whether a point is a local minimum, maximum, or saddle point.

28. Low-Rank Approximation

  • Definition: A method to approximate a matrix with a matrix of lower rank, used for dimensionality reduction and data compression.
  • Example: Singular Value Decomposition (SVD) can be used to approximate a large matrix by keeping only the largest singular values.

29. Softmax Function

  • Definition: A function that converts a vector of values into probabilities by exponentiating and normalizing them.
  • Example: In classification problems, the softmax function is used to convert raw scores (logits) into probabilities for each class.

30. Attention Mechanisms

  • Definition: Techniques in neural networks that allow the model to focus on different parts of the input sequence, weighted by learned attention scores.
  • Example: In natural language processing, attention mechanisms help the model focus on relevant words when translating a sentence.