Week 3 Analytic Exercises, May 3th

This chapter is devoted to improving your mathematical competence. A couple of exercises in this course require an understanding of fundamental mathematical concepts, rules, and notations for solving equations. Furthermore, being acquainted with mathematical concepts will be conducive to gaining an intuition for statistical methods such as different distribution functions or maximum likelihood estimation. After some general tips, we will link you to a couple of examples to give you an idea about what kinds of equations might await you in the upcoming exercises.

3.1 General Tips

1. The Lookup Principle

The first tip is arguably the most important and, fortunately, the simplest.

Often mathematics can seem like a foreign language. If you look at a mathematical proof, for example, and feel as if you can not follow, chances are that there a couple of symbols that you can not interpret or understand yet.

The best way to change this situation is to deliberately note down and look up every single symbol you currently do not understand. You can do this either through searching the web or encyclopedic web sources. Once you looked up the symbols, try to apply them and understand the proof or equation you are trying to figure out. Practicing what you learned will help to manifest it in your long-term memory.

If done correctly, you will continuously improve with each exercise, proof, or equation you elaborate on.

2. Practice fundamental rules

When reading a proof or trying to derive a proof yourself you will use certain equivalences or transformations over and over again. Arguably, it is even necessary to know some of these equivalences by heart. As one example, you want to know how to work with logarithms. This includes knowing that \(log(x^y) = y \cdot log(x)\). Also, you might want to know some fundamental derivatives such as the derivative of \(f(x)=log(x)\) which is \(f'(x)=1/x\).

3. Expand your toolbox

Next to fundamental mathematical rules, there are some more abstract and general steps that are often useful when solving equations.

Imagine you have the following equation:

\[\frac{2}{x} + \frac{3}{y} = 1\]

This is a common situation in which unknown variables are denominators. This is not desirable when trying to solve for these variables. Intuitively, you might multiply the equation with \(x \cdot y\) in order to stop the variables \(x\) and \(y\) from being denominators. You’d obtain

\[2y + 3x = xy\]

which is much easier to work with.

Another example would be simplifying equations by putting common factors outside of brackets. As an example, you might write

\[6x^2+3x = 0\]

\[3x(2x + 1) = 0\]

Which makes it considerably easier to derive for which values of \(x\) the equivalence with 0 holds.

There are many more tools and tricks which make solving equations easier. Following and reading a couple of proofs of others and taking note of transformations you have not thought of before can go a long way in adapting them yourself when trying to do own proofs. Quite often, having a big “toolbox” is the key to solving mathematical equations.

4. Follow the proofs of others

You might have already noticed that learning from the proofs and mathematical reasoning of others can be very effective to build your mathematical competence. Apply the lookup principle, take note of fundamental rules you might have missed and also try to expand your toolbox actively while following proofs. Most importantly, try to aim at proofs that are just above your current competence level in order not to get discouraged.

5. What to do when you get stuck

Following the proofs of other and deriving proofs yourself are two pairs of shoes. Most importantly, you will often find yourself stuck when “doing the maths yourself”. What can one do in such a situation?

Let’s say you are trying to derive the maximum likelihood estimator for \(\lambda\) in a Poisson distribution. Chances are you are not the first person that as attempted to do this. Often, there are forums on the web where individuals like yourself have publicly asked and discussed the proof you are trying to reproduce. Therefore, it might be helpful to search the web for discussion forums devoted to mathematics where other users have shared their solution to the exercise you are grappling with. Again, when engaging with the solutions of others, try to do so actively and note down things that are new to you.

Another way of dealing with getting stuck is leveraging the incubation principle. Doing advanced maths, to a large degree, is essentially problem-solving. Psychological researchers have found that taking breaks while doing problem-solving increases the likelihood of solving the problem at hand. Often, you can get a new perspective on a problem after taking a break or might find the solution to a problem randomly during the break itself (“Eureka effect”).

3.2 Checklist

Here a couple of areas you might want to cover before jumping into practice:

Working with logarithms
Understanding the natural logarithm
Working with exponential functions
Fundamental derivation rules
Fundamental integration rules
Sum (\(\Sigma\)) and product notation (\(\Pi\))
Fundamental of probability and its notation
Determining the local minima and maxima of a function

3.3 Examples

First, you might revise the proofs of the maximum likelihood estimators for binomial and Poisson distributions from the first chapter on maximum likelihood.

Then, there a several online resources with proofs that might guide your further training. As one example, the FU Berlin publicly shared a more advanced selection of proofs of estimation procedures in the context of statistical models here.

The end of the document entails some interesting prompts you might attend selectively and to whichever degree you feel comfortable:

1. Write down the general form of a likelihood function and name its components.
2. Write down the general form of a log likelihood function and name its components
3. Write down the general form of an ML estimator and explain it.
4. Discuss commonalities and differences between OLS and ML beta parameter estimators.
5. Write down the formula of the GLM ML beta parameter estimator and name its components.
6. Write down the formula of the GLM ML variance parameter estimator and name its components.
7. Write down the formula of the GLM ReML variance parameter estimator and name its components.
8. Define the sum of error squares (SES) and the residual sum of squares (RSS) and discuss their
commonalities and differences.
9. Write down the GLM of incarnation independent and identical sampling from a univariate Gaussian
distribution as well as the ensuing expectation and variance parameter estimators.

3.4 Sum and Product Notation

1: Sums

The sum notation \(\sum_{i=1}^n x_i\) allows you to write down the summation of many numbered elements \(x_i\) instead of having to write \(x_1 + x_2 + ... + x_n\). \(i=1\) specifies the lower bound and \(n\) the upper bound of the summation (comparable to a for-loop with index i increasing from 1 to n by 1 every step). You can both add numbered elements together (\(\sum_{i=1}^n x_i\)) and also include i in the sum:\(\sum_{i=1}^n i = 1 + 2 + ... + n\)

2: Product

The product notation (\(\prod_{i=1}^n x_i = x_1 * x_2 * ... * x_n\)) works exactly the same, but the elements are multiplied instead of added together.

3: Exercises Write the following in sum/product notation: \[\begin{eqnarray} A & = & 5^2 + 6^2 + 7^2 + ... + 17^2\\ B & = & \frac{1}{3} + \frac{2}{5} + \frac{3}{7} + ... + \frac{20}{41}\\ C & = & n! \end{eqnarray}\]

Click here to see the solution

\[\begin{eqnarray} A & = & \sum_{i=5}^{17} i^2\\ B & = & \sum_{i=1}^{20} \frac{i}{2i + 1}\\ C & = & \prod_{i=1}^n i \end{eqnarray}\]

3.5 Derivation Revision

This section is a revision of derivation rules including partial derivatives. If you already feel confident enough, you may skip this section and move on to the exercises in 3.5. In this course, you will have to optimise functions. Thus you need derivatives to find the optimum of these functions. 1: Fundamental Derivation Rules

\[f(x) = c \longrightarrow f'(x) = 0\\ f(x) = 1 \longrightarrow f'(x) = 1\\ f(x) = c * g(x) \longrightarrow f'(x) = c * g'(x)\\ f(x) = x^a \longrightarrow f'(x) = ax^{a-1}\\ f(x) = g(x) + h(x) \longrightarrow f'(x) = g'(x) + h'(x)\\\]

2: Chain Rule

The chain rule applies in the case of composite functions such as \(f(x) = (2x^2 + 1)^3\). There always is an outer function (here: \(g(x) = x^3\)) and an inner function (here: \(h(x) = 2x^2 + 1\)) so that the concatenation can be written like this: \(f(x) = g(h(x)) = (g \circ h)(x)\)
The chain rule looks as follows: \(f'(x) = g'(h(x)) * h'(x)\) Then the derivative of the example is: \(f'(x) = 3(2x^2 + 1)^2 * 4x\)

3: Product Rule

When we have a product of two functions, such as \(f(x) = g(x) * h(x) = x^2 * e^x\), the product rule applies. Product rule: \(f'(x) = g'(x) * h(x) + g(x) * h'(x)\) When applied to the example, the result is: \(f'(x) = 2x * e^x + x^2 * e^x\) (Reminder: the derivative of \(e^x\) is \(h'(x) = e^x\))

4: Notation

The notation of a derivative can be both \(f'(x) = \frac{df}{dx}(x)\) To indicate partial derivatives, \(\partial\) is used instead: \(f_x'(x,y) = \frac{\partial f}{\partial x}(x,y)\)

5: Partial Derivatives

In the following, you will also have to calculate partial derivatives of functions that depend on more than one variable such as \(f(x,y) = 2x^2 + y^3x\). It is possible to calculate the derivative with respect to either x or y, the other variable is then treated as a constant: \(\frac{\partial f}{\partial x}(x,y) = 4x + y^3\;\;\;\;\;\) The \(y^3\) is a constant factor in front of x and is thus not changed. \(\frac{\partial f}{\partial y}(x,y) = 3y^2x\;\;\;\;\;\) As \(2x^2\) is a constant it disappears in the derivative.

6: Exercises Calculate the (partial) derivatives of the following functions: \[\begin{eqnarray} f(x) & = & x^2 * \sum_{i=1}^N i\\ g(x) & = & ln(e^x * x^2)\\ h(x) & = & log_2(x)\\ i(x) & = & e^{5x} * (x + 2)^2\\ j(x,y) & = & x * \sum_{i=1}^N (ix + y)^2 \end{eqnarray}\]

Click here to see the solutions

\[\begin{eqnarray} \frac{df}{dx}(x) & = & 2x * \sum_{i=1}^N i\\ \frac{dg}{dx}(x) & = & 1 + \frac{2}{x}\\ \frac{dh}{dx}(x) & = & \frac{1}{ln(2)x}\\ \frac{di}{dx}(x) & = & 5e^{5x} * (x + 2)^2 + e^{5x} * 2(x + 2)\\ \frac{\partial j}{\partial x}(x,y) & = & \sum_{i=1}^N (ix + y)^2 + x * \sum_{i=1}^N 2i^2x + 2iy\\ \frac{\partial j}{\partial y}(x,y) & = & x * \sum_{i=1}^N 2ix + 2y \end{eqnarray}\]

3.6 Exercises

Exercise 1: Linear Regression and LMSE

Given are \(N\) datapoints \((x_1,y_1) \ldots (x_N,y_N)\) from an experiment. Show that if you fit a straight line \(f(x)= \alpha x + \beta\) to the data minimising the squared difference between \(f\) and the \(y_i\)’s then without need for numerical tools you obtain

\[\begin{eqnarray} \alpha & = & \frac{N \sum_{i=1}^N x_i y_i - \sum_{i=1}^N x_i \sum_{i=1}^N y_i}{N \sum_{i=1}^N x_i^2 - \left(\sum_{i=1}^N x_i \right)^2 } \nonumber \\ \beta & = & \frac{1}{N}\left(\sum_{i=1}^N y_i - \alpha \sum_{i=1}^N x_i \right) \nonumber \\ & & \nonumber \end{eqnarray}\]

as and best solution. Note: This is well-known as linear regression and using the (average or mean) squared difference between data and model as loss function is known as least mean-squared error or LMSE.

Click here to obtain a hint

The squared error is defined as:

\[SE(\alpha, \beta) = \sum_{i=1}^N \left[y_i-f(x_i) \right]^2\]

Click here to obtain another hint

Substitute \(f(x_i)\) with the formula given in the instructions.

Click here to obtain another hint

The goal is to minimize this expression with respect to \(\alpha\) and \(\beta\). This is most feasible when having a term of the form \(a^2-b^2-c^2\) rather than \((a-b-c)^2\) (without implying equivalence between the two forms).

Click here to obtain another hint

Obtain the partial derivatives for \(\alpha\) and \(\beta\) for the term you obtain after executing on the previous hint in order to obtain the LMSE estimators for \(\alpha\) and \(\beta\).

\[\begin{eqnarray} \frac{\partial SE(\alpha,\beta)}{\partial \alpha} \\ \frac{\partial SE(\alpha,\beta)}{\partial \beta} \end{eqnarray}\]

If necessary, look up how to calculate partial derivatives.

Click here to watch a video about how to obtain partial derivatives

Educational video published by Khan Academy under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States (CC BY-NC-SA 3.0 US) license

Click here to obtain another hint

Notice that the LMSE estimator for \(\alpha\) you are tasked to derive does not include any \(\beta\) term in it:

\[\alpha = \frac{N \sum_{i=1}^N x_i y_i - \sum_{i=1}^N x_i \sum_{i=1}^N y_i}{N \sum_{i=1}^N x_i^2 - \left(\sum_{i=1}^N x_i \right)^2 }\]

Thus, you will need to perform some sort of substitution to obtain the LMSE estimator for \(\alpha\). In practice, this is necessary because with \(\alpha\) you can then calculate the LMSE estimator for \(\beta\) (if you derived the LMSE estimator for \(\beta\) correctly). Notice that you can not complete the estimations of \(\alpha\) and \(\beta\) otherwise.

Exercise 2: LMSE and Maximum Likelihood

Given are \(N\) independent datapoints \((x_1,y_1) \ldots (x_N,y_N)\) from an experiment and suppose you want to fit a model \(f\) with \(M\) adjustable parameters, \(\alpha_j, j = 1, \ldots, M\). The model predicts a functional relationship between the measured independent \(y_i\)’s and the dependent \(x_i\)’s.

Least-squares fitting implies adjusting \(\alpha_1 \ldots \alpha_M\) such that

\[\begin{equation} \sum_{i=1}^N \left[y_i - f(x_i; \alpha_1 \ldots \alpha_M \right)]^2 \end{equation}\] is minimized. Note: The equation in the first exercise “Linear Regression and LMSE” was a special case of this equation with \(M=2\), \(\alpha = \alpha_1\), and \(\beta = \alpha_2\)

Maximum-likelihood estimation (MLE) asks a different question: instead of minimizing the squared distance of the datapoints to the function, it is based on the intuition that the observed data \((x_i,y_i)\) are more to come from models with certain parameters \(\alpha_1 \ldots \alpha_M\) than from others. How can we select the parameters that are “most likely” correct?

Assume that each \(y_i\) has an independent measurement error that follows a Gaussian distribution around the “true” model \(f(x)\). Assume further, that the standard deviation \(\sigma\) of the Gaussian distributions are the same for all \(y_1 \ldots y_N\).

What is the probability of the dataset as a function of \(\alpha_1 \ldots \alpha_M\)? Use this to show that the maximum likelihood fit in this case corresponds exactly to the LMSE-fit from the equation above. Comment on the equivalence - how realistic is it for behavioral experiments that \(\sigma\) is constant for all \(x\)? How realistic is it for psychophysical experiments?

Click here to obtain a hint

Notice that the exercise defined that the measurement error, which is the difference between \(x_i\) and \(y_i\), follows a normal distribution with standard deviation \(\sigma\). Try to use this information to define the likelihood you are trying to derive and look up the formula for the density function of the normal distribution.

Click here to obtain another hint

The first step in deriving a maximum likelihood estimator, as described in previous chapters, is to establish a product of the probability of individual observations given a certain model or parameter vector. In general, this product then represents the likelihood of the model given the data.

Click here to obtain another hint

To obtain the probability of a given observation \((x_i,y_i)\), we recall that the difference between \(x_i\) and \(y_i\) follows a normal distribution with respect to \(\sigma\).

Then, the probability of the observation can be obtained through integrating the density function of the normal distribution. Since the difference of \((x_i,y_i)\) represents a single point of the density function, we have to assume \(y_i \pm \Delta y\) for a fixed, small \(\Delta y\). Obtaining the probability of \((x_i, y_i)\) through integrating the density function of the normal distribution for a given observation \((x_i,y_i)\) is then represented by:

\[\begin{equation} p_i \propto \exp \left[ - \frac{1}{2} \left(\frac{y_i - f(x_i)}{\sigma}\right)^2 \right] \Delta y \end{equation}\]

Click here to obtain another hint

Revisit the rules for obtaining maximum likelihood estimators from the previous chapter on maximum likelihood estimation and try to obtain the likelihood

\[\mathcal{L}((\alpha_1 ... \alpha_M) | \mathbf{y})\]

starting with the equation from the previous hint.

Note: Notice that \(\mathbf{y}\) represents a vector since it is printed in bold, which corresponds to multiple observations.

Consult the exercise instructions and elaborate on what the symbols in the likelihood represent, if necessary.

Click here to obtain another hint

Consider minimizing the negative log-likelihood, as it will lead to the same result as maximizing the log-likelihood when deriving maximum likelihood estimates.

Click here to obtain another hint

Consider which terms become constants and drop out of the equation when calculating the differentiate with respect to the parameters for which the maximum likelihood estimate is derived. Afterwards you should be able to show equivalence of the ML estimator with the LMSE estimator:

\[\sum_{i=1}^N \left[y_i - f(x_i; \alpha_1 \ldots \alpha_M \right)]^2\]

Exercise 3: Maximum Likelihood and Binomial Data

The number of successes of a sequence of \(N\) Bernoulli trials with success probability \(p\) follows a binomial distribution. Show that if you empirically obtain the fraction \(y\) of successes in \(N\) trials then the maximum likelihood estimator \(\hat{p}\), in this case, is simply \(\hat{p} = y\). Show this both for maximum likelihood and log-likelihood.

Click here to obtain a hint

Look up the formula for the binomial distribution and adapt it to the representations given in the instructions. You could write:

Given a sequence of Bernoulli trials of length \(N\) and the fraction of successes \(y\) we can write the likelihood for the estimator \(\hat{p}\) as:

Click here to obtain another hint

Revisit the previous chapter on maximum likelihood estimation and the process for obtaining a maximum likelihood estimator.

Note: Do not alter the representations of \(y\) and \(N\) while completing the proof.