One small minor note on theory vs. practice. Mathematically β0,β1,β2...βn can be estimated with the following formula:
β^=(X′X)−1X′Y
where X is the original input data and Y is the variable that we want to estimate. This follows from minimizing the error. I will proove this before making a small practical point.
Let ei be the error the linear regression makes at point i. Then:
ei=yi−yi^
The total squared error we make is now:
∑i=1ne2i=∑i=1n(yi−yi^)2
Because we have a linear model we know that:
yi^=β0+β1x1,i+β2x2,i+...+βnxn,i
Which can be rewritten in matrix notation as:
Y^=Xβ
We know that
∑i=1ne2i=E′E
We want to minimize the total square error, such that the following expression should be as small as possible
E′E=(Y−Y^)′(Y−Y^)
This is equal to:
E′E=(Y−Xβ)′(Y−Xβ)
The rewriting might seem confusing but it follows from linear algebra. Notice that the matrices behave similar to variables when we are multiplying them in some regards.
We want to find the values of β such that this expression is as small as possible. We will need to differentiate and set the derivative equal to zero. We use the chain rule here.
dE′Edβ=−2X′Y+2X′Xβ=0
This gives:
X′Xβ=X′Y
Such that finally:
β=(X′X)−1X′Y
So mathematically we seem to have found a solution. There is one problem though, and that is that (X′X)−1 is very hard to calculate if the matrix X is very very large. This might give numerical accuracy issues. Another way to find the optimal values for β in this situation is to use a gradient descent type of method. The function that we want to optimize is unbounded and convex so we would also use a gradient method in practice if need be.