AM221: Advanced Optimization
Lecture 1, Jan 22nd, 2018
Instructor: Rasmus Kyng
Let's load the libraries we'll use.
using DataFrames
using CSV
using Plots
plotly()
Now, load a data set,
The data set is public domain, and available on Harvard Dataverse.
Their description:
This is the famous Galton data on the heights or parents and their children (i.e., where the term "regression" comes from). The data are public domain and can be used for teaching or any other purpose. Archived here in part to explore how Dataverse works.
# note from manual:
# ? CSV.read
# rows_for_type_detect::Int=100: indicates how many rows should be read to infer the types of columns
# if we don't read all the rows before deciding on type, we'll get type of "family" wrong
ghdf = CSV.read("../data/galton_height/galton-stata11.tab",rows_for_type_detect=1000,delim='\t')
Let $x$ be the height of the mother, $y$ the height of the child.
x,y = ghdf[:mother], ghdf[:height]
Let's compute the parameters $a,b$ that minimize the residual sum of squares.
It turns out there's a way to do this by solving linear equations, which is what I'm doing below. But don't worry about that today, we'll learn more about this later.
xc = [x ones(length(x))];
ab = (xc'*xc)^-1 * xc'*y;
a,b = ab[1], ab[2]
ab = (xc'*xc)^-1 * xc'*y
a,b = ab[1], ab[2]
scatter(x,y,lab="data",xlabel="mother's height",ylabel="subject's height",ticks=true)
plot!(x,a*x+b,linewidth=2,lab="linear model")
Let's compare the value of the objective function, the residual sum of squares for our parameters, $a,b$, to what we get with a trivial guess: outputting the mean height as the guess for every subject.
# guess mean height:
RSS0 = sum((mean(y)-y).^2)
# our parameters:
RSS1 = sum((a*x+b-y).^2)
How much did we reduce the objective?
@printf("Objective reduced by %.2f per cent.",100*(1-RSS1/RSS0))
Let's use both the height of the mother and of the father for our model.
I.e. now we have $y = a_1 \cdot x_1 + a_2 \cdot x_2 + b $, where $x_1$ is the height of the mother, $x_2$ is the heigt of the father, and $y$ is the height of the child.
# calculate best parameters (not covered in class!)
x12 = [ghdf[:mother] ghdf[:father]];
x12c = [ones(length(x)) x12 ];
a12b = (x12c'*x12c)^-1 * x12c'*y
Now the objective value is:
RSS2 = sum((x12c*a12b-y).^2)
How much did we reduce the objective below the value at the trivial guess?
@printf("Objective reduced by %.2f per cent.",100*(1-RSS2/RSS0))
Just for fun, let's look at a plot of the predicted height $y$ as a function of the TWO variables for the parent's heights.
x1var = linspace(58, 71, 40)
x2var = linspace(62, 79, 40)
f(u,v) = a12b[2]*u + a12b[3]*v + a12b[1]
plot(x1var, x2var, f, st=:surface)
scatter!(x12[:,1],x12[:,2],y,lab="data", xlabel="mother's height",ylabel="father's height",zlabel="subject's height",ticks=true,leg=false)
When we did regression with just $y = a x + b$, we got
a
How does the residual sum of squares, RSS, vary as we change a?
avar = linspace(-0.1,0.8, 40)
RSS(avar) = sum((avar*x+b-y).^2)
plot(avar,RSS,linewidth=2,xlabel="linear coefficient a",ylabel="RSS(a)",ticks=true,leg=false)
RSS(a) is a convex function!