AM221: Advanced Optimization

Lecture 1, Jan 22nd, 2018

Instructor: Rasmus Kyng

Let's load the libraries we'll use.

In [1]:
using DataFrames
using CSV
using Plots
plotly()
Out[1]:
Plots.PlotlyBackend()

Now, load a data set,

The data set is public domain, and available on Harvard Dataverse.

Their description:

This is the famous Galton data on the heights or parents and their children (i.e., where the term "regression" comes from). The data are public domain and can be used for teaching or any other purpose. Archived here in part to explore how Dataverse works.

In [2]:
# note from manual:
# ? CSV.read
# rows_for_type_detect::Int=100: indicates how many rows should be read to infer the types of columns

# if we don't read all the rows before deciding on type, we'll get type of "family" wrong

ghdf = CSV.read("../data/galton_height/galton-stata11.tab",rows_for_type_detect=1000,delim='\t')
Out[2]:
familyfathermothergenderheightkidsmalefemale
1178.567.0M73.241.00.0
2178.567.0F69.240.01.0
3178.567.0F69.040.01.0
4178.567.0F69.040.01.0
5275.566.5M73.541.00.0
6275.566.5M72.541.00.0
7275.566.5F65.540.01.0
8275.566.5F65.540.01.0
9375.064.0M71.021.00.0
10375.064.0F68.020.01.0
11475.064.0M70.551.00.0
12475.064.0M68.551.00.0
13475.064.0F67.050.01.0
14475.064.0F64.550.01.0
15475.064.0F63.050.01.0
16575.058.5M72.061.00.0
17575.058.5M69.061.00.0
18575.058.5M68.061.00.0
19575.058.5F66.560.01.0
20575.058.5F62.560.01.0
21575.058.5F62.560.01.0
22674.068.0F69.510.01.0
23774.068.0M76.561.00.0
24774.068.0M74.061.00.0
25774.068.0M73.061.00.0
26774.068.0M73.061.00.0
27774.068.0F70.560.01.0
28774.068.0F64.060.01.0
29874.066.5F70.530.01.0
30874.066.5F68.030.01.0

Let $x$ be the height of the mother, $y$ the height of the child.

In [3]:
x,y = ghdf[:mother], ghdf[:height]
Out[3]:
([67.0, 67.0, 67.0, 67.0, 66.5, 66.5, 66.5, 66.5, 64.0, 64.0  …  63.0, 63.0, 65.0, 65.0, 65.0, 65.0, 65.0, 65.0, 65.0, 65.0], [73.2, 69.2, 69.0, 69.0, 73.5, 72.5, 65.5, 65.5, 71.0, 68.0  …  66.5, 57.0, 72.0, 70.5, 68.7, 68.5, 67.7, 64.0, 63.5, 63.0])

Let's compute the parameters $a,b$ that minimize the residual sum of squares.

It turns out there's a way to do this by solving linear equations, which is what I'm doing below. But don't worry about that today, we'll learn more about this later.

In [4]:
xc = [x ones(length(x))];
ab = (xc'*xc)^-1 * xc'*y;
a,b = ab[1], ab[2]
Out[4]:
(0.31317951678849004, 46.69076592846731)
In [5]:
ab = (xc'*xc)^-1 * xc'*y
a,b = ab[1], ab[2]
Out[5]:
(0.31317951678849004, 46.69076592846731)
In [6]:
scatter(x,y,lab="data",xlabel="mother's height",ylabel="subject's height",ticks=true)
plot!(x,a*x+b,linewidth=2,lab="linear model")
Out[6]:

Multi-variate linear regression

Let's compare the value of the objective function, the residual sum of squares for our parameters, $a,b$, to what we get with a trivial guess: outputting the mean height as the guess for every subject.

In [7]:
# guess mean height:
RSS0 = sum((mean(y)-y).^2)
Out[7]:
11515.062371937667
In [8]:
# our parameters:
RSS1 = sum((a*x+b-y).^2)
Out[8]:
11046.805858446714

How much did we reduce the objective?

In [9]:
@printf("Objective reduced by %.2f per cent.",100*(1-RSS1/RSS0))
Objective reduced by 4.07 per cent.

Let's use both the height of the mother and of the father for our model.

I.e. now we have $y = a_1 \cdot x_1 + a_2 \cdot x_2 + b $, where $x_1$ is the height of the mother, $x_2$ is the heigt of the father, and $y$ is the height of the child.

In [10]:
# calculate best parameters (not covered in class!)
x12 = [ghdf[:mother] ghdf[:father]];
x12c = [ones(length(x)) x12 ];
a12b = (x12c'*x12c)^-1 * x12c'*y
Out[10]:
3-element Array{Float64,1}:
 22.3097  
  0.283215
  0.379897

Now the objective value is:

In [11]:
RSS2 = sum((x12c*a12b-y).^2)
Out[11]:
10261.12740121733

How much did we reduce the objective below the value at the trivial guess?

In [12]:
@printf("Objective reduced by %.2f per cent.",100*(1-RSS2/RSS0))
Objective reduced by 10.89 per cent.

Just for fun, let's look at a plot of the predicted height $y$ as a function of the TWO variables for the parent's heights.

In [13]:
x1var = linspace(58, 71, 40)
x2var = linspace(62, 79, 40)
f(u,v) = a12b[2]*u + a12b[3]*v + a12b[1]
plot(x1var, x2var, f, st=:surface)
scatter!(x12[:,1],x12[:,2],y,lab="data", xlabel="mother's height",ylabel="father's height",zlabel="subject's height",ticks=true,leg=false)
Out[13]:

Convexity of RSS w.r.t. a

When we did regression with just $y = a x + b$, we got

In [14]:
a
Out[14]:
0.31317951678849004

How does the residual sum of squares, RSS, vary as we change a?

In [15]:
avar = linspace(-0.1,0.8, 40)
RSS(avar) = sum((avar*x+b-y).^2)
plot(avar,RSS,linewidth=2,xlabel="linear coefficient a",ylabel="RSS(a)",ticks=true,leg=false)
Out[15]:

RSS(a) is a convex function!