hyperparameters: choices that we set rather than learn

A model hyperparameter is a characteristic of a model that is external to the model and whose value cannot be estimated from data. The value of the hyperparameter has to be set before the learning process begins. For example, c in Support Vector Machines, k in k-Nearest Neighbors, the number of hidden layers in Neural Networks.
In contrast, a parameter is an internal characteristic of the model and its value can be estimated from data. Example, beta coefficients of linear/logistic regression or support vectors in Support Vector Machines.
technically, only test once at the very end

commonly split in 70% train, 10% validation and 20% test.

Grid Search

simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm

D-fold Cross-validation: make full use of training data

Use D-fold cross-validation to determine the average number of epochs that optimizes validation performance
- compute mean performance for all folds
5-fold cross-validation for the value of k in KNN.
Train on the full data set using this many epochs to produce the final results

Untitled

for k in k_choices:
  k_to_accuracies[k]=[]
  for i in range(num_folds):
    num_test=X_train_folds[i].shape[0]
    _X_validate=X_train_folds[i];_y_validate=y_train_folds[i]
    if i==0:_X_train=X_train_folds[1];_y_train=y_train_folds[1];rg=range(2,num_folds)
    else: _X_train=X_train_folds[0];_y_train=y_train_folds[0];rg=range(1,num_folds)
    for j in rg: 
      if i!=j:
        _X_train=np.concatenate((_X_train,X_train_folds[j]))
        _y_train=np.concatenate((_y_train,y_train_folds[j]))
    knn=KNearestNeighbor()
    knn.train(_X_train,_y_train)
    y_pred=knn.predict(_X_validate,k=k)
    k_to_accuracies[k].append( float(np.sum(y_pred == _y_validate)) / num_test )

Workflow

check initial loss without weight decay
overfit on a small sample (ACC=100) to fiddle with architecture, learning rate, weight initialization
use all training data, find lr
coarse gird: lr and weight decay(try 1e-4, 1e-5, 0), train 5 epochs
refine grid: train for 20 epochs without lr decay