What to check when the computation is not converging (decreasing errors)
- Check input pattern: how does your data look like? Is there any patterns to it? Can the human visualize it? If not, then unlikely it is machine trainable. If your input have too wide a range of possibilities, then either you need to increase a lot more your data input, or just truncate a lot of the input and go for the most essential ones.
- Check loss function: Does it have any "non-linearity"? (a linear sum of parts is still linear by nature).
- Check source of input: are the data well distributed and representative of the entire input / output domain to be characterized?
- Features engineering: may be some key input features are needed, in order to differentiate from many other form of input?
- Check amount of input/output training data: insufficient data will not be able to cover the depth of knowledge domain.
- Check loss /cost function: different cost function will need different gradient function. Difference between L1 and L2 usage?
- Check activation function: this is a form of introducing non-linearity into the system.
- Explore ensembles: this is where many different models/distributions coexists, and combining all of their output may enhance the final solution.
- Explore attention model: the intuition of this comes from our human ability to be able to focus only on one task at a time: the dimension of input must be truncated to narrow down the input space for analysis. In NLP there is global and local attention to differentiate the different context of focus.
- Gradient Checking: At every steps of the computation, the gradient of the loss function is supposed to make sure that the loss is decreasing, not always, but most of the time. If this simple check does not seemed correct, then likely there is something wrong with the loss function, or gradient function, or algorithmic mistake etc. For example, here is the famous min-char-rnn.py code (and gradient checking code within is extracted and shown below):
# gradient checkingfrom random import uniform
def gradCheck(inputs, target, hprev):
global Wxh, Whh, Why, bh, by
num_checks, delta = 10, 1e-5 _, dWxh, dWhh, dWhy, dbh, dby, _ = lossFun(inputs, targets, hprev)
for param,dparam,name in zip([Wxh, Whh, Why, bh, by], [dWxh, dWhh, dWhy, dbh, dby], ['Wxh', 'Whh', 'Why', 'bh', 'by']):
s0 = dparam.shape
s1 = param.shape
assert s0 == s1, 'Error dims dont match: %s and %s.' % (`s0`, `s1`)
print name
for i in xrange(num_checks):
ri = int(uniform(0,param.size))
# evaluate cost at [x + delta] and [x - delta] old_val = param.flat[ri]
param.flat[ri] = old_val + delta
cg0, _, _, _, _, _, _ = lossFun(inputs, targets, hprev)
param.flat[ri] = old_val - delta
cg1, _, _, _, _, _, _ = lossFun(inputs, targets, hprev)
param.flat[ri] = old_val # reset old value for this parameter # fetch both numerical and analytic gradient grad_analytic = dparam.flat[ri]
grad_numerical = (cg0 - cg1) / ( 2 * delta )
rel_error = abs(grad_analytic - grad_numerical) / abs(grad_numerical + grad_analytic)
print '%f, %f => %e ' % (grad_numerical, grad_analytic, rel_error)
# rel_error should be on order of 1e-7 or less
No comments:
Post a Comment