This blog is Over!

Thanks for visiting!

This blog was for my class, which is over.
I don’t post here anymore, and don’t intend to.
Here is my current school web page, tho, iyi:

David Scott Krueger

Advertisements

rteter

[audio http://s1.vocaroo.com/media/download_temp/Vocaroo_s1ZF8FDj4Lpe.mp3] 2 [audio http://s0.vocaroo.com/media/download_temp/Vocaroo_s0lWoKBaxCv2.mp3] 3 [audio http://s1.vocaroo.com/media/download_temp/Vocaroo_s1ZF8FDj4Lpe.mp3] 4 OOE [audio http://s0.vocaroo.com/media/download_temp/Vocaroo_s0o8l7nn5lop.mp3]

Results on AA dataset

Here is the best result I’ve gotten so far using my multi-prediction CNN-MDNs.

 Image

The network architecture is, as before, 4 layers, 60 channels each except 12 at the output representing a mixture of 4 Gaussians.  Kernel lengths are 101,101,50,1.  The network outputs one sample at a time, based on the previous 250.  I am training 80% on all the aa phonemes from TIMIT longer than 1001 samples, with 10% each for test/validation.  Blue is ground truthgreen is prediction, and red is generation.

For comparison, here is the same architecture/dataset with a scalar output:

Image

First Results with Mixture Density Networks

Neural Networks can be used to output the parameters of a mixture model.  This combined model is then called a Mixture Density Network (MDN).  Alex Graves used an MDN in his work on handwriting synthesis, which inspired my decision to use one.  

Using a mixture model makes the output distribution multi-modal, which, as we discussed in class, may be important for realistic generation and should prevent flat-lining.

I use a mixture of Gaussians, each parametrized with mean, variance, and mixture coefficient.  

Implementation has presented some issues, but they are now mostly solved.  One issue that remains I’ve encountered using the MDN cost is NaNs.  I tried dividing by std instead of variance (as in the RNADE paper), and I also made a lower limit on the variance to avoid divide by 0 errors at the recommendation of David WF, but I still have NaN errors.  I’m trying to find the problem using Pylearn2’s nan_guard.  

In the meanwhile, I’ve been playing with learning rates.  I used a network with 4 layers of 60 channels each with kernel sizes 101,101,50, 1.  This is the same architecture as my CNN2 from the last post (except the output has 12 channels representing a mixture of 4 Gaussians).  As in my last post, blue is ground truthgreen is prediction, and red is generation.

Too large a learning rate:

Weights become NaN within a few epochs (frequently after 1).  Network’s predictions/generations are much too large.  Likelihood increases dramatically and noisily:

 

 

Image

Image

A reasonable (?) learning rate:
Some of these end because of NaN errors, like this one:

Image

Otherwise, the termination criteria stopped the rest, because they had bounced so far away from anything good:
Image

Here is a plot with more close-ups of a sequence:

Image

Some observations:

0.  None of the prediction/generation results are very accurate.  They all seem much more noisy and to have a higher frequency.
1.  The predictions/generations always seem to be too large (but to varying extents).
2.  The generation starts out with a reasonable magnitude.

Even at points with the best likelihood that I achieved, it doesn’t seem to really work.  Hopefully I can improve it by figuring out the NaN issue and the reason for the extreme changes we see with a high LR or after enough improvement with a lower LR.  More or less components in the mixture would be interesting, too.  I think the less components it has, the easier it should be to train.

It may be that eliminating NaNs and maintaining a sensible training criterion will take more than numerically stable operations.  The problem I am seeing seems similar to the exploding gradient problem.  With the right learning rate, I can make steady progress, but only up to a point.  Then it’s like I hit a wall and the gradient explodes.  I’m not sure why this effect would appear so dramatically when I switch to using a MDN cost.  

Better Results with CNNs

I’ve been training CNNs with more layers and getting better results for prediction.  In the following plots, blue is ground truth, green is prediction, and red is generation.  The bottom two panels are the performance and the top two are the outputs (the left is the complete sequence, the right is just the beginning).

I’ve tried the following three architectures, and for each tried to find a good learning rate (in the plot’s title).  I’ve been focused on learning rates, but now I think a better termination criterion might be necessary and that some of the LRs I tried might have performed much better with a better criterion.

 I always use 60 channels for each convolutional layer.

CNN1 : 5 layers, kernel sizes 101,51,51,50,1
Image
CNN2: 4 layers, kernel sizes 101,101,50,1
Image

CNN3: 6 layers, kernel sizes 51,51,51,51,50,1

Image

More hyper-parameter optimization would probably help quite a bit.  My priorities would be setting and adjusting the learning rate (LR) and termination criterion, and playing with the model architecture (focusing on kernel shape and number of layers).  I believe deeper nets with narrower kernels might work better, although they’d also be harder to train.  So far I was able to train CNN1 and CNN2 to about .02 validation error, but the CNN3 stopped training at ~.065.  But it turns out that might just be because I had a bad setting for the termination criteria.  

For termination criteria, I’ve been using Pylearn2’s built in EpochCounter and MonitorBased termination criteria.  I’ve had a hard time finding a good setting for these and a lot of that is due to confusion about their function.  I recommend everyone read the code (not just comments) carefully for these; I found the documentation and names misleading.  

I thought the monitor based criterion was looking for improvement over the best value in the last N epochs, but it actually requires improvement over the best value achieved so far in training.  The noise in the objective for these models is very large, and so a less demanding criterion would probably result in better training.