Training your own model

You can train your own version of the models using the included pincelate.train module.

Run python -m pincelate.train --help for a list of options:

--model-prefix MODEL_PREFIX
                        prefix for saved models (directories must already
                        exist!)
--verbose             show keras progress bars (default to one line per
                        epoch)
--random-state RANDOM_STATE
                        random state for train/test split
--epochs EPOCHS       number of epochs to train
--batch-size BATCH_SIZE
                        batch size
--src {orth,phon}     source sequences
--target {orth,phon}  target sequences
--unidirectional      unidirectional rnn (default is bidirectional)
--enc-rnn-units ENC_RNN_UNITS
                        units in encoder RNN
--dec-rnn-units DEC_RNN_UNITS
                        units in decoder RNN
--enc-rnn-dropout ENC_RNN_DROPOUT
                        recurrent dropout in encoder RNN
--dec-rnn-dropout DEC_RNN_DROPOUT
                        recurrent dropout in decoder RNN
--optimizer {adam,rmsprop}
                        optimizer (rmsprop or adam)
--lr LR               learning rate for optimizer
--decay DECAY         learning rate decay for optimizer
--clipvalue CLIPVALUE
                        clip value for optimizer

A serialized model consists of a number of files, including the pickled hyperparameters and network weights. The --model-prefix option sets the path and first few characters for these files. For example, an option written like so:

--model-prefix=my-models/phon2orth

… will direct the module to save files with names like my-models/phon2orth-obj.pickle, my-models/phon2orth-training.h5, my-models/phon2orth-infer-encoder.h5, etc.

Pincelate needs both an orthography-to-phoneme model and a phoneme-to-orthography model to operate; these are trained separately. You can set the data for the encoder and decoder using the --src and --target options. For example, to train an orthography-to-phoneme model with 64 hidden units in both the encoder and decoder:

python -m pincelate.train --model-prefix=test-models/orth2phon --src=orth \
   --target=phon --enc-rnn-units=64 --dec-rnn-units=64

Training and test data from the CMU Pronouncing Dictionary is loaded and prepared in pincelate.cmudictdata.