An important part of the learning experience associated with this course (and 40% of the grade) comes from experimenting with the algorithms presented in class. This page describes what is expected from the students. Feel free to ask questions below.

The objectives are the following

  • get hands-on experience with some of the algorithms presented in the course
  • practice the writing of an experimental journal (e.g. on a blog dedicated to your experiments for this course), describing your ideas, experimental plans, experimental results, and discussions of potential conclusions (i.e., the stuff that eventually ends up in papers)
  • practice the use of collaborative tools for writing code, using a repository dedicated to your experimental work (e.g., with github)
  • practice the collaborative competition typically enjoyed by scientists:
    • the work of each student (in the code repository and in the blog) is available to the others to build upon, thus speeding up the overall rate of progress of the group
    • each student is encouraged to re-use the ideas, results, tricks, and code from other students but MUST properly cite and acknowledge these inputs (plagiarism without citation would be severely punished)
    • each student competes to obtain good results on common benchmarks, but can take advantage of the good ideas of the others, hence the collaborative competition.
  • An important part of the grade will come from having been the first to do something useful and publicize it on your blog (possibly posting here announcements with links to the blog). The more this contribution is useful to advancing each other’s progress, the more points will be given. This should provide an incentive to do things quickly that may otherwise look boring but that could be useful to others.

Examples of blogs written by students last year:

For now we will get started by playing with the TIMIT dataset and use it to experiment with the task of speech synthesis, i.e., mapping a sequence of symbols (phonemes or words) to an acoustic sequence (e.g. audio samples). Information about the speaker could also be used (so that eventually we could use such a model to imitate someone’s voice and make him or her say something else than what is available in a recording).

More information about the dataset will soon be added here. For now you can find a page that gives information about the data and previous papers there:


Please start by creating your blog and your code depository, a list of pointers to these will be maintained here:


27 thoughts on “Experimenting

  1. Pingback: Starting with IFT6266 project | Amjad Almahairi

  2. Pingback: Starting a research blog | Random Mumbling

  3. Pingback: About this blog | Speech synthesis experiments

  4. As a starting exercise, I suggest to consider just training a simple model (linear, feedforward neural net) with squared error and a single scalar output (next acoustic sample), given a fixed window of past inputs (acoustic samples that precede).

  5. Pingback: Speech Synthesis Project from TIMIT dataset | Benjamin Aubert

    • Excellent summary!

      I would add that one important feature of a representation, for the application of speech synthesis, is that a simple and invertible mapping exists that allows us to recover the acoustic signal from the representation, with small enough loss.

      Ideally, a good representation is also one that is ‘compressible’, so that it can be ‘controlled’ by generating less real numbers per second than the acoustic signal itself.

      An interesting option that I would like to consider is also to *learn* a representation, which means that we directly produce the acoustic signal but we design an “output layer” that maps a more compact internal representation to the acoustic signal, somehow.

  6. I mentioned during last class the Blizzard Challenge, which is a yearly speech synthesis challenge. Their datasets are available for free for non-commercial use. This is the Challenge website: http://www.festvox.org/blizzard/

    Data and tools can be downloaded from this site: http://www.cstr.ed.ac.uk/projects/blizzard/ (you have to accept their license to be able to create an account). I did not check all the datasets but the “roger” voice has phone labels and hand-annotated prosodic labels (which can be used to generate intonation/stress in sentences).

    • Hey thanks for mentioning this. I can’t seem to find results of the challenge, though. They do have a performance metric and a winner, right? Can you point us to that data if you know where to look?

      • Performance is measured through listening tests, where subjects evaluate synthesizers by rating synthesized speech on different scales (overall quality, naturalness, pauses, pleasantness, intonation, emotion, …). The results are published as a paper for each year the challenge took place (the first paper in each year’s page in the first link I posted).

  7. Pingback: Speech synthesis project description and first attempt at a regression MLP | IFT6266 Project on Representation Learning

    • Very nice! I encourage everyone to read this. It would be interesting to try with less sparsity to see how many non-zero coefficients we need in average to get nicely-sounding reconstruction.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s