2022/08/06 (updated 2023/04/26)

This article was originally written on the official forums.

This guide is intended to help users achieve better pronunciation and vocal timing, and better understand the phonemes used to generate synthesized vocals.

1.1

An overview of concepts and terms

A phoneme is an individual sound that Synthesizer V Studio is capable of producing. The available list of phonemes is based on the language being used, and represents a list of all sounds a voice database is capable of producing (including transition sounds between each phoneme).

Cross-lingual synthesis is a feature available to AI voices that allows them access to the phoneme lists for English, Japanese, and Mandarin Chinese, regardless of what their default language is. It is important to keep in mind that AI voices still have a "native" language, and it is normal for them to have an accent when using cross-lingual synthesis.

Standard voices cannot use cross-lingual synthesis and are limited to the phoneme list for their native language.


Regarding Unsupported Languages

Each voice database product has a native language (English, Japanese, or Mandarin Chinese). AI voices using cross-lingual synthesis are able to access any of these supported languages.

Some users may use a large number of manual phoneme changes to make a voice database sing in a language it normally cannot sing in, however this is done by using the existing phoneme list to create a rough approximation of a different language, and certain pronunciations will be impossible when doing this simply because the voice database often cannot produce the necessary sounds for a language it does not support.

Put simply, the sounds a voice database can produce are limited by the phoneme lists it has access to.


A lyric or word is the actual term represented by a sequence of phonemes. In Synthesizer V Studio, words do not directly affect the synthesized output. You could technically never enter the original lyrics or words and only ever enter the exact phonemes manually, and the resulting sound would not be any different. Realistically most users don't actually do that, because words are much easier to work with and taking advantage of the word-to-phoneme mapping allows for better workflow.

To be clear, words are a useful workflow tool, but phonemes are what actually influences the rendered output.


A dictionary is used to customize the mapping between words and phonemes. For example, by default "hello" is represented as hh ax l ow, but you may prefer it to be pronounced as hh eh l ow. Dictionaries are another workflow tool, and can save a lot of time if used effectively. Similar to words, there is nothing dictionaries can do that cannot also be accomplished by manually entering the phonemes for every note, it's just a tool we have at our disposal to make that process significantly faster and easier.


2.1

Phonemes for Each Language

Each language has its own set of phonemes and notations, which can be found in the following files located in Synthesizer V Studio's installation directory:

  • english-arpabet-phones.txt
  • japanese-romaji-phones.txt
  • mandarin-xsampa-phones.txt
  • cantonese-xsampa-phones.txt

For a version of these lists formatted for easy reference, check the Phoneme List page.


2.2

Special phonemes and symbols

These phonemes and symbols are not language-specific, or only exist for certain voices.

cl glottal stop

br inserts a breath using the AI engine, and only works with AI voices.

br1, br2, etc. and brl1, brl2, etc. are only implemented for a small number of Standard (non-AI) voices, specifically the Quadimension Standard voices and Saki Standard (and maybe more that I'm unaware of). These do not use the synthesis engine, but rather insert actual .wav breath samples. Each number represents a different .wav file, so each voice may have a different number of special breath phonemes based on the number of breath sounds included.

-, +, and ++ are technically not phonemes because you enter them within the note rather than above it, but see Extending a word or phoneme across multiple notes below for more about these special characters.


3.1

Entering words/lyrics

Once you have your notes in the piano roll, the next thing to do is enter the lyrics for the track. Some users will enter all of the notes first, then enter the lyrics, others will enter lyrics as they go.

Words can be entered by double-clicking a note and typing the word, then proceeding to the next note by double-clicking it or pressing the tab key. You can use ctrl+tab to go to the previous note instead of the next one.

A word entered into a note will look like this, with the word shown within the note and the phoneme sequence shown above the note in white text.

A note in the piano roll with the word hello as its lyric

Once you have all the lyrics entered as words, the default phoneme mapping will provide you with a good starting point. It is normal to need some manual adjustment after this point, but you can listen through the song and it should sound pretty close to correct.

There is also a batch "Insert Lyrics" function (ctrl+L) under the "Modify" menu. This will assign one word to each note selected from the piano roll. This method is not entirely reliable if you have situations where a single word extends across multiple notes, since there may not be the same number of words and notes. See Extending a word or phoneme across multiple notes below for some methods of addressing this.


3.2

Entering or adjusting phonemes manually

As mentioned above, words are a convenient way of getting most of the pronunciation to be correct, but the phonemes themselves are what dictates the synthesized output. It is normal that the default phoneme mapping for your lyrics will not be exactly what you want.

You can change the phoneme sequence for a note by double-clicking on the phoneme text above the note. When the phonemes have been manually modified the text will turn green instead of white.

A note in the piano roll with phonemes entered manually

When phonemes have been entered manually the word/lyric entered for the note no longer has any effect on the rendered output. You can even remove it entirely and nothing will change, because the phonemes are the only thing that matters, and they are no longer dependent on the word since you entered them manually.

Manual phoneme entry works independently of the note's content, even if the word is blank

If you want to remove the manual phonemes and revert to word-based mapping, double-click on the green phoneme text and delete it. Upon doing so the phonemes will revert to the original word-based sequence.

You can also enter phonemes manually in the note rather than above by prefixing the "word" with a . as shown below. Using the . prefix means that the note content is used as the literal phoneme sequence and no word-based mapping is done.

Using the dot syntax to enter phonemes within a note

3.3

Extending a word or phoneme across multiple notes

There are many situations in which you might want to extend a word or phoneme across multiple notes. We can use the special characters - and + (and ++, though that one's not as useful) to accomplish this.

- is used to sustain a sound across multiple notes:

Using - to extend a tone across two notes

You can sustain the sound across many notes, it doesn't have to be just two:

Using - to extend a tone across many pitches

+ is used to assign the next syllable of the preceding word to a note

In this example, a multi-syllable word is entered as a single note. The engine has attempted to produce a reasonable syllable timing or cadence when pronouncing the word, but in many cases we would want to specify the syllable timing ourselves. We can accomplish this by using the + special character to extend the second syllable to a second note, giving each syllable equal timing that is slightly different from the default.

+ being used to adjust the timing of the second syllable

You can of course also do this across multiple pitches, rather than for pure timing reasons:

+ being used to transition the second syllable to a different pitch

Keep in mind that the rendered output is dictated by phonemes, not words. + is a convenience tool and produces the exact same result as entering the phonemes directly on their respective notes. This can be especially helpful to know, since SynthV Studio may not always correctly infer where the syllable breaks are in a word.

Splitting a word with + is the same as assigning the phonemes manually across two notes

- and + can be easily combined, such as in this example:

Using + and - for the same word

++ is used to complete a word that spans three or more notes and has multiple syllables remaining. It is rare that this will be useful because it has such specific requirements, but it is an option.

An example of the ++ symbol being used to complete a word

When using batch lyric entry, you can use - and + to distribute the lyrics correctly across a number of notes that does not match the number of words. This process can get a bit unwieldy, but if done one phrase/verse at a time might be quicker than entering the lyrics note-by-note.

The batch lyric entry dialog

4.1

Using dictionaries

Dictionaries are a powerful workflow tool that can streamline phoneme entry. Put simply, a user dictionary lets the user change how the software maps words to phonemes. This means that you don't need to find every instance of a word to repeatedly apply the same change. This is especially useful for words with multiple common pronunciations, such as "the" which is often pronounced as either "thuh" (dh ax) or "thee" (dh iy). Some voice databases such as Solaria actually come with a special dictionary that adjusts certain pronunciations and is tailored to the specific voice database.

To be clear, there is nothing that dictionaries can do that cannot also be achieved with manual phoneme entry, but they can help save a lot of time compared to entering phonemes note-by-note.

Using dictionaries in conjunction with cross-lingual synthesis

Voice databases use dictionaries based on their native language. For example even if Solaria is singing in Japanese, the dictionary list will only show the list of English dictionaries. This means you might have some dictionaries that are "for English voices singing in English" and some that are "for Japanese voices singing in English".

If you have a dictionary for a specific language that you want to use with a voice that has a different "native" language, navigate to Documents\Dreamtonics\Synthesizer V Studio\dicts (on Windows) and copy the dictionary from one language's folder to a different one.

For example, to use Solaria's English dictionary with Saki AI (a Japanese voice that can use cross-lingual synthesis to sing in English) you would copy SOLARIA_1.0.json from english-arpabet to japanese-romaji, allowing Solaria's dictionary to show up in the list when using Saki AI.

To create a dictionary, open the Dictionary panel and click "New". You can then enter custom word-to-phoneme mappings. These new mappings will apply to all instances of the word in the tracks or groups that are using the dictionary, but will not replace phonemes that were entered manually on individual notes (the ones with green text).

If modifying a dictionary used for a previous project, consider making a copy of it before making changes so you don't accidentally overwrite mappings that your other projects rely on. Dictionaries are found in the Documents\Dreamtonics\Synthesizer V Studio\dicts folder on Windows.

A simple dictionary mapping hello to hh eh l ow

4.2

Phoneme timing

Aside from assigning individual syllables to notes (see Extending a word or phoneme across multiple notes above), there are additional options to adjust individual phoneme timing.

At the bottom of the Note Properties panel are sliders for note offset and phoneme duration. The note offset slider simply shifts the sound associated with the note forward or backward. The phoneme duration sliders can be adjusted to shorten or lengthen each individual phoneme relative to the others. AI voices also have a set of "Phoneme Strength" sliders which can be used to add emphasis.

Phoneme timing affects not just the timing of phonemes within the same note, but also the transitions with the previous and next note. For example, this is the default timing for the word "hello" split across two equal-length notes. You can see how the l phoneme is actually placed prior to the note that it is associated with, because this more closely mimics how a real human would sing.

Default phoneme timings for the word hello

A side-effect of this is that the ax phoneme is shorter than we might expect. By reducing the l phoneme timing, we can have it intrude less on the previous note:

Phoneme timings for the word hello have been adjusted

I would usually recommend not including too many phonemes within the same note, but for the sake of demonstration it's likely quite clear how much fine-tuning could be done in this example:

Phoneme timing options for a word with many syllables

4.3

Alternate phonemes

Alternate phonemes and expression groups (covered in the next section) are found at the bottom of the Note Properties panel. Keep in mind these settings apply only to the note or notes you have selected at the time.

The alternate phoneme buttons and expression groups in the Note Properties panel

Cycling through alternate phonemes will cause the engine to use an alternate articulation for a sound (AI) or a different recorded sample for that specific sound (Standard). This can be useful if your lyrics have many of the same phoneme and you don't want them to all sound the same, or if the default sample has a harsher consonant sound and you'd prefer a softer one, for example. This will vary wildly between voice databases since it relies on the variation in the original recordings.


4.4

Expression groups (Standard voices only)

Since Standard voices are based on individual phoneme samples, you can control exactly which samples are used during synthesis.

When a Standard voice is being recorded, the various samples are all recorded at multiple pitches. Expression groups represent the different pitches and tone variations of recorded samples. By default Synthesizer V Studio will select the most suitable expression group for the notes you have entered, but you also have the option to manually change this. This is useful if you want to force the engine to use soft or falsetto samples, or prevent it from doing so. Pictured below is the list of expression groups included with Genbu.

The list of expression groups for Genbu

Since AI voices are based on a machine-generated profile rather than discrete samples, there are no expression groups to pick from and this is a feature specific to Standard voices.