Language


My favorite book is Peter Lagefoged’s Vowels and Consonants, which is fitting for The Dessoff Choirs’ (self-appointed) pronunciation guru. As part of that job, I prepare International Phonetic Alphabet (IPA) transliterations of our concert music, at least when we’re singing in a language I know something about. It’s a tedious task, but lately less so, thanks to the workflow system I recently cobbled together for our November concert of French choral music.

Goal: a database of French words and their IPA pronunciations.
French is largely phonetic, so at first I considered creating a rule-based system to construct words’ approximate transliterations. The prospect became more and more complicated to imagine, and this led me to look for a downloadable lexicon that already included IPA (either the output of someone else’s rule-based system or the result of digitizing an existing dictionary).

Dictionaries aplenty, most of them too “user-friendly.”
There’s no shortage of good online dictionaries, but the ones I looked at were distinctly unhelpful. Only some of them contain IPA, first of all, and to begin with, most of them are accessible only through a type-and-click web interface. It might have been possible to automate the web interaction and turn my source texts into a sequence of HTTP requests, but my programming skills in that area are badly dated. Back when the web was a collection of static HTML pages, I’d jury rig something with wget and sed. Nowadays, the web is sophisticated. You don’t just go to a URL and get back a plain HTML document or file. A lot of what appears in your browser window requires client-side execution of Javascript or similar nonsense. Forget about using wget in such situations. (Similar situations have frustrated me before. Someone will have kindly assembled just the data I need, and will have kindly made it available, but only via a browser form for single-item retrieval.)

Third download’s a charm.
Eventually, I found some hopeful downloads. The first two, a file for OpenOffice spellcheck, and a dictionary for WinEDT, didn’t fit the bill, but the third, Ralf’s French dictionary, did. I don’t know who Ralf is, nor do I know who’s behind the testing simon blog, where Google Search led me to discover Ralf’s dictionary. (Simon is apparently a speech recognition system, which explains the connection to dictionaries with IPA.) Ralf’s dictionary contains hundreds of thousands of French words (lexemes) with their textual representations (graphemes, like you’re reading here) and IPA equivalents (phonemes).

Ralf’s dictionary is not a dictionary.
For nearly 25 years, my go-to dictionary for French pronunciation has been a 1980 Hachette. It provides IPA for each of its over 50,000 entries. But like most dictionaries, well, it’s a dictionary, not a lexicon. It’s full of definitions — and that’s the point. “Ralf’s dictionary” is a lexicon that happily includes IPA. The big difference for me, today, is that a complete lexicon like Ralf’s contains all the words people utter (or sing), many of which (especially verbs in the case of French) are not dictionary “words,” but are inflected forms of dictionary words. You can find parler in Hachette (on page 1137), right between parlementer and parleur, and you can find it in Ralf’s (at position 259506), also between parlementer and parleur, but in Ralf’s, it’s not right between. After parlementer and before parler in Ralf’s you’ll find (though turning data pages creates no wonderful musty book smell) parlementera, parlementerai, parlementeraient, …, parlements, parlementâmes, …, parlementé, parlementée, and parlementées. And all with IPA.

Ok, so dussé is missing. But eut is not.
For years, I was never quite sure how to pronounce some inflected verb forms in French. Was the pronunciation of eut (not an entry in Hachette) the same as for eu (which is listed), or does it rhyme with peut? Not that I have occasion to speak eut often, but I’ve had occasion to sing it (in d’Indy’s delightful Madrigal, for example, which Dessoff will be singing in a choral arrangement this November). Sure, I could have asked someone, but that would mean having to ask someone. According to Ralf, the answer is yes. Both eut and eu are pronounced [y]. Ralf could be wrong (he often is — I’ll get to that later, though he doesn’t appear to be in this case), but the pronunciation of eut is a valuable fact, and he recognizes that.

Click here to see YouTube’s divoboy perform d’Indy’s Madrigal (with outstanding French diction save for the incorrect pronunciation of eut, because it probably wasn’t in his dictionary).

One of the weirder French verb forms I do know how to pronounce is dussé, as in “Je vais faire cela, dussé-je le regretter ensuite.” By itself, dussé isn’t really a word, but when dusse (the first person imperfect subjunctive form of devoir) and various other verb forms ending in a mute e appear in inversion with its pronomial subject, the spelling changes: e becomes é. Despite the accent aigu, however, dussé-je is pronounced [dusɛʒ], not [duseʒ]. For better or for worse, by the way, the days of dussé-je may be numbered. In its controversial 1990 “rectifications,” France’s Superior Council of the French Language (only in France, you may think, but also in Belgium and Canada) declared the correct spelling to henceforth be dussè-je. That makes a lot of sense, but of course this is the organization that in the same proclamation tried to change the official spelling of oignon to ognon. As you can imagine, that didn’t go over very well, so we’ll see if dussè-je sticks. You can read more about dussé-je/dussè-je here, which is where I copped the sample sentence above.

Ok, I’ll say it: XML is not evil.
Ralf’s dictionary is an XML file. I’ll admit it, I’ve got issues with XML, or more specifically with people who think XML is a database format, but Ralf used it wisely, as a self-documenting container for data exchange. CSV would have been fine, too, but XML was a better idea here, because the Unicode characters that represent IPA don’t always survive being shuttled around in less standardized text files.

Import time.
Each lexeme in Ralf’s dictionary was associated with a phoneme (the IPA I wanted), a grapheme (the lexeme written down) and sometimes a role (abbreviation, letter, name, or verb). The IPA in Ralf’s dictionary was for speech, and I ultimated needed slightly different pronunciations for singing, so I imported Ralf’s data into a table with an extra phoneme column that contained the changes I wanted.

My database platform of choice, as always, is Microsoft SQL Server. With a lot more trial and error than I’d have needed to import from CSV or various other formats, I finally managed to make XQuery happy. Here’s my import query.

WITH Imported(Item,Role,Grapheme,Phoneme) AS (
  SELECT 
    T1.lexeme.query('.'),
    T1.lexeme.value('./@role','nvarchar(100)') as Role,
    T1.lexeme.value('grapheme[1]','nvarchar(100)') as Grapheme,
    T1.lexeme.value('phoneme[1]','nvarchar(100)') as Phoneme
  FROM FD
  CROSS APPLY x.nodes('/lexicon/lexeme') AS T1(lexeme)
)
  INSERT INTO FrenchIPA
  SELECT 
    Item,
    Role,
    Grapheme,
    Phoneme,
    replace(replace(
      Phoneme,N'?',N'?'
      ),N'??',N'o?'
    )
    as Phoneme2
  FROM Imported;

Replacing graphemes with phonemes.

The source texts I had were just that — texts, text strings. In order to use the table FrenchIPA, I had to identify the individual words in my texts. While in theory, that’s harder than writing the right XQuery for import, it’s something I’ve done a gazillion times and helped other people do a gazillion times. One version of a query for this has been on my Drew web page for years. Cobble, cobble, cobble, and out comes this clumsy, kludgy, clunky, but effective query I used to make a first pass at word-for-word transliteration (replacing each word in the input string variable @txt with its associated phoneme).

with Puncts(n1,n2) as (
  select
    n as n1,
    (select min(n) from Nums as N2
     where N2.n <= len(@txt) and N2.n >= N1.n
     and substring(@txt,N2.n,1) not like '%[a-z]%' collate Latin1_General_CI_AS
    ) as n2
  from dbo.Nums as N1
  where n <= len(@txt)
), Wds(st,fn,w) as (
  select
    min(n1), n2,
    substring(@txt,min(n1),n2-min(n1)) as wd
  from Puncts
  group by n2
), Reps(i,st,fn,w,Grapheme,IPA) as (
  select row_number() over (order by st desc), st, fn, w, Grapheme, P2
  from Wds join FrenchIPA
  on lower(w) = Grapheme
), Result(i,r) as (
  select cast(0 as bigint),@txt
  union all
  select
    Reps.i, stuff(r,st,fn-st,IPA)
  from Reps join Result
  on Reps.i = Result.i+1
)
  select top 1 '['+replace(replace(r,' ','   '),'
',']
[')+']' from Result order by i desc
  option (MAXRECURSION 1000);

The most kludgy part is the recursive query that replaces one word at a time with IPA. If anyone is curious about how this works, ask me.

Cleaning up the result.

This doesn’t produce the final transliteration, by any means, but it’s darn close. Here’s what it yields for d’Indy’s Madrigal (and which example allows me to type the word with two apostrophes yet again).

[Note: I see garbage below in Chrome; IE is ok. And unfortunately, some combination of WordPress, MySQL, Windows Live Writer, and HTML disagrees with Unicode’s combining diacritical characters, so you’ll see meandering tildes.]

[ki   ʒamɛ   fy   də   ply   ʃaɾmɑ̃   vizaʒ,]
[də   kɔl   ply   blɑ̃,   də   ʃəvœ   ply   swajœ;]
[ki   ʒamɛ   fy   də   ply   ʒɑ̃ti   koɾsaʒ,]
[ki   ʒamɛ   fy   kə   ma   dam   ɔ   du   iœ!]
[ki   ʒamɛ   y   lɛvɾ   ply   suɾiɑ̃t,]
[ki   suɾiɑ̃   ɾɑ̃di   kœɾ   ply   ʒwajœ,]
[ply   ʃast   sɛ̃   su   gimp   tɾɑ̃spaɾɑ̃t,]
[ki   ʒamɛ   y   kə   ma   dam   ɔ   du   iœ!]
[ki   ʒamɛ   y   vwa   de'œ̃   ply   du   ɑ̃tɑ̃dɾ,]
[miɲɔn   dɑ̃   ki   buʃ   ɑ̃pɛɾl   mjœ;]
[ki   ʒamɛ   fy   də   ɾəgaɾde   si   tɑ̃dɾ,]
[ki   ʒamɛ   fy   kə   ma   dam   ɔ   du   iœ!]

All that’s left is touchup, mainly.

1. Add schwas for syllables that are silent in speech, but not in song. (Spoken, Frères Jacques has two syllables; sung, it has four.)

2. Fix some mistakes in Ralf’s dictionary, like his having gotten œ and ø backwards most everywhere. (It’s debatable whether a distinction really exists anyway.)

3. Indicate where there are liaisons (and check against the music to avoid marking them across rests).

After not much additional work, this is what I got:

[ki   ʒamɛ   fy   də   ply   ʃaɾmɑ̃   vizaʒə]
[də   kɔl   ply   blɑ̃,   də   ʃəvø   ply   swajø]
[ki   ʒamɛ   fy   də   ply   ʒɑ̃ti   koɾsaʒə]
[ki   ʒamɛ   fy   kə   ma   dam‿o   duz‿jø]

[ki   ʒamɛz‿y   lɛvɾə   ply   suɾiɑ̃tə]
[ki   suɾiɑ̃   ɾɑ̃di   kœɾ   ply   ʒwajø]
[ply   ʃastə   sɛ̃   su   gɛ̃pə   tɾɑ̃spaɾɑ̃tə]
[ki   ʒamɛ   fy   kə   ma   dam‿o   duz‿jø]

[ki   ʒamɛz‿y   vwa   dœ̃   ply   duz‿ɑ̃tɑ̃dɾə]
[miɲɔnə   dɑ̃   ki   buʃ‿ɑ̃pɛɾlə   mjø]
[ki   ʒamɛ   fy   də   ɾəgaɾde   si   tɑ̃dɾə]
[ki   ʒamɛ   fy   kə   ma   dam‿o   duz‿jø]

This makes me very happy, and, despite the time I spent writing the queries, it saved me a lot of time. In fact, it probably took more time to write this post than it did to put together the IPA for this concert.

One Response to “Graphemes to Phonemes Made Easy”

  1. Steve Kass » Typo Story [Episode #1] Says:

    […] Readers will know that as The Dessoff Choirs’ self-appointed language guru, I routinely prepare IPA (International Phonetic Alphabet) transliterations of upcoming concert music. [see Graphemes to Phonemes Made Easy] […]

Leave a Reply

Conflict. Today, my writing was likened to Dan Brown’s, and I’m compelled to demonstrate at least a rudimentary grasp of grammar and its subtleties.

I write like
Dan Brown

I Write Like by Mémoires, Mac journal software. Analyze your writing!

Interlude. Let me explain how I arrived at this conflict; skip to the dénouement if the travelogue begins to bore you. [Note to self: look up or else coin the adjectival form of interlude; consider interludinous, interludinal, interludinary, interludine.]

The comparison of my writing with Dan Brown’s occurred earlier today, while I was visiting I Write Like, a momentarily amusing web¹ site at http://iwl.me. I arrived there from this CONJUGATE VISITS post (sorry, but its author yells the title). I happened onto CONJUGATE VISITS while looking up “supposably,” which I learned today is a word (note the absence of scare quotes around “word”), as opposed to a “word,” which would have been my first guess.

The next step back is a tad embarrassing. I only realized where I’d been before looking up supposably when I retraced my steps for this blog post; I’d gotten the idea to look up supposably from this article on the web site of Reader’s Digest, a generally icky place I wouldn’t have visited intentionally. A tweet from Phil Jimenez led me to the Reader’s Digest article (more specifically a bit.ly URL in the tweet, and I submit disguise-by-shortening as my excuse).

I don’t recall whether I read Phil’s particular tweet before or after I noted that he and I shared exactly one Facebook like, Dan Savage. That was no surprise, given what (or who? It’s a fictional character, so I’m not sure.) led me to Phil’s Twitter stream in the first place — Kevin Keller. Kevin, as you may know, made his appearance in Veronica #202 today; while I’ve yet to get my hands on the issue, I’d caught wind of it from Google News and consequently searched Twitter for the latest buzz, finding Phil, then Reader’s Digest, then supposably, then CONJUGATE VISITS, then I Write Like. In summary,

  • I Write Like, from
  • CONJUGATE VISITS, from
  • supposably, from
  • Reader’s Digest, from
  • @philjimeneznyc, from
  • Kevin Keller, from
  • Google News, from
  • daily routine.

Dénouement. On to my demonstration. Consider the following sentence, which I found on Amazon in a one-star review of CONJUGATE VISITS’s authoress June Casagrande’s book, It Was the Best of Sentences, It Was the Worst of Sentences, here.

Copernicus was thrilled when he discovered that the earth revolves around the sun.

Casagrande and the reviewer both prefer this to “Copernicus was thrilled when he discovered that the earth revolved around the sun.” I on the other hand, presently compelled to say something about grammar, offer an even better sentence.

Copernicus was thrilled to discover that the earth revolves around the sun.

The proposition of Casagrande’s sentence (either version) has two parts. Deconstructing the sentence rigorously, it states first that Copernicus was thrilled, and second that Copernicus’s² thrill occurred when he made his now famous discovery. However, the second part of the proposition is perplexing, if only slightly. If the writer had stopped after “Copernicus was thrilled,” I’d have felt cheated, but because she’d failed to explain why he was thrilled, not because she’d failed to explain when he was thrilled. Emotions interest readers because of their why, not their when.

For most readers, I’m sure the second part of the sentence as written sufficiently explains the why. Similarly, if the “thrilled when” sentence were part of an SAT reading comprehension question, the “correct” answer to Why was Copernicus thrilled? would be a) Because he discovered that the earth revolves around the sun., not d) It’s impossible to determine from the reading. But why explain “why?” indirectly by explaining when? The turn of phrase “thrilled to discover” isn’t the only choice — one might say “thrilled by his discovery” or “thrilled to have discovered,” but it’s the best choice, and this is my blog. Also, I might have answered d) to the SAT question, especially if I knew I’d get to argue with a teacher about it later. I don’t brag about my SAT English score, and for good reason.

Epilog. Dare I paste this blog post into I Write Like? And if I do, then post the result here, then paste it in again, will the result be the same, and if not, and I repeat the process… [Update: The result is … H. P. Lovecraft. I’ll leave it at that. Tear from the fabric the threads that are old!]

I write like
H. P. Lovecraft

I Write Like by Mémoires, Mac journal software. Analyze your writing!

Postscript. You, dear reader, are a mensch for getting to this point. Let me know how I can return the favor. You are almost as much of a mensch as Itzik, who hired me as an editor … twice, the second time after knowing how I go on about things like this.


¹ By writing web and not Web, I comport with one of the “Significant Rule Changes” in the latest edition of The Chicago Manual of Style. The interested reader (which is to say You, because you’ve read this far into my footnote) can find the full list here. This footnote is not an endorsement of The Chicago Manual of Style.

² Ibid. Among the Significant Rule Changes are rules on the possessive forms of two kinds of names: those ending with an unpronounced “s” and those ending with an “eez” sound (in the latter case presumably when the name also ends in “s,” because there can’t be any debate on possessives like Lise’s). Copernicus falls into neither category, and I don’t know the latest rule on his possessive. My rule is to always add ’s to form a possessive (as in This is Steve Kass’s blog.) except maybe for Jesus, Moses, and princess. Even for them I’m not certain what I’d do, but they don’t come up in my writing much.

Leave a Reply

« Previous Page