Speech recognition patch

Hi, since you guys developped a speech synthesis patch (i guess with linking the built-in synthesis library to QC libs), why not develop a speech recognition patch ... ? Some months ago for some project, i heavily used GlovePie (Window$) to join speech recognition/ speech synthesis/ wiimote control/ to realtime graphics with vvvv.... and it was a real phun to play with these tools... I know osX also has a recognition framework, and being able to "vocally" command Quartz can be an awesome feature.... I can imagine linking an actor's speech (structure) to different "actions" within a graph (like,say, an image structure of corresponding notions).... Anyhow, that's just an idea...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

cwright's picture
I like this

When I was working on speech synthesis, I came across the speech recognition framework as well, and thought about this patch. It will require some state saving stuff to work across close/open cycles of Quartz Composer, and we haven't gotten around to reverse engineering this functionality yet. I'll do some mock-ups and see what I can learn. Any hints about interface? (selectable number of inputs/outputs, with string inputs to detect triggering boolean outputs?)

franz's picture
+ - inputs

selectable number of inputs seems natural to me (+ - "a la" multiplexer style). However, when working with GlovePie on PC, one interesting feature was the ability to "combo" multiple semantics, like saying, for instance, " RED 23", would allow me to tint a quad with a red value of 23.... which is even more interesting ( spoken word to event, and spoken word to number -int- ). Dunno how this can be coded under Quartz though...

cwright's picture
String formatting?

Perhaps input can have variable expansion, similar to scanf or printf? So you could have an input string "Red %i" which would then make a number output with the given value. We could have "toggle" types, which toggle their boolean output on trigger, number types which sample-and-hold, and maybe other types? I need to learn the details of the speech recognition engine before we finalize this, but I think your "number-value associated with a word" idea has me completely sold to the idea :)

EDIT: probably also "boolean" types, where you have phrase, followed by "on" or "off". this allows a bit more control than simple toggles (which should turn into signals instead)

cwright's picture
Numbers and fun

according to the Cocoa object, NSSpeechRecognizer, we're limited to a flat list of words in an array. This, of course, makes for very easy code, but very weak usage (numbers would be a chore).

The plus side is that Carbon's interface, Speech Recognition Manager, appears to have a huge set of functions for doing really complicated stuff. Looks like this will be the way to go for 1.0. Should I whip up a cheap no-combo version this week for kicks, and then grind out the better Carbon one over a longer interval? Would it be useful at all, or just a toy?

mfreakz's picture
Not a Toy

I don't know if this project is always on the road, but i would like to say that it's a really good idea to increase QuartzComposer interactivity.

We are all looking for graphic new stuff, 3D, but we are searching other interactivity ways in the same time.

This project is not a toy ! ;)

It's a good way to push Human/computer interaction like OpenCV project (my favorite !!!!)

cwright's picture
Nothing Rhymes...

Since you gave it non-toy status, I figured I'd start.

Results thus far:

NothingRhymes.png392.41 KB

franz's picture

I somehow managed to miss your previous posts this summer.... please, please, send us an alpha.... this is definetly not a toy, even with a few words to recognize.... (my only problem is that is accepts only English spoken words, whereas Microsoft has localized packages... that works very well with GlovePie on PC for instance) but hey, an alpha will do.... !!!

mfreakz's picture
Isn't it fun to have a fancy computer... listening to you ?

Well, as you can see (read) i'm a member of the "Worst English Speaking Union"... I'm French... Many of our members live anonymously, except Björk, and a couple of other famous singers !

cwright: I hope that you don't interpret my "Non-Toy status" as an order, or a kind of rudeness !

When i wake up, i read my email, and i open 3 websites:

lemonde.fr (the news): to have a quick look of the disaster, and to decide if it's morally acceptable to spend time in Quartz Composer...

macbidouille.com (Apple COMPUTER news): to have a quick look of the iDisaster, and decide if it's technically possible to spend time in Quartz Composer...

kineme.net: to understand why it's not an ethical or technical question, but a real meaning of life... ;) Thanx for ALL

Franz: Moi aussi je suis French (Toulouse) J'ai jeté un œil sur ton site et tes projets, c'est vraiment bien. Nous avons pas mal de projet qui concoure. Il faudra en reparler "in la langue de Molière" ;)

dust's picture
phoneme opcode

so i just grabbed this speech recognition patch and speech synth patch. going to try and make it work with quartz, a suggestion if you ever go live with the synth adding some sort of rate, and modulation is cool but the essential thing for me to include a phoneme opcode output. i did some experiments a few years back with video. i took apples opcode and had a friend record each phoneme into a camera. then i cut up the video to just the opcode parts. i guess its only like 30 something sounds in combo that make up the english language.

so i thought it would be cool to have a video avatar or what ever. at first he was retarded all studering like max headroom (remember him) but with some tweaking i got him synced pretty good. so i just pulled out that file and apparently i didn't save the final version because the phoneme's are not as synced up how i remember or maybe i changed them around demonstrating the mcGurk effect to someone but decided while i wait for a render to record a conversation with my computer. so needless to say my internal usb has crapped out on me so it was a little tricky recording me and the computer, so my voice is in mono. had to bust out some old analog stuff. the voice recognition im could do a lot better now but like i said this thing is a few years old.

well i guess embedding my video isn't working. this site wouldn't even let me enter my website because it was .info kept giving me error saying wrong. so here is the link...i attached this tiny vid as well incase i nuke my server.


opcode.mov9.16 MB

gtoledo3's picture
Hmmm.... sounds like a

Hmmm.... sounds like a perfect match with my "TalkingHeads" idea :o)

mfreakz's picture
More QC integrated fonctions/options

If we want to integrate Speech Recognition into interactive comps, we need more "QC Friendly" options. Is it possible to control that:

An input trigger to control listening fonction (instead of the "system preferences settings").

An output signal when recognition is done (instead of the voice synthesis's expression repeating).

A + and - menu (like in the keyboard patch for example) to easily manage the number of expressions, and strings inputs to change each expressions interactivly.

cwright's picture

Overriding the system settings to control listening isn't documented, and isn't recommended. The current patch is set to listen whenever any enabled consumer depends on the recognizer's output.

voice repeat/indicator sound is controlled by your system preferences.

changing input strings on the fly can be expensive, and can generate wrong results (inputting duplicate strings, for example, will throw all the offsets off, so then there needs to be lots of error checking and handling, which slows it down more)

mfreakz's picture
Everybody needs a 2001 experience !

Oh please... :( Just an input string and the +/- editor like the keyboard patch...

Imagine you can connect a Directory scanner to a structure thing an then browse into your MP3/Video collection... !!!! "Hal, play: Aphex twin..." (With the world Economical crash down, we don't have many time left to live into the future... and then it will be the past... ;)

I don't understand how can i introduce many expression ? Should i open several Speech recognition patch ?

mfreakz's picture
String input with a "Done" signal input

I understand all your comments. To manage this Patch in the whole QC interactivity, i'm planing to do a simple Voice Browser, and for all other kind of project, we need an input. My suggestion: If you insert an input string AND a "done" trigger in the Patch, this will prevent errors not ? This boolean trigger will stop incomming string input to the last one, and will "open" string input before ask for rendering. It's a kind of Door to prevent wrong, multiple, partial, string inputs. In underctand the matter is changing input "on the fly" but with the Done signal this patch would work normaly, i suppose. Could you add those option ? Thanx for all. MFreakz.

mfreakz's picture
Speech recognition: String input/refresh button...

Hi, Could you add an input string to this patch. I understand that changing strings could be buggy for the system, but with a boolean trigger, a kind of refresh button, it could be possible to modify string input and then to submit it to the system's recognition engine not ? It could be very useful to update string commands, or for my project, to load those strings from a structure (a directory scanner structure !) an them ask for recognition (the "Refresh Magic Button")... I would like to speak directory names to open them (for exemple)... Please, please...

mattgolsen's picture
I had this exact idea, and I

I had this exact idea, and I wanted to do the same Max Headroom thing. I've been thinking about this since they first mentioned the 3D patch.

mfreakz's picture
Speech recognition: Please...

Please, have a look to this feature, i'm very interested in... I need a String input with a refresh button (if needed) to use this patch more dynamically and more interactivly, in a QC's way... If the system need some times to upgrade the word list, there could have a trick to do that no ? My idea was a trigger input (a refresh button) but we could had a timer that let the system manage with the new word's list if it's the matter not ? Please, if this feature is quite easy to do could you update the patch ? Thanx for all. MF

gtoledo3's picture


Chris, I haven't used this at all, but I keep getting the emails on the thread, so I come over here to see why this has been a problem, and if there was some kind of workaround/macro I could setup...

AND the part that almost made me spit coffee...

You put the list in the SETTINGS! You of all people!

But, mkfreakz, unless I am looking at this wrong...

Can't you just hook this up to the regular speech patch, and put your input string into that to get it to "say" different things? So in essence, the kineme one could be programmed to recognize whatever word you wish, (yes/no for example)... then that would return a true/false to the stock Apple speech synth patch.... which DOES have a string input. So whatever you have to verbalize, doesn't have to relate to what the synth will say. That's actually cool, because you you could set it up to do questions/ answers and things like that.... I could see a comp where you go, "hal, it is cold out, wouldn't you SAY"... and the word say could return a true that would enable the stock speech patch to pull from a string input answer.

FYI, I am actually trying this and it isn't working (but I have a crapload of background noise right now, so it's inconclusive).

cwright's picture

Yes, I put the list in the inspector panel (:cue the tomatoes:)

I did that because changing strings isn't cheap (internally, the speech recognition engine needs to build detection structures for the different words/phrases, and this tends to take some time), and because this was 99% a proof of concept that I didn't think would really catch on :) Also, I wanted a different input mechanism completely (instead of a fixed number of inputs, I wanted to make a structure of strings, so that you could build large or small sets without needing to fiddle with input parameters. But I was lazy, and this was, again, a quick throw-together). Outputs are also tricky with dynamic inputs, because changing the port number on the fly isn't useful...

as for his needs, I think he wants dynamic speech detection, not speaking (i.e. he wants to populate the list with file names, or user names, or whatever, not have QC speak them.

Sorry I haven't touched it since -- We've been trying to wrap up some stuff for the end of the year, and plot a general course for Q1 next year as well, so a lot of small fixes for our smaller patches haven't been released yet :/ Only so many hours in a day :(

mfreakz's picture
A Speech recognition Patch wish...

Is there any chance for a new build of the Speech recognition Patch, with an input string (not "only" the inspector word list) an a refresh button ? It could be really great to let this Patch detect a variable input to create more interractive compositions. Any project or plan about it ?