20
Ahora que ya tenemos la voz, necesitamos un SST (Speech to Text), o Text Recognition. De todos los que he probado, me decanto por el de Google. El problema es que para el uso de éste, deberemos guardar en un wav lo que digamos, convertirlo a flac, y enviarlo por la api usando un navegador (menudo pitote). Después de unas recopilaciones de aquí y de allá lo traigo ya resumido: Grabamos con el comando "rec" el wav (Aseguraos de que el volumen de la entrada de micrófono del alsamixer está al 100%, si no le costará la vida reconocer que estamos diciendo) rec -r 16000 -e signed-integer -b 16 -c 1 audio.wav trim 0 4 Convertimos de wav a flac: sox audio.wav -r 16000 -b 16 -c 1 audio.flac vad reverse vad reverse lowpass -2 2500 Y ahora viene la magia... sin usar el navegador, pero usando una cabecera (simulándolo): curl --data-binary @audio.flac --header 'Content-type: audio/x- flac; rate=16000' 'http://www.google.com/speech-api/v1/recognize? xjerr=1&client=chromium&lang=es-ES&maxresults=1' 1>audio.txt Almacenamos en un audio.txt lo que nos devuelve la api de Google. Podremos cambiar el idioma en lang=es-ES (por defecto puse España-Español) o si queremos que nos muestre más resultados en vez de 1 (el más acertado). Para resumir lo que nos devuelve en una simple frase formateamos el resultado: FILETOOBIG=`cat audio.txt | grep "<HTML>"`

Ahora Que Ya Tenemos La Voz

Embed Size (px)

DESCRIPTION

tutorial

Citation preview

Ahora que ya tenemos la voz, necesitamos un SST (Speech to Text), o Text Recognition.De todos los que he probado, me decanto por el de Google. El problema es que para el uso de ste, deberemos guardar en un wav lo que digamos, convertirlo a flac, y enviarlo por la api usando un navegador (menudo pitote).

Despus de unas recopilaciones de aqu y de all lo traigo ya resumido:

Grabamos con el comando "rec" el wav (Aseguraos de que el volumen de la entrada de micrfono del alsamixer est al 100%, si no le costar la vida reconocer que estamos diciendo)rec -r 16000 -e signed-integer -b 16 -c 1 audio.wav trim 0 4Convertimos de wav a flac:sox audio.wav -r 16000 -b 16 -c 1 audio.flac vad reverse vad reverse lowpass -2 2500Y ahora viene la magia... sin usar el navegador, pero usando una cabecera (simulndolo):curl --data-binary @audio.flac --header 'Content-type: audio/x-flac; rate=16000' 'http://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=es-ES&maxresults=1' 1>audio.txtAlmacenamos en un audio.txt lo que nos devuelve la api de Google. Podremos cambiar el idioma en lang=es-ES (por defecto puse Espaa-Espaol) o si queremos que nos muestre ms resultados en vez de 1 (el ms acertado).

Para resumir lo que nos devuelve en una simple frase formateamos el resultado:FILETOOBIG=`cat audio.txt | grep ""`TRANSCRIPT=`cat audio.txt | cut -d"," -f3 | sed 's/^.*utterance\":\"\(.*\)\"$/\1/g'`CONFIDENCE=`cat audio.txt | cut -d"," -f4 | sed 's/^.*confidence\":0.\([0-9][0-9]\).*$/\1/g'`Donde el Confidence (por si queremos usarlo) ser el porcentaje de probabilidad de acierto en el reconocimiento, y el Transcript lo que entendi.

Ya con esto, procedemos con un simple if la prueba de que funciona:if echo "$TRANSCRIPT" |grep -q "Hola"; then aoss espeak -ves "$TRANSCRIPT"elif echo "$TRANSCRIPT" |grep -q "quin eres"; then aoss espeak -ves "Soy Yarvis, la mquina de Yuki Sekisan"elif echo "$TRANSCRIPT" |grep -q "saluda a Alberto"; then aoss espeak -ves "Hola Alberto, eres muy pesado, vete ya"elif echo "$TRANSCRIPT" |grep -q "main craft"; then aoss espeak -ves "Abriendo Maincraft" | java -Xmx1024M -Xms512M -cp /home/yuki/Escritorio/Minecraft.jar #net.minecraft.LauncherFrameelse aoss espeak -ves "No te entiendo";aoss espeak -ves "$TRANSCRIPT"fiFunciona perfectamente :). Ahora el siguiente paso... Migrarlo a una Base de datos! Me he decantado por MYSQL

Basically, the library encodes the sample into FLAC using a third party FLAC library, then issues a request to "https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US" with a Content-Type header specifying the format (FLAC) and bitrate (8 kbit).

http://www.codeproject.com/Articles/338010/Fun-with-Google-Speech-Recognition-service

IntroductionI was excited to discover open web services like Google has, and it was very amazing when I heard about Google speech recognition.In this article, I write some tips to use Google speech recognition API in Windows application with direct recording voice from audio input devices. And also, like a delicious spice - wear simple program for speech recognition into the utility for quick issues adding in Redmine project.BackgroundThe basic idea was: you push the button, some timer starts elapse together with wave-in device opening, main loop starts and pcm data from buffers with your voice records to file, timer stops and audio file is posted to Google for recognition.First task was in understanding flac encoding in realtime, you can tell 'In *nix,I can write couple commands in terminal and do all: record, encode, post flac file and receive answer from server. So, why do you not encode file with encoder program started after recording wave file?' - because it's boring, just imagine: your program writes already prepared flac audio file!From the time then I wrote some application for batch converting mp3 files to OGG/Vorbis, I have stayed library that can encode pcm to vorbis in realtime, there also was ring buffer for that.At that point, the appropriate handler for the flac did not wait. You might know that Google accepts flac in 16 kHz and 16 bit per sample with 1(mono) channel format. By using example in libflac, I add three functions:InitialiseEncoder,ProcessEncoder,CloseEncoderwhich are, respectively: open file and prepare encoder, upload to encoder 16bit pcm samples, close file and destroy encoder. One thing: don't understand why it can't add metadata to flac file? Maybecharsetproblems?The wonderful article:WaveLib, which has wave-in API implementation included, that usesRecorderclass: starts theWaveInRecorderand in parallel uses thread for transmitting pcm data to encoder.File UploadingThe basic upload function usage isbelow, change lang parameter optionally:Collapse|Copy Codestring result = WebUpload.UploadFileEx(flacpath, "http://www.google.com/speech-api/v1/recognize?lang=ru&client=chromium","file", "audio/x-flac; rate=16000", parameters, null);Response from server isreceived in JSON format.Issue CreatingIn which case can youuse the speech recognition? Maybe for issue creating? Maybe it is not practical, but certainly funny.The Redmine web application includes REST web service. By it, we can create issues as much as we need to, just specify project and tracker, by the way the list of trackersI could only get younger version 1.3*.Collapse|Copy CodeRedmineManager manager = new RedmineManager(Configuration.RedmineHost, Configuration.RedmineUser, Configuration.RedminePassword); // New ISSUEvar newIssue = new Issue{ Subject = Title, Description = Description, Project = new IdentifiableName() { Id = ProjectId }, Tracker = new IdentifiableName() { Id = TrackerId }};// GET ID OF CURRENT USERUser thisuser = (from u in manager.GetObjectList(new System.Collections.Specialized.NameValueCollection()) where u.Login == Configuration.RedmineUser select u).FirstOrDefault();if (thisuser != null) newIssue.AssignedTo = new IdentifiableName() { Id = thisuser.Id }; manager.CreateObject(newIssue);Points of InterestWhen it was over, I drew attention to record timeout, it gives you 4 secs for your speech: not for all expressions of it may be appropriate, form maybe needs some stop button?Ring buffer will save you from data loss in case of such records directly to flac. When the data comes from the wave-in, they go in the ring buffer.

FondoLa idea bsica era: se presiona el botn, el temporizador comienza a transcurrir algunos con olas en la apertura del dispositivo, las principales salidas de bucle y los datos PCM de buffers con sus registros de voz en un archivo, se detiene el temporizador y el archivo de audio se envi a Google para su reconocimiento.Primera tarea consista en entender la codificacin flac en tiempo real, se puede decir 'En * nix, puedo escribir par de comandos en el terminal y hacer todo: grabar, codificar, publicar archivos flac y recibir respuesta del servidor.As que, por qu no codificar archivos con el programa codificador iniciado despus de grabar archivos de onda?- Porque es aburrido, imagnense: el programa graba archivos de audio flac ya preparado!Desde el momento en que escrib entonces alguna aplicacin para la conversin por lotes archivos MP3 a OGG / Vorbis, me he alojado biblioteca que puede codificar pcm a vorbis en tiempo real, tambin hubo memoria cclica para eso.En ese punto, el controlador apropiado para el FLAC no esper.Usted puede saber que Google acepta flac en 16 kHz y 16 bits por muestra con 1 (mono) Formato de canal.Por ejemplo, en el uso de libflac, aado tres funciones:InitialiseEncoder,ProcessEncoder,CloseEncoderque son, respectivamente: archivos abiertos y os preparare encoder, subo a muestras PCM encoder de 16 bits, cerrar el archivo y destruyo encoder.Una cosa: no entiendo por qu no se puede aadir metadatos a los archivos flac?Quizscharsetproblemas?El artculo maravilloso:WaveLib, que tiene olas de aplicacin API incluida, que utiliza lagrabadora declase: se inicia laWaveInRecordery paralelamente utiliza hilos para transmitir datos PCM para encoder.Carga de archivosEl uso bsico funcin de carga est por debajo, cambie el parmetro lang opcionalmente:Collapse|Copiar cdigostring resultado = rate = 16000 " , los parmetros, nulo );Respuesta del servidor se recibe en formato JSON.Emitir CreacinEn este caso se puede utilizar el reconocimiento de voz?Tal vez para la creacin de tema?Tal vez no es prctico, pero sin duda divertido.La aplicacin web Redmine incluye servicio web REST.Por ello, podemos crear problemas tanto como necesitamos, basta con especificar los proyectos y seguimiento, por la forma en la lista de seguidores que slo poda obtener la versin ms joven 1.3 *.Collapse|Copiar cdigoRedmineManager gerente = nueva RedmineManager (Configuration.RedmineHost, Configuration.RedmineUser, Configuration.RedminePassword); / / NUEVO NMERO var = newIssue nueva emisin{ Subject = Ttulo, Description = Descripcin, = Proyecto de nueva IdentifiableName () {id = Projectid} Rastreador = nueva IdentifiableName () {Id = TrackerId}};/ / Obtener ID DE USUARIO ACTUAL usuario thisuser = (de u en manager.GetObjectList ( nueva System.Collections.Specialized.NameValueCollection ()) donde u.Login == Configuration.RedmineUser . seleccione u) FirstOrDefault ();si (thisuser! = NULL ) newIssue.AssignedTo = nueva IdentifiableName () {Id = thisuser.Id}; manager.CreateObject (newIssue);Puntos de intersCuando todo termin, me llam la atencin para registrar tiempo de espera, que te da 4 segundos para su discurso: no para todas las expresiones de la misma puede ser apropiado, la forma tal vez necesita un poco de botn de parada?Memoria circular te salvar de la prdida de datos en caso de tales registros directamente a flac.Cuando los datos proceden de la onda-in, se van en el bfer de anillo.Historia 28 de febrero 2012: Primera versinLicencia

IntroductionAs I already mentioned in my articleA low-level audio player in C#, there are no built-in classes in the .NET framework for dealing with sound. This holds true not only for audio playback, but also for audio capture.It should be noted, though, that the Managed DirectX 9 SDK does include classes for high-level and low-level audio manipulation. However, sometimes you dont want your application to depend on the full DX 9 runtime, just to do basic sound playback and capture, and there are also some areas where Managed DirectSound doesnt help at all (for example, multi-channel sound playback and capture).Nevertheless, I strongly recommend you to use Managed DirectSound for sound playback and capture unless you have a good reason for not doing so.This article describes a sample application that uses thewaveInandwaveOutAPIs in C# through P/Invoke to capture an audio signal from the sound cards input, and play it back (almost) at the same time.Using the codeThe sample code reuses theWaveOutPlayerclass from my articleA low-level audio player in C#. The new classes in this sample areWaveInRecorderandFifoStream.TheFifoStreamclass extendsSystem.IO.Streamto implement a FIFO (first-in first-out) of bytes. The overriddenWritemethod adds data to the FIFOs tail, and theReadmethod peeks and removes data from the FIFOs head. TheLengthproperty returns the amount of buffered data at any time. CallingFlushwill clear all pending data.TheWaveInRecorderclass is analogous to theWaveOutPlayerclass. In fact, if you look at the source files, youll notice that the implementations of these classes are very similar. As withWaveOutPlayer, the interface of this class has been reduced to the strict minimum.Creating an instance ofWaveInRecorderwill cause the system to start recording immediately. Heres the code that creates theWaveOutPlayerandWaveInRecorderinstances.Collapse|Copy Codeprivate void Start(){ Stop(); try { WaveLib.WaveFormat fmt = new WaveLib.WaveFormat(44100, 16, 2); m_Player = new WaveLib.WaveOutPlayer(-1, fmt, 16384, 3, new WaveLib.BufferFillEventHandler(Filler)); m_Recorder = new WaveLib.WaveInRecorder(-1, fmt, 16384, 3, new WaveLib.BufferDoneEventHandler(DataArrived)); } catch { Stop(); throw; }}TheWaveInRecorderconstructor takes five parameters. Except for the last parameter, their meaning is the same as inWaveOutPlayer.The first parameter is the ID of the wave input device that you want to use. The value-1represents the default system device, but if your system has more than one sound card, then you can pass any number from0to the number of installed sound cards minus one, to select a particular device.The second parameter is the format of the audio samples.The third and forth parameters are the size of the internal wave buffers and the number of buffers to allocate. You should set these to reasonable values. Smaller buffers will give you less latency, but the captured audio may have gaps on it if your computer is not fast enough.The fifth and last parameter is a delegate that will be called periodically as internal audio buffers are full of captured data. In the sample application we just write the captured data to the FIFO, like this:Collapse|Copy Codeprivate void DataArrived(IntPtr data, int size){ if (m_RecBuffer == null || m_RecBuffer.Length < size) m_RecBuffer = new byte[size]; System.Runtime.InteropServices.Marshal.Copy(data, m_RecBuffer, 0, size); m_Fifo.Write(m_RecBuffer, 0, m_RecBuffer.Length);}Similarly, theFillermethod is called every time the player needs more data. Our implementation just reads the data from the FIFO, as shown below:Collapse|Copy Codeprivate void Filler(IntPtr data, int size){ if (m_PlayBuffer == null || m_PlayBuffer.Length < size) m_PlayBuffer = new byte[size]; if (m_Fifo.Length >= size) m_Fifo.Read(m_PlayBuffer, 0, size); else for (int i = 0; i < m_PlayBuffer.Length; i++) m_PlayBuffer[i] = 0; System.Runtime.InteropServices.Marshal.Copy(m_PlayBuffer, 0, data, size);}Note that we declared the temporary buffersm_RecBufferandm_PlayBufferas member fields in order to improve performance by saving some garbage collections.To stop streaming, just callDisposeon the player and capture objects. We also need to flush the FIFO so that the next timeStartis called there is no residual data to play.Collapse|Copy Codeprivate void Stop(){ if (m_Player != null) try { m_Player.Dispose(); } finally { m_Player = null; } if (m_Recorder != null) try { m_Recorder.Dispose(); } finally { m_Recorder = null; } m_Fifo.Flush(); // clear all pending data}Conclusion

# Curl-H "Content-Type: audio / x-flac, tasa = 16000" "https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US"-F miarchivo = "@ C: \ input.flac"-k-o "C: \ output.txt"Funciona excelente!Slo algunas notas:1) copiar y pegar: utilizar diferentes seales cotizacin2) hacer que tipo de seguro = 16000 corresponde a la tasa de bits (audacity: antes de grabar)!3) que tena un poco mejor los resultados de grabacin monoAlguien tiene algo sobre:* Hecho espera de 100-continueMe falta esta milisegundos para esperar a que ...

1. Path path=Paths.get("out.flac");2. byte[]data=Files.readAllBytes(path);3. 4. Stringrequest="https://www.google.com/"+5. "speech-api/v1/recognize?"+6. "xjerr=1&client=speech2text&lang=en-US&maxresults=10";7. URLurl=newURL(request);8. HttpURLConnectionconnection=(HttpURLConnection)url.openConnection(); 9. connection.setDoOutput(true);10. connection.setDoInput(true);11. connection.setInstanceFollowRedirects(false);12. connection.setRequestMethod("POST");13. connection.setRequestProperty("Content-Type","audio/x-flac; rate=16000");14. connection.setRequestProperty("User-Agent","speech2text");15. connection.setConnectTimeout(60000);16. connection.setUseCaches(false);17. 18. DataOutputStreamwr=newDataOutputStream(connection.getOutputStream());19. wr.writeBytes(newString(data));20. wr.flush();21. wr.close();22. connection.disconnect();23. 24. System.out.println("Done");25. 26. BufferedReaderin=newBufferedReader(27. newInputStreamReader(28. connection.getInputStream()));29. StringdecodedString;30. while((decodedString=in.readLine())!=null){31. System.out.println(decodedString);32. }

Path path = Paths.get("out.flac"); byte[] data = Files.readAllBytes(path); String request = "https://www.google.com/"+ "speech-api/v1/recognize?"+ "xjerr=1&client=speech2text&lang=en-US&maxresults=10"; URL url = new URL(request); HttpURLConnection connection = (HttpURLConnection) url.openConnection(); connection.setDoOutput(true); connection.setDoInput(true); connection.setInstanceFollowRedirects(false); connection.setRequestMethod("POST"); connection.setRequestProperty("Content-Type", "audio/x-flac; rate=16000"); connection.setRequestProperty("User-Agent", "speech2text"); connection.setConnectTimeout(60000); connection.setUseCaches (false); DataOutputStream wr = new DataOutputStream(connection.getOutputStream ()); wr.writeBytes(new String(data)); wr.flush(); wr.close(); connection.disconnect(); System.out.println("Done"); BufferedReader in = new BufferedReader( new InputStreamReader( connection.getInputStream()));String decodedString;while ((decodedString = in.readLine()) != null) {System.out.println(decodedString);}

Debe utilizarwr.write (datos);lugar dewr.writeBytes (new String (datos));Google la respuesta:{ "status" : 0 , "id" : "e0f4ced346ad18bbb81756ed4d639164-1" , "hiptesis" : [{ "hablasen" : "hola cmo ests" , "confianza" : 0,94028234 }, { "hablasen" : "hello how r que " }, { "hablasen" : "Cmo ests hoy u" }, { "hablasen" : "hola cmo ests en" }]}

1. package test;2. 3. import java.io.BufferedReader;4. import java.io.DataOutputStream;5. import java.io.InputStreamReader;6. import java.net.HttpURLConnection;7. import java.net.MalformedURLException;8. import java.net.URL;9. import java.nio.file.Files;10. import java.nio.file.Path;11. import java.nio.file.Paths;12. 13. public class TestGoogleApiForSpeechRecognition {14. 15. public static void main(String[] args) throws Exception {16. 17. Path path = Paths.get("C:\\Users\\CDAC\\Downloads\\priyanka.flac");18. byte[] data = Files.readAllBytes(path);19. 20. String request = "https://www.google.com/"+21. "speech-api/v1/recognize?"+22. "xjerr=0&client=speech2text&lang=en-US&maxresults=20";23. URL url = new URL(request);24. HttpURLConnection connection = (HttpURLConnection) url.openConnection(); 25. connection.setDoOutput(true);26. connection.setDoInput(true);27. connection.setInstanceFollowRedirects(false);28. connection.setRequestMethod("POST");29. connection.setRequestProperty("Content-Type", "audio/x-flac; rate=16000");30. connection.setRequestProperty("User-Agent", "speech2text");31. connection.setConnectTimeout(60000);32. connection.setUseCaches (false);33. 34. DataOutputStream wr = new DataOutputStream(connection.getOutputStream ());35. wr.write(data);36. wr.flush();37. wr.close();38. connection.disconnect();39. 40. System.out.println("Done");41. 42. BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));43. 44. String decodedString;45. while ((decodedString = in.readLine()) != null) {46. System.out.println(decodedString);47. }48. }49. }

package test;

import java.io.BufferedReader;import java.io.DataOutputStream;import java.io.InputStreamReader;import java.net.HttpURLConnection;import java.net.MalformedURLException;import java.net.URL;import java.nio.file.Files;import java.nio.file.Path;import java.nio.file.Paths;

public class TestGoogleApiForSpeechRecognition {

public static void main(String[] args) throws Exception {Path path = Paths.get("C:\\Users\\CDAC\\Downloads\\priyanka.flac"); byte[] data = Files.readAllBytes(path); String request = "https://www.google.com/"+ "speech-api/v1/recognize?"+ "xjerr=0&client=speech2text&lang=en-US&maxresults=20"; URL url = new URL(request); HttpURLConnection connection = (HttpURLConnection) url.openConnection(); connection.setDoOutput(true); connection.setDoInput(true); connection.setInstanceFollowRedirects(false); connection.setRequestMethod("POST"); connection.setRequestProperty("Content-Type", "audio/x-flac; rate=16000"); connection.setRequestProperty("User-Agent", "speech2text"); connection.setConnectTimeout(60000); connection.setUseCaches (false); DataOutputStream wr = new DataOutputStream(connection.getOutputStream ()); wr.write(data); wr.flush(); wr.close(); connection.disconnect(); System.out.println("Done"); BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream())); String decodedString; while ((decodedString = in.readLine()) != null) { System.out.println(decodedString); }}}

1. CAMINO=Caminos.conseguir("out.flac");2. byte[]datos=Archivos.ReadAllBytes(ruta);3. 4. Cadenapeticin="https://www.google.com/"+5. "Speech-api/v1/recognize?"+6. "Xjerr = 1 & client = speech2text & lang = es-US & maxResults = 10";7. URLurl=nuevaURL(peticin);8. HttpURLConnectionconexin=(HttpURLConnection)url.openConnection(); 9. . conexinsetDoOutput(verdadero);10. . conexinsetDoInput(verdadero);11. conexin.setInstanceFollowRedirects(false);12. conexin.setRequestMethod("POST");13. conexin.setRequestProperty("Content-Type","audio / x-flac, tasa = 16000");14. conexin.setRequestProperty("User-Agent","speech2text");15. . conexinsetConnectTimeout(60000);16. conexin.setUseCaches(false);17. 18. DataOutputStreamwr=nuevaDataOutputStream(conexin.getOutputStream());19. . wrwriteBytes(nuevacadena(datos));20. wr.ras();21. . wrclose();22. . conexindesconexin();23. 24. Sistema.cabo.println("Done");25. 26. BufferedReaderen=nuevoBufferedReader(27. nuevaInputStreamReader(28. conexin.getInputStream()));29. CadenaDecodedString;30. mientras que((DecodedString=pulgreadLine())! =NULL){31. Sistema.cabo.println(DecodedString);32. }

Path path = Paths.get("out.flac"); byte[] data = Files.readAllBytes(path); String request = "https://www.google.com/"+ "speech-api/v1/recognize?"+ "xjerr=1&client=speech2text&lang=en-US&maxresults=10"; URL url = new URL(request); HttpURLConnection connection = (HttpURLConnection) url.openConnection(); connection.setDoOutput(true); connection.setDoInput(true); connection.setInstanceFollowRedirects(false); connection.setRequestMethod("POST"); connection.setRequestProperty("Content-Type", "audio/x-flac; rate=16000"); connection.setRequestProperty("User-Agent", "speech2text"); connection.setConnectTimeout(60000); connection.setUseCaches (false); DataOutputStream wr = new DataOutputStream(connection.getOutputStream ()); wr.writeBytes(new String(data)); wr.flush(); wr.close(); connection.disconnect(); System.out.println("Done"); BufferedReader in = new BufferedReader( new InputStreamReader( connection.getInputStream()));String decodedString;while ((decodedString = in.readLine()) != null) {System.out.println(decodedString);}