If you’re accessing this activity directly, did you know there are nine other activities in this series up on our website? Check out our AI page to see a breakdown of the activities and our recommended order to complete them in! Also, these activities introduce AI concepts and terminology. If you find yourself unfamiliar with any of the words in this activity, the landing page also has a glossary of AI terms. Happy space-station-fixing!
To recap: You and your group mates are astronauts and scientists aboard the Actua Orbital Station. Unfortunately, your station just got bombarded by magnetic rays and your electronics have begun to shut down! The only one who can save you is the station’s AI, DANN. DANN stands for Dedicated Actua Neural Network, and it’s gone a little loopy. Brush up on your technical skills, learn about AI, and save yourself and your crewmates!
You’ve managed to reinitialize DANN’s audio core using your knowledge in “What Machines See: Digging into Machine Vision”, but when you asked him to open a door, he made you a milkshake instead. It looks like DANN’s audio core might have sustained more damage than we thought. DANN’s audio recognition mode is listening to us, but it isn’t understanding what it’s hearing. You need to retrain his audio processing skills so you can give him commands. Once we’re sure it’s fixed, we can move on to DANN’s morality system in “Ethics in AI: Don’t Let DANN Turn Evil”!
In this activity, participants will train a machine listening program to accomplish a classification task. They will explore a pre-trained audio classifier before using Google’s Teachable Machine to train their own audio-based classification model to recognize certain keywords.
A full reset and retraining of DANN’s audio recognition model is necessary, but to accomplish that, you will need to reorient the station’s main antenna array to point at the Station Recovery Satellite (SRS). Using what you learned with the visual core command model, you will need to train a simple machine listening model for handling the commands for the station’s reaction control system. You can use the reaction control system to realign the antenna array to point at one of the backup satellites.
Opening hook: Exploring a pre-trained model
Just like in the “What Machines See” activity, you will start by exploring a pre-trained speech recognition model. This model, called SpeechCommands18w, is trained to recognize 18 words:
- The numbers “zero” to “nine”
- Four directions: “up”, “down”, “left”, and “right”
- “Go” and “stop”
- “Yes” and “no”
This model will also indicate if it detects an “unknown word” or if it only hears “background noise”. The model has been loaded into a test program for you to evaluate it:
- Open the interactive sketch. Since this is an audio recognition model, you may be asked for permission to use your computer’s microphone.
- Pick five of the words from the model’s vocabulary (i.e. five of the 18 words the model has been trained to recognize).
- Say each of the five words that you picked. Leave 1 to 2 seconds of silence between each. Write down whether or not the model successfully recognized the word that you said.
- Repeat step 3 between 1 and 3 times.
- Now, based on your results from steps 3 and 4, consider the following questions:
- Were there any words that the model had a hard time recognizing? Which ones?
- Did the model recognize any words incorrectly?
- What conditions do you think would help the model correctly recognize words?
The SpeechCommands18w model can recognize many potentially useful words, but it isn’t trained to recognize the specific command words you need to use to control the station. This means that you’ll need to train a model that can recognize the words that the station will respond to.
Activity 1: Defining audio classification models
The process for training an audio classification model using Teachable Machine follows a similar process as training an image classification model.
Step 1. Load Teachable Machine
- Teachable Machine can be accessed online here: https://teachablemachine.withgoogle.com/train/audio.
- That link will bring you to Teachable Machine’s audio classification model. You have probably already used one of the other Teachable Machine models, the image classifier, for your model for DANN’s visual core.
Step 2. Define classes
- The audio core will use the names of the classes in your model to call the corresponding commands.
- What classes do you think you will need to define? Consult Appendix A for how the reaction control system (RCS) functions.
- Hint: Mission Control suggests that you define one class per direction as well as the word “stop”. Teachable Machine has already created the necessary “background noise” class, but it might be smart to add an “unknown word” class as in the SpeechCommands18w model.
Step 3. Create training data
- Training data for the audio model can be created using a microphone attached to your computer. Teachable Machine will record a continuous amount of audio (2 seconds, by default) and then automatically break it into 1 second clips for training. Each class needs a minimum of 8 different audio samples (other than the background noise class, which will need 20 samples), but more samples should mean better recognition accuracy, so do some extra if you have the time!
Step 4: Train model
Just like the image classification model that you trained, you now need to train your audio recognition model with your data so that it can recognize the required command words. To do this:
- Click on the “Train Model” button in the box labelled “Training”.
- Wait for training to complete. After a short while, below the training button, you should see a timer counting up and a number out of 50. Your training is complete once that number reaches 50 out of 50, but this might take a few moments.
- When training is complete, the “Preview” box should have an “Output” section which displays, in real-time, your trained model’s classification of your microphone audio and your model’s confidence in its classification.
Step 5: Test model
- Now you need to check if your audio classification model’s been trained well enough for you to realign the station’s antenna array. This means that your model should reliably recognize the words that you chose, and detect background noise or unknown words when necessary.
- You can adapt the evaluation questions that you used for the image classification model to test if your model is functioning well:
- Does your model accurately recognize your commands when spoken by you or other group members that were part of your training data?
- Does your model accurately recognize your commands when spoken by other people who were not part of the training data?
- Does your model accurately recognize your commands when spoken in a different environment from where the training data was generated (e.g. a different part of the room, different background noises and levels)?
- If yes to all of the above, your model is ready for use. If not, consult the section on troubleshooting, below.
- Does your training data include any sounds that are not good representations of the class that they are in?
- Does your model work on some command classes but have difficulty recognizing specific command classes? Listening to the training data, could you hypothesize why this might be?
If the evaluation results are not satisfactory, a model can be re-trained once more training data has been added. In many cases, additional training data, or better training data, will solve issues of model reliability and accuracy. A model that can reliably pass at least one of the evaluation questions may still be loaded into the visual core at the discretion of the user. Reliable performance, however, cannot be guaranteed.
Step 6. Applying your trained model
- Paste the URL for your Teachable Machine model where the website prompts you to. This will load your AI model into DANN’s audio core.
- Use your model to reorient the station’s antenna array and align.
Reflection & Debrief
Once you’ve uploaded your trained audio model into the interactive sketch and used it, discuss the following questions:
- Did your model function well for its purpose?
- Were there any words that your model had trouble understanding?
- What do you think the pros and cons are of voice interfaces?
- Where do we see them applied?
- Where might they be helpful?
- If you’ve completed previous activity for training an image classification model: Does your audio model work better than your image classification model? Support your position with specific observations and examples.
Extensions & Modifications
- “Model and data provenance” are information about the model’s creation and training. Review the model and data provenance section for the SpeechCommands18w model, here: https://learn.ml5js.org/#/reference/sound-classifier?id=model-and-data-provenance
- Why is this information important?
- How could this information help us build better models?
- This activity can be extended by creating new classes and audio commands. Imagine that you are giving commands to an AI in charge of a space station. What commands would you need to give it? Create and train new classes with those commands, making sure that you produce the same amount of data as you did for the other classes. Test it out. When the model has more classes, is it better or worse at identifying commands? How could it be improved?
- For the opening hook model exploration, to save time, instead of doing it in small groups, it can be done as a whole class with students taking turns saying words from the vocabulary.
- To make the activity easier for younger participants, Teachable Machine supports pre-recording a dataset and uploading it to Google Drive. This allows participants to build on an existing data set for less complex experience and better results.