Week 13: Capstone Project - Autonomous Humanoid with Conversational AI

Capstone Project Introduction

Project Overview

The capstone project synthesizes all skills from Module 4 into a fully integrated autonomous humanoid system. You will build a conversational AI agent that accepts natural language commands, plans multi-step tasks, perceives its environment through vision and language grounding, and executes physical actions—all while providing natural language feedback to the user.

This project represents the state-of-the-art in embodied AI: systems that seamlessly blend perception, reasoning, and action in human-centric environments. By completion, you will have hands-on experience with the same technologies powering commercial humanoid assistants like Figure 02, Tesla Optimus, and 1X NEO.

System Architecture

The capstone integrates five core subsystems:

1. Voice Interface Layer

Whisper ASR: Continuous speech recognition with wake-word detection
Command Parser: Extract intent, entities, and parameters from transcriptions
Text-to-Speech: Generate natural language feedback (using pyttsx3 or OpenAI TTS)

2. Planning and Reasoning Engine

GPT-4 Task Planner: Decompose high-level goals into action sequences
ReAct Controller: Adaptive plan execution with real-time replanning
Memory System: Track world state, object locations, and task history

3. Multimodal Perception

CLIP Visual Grounding: Map language descriptions to scene objects
Object Detection: YOLOv8 or Detic for instance segmentation
Depth Estimation: RGB-D camera integration for 3D localization
Gesture Recognition: MediaPipe hand tracking for non-verbal commands

Nav2 Integration: Autonomous navigation to target locations
MoveIt 2: Motion planning for arm manipulation tasks
Grasp Planning: Compute stable grasps from visual observations

5. Feedback and Monitoring

Execution Status: Report task progress via speech ("Navigating to kitchen...")
Error Handling: Detect failures and request human assistance
Safety Monitoring: Collision avoidance, force limits, emergency stop

System Data Flow

User Voice Command
    ↓
[Whisper ASR] → Transcription
    ↓
[GPT-4 Parser] → Structured Intent
    ↓
[Task Planner] → Action Sequence
    ↓
┌─────────────┴─────────────┐
│ ReAct Execution Loop:     │
│                           │
│ [Camera Feed]             │
│      ↓                    │
│ [CLIP + Object Detection] │
│      ↓                    │
│ [Visual Grounding]        │
│      ↓                    │
│ [Navigation/Manipulation] │
│      ↓                    │
│ [Execution Feedback]      │
│      ↓                    │
│ [Text-to-Speech Status]   │
│      ↓                    │
│ Next Step or Complete     │
└───────────────────────────┘

Example Interaction Scenario

User: "Hey robot, bring me the red mug from the kitchen counter."

System Processing:

Voice: Whisper transcribes command
Parsing: GPT-4 extracts intent=fetch, object=red mug, location=kitchen counter
Planning: Generate steps:
- Navigate to kitchen
- Locate "red mug" using CLIP
- Approach counter
- Grasp mug
- Navigate to user
- Hand over mug
Execution:
- Step 1: "Navigating to kitchen..." (TTS feedback)
- Step 2: Camera captures counter scene → CLIP identifies red mug at pixel (x, y) → Convert to 3D point
- Step 3: "Approaching counter..." → Navigate to pre-grasp pose
- Step 4: MoveIt plans grasp trajectory → Execute gripper closure
- Step 5: "Returning to you..." → Navigate back
- Step 6: "Here is your red mug." → Extend arm for handover

Adaptive Behavior: If "red mug" not found on counter:

Replan: "I don't see a red mug on the counter. Should I check the dish rack?"
User: "Yes, check the rack."
Continue: Update plan and search new location

Technical Requirements

Hardware

Humanoid robot platform (or simulated equivalent in Gazebo/Isaac Sim)
RGB-D camera (Intel RealSense D435, Kinect Azure)
Microphone and speakers for voice I/O
GPU with 8GB+ VRAM (NVIDIA RTX 3060 Ti or better)
Gripper with force/torque sensing

Software Stack

ROS 2 Humble on Ubuntu 22.04
Nav2 for autonomous navigation
MoveIt 2 for manipulation planning
OpenAI API access (GPT-4, Whisper, TTS)
Python 3.10+ with PyTorch, transformers, OpenCV
MediaPipe, CLIP, YOLOv8 for perception

Development Environment

# Install core dependencies
sudo apt update
sudo apt install -y ros-humble-nav2-* ros-humble-moveit python3-pip

# Install Python packages
pip install openai whisper clip torch torchvision opencv-python mediapipe ultralytics pyttsx3

# Clone capstone workspace
mkdir -p ~/capstone_ws/src
cd ~/capstone_ws/src
git clone <your_capstone_repo>

# Build workspace
cd ~/capstone_ws
colcon build --symlink-install
source install/setup.bash

Project Milestones

Milestone 1: Voice Interface (Week 13 Day 1-2)

Implement wake-word activated listening
Integrate Whisper for transcription
Parse commands into structured intents
Add TTS feedback for acknowledgment

Deliverable: System that accepts voice commands and confirms understanding

Milestone 2: Visual Grounding (Week 13 Day 3-4)

Set up camera feed in ROS 2
Integrate CLIP for object localization
Implement depth-based 3D coordinate conversion
Test referring expressions ("the leftmost mug")

Deliverable: System that locates objects from natural language descriptions

Milestone 3: Task Planning (Week 13 Day 5)

Implement GPT-4 task decomposition
Build ReAct execution loop
Add world state tracking (object locations, robot pose)

Deliverable: System that plans and executes multi-step tasks adaptively

Milestone 4: Full Integration (Week 13 Day 6-7)

Connect all subsystems (voice → planning → vision → action)
Implement error handling and recovery
Add gesture commands as alternative input
Deploy on physical robot or high-fidelity simulation

Deliverable: End-to-end autonomous humanoid assistant

Evaluation Criteria

Your capstone will be evaluated on:

Functionality (40%):

Correctly executes 5/5 test scenarios (fetch, place, navigate, search, collaborative task)
Handles edge cases (object not found, path blocked, grasp failure)
Adaptive replanning when initial plan fails

Integration (25%):

Seamless coordination between voice, vision, planning, and control
Real-time performance (< 5s latency from command to action start)
Robust error handling across subsystems

User Experience (20%):

Natural language interaction (clear TTS feedback, confirmation loops)
Safe operation (collision avoidance, force limits)
Intuitive gesture commands

Code Quality (15%):

Modular architecture with clear interfaces
Comprehensive error handling
Inline documentation and comments

Test Scenarios

Demonstrate your system on these tasks:

Fetch and Deliver: "Bring me the blue water bottle from the shelf"
Multi-Object Task: "Clear the table by moving all mugs to the dish rack"
Search and Report: "Is there a laptop on the desk? If so, what color is it?"
Collaborative Cooking: "Help me prepare dinner. First, get the cutting board, then the knife."
Gesture Override: Use pointing gesture to indicate which object to grasp when multiple candidates exist

Next Steps

You've reviewed the capstone architecture and requirements. In Week 13: Capstone Implementation, you'll build the integrated system step-by-step, starting with the voice-to-action pipeline and progressing to full multimodal control.

Resources

Code Repository: https://github.com/yourname/vla-capstone
Documentation: System architecture diagrams, API references
Support: Office hours (link in course page), discussion forum
Demo Videos: Reference implementations for each milestone

Ready to build your autonomous humanoid? Let's proceed to implementation.

Project Overview​

System Architecture​

1. Voice Interface Layer​

2. Planning and Reasoning Engine​

3. Multimodal Perception​

4. Navigation and Manipulation​

5. Feedback and Monitoring​

System Data Flow​

Example Interaction Scenario​

Technical Requirements​

Hardware​

Software Stack​

Development Environment​

Project Milestones​

Milestone 1: Voice Interface (Week 13 Day 1-2)​

Milestone 2: Visual Grounding (Week 13 Day 3-4)​

Milestone 3: Task Planning (Week 13 Day 5)​

Milestone 4: Full Integration (Week 13 Day 6-7)​

Evaluation Criteria​

Test Scenarios​

Next Steps​

Resources​