Skip to main content

Week 13: Capstone Project - Autonomous Humanoid with Conversational AI

Capstone Project Introduction

Project Overview​

The capstone project synthesizes all skills from Module 4 into a fully integrated autonomous humanoid system. You will build a conversational AI agent that accepts natural language commands, plans multi-step tasks, perceives its environment through vision and language grounding, and executes physical actionsβ€”all while providing natural language feedback to the user.

This project represents the state-of-the-art in embodied AI: systems that seamlessly blend perception, reasoning, and action in human-centric environments. By completion, you will have hands-on experience with the same technologies powering commercial humanoid assistants like Figure 02, Tesla Optimus, and 1X NEO.

System Architecture​

The capstone integrates five core subsystems:

1. Voice Interface Layer​

  • Whisper ASR: Continuous speech recognition with wake-word detection
  • Command Parser: Extract intent, entities, and parameters from transcriptions
  • Text-to-Speech: Generate natural language feedback (using pyttsx3 or OpenAI TTS)

2. Planning and Reasoning Engine​

  • GPT-4 Task Planner: Decompose high-level goals into action sequences
  • ReAct Controller: Adaptive plan execution with real-time replanning
  • Memory System: Track world state, object locations, and task history

3. Multimodal Perception​

  • CLIP Visual Grounding: Map language descriptions to scene objects
  • Object Detection: YOLOv8 or Detic for instance segmentation
  • Depth Estimation: RGB-D camera integration for 3D localization
  • Gesture Recognition: MediaPipe hand tracking for non-verbal commands

4. Navigation and Manipulation​

  • Nav2 Integration: Autonomous navigation to target locations
  • MoveIt 2: Motion planning for arm manipulation tasks
  • Grasp Planning: Compute stable grasps from visual observations

5. Feedback and Monitoring​

  • Execution Status: Report task progress via speech ("Navigating to kitchen...")
  • Error Handling: Detect failures and request human assistance
  • Safety Monitoring: Collision avoidance, force limits, emergency stop

System Data Flow​

User Voice Command
↓
[Whisper ASR] β†’ Transcription
↓
[GPT-4 Parser] β†’ Structured Intent
↓
[Task Planner] β†’ Action Sequence
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ReAct Execution Loop: β”‚
β”‚ β”‚
β”‚ [Camera Feed] β”‚
β”‚ ↓ β”‚
β”‚ [CLIP + Object Detection] β”‚
β”‚ ↓ β”‚
β”‚ [Visual Grounding] β”‚
β”‚ ↓ β”‚
β”‚ [Navigation/Manipulation] β”‚
β”‚ ↓ β”‚
β”‚ [Execution Feedback] β”‚
β”‚ ↓ β”‚
β”‚ [Text-to-Speech Status] β”‚
β”‚ ↓ β”‚
β”‚ Next Step or Complete β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Example Interaction Scenario​

User: "Hey robot, bring me the red mug from the kitchen counter."

System Processing:

  1. Voice: Whisper transcribes command
  2. Parsing: GPT-4 extracts intent=fetch, object=red mug, location=kitchen counter
  3. Planning: Generate steps:
    • Navigate to kitchen
    • Locate "red mug" using CLIP
    • Approach counter
    • Grasp mug
    • Navigate to user
    • Hand over mug
  4. Execution:
    • Step 1: "Navigating to kitchen..." (TTS feedback)
    • Step 2: Camera captures counter scene β†’ CLIP identifies red mug at pixel (x, y) β†’ Convert to 3D point
    • Step 3: "Approaching counter..." β†’ Navigate to pre-grasp pose
    • Step 4: MoveIt plans grasp trajectory β†’ Execute gripper closure
    • Step 5: "Returning to you..." β†’ Navigate back
    • Step 6: "Here is your red mug." β†’ Extend arm for handover

Adaptive Behavior: If "red mug" not found on counter:

  • Replan: "I don't see a red mug on the counter. Should I check the dish rack?"
  • User: "Yes, check the rack."
  • Continue: Update plan and search new location

Technical Requirements​

Hardware​

  • Humanoid robot platform (or simulated equivalent in Gazebo/Isaac Sim)
  • RGB-D camera (Intel RealSense D435, Kinect Azure)
  • Microphone and speakers for voice I/O
  • GPU with 8GB+ VRAM (NVIDIA RTX 3060 Ti or better)
  • Gripper with force/torque sensing

Software Stack​

  • ROS 2 Humble on Ubuntu 22.04
  • Nav2 for autonomous navigation
  • MoveIt 2 for manipulation planning
  • OpenAI API access (GPT-4, Whisper, TTS)
  • Python 3.10+ with PyTorch, transformers, OpenCV
  • MediaPipe, CLIP, YOLOv8 for perception

Development Environment​

# Install core dependencies
sudo apt update
sudo apt install -y ros-humble-nav2-* ros-humble-moveit python3-pip

# Install Python packages
pip install openai whisper clip torch torchvision opencv-python mediapipe ultralytics pyttsx3

# Clone capstone workspace
mkdir -p ~/capstone_ws/src
cd ~/capstone_ws/src
git clone <your_capstone_repo>

# Build workspace
cd ~/capstone_ws
colcon build --symlink-install
source install/setup.bash

Project Milestones​

Milestone 1: Voice Interface (Week 13 Day 1-2)​

  • Implement wake-word activated listening
  • Integrate Whisper for transcription
  • Parse commands into structured intents
  • Add TTS feedback for acknowledgment

Deliverable: System that accepts voice commands and confirms understanding

Milestone 2: Visual Grounding (Week 13 Day 3-4)​

  • Set up camera feed in ROS 2
  • Integrate CLIP for object localization
  • Implement depth-based 3D coordinate conversion
  • Test referring expressions ("the leftmost mug")

Deliverable: System that locates objects from natural language descriptions

Milestone 3: Task Planning (Week 13 Day 5)​

  • Implement GPT-4 task decomposition
  • Build ReAct execution loop
  • Add world state tracking (object locations, robot pose)

Deliverable: System that plans and executes multi-step tasks adaptively

Milestone 4: Full Integration (Week 13 Day 6-7)​

  • Connect all subsystems (voice β†’ planning β†’ vision β†’ action)
  • Implement error handling and recovery
  • Add gesture commands as alternative input
  • Deploy on physical robot or high-fidelity simulation

Deliverable: End-to-end autonomous humanoid assistant

Evaluation Criteria​

Your capstone will be evaluated on:

Functionality (40%):

  • Correctly executes 5/5 test scenarios (fetch, place, navigate, search, collaborative task)
  • Handles edge cases (object not found, path blocked, grasp failure)
  • Adaptive replanning when initial plan fails

Integration (25%):

  • Seamless coordination between voice, vision, planning, and control
  • Real-time performance (< 5s latency from command to action start)
  • Robust error handling across subsystems

User Experience (20%):

  • Natural language interaction (clear TTS feedback, confirmation loops)
  • Safe operation (collision avoidance, force limits)
  • Intuitive gesture commands

Code Quality (15%):

  • Modular architecture with clear interfaces
  • Comprehensive error handling
  • Inline documentation and comments

Test Scenarios​

Demonstrate your system on these tasks:

  1. Fetch and Deliver: "Bring me the blue water bottle from the shelf"
  2. Multi-Object Task: "Clear the table by moving all mugs to the dish rack"
  3. Search and Report: "Is there a laptop on the desk? If so, what color is it?"
  4. Collaborative Cooking: "Help me prepare dinner. First, get the cutting board, then the knife."
  5. Gesture Override: Use pointing gesture to indicate which object to grasp when multiple candidates exist

Next Steps​

You've reviewed the capstone architecture and requirements. In Week 13: Capstone Implementation, you'll build the integrated system step-by-step, starting with the voice-to-action pipeline and progressing to full multimodal control.

Resources​

  • Code Repository: https://github.com/yourname/vla-capstone
  • Documentation: System architecture diagrams, API references
  • Support: Office hours (link in course page), discussion forum
  • Demo Videos: Reference implementations for each milestone

Ready to build your autonomous humanoid? Let's proceed to implementation.