Skip to content

Quiz 7: Multimodal AI

Test your understanding of multimodal AI capabilities and applications.


Questions

Question 1 (Remember)

What is a diffusion model?

  1. A model that spreads information across networks
  2. A generative model that creates images by iteratively removing noise
  3. A text-only language model
  4. A compression algorithm
Answer

B) A generative model that creates images by iteratively removing noise - Diffusion models start with random noise and progressively denoise it based on text prompts to generate coherent images.


Question 2 (Remember)

Which platform is known for generating artistic, creative images?

  1. ChatGPT
  2. Claude
  3. Midjourney
  4. Perplexity
Answer

C) Midjourney - Midjourney is particularly known for generating artistic, aesthetically pleasing images with a distinctive style, popular among designers and artists.


Question 3 (Understand)

What enables GPT-4 Vision to analyze images?

  1. Separate image processing software
  2. Native multimodal capabilities in the model architecture
  3. External plugins only
  4. Manual image description
Answer

B) Native multimodal capabilities in the model architecture - GPT-4V has built-in ability to process visual inputs alongside text, understanding images directly without external tools.


Question 4 (Understand)

Why is text-to-video generation more challenging than text-to-image?

  1. Video uses fewer parameters
  2. Video requires temporal consistency across many frames
  3. Video technology is older
  4. There is no difference in difficulty
Answer

B) Video requires temporal consistency across many frames - Video generation must maintain coherence across time (objects moving consistently, physics, continuity), making it significantly more complex than single images.


Question 5 (Apply)

You need to extract text from handwritten documents. Which AI capability would you use?

  1. Text-to-image
  2. Speech-to-text
  3. Vision/image analysis with OCR
  4. Text-to-speech
Answer

C) Vision/image analysis with OCR - Vision capabilities in multimodal models can analyze images including handwritten text, extracting and transcribing content (OCR functionality).


Question 6 (Apply)

A marketing team needs product images for an e-commerce site but has no photography budget. What's the best approach?

  1. Use stock photos only
  2. Generate product images using text-to-image AI like DALL-E
  3. Skip images entirely
  4. Use text descriptions only
Answer

B) Generate product images using text-to-image AI like DALL-E - AI image generation can create product visuals from descriptions, though quality and accuracy should be verified for commercial use.


Question 7 (Analyze)

Compare DALL-E and Stable Diffusion in terms of accessibility and control:

  1. They are identical
  2. DALL-E is open-source
  3. DALL-E is API-based; Stable Diffusion is open-source with more customization
  4. Stable Diffusion requires no technical knowledge
Answer

C) DALL-E is API-based; Stable Diffusion is open-source with more customization - DALL-E is accessed through OpenAI's API, while Stable Diffusion is open-source, allowing local deployment and extensive customization.


Question 8 (Analyze)

What are the primary business applications for speech-to-text AI?

  1. Image generation
  2. Meeting transcription, customer call analysis, accessibility
  3. Video editing
  4. Code generation
Answer

B) Meeting transcription, customer call analysis, accessibility - Speech-to-text enables automatic transcription of meetings, analysis of customer calls at scale, and accessibility features for hearing-impaired users.


Question 9 (Evaluate)

An organization wants to use AI-generated images for advertising. What's the most important consideration?

  1. Generation speed
  2. Image resolution only
  3. Copyright, authenticity, and brand safety implications
  4. Cost per image
Answer

C) Copyright, authenticity, and brand safety implications - Commercial use of AI images raises questions about copyright, potential for misleading content, and brand reputation risks that must be carefully evaluated.


Question 10 (Create)

Design a multimodal AI solution for a real estate company that needs: property descriptions, virtual staging, and voice-enabled search.

  1. Text-only chatbot
  2. Image generation only
  3. Integrated solution: text generation for descriptions, image AI for staging, speech-to-text for search
  4. Manual processes only
Answer

C) Integrated solution: text generation for descriptions, image AI for staging, speech-to-text for search - Each modality serves a purpose: LLMs for compelling descriptions, image AI for virtual staging, speech recognition for hands-free search.


Score Interpretation

  • 9-10 correct: Excellent understanding of multimodal AI
  • 7-8 correct: Good grasp, review missed concepts
  • 5-6 correct: Fair understanding, revisit chapter sections
  • Below 5: Re-read Chapter 7 before proceeding

Previous: Quiz 6 | Back to Quizzes | Next: Quiz 8 - Governance & Ethics