Quiz 7: Multimodal AI¶

Test your understanding of multimodal AI capabilities and applications.

Questions¶

Question 1 (Remember)¶

What is a diffusion model?

A model that spreads information across networks
A generative model that creates images by iteratively removing noise
A text-only language model
A compression algorithm

Answer

B) A generative model that creates images by iteratively removing noise - Diffusion models start with random noise and progressively denoise it based on text prompts to generate coherent images.

Question 2 (Remember)¶

Which platform is known for generating artistic, creative images?

ChatGPT
Claude
Midjourney
Perplexity

Answer

C) Midjourney - Midjourney is particularly known for generating artistic, aesthetically pleasing images with a distinctive style, popular among designers and artists.

Question 3 (Understand)¶

What enables GPT-4 Vision to analyze images?

Separate image processing software
Native multimodal capabilities in the model architecture
External plugins only
Manual image description

Answer

B) Native multimodal capabilities in the model architecture - GPT-4V has built-in ability to process visual inputs alongside text, understanding images directly without external tools.

Question 4 (Understand)¶

Why is text-to-video generation more challenging than text-to-image?

Video uses fewer parameters
Video requires temporal consistency across many frames
Video technology is older
There is no difference in difficulty

Answer

B) Video requires temporal consistency across many frames - Video generation must maintain coherence across time (objects moving consistently, physics, continuity), making it significantly more complex than single images.

Question 5 (Apply)¶

You need to extract text from handwritten documents. Which AI capability would you use?

Text-to-image
Speech-to-text
Vision/image analysis with OCR
Text-to-speech

Answer

C) Vision/image analysis with OCR - Vision capabilities in multimodal models can analyze images including handwritten text, extracting and transcribing content (OCR functionality).

Question 6 (Apply)¶

A marketing team needs product images for an e-commerce site but has no photography budget. What's the best approach?

Use stock photos only
Generate product images using text-to-image AI like DALL-E
Skip images entirely
Use text descriptions only

Answer

B) Generate product images using text-to-image AI like DALL-E - AI image generation can create product visuals from descriptions, though quality and accuracy should be verified for commercial use.

Question 7 (Analyze)¶

Compare DALL-E and Stable Diffusion in terms of accessibility and control:

They are identical
DALL-E is open-source
DALL-E is API-based; Stable Diffusion is open-source with more customization
Stable Diffusion requires no technical knowledge

Answer

C) DALL-E is API-based; Stable Diffusion is open-source with more customization - DALL-E is accessed through OpenAI's API, while Stable Diffusion is open-source, allowing local deployment and extensive customization.

Question 8 (Analyze)¶

What are the primary business applications for speech-to-text AI?

Image generation
Meeting transcription, customer call analysis, accessibility
Video editing
Code generation

Answer

B) Meeting transcription, customer call analysis, accessibility - Speech-to-text enables automatic transcription of meetings, analysis of customer calls at scale, and accessibility features for hearing-impaired users.

Question 9 (Evaluate)¶

An organization wants to use AI-generated images for advertising. What's the most important consideration?

Generation speed
Image resolution only
Copyright, authenticity, and brand safety implications
Cost per image

Answer

C) Copyright, authenticity, and brand safety implications - Commercial use of AI images raises questions about copyright, potential for misleading content, and brand reputation risks that must be carefully evaluated.

Question 10 (Create)¶

Design a multimodal AI solution for a real estate company that needs: property descriptions, virtual staging, and voice-enabled search.

Text-only chatbot
Image generation only
Integrated solution: text generation for descriptions, image AI for staging, speech-to-text for search
Manual processes only

Answer

C) Integrated solution: text generation for descriptions, image AI for staging, speech-to-text for search - Each modality serves a purpose: LLMs for compelling descriptions, image AI for virtual staging, speech recognition for hands-free search.

Score Interpretation¶

9-10 correct: Excellent understanding of multimodal AI
7-8 correct: Good grasp, review missed concepts
5-6 correct: Fair understanding, revisit chapter sections
Below 5: Re-read Chapter 7 before proceeding

Previous: Quiz 6 | Back to Quizzes | Next: Quiz 8 - Governance & Ethics