Quiz 7: Multimodal AI¶
Test your understanding of multimodal AI capabilities and applications.
Questions¶
Question 1 (Remember)¶
What is a diffusion model?
- A model that spreads information across networks
- A generative model that creates images by iteratively removing noise
- A text-only language model
- A compression algorithm
Answer
B) A generative model that creates images by iteratively removing noise - Diffusion models start with random noise and progressively denoise it based on text prompts to generate coherent images.
Question 2 (Remember)¶
Which platform is known for generating artistic, creative images?
- ChatGPT
- Claude
- Midjourney
- Perplexity
Answer
C) Midjourney - Midjourney is particularly known for generating artistic, aesthetically pleasing images with a distinctive style, popular among designers and artists.
Question 3 (Understand)¶
What enables GPT-4 Vision to analyze images?
- Separate image processing software
- Native multimodal capabilities in the model architecture
- External plugins only
- Manual image description
Answer
B) Native multimodal capabilities in the model architecture - GPT-4V has built-in ability to process visual inputs alongside text, understanding images directly without external tools.
Question 4 (Understand)¶
Why is text-to-video generation more challenging than text-to-image?
- Video uses fewer parameters
- Video requires temporal consistency across many frames
- Video technology is older
- There is no difference in difficulty
Answer
B) Video requires temporal consistency across many frames - Video generation must maintain coherence across time (objects moving consistently, physics, continuity), making it significantly more complex than single images.
Question 5 (Apply)¶
You need to extract text from handwritten documents. Which AI capability would you use?
- Text-to-image
- Speech-to-text
- Vision/image analysis with OCR
- Text-to-speech
Answer
C) Vision/image analysis with OCR - Vision capabilities in multimodal models can analyze images including handwritten text, extracting and transcribing content (OCR functionality).
Question 6 (Apply)¶
A marketing team needs product images for an e-commerce site but has no photography budget. What's the best approach?
- Use stock photos only
- Generate product images using text-to-image AI like DALL-E
- Skip images entirely
- Use text descriptions only
Answer
B) Generate product images using text-to-image AI like DALL-E - AI image generation can create product visuals from descriptions, though quality and accuracy should be verified for commercial use.
Question 7 (Analyze)¶
Compare DALL-E and Stable Diffusion in terms of accessibility and control:
- They are identical
- DALL-E is open-source
- DALL-E is API-based; Stable Diffusion is open-source with more customization
- Stable Diffusion requires no technical knowledge
Answer
C) DALL-E is API-based; Stable Diffusion is open-source with more customization - DALL-E is accessed through OpenAI's API, while Stable Diffusion is open-source, allowing local deployment and extensive customization.
Question 8 (Analyze)¶
What are the primary business applications for speech-to-text AI?
- Image generation
- Meeting transcription, customer call analysis, accessibility
- Video editing
- Code generation
Answer
B) Meeting transcription, customer call analysis, accessibility - Speech-to-text enables automatic transcription of meetings, analysis of customer calls at scale, and accessibility features for hearing-impaired users.
Question 9 (Evaluate)¶
An organization wants to use AI-generated images for advertising. What's the most important consideration?
- Generation speed
- Image resolution only
- Copyright, authenticity, and brand safety implications
- Cost per image
Answer
C) Copyright, authenticity, and brand safety implications - Commercial use of AI images raises questions about copyright, potential for misleading content, and brand reputation risks that must be carefully evaluated.
Question 10 (Create)¶
Design a multimodal AI solution for a real estate company that needs: property descriptions, virtual staging, and voice-enabled search.
- Text-only chatbot
- Image generation only
- Integrated solution: text generation for descriptions, image AI for staging, speech-to-text for search
- Manual processes only
Answer
C) Integrated solution: text generation for descriptions, image AI for staging, speech-to-text for search - Each modality serves a purpose: LLMs for compelling descriptions, image AI for virtual staging, speech recognition for hands-free search.
Score Interpretation¶
- 9-10 correct: Excellent understanding of multimodal AI
- 7-8 correct: Good grasp, review missed concepts
- 5-6 correct: Fair understanding, revisit chapter sections
- Below 5: Re-read Chapter 7 before proceeding
Previous: Quiz 6 | Back to Quizzes | Next: Quiz 8 - Governance & Ethics