Visual Question Answering

Visual Question Answering (VQA) answers questions about images, combining computer vision and natural language processing.

Hands-on Example: Answering Questions About Images

from transformers import pipeline
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt

# Initialize the visual question answering pipeline
vqa = pipeline("visual-question-answering")

# Load an image
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/43/Cute_dog.jpg/800px-Cute_dog.jpg"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))

# Questions about the image
questions = [
    "What animal is in the image?",
    "What color is the dog?",
    "Is the dog inside or outside?",
    "Does the dog look happy?"
]

# Display the image
plt.figure(figsize=(8, 8))
plt.imshow(image)
plt.axis('off')
plt.title("Query Image")
plt.show()

# Answer each question
print("Visual Question Answering:")
for question in questions:
    result = vqa(image=image, question=question)
    print(f"Q: {question}")
    print(f"A: {result['answer']} (Score: {result['score']:.4f})")
    print("-" * 50)

The visual question answering pipeline combines image understanding with language comprehension to answer questions about visual content.

Try It Yourself:

Test VQA on complex scenes with multiple objects and ask questions about relationships between objects.
Try asking more abstract questions about mood, style, or aesthetic qualities.
Experiment with ambiguous questions to see how the model handles uncertainty.

results matching ""

No results matching ""