How to Build an Object Detection App in Python Using YOLOv5
最后更新 October 15, 2024

Over the summer, I got to deliver my very first solo workshop at the Kansas City Developer Conference 2024 (KCDC), where we used a pre-trained model for the sake of time and slow internet. I was honestly really happy when a few attendees came to chat with me after asking if I could share a written tutorial so they could continue building on top of this project. To all of you who came (or couldn’t come but wanted to) with the most important prerequisite from the list (“a great attitude in case this workshop turns into a dumpster fire”), this guide is for you!Screenshot of the "Other" section of the workshop prerequisite that says "A great attitude in case this workshop turns into a dumpster fire.  If you have any questions in advance, feel free to boop me on Twitter @dianasoyster"prerequisiteIn this tutorial, we’ll learn how to rebuild the same Python application from my workshop while discovering helpful tools, dynamic platforms, and well-trained models used. We'll also gain a basic understanding of how machine learning works for live image detection.

Technologies Used to Build Live Object Detection

All the technologies used were beginner-friendly, with the option to use them for hardcore variations of this project. We used the following:

  • Jupyter Notebook: Our Interactive Workspace: Jupyter Notebook is a versatile tool that allows you to create and share documents containing live code, equations, visualizations, and narrative text. It's an excellent environment for developing and testing machine learning models due to its interactive nature and ease of use.

  • Python: Our Language of Choice: Python is the programming language we used to write our image detection logic. Its simplicity and vast ecosystem of libraries make it an ideal choice for machine learning and computer vision projects.

  • PyTorch: The Deep Learning Framework: PyTorch is an open-source deep learning framework that provides a flexible and efficient platform for building and training machine learning models. It's particularly popular in the research community due to its dynamic computation graph and ease of use.

  • YOLOv5: The Real-Time Object Detector: YOLOv5 (You Only Look Once version 5) is an object detection model known for its speed and accuracy. It can detect objects in images and videos in real-time, making it perfect for live image detection applications. While researching, I discovered demos where people used YOLOv5 to count the number of cars in a live video of a freeway—and it was correct!

Key Stages of Machine Learning for Live Image Detection

Several people in my workshop were unfamiliar with Python and/or machine learning—and that's totally okay! Here's a simplified explanation to get you up to speed: machine learning teaches models to recognize and understand patterns in data. Artificial intelligence then uses these trained models to make decisions or take actions based on what they learned.

Now, what does that even mean? Let's take live image detection as an example. Applications using live image detection, such as self-driving cars, rely on AI to analyze and interpret visual data. A key component of these applications is object detection, driven by machine learning. The process includes gathering and labeling data, preprocessing, model training, and deployment for effective live image detection implementation.

Data Collection

Object detection begins with gathering large volumes of images and videos that represent real-world scenarios. For example, in developing a self-driving car system, images of various street scenes are collected to capture different lighting conditions, weather situations, and traffic scenarios. Using a diverse dataset is important to create the foundation for training machine learning models to recognize and locate objects.

Image Labeling

Following data collection, the next step is image labeling. In this phase, human annotators or specialized tools annotate the collected images and videos using methods like marking objects like cars, pedestrians, and traffic lights with bounded boxes and categorizing them accordingly. The accuracy of these labels directly impacts how well the model identifies objects in practice. (Subtext: The accuracy of these labels determines whether or not a self-driving car runs someone over.)

Preprocessing, Model Training, and Deployment

For brevity, after collecting and labeling data, the next steps involve preprocessing the data to make it uniform and useful for training. This includes standardizing formats and adjusting colors - essentially, it's "cleaning" the data.

After that, the next step is model training, where algorithms are used to teach the model to recognize patterns and make predictions based on the prepared data. This involves adjusting settings to improve performance and accuracy, essentially teaching the model how to interpret the cleaned data.

Finally, the model is deployed and integrated into systems to perform tasks such as real-time object detection. However, this process is often ongoing as the model is fed more data to train on and requires fine-tuning for better performance.

How to Build a Machine Learning Application Using a Pre-Trained Model

We'll be using a pre-trained model called YOLOv5. This improved version of YOLOv4 includes adaptive anchor box selection and image scaling right from the start of training.

Without adaptive anchor box selection, the model may struggle to detect objects of various shapes and sizes. For instance, it might consistently use small, square anchor boxes, even when dealing with tall, rectangular objects. This can often lead to less accurate predictions because the model isn't adjusting to the specific shapes and sizes of objects in different images. Nikita Malviya wrote a blog titled "Object Detection — Anchor Box VS Bounding Box" that concisely explains the difference between them.

Similarly, without image scaling, the model may have difficulty handling images of different sizes. If the model expects a specific input size, feeding it an image that's significantly larger or smaller could result in inaccurate detections. This is because the model might not be able to effectively process the image's content or identify objects at the correct scale. Stephane Charette does a great job of explaining how image scaling works under the section "What is the optimal network size?" of his blog.

These improvements make it faster to detect objects in images. Now, let's get into it!

How to Install Jupyter

In your terminal, run this command:

brew install jupyter

Check the version you just installed:

jupyter --version

Now, launch it:

jupyter notebook

It should open in a web browser running at localhost. Go to File > New > Notebook to open a fresh page.

How to Install Jupyter Dependencies & Imports

Now that we're in Jupyter, let's make sure we have our basic requirements.

In Jupyter notebooks, the !(exclamation mark) is used to execute shell commands directly from within a code cell. This feature is part of Jupyter's magic commands, specifically the !command, which allows you to run shell commands as if you were in a terminal or command prompt. Normal code cells are used by default, so you won't need to include ! whenever you're running normal Python code (or whatever supported language).

  1. Install Pytorch.

https://pytorch.org/

!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

2. Clone the repo.

!git clone https://github.com/ultralytics/yolov5

3. Install the requirements.txt file.

The requirements.txt file is a text file used in Python projects to list the dependencies (libraries or packages) required by the project to run.

!cd yolov5
!pip install -r requirements.txt

4. Import Pytorch.

import torch
from matplotlib import pyplot as plt
import numpy as np
import cv2

How to Load a YOLOv5 Model

This is loading the pre-trained ultralytics model from the torch hub, which is similar to the TensorFlow hub but the PyTorch equivalent.

Search for “ultralytics”.

We’ll be using the baseline or small model and need to load that one:

Graphs showing the different model types available using UltralyticsSmall MapSource: Ultralytics YOLOv5

model = torch.hub.load('ultralytics/yolov5', 'yolov5s')

Let me explain each part: Loading a pre-trained model: torch.hub.load is loading the actual pre-trained model hosting on our GitHub repo. In this case, it’s the ultralytics/yolov5 repo. Model Selection: 'yolov5s' specifies the specific variant or size of the YOLOv5 model to load. YOLOv5 comes in different sizes (e.g., s for small, m for medium, l for large). We went with small.

When you run this command, it’ll download the model’s weights, parameters, and configurations from the repo we specified.

To look at our model, run:

model

How to Make Image Detections with YOLOv5

Next, we need to create the image and pass on the link to our jpg. This could really be any image off the internet as long as you have the image link.

img = ‘https://nbc16.com/resources/media2/16x9/full/1015/center/80/c70166c3-4aae-4465-825d-3ca1a60b88e5-large16x9_elsa2.JPG

Now, let’s get it to tell us what’s in the image we just pasted in. That’s where we use results and give it the string of the image. When we print it, it’ll give us the text of what’s in there.

For this example, we’ve got the dimensions and a basic description of what the image is of.

results = model(img)
results.print()

Cool! So we know what the picture is of, but we want to actually see the image because it hasn’t been rendered yet. So let’s render it using:

%matplotlib inline 
plt.imshow(np.squeeze(results.render()))
plt.show()

%matplotlib inline is a “magical command” that makes sure these plots are shown in this notebook versus in an outside window.

plt.imshow(np.squeeze(results.render())) removes any dimensions that aren’t needed.

plt.show() actually shows you the image.

Now to return the image’s detections, run this:

results.render()

This function takes the original image and draws boxes around objects detected by a computer vision model. By overlaying these boxes onto the image, it visually indicates the locations of the detected objects. This is part of the process of visualizing object detection results, making it easier to see what the model has identified in the image.

Make Live Image Detections

cap = cv2.VideoCapture(0)
while cap.isOpened():
    ret, frame = cap.read()

    # Make detections 
    results = model(frame)

    cv2.imshow('YOLO', np.squeeze(results.render()))

    if cv2.waitKey(10) & 0xFF == ord('q'):
        break
cap.release()
cv2.destroyAllWindows()

Line-by-line explanation:

This line initializes a video capture object (cap) using OpenCV (cv2). It tries to access the first available camera device (0), which is typically the webcam on your computer.

cap = cv2.VideoCapture(0)

This starts a loop that continues as long as the camera (cap) is open and able to capture frames.

while cap.isOpened():

This captures a frame from the camera (cap). The returned values are stored in ret (a boolean indicating if the frame was successfully read) and frame (the actual image frame captured from the camera).

ret, frame = cap.read()

YOLO is just what is being rendered. cv2.imshow() displays the captured frame in a window with the title 'YOLO'. This window shows the live video feed from your camera.

cv2.imshow(‘YOLO’, frame)

cv2.waitKey(10) waits for a key press for up to 10 milliseconds. The & 0xFF == ord('q') part checks if the pressed key is 'q'. If 'q' is pressed, the loop breaks, which will end the video capture.

    if cv2.waitKey(10) & 0xFF == ord('q'):

        Break

This releases the camera (cap) resources, allowing other applications to use it.

cap.release()

This closes all OpenCV windows that were opened, in this case, the 'YOLO' window showing the camera feed.

cv2.destroyAllWindows()

When you run this, you should get a cute little window of yourself.

How the script works:

This script captures video from your webcam and applies a YOLO object detection model in real-time. The process begins by accessing your webcam feed and continuously reading each video frame. As each frame is captured, it is passed through a pre-trained YOLO model, which detects objects within the frame, such as people, cars, or other predefined objects.

The model highlights the detected objects by drawing bounding boxes around them and displaying the results in a window titled 'YOLO.' This process repeats for every frame in the video stream, creating a live object detection feed.

The script keeps running until you press the 'q' key, at which point the video feed stops, and the program closes both the camera connection and any display windows.

What running this should look like:

In short, this code opens your computer's webcam, captures live video frames, and displays them in a window titled 'YOLO'. It continues to show the live video feed until you press the 'q' key, which exits the loop and closes the camera feed window. After exiting the loop, it releases the camera resources and closes any windows opened by OpenCV.

Run this, and if you get a window that pops up and goes away, go ahead and just run it again. So, now it should work!

Understanding Machine Learning Misidentifications

Misidentification in machine learning occurs when a model incorrectly classifies an object or data point, often due to limitations in its training data or inherent similarities between different categories. This can stem from the model being exposed to a disproportionate number of certain images or features, leading it to generalize inaccurately. Here are some examples that stood out to me from the workshop:

Koala as Bear

The misidentification of a koala as a bear can be attributed to several factors: the training dataset for YOLOv5 may contain more images of bears than koalas, leading the model to associate similar features (like fur and shape) with bears. Koalas and bears share visual similarities that can confuse the model if it hasn't been specifically trained to distinguish between them. Additionally, object detection models sometimes generalize based on learned features, resulting in the model classifying a koala as a bear due to the prevalence and familiarity of bears in the training data.

Rectangular Devices as Phones

The misidentification of rectangular devices (conference badges, remotes, laptop batteries) as phones occurs because the YOLOv5 model contains numerous images of people holding rectangular devices near their faces as if they were accepting a call, leading it to associate phones with rectangular shapes held up to my workshop attendees. Since these devices share similar visual characteristics such as shape and size with phones, the model generalizes based on these features, resulting in the incorrect classification. This generalization happens because the model has learned to recognize common objects in typical orientations and may struggle with less frequent or diverse examples not well-represented in its training data.

How to Store Logs of Detections

In live object detection applications, logging is helpful for tracking the model's performance, debugging, and improving accuracy over time.

1. Set Up Logging Configuration

Start by setting up a basic logging configuration in your Python script. This will allow you to capture and save log messages to a file. Here’s a simple example using Python's built-in logging module:

import logging

# Set up logging configuration
logging.basicConfig(filename='object_detection.log',
                    level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

logging.info('Logging setup complete')

This configuration sets up a log file named object_detection.log, logs messages at the INFO level or higher, and includes timestamps, log levels, and messages in each log entry.

2. Log Important Events

Throughout your application, log important events and data points. For example, you can log when an image is processed, the results of object detection, and any errors that occur:

import logging

def process_image(image):
    logging.info('Processing image: %s', image)
    try:
        results = model(image)
        logging.info('Detection results: %s', results)
    except Exception as e:
        logging.error('Error processing image: %s', e)

In this snippet, logging.info records successful events and logging.error records any errors encountered.

3. Log Detection Results

When performing live detections, you might want to log the detected objects and their confidence scores. Here’s how you can do this:

def log_detections(results):
    detections = results.pandas().xyxy[0]  # Converts results to a pandas DataFrame
    for index, row in detections.iterrows():
        logging.info('Detected %s with confidence %.2f at [%d, %d, %d, %d]',
                     row['name'], row['confidence'],
                     int(row['xmin']), int(row['ymin']),
                     int(row['xmax']), int(row['ymax']))

This function extracts detection results and logs each detected object with its confidence score and bounding box coordinates.

4. Review and Analyze Logs

Periodically review the log file to analyze the performance of your object detection model. Look for patterns or recurring issues that can help you improve the model. You can use tools like grep to filter logs or analyze them with scripts to identify trends:

grep "Detected" object_detection.log

This command filters the log file to show only the lines containing detection results.

5. Rotate and Manage Log Files

As your application generates more logs, it’s a good idea to manage log file size and rotation. You can use RotatingFileHandler from the logging module to handle this:

from logging.handlers import RotatingFileHandler

handler = RotatingFileHandler('object_detection.log', maxBytes=2000000, backupCount=5)
logging.getLogger().addHandler(handler)

This configuration will rotate the log file when it reaches 2 MB, keeping up to 5 backup files.

Wrap Up

Thank you to those who made it to the end of my in-person workshop and/or this tutorial! This workshop provided hands-on experience in developing a Python application for real-time object detection. By using a pre-trained YOLOv5 model and video processing libraries, we learned how to build a system that detects objects in real-time and manages detection logs. My goal was to highlight the integration of machine learning with backend management and give developers of all levels some valuable skills for future projects in machine learning and computer vision.

Join the Party

If you enjoyed my workshop or this blog (or didn't), I'd love to hear about it! Feel free to tag me on X, formerly known as Twitter. For more content from my team, follow them on X and join our developer community Slack channel. I personally had a great time at KCDC, and I'd love to (hopefully) meet some of you next year. Until then, enjoy this recap available on my Instagram and X.

Diana PhamDeveloper Advocate

Diana is a developer advocate at Vonage. She likes eating fresh oysters.

Ready to start building?

Experience seamless connectivity, real-time messaging, and crystal-clear voice and video calls-all at your fingertips.

Subscribe to Our Developer Newsletter

Subscribe to our monthly newsletter to receive our latest updates on tutorials, releases, and events. No spam.