Artificial Intelligence 5 min read

Applying ByteDance’s Doubao‑1.5 Vision Model for Image Counting and Automated Annotation

The article demonstrates how ByteDance’s new Doubao‑1.5 multimodal model can be used to locate and count objects in images—such as sushi plates, street signs, and cartoon hats—by generating coordinates and overlaying visual annotations through a concise Python script.

DataFunTalk
DataFunTalk
DataFunTalk
Applying ByteDance’s Doubao‑1.5 Vision Model for Image Counting and Automated Annotation

The author starts with a playful example: counting how many plates of sushi are on a table (15 per plate, resulting in 14 plates and a total of 210 items) to illustrate that AI can be employed for inventory tasks.

At a recent ByteDance launch event, the company unveiled the Doubao 1.5 deep‑thinking model, a 200‑billion‑parameter model with R1‑level performance, accompanied by a vision‑pro extension that adds visual understanding capabilities.

The vision‑pro module dramatically improves visual localization, supporting single‑ and multi‑object bounding‑box or point prompts, counting, 3D depth prediction, and can be applied to inspection and other commercial scenarios, effectively allowing large models to generate stable image annotations.

Using the model, the author extracts coordinates of sushi plates in an image and then runs a short Python script to draw rounded rectangles, add bright cyan text with a black shadow, and display the annotated image. The full script is shown below:

# Reload image again for a clean slate
highlighted_image = Image.open(new_image_path)
draw = ImageDraw.Draw(highlighted_image)

# Define more vibrant color scheme
line_color = "#00FFFF"
text_color = "#00FFFF"
shadow_color = "black"

# Draw lines and bright text with shadow
for idx, (x, y) in enumerate(scaled_new_points, start=1):
    left = x - new_line_length / 2
    right = x + new_line_length / 2
    top = y - new_line_height / 2
    bottom = y + new_line_height / 2

    # Draw vibrant line
    draw.rounded_rectangle([(left, top), (right, bottom)], radius=new_line_height / 2, fill=line_color)

    # Draw text shadow
    text_position = (left - 10, y)
    shadow_position = (text_position[0] + shadow_offset, text_position[1] + shadow_offset)
    draw.text(shadow_position, str(idx), font=large_font, fill=shadow_color, anchor="rm")

    # Draw main vibrant text
    draw.text(text_position, str(idx), font=large_font, fill=text_color, anchor="rm")

# Display updated image with high visibility colors
plt.figure(figsize=(10, 8))
plt.imshow(highlighted_image)
plt.axis("off")
plt.show()

Running the script produces a fully annotated image, confirming the workflow.

The author then tests the model on other scenarios: detecting street signs in a street‑view photo, labeling Mickey Mouse’s hat, and more, showing that a simple query can yield well‑annotated results without explicit object definitions.

While multimodal AI has been hyped, practical applications remain limited; however, tasks such as inventory counting, component tallying, and other labor‑intensive visual counting can now be delegated to AI, offering tangible value.

The article also lists current challenges—angle bias, dense targets causing overlapping labels, and complex backgrounds leading to missed detections—and notes that these issues are solvable, suggesting that the qualitative shift has already occurred and quantitative growth is only a matter of time.

computer visionPythonAIobject detectionmultimodalImage AnnotationDoubao
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.