Evaluating Contextual Intelligence in Recyclability A Comprehensive Study of Image-Based Reasoning Systems

Reading time: 17 minute
...

📝 Original Paper Info

- Title: Evaluating Contextual Intelligence in Recyclability A Comprehensive Study of Image-Based Reasoning Systems
- ArXiv ID: 2601.00905
- Date: 2025-12-31
- Authors: Eliot Park, Abhi Kumar, Pranav Rajpurkar

📝 Abstract

While the importance of efficient recycling is widely acknowledged, accurately determining the recyclability of items and their proper disposal remains a complex task for the general public. In this study, we explore the application of cutting-edge vision-language models (GPT-4o, GPT-4o-mini, and Claude 3.5) for predicting the recyclability of commonly disposed items. Utilizing a curated dataset of images, we evaluated the models' ability to match objects to appropriate recycling bins, including assessing whether the items could physically fit into the available bins. Additionally, we investigated the models' performance across several challenging scenarios: (i) adjusting predictions based on location-specific recycling guidelines; (ii) accounting for contamination or structural damage; and (iii) handling objects composed of multiple materials. Our findings highlight the significant advancements in contextual understanding offered by these models compared to previous iterations, while also identifying areas where they still fall short. The continued refinement of context-aware models is crucial for enhancing public recycling practices and advancing environmental sustainability.

💡 Summary & Analysis

1. **Swift Classification:** The study explores how different models can quickly classify waste to improve recycling rates and reduce environmental impact. This involves determining the compatibility of various types of waste with different disposal bins. 2. **Location-Specific Guidelines Implementation:** The models were tested under varying location-specific recycling guidelines, reflecting the unique systems and distributors in each city. 3. **Classification Based on Changes in Condition:** The ability of the model to determine recyclability based on changes such as contamination or structural damage was also assessed.

Simple Explanation with Metaphors:

  • This study aims to find ways to accurately classify waste for recycling using AI models, thereby protecting the environment. Each model must decide which type of disposal bin a given item should be placed in. It’s like sorting clothes into different baskets based on their color.
  • The research also considers that recycling guidelines can vary by location, meaning each city has its own unique system and buyers for recycled goods. This is akin to wearing different outfits depending on the city you’re in.
  • Lastly, it evaluates how well models can determine recyclability when waste items are contaminated or damaged. It’s similar to deciding whether a torn book should be repaired and reused or discarded.

Sci-Tube Style Script:

  • Beginner: “Making recycling easier! Let’s look at how AI models classify waste. We’ll also see how they adapt to different city guidelines!”
  • Intermediate: “Protecting the environment with AI in waste classification! Find out about our tests considering various conditions and location-specific rules.”
  • Advanced: “Addressing complex recycling systems with cutting-edge technology. Learn more about the latest advancements in waste classification models and their performance.”

📄 Full Paper Content (ArXiv Source)

# Introduction

Effective waste management, particularly through recycling, is essential in promoting environmental sustainability. In 2018, the United States generated approximately 292.4 million tons of municipal solid waste, equating to 4.9 pounds per person per day . Of this, 32.1 percent was either recycled or composted, a notable achievement but one that highlights the significant proportion of waste still destined for landfills. Within this waste stream, certain materials, such as paper and paperboard, achieved recycling rates as high as 68.2 percent, while others, such as plastics, lagged far behind at just 8.7 percent . These disparities underscore the need for innovative approaches to improve recycling rates across all categories, including a better understanding by the general public in distinguishing which items should be recycled.

Related Works

In a recent work [2], the potential of general vision-language models, specifically Contrastive Language-Image Pretraining (CLIP) , for automating the classification of waste materials for recycling was explored. The results were substantially better compared to previous approaches using simple convolution neural networks with the model achieving an accuracy of 89% in zero-shot classification into a dozen different disposal methods. However, the approach had notable limitations. CLIP’s reliance on a predefined list of potential items meant it struggled with items outside of this list, reducing its effectiveness in real-world applications where waste items are highly varied. In particular, the model was not designed to handle common but challenging cases such as greasy, dirty, or broken items, which often complicate the recycling process. The current study seeks to address these shortcomings by utilizing state-of-the-art models GPT-4o , GPT-4o-mini, and Claude 3.5 , which are all equipped with advanced vision capabilities. These models enhance the system’s ability to classify a broader range of waste items, including those with unique characteristics that significantly impact their recyclability, such as contamination and structural damage. By integrating more diverse and representative data, and leveraging these cutting-edge models, we aim to develop a more accurate and robust tool for waste classification, ultimately contributing to more efficient recycling practices.

style="width:5.5in" />
Overview of our study. Four contextual predictions are tested for three models.

Methodology

The overview of the experiments are shown in Figure 1. We collected images through a combination of three methods: Google Images, DALL·E image generation, and personal photography. The goal was to compile a diverse and representative database of commonly disposed items, ensuring that the images were clear, realistic, and contextually appropriate. We generated an initial list of 100 commonly disposed items using ChatGPT, which provided a broad array of materials across various categories. After refining this list, we manually gathered images for each item, focusing on capturing realistic depictions rather than stock photos with blank backgrounds.

The materials included in the database covered a wide spectrum, with a focus on frequently encountered waste types: Cardboard, Electronics, Glass, Plastic, Metal, Organics, Paper, Styrofoam, Textiles, and Wood. Each category was represented by 10 images, yielding a total of 100 images.

Three models were used: (i) GPT-4o (released May 2024): a multimodal (text, images, audio) generative pre-trained transformer from OpenAI . (ii) GPT-4o-mini (July 2024): a small and less expensive version of GPT-4o. (iii) Claude 3.5 Sonnet (Jun 2024): a generative pre-trained transformer from Anthropic.

Experiment 1: Predicting recyclability based on recognizing different types of waste bin

One challenge in a real-life situation is the different types of recycling options that are available at the time of disposal. In this experiment, we aimed to evaluate the model’s ability to classify waste items based on their compatibility with different types of disposal bins. Each API call involved two images: one of a bin and one of an item from the previously curated database. The models were prompted to consider both the size of the bin opening and the physical characteristics of the item, such as shape, size, and material, to determine the correct disposal method.

The experiment utilized three distinct bin images to represent common disposal scenarios: (i) BigBelly Bins: These are urban bins found in many cities, featuring two categories (trash and recycling) with small openings designed to limit contamination. A major challenge for the models was to infer the size of the opening compared to the size of the item. (ii) Residential Bins: Standard 96-gallon residential bins typically used for household waste collection. (iii) Three-Category Urban Bins: These include categories for trash, recycling, and compost, commonly found in urban settings. Each of the 100 images was tested against all three bin images, resulting in a comprehensive evaluation of the model’s ability to accurately classify items across varying disposal contexts.

Experiment 2: Predicting recyclability using location-specific guidelines

Another complication is the differing recyclability guidelines depending on the city and country. This is a complex issue driven by the specific machines used at each recycling collection center and which distributor purchases the recycled items for a city. Thus, this experiment was designed to assess the model’s ability to adapt its waste classification based on location-specific recycling guidelines. We used the same database of 100 items across four tests, each incorporating varying levels of guideline specificity to evaluate how well the model could adjust to different local recycling practices.

The tests included: No Guidelines (165 word prompt) as the baseline and (i) Boston Guidelines (377 word prompt); (ii) London Guidelines (499 word prompt); (iii) San Francisco Guidelines (557 word prompt). Across these tests, five items displayed a change in recyclability status when transitioning from Boston’s guidelines to those of London and San Francisco. This variation underscores the importance of local context in determining the correct disposal method, as items deemed recyclable in one city may not be accepted in another.

Experiment 3: Predicting Recyclability Based on Phase Changes

Here, we aimed to evaluate the model’s ability to classify waste items based on transformations that may affect their recyclability. The experiment utilized a set of 40 images, each labeled as either “before” or “after” to depict the same object in two different states: pre-transformation (clean) and post-transformation. The transformations were divided into two categories: (i) Contamination, with 20 images displayed objects before and after contamination, such as food residue or liquid spills; (ii) Structural change, with 20 images captured objects before and after structural changes, including damage like tearing, breaking, or crumpling.

For each pair, the model was presented with both images simultaneously and tasked with assessing the transformation that occurred to determine the correct recyclability status. This experiment’s design provided a comprehensive evaluation of the model’s ability to consider physical changes that can affect whether an item should be recycled or discarded. The focus on 40 image pairs allowed for a detailed analysis of the model’s performance across both contamination and structural changes, providing insights into its ability to adapt to real-world scenarios where the condition of an item plays a crucial role in determining its recyclability.

Experiment 4: Predicting Recyclability of Multi-Material Objects

This experiment assessed the model’s ability to classify items made of multiple materials. A set of 50 images was used, each representing an object composed of more than one material, such as an aluminum takeout tray with a detachable plastic top or a glass jar with a metal lid. These images were passed through the model individually to evaluate its effectiveness in determining the recyclability of complex, multi-material objects.

Results and Analysis

style="width:5.5in" />
Performance on Bin Testing (Experiment 1). Recyclability prediction was made in the context of the available bin type, requiring accurate assessment of size of the item and the opening.

Experiment 1: Bin Testing (Figure 2)

BigBelly Urban Bins

In these experiments, GPT-4o demonstrated the highest performance among the models with an average accuracy of 0.69 across all types. It excelled in classifying electronics (0.9) and glass (0.9), reflecting the model’s capability to manage the more distinct visual features of these materials. However, it encountered significant challenges with cardboard (0.2), largely due to the inability to correctly assess the size of the BigBelly bin openings relative to the size of the items. This limitation led to misclassifications where larger items like cardboard were incorrectly identified as fitting into the bin.

GPT-4o-mini, with an accuracy of 0.51, faced even greater difficulties, particularly with cardboard and wood, where it failed to classify any images correctly (0.0 accuracy for cardboard and 0.3 for wood). This model struggled because it hallucinated the existence of a third bin, often outputting “middle” when only left and right options were available.

Claude was the weakest performer, achieving an accuracy of 0.47. Similar to GPT-4o-mini, Claude also struggled with larger items like cardboard and wood, failing to recognize that these items were too large for the bin openings. This misjudgment led to frequent misclassifications, further highlighting the model’s limitations in scenarios where precise spatial reasoning and contextual understanding of bin dimensions are critical.

Residential Bins

The performance dynamics shifted slightly. GPT-4o-mini achieved the highest accuracy at 0.73, particularly excelling in the classification of wood (1.0) and maintaining solid performance across other categories like cardboard (0.8) and paper (0.8). However, despite its overall strong performance, GPT-4o-mini also exhibited a surprising issue: it again hallucinated the existence of a third bin on four occasions. This is notable given that, unlike the BigBelly bins where the two bins had three openings that might have caused confusion, the residential bins are clearly separated into only two large and distinct containers. This unexpected behavior suggests that the model may have underlying difficulties in accurately interpreting the physical layout of the bins, even in seemingly straightforward scenarios.

GPT-4o followed closely with an accuracy of 0.63, showing consistent strength in electronics (0.8) and paper (0.8), though its performance in wood (0.7) and cardboard (0.8) was slightly lower than GPT-4o-mini. Claude, on the other hand, showed improvement compared to its performance in the BigBelly bins, achieving an accuracy of 0.64. While GPT-4o-mini’s weakest material in this experiment was electronics (0.2), electronics was one of GPT-4o’s strongest categories.

The higher accuracies across all models in this experiment may not fully reflect the models’ strengths but rather the increased capacity of the 96-gallon residential bins, which can accommodate a broader range of item sizes. This larger bin size likely simplified the classification task, as more items, including larger ones like cardboard and wood, could easily fit into the bins. This suggests that the observed improvement in performance might be partially due to the bin’s capacity rather than solely the models’ predictive capabilities.

Three-Category Bins

GPT-4o led with an accuracy of 0.73, demonstrating its ability to effectively manage the complexity of multiple disposal categories. The model excelled in classifying electronics (0.9) and organics (0.9), maintaining high accuracy in textiles (0.9) and glass (0.6).

GPT-4o-mini, with an accuracy of 0.58, showed similar strengths as GPT-4o but struggled significantly more with wood (0.4) and organic matter (0.1). The low accuracy in organic matter classification was particularly concerning, as it stemmed from the model repeatedly outputting “compost” seven times, despite explicit instructions in the prompt to use only “left”, “right”, “middle”, or “none”. This issue highlights a critical weakness in GPT-4o-mini’s ability to follow prompts accurately, especially in complex scenarios involving multiple disposal categories.

Claude, achieving an accuracy of 0.59, showed improvement in this setup compared to simpler bin configurations. However, it continued to face challenges, particularly with wood (0.7) and cardboard (0.3). While Claude’s performance was slightly better here, its ongoing struggles with these materials indicate persistent issues with accurately classifying items, particularly in scenarios where size and material characteristics are critical. Across the different models, GPT-4o consistently delivered the best results. However, all models faced challenges, particularly with accurately classifying items based on their size, which was a recurring issue across different materials like cardboard and wood. This underscores the need for further refinement, especially in handling larger or less distinct items and ensuring that models adhere strictly to the provided prompts.

style="width:5.5in" />
Three additional tests for contextual predictions. (a) City-specific guidelines were added to the prompt. The direction of the arrows indicates that the classification improved (or remained the same) as a result. (b) Pairs of images were tested for contamination (e.g., soiled item) and structural changes (e.g., broken glass). The changes in the pair predictions are illustrated with the blue and pink bands corresponding to correct and incorrect items, respectively. For instance, for GPT-4o, four pairs were Y-N pairs (recyclable prior to contamination / not recyclable after contamination); three items were predicted correctly (blue) but one item was predicted to be N-N. In general, contaminated items that are still recyclable were incorrectly predicted as no longer being recyclable.

Experiment 2: Location-Specific Guidelines (Figure 3)

Here, we tested the models’ performance under different location-specific guidelines: Boston, London, and San Francisco. Each model was evaluated using the same set of 100 images across all locations to assess how well they could adapt to varying recycling guidelines.

When comparing the “No Location” scenario (which used Boston labels) to the Boston-specific prompt, there was a noticeable improvement in accuracy for all models. GPT-4o’s accuracy increased from 0.89 in the “No Location” scenario to 0.94 with the Boston guidelines. GPT-4o-mini also saw a jump from 0.86 to 0.90, while Claude’s accuracy slightly increased from 0.84 to 0.85. These improvements suggest that the models benefitted from the more explicit instructions provided by the Boston-specific guidelines, even though the prompt specified Boston as the city of the “No Location” scenario. Across all locations, only five items changed their recyclability status when switching from Boston Regulations to London or San Francisco’s Regulations. This limited variation implies that the increase in accuracy seen in location-specific scenarios may not be entirely due to the models adapting to new guidelines, but rather to the added support of a detailed prompt. The consistency in the incorrect images also supports this, as many images were incorrect regardless of the location-specific prompt.

Overall, the results indicate that while the models are capable of adjusting to different location-specific guidelines, the relatively minor changes in recyclability in this dataset across locations suggest that the observed improvements in accuracy may be attributable to the clarity and specificity of the guidelines rather than the models’ ability to handle different regional recycling practices.

Experiment 3: Structural Changes

The models were evaluated for their ability to classify items before and after undergoing physical transformations due to contamination or structural changes. GPT-4o achieved an overall combined image-pair accuracy of 0.75, and a slightly higher individual image accuracy of 0.875.

For contamination, GPT-4o led with a combined accuracy of 0.6 and an individual accuracy of 0.8, demonstrating a solid performance in recognizing and adapting to soiled items. GPT-4o-mini followed with a combined accuracy of 0.4, showing some limitations in handling subtle contamination, while Claude achieved a combined accuracy of 0.7.

In the structural changes scenario, GPT-4o again excelled with a combined accuracy of 0.9 and an individual accuracy of 0.95, showing strong adaptability to physical alterations. GPT-4o-mini achieved a combined accuracy of 0.8, and Claude lagged behind with 0.5, indicating difficulties in accurately reclassifying structurally altered items. These results highlight GPT-4o’s robustness in managing complex real-world scenarios, though further refinement is needed across all models to enhance their practical applicability in diverse recycling contexts.

Experiment 4: Multi-Material Objects

This experiment tested the models’ ability to classify items composed of multiple materials, such as plastic combined with metal or glass. GPT-4o led with an impressive overall accuracy of 0.98, misclassifying only 1 out of 50 images. This high accuracy indicates GPT-4o’s strong capability in managing the complexity of multi-material objects, effectively identifying and categorizing them with minimal errors.

GPT-4o-mini also performed well, achieving an overall accuracy of 0.92, although it misclassified 4 out of 50 images. This slightly lower accuracy compared to GPT-4o suggests that while GPT-4o-mini is competent, it still has room for improvement in handling the intricacies of multi-material items. Claude, however, demonstrated significant challenges in this experiment, with an overall accuracy of 0.78 and 11 misclassified images out of 50. The lower accuracy indicates that Claude struggled with the complexities of multi-material objects, particularly when compared to the more robust performance of GPT-4o.

In summary, GPT-4o was the most reliable model for classifying multi-material objects, showing a clear advantage over GPT-4o-mini and especially Claude. The consistent performance of GPT-4o across these complex scenarios highlights its effectiveness, while the noticeable gap in accuracy for Claude underscores the need for further refinement in handling multi-material classifications.

Discussion

Standard guidelines for recycling are often inadequate for the public in determining the recyclability of an item due the myriad of its possible conditions. As we have shown here, the rapid advances in vision-language models now offer contextual intelligence that leads to exceptional performance in many complex situations. GPT-4o, for instance, excels in many of the tests carried out in this work.

A continuing challenging in evaluating the prediction models is the difficulty of ascertaining the true label, especially given the more than dozen disposal methods specified in most city guidelines and their variations. We have called, emailed, and visited city recycling managers to obtain the true labels for many contexts. In some cases, the answers were clear (e.g., a greasy pizza box is recyclable in Boston but not in some other cities); but in other cases, there were no definitive answers, in part because the different companies who buy the recycled items at a given time have different requirements. Nevertheless, the models we tested here are ideal for their easy-to-modify prompts in addition to their contextual intelligence.

In our evaluation, no distinction was between a false positive and a false negative prediction. In reality, a false positive (an item predicted to be recyclable but is not) is more costly, as it contaminates the batch to which it was assigned. A weight function could be incorporated into the model to improve its practicality.


📊 논문 시각자료 (Figures)

Figure 1



Figure 2



Figure 3



A Note of Gratitude

The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut