Building an Agentic 'Smart Scarecrow' with YOLO, ResNet, and Reinforcement Learning

If you’ve ever tried to keep wild animals out of a garden or a farm, you know the struggle. Peacocks, wild boars, elephants, and macaques don’t really care about a pile of old clothes stuffed with straw. Traditional scarecrows are static and animals figure them out fast.

So, what if we gave the scarecrow a brain, a set of eyes, and the ability to learn what scares specific animals the most?

Enter the Agentic RL Scarecrow — a project that fuses Computer Vision (YOLOv8 + ResNet50), Reinforcement Learning (UCB1), and Large Language Models (LLMs) to create a dynamic, self-learning wildlife deterrent system.

Here’s a deep dive into how it works.

1. The Eyes: A Two-Stage Vision Pipeline

You can’t scare an animal if you don’t know what you’re looking at. To achieve real-time, high-accuracy detection, the system relies on a two-stage computer vision pipeline.

Stage 1: Localization with YOLOv8

Running heavy classification on every pixel of a 1080p webcam feed is a recipe for a molten CPU. Instead, we use YOLOv8 nano as a first-pass filter. YOLO is incredibly fast and is used purely to draw bounding boxes around anything that looks like an animal.

Stage 2: Classification with ResNet-50

Once YOLO isolates the bounding box, we crop that specific patch of the frame and pass it to a custom-trained ResNet-50 model in PyTorch.

Why two models? Because YOLO is great at finding “where” an animal is, but standard YOLO isn’t trained to distinguish between a Nilgai and a Chital. By isolating the animal, we can feed a clean 224x224 crop into a highly specialized ResNet-50 model trained exclusively on our target species (Macaques, Elephants, Peacocks, Wild Boars, etc.).

To prevent false positives, we enforce a strict rule: The classification confidence must exceed 55%, the model’s prediction entropy must be ≤ 2.5, and the animal must be present in 5 consecutive frames.

2. Experimental Results and Performance

We put our ResNet-50 head-to-head with other edge-friendly architectures like EfficientNet-B0 and MobileNetV3-Large. ResNet-50 emerged as the clear winner for our use case.

Model Comparison Metrics

Metric	ResNet50	EfficientNet-B0	MobileNetV3-Large
Accuracy	0.9965	0.9561	0.9666
Balanced Accuracy	0.9967	0.9572	0.9674
F1 (macro)	0.9967	0.9574	0.9681
Precision (macro)	0.9967	0.9581	0.9693
Recall (macro)	0.9967	0.9572	0.9674
Parameters	25.6 M	5.3 M	5.4 M

ResNet-50 achieved the highest FPS (53) and scored the highest across all evaluated metrics, making it the optimal choice for real-time deployment on our Raspberry Pi 5 setup.

Speed vs Accuracy

Analyzing the tradeoff between inference speed and detection accuracy across models.

Confusion Matrix & Per-Class Metrics

ResNet-50 achieved 99.65% accuracy — nearly perfect classification across all 9 classes (8 species + “no animal”). In fact, it scored a perfect 1.000 F1 on 7 out of 9 classes. The only minor confusion was between the pig (0.985) and wild_boar (0.986) classes, which are visually extremely similar.

ResNet-50 Confusion Matrix

The confusion matrix for our primary ResNet-50 classifier.

ResNet-50 Per-Class Metrics

Per-class breakdown showing near-perfect identification metrics.

3. Ablation Studies: Optimizing the Vision Head

To squeeze every drop of performance from our model, we conducted extensive ablation studies on the architecture’s vision head.

Head Architecture

We tested a simple baseline linear head against a deep regularized head and our proposed fine-tuned head.

Configuration	Accuracy	F1 Score	Inference (ms)
Baseline Linear Head	0.9714	0.9708	0.42
Deep Regularized Head	0.9651	0.9646	0.24
Proposed Fine-tuned Head	0.9778	0.9776	0.33

The proposed fine-tuned head achieves the best accuracy (97.78%) while keeping a highly balanced inference time of 0.33 ms. Let’s look at how the confusion between the difficult pig and wild_boar classes improved through these iterations:

1. Baseline Linear: Pig misclassified as wild_boar 7 times.

Baseline Linear CM

2. Deep Regularized: Confusion worsens to 9 (added depth didn’t help the hardest pair).

Deep Regularized CM

3. Proposed Fine-tuned: Confusion reduces to 5 — our best result.

Proposed Fine-tuned CM

ResNet Depth (Layer Truncation)

We also investigated whether we could truncate layers (L2, L3) to save computation. Dropping to L3 reduced inference time by a mere 0.03 ms but tanked accuracy by 4.8%. Dropping to L2 ruined the model (73.3% accuracy). The conclusion? The full ResNet-50 (L4) is essential, as accuracy loss from truncation far outweighs any latency benefits on our edge hardware.

4. The Brain: Reinforcement Learning (UCB1)

This is where the project gets really cool. A normal motion-activated alarm just blares a siren. But animals undergo habituation. If they hear a siren every day and nothing bad happens, they eventually ignore it.

To solve this, the scarecrow has an RL agent running under the hood. It treats scaring animals like a Multi-Armed Bandit problem.

How it learns

When an animal is detected, the agent has an arsenal of sounds to choose from (e.g., Tiger Roar, Gunshot, Bee Swarm, Human Yelling). It uses a policy (like UCB1 or epsilon-greedy Q-learning) to balance exploration (trying new sounds) with exploitation (using sounds known to work).

The Reward Function

The system tracks the animal’s bounding box. The reward is entirely based on how fast the animal leaves the frame.

$$ \text{Efficiency} = \frac{\text{time animal was absent}}{\text{total time sound played}} $$

If the animal bolts within 2 seconds of hearing a Leopard growl, the Leopard growl gets a massive score boost.
If the animal hangs around for 30 seconds ignoring an airhorn, the airhorn gets penalized.

Sounds that consistently score below 35% efficiency are blacklisted. Over time, the local brain learns that Elephants hate the sound of buzzing bees, while Macaques immediately scatter at the sound of Leopard snarls.

5. Dynamic Audio & LLM Explanations

In the most advanced iteration of the project, the system doesn’t even rely on a static folder of MP3s.

Freesound API

When the RL agent decides it wants to try a “hawk scream” against a Peacock, it queries the Freesound API on the fly, downloads the top-rated audio clip, converts it to the proper WAV format, caches it locally, and plays it through the speakers. Sounds are grouped into “Tiers” of severity, from natural predators up to extreme noises like firecrackers and sirens.

Groq LLM API (Llama 3)

To make the system truly agentic and observable, it hooks into the Groq API running Llama-3. Whenever a new sound is needed, the system asks the LLM for a biological explanation as to why that sound is efficient. If the system plays a bee swarm to an elephant, the LLM outputs a real-time log explaining the science: “Elephants have a hardwired panic reflex to bees because bees can sting the sensitive insides of their trunks.”

Conclusion

This project is an example of edge AI that moves crop protection from static to adaptive. Real-time AI classification ensures that deterrents are targeted and effective, moving beyond passive detection and into the realm of active defense.

Most importantly, the Reinforcement Learning loop solves the notorious “Habituation Problem,” making the system smarter the longer it stays in the field. Ultimately, this technology provides a cost-effective, non-lethal, and autonomous solution to Human-Wildlife conflict.