Skip to content
Vorp Labs//Vision
November 25, 2025

Technical

Zero-shot player tracking with SAM3

How we built a vision system that tracks players across frames with persistent IDs, even through occlusions and camera cuts.

P
Phil GlazerFounder
8 min read

In this piece, we walk through how we built a player tracking system using SAM3 and ByteTrack that is capable of taking a video clip of NFL game footage from a broadcast and tracking players through the clip.

Results

To give a sense for what the system generates, here's sample output showing the raw broadcast footage versus the tracking overlay:

Left: original broadcast footage. Right: SAM3-tracked output with persistent player IDs and motion trails.

Every colored box is a tracked player, with trails showing movement over the last 15 frames. Player IDs mostly persist across the video (e.g., player #1 stays #1 from first frame to last), though there are improvements to be made.

Methodology note: "ID persistence" is our internal sanity-check metric—the percentage of frames where a track's ID matches its majority-assigned identity, measured by manual spot-checking on 8 broadcast clips (20-45 seconds, 720p/1080p). It's not a standard MOT benchmark metric.

0
training images
1
text prompt
~85%
ID persistence
<1¢
per second of video

Methodology Overview

The core of the system is SAM3 (Meta's latest Segment Anything model) for prompt-based segmentation, accessed via fal.ai. We prompt SAM3 per frame to generate masks for player-like regions, then convert masks to bounding boxes (dropping tiny/invalid masks). Those boxes plus per-mask scores become detections for ByteTrack.

Our pipeline:

  1. Text prompt → SAM3 segments matching objects in each frame
  2. ByteTrack → Helps maintain consistent IDs across frames
  3. Team classification → K-means clustering separates teams by jersey color
  4. Track stitching → Reconnects IDs when players temporarily disappear
  5. Interpolation → Smooths gaps for clean visualization
VideoFrame ExtractionSAM3 APIByteTrackTeam ColorsTrack StitchingInterpolationTracked Video + JSON

Tip

SAM3 is open source—you can run it locally on a GPU or call it via API. We use fal.ai's hosted API to avoid running inference locally. Wall-clock latency: ~2 minutes for a 30-second clip at 30fps (varies with queueing and resolution). Cost: roughly $0.15-0.30 per 30-second clip at 720p-1080p.

Implementation Details

While making calls to SAM3 to segment objects in a given frame of a video clip is straightforward, things become more complicated when dealing with real broadcast footage; the camera angle moves over time, players occlude each other, and leave/re-enter frame.

ByteTrack Tuning

Out-of-the-box ByteTrack parameters are tuned for surveillance footage. Sports broadcast footage is different - as mentioned above, players leaving and re-entering frame (as well as disappearing behind players) make consistent tracking difficult. After testing across several clips, we settled on these settings:

tracker = sv.ByteTrack(
    track_activation_threshold=0.30,      # Lower than default—catch partial occlusions
    lost_track_buffer=90,                 # ~3 seconds at 30fps before dropping a track
    minimum_matching_threshold=0.55,      # More forgiving for fast motion
    frame_rate=effective_fps,
    minimum_consecutive_frames=3,         # Prevent 1-frame junk IDs
)

The most impactful change for our footage was lost_track_buffer=90. Players regularly disappear for 1-2 seconds (behind other players, camera cuts, leaving frame). Default ByteTrack configs are often used for MOT-style benchmarks; on broadcast sports we found they drop tracks too quickly across occlusions/camera motion.

Track Stitching

Even with a long buffer, ByteTrack occasionally fragments a single player into multiple IDs. Our track stitching pass merges them back:

# If track A ends and track B starts nearby within a few frames,
# and the distance is physically plausible (player can't teleport),
# they're the same person—merge B into A.
frame_gap = track_b.start_frame - track_a.end_frame
distance = euclidean(track_a.end_pos, track_b.start_pos)
max_plausible = 50 * (frame_gap + 1)  # ~15 yards/sec at 1080p
 
if distance <= max_plausible:
    merge(track_b, into=track_a)

On our test clips, this simple gap+distance merge reduced obvious track fragmentation by ~15% (measured as: short tracklets that could be merged into a longer track by manual spot-checking).

Team Classification

We automatically separate teams by jersey color using K-means clustering on HSV color values from each player's jersey region:

# Extract jersey colors and cluster into 2 teams
hsv_colors = [extract_jersey_hsv(frame, player) for player in detections]
kmeans = KMeans(n_clusters=2)
team_labels = kmeans.fit_predict(hsv_colors)
 
# Assign "light" to brighter cluster, "dark" to the other

This is surprisingly robust when jersey colors are clearly separable and lighting is stable. It degrades with color clashes (e.g., white vs. light gray), heavy shadows, or pileups where players overlap significantly.

Output Data

Beyond the tracked video, we output structured JSON for each detection:

{
  "frame_idx": 0,
  "tracker_id": 7,
  "team": "light",
  "x": 427,
  "y": 201,
  "confidence": 0.71
}

From this data, we can compute:

  • Distance covered — total yards each player ran
  • Speed — instantaneous and average, in MPH
  • Heat maps — position density over time
  • Trajectories — full movement paths

Limitations

Early-stage

This is a working prototype. While the initial outputs are promising, moving to a production system will require further adjustments for reliability.

The approach has clear limitations. Some are inherent to the technology, others are specific to sports footage:

Camera angle matters. Broadcast footage shows a slice of the field. Players enter and exit frame constantly. The tracker handles this well, but it's not possible to analyze what can't be seen. Depending on the objective of the analysis being done, this might be a significant limitation.

Small objects are hard. We experimented with various clips where the football itself is in frame but it's challenging. The ball is small, moves quickly, and is often occluded. Player tracking is more reliable.

Speed/distance needs calibration. Converting pixel movement to yards requires knowing the camera's view of the field. We've made attempts to approximate this but getting more reliable results will require additional approaches and probably also manual calibration per clip.

Future Outlook

While this is still a prototype, we're actively building on it:

  • Multi-sport expansion: The same pipeline should generalize to soccer, basketball, and hockey. We're testing across sports now.
  • Footage to analytics: The more interesting output (for us) is the structured track data you can compute from this: positions over time, approximate speed, team assignment—within the limits of broadcast footage.

Further Reading


If you're experimenting with sports tracking and want to compare notes, or have footage you'd like to run through this pipeline, feel free to get in touch.

Interested in exploring this further?

We're looking for early partners to push these ideas forward.