Tracking is the concept of associating a detection from a frame or image to a detection in another frame or image. A “Track” is a series of detections each representing multiple looks at the same underlying real-world object.

Tracking methodology considerations

The tracks are generated by joining detections into tracklets, or an associated group of detections. Iterations are performed to then join tracklets into larger tracklets until confidence is reached that all tracklets are now well-formed tracks. The amount of loops is discussed under “Data Considerations”. For the purposes of implementation symetry, the first iteration casts detections to each be a tracklet containing 1 detection.

Each stage of the loop executes a graph-based algorithm that solves which tracklets to join based on the weights associated with each edge in the graph. More than 2 edges can be joined in 1 iteration of the graph edge contraction. There are an infinite number of ways to calculate weights of each edge. OpenEM supports edge weight methods using ML/AI or traditional computational methods.

An example tracker using Tator as a data store for both detections and tracks is located in scripts/

Recursive graph solving overview

The graph edge contraction above shows the contraction of four tracks into one via iterative contraction. The weights of the graph are qualatatively shown as the width of the line connecting each graph edge.

Simple IoU

Graphical depiction of intersection over union (IoU). Adrian Rosebrock / CC BY-SA (

A tracker starts with a list of detections. In some video, specifically higher frame-rate low-occlusion video, the IoU of one detection to another can be highly indicatative the object is the same across frames. The IoU weighting mechanism can start to have errors on long overlapping tracks. As an example picture looking across a two-way street, assuming a well-trained object detector it is probable almost all frames could capture two oncoming cars passing each other. However, an IoU tracker can misjudge truth by looking only at overlap of detections, resulting in ‘track swappage’. In this case rather than have 2 cars, one traveling left, one traveling right, the tracker may track 2 cars each driving towards the middle and turning around.


Directional tracking generates edge weight based on the similarity of velocity between two tracks. In the example car example above, a the IoU strategy may be limited to only create track lengths up to N frames. Given N frames, each IoU-based track can have a calculated velocity and edge weights are valued based on the simularity of velocity between two tracks. The directional model can add fidelity to an IoU tracker if objects have defined motion patterns to bias association based on the physical characteristics of the object being tracked.

Directional tracking can be difficult for objects that move erradically or ultimately become occluded for long periods of time. Directional tracking also does not help identify tracks that leave and return the field of view within a recording.

Siamese CNN

This method compares two tracklets that each have no more than 1 detection. The appearance features of each detection or series of detections are extracted and compared. The simularity of each detection is used as an edge in the graph. This method of edge weight determination can help recognize if detections are the same even if no motion or overlap is present between multiple looks at the object. Using the car example above, the appearance features of a red car would match it to a similar look at the same red car. This method can run into issues for objects that change or transform their appearance. Using the car example above, an exposed weakness to this approach would be if one car was a convertible in the process of folding in the roof.

This method requires a trained model.

Multiple stage approach

Each one of these stages can be used in conjunction with another. The reference tracker shows the concept of using different methods based on the current iteration of the network. This fusion approach can be useful to

Training considerations

Of the currently supported reference methods, the Siamese CNN is the only method requiring training. <TODO: insert link to how to train siamese data>.

Bootstraping the tracker can be useful in the generation of training data. Utilizing the IoU or Directional tracker methods can generate data to be reviewed by annotators. After this review, data can be used to train the Siamese CNN. This can result in more training data for associating detections faster, than having annotators start by manually associating detections.