OpenVINO™ toolkit provides a set of pre-trained models that you can use for learning and demo purposes or for developing deep learning software. Most recent version is available in the repo on Github.
The models can be downloaded via Model Downloader
(<OPENVINO_INSTALL_DIR>/deployment_tools/open_model_zoo/tools/downloader
).
They can also be downloaded manually from 01.org.
Several detection models can be used to detect a set of the most popular
objects - for example, faces, people, vehicles. Most of the networks are
SSD-based and provide reasonable accuracy/performance trade-offs. Networks that
detect the same types of objects (for example, face-detection-adas-0001
and
face-detection-retail-0004
) provide a choice for higher accuracy/wider
applicability at the cost of slower performance, so you can expect a "bigger"
network to detect objects of the same type better.
Object recognition models are used for classification, regression, and character recognition. Use these networks after a respective detector (for example, Age/Gender recognition after Face Detection).
Model Name | Complexity (GFLOPs) | Size (Mp) |
---|---|---|
age-gender-recognition-retail-0013 | 0.094 | 2.138 |
head-pose-estimation-adas-0001 | 0.105 | 1.911 |
license-plate-recognition-barrier-0001 | 0.328 | 1.218 |
vehicle-attributes-recognition-barrier-0039 | 0.126 | 0.626 |
emotions-recognition-retail-0003 | 0.126 | 2.483 |
landmarks-regression-retail-0009 | 0.021 | 0.191 |
facial-landmarks-35-adas-0002 | 0.042 | 4.595 |
person-attributes-recognition-crossroad-0230 | 0.174 | 0.735 |
gaze-estimation-adas-0002 | 0.139 | 1.882 |
Precise tracking of objects in a video is a common application of Computer Vision (for example, for people counting). It is often complicated by a set of events that can be described as a "relatively long absence of an object". For example, it can be caused by occlusion or out-of-frame movement. In such cases, it is better to recognize the object as "seen before" regardless of its current position in an image or the amount of time passed since last known position.
The following networks can be used in such scenarios. They take an image of a person and evaluate an embedding - a vector in high-dimensional space that represents an appearance of this person. This vector can be used for further evaluation: images that correspond to the same person will have embedding vectors that are "close" by L2 metric (Euclidean distance).
There are multiple models that provide various trade-offs between performance and accuracy (expect a bigger model to perform better).
Model Name | Complexity (GFLOPs) | Size (Mp) | Rank-1 |
---|---|---|---|
person-reidentification-retail-0031 | 0.028 | 0.280 | 92.11% |
person-reidentification-retail-0103 | 0.564 | 0.597 | 93.5% |
person-reidentification-retail-0107 | 0.174 | 0.183 | 91.7% |
person-reidentification-retail-0200 | 5.506 | 4.723 | 95.4% |
Model Name | Complexity (GFLOPs) | Size (Mp) | Pairwise accuracy |
---|---|---|---|
face-reidentification-retail-0095 | 0.588 | 1.107 | 99.33% |
Semantic segmentation is an extension of object detection problem. Instead of returning bounding boxes, semantic segmentation models return a "painted" version of the input image, where the "color" of each pixel represents a certain class. These networks are much bigger than respective object detection networks, but they provide a better (pixel-level) localization of objects and they can detect areas with complex shape (for example, free space on the road).
Model Name | Complexity (GFLOPs) | Size (Mp) |
---|---|---|
road-segmentation-adas-0001 | 4.770 | 0.184 |
semantic-segmentation-adas-0001 | 58.572 | 6.686 |
Instance segmentation is an extension of object detection and semantic segmentation problems. Instead of predicting a bounding box around each object instance instance segmentation model outputs pixel-wise masks for all instances.
Model Name | Complexity (GFLOPs) | Size (Mp) |
---|---|---|
instance-segmentation-security-0050 | 46.602 | 30.448 |
instance-segmentation-security-0083 | 365.626 | 143.444 |
instance-segmentation-security-0010 | 899.568 | 174.568 |
Human pose estimation task is to predict a pose: body skeleton, which consists of keypoints and connections between them, for every person in an input image or video. Keypoints are body joints, i.e. ears, eyes, nose, shoulders, knees, etc. There are two major groups of such metods: top-down and bottom-up. The first detects persons in a given frame, crops or rescales detections, then runs pose estimation network for every detection. These methods are very accurate. The second finds all keypoints in a given frame, then groups them by person instances, thus faster than previous, because network runs once.
Model Name | Complexity (GFLOPs) | Size (Mp) |
---|---|---|
human-pose-estimation-0001 | 15.435 | 4.099 |
Deep Learning models find their application in various image processing tasks to increase the quality of the output.
Model Name | Complexity (GFLOPs) | Size (Mp) |
---|---|---|
single-image-super-resolution-1032 | 11.654 | 0.030 |
single-image-super-resolution-1033 | 16.062 | 0.030 |
text-image-super-resolution-0001 | 1.379 | 0.003 |
Deep Learning models for text detection in various applications.
Model Name | Complexity (GFLOPs) | Size (Mp) |
---|---|---|
text-detection-0003 | 51.256 | 6.747 |
text-detection-0004 | 23.305 | 4.328 |
Deep Learning models for text recognition in various applications.
Model Name | Complexity (GFLOPs) | Size (Mp) |
---|---|---|
text-recognition-0012 | 1.485 | 5.568 |
handwritten-score-recognition-0003 | 0.792 | 5.555 |
Deep Learning models for text spotting (simultaneous detection and recognition).
Model Name | Complexity (GFLOPs) | Size (Mp) |
---|---|---|
text-spotting-0001-detector | 185.169 | 26.497 |
text-spotting-0001-recognizer-encoder | 2.082 | 1.328 |
text-spotting-0001-recognizer-decoder | 0.002 | 0.273 |
Action Recognition models predict action that is being performed on a short video clip
(tensor formed by stacking sampled frames from input video). Some models (for example driver-action-recognition-adas-0002
may use precomputed high-level spatial
or spatio-temporal) features (embeddings) from individual clip fragments and then aggregate them in a temporal model
to predict a vector with classification scores. Models that compute embeddings are called encoder, while models
that predict an actual labels are called decoder.
Model Name | Complexity (GFLOPs) | Size (Mp) |
---|---|---|
driver-action-recognition-adas-0002-encoder | 0.676 | 2.863 |
driver-action-recognition-adas-0002-decoder | 0.147 | 4.205 |
action-recognition-0001-encoder | 7.340 | 21.276 |
action-recognition-0001-decoder | 0.147 | 4.405 |
asl-recognition-0003 | 6.651 | 4.129 |
Deep Learning models for image retrieval (ranking 'gallery' images according to their similarity to some 'probe' image).
Model Name | Complexity (GFLOPs) | Size (Mp) |
---|---|---|
image-retrieval-0001 | 0.613 | 2.535 |
Deep Learning compressed models
Model Name | Complexity (GFLOPs) | Size (Mp) |
---|---|---|
resnet50-binary-0001 | 1.002 | 7.446 |
resnet18-xnor-binary-onnx-0001 | - | - |
[*] Other names and brands may be claimed as the property of others.