In this post, we will explain the various algorithms we used to learn new information on what is shown in videos from the Open Archive (Beeld & Geluid)[1].

In the context of the project Climate Imaginaries at Sea, we focused on a dataset containing videos linked to oceans, rivers, and the creatures who live in them, which react to the climate crisis. The methods described here however can be applied to any video. We also developed a tool in which you can search and create supercuts from the results from the used algorithms to see how certain concepts are represented.

By default, the Open Archive videos contain only general metadata. If we want to explore how certain concepts are represented over time, it would be interesting to know more about what is in the videos. With the recent advances in AI technologies, specifically computer vision, we can extract some of this extra information.

Laying the groundwork

opics, especially for the videos containing the news, we first have to split the videos into smaller chunks. Therefore, in the first processing steps, we do three things. 

1. Transcribing the video

To learn about what is being said in a video, each video is transcribed using the VOSK [1] library and the KaldiNL[2] model to transcribe Dutch spoken words. Another option is to use the recently released Whisper algorithm from OpenAI. In a few simple tests on a couple of videos from the archive, the Whisper algorithm showed better results but also took significantly longer than the KaldiNL model. The KaldiNL model showed to be faster and provided sufficiently accurate results. It depends on the computing power (and time) available and which algorithm is most suitable. Since there is only one audio channel on older videos, the accuracy of the transcription is significantly reduced when there is a lot of background noise. 

2. Split the video into scenes and link transcribed text to the scenes. 

Using the PySceneDetect [3] package, we split all the videos into scenes. In the first run, some videos did not get properly split into the scenes. Better results were achieved by reducing the cut threshold level to 15 (from the default 27). The most optimal threshold varies per video. Setting the same value for all videos might result in a scene being cut into more segments than needed (lower threshold) or multiple scenes not being split at all (higher threshold). For our purposes, the fact that with the lower threshold, some scenes are cut into multiple smaller segments even if it was actually one scene was no big issue.  In the future, automatically finding the proper threshold would be useful. 

In the data returned from the transcription, timestamps are available for each character found. By matching these timestamps with the scenes, we can link back to which text was spoken in which scene. When a word is spoken directly on a scene split, we make sure to add the whole word as part of that scene. 

[1] https://www.openbeelden.nl/

[2] http://scenedetect.com/en/latest/examples/usage-example/#finding-optimal-thresholdsensitivity

3. Store the scene information in a database.

We store all the information for each scene in a (MongoDB) database. To start, for each scene we have the following information:

  • Video name
  • Scenename (videoname-scene-xxx.mp4)
  • Spoken text

Only the paths to the videos and scenes are stored. The scenes themselves are saved on the file system.  Extra relevant already existing metadata can also be added here. For example, the date the video was released.

Video content analysis using Computer Vision.

In our case, the 445 videos from our set were divided into 8300  scenes containing the text spoken in that scene. To learn more about what is shown within these scenes, we applied various Computer Vision techniques. We will list them here and explain the process and what we learned.

Object Detection

With Object Detection, we can retrieve objects from images. Using the Yolov5 [4]  model we use a lightweight model that can classify a wide range of (general) objects. From each scene, we extract a random frame and use that image as input for the object detection model. For each object that is found in the image, the label, confidence score, and location of the object in the image are stored in the database.

A still video image with people outlines with red boxes illustrating that they have been detected by computer vision.

Semantic Segmentation
Semantic Segmentation allows us to give a label to each pixel in an image. Same to object detection, a random image from the scene is retrieved after which we use the Segformer [5] model trained on the ADE20k [6]  dataset to label each pixel. An example can be seen in the figure below. 

An Image of a bridge that has been reduced to fields of colour illustrating the object detection of sky, bridge, tree, earth, road, water, river.

For each scene, for each given label in the image, the percentage of the extracted image which contains that label is stored. It is important to know that the algorithm will always label each pixel, even if it is not sure. This can cause some more noise in the results.

Scene Classification
To learn about what type of place can be seen in a scene we use the DenseNet161 Scene Classification model [7], trained on the Places365 dataset [8]. This model will give a label to an image. Again we extract a random frame from the scene which we use to classify. You can see Figures x and x for some examples. We store the best scoring label together with its confidence score. 

Two video stills side by side. One showing an aerial view of a crowded dutch beach. The other showing an aeroplane on a runway

Action Detection 

A very experimental technique is used to detect what actions are performed by people in these scenes. Instead of a still image as an input, here the algorithm tries to make sense of a whole video segment and returns labels on what action is performed in the video. For example, drinking or dancing. We used the SlowFast [9]  action detection algorithm, trained on the Kinetics 400 [10] dataset. Because the algorithm is only trained on actions done by persons, we only run this algorithm when we have detected a person in the scene using object detection. We store the action label and the confidence score in the database. See figure x for an example from the algorithm’s source on what these algorithms aim to do.

A video still of a woman dancing flamenco on stage in front of a group of flamenco musicians. The image has an overlay of computer generated captions describing the scene via computer vision.
Example of action detection in video. From: https://github.com/facebookresearch/pytorchvideo

Exploring the data

After running all these algorithms on all the scenes, we can search the database for specific concepts from the content of the scenes. We can search for the text that is spoken (in dutch) or the (English) labels that are the output of the algorithm. Next to the labels, we can also use the confidence scores for each label to limit the number of scenes returned. In the example in figure x, you can see an example of part of the data connected to a scene, and a frame from the corresponding scene.  

A json file showing meta data gathered from a source image using object detection algorithms.
Example of data (left) connected to a scene (right)

The tool’s interface is built around the creation of a supercut as an output. Based on a query, the tool will return a new video built from all the scenes that match that query. For example, you can create a supercut from all the scenes that contain a person, contain a person AND a dog or contain a person performing a certain action. 

The interface consists of a (Python Flask) back-end API and a VueJS front-end to create the query and visualize the results. Depending on the number of scenes (and their length) the creation of a supercut can take up to a few minutes. For that reason, the interface gives the option to check the query for the number of scenes so you can adjust the query to represent a more manageable number of scenes.  You can also view the metadata for all the labels in the database to see how often they are returned so you can adjust your query accordingly. 

A screenshot of a web interface of the supercut tool.

Discussion 

The methods explored in this project show an interesting insight into the current capabilities of Computer Vision algorithms.

The algorithms used in this project are not perfect. And there is no test set, to give us clear insights into how well the models perform. Because of this, we cannot provide an estimate on the precision and recall, and different models might give better (or worse) results. We use older, low-resolution,  sometimes black-and-white videos, which can reduce the accuracy of the algorithms due to them being trained on more recent (higher quality) videos.  In our experiments, querying the data by setting a certain confidence score helps a great deal in finding relevant results. 

In particular, the action detection algorithm is experimental and returns mixed results. Even on higher confidence levels. The algorithm works best on short segments (~5 sec), but often the scenes are longer.

Because of the lower resolution, the runtime of the used algorithms is quick. The algorithms that analyze just a random frame from a scene take about an hour to process using an NVIDIA GeForce GTX 1050 Ti GPU in a local machine. The action detection takes longer. When analyzing more recent, higher-quality video, however, more processing power is needed and while this processing power is easily available on various cloud providers, it can become expensive quickly. 

The next steps will be to try more different types of algorithms, for example, the extraction of text from videos using Optical Character Recognition (OCR) techniques and explore updates to the SOTA of the methods, in a fast-moving research field, already outlined in this blog. Furthermore, it would be interesting to use the existing metadata from the videos in some way. For example, the release date of the video would enable us to experiment with views on how concepts are represented over time. We would also like to do some tests on social media videos surrounding a certain topic. What type of video is shared often and what is in them?

Finally, to enable other researchers to experiment with these technologies we aim to create Python notebooks to run the algorithms yourself. Some technical expertise will still be required to create a database from these results and connect the interface. 

Resources

  • [1] Vosk offline speech recognition API. Available at: https://alphacephei.com/vosk/ (Accessed: November 21, 2022). 
  • [2] KaldiNL. Available at: https://github.com/opensource-spraakherkenning-nl/Kaldi_NL (Accessed: November 21, 2022). 
  • [3] Intelligent scene cut detection and video splitting tool.  Available at: https://scenedetect.com/en/latest/ (Accessed: November 21, 2022). 
  • [4] YOLOv5 Object Detection. Available at: https://github.com/ultralytics/yolov5 (Accessed: November 21, 2022). 
  • [5] Xie, Enze, et al. “SegFormer: Simple and efficient design for semantic segmentation with transformers.” Advances in Neural Information Processing Systems 34 (2021): 12077-12090.
  • [6] Zhou, Bolei, et al. “Scene parsing through ade20k dataset.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
  • [7] Huang, Gao, et al. “Densely connected convolutional networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
  • [8] Zhou, Bolei, et al. “Places: A 10 million image database for scene recognition.” IEEE transactions on pattern analysis and machine intelligence 40.6 (2017): 1452-1464.
  • [9] Feichtenhofer, Christoph, et al. “Slowfast networks for video recognition.” Proceedings of the IEEE/CVF international conference on computer vision. 2019.
  • [10] Kay, Will, et al. “The kinetics human action video dataset.” arXiv preprint arXiv:1705.06950 (2017).