MULTIMODAL DATASET

This package contains modules for handling multimodal data, with a focus on video with audio and subtitles.

Requirements

The package relies on the packages:

h5py
numpy
imageio

If you use Anaconda, these are most likely installed by default, otherwise you have to install them.

For converting movies to datasets, you will need ffmpeg-python. There are many python ffmpeg interfaces, make sure you use ffmpeg-python. You will also need a ffmpeg binary installed on your system.

To run the demo (see below) you will also need the requests package, alternatively you can download the Sintel files manually. Step 3 in the demo also requires scipy to be installed. If you use Anaconda, these packages are installed by default.

Installation

In the base project directory, run:

$ python setup.py develop

This will add symbolic links in your local python distribution to this package, which will update when you pull new changes.

Usage

Creating multimodal datasets

The package contains a script to convert .mp4 files (using ffmpeg) and accompanying SubRip subtitles files into HDF5 datasets where all frames of the video-stream are saved as individual images (to enable precise slicing) and the audio together with the audio and subtitles of the video.

To convert videos, use the script bin/make_multimodal_dataset.py. The script supports two modes of operation: 1) supplied with a single MP4 file and one or more SRT subtitle files, it will combine them into a single dataset or 2) given a video-index produced by bin/make_video_index.py, all videos in the index will be converted (optionally in parallel).

Single video file:
```
$ python bin/make_multimodal_dataset.py PATH_TO_MP4 [[PATH_TO_SRT_1] ... PATH_TO_SRT_N]
```
This will create a HDF5 file using the same base name as the video file with an .h5 extension.
Using a video-index: Alternatively, the tool takes a video-index file as input which specifies which file and subtitles should be converted. This file is a JSON file produces by the bin/make_video_index.py script.

Using multimodal datasets

The methods for using multimodal datasets are located in the multimodal package.

Demo

A demo of how to use the datasets can be found in the bin/extract_subtitle_audio.py script. First you need a multimodal video dataset to work with. For this purpose we will download the free Sintel movie and subtitles. After installing the package (see above), do the following:

Install required software, ffmpeg, ffmpeg-python:

i. ffmpeg:

Debian/Ubuntu
```
$ sudo apt-get install ffmpeg
```
Anaconda:
If you use Anaconda, you can install a python-local ffmpeg binary which doesn't require superuser privilege:
```
$ conda install ffmpeg
```
ii. Python packages:
```
$ pip install ffmpeg-python
```
Download the dataset:
```
$ mkdir -p data && cd data && python ../bin/download_sintel.py
```
This will download the free movie Sintel along with english subtitles and place them in the data/ directory. If you don't have the python requests package, you can download the movie and subtitles manually from the above link.
Convert the movie and subtitle to a multimodal dataset, in the directory data/:
```
$ python ../bin/make_multimodal_dataset.py sintel-1024-surround.mp4 sintel_en.srt
```
This will make a HDF5 multimodal dataset from the movie and subtitles, the dataset will have the same name as the video, but with the extension .h5 instead of .mp4
Make a directory for the wave files and run the extraction script:
```
$ mkdir -p sintel_waves && python ../bin/extract_subtitle_audio.py sintel-1024-surround.h5 sintel_waves
```
If you get a numpy float deprecation warning, everything is still all right.

The directory data/sintel_waves will now contain wav files named by the subtitle text containing the corresponding audio.

Anatomy of multimodal video datasets

At it's core, the multimodal datasets are HDF5 files, where the root group gathers all the different modalities for a single multimodal stream (e.g. a movie). Generally, the modalities can be anything, but specialized classes have been implemented for dealing with video, audio and subtitles.

A single multimodal stream (a root group of the HDF5 file) can have multiple modalities, in movies for example these modalities are video, sound and subtitles. These modalities are groups of the stream. Each modality in turn can have different "facets", ways in which the modality is represented.

In the case of movies, the image modality could have different facets such as different camera angles of the same scene, the audio facets could correspond to different language audio streams and subtitle facets could correspond to different closed captioned languages.

The layout of general multimodal datasets looks like this:

For video datasets, a concrete organization could look like this:

A facet can also be the result of online processing of some other facets (such as resampling an audio facet or resizing a video facet), but these are not represented in the HDF5 and are specific to the facet handler implementation.

Batched data

The datasets doesn't support batched data yet, since this will require packing and padding of sequences. This is on the todo-list.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
bin		bin
doc		doc
multimodal		multimodal
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

doc

doc

multimodal

multimodal

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

setup.py

setup.py

Repository files navigation

MULTIMODAL DATASET

Requirements

Installation

Usage

Creating multimodal datasets

Using multimodal datasets

Demo

Anatomy of multimodal video datasets

Batched data

About

Releases

Packages

Languages

License

eryl/multimodal-dataset

Folders and files

Latest commit

History

Repository files navigation

MULTIMODAL DATASET

Requirements

Installation

Usage

Creating multimodal datasets

Using multimodal datasets

Demo

Anatomy of multimodal video datasets

Batched data

About

Topics

Resources

License

Stars

Watchers

Forks

Languages