Kaggle Update #1 – Data Pipeline

Image Processing

First task is to extract the data required, at least the image components, sound will come later.

The augmented data processing with dlib will take perhaps a few days or so, with roughly 1,000,000 images generated from the video files for training and test datasets. I used a GCP low memory 16-core instance with multiple Python/dlib extraction scripts. I guess splitting even further with more cores (allowing for I/O latency) would speed things up.

The result from each video being 10 frames over the video duration processed and resized with (CUDA enabled) dlib, so a total of around a million images. These are split into real/fake (according to metadata lookup) with a distribution around 5:1 Fake:Real, which mirrors the 400 item test set offered for download.

My objective is to use the extracted images as the basis of my training (and test) sets, fully training my network via GCP on the entire image set.

On another note I have switched to PyCharm, as the 2019 build supports notebooks and is far more productive than using a browser window, in fact using ipython browser notebooks for development is simply not pleasant. Having used phpStorm back in the day (2011) and IntelliJ for Java/Kotlin/AOSP fairly recently most of the IDE is pretty intuitive.

TFRC (TensorFlow Reasearch Cloud/Google) have also given me unlimited TPU for sixty days, think I applied for this months ago but it just came through. Will need to start using this on the bigger data hungry models I aim to employ.

Leave a Reply

Your email address will not be published. Required fields are marked *