In almost all cases the success of supervised learning techniques is highly dependent on the quality and quantity of available training data. Perfect data is seldom available, and when it is there may not be sufficient diversity in it to train a hungry model. Cleaning noisy data can be labor-intesive if automated approaches are not available or feasible. A large body of labelled training data is often desirable but it may also be necessary to synthesize new data from a much smaller training set, for example employing generative models or more traditional statistics techniques. Alternatively, it may be possible to synthesize training data in its entirety. Services such as CrowdFlower have emerged that can assist with data pre-processing, whilst Neuromation claims to provide an entire platform for synthesis of training data.
High-quality data sets have been published for certain problem domains, but for many visual effects projects the application of machine learning starts with collection or generation of training data. Sometimes this data will contain imagery or assets which is protected by strict client confidentiality agreements, which will prevent any type of external distribution of usage – perhaps even publication. However, for those cases where data is not encumbered by IP issues it would be enormously beneficial for the industry as a whole if there were a means to share portions of it with the wider research community.
Reliable reproducibility of others published research is always a concern as is the ability to compare models against known benchmark test data. For example, a broad diverse data set of comprising of images which were rendered to differing qualities to could act as an industry-wide standard for evaluation of de-noising algorithms. Arguably, some problems in machine learning have enjoyed increased attention and successes due to the availability of good data. Public competitions such as those found on kaggle can also drive interest towards new areas.
We have create a github repository which aims to document some of the available datasets which may be applicable to the visual effects community. Most will require some sort of adaptation to your problem domain, but others may be useable as-is (license permitting). Whilst the list is far from comprehensive, this is an ongoing project which will continue to be updated. Please feel free to submit requests for any that you would like to see included. Also, if you created any extensions or refinements of these datasets and would like to share them then please let us know.
And, as always, we should love to hear your thoughts on the topic. What datasets would you find most valuable? What data might VFX studios mutually benefit from if they were to make it available to all? And what types of data might studios be able to safely release without yielding competitive advantage, yet at the same time helping to push forward the state-of-the-art? Contrive to the discussion in the comments section, below.