In July, I was fortune to attend the International Computer Vision Summer School (ICVSS) in Sicily. This year’s theme was entitled “From Representation to Action and Interaction”, with the keynote speakers from a range highly rated academic and industrial institutions.
Given the recent explosion in the use of Machine Learning (ML), especially Deep Learning, unsurprisingly the vast majority of material focused upon the application of these techniques to a wide range of computer vision problems. The perceived wisdom was that the traditional tasks of object detection, recognition and segmentation are now solved for unambiguous scenarios, such as uncluttered environments. As a result, focus has now shifted to developing approaches enabling computers to gain an understanding of objects in the world, with the aim of being able to connect computer vision to language and action.
From a theoretical perspective, there was significant discussion of the current philosophical and practical shift away from categorization of ML algorithms in terms of unsupervised, supervised and reinforcement learning. Instead the trend is towards developing approaches which consolidate supervised and unsupervised learning to enable computers to learn via natural- and self- supervision, and the replacement of simple incentive based reinforcement learning with approaches which accomplish tasks via curiosity and learning to predict the future.
It was argued that the use of traditional supervised approaches via manual labelling is not scalable for real world tasks. A wide range of self-supervised approaches were showcased, in which a pretext task is formulated such that it does not require the use of manually labelled data. The act of solving this task results in a learned representation of the data, which is applicable for use in common computer vision tasks, such as object detection. Examples of pretext tasks included learning to “colorize” an image, solving a jigsaw puzzle i.e rearranging randomized image patches to recover the original image, and inpainting.
Another complimentary group of approaches which were discussed at length was the use of adversarial domain adaptation techniques to facilitate domain transfer, in which a representation or behaviour is learned in a restricted or simulated environments before being propagated to a more expansive real world scenario.
The main practical applications for these state of the art techniques were focused within the domains of autonomous vehicles and intelligent virtual assistants.
Significant recent endeavour has been dedicated to addressing two interrelated shortcomings with early approaches with the use of ML in relation to autonomous vehicles; The dependence on high definition mapping of the world to enable a vehicle navigation, and lack of diverse data due to the vehicle being restricted areas which have been appropriately mapped.
The use of domain transfer of real and simulated data has been successfully deployed to enable autonomous vehicles to operate in new environments. In addition, there were a wide range of technical advances which enabled to vehicles to operate in the absence of highly detailed data, including accurate real time semantic segmentation and object detection, learning road layouts on the fly and location detection / navigation from traditional 2D maps alone.
Furthering the ability to better understand scenes in conjunction with natural language processing was the other central theme of the practical work presented. Examples included automatically generating captions for images and video, providing reasoning when answering questions and solving visual question answering (VQA) problems. This was particularly prevalent among representatives from Google and Facebook, which one would hypotheses is targeted for use in expanding the functionality of the popular intelligent virtual assistants, such as Google Home.
With many thanks to Adam Hartshorne for his write up. You can contact Adam on A.T.Hartshorne@bath.ac.uk.