Brian Eno’s ‘The Ship’ – A Generative Film



This project is a music video project for “The Ship” released by Brian Eno, the harbinger of generative music genre. Considering how Eno constantly questions the process of making and creating state-of-the-art music, instead of developing a conventional music video, the project was delivered as a software, more specifically utilized artificial intelligence to generate the video. Despite the current expectations of AI to be complete, accurate, and flawless, we questioned whether artificial intelligence can achieve human-like creativity.

At the start of the project, Brian Eno explained the underlying theme of this musical score as “humankind teetering between hubris and paranoia.” Looking back at the 20th century, for instance, WW2 was when people’s optimistic anticipation for technology failed miserably as ending up in horrendous technologies that could potential efface the entire humankind. In the current society, we considered AI to be the perfect example.

The project collected colossal number of photographs which represent memorable moments in our history and created an AI which understands these events. The AI also constantly reads the images from the current news. The distinctive features of the photo are extracted from the combined outputs of many neural networks. Synthesizing the results allows it to calculate the overall similarity of the images. It represents in a structured and systematic vision associations that a human would never be able to see. This shows the viewpoint that humans recollect a diversity of contexts.

The project is unique by the fact that the output differentiates itself constantly based on the continuous input of the current world. Moreover, the AI is, designed to (mis)connect the present image to the past. Taking the WW2 reference, the AI may connect the current political news to the events in the past such as the Nazi regime. In a way, the artificial intelligence is no longer intelligent but rather demonstrates artificial stupidity.

Technical Detail

The software consists of two systems: 1) a server-side system which searches the photo archive database within to find a photo that has the strongest similarity with the given input, and 2) a client-side system which composes the searched images on the web browser.
On the server-side system, several convolutional neural networks (CNN) is used. Through Caffe, a framework for deep learning, we utilized several trained image recognition models published as Caffe Zoo. Models include VGG model, which is used to classify 1000 images classes(i.e. “Yorkshire terrier” “airplane” “umbrella”) from ImageNet, and also PlaceNet which is used to determine specific locations(i.e. “in the woods” “indoor”). Image inputs were passed to these image recognition models, and 4096-dimensional vectors derived from an image recognition layer, which is one below from the top layer used as recognizing images (SoftMax layer), was utilized as the feature points. Through this feature which enables to compare several observations by combining the results of multiple neural networks, the overall system permits to present intriguing results which could be understood as a misreading of an image, or even linking with another context which a human would never associate.
With regards to the archive photos collected, feature vectors are pre-calculated and stored on a database. News images are captured periodically collected from various twitter news accounts such as BBC, Reuters, and National Geographic. Every time an additional news image is captured, T-SNE, a dimension reduction algorithm, is utilized to describe the distribution of each photos’ feature points on a smaller dimension and computes the distance within the vector space. Approximately 30 images that have a shorter distance between the feature points are selected as similar photos. T-SNE is re-applied to the selected images and similarities will be computed once again. Then the images with the biggest difference in similarity will be sorted in descending order and will be sent to the client side. As a result, an image that resembles yet contrasts the original news image will be juxtaposed side-by-side and generates a video experience on the browser.
WebGL and Shader are employed to depict the expression on the browser. We attempted to create an effect which replicates the experience of slowly zooming into the static image. Upon building the composition of the overall video, and applying effects such as feedback and distortion, feature points mentioned above are used to determine the area of the effect to apply, thus creating an enigmatic impression.