Realtime video compositing in WebGL
This Christmas, at Tool we wanted to create a small interactive experience to share with our friends and clients. Since lately I did experiment with compositing WebGL objets on a video I thought this is a cool technique that we can use.
The idea was simple enough: we would shoot a Christmas tree in a nicely decorated room and composite-in a gift box that the user can interact with while watching the video. All this is rendered with WebGL - the video runs in the background and the 3d interactive content on top, both layers are matched in perspective and movement.
To achieve this effect I had to use quite a lot of different pieces of software. Here's a breakdown of what it took to build it:
First of all, I needed to match the perspective of the camera in the footage and that of the camera in the 3D scene. There is no exact science in doing that, so the best way is to take a frame from the video, use it as background in C4D and try to match manually.
It's a trial and error technique and adjusting the details can be quite a challenge. Fortunately, I found a good book about matchmoving with some very useful tips… like that one about writing down what lens was used during filming (I forgot about that, of course… :)
After matching comes tracking. At first I wanted to do full camera solve, but it turned out to be quite complex and not necessary in this case, so I went with 2D tracking instead. A very good tool to do this is the Mocha AE Plugin - it is easy to use, accurate and fast.
Using 2D tracking means that we only track movement on the XY plane. We also do not account for any rotation of the camera. For handheld shots, where there's only slight camera movement, it is good enough. Of course this would never work for tracking or panning shots - these require full camera solve.
Once the tracking is done and tweaked, Mocha allows to export the tracking data into a text-based format. After this, all I needed was a simple Python script to turn this data into a nicely formated JSON.
Once the 3D model of the gift was in place and all the camera angles were matched in C4D, I exported the whole thing to Unity3D. The main reason for this, is that I wanted to take advantage of the Unity/WebGL exporter to get that to WebGL.
I also used Unity's animation editor to create the movement of the box when the cap flies off and the nutcracker pops out. Do do that, I added some functionality to J3D to support animations. It can be found in the v2 branch of J3D. One of the main changes in the engine I had to make, was to switch from Euler angles to Quaternions.
In order to correctly track video and overlay any elements with precision, I needed to know the frame the video is at at any time during playback. The easiest way is to take the current time and divide it by the duration of a single frame (ex. 1/24 seconds). Unfortunately it is not that simple! If you want an in-depth look why it is so complicated please read the excellent article by Zeh Fernando. Even though his article talks about Flash, same thing applies to HTML5 video.
Long story short, each video used in a tracked shot needs to have the number of the frame encoded into it. The best way to do this: encode the frame number as binary marker somewhere in the video. To see how it looks, try playing one of the videos directly in your browser. See those white boxes on the bottom? This is it!
Adding the binary marker to the video can be painful if done manually. This is where Python comes in. Using a library called PIL (Python Imaging Library) I wrote a simple script that does the following:
- decompose a video clip into a sequence of PNGs (using FFMPEG)
- manipulate each image by adding the binary frame number at the bottom
- encode the frames back into a video optimized for HTML5 (FFMPEG again)
Video encoding tip. One thing to consider when adding the frame number marker is that video playback is optimal if the pixel size of the video is modulo 16 (divisible by 16), for example 1024 x 576. Flash or HTML5 - same rules apply here.
Now, if your video has an optimized pixel dimension it would be a shame to add a few pixels with the binary marker and get a resulting video that is no longer optimized. It's usually better to add the marker over of the video. You will need to sacrifice a few pixels of the footage, but a better playback speed will make up for that.
With the Unity exporter getting the scene to render in WebGL was simple. The one big thing left at this point were custom shaders that would make the gift box blend well with the video. In fact is was the most challenging part od the project! The shader I ended up using on the box has diffuse and specular lighting with a specular map, reflections with a reflection map and a normal map to make the gift wrap look as realistic as possible. Finally, I added a bit of personalization - if you type your name after the # in the URL it will be rendered on the label on the box (example).
Last but not least: our friends at Plan8 added some great interactive sound FX that, as always, add a lot to the final effect.
It was fun to create this. I feel like this technique can definitely have some interesting uses. Of course, nowadays, anything that requires WebGL is treated as experimental, but once the majority of browsers can render that… What do you think?