Aug 23, 2009: Welcome everyone!
 

I'm starting a blog to talk about a hobby project I'm undertaking for my house. The concept and basic implementation is best explained by showing a mock-up in action. Check out the video link below!

Why? Ok, now that you've seen the concept video - you may wonder about the why & how. To answer the why question, whenever I go somewhere beautiful I wish I could take that view home with me. I can take a photo, but when I get back home and look at - I just don't get the same feeling. To make something appear real, I need:

  • Motion - the more frames per second that can be shown, the more realistic movement looks
  • Immersion - the image needs to surround you. At any point in time, we see ~110 degrees around us which a small picture or screen fails to recreate
  • High resolution - The human eye can decern close to 0.3 pixels per arc minute. Low resolution loses realism.
  • Sense of depth - We take depth ques through many means, one important que for distant objects is motion parallax which occurs when objects change position more slowly the further away they are from you.
  • Sound - Audio plays in important role in recreating a scene as well as proving some depth ques.

    Some of my favorite views:


    How? Ultra-high resolution isn't a new idea, today's newest movie video projectors use resolutions up to 4096x2160. While IMAX film can supposedly resolve vertical lines at nearly twice this resolution. Red digital cinema sells digital cameras that can film at 4096x2160 (8 megapixels at 24fps), and has announced plans to sell a 28000 X 4000 camera later this year (112 megapixels) at 30 fps. My goal is to achieve somewhere around 100 megapixels at 60 fps which more than double announced cameras from red.com. To accomplish this resolution there are several challenges for recording, editing, and display.

    Filming. At this time, I'm thinking the least expensive option is to buy a bunch of low-end HD video cameras and configure them such that each one films a part of the scene, and then stitch the video frames from each camera together to form a larger frame. 100 megapixels could be accomplished for about $30,000 which is considerably cheaper than an IMAX camera (up to $500k new), Red camera ($30k for 1/5th the resolution). As well, prices at the low end of cameras come down significantly faster than the high-end. I'm still looking at cameras but my leading choices right now are Aiptek Action HD GVS and Sanyo VPC-FH1 HD.


     

    Aug. 30, 2009 Updating with some related links

  • A True Virtual Window, A thesis by ADRIJAN SILVESTER RADIKOVIC
    A PHD student who describes how he implemented a similar idea. The main difference between his implementation and what I'm targeting are:
    - He used a static image, I'm planning to do video at 60fps
    - I'm planning higher reolutions than he used (100 megapixels versus 67 megapixels
    (100 megapixels at 60fps may be a pipe dream, but I'd like to try!)
    - He used only one camera for head tracking sampling at a fairly low framerate, I'm planning to use 6+ sampling at 100fps
    - He performed head tracking using the normal visible color specture, I'm planning to use infared lights and sample images in infared color specture.
    - He used a pan and zoom camera for head tracking, I'm planning on using static cameras.
    - He used only one video display and computer for rendering, I'm planning on using multiple displays and computers
  • Panoramic Video Textures
    A method to create high-resolution looping video is presented using a single panned video camera.br<> This technique is very interesting, but is limited in the types of scenes that can be recorded and replayed.

  • The Virtual Window Project. Nothing fancy, but shows what you can accomplish with a few hundred bucks.

  • Edge blending using commodity projectors

  • Super Hi-Vision, research on a future ultra-HDTV system

  • Immersive Panoramic Video One of the few places I seen video stitching discussed
  • Tests of visual acuity to determine the resolution required of a television transmission system
  • The most efficient tile size in tile-based cylinder panoramic video coding and its selection under restriction of bandwidth

     

    Sept. 7, 2009 Results of some initial filming test

    Over memorial day weekend, I went to a few spots around San Francisco and filmed some scenery using Sony's HD Handycam (1920x1080). Since I only had one camera, I filmed different parts of a large scene at different time intervals - this will create artifacts at the seams for moving objects but should give a rough feel for how the final results will look. I learned a few things using this borrowed camera:

  • 30 fps looks good most of the time, but definitely noticeable for faster moving objects. 60 would look a bit nicer
  • Interlaced video is unacceptable. You really need progressive frame captures to ensure you don't get ugly tearing artifacts
  • The more motion in the scene the more interesting it is too watch. I found getting close to the ocean and filming waves looks great compared to far away shots, however too close means depth perception won't match reality when projected onto a 2d surface.
  • My quad-core Xeon doesn't has enough horse power to decode 4 1920x1080x30 Mpeg2 streams simultaneously, I need to investigate where the bottlenecks are coming from
    Anyway, I updated the video above to include some of my test footage, check it out (jump to 1:30 inside the video to see what I filmed today).

  • Sept 29, 2009  Multi-projector displays & Scalable Display

    A month ago, I got the book Practical Multi-Projector Display Design and read it cover-to-cover with great interest.  I highly recommend the book as it was pretty easy to follow compared to the usual thesis papers you get from PhD students.    The basic idea is to calculate geometry and luminance mappings from screen space to projector space by projecting various patterns, taking a photo of the screen with a high resolution camera, and then determining how the original patterns actually got mapped.  This mapping information can be expressed as triangle strips when projecting onto non-linear surface such as curved walls.  During rendering, transforming triangle strips can be accomplished with very little overhead.  Luminance and color correction can also be specified at points in the triangle mesh while the alpha component can be used to blend the edges of two displays together to make it look seamless. 

    The company Scalable Display has implemented many of the techniques outlined in Practical Multi-Project Display Design and using their software would be a great way to save time on the project.

    Coincidently, I was contacted by a company called Scalable Display and recently meet with the CEO and Founder of Scalable Display (Andrew and Raj).  They expressed interest in doing some joint development on this project.  I got a demo of their system with using 2 projectors similar to this YouTube demo.   They said they are working on one 100 megapixel project with the Navy which was pretty exiting to hear.  I'd love to see that system in action.  Scalable Display's largest commercial market is military simulators.

    I have 2 projectors now and going to play around with edge blending.  For simple 2 projector edge blending, there are some easier solutions including nVidia's PowerWall, Matrox PJ-4OLP, and others.   Many people with 2 projectors just want to watch movies, and VLC has a plugin called "Panoramix" that will also do projector blending.

    On a related note, I recently found this page which list a few (most non-commercial) display projects with high megapixel counts: http://kvmsansv.com/multi-megapixel_displays.html


    Oct 4, 2009  Shooting 4k film using Red Cinema

    To do some initially test & software, I need some high resolution footage.  My eventual plan is to create super high resolution video by stitching a lot of individual video sequences together (just like how you create panoramic images).  However, there is no such video stitching application out there, which means I'll have to write it myself.  In order to skip that for the short term, I hired clai.tv to come up to San Francisco and film some test 4k footage (4096x2048) using their Red Cinema digital camera.   Without their help, I'd have a hard time working this camera.  As you can see in the video below, the camera is pretty complex.

    Below is a thumbnail from one frame captured by the camera, click it to see the full resolution image capture.


    Nov 28, 2009: "Hiring" Open Source Developer for Video Stitching project

    If you have experience working on image stitching and interesting in apply/extending this to process video, I'm willing to fund your work.   The project should take a set of movie files and perform the following:

    - Calculate feature descriptors for individual video streams and automatically find overlaps in both space and time
    - Produce a single high resolution output video (jpeg stills are sufficient)
    - Support various projections and mappings
    - Support image, gamma, and color blending
    - Experiment with synthesizing "tween" frames in order to blend videos that are not synchronized at the millisecond level

    Ideally this work would extend an existing open source project like Panotools.

    If you are interested, please contact me at jc@thisdomain.


    Dec 13, 2009: First attempt at head tracking using TrackIR

    I was experimenting with head tracking using TrackIR from Natural Point.   This near infrared camera does a pretty good job for the price.   The main limitation is has is a pretty limited capture volume with one camera, though it appears you can use Natural Point's SDK to get data from multiple cameras, so this could be used to increase the capture volume you are working with.  Natural Point's other software tools which do multi-camera calibration and multi-camera triangulation are not available for TrackIR, it appears they try hard to encourage non-gaming consumer to move to the OptiTrack camera systems.  As well, I don't think TrackIR support a sync-signal so capturing fast moving objects will likely have higher errors (but ok for my purposes).

    The results of my test turned out pretty well, here is a video:


    April 24, 2010: Using a depth map to enhance 3d effect for video footage

    I found a depth map can be used to convert 2d video into 3d to provide more realistic views as the user's head position changes.   Below is a demo using hand drawn depth maps.   I'm als investigating automated depth map creation using feature point matching from 2 or more camera views of the same scene.  If that works, a depth-map per frame could be generated which would allow for objects that move over larger z distances.  A multi camera view would also allow for better texture estimation for the area behind occluded parts of the frame.   Even without that it appears a static hand drawn depth map would provide convincing results for motion that occurs at approximately the same z distance.

    July 11, 2010:  Real-time head tracking

    An important part of this project is real-time head tracking and I've been exploring various options.  



    WiiMote.  The first option I tried was the WiiMote system popularized by Jonny Lee's video and also demoed by project with similar goals called Winscape.  Although low-cost, when I tested the WiiMote, I immediately found that it falls far short in real-life.   Some of the problems it has including limited range and sample accuracy.  You need to stand in a pretty small "sweet spot" in order for it to work.  Moving more than a few steps from this spot will cause it to stop tracking altogether.  Another important issue is the accuracy of tracking, because of the limited resolution of the camera tracking IR sources (128x96), the XYZ locations that can be calculated are fairly "jittery".  In order to keep the scene from jumping around, you need to smooth the sample points (average them over 5-10 frames), but this introduces a lot of latency.  Further adding to the latency problem, the WiiMote only samples at 40Hz.   If you move your head or body, the screen lags behind half a second to a second and this destroys the illusion of the window being real.   In Youbue videos, latency isn't easy to observe and it's easy to control your position to stay in a specific sweet spot.  In short WiiMore is only good for Youtube videos. :)   The PS3 Move Controller would be an interesting option to investigate - it's similar to the WiiMote, but samples at a higher rate and resolution.

     

    A NaturalPoint Device

    Natural Point TrackIR.   TrackIR5 also operates in a similar fashion to the WiiMote, however it's resolution and sample rate a way better than the WiiMote.  TrackIR5's camera samples at 640x480 with a frequency of 120Hz, in terms of raw data that is 75 TIMES better than the WiiMote, and for a price tag of $150 it's pretty affordable as well.   The downside with the TrackIR is that it also has limited range, you need to be in a small sweet spot for it to work.   Natural Point provides an SDK to access data from the TrackIR directly and supports multiple TrackIR units on a single computer,  it would take a little work, but it's possible to create a system that uses data from multiple TrackIR units to extend the range.   TrackIRs are nice because they are very compact and powered and operate over standard USB cabling.  One other downside for TrackIR is the need to wear IR reflectors.

     

     

    Natural Point OpiTrack

    Of the 3 solutions I've tested so far (Wii,TrackIR, and OptiTrack), I like the OptiTrack the best.  Optitrack is the commercial grade version of TrackIR, it is designed up front to cover a large area and support multiple cameras.   Natural Point also provides software that will automatically calibrate cameras and calculate 3d positions from multiple views.   The calibration works by waving a reflector wildly around the room to allow for some initial data to by analyzed, from there the position of each camera can be calculated.  Both TrackIR and OptiTrack camera have built-in hardware point detection so they can provide your computer with a list of 2d points they see rather than a full 2d image, this eliminates the amount of data that needs to be transferred (helps in lowering latency) and also reduces the amount of CPU consumed on the host PC.   The host PC doesn't need to process a large set of pixels to find IR reflectors, so it's work-load is pretty light.   Both TrackIR and OptiTrack have a pretty small impact on your CPU so they could potentially be run on the same computer doing rendering.   I found there is about a 5ms latency for obtaining samples in the real world, which is pretty good (though not as good as TrackIR).
    OptiTrack cost runs at $6,000 for 6 cameras and various supporting equipment and cabling.  They main things I don't like about Optitrack are:

    - You need at least 8 cameras to cover a 10x10 room well.  This leads to a lot of cabling.
    - Cameras positions are flexible, but to get good results they should be above the head - making them a bit of an eyesore if you want the solution to feel like a normal living space.
    - Like all the other solutions, you need to wear at least one IR reflector that the cameras can track.

    ViCon products

    I didn't test these, but they are worth mentioning.   ViCon provides IR tracking cameras similar to Natural Point, but designed for the higher-end needs.  From discussion forums on the net, it looks like ViCon cameras are frequently used by commercial motion capture studios.  Vicon's cameras range from $3,200 to $20,000 per camera, so they are 5-30 TIMES more expensive than those from Natural Point, however they can achieve some pretty impressive stats.   ViCon cameras can have a 2-3ms latency per frame and sample up to 200HZ.  Their high-end camera has a resolution of 16 megapixels  (4096x4096), pretty impressive!   They say you can track objects in very large areas (football fields) with this camera.

    Robotic Pan/Tilt/Zoom (PTZ) Video with face tracking

    This is the area I'm currently exploring.   The concept here is use 2 or more cameras that can be programmatically controlled for pan, tilt, and zoom to follow a subject and use face tracking to determine the position of a user's head and eyes.   Determining the position of the user's eyes from two or more cameras should allow for the calculation of accurate 3d positions.   This was the solution I original had in mind, but put it off because it's also the hardest to get working.  The advantages of this approach are many:

    - Markless tracking.  The subject doesn't have to wear any reflectors.
    - Large coverage area with small number of cameras.  The ability to pan/tilt/zoom allows you to use resolution where you need it rather than trying to cover the entire room all the time. 

    Some of the challenges include:

    - To cover a room with you need to track the subject as he/she moves by performing pan/tilt/zoom operations in the camera.  There is an ideal resolution and position you'd like to keep the subjects face at so that future movements will stay on camera and there is enough resolution to distinguish the face position accurately.

    - The bandwidth and CPU required to perform face tracking is pretty heavy.  Face processing needs to be done on the PC, so the entire image from the camera needs to be transferred into PC memory.   For a 720p HD black & white video signal at 60fps, this means 55MB of data needs to be transferred and processed every second.   For a 3.2GHz processor, this means you have ~50 million cycles per pixel to perform transfer and processing (more if you can split across cores).   One key to fast face processing is to reduce the amount of image data you need to process by keeping the face size as small as needed and then intelligently skipping areas of the image where the face is unlikely to be in.   The Sony EVI HD1 camera supports video at 1080p @60hz, but this would result in 124MB of data transferred and processed per second, which leaves ~25 million cycles per pixel.    This is probably doable, but there isn't much CPU left over for 1) multiple cameras and 2) rendering subsystem.

    - I haven't seen any PTZ cameras that support frame rates higher than 60fps.  I'm guessing latency will end up 10-15ms, but could be higher if the capture and transfer of images takes additional time.   Right now, I'm trying to determine what kind of latency PC capture cards can grab 720p or 1080p with minimal latency.

    - To accurately calculate 3d points from multiple 2d cameras, you need to know exactly where each the cameras are located and looking at.  If you use pan/tilt/zoom operations, you need to recalibrate these settings.  Doing this in real time could be tricky, but should be doable.  My planned approach is to use background feature points from previous frames to determine precisely where the camera moved to.
     

    - Face tracking itself is tricky.  There are 3 potential options that I plan to explore.   CAMSHIFT, FaceAPI, and PittPatt.  

    • CAMSHIFT is an algorithm provided as part of the OpenCV graphics library.  I'm not expecting great results from this, but worth giving a shot.

    • FaceAPI has a "free for non-commercial use license" and charges $4k per developer plus royalties for commercial applications.   I think it's smart of FaceAPI to provide a free version for non-commercial use, it helps get interest from the crowd that would otherwise use and contribute to OpenCV.   I have done some basic test with FaceAPI and found it can handle a 640x480@30fps video feed with approximately 12% CPU load on my 3.2Ghz quadcore.   Further testing is need to see how it performs at higher resolutions and frame rates.

    • PittPatt charges $5k per developer per year with royalties for commercial products, they have a 30 day free trial that is available.   I haven't tested them yet, but have high hopes.  I'm concerned they are not as fast as FaceAPI and won't be able to handle high resolution and frame rates.   A representative I talked to on the phone mentioned a top speed for one CPU core of 320x240 at 40fps.  Scaling up to 8 cores wouldn't be fast enough for 1080p@60.

    A few PTZ cameras currently under consideration:

      Max Rez Fps Video out   List Price Width Length Height Tilt (deg) Pan (deg)
    sony evi HD1 1080p 59.94 HD: HD-SDI   $4,028 10.24 6 6.75 50 200
          Analog Component (Y/Pb/Pr)          
          SD: VBS                
          Y/C                
    sony evi HD1 720p 59.94 DVI-I (Digital and Analog) $3,499 9.8 5.98 5.31 50 200
    sony BRCZ330 720p 59.94 BRBK-HD2: HD-SDI $5,100 6.375 7.375 7.625 60 350

     


     


     

    More to come.... Got ideas / suggestions? Email me: jc@thisdomain.com

  •