The Azure Flame: 2012

Tuesday, April 10, 2012

Color-Image Segmentation

One of the biggest hurtles I'm encountering in my senior design project right now is separating the person from everything else. The paper I read today discusses that very problem in excruciating detail. Let me tell you, it's not as easy as I thought, but if my current idea fails, I might just fall back onto some of the concepts this paper discusses. This paper looks at separating objects from each other by looking at the problem from a classification and clustering point of view. In other words, it takes the values of the pixels and groups them together based on their similarities.

By itself, looking at color similarity can be very flawed. The same color can appear in several different part of the picture. Also, given different lighting, certain colors can appear as different colors that might show up elsewhere. In this case, you have different partitioning results in different lighting. In order to address this problem, the researchers in the paper look, not only to the RGB color values, but to the location of that value within the picture. The assumption is that if similar colors exist in close proximity, it is likely the same object. You also have to consider noise. With a single pixel being so tiny, it's not uncommon to have a single pixel be an abnormal color from its neighbors. To ensure correctness, they find "core" pixels. They do this simply by looking at the neighboring pixels, and see if a minimum number of neighboring pixels are similar enough to the chosen one. You then take those and cluster them based on both proximity and color value, and determine what is and isn't the same object.

The paper also include "fuzzy mode detection." I have to be honest with you. I do not know what the paper is talking about during this part. What is a "mode?" If I could answer that question, maybe I'd be able to follow this section of the paper better. Sorry guys.

Source:
Losson, O., Botte-Lecocq, C., & Macaire, L. (2008). Fuzzy mode enhancement and detection for color image segmentation. Journal on Image and Video Processing - Color in Image and Video Processing , 1-19.
http://dl.acm.org/citation.cfm?id=1362851.1453693&coll=DL&dl=ACM&CFID=96602743&CFTOKEN=40588068

Thursday, April 5, 2012

Artificial High-Resolution

Recently in my digital photography class I learned about HDR pictures. Specifically how to combine pictures with different exposure values to create a picture similar to what your eye sees. I admit, I never thought the same ideas would apply in a research paper I read for my Computer Science senior design class. Instead of combining pictures to get all the detail from multiple exposure values, the paper discusses getting a high resolution picture using a collection of low-resolution pictures. The idea is pretty much the same though. Since each picture will have different pixels with good information, you take the good information from each picture and add it to the final product. This seems like a fairly intuitive approach to the topic. However, the technique is not without it's obstacles. You see, in photography you don't always get the exact same picture when you press the button a second time. Something in the scene might change. For example, your angle to the object might be different ever so slightly. The background might change or move, especially if there are creatures in the background. A direct merge of these pictures would result in very messy final picture. Not exactly the "super resolution" you're looking for.

The paper is actually about addressing this obstacle. They approach the problem understanding there may be subtle differences in the different picture, and bring in the idea of error. They create a curve based upon all the pictures and then assign error weights based upon how far the value of a pixel is from the curve. The farther the pixel value is, the smaller weight it has. The assigned weights are based upon an outlier threshold determined at the creation of the curve. These weights allow for the final picture to partially ignore, or even exclude irrelevant information from the final picture. The results of this method are a crisp picture that exclude extra data, including extra objects that may be placed in one of the contributing pictures.

Source:

El-Yamany, N. A., & Papamichalis, P. E. (2008). Robust Color Image Superresolution:. Journal on Image and Video Processing - Color in Image and Video Processing , 1-12.

http://dl.acm.org/citation.cfm?id=1362851.1384984&coll=DL&dl=ACM&CFID=95749724&CFTOKEN=17292690

Thursday, March 22, 2012

Kinect making 3D video

I discovered a post that makes a 3D video artificially using the Kinect sensor to record the video. The proposed algorithm is a prepossessing stage. Using raw depth data from the Kinect to have the depth element to the video is the depth map is relative and full of holes. The depth data is recorded based upon reflected infrared light coming originally from the sensor. To help compensate, the article proposed using the RGB frames to help clear the depth data up. The proposed algorithm has five steps in order to created an accurate depth map for the 3d video. The first step creates a series of motion estimations using both the frames before the current frame and estimates the motion vectors of the frames after the current frame. The second step is to create a confidence metric for the motion vectors of the future frames in order to assess the quality of the motion vectors. The third step is to use the motion vectors on future frames for "motion compensation" in order to have a better accuracy of the depth of the frames. The forth step is to perform basic depth map filtering. The final step is to fill any holes with the data of neighboring pixels.

The results of this algorithm is a video conversion at 1.4 frames per second. Keep in mind this is not the viewing rate but the processing rate. The algorithm fixes problems with the original depth map. It also make the depth map smoother and more stable.

- Kao Pyro of the Azure Flame

Source:
Matyunin, S., Vatolin, D., & Berdnikov, Y. (2011). TEMPORAL FILTERING FOR DEPTH MAPS GENERATED BY KINECT DEPTH CAMERA. 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON) , 1-4.
http://ieeexplore.ieee.org/search/srchabstract.jsp?tp=&arnumber=5877202&openedRefinements%3D*%26filter%3DAND%28NOT%284283010803%29%29%26searchField%3DSearch+All%26queryText%3DKinect

Thursday, March 8, 2012

Finding the Face

Today, the paper I read is less about using the Kinect, and more about processing an image. A big part of my project is about creating a partial skeleton for the Kinect. In order to do that I need good anchor points to place the joints at. In this case, I've decided the head would be real reliable. The paper is about an efficient algorithm to detect where the face is. The paper mentions several ways to go about detecting a face. The three ways mentions were knowledge based, image based, and feature based. The paper proposes take the feature based approach. Feature based is about finding specific features in the image common to faces, such as skin color, face shape, the eyes, and the nose. The paper goes over two of theses features.

The first feature it approaches is finding the skin color. However a major problem with skin color is differing tones. To solve this issue, they decide to take the image and convert it to a different color scheme. They use the YCbCr color scheme, because it makes a large distinction between skin and non-skin. It also applies the same to many different skin tones and colors, making the algorithm accurate for a variety of people. After the conversion, they draw a bounding box around the "skin" pixels, which in essence is the face.

The second feature they cover are the eyes. They take the bounding box they found with the previous feature as a starting point. Then, assuming the eyes would be in the upper half of the box, they cut out the bottom half to reduce the search area. Then they use a technique called Hough transform, which identifies specified geometric shapes easily, in this case, the eyes as an oval. Hough transform takes many calculations, and can be a problem in programs that require more immediate results.

- Kao Pyro of the Azure Flame

Source:

Choudhar, M. V., Devi, M. S., & Bajaj, P. (2011). Face and facial feature detection. Proceedings of the International Conference & Workshop on Emerging Trends in Technology , 686-689.

http://dl.acm.org/citation.cfm?id=1980022.1980169&coll=DL&dl=ACM&CFID=69641785&CFTOKEN=95949759

Thursday, February 23, 2012

The Digital Smile

I'm not going to lie. The paper I read was short. So short, in fact, I nearly passed it up until I started thinking about it. Although it was short, it still informed me. It gave me facts and ideas, that i'd only be able to attribute to this paper. For that reason I'm writing about it. The paper was explaining a demo that used the Kinect to map facial gestures to an onscreen avatar. That's right. When you smile, you're character smiles. When you look like you utterly want to destroy the enemy who stole all you're online glory... well, same thing happens... kind of. The paper explains that though their framework, you don't have to manage lighting or put intrusive sensors on the subject. However, because the signal from the Kinect has a lot of "noise" you can't 1-to-1 map your face to your character's face either. To solve this problem they use a technique very similar to normal gesture recognition techniques used with Kinect. They have a pre-loaded database of facial animations, and use a sort of splicing between what the camera sees and those animations. The whole process can be summed up in this way: The Kinect determines which expression your current expression matches, and the activates that expression animation on the character model.

I was actually surprised I didn't think of this before. After all, all of the papers I have read said as much. Pre-load the database with preset gestures, then do machine learning to match the input with the closest preset. Using this same idea, putting in emote detection in our project really isn't that hard. Depending on the availability of time and resources, we could likely add emoticon support within a couple of weeks.

-Kao Pyro of the Azure Flame

Source:
Weise, Thibaut, et al. "Kinect-based facial animation." SA '11 SIGGRAPH Asia 2011 Emerging Technologies (2011).
http://dl.acm.org/citation.cfm?id=2073370.2073371&coll=DL&dl=ACM

Thursday, February 9, 2012

Connect with the Kinect

The last couple of blog posts, I've concentrated on advanced research topic of tracking the hands in some way. I feel I should return to the basis of Kinect, and the original intent for the device: putting you in the game. The paper I read discusses a framework for mapping a person's body motions to that of an avatar. But it's not just one to one. It's an enhanced mapping. For example, if you jump, your avatar could jump over an entire building. So then the challenge in building the framework becomes evident. How do you give these enhanced action to the player, while still making the player feel like they are "in the game," or are still in perfect control?

To even begin to tackle this problem, you must make predefined gestures to try and match input to. Kind of the same process as, does the player press the "X" button. After you have your pre-defined gestures, you take the input gestures of the player, apply machine-learning (classification) algorithms to decide which gesture the input is closest to. There is one snag. Taking too much time between getting input and displaying the result can make the player feel out of synche with their avatar. To combat this, the paper suggests determining the gesture before the player has completed it. For example, if the player jumps, the algorthm decides the player has done the "jump" gesture before the player reaches the peek of his jump.

Another step in implementing exaggerated action is how you display and work the output on the screen. In order to show exaggerated output, you must first have predefined animations for the specific action. So when the player jumps, the avatar will jump, but the angle of the legs may be completely different since the jump is scripted. Taking this into account there must also be a seemless transition between the non-scripted one-to-one motion before the action, and the scripted animation activated by the gesture. While the paper doesn't go into exactly how this seemless transition occures, I theorize it can be eazily done by interpolating (using math to estimate in-between limb positions) frames between the last non-scripted frame, and the first scripted frame.

- Kao Pyro of the Azure Flame

Source:
Bleiweiss, A., Eshar, D., Kutliroff, G., Lerner, A., Oshrat, Y., & Yanai, Y. (2010). Enhanced interactive gaming by blending full-body tracking and gesture animation. SA '10 ACM SIGGRAPH ASIA 2010 Sketches .
http://dl.acm.org/citation.cfm?id=1899950.1899984&coll=DL&dl=GUIDE

Tuesday, February 7, 2012

The Universal Touch

That's right. It's another post about Kinect research. Today's topic? Turning any monitor into a touch screen. Because the Kinect has the ability to determine Z-axis locations, instead of just X and Y, it theoretically has the ability to determine when a intersection with another object has been made. Apply this to any monitor and you suddenly have a make-shift touch screen. However, the problem of only one camera combined with the camera's low resolution still creates many challenges for an accurate implementation. Other considerations are identifying the screen from the rest of the environment, as well as obtaining accurate touch location detection with the finger itself.

To start off, the researchers decided to filter out all unnecessary information. That means that they have the camera record the depths of all objects, then remove everything that is behind the screen. This way if there are any sudden changes to the background, they won't be registered by the system, but rather ignored. During this phase though, reflective monitors can give an inaccurate depth reading, so it is suggested that you cover the monitor with a non-reflective material such as a piece of paper. After all the filtering of non-essential objects occurs, there is a calibration phase with the users fingers, to ensure there is no preexisting touch offset. When that is finished, touch is simply determining when the depth of the finger equals the depth of the screen.

Unfortunately this system has many limitations. To get an accurate reading, the finger must be parallel to the vertical edge of the monitor. Also the finger can't be at too much of a Z/Y axis angle, otherwise the hand will begin to cover up the finger in front of the camera. These are going to be the two main issues for any motion tracking system that only uses one camera.

The things I took out of this to find useful for my project would be the filtering technique. Rather than trying to work around a changing background environment, it'd be better to just have the system ignore anything at a certain depth. That way there is less information clutter that has to be sorted through, creating a clearer picture of the motions attempting to be read.

- Kao Pyro of the Azure Flame

Source:

Dippon, A., & Klinker, G. (2011). KinectTouch: accuracy test for a very low-cost 2.5D multitouch tracking system. ITS '11 Proceedings of the ACM International Conference on Interactive Tabletops and Surfaces , 49-52.
http://dl.acm.org/citation.cfm?id=2076354.2076363&coll=DL&dl=ACM

Thursday, February 2, 2012

The Problem with Fingers

The Kinect has really been a revolutionary tool for both developers and independent researchers. However, that is not to say that the Kinect is not limited. In fact, the Kinect is very limited by its low resolution. Tracking body and arm gestures have been no sweat, but when it comes to hand, and finger tracking, the difficulty just seems to ramp up. This is because the hand is a significantly smaller object. Viewing the hand through the Kinect ends with a very noisy result. It has been the goal for researchers some time to get accurate hand tracking through inexpensive devices, such as the Kinect. Part of the problem is different lighting can cause different readings. Other issues include a noisy or busy background, and so the challenge becomes how do you differentiate between the background and the hand you are trying to read.

The researchers at Nanyang Technology University suggest using a technique called EMD (Earth Movers Distance). EMD is basically the difference between two probability distributions. The dumbed down idea of EMD is to find the pattern the reading is most like and lob it in that category. This is actually a very similar technique to Information Retrieval's Classification problem. You would use an algorithm such as K-Nearest Neighbors to determine which pattern (or in this case gesture) is closest to the given input, rather than trying to rely on an exact matching. It never occurred to me that IR might be helpful in my Kinect project, but hopefully I'll be able to keep my eyes open.

Kao Pyro of the Azure Flame

Source:
Ren, Z., Yuan, J., & Zhang, Z. (2011). Robust hand gesture recognition based on finger-earth mover's distance with a commodity depth camera. MM '11 Proceedings of the 19th ACM international conference on Multimedia , 1093-1096.