The Azure Flame: February 2012

Thursday, February 23, 2012

The Digital Smile

I'm not going to lie. The paper I read was short. So short, in fact, I nearly passed it up until I started thinking about it. Although it was short, it still informed me. It gave me facts and ideas, that i'd only be able to attribute to this paper. For that reason I'm writing about it. The paper was explaining a demo that used the Kinect to map facial gestures to an onscreen avatar. That's right. When you smile, you're character smiles. When you look like you utterly want to destroy the enemy who stole all you're online glory... well, same thing happens... kind of. The paper explains that though their framework, you don't have to manage lighting or put intrusive sensors on the subject. However, because the signal from the Kinect has a lot of "noise" you can't 1-to-1 map your face to your character's face either. To solve this problem they use a technique very similar to normal gesture recognition techniques used with Kinect. They have a pre-loaded database of facial animations, and use a sort of splicing between what the camera sees and those animations. The whole process can be summed up in this way: The Kinect determines which expression your current expression matches, and the activates that expression animation on the character model.

I was actually surprised I didn't think of this before. After all, all of the papers I have read said as much. Pre-load the database with preset gestures, then do machine learning to match the input with the closest preset. Using this same idea, putting in emote detection in our project really isn't that hard. Depending on the availability of time and resources, we could likely add emoticon support within a couple of weeks.

-Kao Pyro of the Azure Flame

Source:
Weise, Thibaut, et al. "Kinect-based facial animation." SA '11 SIGGRAPH Asia 2011 Emerging Technologies (2011).
http://dl.acm.org/citation.cfm?id=2073370.2073371&coll=DL&dl=ACM

Thursday, February 9, 2012

Connect with the Kinect

The last couple of blog posts, I've concentrated on advanced research topic of tracking the hands in some way. I feel I should return to the basis of Kinect, and the original intent for the device: putting you in the game. The paper I read discusses a framework for mapping a person's body motions to that of an avatar. But it's not just one to one. It's an enhanced mapping. For example, if you jump, your avatar could jump over an entire building. So then the challenge in building the framework becomes evident. How do you give these enhanced action to the player, while still making the player feel like they are "in the game," or are still in perfect control?

To even begin to tackle this problem, you must make predefined gestures to try and match input to. Kind of the same process as, does the player press the "X" button. After you have your pre-defined gestures, you take the input gestures of the player, apply machine-learning (classification) algorithms to decide which gesture the input is closest to. There is one snag. Taking too much time between getting input and displaying the result can make the player feel out of synche with their avatar. To combat this, the paper suggests determining the gesture before the player has completed it. For example, if the player jumps, the algorthm decides the player has done the "jump" gesture before the player reaches the peek of his jump.

Another step in implementing exaggerated action is how you display and work the output on the screen. In order to show exaggerated output, you must first have predefined animations for the specific action. So when the player jumps, the avatar will jump, but the angle of the legs may be completely different since the jump is scripted. Taking this into account there must also be a seemless transition between the non-scripted one-to-one motion before the action, and the scripted animation activated by the gesture. While the paper doesn't go into exactly how this seemless transition occures, I theorize it can be eazily done by interpolating (using math to estimate in-between limb positions) frames between the last non-scripted frame, and the first scripted frame.

- Kao Pyro of the Azure Flame

Source:
Bleiweiss, A., Eshar, D., Kutliroff, G., Lerner, A., Oshrat, Y., & Yanai, Y. (2010). Enhanced interactive gaming by blending full-body tracking and gesture animation. SA '10 ACM SIGGRAPH ASIA 2010 Sketches .
http://dl.acm.org/citation.cfm?id=1899950.1899984&coll=DL&dl=GUIDE

Tuesday, February 7, 2012

The Universal Touch

That's right. It's another post about Kinect research. Today's topic? Turning any monitor into a touch screen. Because the Kinect has the ability to determine Z-axis locations, instead of just X and Y, it theoretically has the ability to determine when a intersection with another object has been made. Apply this to any monitor and you suddenly have a make-shift touch screen. However, the problem of only one camera combined with the camera's low resolution still creates many challenges for an accurate implementation. Other considerations are identifying the screen from the rest of the environment, as well as obtaining accurate touch location detection with the finger itself.

To start off, the researchers decided to filter out all unnecessary information. That means that they have the camera record the depths of all objects, then remove everything that is behind the screen. This way if there are any sudden changes to the background, they won't be registered by the system, but rather ignored. During this phase though, reflective monitors can give an inaccurate depth reading, so it is suggested that you cover the monitor with a non-reflective material such as a piece of paper. After all the filtering of non-essential objects occurs, there is a calibration phase with the users fingers, to ensure there is no preexisting touch offset. When that is finished, touch is simply determining when the depth of the finger equals the depth of the screen.

Unfortunately this system has many limitations. To get an accurate reading, the finger must be parallel to the vertical edge of the monitor. Also the finger can't be at too much of a Z/Y axis angle, otherwise the hand will begin to cover up the finger in front of the camera. These are going to be the two main issues for any motion tracking system that only uses one camera.

The things I took out of this to find useful for my project would be the filtering technique. Rather than trying to work around a changing background environment, it'd be better to just have the system ignore anything at a certain depth. That way there is less information clutter that has to be sorted through, creating a clearer picture of the motions attempting to be read.

- Kao Pyro of the Azure Flame

Source:

Dippon, A., & Klinker, G. (2011). KinectTouch: accuracy test for a very low-cost 2.5D multitouch tracking system. ITS '11 Proceedings of the ACM International Conference on Interactive Tabletops and Surfaces , 49-52.
http://dl.acm.org/citation.cfm?id=2076354.2076363&coll=DL&dl=ACM

Thursday, February 2, 2012

The Problem with Fingers

The Kinect has really been a revolutionary tool for both developers and independent researchers. However, that is not to say that the Kinect is not limited. In fact, the Kinect is very limited by its low resolution. Tracking body and arm gestures have been no sweat, but when it comes to hand, and finger tracking, the difficulty just seems to ramp up. This is because the hand is a significantly smaller object. Viewing the hand through the Kinect ends with a very noisy result. It has been the goal for researchers some time to get accurate hand tracking through inexpensive devices, such as the Kinect. Part of the problem is different lighting can cause different readings. Other issues include a noisy or busy background, and so the challenge becomes how do you differentiate between the background and the hand you are trying to read.

The researchers at Nanyang Technology University suggest using a technique called EMD (Earth Movers Distance). EMD is basically the difference between two probability distributions. The dumbed down idea of EMD is to find the pattern the reading is most like and lob it in that category. This is actually a very similar technique to Information Retrieval's Classification problem. You would use an algorithm such as K-Nearest Neighbors to determine which pattern (or in this case gesture) is closest to the given input, rather than trying to rely on an exact matching. It never occurred to me that IR might be helpful in my Kinect project, but hopefully I'll be able to keep my eyes open.

Kao Pyro of the Azure Flame

Source:
Ren, Z., Yuan, J., & Zhang, Z. (2011). Robust hand gesture recognition based on finger-earth mover's distance with a commodity depth camera. MM '11 Proceedings of the 19th ACM international conference on Multimedia , 1093-1096.