Flocking around with the Kinect

Today I write about projecting objects around you using the Kinect and Microsoft's official Software Development Kit. My dayjob, quite randomly, gave me the opportunity to play around with the Kinect for a few days. During this time, I was able to discover a number of interesting things that I want to share with you.

If there is anything you take away from this article, let it be this: do not use the official toolkit. The open source version of the Kinect SDK is much more powerful and complete. At the time I was testing the official toolkit, only a few days after its release, I discovered that most -- if not all -- of the interesting functionality which goes beyond the most basic functionality (like skeleton detection), is not implemented. Another thing that bothered me was the lacking documentation. But I guess that is to be expected from a new product.

To start off I'll tell you a little bit more about what the Kinect is, and what it is capable of. Then, I present a cool little demo that we're going to try and implement. The rest of the article is a walkthrough of the interesting bits and pieces.

Brand spanking new

Nintendo was first to bring motion detection to gaming consoles in the winter of 2006. Their Wiimote was a instantaneous hit with the public. Accompanied by a very affordable console, it was the envy of the other big console makers: Sony and Microsoft. After four long years of free-reign as king of this particular hill, Sony and Microsoft released their answer to the Nintendo's amazing peripheral.

Sony introduced the Playstation Move in 2010. Basically nothing more than sticks with coloured lightballs on top that allow the PS3 to locate and track your movements.

The Kinect (released in 2010) is the latest and (perhaps) greatest movement detection console accesory, introduced by Microsoft. Despite Microsoft's late entrance into this market, they introduced an accessory that has some great things going for it.

The Kinect has two camera's of considerable quality. Other than the camera, there is a mechanism inside this wonderful device, that is able to project a steady grid of points into the room the Kinect is setup in.

Together, the two camera's are able to record a three dimensional image, color components, but also a depth buffer. Combining the normal picture they receive from the camera's with an interpretation of the projected laser grid which results in the depth buffer.

Internally the Kinect contains some powerful processing power. Not only is the device able to determine a depth field, but is also able to track two people (or skeletons) at the same time. By analysing the imagery that it records, it is able to extrapolate the skeletal information.

All this information can be read from the device through a set of functions Microsoft has gracefully released into the wild through its new Kinect SDK.

An interesting idea

Because I didn't have all the time in the world (a max of three days) I had to come up with something that would show off the possibilities of the SDK in an adequate and fun way.

Some of the core functionality of the device that I wanted to show off were:

reading skeleton information;
interpreting events in the scene;
augment the camera feed with additional objects.

After some thought, I decided to create a small demo that would be able to measure the size of the person being recorded; interpret that information; and overlay a swarm of bees onto the screen. Interaction with the scene would be provided once the person in front of the Kinect raises his hand. This would attract the bees to swarm around the hand that was lifted.

The rest of this article describes the methods I used to bring this little project to a satisfying end.

Measuring skeleton length

The demo that we're making is going to calculate the length of the person in front of the camera, translate his length from meters to pixels. This information is then used to put a swarm of bees on the screen that is sized similarly to their real-world counterparts, simply by providing their length in meters.

Assumptions

To start out, we need to discover and analyse the environment the Kinect is pointed at. Let's scribble down some knowledge we have about our surroundings and the Kinect:

The camera is always the center of the scene. This assumption makes a number of our calculations easier. So, let's just say that the camera is indeed, the center of our universe.
Skeleton detection information is provided as a vector. Vectors are known to have both direction and length. The cool thing about the information the Kinect returns is, that it returns it in worldy measurements -- the distance you can discern from a vector is measured in meters!
The colour buffer has a slight deviation to the depth buffer image's dimensions. So, when we try to translate something from skeletal information, which is based on depth buffer information, to a position on the screen it will not map 1:1. This means we need to take into account a small deviation constant.

There actually are Kinect API calls that allow you to translate from colour buffer to depthbuffer coordinates and vice versa. However, these were not yet made available for public use. Later, I discovered they were quite readily accessible in the unofficial Kinect SDK.

Triangles all over the place

To determine the length of the person in front of the camera we're going to do some very basic math on the points that we can get information on. An assumption we're making is that the feet are always positioned on the floor on an equal height.

The Kinect provides us with the following skeletal rays:

left foot position (L_ray);
right foot position (R_ray); and
position of the head (H_ray).

These rays are cast from the point of view of the camera. Now, you may not know this, but the Kinect also has a mechanism that allows the camera to automatically focus on elements that are in the room. Unfortunately, with the state the SDK was in, I was unable to retrieve this information from the Kinect. In the future, this information needs to be incorporated in the calculations below.

L_pos = cast L_ray from Camera

R_pos = cast R_ray from Camera

H_pos = cast H_ray from Camera

Now, you might think this would be a hassle to calculate, but in reality we can leave this step behind. Remember our camera being the center of the universe? This means that the vector and it's length, cast from (0, 0, 0) would always result in the ray's value. So in actuality:

L_pos = L_ray

R_pos = R_ray

H_pos = H_ray

Then, we create an imaginary triangle, running from the left foot, to right in between the legs (M), and up to the head. By calculating the length of the middle element we know (exactly) how tall the person in the camera is.

H

/|

     k  / |

       /  |  m

      /   |

     /____|

L            M

l

l = distance(L_pos, R_pos) * 0.5

k = distance(L_pos, H_pos)

Now we know the person's length in meters. We then translate this to a unit of measure that is understandable to the computer, pixels! The joint information we read from the Kinect does not solely contain the ray that was cast to detect to the joint, but also it's position on the colour buffer (x, y). Now we know the height in meters, it's as simple as subtracting the left foot's y-position from the head's, and dividing that by our height in meters. Neat right!?

Meters_Per_Pixel = (L_y - H_y) / m

After all this, we know quite a few things about the scene the bee swarm is going to be flying in. Displaying things on, or around the scene will be much easier!

During the development of my little demo it was surprising how accurate the Kinect actually was. Taking into account a scaling factor because of the amera's lens distortion, every person that I had stand in front of it was really close to his/her actual height. So if this demo doesn't work out, I can always sell the code as world's most uncomfortable and most expensive measuring tape!

Flocking behaviours Now that we are able to find the skeleton in space, and know exactly how big the person in front of the camera is -- and therefore know at what size and position to project our cute little swarm. It is time to delve deeper into the behavior of our swarm. We will be implementing a simple flocking mechanism that is common-place and documented very well around the web. I won't be explaining it from top to bottom, but there general concept follows. If you want to know how to implement it, check out the code that goes with this article.

Three rules of flight When one talks about swarm behaviours, I always imagine a flock of birds, flying through the sky in intricate patterns that seem almost too beautiful to be real. Thankfully, smart people, have disproven any such magic and narrowed it down to three rules that birds use to fly the way they do. Turns out, the same rule applies for swarms of other types, such as bees and fish.

The three rules are as follows:

cohesion;
separation; and
alignment.

Cohesion

It's important for our entire swarm to stay together and have the same general purpose. We can't be having one bee fly into one direction, while the other is flying in almost the opposite direction. This is where cohesion comes into play. It's similar to separation but on scale of the entire swarm.

Separation

To make sure that the bees are able to stay in flight without any unfortunate collisions into the Queen Bee's main-quarters, it's important they steer clear of eachother, this is called 'separation'. This separation is based on its closest neighbours.

In the code below I've made it possible for each bee to have a little bit more attractive power than other bees. This made it possible for me to create a "leader". The bee that I wanted to lead the rest of the swarm, by giving him random coordinates to fly to, has a lot more attraction power. This causes the rest of the swarm to follow him!

Pseudo-code:

Total_Attraction = The sum of all bees' attraction scalars

Average = [0, 0, 0]

foreach Neighbour as Bee:

   Average += Bee.direction * (Bee.attraction / Total_Attraction)

   Average *= 1 / Number_of_Neighbours

NewDirection = Average - CurrentPosition

Normalize DirectionVector

CurrentDirection = CurrentDirection + Weight * NewDirection

Alignment

Having setup a rule for not killing your neighbour in mid-flight by trying to occupy the same physical space is very necessary. But it doesn't mean that our bees don't like to be cozy, they're a swarm after all!

To keep everyone in the same area, there is the rule of 'alignment'. What it essentially entails is making sure a bee flies in the same general direction as its neighbouring bees.

Psuedo-code:

Average = [0, 0, 0]

foreach Neighbour as Bee:

   Average += Bee.direction

Average *= (1 / Number_of_Neighbours)

Normalize(Average)

Add weighted average to Bee's direction vector While coding my demo, I noticed that by just implementing the rules for separation and alignment, the swarm of bees had a near life-like behavioral pattern. So I opted to not implement the cohesion rule -- it's all about the illusion after all!

Conclusion

You still with me? Cool.

We have gone through quite some stuff: Kinect basics, creating world's most expensive measuring tape, and a simple algorithm for flocking behaviors. All that's left is to bring it together. To be honest, that's the boring stuff (you know it's true), so I'll leave that as an exercise for you.

My implementation can be seen in action on Youtube on the link below. Thanks go to my dear collegue Bart for recording it and Kevin for providing an unintended Kinect stresstest.