Flocking around with the Kinect
in Articles,
by Marnix
last published on 22 November 2021
Today I write about projecting objects around you using the Kinect and Microsoft's official Software Development Kit. My dayjob, quite randomly, gave me the opportunity to play around with the Kinect for a few days. During this time, I was able to discover a number of interesting things that I want to share with you.
If there is anything you take away from this article, let it be this: do not use the official toolkit. The open source version of the Kinect SDK is much more powerful and complete. At the time I was testing the official toolkit, only a few days after its release, I discovered that most -- if not all -- of the interesting functionality which goes beyond the most basic functionality (like skeleton detection), is not implemented. Another thing that bothered me was the lacking documentation. But I guess that is to be expected from a new product.
To start off I'll tell you a little bit more about what the Kinect is, and what it is capable of. Then, I present a cool little demo that we're going to try and implement. The rest of the article is a walkthrough of the interesting bits and pieces.
Brand spanking new
Nintendo was first to bring motion detection to gaming consoles in the winter of 2006. Their Wiimote was a instantaneous hit with the public. Accompanied by a very affordable console, it was the envy of the other big console makers: Sony and Microsoft. After four long years of free-reign as king of this particular hill, Sony and Microsoft released their answer to the Nintendo's amazing peripheral.
Sony introduced the Playstation Move in 2010. Basically nothing more than sticks with coloured lightballs on top that allow the PS3 to locate and track your movements.
The Kinect (released in 2010) is the latest and (perhaps) greatest movement detection console accesory, introduced by Microsoft. Despite Microsoft's late entrance into this market, they introduced an accessory that has some great things going for it.
The Kinect has two camera's of considerable quality. Other than the camera, there is a mechanism inside this wonderful device, that is able to project a steady grid of points into the room the Kinect is setup in.
Together, the two camera's are able to record a three dimensional image, color components, but also a depth buffer. Combining the normal picture they receive from the camera's with an interpretation of the projected laser grid which results in the depth buffer.
Internally the Kinect contains some powerful processing power. Not only is the device able to determine a depth field, but is also able to track two people (or skeletons) at the same time. By analysing the imagery that it records, it is able to extrapolate the skeletal information.
All this information can be read from the device through a set of functions Microsoft has gracefully released into the wild through its new Kinect SDK.
An interesting idea
Because I didn't have all the time in the world (a max of three days) I had to come up with something that would show off the possibilities of the SDK in an adequate and fun way.
Some of the core functionality of the device that I wanted to show off were:
- reading skeleton information;
- interpreting events in the scene;
- augment the camera feed with additional objects.
After some thought, I decided to create a small demo that would be able to measure the size of the person being recorded; interpret that information; and overlay a swarm of bees onto the screen. Interaction with the scene would be provided once the person in front of the Kinect raises his hand. This would attract the bees to swarm around the hand that was lifted.
The rest of this article describes the methods I used to bring this little project to a satisfying end.
Measuring skeleton length
The demo that we're making is going to calculate the length of the person in front of the camera, translate his length from meters to pixels. This information is then used to put a swarm of bees on the screen that is sized similarly to their real-world counterparts, simply by providing their length in meters.
Assumptions
To start out, we need to discover and analyse the environment the Kinect is pointed at. Let's scribble down some knowledge we have about our surroundings and the Kinect:
- The camera is always the center of the scene. This assumption makes a number of our calculations easier. So, let's just say that the camera is indeed, the center of our universe.
- Skeleton detection information is provided as a vector. Vectors are known to have both direction and length. The cool thing about the information the Kinect returns is, that it returns it in worldy measurements -- the distance you can discern from a vector is measured in meters!
- The colour buffer has a slight deviation to the depth buffer image's dimensions. So, when we try to translate something from skeletal information, which is based on depth buffer information, to a position on the screen it will not map 1:1. This means we need to take into account a small deviation constant.
- There actually are Kinect API calls that allow you to translate from colour buffer to depthbuffer coordinates and vice versa. However, these were not yet made available for public use. Later, I discovered they were quite readily accessible in the unofficial Kinect SDK.
Triangles all over the place
To determine the length of the person in front of the camera we're going to do some very basic math on the points that we can get information on. An assumption we're making is that the feet are always positioned on the floor on an equal height.
The Kinect provides us with the following skeletal rays:
- left foot position (L_ray);
- right foot position (R_ray); and
- position of the head (H_ray).
These rays are cast from the point of view of the camera. Now, you may not know this, but the Kinect also has a mechanism that allows the camera to automatically focus on elements that are in the room. Unfortunately, with the state the SDK was in, I was unable to retrieve this information from the Kinect. In the future, this information needs to be incorporated in the calculations below.
L_pos = cast L_ray from Camera R_pos = cast R_ray from Camera H_pos = cast H_ray from Camera
Now, you might think this would be a hassle to calculate, but in reality we can leave this step behind. Remember our camera being the center of the universe? This means that the vector and it's length, cast from (0, 0, 0) would always result in the ray's value. So in actuality:
L_pos = L_ray R_pos = R_ray H_pos = H_ray
Then, we create an imaginary triangle, running from the left foot, to right in between the legs (M), and up to the head. By calculating the length of the middle element we know (exactly) how tall the person in the camera is.
H /| k / | / | m / | /____| L M l l = distance(L_pos, R_pos) * 0.5 k = distance(L_pos, H_pos) m = sqrt(k^2 - l^2)
Now we know the person's length in meters. We then translate this to a unit of measure that is understandable to the computer, pixels! The joint information we read from the Kinect does not solely contain the ray that was cast to detect to the joint, but also it's position on the colour buffer (x, y). Now we know the height in meters, it's as simple as subtracting the left foot's y-position from the head's, and dividing that by our height in meters. Neat right!?
Meters_Per_Pixel = (L_y - H_y) / m
After all this, we know quite a few things about the scene the bee swarm is going to be flying in. Displaying things on, or around the scene will be much easier!
During the development of my little demo it was surprising how accurate the Kinect actually was. Taking into account a scaling factor because of the amera's lens distortion, every person that I had stand in front of it was really close to his/her actual height. So if this demo doesn't work out, I can always sell the code as world's most uncomfortable and most expensive measuring tape!
Flocking behaviours Now that we are able to find the skeleton in space, and know exactly how big the person in front of the camera is -- and therefore know at what size and position to project our cute little swarm. It is time to delve deeper into the behavior of our swarm. We will be implementing a simple flocking mechanism that is common-place and documented very well around the web. I won't be explaining it from top to bottom, but there general concept follows. If you want to know how to implement it, check out the code that goes with this article.
Three rules of flight When one talks about swarm behaviours, I always imagine a flock of birds, flying through the sky in intricate patterns that seem almost too beautiful to be real. Thankfully, smart people, have disproven any such magic and narrowed it down to three rules that birds use to fly the way they do. Turns out, the same rule applies for swarms of other types, such as bees and fish.
The three rules are as follows:
- cohesion;
- separation; and
- alignment.
Cohesion
It's important for our entire swarm to stay together and have the same general purpose. We can't be having one bee fly into one direction, while the other is flying in almost the opposite direction. This is where cohesion comes into play. It's similar to separation but on scale of the entire swarm.
Separation
To make sure that the bees are able to stay in flight without any unfortunate collisions into the Queen Bee's main-quarters, it's important they steer clear of eachother, this is called 'separation'. This separation is based on its closest neighbours.
In the code below I've made it possible for each bee to have a little bit more attractive power than other bees. This made it possible for me to create a "leader". The bee that I wanted to lead the rest of the swarm, by giving him random coordinates to fly to, has a lot more attraction power. This causes the rest of the swarm to follow him!
Pseudo-code:
Total_Attraction = The sum of all bees' attraction scalars Average = [0, 0, 0] foreach Neighbour as Bee: Average += Bee.direction * (Bee.attraction / Total_Attraction) Average *= 1 / Number_of_Neighbours NewDirection = Average - CurrentPosition Normalize DirectionVector CurrentDirection = CurrentDirection + Weight * NewDirection
Alignment
Having setup a rule for not killing your neighbour in mid-flight by trying to occupy the same physical space is very necessary. But it doesn't mean that our bees don't like to be cozy, they're a swarm after all!
To keep everyone in the same area, there is the rule of 'alignment'. What it essentially entails is making sure a bee flies in the same general direction as its neighbouring bees.
Psuedo-code:
Average = [0, 0, 0] foreach Neighbour as Bee: Average += Bee.direction Average *= (1 / Number_of_Neighbours) Normalize(Average)
Add weighted average to Bee's direction vector While coding my demo, I noticed that by just implementing the rules for separation and alignment, the swarm of bees had a near life-like behavioral pattern. So I opted to not implement the cohesion rule -- it's all about the illusion after all!
Conclusion
You still with me? Cool.
We have gone through quite some stuff: Kinect basics, creating world's most expensive measuring tape, and a simple algorithm for flocking behaviors. All that's left is to bring it together. To be honest, that's the boring stuff (you know it's true), so I'll leave that as an exercise for you.
My implementation can be seen in action on Youtube on the link below. Thanks go to my dear collegue Bart for recording it and Kevin for providing an unintended Kinect stresstest.
One of the things I wish the official SDK would offer, but didn't at the time, was a separate stencil buffer that is like a cardboard cut-out for the skeleton that is being tracked. This information would have allowed me to hide a bee when it would disappear behind me in space. Stuff like that really pushes the illusion of augmented reality into a whole different dimension.
Since I've completed this little experiment, I've seen a great number of awesome projects involving the Kinect, confirming my belief that this seemingly unimpressive and maybe even awkward gaming accessory actually has great potential!