Experiment Data
When you run an experiment, collect as much raw data as possible. Save the raw data, not derived data. Save all the data, not the data you think you need.
- If you collect derived data instead of raw data, you will lose information.
- If you collect a subset of data, you will lose information.
How I Burned Myself
I was working on a step counting computer vision system. A camera would watch someone walk in place and record how many steps they took. Fortunately, most of the hard work had already been done. The camera was a Kinect. Instead of converting raw images to gait information, the Kinect gave me joint positions.
My solution was a state machine and a series of classifiers: a classifier for right leg up; right leg down; left leg up; left leg down. (I don't really remember if it was two or four classifiers.) The state machine had three states: start, right, and left. The system would use the classifiers to determine when the state was switched and increment a counter. With the classifiers at about ~80% on the testing data, I could get within 5% of the actual step count. It was a cool system.
The trouble came with improving it. This was supervised learning, but there was no training data, so I had to generate and classify my own data. I wrote two extra programs: one that would record images and another for hand classifying them. My hand classifier even had hot keys! (The data collection system became very complicated, much more complicated than the classifiers.) Unfortunately, I spent a lot of time in the classifier.
Early on, I only recorded the data I thought would be useful for the trainer. I'd keep knee and leg information and throw away the arms. When I wanted to improve the system, I had to collect new data and retrain it by hand. I spent a lot of time classifying data. Besides being incredibly tedious, this system made it hard to see what effect my changes had.
In retrospect, the problems I ran into should have been obvious, but my original design wasn't seperated into several programs. Instead, I had the same data definition for the trainer and the collector. I had to recompile to change what I was working on. Classifying hundreds of images over and over without getting to keep them is a painful time waste.
Conclusion
Please don't repeat my mistake. Design your experiment so you can save everything and then save everything.