-fix CMakeLists, public headers weren't getting installed
(only 30 muls for size 4) - rework the matrix inversion: now using cofactor technique for size<=3, so the ugly unrolling is only used for size 4 anymore, and even there I'm looking to get rid of it.
fully optimized. * Even if LargeBit is set, only parallelize for large enough objects (controlled by EIGEN_PARALLELIZATION_TRESHOLD).