Another quick post, but I'm continuing some work from a couple months ago, where I trained several neural nets in diff datatypes, and it was only fp16/fp32/fp64, and still is (would like to do more, but need to learn how to make datatype code). I decided to try the approach where the weights are space equivalent, or same amount of GB.
From my original work, (inspired by @kosenjuu), the final graph looked like this:

Loss on the same model with diff floating points does nothing, but I think more floating points with half the precision each time will still work better
3 configs,
| config | dtype | layers | (train time)/step | 
|---|---|---|---|
| 1 | fp64 | 4 | 170ms | 
| 2 | fp32 | 8 | 430ms | 
| 3 | fp16 | 16 | 3500ms | 
the overall loss graph looks like this:
