Getting better models by dropping floating point sizes and increasing parameters

Another quick post, but I'm continuing some work from a couple months ago, where I trained several neural nets in diff datatypes, and it was only fp16/fp32/fp64, and still is (would like to do more, but need to learn how to make datatype code). I decided to try the approach where the weights are space equivalent, or same amount of GB.

Original conclusion

From my original work, (inspired by @kosenjuu), the final graph looked like this:

loss

Loss on the same model with diff floating points does nothing, but I think more floating points with half the precision each time will still work better

Training runs

3 configs,

4 layers (fp64)
8 layers (fp32)
16 layers (fp16)

config	dtype	layers	(train time)/step
1	fp64	4	170ms
2	fp32	8	430ms
3	fp16	16	3500ms

the overall loss graph looks like this:

losses

q1/q2/q4 and beyond

need to figure out how people make custom datatypes in pytorch, then get to the lowest values, and see how it goes
they also need to be gpu-useable, I tried the float8 datatype, was unsuccessful