transformers are so overhyped, I just don't see the value in using a 10-layer transformer over a 5-layer model for most NLP tasks