RoPE + GQA for Transformers

Implemented Rotary Positional Embeddings and Grouped-Query Attention in a GPT-style Transformer, with Key-Value caching and causal masking.
Benchmarked attention latency across sequence lengths and head configurations, reducing validation loss by >10% over a minGPT baseline and improving throughput/memory behavior.