Despite these impressive specifications, user testing revealed significant performance issues. The release has left many in the AI community feeling disappointed, marking the most negative reaction to a model release in recent memory.
Independent AI researcher Simon Willison asked Llama 4 Scout to summarize a long Reddit thread (~20,000 tokens), and the output was โcomplete junk,โ with the model looping and hallucinating instead of summarizing. Users reported that performance begins to degrade well below the advertised 10 million tokens, often around just 10,000โ20,000 tokens, limiting practical benefits.