Purely personal experience, no data-backed research
Daily Work
For tasks like information retrieval, analysis, and organization, mainly to cover non-specialized work in my independent development, such as design and marketing.
- Gemini 2.5 Pro: This is currently the main workhorse. It’s affordable (compared to other top-tier LLMs), stable, multimodal (with very generous file size limits), and has a large context window. It takes a bit longer to think and outputs can be a bit verbose, but its understanding of problems and adherence to instructions are excellent. My first choice when serious work needs to be done.
- Claude Sonnet 4 (thinking): Used as a second opinion for Gemini 2.5 Pro. The biggest issues right now are the file size limitations, which can be troublesome, and conversations tend to derail easily as they get longer, leading to generic bulleted answers.
- Claude Sonnet 4: Handles minor issues that don’t require deep thought. Apple’s previous paper on inference models overthinking small problems aligns with my experience. The Claude Sonnet series has always had a flaw: it’s too “proper.” You need to use good prompts and cautious questioning to prevent it from being afraid to point out problems. Another issue I’ve observed is that its adherence to prompts is not very flexible; it doesn’t play certain “incorrect” roles well.
- Qwen QWQ 32B (on Groq): Serves as a second opinion for Claude Sonnet 4, offering different perspectives. In contrast, I feel that its information and numbers tend to be more sharp and exaggerated, unlike Sonnet’s “propriety.” The lack of multimodality is a drawback.
- GPT 4o: Used as a third opinion for Claude Sonnet 4. I actually rarely use the GPT series anymore. 4o’s output quality is noticeably inferior, and I’ve always disliked its emotional manipulation; I basically won’t use it unless I override it with a prompt. But the advantage is that prompt flexibility is very high, and it’s excellent at playing the role of a maid.
- Gemini 2.5 Flash: Handles small tasks like simple translation or grammar. It’s cheap and fast, but its analytical and organizational capabilities are insufficient (even with “thinking” added).
- Llama 4 Scout / Marverick (on Groq): For laughs. The failures of others always bring me joy. Seriously, why is Llama 4 so bad? Its output is mediocre, it has no value as a second opinion, and its instruction following isn’t even good enough for a product API (at least Llama 3 was). I originally expected it to be a game-changer with Groq, but now it seems to be game over.
Writing Code
- Gemini 2.5 Pro: Mainstay. Most of the time, I use this. Aside from being verbose and slow, I don’t have too many complaints. When writing in non-specialized areas, its verbosity can supplement some knowledge, but it can be a bit annoying when writing in my main field.
- Claude Sonnet 4: Handles relatively simple minor tasks. The Claude Sonnet series was originally my main choice, but Sonnet 4’s tendency to “do too much” is a bit annoying. Has it been manipulated into “proving itself” or something?
- Gemini 2.5 Flash: i18n.
Product API
- Gemini 2.5 Flash: This is currently the sweet spot among “price,” “speed,” and “quality.” The service is also stable, making it the most practical model for development. Most AI modules that don’t require high-quality output use Gemini 2.5 Flash. I believe Google successfully found a good market entry point after Gemini 2.0 Flash, which allowed them to bring up other models.
- Claude Sonnet 4: Suitable for high-quality applications. Claude’s “proper” style makes its output quality very stable, and its price and speed are also acceptable. Another experience I had with Tasmap is that the Sonnet series has the best “aesthetics,” or rather, the only one with aesthetics.
- Gemini 2.5 Flash Lite: Just released the day before I wrote this, I haven’t vigorously tested it yet. But initially, it seems to have the potential to be the choice for “minimum viable quality, fastest speed.”
- Llama 3.3 70B (on Groq): I’m looking for a model with “minimum viable quality, fastest speed.” Previously, I used Mistral 7bx8, but Groq no longer supports it. Llama 3’s quality is a bit subtle; sometimes it can’t output object output.
Daily Life
- GPT 4o: Asking “what kind of fish is this” when grocery shopping. I don’t really use LLMs much away from my laptop.
- Claude, Gemini: Same as above. I don’t really use LLMs much away from my laptop.
Other
- Mistral: I’ve been experimenting with it recently, but haven’t gotten particularly outstanding results. Their best model should be Mistral Medium, but its positioning is a bit awkward; it feels positioned between Claude Sonnet 4 and Gemini 2.5 Flash, yet it doesn’t have too many outstanding features. I hope to find a “qualified quality, super-fast, cheap model” from them, but it feels like they are still exploring which direction to go themselves.
- Grok: I heard that searching Twitter’s real-time trends is great, but I haven’t tried it yet. My initial impressions from trying it in the app were not good; its functionality and output quality were far inferior to the models from the big three. Being uninhibited might be a characteristic.
- Claude Opus 4: Too expensive, too slow. It’s useless for product development or daily work. I don’t need you to run for an hour to write a project; just thinking about how much I’d have to review scares me.
- GPT o3: I’m not very familiar with the OpenAI series. I remember this one was touted for deep analysis, but after trying it a few times, I didn’t feel it was that deep; it was roughly on par with Gemini 2.5 Pro.