Twitter AI Evaluation (legacy)

Saturday, April 4, 2026

AI Evaluated

Tweets

Explore

Save

Skip

@AleiahLock Explore Further

Quick Insight

This is a practical breakdown of 17 specific Claude skills that generated real revenue by treating AI as a worker, not magic. The author focuses on unsexy but profitable services like email personalization, SOP creation, and CRM cleanup rather than flashy AI products.

Actionable Takeaway

Pick 2-3 skills that align with your fintech experience (like CRM cleanup, competitor analysis, or SOP writing) and package them as consulting services for other startups. Test pricing per deliverable, not hourly.

Related to Your Work

The vendor comparison reports and feature mapping skills directly apply to your fintech platform work - you could systematize how you evaluate payment processors, compliance tools, or analytics platforms, then sell that process to other fintech founders facing similar decisions.

Thread/Source Worth Reading

The linked article provides concrete examples and frameworks for each of the 17 skills, including specific inputs/outputs and why each one sells. Worth reading for the tactical details on pricing and positioning these services.

@kevingu Explore Further

Quick Insight

This is about AutoAgent, an open source library that uses a meta-agent to autonomously improve task agents by tweaking prompts, adding tools, and refining orchestration. It achieved #1 scores on benchmarks without manual tuning, essentially solving the "harness engineering" bottleneck that requires deep domain + model expertise.

Actionable Takeaway

Clone the AutoAgent repo and test it on one of your existing automation workflows - maybe the print-on-demand pipeline or webhook processing logic. Set up evals for your specific use case and let it optimize for 24 hours to see what harness improvements it discovers.

Related to Your Work

This directly addresses your AI workflow automation challenges. Instead of manually tweaking prompts and tool configurations for different fintech use cases (fraud detection, offer matching, analytics), you could point AutoAgent at each domain with proper evals and let it discover optimal agent configurations autonomously.

Thread/Source Worth Reading

Yes, the linked content is substantial and worth reading. It's a detailed technical breakdown with concrete implementation details, benchmark results, and lessons learned about agent architecture (meta/task split, importance of traces, same-model pairing advantages). Includes actual code and practical insights about agent behavior.

@karpathy Explore Further

LLM Knowledge Bases Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So: Data ingest: I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them. IDE: I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides). Q&A: Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale. Output: Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base. Linting: I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into. Extra tools: I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries. Further explorations: As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows. TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.

Quick Insight

Karpathy describes using LLMs to automatically build and maintain personal knowledge bases from research materials — basically having AI create and update wiki-style documentation from raw sources, then query against it. This is a practical workflow for knowledge management that goes beyond typical "AI coding assistant" use cases.

Actionable Takeaway

Try this workflow for documenting one of his side projects: dump all research, docs, and competitor analysis into a folder, then use Claude/GPT to generate and maintain a structured wiki that he can query for product decisions and feature planning.

Related to Your Work

Perfect for his fintech platform's technical documentation and compliance research — could automate maintaining internal wikis about payment regulations, API documentation, and integration guides. Also useful for side project research where he's constantly evaluating new tools and market opportunities.

Thread/Source Worth Reading

Single tweet, no links. The workflow details are all contained in the post itself — sufficient to understand and implement the approach.

@ryancarson Explore Further

Quick Insight

Ryan Carson is showing how he turned OpenClaw (an AI agent framework) into a proactive executive assistant that manages emails, schedules, tasks, and business development workflows. This is a concrete implementation guide for building an AI chief of staff that works from files and integrations rather than just chat history.

Actionable Takeaway

Clone the clawchief repo and experiment with OpenClaw for automating your own email triage and task management - especially since you're already building AI integrations and could benefit from having an AI assistant manage your side project workflows.

Related to Your Work

This directly applies to your AI-powered dev workflows and automation interests. You could use a similar setup to manage your multiple side projects (print-on-demand, web agency tools, Chrome extensions) by having an AI assistant track tasks, follow up on customer emails, and coordinate between different project pipelines.

Thread/Source Worth Reading

Yes, the linked article and GitHub repo (snarktank/clawchief) are worth exploring. It provides a complete step-by-step setup guide with actual code, configuration files, and cron job templates. The repo includes skills, workspace files, and practical implementation details for building a proactive AI assistant.

@ashugarg Explore Further

Quick Insight

This article argues that B2B software will finally get its compounding data loop (like Netflix/Google have) by capturing "decision traces" — the reasoning behind business decisions, not just outcomes. With AI agents proposing actions and humans modifying them, every edit becomes a structured signal about what the model got wrong, creating a feedback loop that improves over time.

Actionable Takeaway

Start instrumenting decision reasoning in your fintech platform's webhook integrations and analytics dashboards. When users configure offer rules or adjust targeting parameters, capture not just the final settings but the context — why they changed discount thresholds, what merchant feedback influenced the decision.

Related to Your Work

Your credit-card-linked offers platform likely has tons of merchant configuration decisions that could become decision traces. When merchants adjust offer parameters or reject suggested targeting, that's valuable signal about what your algorithms are missing — much more valuable than just tracking final conversion rates.

Thread/Source Worth Reading

The linked article is substantial and worth reading. It provides concrete examples of how decision traces work in practice and explains the technical infrastructure needed to capture and structure this data. The "context graph" concept could be directly applicable to fintech workflows.

@zodchiii Explore Further

Quick Insight

This is a detailed guide on setting up automated hooks for Claude Code that enforce code quality, run tests, and prevent dangerous operations without relying on Claude to remember instructions. The 8 hooks cover everything from auto-formatting to blocking destructive commands, giving you a "set it and forget it" approach to AI pair programming guardrails.

Actionable Takeaway

Set up the auto-format hook (#1) and dangerous command blocker (#2) immediately in your next Claude Code session. These two alone will solve the most common frustrations with AI coding tools forgetting basic requirements.

Related to Your Work

Perfect for your AI-powered dev workflows and fintech platform work where code quality and safety are critical. The "require passing tests before PR" hook would be especially valuable for your credit-card platform where bugs could affect real money transactions.

Thread/Source Worth Reading

Yes, this is a comprehensive implementation guide with actual shell scripts you can copy-paste. The author provides 8 complete hook examples with explanations of when/why to use each one. Much more practical than typical "here's what's possible" AI content.

@thekitze Explore Further

what are your usecases for openclaw/hermes that you cannot do with codex / claude code (local or remote via phone) I am legit asking and not trolling.

Quick Insight

Kitze is asking for real use cases where openclaw/hermes (likely newer AI coding tools) provide value over established options like Codex/Claude. This is a practical question about tool differentiation in the crowded AI coding space, relevant since Brian actively uses AI in his dev workflows.

Actionable Takeaway

Check the replies to see what specific use cases people mention for openclaw/hermes vs Claude/Codex. If any compelling workflows emerge, test them against Brian's current AI-assisted development setup for his side projects.

Related to Your Work

Directly relevant to Brian's AI-powered dev workflows and automation tools. Understanding which AI coding assistants excel at specific tasks could optimize his development process for both the fintech platform and side projects like Chrome extensions or print-on-demand automation.

Thread/Source Worth Reading

No links provided, but the replies would contain the actual value - real user experiences comparing these tools. Worth checking the thread for concrete use cases and workflows.

@Shpigford

just expanded my /research skill with 3 new background agents: 1. ui agent 2. ux agent 3. delight agent major focus on prioritizing the user experience as a core part of any feature/fix/improvement.

Quick Insight

Josh Pigford is sharing his custom Claude /research skill that runs parallel agents focused on UI, UX, and user delight before any coding work begins. This is about systematically baking user experience considerations into technical decision-making rather than treating it as an afterthought.

Actionable Takeaway

Brian could adapt this multi-agent research pattern for his fintech work — create specialized agents that automatically check compliance requirements, fraud implications, and performance impact before implementing any feature. The parallel agent approach could catch edge cases early.

Related to Your Work

For Brian's credit-card-linked offers platform, this systematic UX research approach could prevent common fintech pitfalls like confusing checkout flows or unclear offer terms. Running a "compliance agent" alongside UI/UX agents could catch regulatory issues before they become expensive fixes.

Thread/Source Worth Reading

Worth reading. The linked article shows the full /research skill implementation with Context7 MCP integration for docs and web scraping. It includes specific prompts for clarifying requirements before research begins, which could save Brian significant rework cycles on client projects and side ventures.