Articles

Building better AI with human-centered development

Simon Banks

|April 16, 2025

Modern AI has captured public imagination through products like ChatGPT and Claude. But these polished interfaces that happen to be simple to navigate actually mask a complex, often messy reality. Building AI systems is an intricate dance of technical precision, careful planning, and constant iteration. Behind every successful deployment lies a development cycle that shapes how artificial intelligence moves from concept to reality.

But great AI goes further than technical sophistication. It’s all about how well these systems understand and interact with the real world. That’s why human input is essential. Models trained in isolation can reinforce biases, misinterpret context, or fail when faced with real-world variability. At Prolific, we help AI teams bridge this gap by providing access to diverse, vetted participants, making sure models are tested, trained, and refined with high-quality human insights.

Every step of the AI development lifecycle—from defining the problem to deploying and maintaining the system—relies on strong foundations.

We explore each phase of AI development:

Phase 1: Beginning with why: defining the problem
Phase 2: The data foundation: quality, collection, and preparation
Phase 3: Building the right development environment
Phase 4: Designing the brain: model architecture and design
Phase 5: Teaching the system: training and refinement
Phase 6: Validation and testing
Phase 7: Going live: deployment strategies
Phase 8: The maintenance reality
Phase 9: Future considerations

Phase 1: Beginning with why

Every successful AI project starts with questions. What real problem needs solving? What will success look like? Too often, however, organizations start with a solution in mind, like "we need an AI", rather than understanding the deeper challenges they face.

The challenge spans across the entire AI landscape. Research labs building foundational models need to understand how their work will serve as building blocks for others.

Large enterprises integrating AI into existing products need to consider how it affects millions of current users. Startups building AI-powered solutions need clarity on the specific problems they're solving.

Each group faces unique challenges, but they share a common need: clear purpose before technology.

Take an automated customer service system. The real challenge isn't responding to customer queries. It's doing so quickly enough to enhance user satisfaction, accurately enough to provide helpful answers, and clearly enough to maintain customer trust. These core needs shape how the system comes together.

Successful projects begin by defining clear boundaries and requirements. Yes, this means understanding the technical challenges, but also the human, operational, and regulatory context the system will operate within. Considerations at this stage center around three core areas:

Performance requirements

AI systems are typically evaluated on speed (e.g. latency or inference time), accuracy (task-specific metrics like precision, recall, or F1 score), and resource usage (such as memory footprint or compute demand).

Scalability also plays a key role, as teams need to consider how the system performs under increased load or when deployed across different environments, from cloud servers to edge devices. The exact metrics will vary depending on the application, but successful projects clearly define what ‘good performance’ looks like for their use case early in development.

Operational constraints

AI systems rarely operate in isolation. They need to integrate smoothly with existing infrastructure, be maintainable by internal teams, and deliver outputs in a way that fits into established workflows.

Key considerations include API compatibility, ease of retraining, support for version control, and whether the system can be monitored and debugged effectively in production. Teams often balance ambition with what’s practical to deploy and maintain over time.

Compliance demands

From GDPR to sector-specific regulations, AI systems must meet privacy, transparency, and documentation standards. This often means tracking data provenance, maintaining explainability in model outputs, and having clear audit trails for training data and decision logic. Compliance affects design choices early in development, especially in regulated domains like healthcare, finance, or education.

Moving from concept to implementation means balancing ambition with practicality. While technical capabilities can expand quickly, the fundamentals of problem-solving remain unchanged.

Companies should aim to take time and understand their challenges rather than chasing trending solutions. This way, you’ll stand a better chance of building systems that deliver lasting value and meaningful change.

Phase 2: The data foundation

No amount of algorithmic sophistication can overcome poor data, which is why data problems jeopardise more AI projects than any other factor. Laying a strong foundation for data means more than just gathering information, as it requires rigorous collection, validation, and ongoing refinement so AI models learn from accurate, representative, and high-quality data.

It’s in the quality of the data

AI models learn patterns from historical information, but if that data is incomplete, biased, or low quality, the entire system suffers. Missing values, incorrect labels, or biased sampling create systemic issues that no amount of model tuning can fix.

For instance, a facial recognition model trained on a dataset lacking diverse skin tones will produce critically biased results, even with state-of-the-art algorithms. No amount of post-processing or fine-tuning will compensate for an inadequate dataset. The only solution is getting the right data from the start.

AI teams need to have options for building better datasets that provide access to diverse, vetted human participants, rather than options that prioritize data scale over precision.

Quick, we need more data

It's not uncommon to discover that your existing data isn't sufficient. As a result, new collection systems may need building, and external data sources might require integration. Privacy requirements could limit what can be used. These challenges need solving before any actual AI development can begin.

For instance, open-source platforms like Hugging Face's Datasets Hub offer a vast collection of community-curated datasets across various domains, including natural language processing, computer vision, and audio tasks. Resources like this means developers can access and share datasets, facilitating the enhancement of AI models with diverse and comprehensive data.

In the UK, the government has recognized the importance of accessible data for AI innovation. The recently announced AI Opportunities Action Plan includes the creation of a National Data Library, which aims to compile anonymized public sector data to support AI research and development. The initiative hopes to unlock valuable data assets and provide developers with the resources needed to build more effective and insightful AI systems.

Expanding data access in this way helps AI systems make more informed predictions. Whether it's training models for medical diagnostics, improving language translation, or refining automated decision-making, a broader and more representative dataset improves the system’s reliability and performance.

By using these resources, organizations can overcome data limitations and make sure their AI systems are trained on comprehensive and representative datasets. The approach enhances the models' ability to make nuanced and accurate predictions, ultimately delivering more valuable outcomes.

Data here, data there, data everywhere

The complexity of modern AI systems demands data from multiple sources. Working with diverse sources brings challenges around quality, format differences, update frequencies, and access controls. Integrating these sources while maintaining data quality requires sophisticated infrastructure and careful planning.

An example of this might be a language translation model that pulls data from news articles, social media, and academic papers. Sources use different writing styles, formatting, and update cycles. News content updates daily, social media hourly, and academic papers quarterly . What you potentially get as a result is a complex challenge in maintaining current, balanced training data.

That’s where human input makes a difference. Automated systems can pull everything together, but people help make sense of it, checking for gaps, inconsistencies, and the kind of nuance that raw data alone might miss.

Preparation makes perfect

Raw data rarely serves AI systems well. The preparation phase turns collected information into usable training material, a process that combines technical skill with domain expertise.

Data preparation involves several technical processes:

Data cleaning: Removing duplicates, correcting errors, and filtering out irrelevant information before further processing
Data integration: Combining data from multiple sources and building pipelines for smooth ingestion
Transformation: Converting raw data into standardised formats, handling missing values, and normalising scales
Data labeling: Adding meaningful tags and annotations, typically before feature engineering, to improve model learning
Feature engineering: Creating new representations that capture important patterns and relationships
Validation: Verifying data quality, consistency, and completeness
Documentation: Recording all transformations and decisions for future reference

These foundations can be put in place using a data development platform, allowing you to move into model development safe in the knowledge your AI system builds upon reliable, well-structured data that supports learning and accurate predictions.

Phase 3: Building the right environment for development

AI development environments have unique requirements compared to standard software development. Traditional software projects focus on writing and deploying code, but AI projects also involve managing large datasets, training models, and running many experiments. An effective AI environment supports tasks such as:

Scalable computing for training

AI projects often require powerful GPUs or distributed computing to train large models. A good environment can scale from quick prototype experiments to intensive training runs on big datasets.

Dataset management and versioning

Managing large datasets is very important. The environment should support data versioning so teams can track changes to training data and make sure experiments are reproducible.

Experiment tracking

Unlike debugging code, AI development involves trying numerous model configurations and hyperparameters. Tools to log parameters, results, and metrics help teams compare experiments and figure out what works.

Model versioning and reproducibility

As models evolve, it's important to keep track of different versions and their performance. Many AI environments include model registries or version control for models to reproduce results and avoid confusion.

Integration with ML tools

Effective AI environments often integrate frameworks like PyTorch or TensorFlow, supported by experiment tracking tools such as MLflow, Weights & Biases, or Hugging Face's Evaluate. They also commonly include interactive notebooks for faster prototyping, orchestration tools such as Kubernetes for scalable deployments, or specialized frameworks like LangChain for building complex, agent-based applications.

Focusing on these AI-specific aspects helps remove friction in the workflow. When the environment handles data and experiment chores, teams can spend more time refining models and pushing boundaries.

Phase 4: Designing the brain

The architecture chosen should balance multiple competing needs, be it accuracy, speed, resource usage, or maintainability. Often, the latest cutting-edge model with billions of parameters isn't the right choice—a simpler, focused architecture might deliver better real-world results.

This phase requires combining deep technical expertise with practical wisdom. Look beyond raw performance metrics and consider the full lifecycle of your system.

How will it handle unexpected inputs?
Can it scale with growing demand?
Will maintenance needs overwhelm the team?

Architecture decisions extend far beyond selecting a model type. From initial data processing through final prediction delivery, the entire system needs careful consideration. The components have to work smoothly under real-world conditions, with security built into the foundation rather than bolted on later.

Selecting the right model often involves consulting benchmarks that evaluate AI performance across various tasks. Resources like Papers with Code, Hugging Face's Open LLM Leaderboard, and MLPerf provide comparative insights into different models, offering key metrics such as accuracy, latency, and compute efficiency. Developers should consider:

Task-specific performance: Metrics like precision, recall, F1-score, and ROC-AUC can indicate suitability for classification, detection, and ranking tasks.
Scalability and efficiency: Models should meet compute and memory constraints, particularly for edge or cloud deployment.
Cost considerations: Larger models may offer higher accuracy but at the expense of higher computational costs. Fine-tuned, smaller architectures may provide a better trade-off between performance and efficiency.
Robustness and fairness: Benchmarks sometimes include adversarial robustness and bias detection metrics, helping teams build more trustworthy systems.

Modern AI development succeeds when teams find the sweet spot between capability and sustainability. A system that's reliable and maintainable while meeting user needs often delivers more value than one that achieves slightly higher accuracy scores in controlled conditions.

Consider a content moderation system. While a large language model might achieve higher accuracy in detecting nuanced policy violations, a simpler classification model could prove more practical by processing content faster, requiring less computing power, and being easier to update as moderation policies change.

The simpler system's speed and adaptability might provide better real-world value than the marginal accuracy gains of a more complex model.

Phase 5: Teaching the system

Training is an ongoing refinement. Initial training reveals gaps in data, flaws in architecture, and unexpected behaviours. Each discovery leads to adjustments and improvements.

This phase demands patience and a systematic approach. Results must be carefully analysed, unusual or extreme scenarios (edge cases) thoroughly investigated, and performance verified across varying conditions. Each iteration deepens the understanding of the model's strengths and limitations.

Modern training approaches, such as RLHF or adversarial training, emphasize robustness and fairness alongside accuracy. Systems need to handle unexpected inputs, maintain performance under varying conditions, and degrade gracefully when pushed beyond their limits. This requires sophisticated training regimes that expose models to diverse scenarios and edge cases.

The hardest challenges often show their face here. Models may show unexpected biases. Even well-structured datasets can miss key nuances, which is why human input is often needed to check for imbalances and ensure fairness.

Whether through diverse annotation teams or real-world testing, human data helps catch biases that algorithms might overlook. We've seen this play out in real-world cases, like when the Gender Shades research uncovered that facial recognition systems were failing women with darker skin tones at alarming rates. The error rates jumped by up to 34% compared to lighter-skinned males, yet these biases remained hidden until researchers deliberately tested the systems with diverse human faces.

This reality is why leading AI companies have improved how they train models. OpenAI's approach with reinforcement learning from human feedback shows how central human judgment has become to creating effective AI. Behind ChatGPT's polished interface lies countless hours of human evaluators providing nuanced feedback that guides the system toward more helpful responses.

At Prolific, we address these training challenges by offering fast access to diverse, representative human participants. Our platform enables developers to quickly collect high-quality data from verified participants across 38 countries speaking over 250 languages.

Having a global reach helps teams identify and address potential biases during the training process rather than after deployment. With more than 300 demographic and behavioral filters, AI developers can specifically target the exact participant profiles needed for specialized model training, so their systems work effectively across diverse user groups.

This approach to human-in-the-loop training creates more robust, fair AI systems that perform reliably in real-world conditions.

Performance can vary significantly across different groups. Resource usage might exceed expectations in ways difficult to predict. These issues demand thorough investigation and thoughtful resolution. Teams have a responsibility to balance the desire for perfect performance against practical constraints like time and computing resources.

Phase 6: Model evaluation

Before full deployment, systems need thorough validation. This goes beyond checking basic accuracy metrics. It requires understanding behavior across different scenarios, user groups, and operating conditions. You should verify the practical usability and maintenance requirements as well as the technical performance.

Validation often reveals the need for refinement. Models might require architectural adjustments for better speed and efficiency under production loads, or training modifications could be needed for handling edge cases and unexpected inputs. There may even be the need for a complete redesign due to fundamental flaws in how the model handles real-world data patterns and user interactions.

The most successful teams approach validation systematically, testing:

Technical performance across different conditions and scenarios. For instance, stress-testing an AI-powered chatbot by simulating peak-hour customer queries to see if response times degrade.
Integration capabilities with existing systems and workflows. Making sure a fraud detection model works with a bank’s transaction processing system without causing delays or false positives.
Resource usage under various load patterns. Running an image recognition model on different hardware configurations to determine if it remains efficient on lower-powered devices.
Security resilience against potential attacks. Evaluating an AI system’s vulnerability to adversarial attacks, such as testing an image classifier with manipulated inputs to see if it misclassifies objects.

Good validation separates AI systems that work in the real world from those that only show promise labs. While no system performs perfectly, proper testing will reveal whether yours can handle the messiness of actual use. Putting your system through its paces now shows if it's truly ready for use or just looks good on paper.

Responsible considerations and bias mitigation

Responsible AI development begins with honesty about bias, many of which are unconscious. Every dataset reflects the prejudices and limitations of both its creators and the input data itself. Models learn and amplify these biases, and when deployed, they can deepen existing inequalities or create new ones.

For example, a hiring algorithm trained on historical recruitment data may favor candidates from certain backgrounds if past hiring decisions were biased, leading to discriminatory outcomes.

Bias testing is just one part of responsible AI development, and tailoring models for specific use cases is equally important. For instance, a Spanish user asking a generalized LLM for breakfast recommendations might receive an American-centric response rather than suggestions relevant to their culture. That’s not true personalization. It’s a sign that the model hasn’t been optimized for diverse user contexts. Ensuring AI systems adapt to different cultural, linguistic, and domain-specific needs requires careful dataset curation and fine-tuning beyond just addressing bias.

Beyond bias detection, you need a solid process for model transparency, data privacy, and fairness testing. When problems arise, clear procedures should guide your response and resolution.

Validating AI with external, representative human data is necessary for fairness and reliability. No matter the background of the development team, real-world input from diverse users helps identify biases, blind spots, and unintended consequences that internal testing might miss. Regular audits and clear communication channels give concerns receive proper attention, allowing models to be refined based on feedback from those they impact most.

True fairness in AI may be elusive because fairness is not a fixed concept and varies across contexts, cultures, and perspectives. AI models learn from imperfect historical data and need to navigate trade-offs between competing priorities. While perfection may be out of reach, responsible AI development focuses on minimizing bias and aligning systems with human values as effectively as possible. And prioritizing ethics from the start helps build systems that work better for everyone they serve.

Phase 7: Going live

Deployment moves your model from testing to the real world, transforming it into a production system. This phase demands careful planning, from preparing infrastructure and setting up monitoring to developing clear support procedures. Together, these elements build the foundation for reliable operation.

Once it's in a production environment, it brings new challenges. Real-world data often differs from training sets, and usage patterns may not match expectations. Integration with existing systems can create unexpected interactions. Success depends on spotting and resolving these issues to keep the system running effectively.

Well-prepared teams take a gradual approach to rollouts, but AI systems bring unique challenges that need extra care. Unlike traditional software, AI models can behave in unpredictable ways, especially when they run into new data they weren’t trained on. A limited release helps identify issues like model drift, biased outputs, or weird edge cases before they cause real problems.

A gradual rollout helps keep things manageable, combining ongoing monitoring with clear steps for handling anything unexpected. Since AI systems rely on real-world feedback to stay on track, teams should set up ways to catch performance dips, unintended biases, or strange behavior early. With the right setup, it's easier to tweak and retrain models as needed, staying ahead of problems instead of scrambling to fix them later.

Phase 8: The maintenance reality

Production AI systems require constant attention. Performance needs monitoring. Data drift needs detecting. Models need updating. Security needs maintaining. This ongoing maintenance shapes everything that came before, from initial architecture choices to deployment procedures.

Successful maintenance requires reliable infrastructure and clear processes. Teams need tools for monitoring performance, detecting issues, and implementing updates. They need procedures for handling everything from routine updates to emergency fixes. Documentation needs to stay current to support effective operations.

Data drift presents particular challenges. Real-world patterns change over time, causing model performance to degrade. Teams need systems for detecting this drift and strategies for keeping models accurate. Fine-tuning plays a key role here: rather than retraining from scratch, developers can refine existing models with updated data, allowing them to adapt while preserving past learning.

Effective fine-tuning fundamentally depends on access to fresh, high-quality human data. As user behaviors evolve, language shifts, and new edge cases emerge, models need continuous exposure to current human interactions and judgments. This ongoing human feedback loop helps AI systems stay aligned with changing real-world contexts and expectations.

Prolific directly addresses this need by providing on-demand access to human participants for continuous model refinement. With studies launching every two minutes on our platform, AI teams can quickly gather fresh training data, while validating model outputs against human preferences and identifying emerging edge cases that require attention.

Our diverse participant pool ensures that fine-tuning captures a representative range of perspectives, helping models maintain relevance across different user groups and contexts. A human-centered approach to fine-tuning is significantly more efficient than complete retraining and helps maintain consistency across different versions of the model while adapting to changing real-world conditions.

Security considerations never end. New vulnerabilities emerge. Attack patterns evolve. Defense mechanisms need regular updates. Teams must stay vigilant, monitoring for potential threats while maintaining protective measures.

Phase 9: The future landscape

AI development keeps pushing forward at a remarkable speed. As models become more sophisticated and larger in scale, they need a strong infrastructure to support them.

In 2023, AI ecosystems saw a significant surge in model development, with 149 foundation models released—more than double the number from the previous year. Notably, 65.7% of these models were open-source, reflecting a growing trend toward collaborative AI advancement.

Leading AI organizations continued this momentum:

OpenAI: Introduced GPT-4.5 in February 2025, a multimodal model processing text, images, and audio, achieving state-of-the-art results in various benchmarks.
Google: Launched Gemini 2.0 Flash in December 2024, designed for the agentic era, enhancing user interaction beyond traditional chatbots.
Anthropic: Released Claude 3.7 as recently as February 2025, surpassing previous models in performance.

These developments highlight the quick evolution and scaling of AI models, with organizations releasing advanced iterations to meet diverse application needs. At the same time, regulations around AI change and users expect more from these systems. Keeping up means teams need to learn and adapt as the field evolves.

The field constantly offers new techniques promising better performance or efficiency. While these advances are exciting, the key is to experiment, test, and refine. New techniques can lead to more powerful capabilities, but thoughtful development sees them applied in ways that add real value. What works well for one project might not be the right fit for another, but continuous exploration and iteration help teams discover the best solutions for their specific needs.

Meanwhile, the rules around AI keep changing, especially when it comes to transparency and fairness. Think ahead about regulatory requirements when designing systems and keeping thorough documentation to show compliance. Watch these changing rules closely while making sure your systems stay up to standard.

Building for success

Success in AI development requires more than technical skill. It demands practical wisdom, systematic execution, and constant attention to both details and broader context. Balance cutting-edge capabilities with operational realities while never losing sight of your original objectives.

The development lifecycle isn't just a sequence of steps, as much as it's a framework for turning possibilities into practical solutions. Understanding this reality helps to build AI systems that actually work, deliver real value, and stand the test of time.

As AI continues advancing, the fundamentals remain constant: start with clear problems, build solid foundations, design for real-world conditions, and maintain vigilance throughout operations. Embrace these principles while staying focused on delivering real value to create systems that make genuine differences in the world.

Prolific’s approach: the AI development lifecycle in practice

Every good AI system needs human input at key stages. Whether you're a research lab pushing boundaries or a team building practical applications, real people shape how your system learns and behaves.

At Prolific, we enable faster, higher-quality AI development by focusing on four key pillars:

1. Faster time-to-market

AI teams need real-world data fast. Our platform provides access to 200,000+ participants on demand, allowing developers to collect high-quality data in hours or days rather than weeks. Self-serve functionality eliminates time-consuming onboarding processes, helping AI teams move quickly from testing to deployment.

2. High-quality data

Prolific’s participants are verified, fairly compensated, and quality-approved, ensuring accurate and reliable data. Unlike crowdwork platforms, we prioritise participant experience and engagement, which leads to better responses and stronger AI training data.

3. Humans-in-the-loop, always

Just as software undergoes user testing before launch, AI models require continuous human feedback. As AI becomes more specialised, teams need domain experts to refine models for real-world use. With 300+ audience filters covering demographics, skills, and experience, Prolific makes it easy to find the right people for precise model training and validation.

4. Building responsible AI

AI must work for everyone. Our diverse global participant pool, spanning 250 languages and 38 countries, provides fairer, more representative AI systems. By embedding bias detection and fairness testing throughout the AI lifecycle, teams can proactively mitigate risks and build more ethical, transparent AI solutions.

Testing for real-world success

Before any AI system goes live, it needs thorough testing. With Prolific, teams can:
✅ Validate real-world performance with diverse user groups
✅ Identify hidden biases across different demographics
✅ Run comprehensive safety and reliability checks
✅ Test system behavior in challenging scenarios

Taking this systematic approach helps AI systems deliver real value in production, not just in controlled environments. Whether advancing core AI capabilities or solving specific problems, structured human input throughout development leads to more robust, trustworthy AI.

Check out our case studies, showing how Prolific helped Layer 6 expose flaws in AI evaluation metrics by providing more than a thousand vetted participants to compare generative model outputs with real human perception, and how we helped Shovels build a high-quality labeled dataset by connecting them with industry experts to refine AI-powered construction permit classification.

Get in touch or contact sales.

Share this post:

Articles

What is LLM-as-a-judge?

14 minutes read

April 2, 2025