Code Review & Testing Checklist for AI & Machine Learning

Interactive Code Review & Testing checklist for AI & Machine Learning. Track progress with checkable items and priority levels.

This checklist streamlines code review and testing for AI and machine learning projects, covering data quality, reproducibility, evaluation rigor, performance, and security. Use it to convert pull requests into production-grade models with faster iteration and fewer regressions.

Progress0/35 completed (0%)

Category

Priority

Showing 35 of 35 items

Enforce dataset schema contracts in CI

Add Great Expectations or Pandera checks for nulls, ranges, dtypes, and categorical domains, and fail the pull request if contracts are violated. Include row count sanity checks and primary key uniqueness to catch silent pipeline defects.

criticalData Validation

Pin dataset versions and hashes

Use DVC or LakeFS to version raw and processed datasets, and require hash pins in code and configs. Review diffs of data manifests in pull requests to ensure immutable, reproducible training and evaluation.

criticalData Lineage

Add drift detection tests against a baseline

Run Evidently or Alibi Detect in CI to compare feature and label distributions against a locked baseline, with thresholds that gate merges. Flag upstream data shifts early to prevent silent performance drops.

importantDrift Monitoring

Validate offline vs online feature consistency

Add tests that compute a small batch of features through both offline and online feature store paths (eg Feast) and assert equality within tolerance. Include time travel tests to prevent training serving skew.

criticalFeature Stores

Detect target leakage and timestamp leaks

Add automated checks for suspicious correlations between features and targets, and enforce time based splitting with clear cutoff dates. Include PR comments that link to leakage audits and remediation steps.

criticalLeakage Prevention

Automate label quality and balance checks

Verify class distributions, annotation agreement, and label noise rates on each dataset change using small suite tests. Fail fast when label proportions shift beyond configured tolerances.

importantLabel Quality

Test pipeline idempotency and determinism

Run the same pipeline twice in a sandbox orchestrator (Prefect, Dagster, or Airflow) and assert identical artifact checksums when inputs are unchanged. Log seed values and component versions for traceability.

importantOrchestration Reliability

Standardize random seeds and determinism flags

Set seeds for Python, NumPy, and framework libraries, and enable deterministic algorithms (eg torch.use_deterministic_algorithms and cudnn.deterministic). Include a test that verifies repeatable metrics within set tolerance.

criticalReproducibility

Lock environments and toolchain versions

Pin dependencies via Poetry or pip-tools, lock CUDA and cuDNN versions, and reference a base Docker image tag in CI. Add a smoke test that builds the container and runs a minimal training step.

criticalEnvironment

Snapshot configs and enforce schemas

Use Hydra or OmegaConf with Pydantic or attrs to validate training configs and snapshot them as MLflow or W and B artifacts. Fail merges if required fields are missing or unexpected overrides appear.

criticalConfiguration

Unit test custom layers and losses

Add finite difference gradient checks and compare outputs to small reference implementations. Include shape inference and mixed precision compatibility tests in the pull request.

criticalNumerical Correctness

Use tolerance based assertions for floats

Apply numpy.testing or torch.allclose with explicit rtol and atol across CPU and GPU runners. Document platform specific tolerances to avoid flaky tests due to kernel or BLAS differences.

importantFloating Point

Verify experiment tracking hooks

Add an integration test that logs params, metrics, and artifacts to MLflow or Weights and Biases, and asserts run completeness. Enforce run naming and tagging conventions via a review checklist.

importantExperiment Tracking

Test checkpoint save and resume fidelity

Resume from checkpoints mid epoch and ensure metrics and states (optimizer, scheduler, AMP) are restored. Gate merges if resumed training deviates beyond tolerance from a fresh run.

importantCheckpointing

Cross validate metric implementations

Compare custom metrics to scikit learn or standard references on randomized fixtures using Hypothesis. Add boundary case tests for empty classes, large logits, and ties in ranking metrics.

criticalEvaluation Metrics

Gate pull requests on fairness thresholds

Run Fairlearn or Aequitas to compute group metrics and raise PR comments when parity gaps exceed policy thresholds. Require reviewers to sign off on mitigation plans or data changes.

importantFairness

Add adversarial robustness smoke tests

Evaluate small PGD or FGSM attacks for vision, or TextAttack perturbations for NLP, and track the relative drop in accuracy. Set a budget that prevents major regressions before deployment.

importantRobustness

Slice based evaluation for long tail performance

Compute metrics on key data slices (geography, device type, rare classes) and track worst case slices in CI. Flag regressions even if overall performance looks stable.

importantData Slicing

Check probability calibration and thresholds

Generate reliability diagrams and Brier scores, and ensure post training calibration steps like temperature scaling are applied. Review operating point selection and alert if drift changes optimal thresholds.

importantCalibration

Automate model card updates in CI

Verify that performance, data sources, intended use, and ethical considerations are filled, and attach evaluation artifacts. Fail merges if the model card is missing or out of date.

importantDocumentation

Maintain golden dataset regression tests

Pin a small, representative reference set and compare predictions and metrics to a stored baseline within tolerance. Block merges when regressions exceed agreed deltas.

criticalRegression Testing

Benchmark latency and tail performance

Run load tests with Locust or k6 and capture P50, P95, and P99 latencies for representative payloads. Record SLOs in CI and fail if degradations exceed thresholds.

criticalPerformance

Validate memory usage and batch behavior

Test max batch sizes on target hardware, monitor peak GPU and CPU memory, and assert no out of memory or fragmentation issues. Include a stress test with dynamic shapes if applicable.

criticalMemory

Test concurrency and thread safety

Exercise inference under concurrent requests using FastAPI, TorchServe, or Triton with multiple workers and assert reproducible outputs. Verify that model objects are not mutated across requests.

importantConcurrency

Verify serialization and portability parity

Export models to ONNX or TorchScript and run parity checks against native frameworks with numeric tolerances. Validate opset compatibility and fallback paths in CI.

importantPortability

Assess precision and quantization impact

Compare FP32 vs FP16 and INT8 accuracy drops with calibration data and enforce a maximum allowed delta. Include tests for dynamic and static quantization workflows where relevant.

importantQuantization

Measure cold start and warm start performance

Track model load time, CUDA context initialization, and cache warmup costs for containerized deployments. Gate merges if new dependencies inflate startup beyond agreed budgets.

importantStartup Time

Ensure observability hooks and telemetry

Expose Prometheus metrics, structured logs, and traces via OpenTelemetry, and assert presence in CI with a quick scrape test. Include request IDs and model version tags in every log line.

criticalObservability

Run static analysis and style gates

Enforce ruff or flake8, black, isort, and mypy in pre commit and CI to maintain readability and type safety. Add Bandit and Semgrep rules for risky patterns like insecure pickle loads.

criticalStatic Analysis

Scan dependencies and generate SBOMs

Use pip audit or Safety and Snyk to block known vulnerabilities and output SBOMs with Syft for traceability. Track CUDA and cuDNN CVEs and set patch policies.

criticalDependencies

Prevent secret and PII exposure in repos

Enable git secrets or Trufflehog in CI, and add tests that assert synthetic datasets are de identified before storage. Restrict access to raw data via role based controls and audit logs.

criticalPrivacy

Harden model serialization and loading

Forbid unsafe pickle loading paths and prefer safe formats like safetensors, ONNX, or TorchScript. Add tests that ensure unsafe loaders raise exceptions and document approved formats.

criticalSerialization Security

Verify model supply chain integrity

Sign artifacts and containers with Cosign, verify checksums, and enforce registry access controls. Add CI attestations for training data versions and config hashes.

criticalSupply Chain

Optimize CI with test sharding and caching

Use pytest xdist, reuse cached datasets and model weights across jobs, and split long running tests by duration. Keep seed sets stable to reduce flakiness while covering parallel paths.

importantCI/CD

Test LLM apps for prompt abuse and data leakage

Include prompt injection and jailbreak suites, PII red teaming, and content filter checks using frameworks like Guardrails or LangSmith evals. Block merges if safety scorecards slip below policy thresholds.

importantLLM Security

Pro Tips

*Create a tiny but representative golden dataset and lock it with DVC to run fast, deterministic regression tests in every pull request.
*Run a CI matrix across CPU and GPU with multiple CUDA versions, and set per target numeric tolerances to avoid flakiness while catching real regressions.
*Cache model weights, tokenizers, and test datasets in CI using content addressed keys so performance tests and exports run quickly.
*Automate config snapshots and artifact lineage by uploading training params, code commit hash, data hashes, and docker image digests on every training run.
*Schedule nightly non deterministic chaos runs with randomized seeds and adversarial perturbations to catch issues that slip past fast PR checks.

Code Review & Testing Checklist for AI & Machine Learning

Pro Tips

Related Articles

How to Make a Short-form Video for Instagram Reels in {{year}}

Best Documentation & Knowledge Base Tools for SaaS & Startups

Best Documentation & Knowledge Base Tools for E-Commerce

Ready to get started?