hugolesta.nl

$ cat posts/dont-let-your-llm-wing-it-bedrock-knowledge-base.md

Don't Let Your LLM Wing It: Building a Knowledge Base That Actually Knows Things

terraformawsbedrockpgvectorragplatform-engineering

Every team eventually asks the same question: "can we make the LLM answer questions about our docs instead of hallucinating something plausible?" The answer is retrieval-augmented generation (RAG), and the unglamorous truth is that RAG is 10% prompt engineering and 90% plumbing — object storage, a vector index, an embedding model, and a pipeline that keeps all three in sync whenever someone edits a markdown file.

This is the plumbing. A Bedrock Knowledge Base backed by Aurora PostgreSQL with pgvector, provisioned entirely in Terraform, synced automatically from a Git repo via GitHub Actions. No notebooks, no manual "let me re-upload the docs" Tuesdays.


The architecture

Rendering diagram…

Two environments — acc and prod — run this same pipeline side by side, each with its own bucket, KB, and database, gated by branch and dispatch logic in the workflow.


Provisioning the vector store

S3 bucket for ingestion

Bedrock pulls documents from S3, not the other way around, so the bucket is just a private, versioned, encrypted drop zone:

resource "aws_s3_bucket" "kb_docs" {
  bucket = "${local.cluster_name}-knowledge-base"
  tags   = local.default_tags
}

resource "aws_s3_bucket_versioning" "kb_docs" {
  bucket = aws_s3_bucket.kb_docs.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "kb_docs" {
  bucket = aws_s3_bucket.kb_docs.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "aws:kms"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "kb_docs" {
  bucket                  = aws_s3_bucket.kb_docs.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Aurora pgvector cluster

This is a separate Aurora cluster from any application database — different lifecycle, different access pattern, and you don't want a runaway ingestion job competing for connections with your app:

module "kb_db" {
  source  = "terraform-aws-modules/rds-aurora/aws"
  version = "~> 9.0"

  name              = "bedrock-kb"
  engine            = "aurora-postgresql"
  engine_version    = "16.6"
  instances         = { 1 = {} }

  serverlessv2_scaling_configuration = {
    min_capacity = 0.5
    max_capacity = 4
  }

  database_name        = "bedrock_kb"
  master_username       = "root"
  manage_master_user_password = true

  enable_http_endpoint  = true

  tags = local.default_tags
}

Bedrock's Knowledge Base service expects credentials in {"username": ..., "password": ...} JSON shape in Secrets Manager. If your Terraform module stores a bare password string (most do), mirror it into a second secret in the right shape rather than fighting the module:

resource "aws_secretsmanager_secret" "kb_db_bedrock" {
  name                    = "bedrock-kb-db-credentials"
  recovery_window_in_days = 0
  tags                    = local.default_tags
}

resource "aws_secretsmanager_secret_version" "kb_db_bedrock" {
  secret_id = aws_secretsmanager_secret.kb_db_bedrock.id
  secret_string = jsonencode({
    username = "root"
    password = module.kb_db.cluster_master_password
  })
}

IAM: the Bedrock service role

Bedrock's Knowledge Base needs to assume a role that can read from S3, describe and execute statements against Aurora via the Data API, read the credentials secret, and invoke the embedding model:

data "aws_iam_policy_document" "bedrock_kb_assume_role" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["bedrock.amazonaws.com"]
    }
    condition {
      test     = "StringEquals"
      variable = "aws:SourceAccount"
      values   = [local.aws_account_id]
    }
  }
}

data "aws_iam_policy_document" "bedrock_kb" {
  statement {
    sid       = "S3Read"
    effect    = "Allow"
    actions   = ["s3:GetObject", "s3:ListBucket"]
    resources = [aws_s3_bucket.kb_docs.arn, "${aws_s3_bucket.kb_docs.arn}/*"]
  }

  statement {
    sid       = "RDSDataApi"
    effect    = "Allow"
    actions   = ["rds-data:BatchExecuteStatement", "rds-data:ExecuteStatement"]
    resources = [module.kb_db.cluster_arn]
  }

  statement {
    sid       = "SecretsManagerRead"
    effect    = "Allow"
    actions   = ["secretsmanager:GetSecretValue"]
    resources = [aws_secretsmanager_secret.kb_db_bedrock.arn]
  }

  statement {
    sid       = "BedrockEmbeddings"
    effect    = "Allow"
    actions   = ["bedrock:InvokeModel"]
    resources = ["arn:aws:bedrock:eu-west-1::foundation-model/amazon.titan-embed-text-v2:0"]
  }
}

resource "aws_iam_role" "bedrock_kb" {
  name               = "${local.cluster_name}-bedrock-kb"
  assume_role_policy = data.aws_iam_policy_document.bedrock_kb_assume_role.json
}

resource "aws_iam_role_policy" "bedrock_kb" {
  role   = aws_iam_role.bedrock_kb.id
  policy = data.aws_iam_policy_document.bedrock_kb.json
}

The Knowledge Base resource itself

resource "aws_bedrockagent_knowledge_base" "main" {
  name     = "${local.cluster_name}-knowledge-base"
  role_arn = aws_iam_role.bedrock_kb.arn

  knowledge_base_configuration {
    type = "VECTOR"
    vector_knowledge_base_configuration {
      embedding_model_arn = "arn:aws:bedrock:eu-west-1::foundation-model/amazon.titan-embed-text-v2:0"
    }
  }

  storage_configuration {
    type = "RDS"
    rds_configuration {
      resource_arn           = module.kb_db.cluster_arn
      credentials_secret_arn = aws_secretsmanager_secret.kb_db_bedrock.arn
      database_name          = "bedrock_kb"
      table_name              = "bedrock_integration.bedrock_kb"
      field_mapping {
        primary_key_field = "id"
        vector_field      = "embedding"
        text_field        = "chunks"
        metadata_field    = "metadata"
      }
    }
  }

  depends_on = [module.kb_db]
}

resource "aws_bedrockagent_data_source" "s3" {
  knowledge_base_id = aws_bedrockagent_knowledge_base.main.id
  name              = "${local.cluster_name}-knowledge-base-s3"

  data_source_configuration {
    type = "S3"
    s3_configuration {
      bucket_arn = aws_s3_bucket.kb_docs.arn
    }
  }

  vector_ingestion_configuration {
    chunking_configuration {
      chunking_strategy = "HIERARCHICAL"
      hierarchical_chunking_configuration {
        level_configuration {
          max_tokens = 1500
        }
        level_configuration {
          max_tokens = 300
        }
        overlap_tokens = 60
      }
    }
  }
}

Hierarchical chunking is worth calling out: it splits documents into large 1500-token "parent" chunks and smaller 300-token "child" chunks with a 60-token overlap. Retrieval matches on the precise child chunk but can return the broader parent context — better recall on long documents than flat fixed-size chunking, at the cost of slightly more complex ingestion.

One-time schema bootstrap

aws_bedrockagent_knowledge_base expects the target table to already exist with the right columns and indexes — Terraform won't create the pgvector extension or table for you. This is a one-time psql job against the Aurora endpoint:

CREATE EXTENSION IF NOT EXISTS vector;
CREATE SCHEMA IF NOT EXISTS bedrock_integration;

CREATE TABLE IF NOT EXISTS bedrock_integration.bedrock_kb (
  id        UUID PRIMARY KEY,
  embedding vector(1024),
  chunks    TEXT,
  metadata  JSON
);

CREATE INDEX IF NOT EXISTS bedrock_kb_embedding_idx
  ON bedrock_integration.bedrock_kb
  USING hnsw (embedding vector_cosine_ops);

CREATE INDEX IF NOT EXISTS bedrock_kb_chunks_idx
  ON bedrock_integration.bedrock_kb
  USING gin (to_tsvector('simple', chunks));

The HNSW index handles approximate nearest-neighbor search on the embedding vector; the GIN index on chunks enables hybrid search (keyword + semantic) if you ever want it. vector(1024) matches Titan Embed Text v2's output dimensionality — if you swap embedding models later, the column width has to match or ingestion fails outright.

Publishing references via SSM

resource "aws_ssm_parameter" "kb_id" {
  name  = "/${local.env}/knowledge-base/kb_id"
  type  = "String"
  value = aws_bedrockagent_knowledge_base.main.id
}

resource "aws_ssm_parameter" "kb_bucket" {
  name  = "/${local.env}/knowledge-base/bucket"
  type  = "String"
  value = aws_s3_bucket.kb_docs.bucket
}

Apps and CI pipelines read these instead of hardcoding ARNs — when you rebuild the KB in a new account or region, nothing downstream needs a code change.


Syncing docs automatically with GitHub Actions

Provisioning is the easy part. The actual day-to-day value comes from never having to think about ingestion again. Drop a markdown file in a knowledge-base/ folder, push, and it's searchable within minutes.

name: Sync Knowledge Base

on:
  workflow_dispatch:
    inputs:
      environment:
        type: choice
        description: Environment to sync knowledge base
        options:
          - acc
          - prod
        required: true
  push:
    branches:
      - main
    paths:
      - 'knowledge-base/**'

jobs:
  sync-kb-acc:
    if: >
      github.event_name == 'workflow_dispatch' && github.event.inputs.environment == 'acc' ||
      github.event_name == 'push'
    secrets: inherit
    uses: your-org/your-pipelines-repo/.github/workflows/sync-bedrock-knowledge-base.yml@v2
    with:
      environment: acc
      knowledge-base-id: ${{ vars.BEDROCK_KB_ID_ACC }}
      data-source-id: ${{ vars.BEDROCK_KB_DATA_SOURCE_ID_ACC }}
      bucket-name: ${{ vars.BEDROCK_KB_BUCKET_ACC }}
      bucket-prefix: knowledge-base
      source-dir: knowledge-base

  sync-kb-prod:
    if: github.event_name == 'workflow_dispatch' && github.event.inputs.environment == 'prod'
    secrets: inherit
    uses: your-org/your-pipelines-repo/.github/workflows/sync-bedrock-knowledge-base.yml@v2
    with:
      environment: prod
      knowledge-base-id: ${{ vars.BEDROCK_KB_ID_PROD }}
      data-source-id: ${{ vars.BEDROCK_KB_DATA_SOURCE_ID_PROD }}
      bucket-name: ${{ vars.BEDROCK_KB_BUCKET_PROD }}
      bucket-prefix: knowledge-base
      source-dir: knowledge-base

The gating logic is the whole trick here:

  • push to main, path-filtered to knowledge-base/** — only fires the acc job. Routine doc edits land in the acc environment automatically, with zero manual steps.
  • workflow_dispatch with environment: acc — also runs the acc job. Useful for re-triggering a sync without a new commit (e.g. after fixing a broken IAM policy).
  • workflow_dispatch with environment: prod — the only path that touches prod. Promotion to production is always a deliberate, manual action, never a side effect of a push.

Both jobs delegate to the same reusable workflow (sync-bedrock-knowledge-base.yml@v2), parameterized per environment. The reusable workflow does the actual work: sync the source-dir to the S3 bucket-name under bucket-prefix, then call StartIngestionJob against data-source-id. Centralizing that logic in one reusable workflow means every team adopting this pattern gets the same sync behavior — and a fix to the sync logic ships everywhere at once instead of needing fifteen copy-pasted workflow files updated individually.


Field notes

  • The Data API requires enable_http_endpoint = true on the Aurora cluster. Without it, Bedrock's rds-data:ExecuteStatement calls fail with a confusing connectivity error that has nothing to do with security groups — you'll waste an hour checking VPC routing before you find this.
  • Vector dimension mismatches fail silently at the wrong layer. If vector(1024) doesn't match your embedding model's output size, the table creation succeeds, the Knowledge Base resource creates fine, and ingestion just fails per-document. Check the embedding model's dimensionality before writing the DDL, not after.
  • Hierarchical chunking is the right default for prose-heavy docs, but if your knowledge base is mostly short, structured files (FAQs, glossaries), flat fixed-size chunking is simpler to reason about and debug.
  • Don't let push-triggered syncs touch prod. It's tempting to make prod sync automatically on a release tag, but a bad doc — wrong numbers, stale instructions — propagating into a production RAG pipeline with no human in the loop is a worse failure mode than a slightly stale prod KB.
  • The acc/prod split needs two of everything — bucket, KB, data source, Aurora cluster — not just two sets of IAM permissions on shared resources. It costs more, but it means a bad ingestion config or chunking change gets caught in acc before it can corrupt the index your production LLM proxy actually queries.

Closing

None of this is exotic — it's a bucket, a Postgres extension, two indexes, and a YAML file with an if: condition. That's the whole trick to "production RAG": treat the knowledge base like any other piece of infrastructure, version it, gate promotion to prod behind a manual step, and let the boring CI pipeline do the boring sync work. Your LLM stops winging it, and you stop being the person who manually re-uploads PDFs every time someone asks why the bot doesn't know about last week's runbook update.