$ cat posts/dont-let-your-llm-wing-it-bedrock-knowledge-base.md
Don't Let Your LLM Wing It: Building a Knowledge Base That Actually Knows Things
Every team eventually asks the same question: "can we make the LLM answer questions about our docs instead of hallucinating something plausible?" The answer is retrieval-augmented generation (RAG), and the unglamorous truth is that RAG is 10% prompt engineering and 90% plumbing — object storage, a vector index, an embedding model, and a pipeline that keeps all three in sync whenever someone edits a markdown file.
This is the plumbing. A Bedrock Knowledge Base backed by Aurora PostgreSQL with pgvector, provisioned entirely in Terraform, synced automatically from a Git repo via GitHub Actions. No notebooks, no manual "let me re-upload the docs" Tuesdays.
The architecture
Two environments — acc and prod — run this same pipeline side by side, each with its own bucket, KB, and database, gated by branch and dispatch logic in the workflow.
Provisioning the vector store
S3 bucket for ingestion
Bedrock pulls documents from S3, not the other way around, so the bucket is just a private, versioned, encrypted drop zone:
resource "aws_s3_bucket" "kb_docs" {
bucket = "${local.cluster_name}-knowledge-base"
tags = local.default_tags
}
resource "aws_s3_bucket_versioning" "kb_docs" {
bucket = aws_s3_bucket.kb_docs.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "kb_docs" {
bucket = aws_s3_bucket.kb_docs.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}
resource "aws_s3_bucket_public_access_block" "kb_docs" {
bucket = aws_s3_bucket.kb_docs.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
Aurora pgvector cluster
This is a separate Aurora cluster from any application database — different lifecycle, different access pattern, and you don't want a runaway ingestion job competing for connections with your app:
module "kb_db" {
source = "terraform-aws-modules/rds-aurora/aws"
version = "~> 9.0"
name = "bedrock-kb"
engine = "aurora-postgresql"
engine_version = "16.6"
instances = { 1 = {} }
serverlessv2_scaling_configuration = {
min_capacity = 0.5
max_capacity = 4
}
database_name = "bedrock_kb"
master_username = "root"
manage_master_user_password = true
enable_http_endpoint = true
tags = local.default_tags
}
Bedrock's Knowledge Base service expects credentials in {"username": ..., "password": ...} JSON shape in Secrets Manager. If your Terraform module stores a bare password string (most do), mirror it into a second secret in the right shape rather than fighting the module:
resource "aws_secretsmanager_secret" "kb_db_bedrock" {
name = "bedrock-kb-db-credentials"
recovery_window_in_days = 0
tags = local.default_tags
}
resource "aws_secretsmanager_secret_version" "kb_db_bedrock" {
secret_id = aws_secretsmanager_secret.kb_db_bedrock.id
secret_string = jsonencode({
username = "root"
password = module.kb_db.cluster_master_password
})
}
IAM: the Bedrock service role
Bedrock's Knowledge Base needs to assume a role that can read from S3, describe and execute statements against Aurora via the Data API, read the credentials secret, and invoke the embedding model:
data "aws_iam_policy_document" "bedrock_kb_assume_role" {
statement {
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["bedrock.amazonaws.com"]
}
condition {
test = "StringEquals"
variable = "aws:SourceAccount"
values = [local.aws_account_id]
}
}
}
data "aws_iam_policy_document" "bedrock_kb" {
statement {
sid = "S3Read"
effect = "Allow"
actions = ["s3:GetObject", "s3:ListBucket"]
resources = [aws_s3_bucket.kb_docs.arn, "${aws_s3_bucket.kb_docs.arn}/*"]
}
statement {
sid = "RDSDataApi"
effect = "Allow"
actions = ["rds-data:BatchExecuteStatement", "rds-data:ExecuteStatement"]
resources = [module.kb_db.cluster_arn]
}
statement {
sid = "SecretsManagerRead"
effect = "Allow"
actions = ["secretsmanager:GetSecretValue"]
resources = [aws_secretsmanager_secret.kb_db_bedrock.arn]
}
statement {
sid = "BedrockEmbeddings"
effect = "Allow"
actions = ["bedrock:InvokeModel"]
resources = ["arn:aws:bedrock:eu-west-1::foundation-model/amazon.titan-embed-text-v2:0"]
}
}
resource "aws_iam_role" "bedrock_kb" {
name = "${local.cluster_name}-bedrock-kb"
assume_role_policy = data.aws_iam_policy_document.bedrock_kb_assume_role.json
}
resource "aws_iam_role_policy" "bedrock_kb" {
role = aws_iam_role.bedrock_kb.id
policy = data.aws_iam_policy_document.bedrock_kb.json
}
The Knowledge Base resource itself
resource "aws_bedrockagent_knowledge_base" "main" {
name = "${local.cluster_name}-knowledge-base"
role_arn = aws_iam_role.bedrock_kb.arn
knowledge_base_configuration {
type = "VECTOR"
vector_knowledge_base_configuration {
embedding_model_arn = "arn:aws:bedrock:eu-west-1::foundation-model/amazon.titan-embed-text-v2:0"
}
}
storage_configuration {
type = "RDS"
rds_configuration {
resource_arn = module.kb_db.cluster_arn
credentials_secret_arn = aws_secretsmanager_secret.kb_db_bedrock.arn
database_name = "bedrock_kb"
table_name = "bedrock_integration.bedrock_kb"
field_mapping {
primary_key_field = "id"
vector_field = "embedding"
text_field = "chunks"
metadata_field = "metadata"
}
}
}
depends_on = [module.kb_db]
}
resource "aws_bedrockagent_data_source" "s3" {
knowledge_base_id = aws_bedrockagent_knowledge_base.main.id
name = "${local.cluster_name}-knowledge-base-s3"
data_source_configuration {
type = "S3"
s3_configuration {
bucket_arn = aws_s3_bucket.kb_docs.arn
}
}
vector_ingestion_configuration {
chunking_configuration {
chunking_strategy = "HIERARCHICAL"
hierarchical_chunking_configuration {
level_configuration {
max_tokens = 1500
}
level_configuration {
max_tokens = 300
}
overlap_tokens = 60
}
}
}
}
Hierarchical chunking is worth calling out: it splits documents into large 1500-token "parent" chunks and smaller 300-token "child" chunks with a 60-token overlap. Retrieval matches on the precise child chunk but can return the broader parent context — better recall on long documents than flat fixed-size chunking, at the cost of slightly more complex ingestion.
One-time schema bootstrap
aws_bedrockagent_knowledge_base expects the target table to already exist with the right columns and indexes — Terraform won't create the pgvector extension or table for you. This is a one-time psql job against the Aurora endpoint:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE SCHEMA IF NOT EXISTS bedrock_integration;
CREATE TABLE IF NOT EXISTS bedrock_integration.bedrock_kb (
id UUID PRIMARY KEY,
embedding vector(1024),
chunks TEXT,
metadata JSON
);
CREATE INDEX IF NOT EXISTS bedrock_kb_embedding_idx
ON bedrock_integration.bedrock_kb
USING hnsw (embedding vector_cosine_ops);
CREATE INDEX IF NOT EXISTS bedrock_kb_chunks_idx
ON bedrock_integration.bedrock_kb
USING gin (to_tsvector('simple', chunks));
The HNSW index handles approximate nearest-neighbor search on the embedding vector; the GIN index on chunks enables hybrid search (keyword + semantic) if you ever want it. vector(1024) matches Titan Embed Text v2's output dimensionality — if you swap embedding models later, the column width has to match or ingestion fails outright.
Publishing references via SSM
resource "aws_ssm_parameter" "kb_id" {
name = "/${local.env}/knowledge-base/kb_id"
type = "String"
value = aws_bedrockagent_knowledge_base.main.id
}
resource "aws_ssm_parameter" "kb_bucket" {
name = "/${local.env}/knowledge-base/bucket"
type = "String"
value = aws_s3_bucket.kb_docs.bucket
}
Apps and CI pipelines read these instead of hardcoding ARNs — when you rebuild the KB in a new account or region, nothing downstream needs a code change.
Syncing docs automatically with GitHub Actions
Provisioning is the easy part. The actual day-to-day value comes from never having to think about ingestion again. Drop a markdown file in a knowledge-base/ folder, push, and it's searchable within minutes.
name: Sync Knowledge Base
on:
workflow_dispatch:
inputs:
environment:
type: choice
description: Environment to sync knowledge base
options:
- acc
- prod
required: true
push:
branches:
- main
paths:
- 'knowledge-base/**'
jobs:
sync-kb-acc:
if: >
github.event_name == 'workflow_dispatch' && github.event.inputs.environment == 'acc' ||
github.event_name == 'push'
secrets: inherit
uses: your-org/your-pipelines-repo/.github/workflows/sync-bedrock-knowledge-base.yml@v2
with:
environment: acc
knowledge-base-id: ${{ vars.BEDROCK_KB_ID_ACC }}
data-source-id: ${{ vars.BEDROCK_KB_DATA_SOURCE_ID_ACC }}
bucket-name: ${{ vars.BEDROCK_KB_BUCKET_ACC }}
bucket-prefix: knowledge-base
source-dir: knowledge-base
sync-kb-prod:
if: github.event_name == 'workflow_dispatch' && github.event.inputs.environment == 'prod'
secrets: inherit
uses: your-org/your-pipelines-repo/.github/workflows/sync-bedrock-knowledge-base.yml@v2
with:
environment: prod
knowledge-base-id: ${{ vars.BEDROCK_KB_ID_PROD }}
data-source-id: ${{ vars.BEDROCK_KB_DATA_SOURCE_ID_PROD }}
bucket-name: ${{ vars.BEDROCK_KB_BUCKET_PROD }}
bucket-prefix: knowledge-base
source-dir: knowledge-base
The gating logic is the whole trick here:
pushtomain, path-filtered toknowledge-base/**— only fires theaccjob. Routine doc edits land in the acc environment automatically, with zero manual steps.workflow_dispatchwithenvironment: acc— also runs the acc job. Useful for re-triggering a sync without a new commit (e.g. after fixing a broken IAM policy).workflow_dispatchwithenvironment: prod— the only path that touches prod. Promotion to production is always a deliberate, manual action, never a side effect of a push.
Both jobs delegate to the same reusable workflow (sync-bedrock-knowledge-base.yml@v2), parameterized per environment. The reusable workflow does the actual work: sync the source-dir to the S3 bucket-name under bucket-prefix, then call StartIngestionJob against data-source-id. Centralizing that logic in one reusable workflow means every team adopting this pattern gets the same sync behavior — and a fix to the sync logic ships everywhere at once instead of needing fifteen copy-pasted workflow files updated individually.
Field notes
- The Data API requires
enable_http_endpoint = trueon the Aurora cluster. Without it, Bedrock'srds-data:ExecuteStatementcalls fail with a confusing connectivity error that has nothing to do with security groups — you'll waste an hour checking VPC routing before you find this. - Vector dimension mismatches fail silently at the wrong layer. If
vector(1024)doesn't match your embedding model's output size, the table creation succeeds, the Knowledge Base resource creates fine, and ingestion just fails per-document. Check the embedding model's dimensionality before writing the DDL, not after. - Hierarchical chunking is the right default for prose-heavy docs, but if your knowledge base is mostly short, structured files (FAQs, glossaries), flat fixed-size chunking is simpler to reason about and debug.
- Don't let
push-triggered syncs touch prod. It's tempting to make prod sync automatically on a release tag, but a bad doc — wrong numbers, stale instructions — propagating into a production RAG pipeline with no human in the loop is a worse failure mode than a slightly stale prod KB. - The
acc/prodsplit needs two of everything — bucket, KB, data source, Aurora cluster — not just two sets of IAM permissions on shared resources. It costs more, but it means a bad ingestion config or chunking change gets caught in acc before it can corrupt the index your production LLM proxy actually queries.
Closing
None of this is exotic — it's a bucket, a Postgres extension, two indexes, and a YAML file with an if: condition. That's the whole trick to "production RAG": treat the knowledge base like any other piece of infrastructure, version it, gate promotion to prod behind a manual step, and let the boring CI pipeline do the boring sync work. Your LLM stops winging it, and you stop being the person who manually re-uploads PDFs every time someone asks why the bot doesn't know about last week's runbook update.