Skip to content

system design

1 post with the tag “system design”

Building a Cross-Project Knowledge Base for the AI Era with the Vault System

Building a Cross-Project Knowledge Base for the AI Era with the Vault System

Section titled “Building a Cross-Project Knowledge Base for the AI Era with the Vault System”

Learning by studying and reproducing real projects is becoming mainstream, but scattered learning materials and broken context make it hard for AI assistants to deliver their full value. This article introduces the Vault system design in the HagiCode project: through a unified storage abstraction layer, AI assistants can understand and access all learning resources, enabling true cross-project knowledge reuse.

In fact, in the AI era, the way we learn new technologies is quietly changing. Traditional approaches like reading books and watching videos still matter, but “studying and reproducing projects” - deeply researching and learning from the code, architecture, and design patterns of excellent open source projects - is clearly becoming more efficient. Running and modifying high-quality open source projects directly is one of the fastest ways to understand real-world engineering practice.

But this approach also brings new challenges.

Learning materials are too scattered. Notes might live in Obsidian, code repositories may be spread across different folders, and an AI assistant’s conversation history becomes a separate data island. Every time you need AI help analyzing a project, you have to manually copy code snippets and organize context, which is quite tedious.

Context keeps getting lost. AI assistants cannot directly access local learning resources, so every conversation starts with re-explaining background information. The code repositories you study update quickly, and manual synchronization is error-prone. Worse still, knowledge is hard to share across multiple learning projects - the design patterns learned in project A are completely unknown to the AI when it works on project B.

At the core, these issues are all forms of “data islands.” If there were a unified storage abstraction layer that let AI assistants understand and access all learning resources, the problem would be solved.

To address these pain points, we made a key design decision while developing HagiCode: build a Vault system as a unified knowledge storage abstraction layer. The impact of that decision may be even greater than you expect - more on that shortly.

The approach shared in this article comes from practical experience in the HagiCode project. HagiCode is an AI coding assistant based on the OpenSpec workflow. Its core idea is that AI should not only be able to “talk,” but also be able to “do” - directly operate on code repositories, execute commands, and run tests. GitHub: github.com/HagiCode-org/site

During development, we found that AI assistants need frequent access to many kinds of user learning resources: code repositories, notes, configuration files, and more. If users had to provide everything manually each time, the experience would be terrible. That led us to design the Vault system.

HagiCode’s Vault system supports four types, each corresponding to different usage scenarios:

TypePurposeTypical Scenario
folderGeneral-purpose folder typeTemporary learning materials, drafts
coderefDesigned specifically for studying code projectsSystematically learning an open source project
obsidianIntegrates with Obsidian note-taking softwareReusing an existing notes library
system-managedManaged automatically by the systemProject configuration, prompt templates, and more

Among them, the coderef type is the most commonly used in HagiCode. It provides a standardized directory structure and AI-readable metadata descriptions for code-study projects. Why design this type specifically? Because studying an open source project is not as simple as “downloading code.” You also need to manage the code itself, learning notes, configuration files, and other content at the same time, and coderef standardizes all of that.

The Vault registry is persisted to the file system as JSON:

_registryFilePath = Path.Combine(absoluteDataDir, "personal-data", "vaults", "registry.json");

This design may look simple, but it was carefully considered:

Simple and reliable. JSON is human-readable, making it easy to debug and modify manually. When something goes wrong, you can open the file directly to inspect the state or even repair it by hand - especially useful during development.

Reduced dependencies. File system storage avoids the complexity of a database. There is no need to install and configure an extra database service, which reduces system complexity and maintenance cost.

Concurrency-safe. SemaphoreSlim is used to guarantee thread safety. In an AI coding assistant scenario, multiple operations may access the Vault registry at the same time, so concurrency control is necessary.

The system’s core capability is that it can automatically inject Vault information into the context of AI proposals:

export function buildTargetVaultsText(
vaults: VaultForText[],
template: VaultPromptTemplate = DEFAULT_VAULT_PROMPT_TEMPLATE,
): string {
const readOnlyVaults = vaults.filter((vault) => vault.accessType === 'read');
const editableVaults = vaults.filter((vault) => vault.accessType === 'write');
const sections = [
buildVaultSection(readOnlyVaults, template.reference),
buildVaultSection(editableVaults, template.editable),
].filter(Boolean);
return `\n\n### ${template.heading}\n\n${sections.join('\n')}`;
}

This allows the AI assistant to automatically understand which learning resources are available, without requiring the user to provide context manually every time. It makes the HagiCode experience feel especially natural - tell the AI, “Help me analyze React concurrent rendering,” and it can automatically find the previously registered React learning Vault instead of asking you to paste code over and over again.

The system divides Vaults into two access types:

  • reference (read-only): AI can only use the content for analysis and understanding, without modifying it
  • editable (modifiable): AI can modify the content as needed for the task

This distinction tells the AI which content is “read-only reference” and which content it is allowed to modify, reducing the risk of accidental changes. For example, if you register an open source project’s Vault as learning material, you definitely do not want AI casually editing the code inside it - so mark it as reference. But if it is your own project Vault, you can mark it as editable and let AI help modify the code.

Standardized Structure for a CodeRef Vault

Section titled “Standardized Structure for a CodeRef Vault”

For coderef Vaults, the system provides a standardized directory structure:

my-coderef-vault/
├── index.yaml # vault metadata description
├── AGENTS.md # operating guide for AI assistants
├── docs/ # stores learning notes and documentation
└── repos/ # manages referenced code repositories through Git submodules

What is the design philosophy behind this structure?

docs/ stores learning notes, using Markdown to record your understanding of the code, architecture analysis, and lessons from debugging. These notes are not only for you - AI can understand them too, and will automatically reference them when handling related tasks.

repos/ manages the studied repositories through Git submodules rather than by copying code directly. This has two benefits: first, it stays in sync with upstream, and a single git submodule update fetches the latest code; second, it saves space, because multiple Vaults can reference different versions of the same repository.

index.yaml contains Vault metadata so the AI assistant can quickly understand its purpose and contents. It is essentially a “self-introduction” for the Vault, so the AI knows what it is for the first time it sees it.

AGENTS.md is a guide written specifically for AI assistants, explaining how to handle the content inside the Vault. You can tell the AI things like: “When analyzing this project, focus on code related to performance optimization” or “Do not modify test files.”

Creating a CodeRef Vault is simple:

const createCodeRefVault = async () => {
const response = await VaultService.postApiVaults({
requestBody: {
name: "React Learning Vault",
type: "coderef",
physicalPath: "/Users/developer/vaults/react-learning",
gitUrl: "https://github.com/facebook/react.git"
}
});
// The system will automatically:
// 1. Clone the React repository to vault/repos/react
// 2. Create the docs/ directory for notes
// 3. Generate index.yaml metadata
// 4. Create the AGENTS.md guide file
return response;
};

Then reference this Vault in an AI proposal:

const proposal = composeProposalChiefComplaint({
chiefComplaint: "Help me analyze React's concurrent rendering mechanism",
repositories: [
{ id: "react", gitUrl: "https://github.com/facebook/react.git" }
],
vaults: [
{
id: "react-learning",
name: "React Learning Vault",
type: "coderef",
physicalPath: "/vaults/react-learning",
accessType: "read" // AI can only read, not modify
}
],
quickRequestText: "Pay special attention to the Fiber architecture and scheduler implementation"
});

Scenario 1: Systematically studying open source projects

Create a CodeRef Vault, manage the target repository through Git submodules, and record learning notes in the docs/ directory. AI can access both the code and the notes at the same time, providing more accurate analysis. Notes written while studying a module are automatically referenced by the AI when it later analyzes related code - like having an “assistant” that remembers your previous thinking.

Scenario 2: Reusing an Obsidian notes library

If you are already using Obsidian to manage notes, just register your existing Vault in HagiCode directly. AI can access your knowledge base without manual copy-paste. This feature is especially practical because many people have years of accumulated notes, and once connected, AI can “read” and understand that knowledge system.

Scenario 3: Cross-project knowledge reuse

Multiple AI proposals can reference the same Vault, enabling knowledge reuse across projects. For example, you can create a “design patterns learning Vault” that contains notes and code examples for many design patterns. No matter which project the AI is analyzing, it can refer to the content in that Vault - knowledge does not need to be accumulated repeatedly.

The system strictly validates paths to prevent path traversal attacks:

private static string ResolveFilePath(string vaultRoot, string relativePath)
{
var rootPath = EnsureTrailingSeparator(Path.GetFullPath(vaultRoot));
var combinedPath = Path.GetFullPath(Path.Combine(rootPath, relativePath));
if (!combinedPath.StartsWith(rootPath, StringComparison.OrdinalIgnoreCase))
{
throw new BusinessException(VaultRelativePathTraversalCode,
"Vault file paths must stay inside the registered vault root.");
}
return combinedPath;
}

This ensures all file operations stay within the Vault root directory and prevents malicious path access. Security is not something to take lightly. If an AI assistant is going to operate on the file system, the boundaries must be clearly defined.

When using the HagiCode Vault system, there are several things to pay special attention to:

  1. Path safety: Make sure custom paths stay within the allowed scope, otherwise the system will reject the operation. This prevents accidental misuse and potential security risks.

  2. Git submodule management: CodeRef Vaults are best managed with Git submodules instead of directly copying code. The benefits were covered earlier - keeping in sync and saving space. That said, submodules have their own workflow, so first-time users may need a little time to get familiar with them.

  3. File preview limits: The system limits file size (256KB) and quantity (500 files), so oversized files need to be handled in batches. This limit exists for performance reasons. If you run into very large files, you can split them manually or process them another way.

  4. Diagnostic information: Creating a Vault returns diagnostic information that can be used for debugging on failure. Check the diagnostics first when you run into issues - in most cases, that is where you will find the clue.

The HagiCode Vault system is fundamentally solving a simple but profound problem: how to let AI assistants understand and use local knowledge resources.

Through a unified storage abstraction layer, a standardized directory structure, and automated context injection, it delivers a knowledge management model of “register once, reuse everywhere.” Once a Vault is created, AI can automatically access and understand learning notes, code repositories, and documentation resources.

The experience improvement from this design is obvious. There is no longer any need to manually copy code snippets or repeatedly explain background information - the AI assistant becomes more like a teammate who truly understands the project and can provide more valuable help based on existing knowledge.

The Vault system shared in this article is a solution shaped through real trial and error and real optimization during HagiCode development. If you think this design is valuable, that says something about the engineering behind it - and HagiCode itself is worth checking out as well.

If this article helped you:

The public beta has started. Welcome to install it and give it a try.

Thank you for reading. If you found this article useful, please like, save, and share it. This content was created with AI-assisted collaboration, and the final version was reviewed and confirmed by the author.