Skip to content

knowledge management

2 posts with the tag “knowledge management”

Building a Cross-Project Knowledge Base for the AI Era with the Vault System

Building a Cross-Project Knowledge Base for the AI Era with the Vault System

Section titled “Building a Cross-Project Knowledge Base for the AI Era with the Vault System”

Learning by studying and reproducing real projects is becoming mainstream, but scattered learning materials and broken context make it hard for AI assistants to deliver their full value. This article introduces the Vault system design in the HagiCode project: through a unified storage abstraction layer, AI assistants can understand and access all learning resources, enabling true cross-project knowledge reuse.

In fact, in the AI era, the way we learn new technologies is quietly changing. Traditional approaches like reading books and watching videos still matter, but “studying and reproducing projects” - deeply researching and learning from the code, architecture, and design patterns of excellent open source projects - is clearly becoming more efficient. Running and modifying high-quality open source projects directly is one of the fastest ways to understand real-world engineering practice.

But this approach also brings new challenges.

Learning materials are too scattered. Notes might live in Obsidian, code repositories may be spread across different folders, and an AI assistant’s conversation history becomes a separate data island. Every time you need AI help analyzing a project, you have to manually copy code snippets and organize context, which is quite tedious.

Context keeps getting lost. AI assistants cannot directly access local learning resources, so every conversation starts with re-explaining background information. The code repositories you study update quickly, and manual synchronization is error-prone. Worse still, knowledge is hard to share across multiple learning projects - the design patterns learned in project A are completely unknown to the AI when it works on project B.

At the core, these issues are all forms of “data islands.” If there were a unified storage abstraction layer that let AI assistants understand and access all learning resources, the problem would be solved.

To address these pain points, we made a key design decision while developing HagiCode: build a Vault system as a unified knowledge storage abstraction layer. The impact of that decision may be even greater than you expect - more on that shortly.

The approach shared in this article comes from practical experience in the HagiCode project. HagiCode is an AI coding assistant based on the OpenSpec workflow. Its core idea is that AI should not only be able to “talk,” but also be able to “do” - directly operate on code repositories, execute commands, and run tests. GitHub: github.com/HagiCode-org/site

During development, we found that AI assistants need frequent access to many kinds of user learning resources: code repositories, notes, configuration files, and more. If users had to provide everything manually each time, the experience would be terrible. That led us to design the Vault system.

HagiCode’s Vault system supports four types, each corresponding to different usage scenarios:

TypePurposeTypical Scenario
folderGeneral-purpose folder typeTemporary learning materials, drafts
coderefDesigned specifically for studying code projectsSystematically learning an open source project
obsidianIntegrates with Obsidian note-taking softwareReusing an existing notes library
system-managedManaged automatically by the systemProject configuration, prompt templates, and more

Among them, the coderef type is the most commonly used in HagiCode. It provides a standardized directory structure and AI-readable metadata descriptions for code-study projects. Why design this type specifically? Because studying an open source project is not as simple as “downloading code.” You also need to manage the code itself, learning notes, configuration files, and other content at the same time, and coderef standardizes all of that.

The Vault registry is persisted to the file system as JSON:

_registryFilePath = Path.Combine(absoluteDataDir, "personal-data", "vaults", "registry.json");

This design may look simple, but it was carefully considered:

Simple and reliable. JSON is human-readable, making it easy to debug and modify manually. When something goes wrong, you can open the file directly to inspect the state or even repair it by hand - especially useful during development.

Reduced dependencies. File system storage avoids the complexity of a database. There is no need to install and configure an extra database service, which reduces system complexity and maintenance cost.

Concurrency-safe. SemaphoreSlim is used to guarantee thread safety. In an AI coding assistant scenario, multiple operations may access the Vault registry at the same time, so concurrency control is necessary.

The system’s core capability is that it can automatically inject Vault information into the context of AI proposals:

export function buildTargetVaultsText(
vaults: VaultForText[],
template: VaultPromptTemplate = DEFAULT_VAULT_PROMPT_TEMPLATE,
): string {
const readOnlyVaults = vaults.filter((vault) => vault.accessType === 'read');
const editableVaults = vaults.filter((vault) => vault.accessType === 'write');
const sections = [
buildVaultSection(readOnlyVaults, template.reference),
buildVaultSection(editableVaults, template.editable),
].filter(Boolean);
return `\n\n### ${template.heading}\n\n${sections.join('\n')}`;
}

This allows the AI assistant to automatically understand which learning resources are available, without requiring the user to provide context manually every time. It makes the HagiCode experience feel especially natural - tell the AI, “Help me analyze React concurrent rendering,” and it can automatically find the previously registered React learning Vault instead of asking you to paste code over and over again.

The system divides Vaults into two access types:

  • reference (read-only): AI can only use the content for analysis and understanding, without modifying it
  • editable (modifiable): AI can modify the content as needed for the task

This distinction tells the AI which content is “read-only reference” and which content it is allowed to modify, reducing the risk of accidental changes. For example, if you register an open source project’s Vault as learning material, you definitely do not want AI casually editing the code inside it - so mark it as reference. But if it is your own project Vault, you can mark it as editable and let AI help modify the code.

Standardized Structure for a CodeRef Vault

Section titled “Standardized Structure for a CodeRef Vault”

For coderef Vaults, the system provides a standardized directory structure:

my-coderef-vault/
├── index.yaml # vault metadata description
├── AGENTS.md # operating guide for AI assistants
├── docs/ # stores learning notes and documentation
└── repos/ # manages referenced code repositories through Git submodules

What is the design philosophy behind this structure?

docs/ stores learning notes, using Markdown to record your understanding of the code, architecture analysis, and lessons from debugging. These notes are not only for you - AI can understand them too, and will automatically reference them when handling related tasks.

repos/ manages the studied repositories through Git submodules rather than by copying code directly. This has two benefits: first, it stays in sync with upstream, and a single git submodule update fetches the latest code; second, it saves space, because multiple Vaults can reference different versions of the same repository.

index.yaml contains Vault metadata so the AI assistant can quickly understand its purpose and contents. It is essentially a “self-introduction” for the Vault, so the AI knows what it is for the first time it sees it.

AGENTS.md is a guide written specifically for AI assistants, explaining how to handle the content inside the Vault. You can tell the AI things like: “When analyzing this project, focus on code related to performance optimization” or “Do not modify test files.”

Creating a CodeRef Vault is simple:

const createCodeRefVault = async () => {
const response = await VaultService.postApiVaults({
requestBody: {
name: "React Learning Vault",
type: "coderef",
physicalPath: "/Users/developer/vaults/react-learning",
gitUrl: "https://github.com/facebook/react.git"
}
});
// The system will automatically:
// 1. Clone the React repository to vault/repos/react
// 2. Create the docs/ directory for notes
// 3. Generate index.yaml metadata
// 4. Create the AGENTS.md guide file
return response;
};

Then reference this Vault in an AI proposal:

const proposal = composeProposalChiefComplaint({
chiefComplaint: "Help me analyze React's concurrent rendering mechanism",
repositories: [
{ id: "react", gitUrl: "https://github.com/facebook/react.git" }
],
vaults: [
{
id: "react-learning",
name: "React Learning Vault",
type: "coderef",
physicalPath: "/vaults/react-learning",
accessType: "read" // AI can only read, not modify
}
],
quickRequestText: "Pay special attention to the Fiber architecture and scheduler implementation"
});

Scenario 1: Systematically studying open source projects

Create a CodeRef Vault, manage the target repository through Git submodules, and record learning notes in the docs/ directory. AI can access both the code and the notes at the same time, providing more accurate analysis. Notes written while studying a module are automatically referenced by the AI when it later analyzes related code - like having an “assistant” that remembers your previous thinking.

Scenario 2: Reusing an Obsidian notes library

If you are already using Obsidian to manage notes, just register your existing Vault in HagiCode directly. AI can access your knowledge base without manual copy-paste. This feature is especially practical because many people have years of accumulated notes, and once connected, AI can “read” and understand that knowledge system.

Scenario 3: Cross-project knowledge reuse

Multiple AI proposals can reference the same Vault, enabling knowledge reuse across projects. For example, you can create a “design patterns learning Vault” that contains notes and code examples for many design patterns. No matter which project the AI is analyzing, it can refer to the content in that Vault - knowledge does not need to be accumulated repeatedly.

The system strictly validates paths to prevent path traversal attacks:

private static string ResolveFilePath(string vaultRoot, string relativePath)
{
var rootPath = EnsureTrailingSeparator(Path.GetFullPath(vaultRoot));
var combinedPath = Path.GetFullPath(Path.Combine(rootPath, relativePath));
if (!combinedPath.StartsWith(rootPath, StringComparison.OrdinalIgnoreCase))
{
throw new BusinessException(VaultRelativePathTraversalCode,
"Vault file paths must stay inside the registered vault root.");
}
return combinedPath;
}

This ensures all file operations stay within the Vault root directory and prevents malicious path access. Security is not something to take lightly. If an AI assistant is going to operate on the file system, the boundaries must be clearly defined.

When using the HagiCode Vault system, there are several things to pay special attention to:

  1. Path safety: Make sure custom paths stay within the allowed scope, otherwise the system will reject the operation. This prevents accidental misuse and potential security risks.

  2. Git submodule management: CodeRef Vaults are best managed with Git submodules instead of directly copying code. The benefits were covered earlier - keeping in sync and saving space. That said, submodules have their own workflow, so first-time users may need a little time to get familiar with them.

  3. File preview limits: The system limits file size (256KB) and quantity (500 files), so oversized files need to be handled in batches. This limit exists for performance reasons. If you run into very large files, you can split them manually or process them another way.

  4. Diagnostic information: Creating a Vault returns diagnostic information that can be used for debugging on failure. Check the diagnostics first when you run into issues - in most cases, that is where you will find the clue.

The HagiCode Vault system is fundamentally solving a simple but profound problem: how to let AI assistants understand and use local knowledge resources.

Through a unified storage abstraction layer, a standardized directory structure, and automated context injection, it delivers a knowledge management model of “register once, reuse everywhere.” Once a Vault is created, AI can automatically access and understand learning notes, code repositories, and documentation resources.

The experience improvement from this design is obvious. There is no longer any need to manually copy code snippets or repeatedly explain background information - the AI assistant becomes more like a teammate who truly understands the project and can provide more valuable help based on existing knowledge.

The Vault system shared in this article is a solution shaped through real trial and error and real optimization during HagiCode development. If you think this design is valuable, that says something about the engineering behind it - and HagiCode itself is worth checking out as well.

If this article helped you:

The public beta has started. Welcome to install it and give it a try.

Thank you for reading. If you found this article useful, please like, save, and share it. This content was created with AI-assisted collaboration, and the final version was reviewed and confirmed by the author.

How to Reproduce Projects in the AI Era: Vault, a Cross-Project Persistent Storage System

How to Reproduce Projects in the AI Era: Vault, a Cross-Project Persistent Storage System

Section titled “How to Reproduce Projects in the AI Era: Vault, a Cross-Project Persistent Storage System”

In the era of AI-assisted development, how can we help AI assistants better understand our learning resources? The HagiCode project built the Vault system as a unified knowledge storage abstraction layer that AI can understand, greatly improving the efficiency of learning through project reproduction.

In the AI era, the way developers learn new technologies and architectures is changing profoundly. “Reproducing projects” - that is, deeply studying and learning from the code, architecture, and design patterns of excellent open source projects - has become an efficient way to learn. Compared with traditional methods like reading books or watching videos, directly reading and running high-quality open source projects helps you understand real-world engineering practices much faster.

Still, this learning method comes with quite a few challenges.

Learning materials are too scattered. Your notes may live in Obsidian, code repositories may be scattered across different folders, and your AI assistant’s conversation history becomes yet another isolated data island. When you want AI to help analyze a project, you have to manually copy code snippets and organize context, which is rather tedious.

What is even more troublesome is the broken context. AI assistants cannot directly access your local learning resources, so you have to provide background information again in every conversation. On top of that, reproduced code repositories update quickly, manual syncing is error-prone, and knowledge is hard to share across multiple learning projects.

At the root, all of these problems come from “data islands.” If there were a unified storage abstraction layer that allowed AI assistants to understand and access all your learning resources, the problem would be solved neatly.

The Vault system shared in this article is exactly the solution we developed while building HagiCode. HagiCode is an AI coding assistant project, and in our daily development work we often need to study and refer to many different open source projects. To help AI assistants better understand these learning resources, we designed Vault, a cross-project persistent storage system.

This solution has already been validated in HagiCode in real use. If you are facing similar knowledge management challenges, I hope these experiences can offer some inspiration. After all, once you’ve fallen into a few pits yourself, you should leave something behind for the next person.

The core idea of the Vault system is simple: create a unified knowledge storage abstraction layer that AI can understand. From an implementation perspective, the system has several key characteristics.

The system supports four vault types, each corresponding to a different usage scenario:

// folder: general-purpose folder type
export const DEFAULT_VAULT_TYPE = 'folder';
// coderef: a type specifically for reproduced code projects
export const CODEREF_VAULT_TYPE = 'coderef';
// obsidian: integrated with Obsidian note-taking software
export const OBSIDIAN_VAULT_TYPE = 'obsidian';
// system-managed: vault automatically managed by the system
export const SYSTEM_MANAGED_VAULT_TYPE = 'system-managed';

Among them, the coderef type is the most commonly used in HagiCode. It is specifically designed for reproduced code projects, providing a standardized directory structure and AI-readable metadata descriptions.

The Vault registry is stored persistently in JSON format, ensuring that the configuration remains available after the application restarts:

public class VaultRegistryStore : IVaultRegistryStore
{
private readonly string _registryFilePath;
public VaultRegistryStore(IConfiguration configuration, ILogger<VaultRegistryStore> logger)
{
var dataDir = configuration["DataDir"] ?? "./data";
var absoluteDataDir = Path.IsPathRooted(dataDir)
? dataDir
: Path.GetFullPath(Path.Combine(Directory.GetCurrentDirectory(), dataDir));
_registryFilePath = Path.Combine(absoluteDataDir, "personal-data", "vaults", "registry.json");
}
}

The advantage of this design is that it is simple and reliable. JSON is human-readable, which makes debugging and manual editing easier; filesystem storage avoids the complexity of a database and reduces system dependencies. After all, sometimes the simplest option really is the best one.

Most importantly, the system can automatically inject vault information into the context of AI proposals:

export function buildTargetVaultsText(
vaults: VaultForText[],
template: VaultPromptTemplate = DEFAULT_VAULT_PROMPT_TEMPLATE,
): string {
const readOnlyVaults = vaults.filter((vault) => vault.accessType === 'read');
const editableVaults = vaults.filter((vault) => vault.accessType === 'write');
if (readOnlyVaults.length === 0 && editableVaults.length === 0) {
return '';
}
const sections = [
buildVaultSection(readOnlyVaults, template.reference),
buildVaultSection(editableVaults, template.editable),
].filter(Boolean);
return `\n\n### ${template.heading}\n\n${sections.join('\n')}`;
}

This enables an important capability: AI assistants can automatically understand the available learning resources without users manually providing context. You could say that counts as a kind of tacit understanding.

The standardized structure of CodeRef Vault

Section titled “The standardized structure of CodeRef Vault”

For the coderef type of vault, HagiCode provides a standardized directory structure:

my-coderef-vault/
├── index.yaml # vault metadata description
├── AGENTS.md # operating guide for AI assistants
├── docs/ # stores study notes and documents
└── repos/ # manages reproduced code repositories through Git submodules

When creating a vault, the system automatically initializes this structure:

private async Task EnsureCodeRefStructureAsync(
string vaultName,
string physicalPath,
ICollection<VaultBootstrapDiagnosticDto> diagnostics,
CancellationToken cancellationToken)
{
Directory.CreateDirectory(physicalPath);
var indexPath = Path.Combine(physicalPath, CodeRefIndexFileName);
var docsPath = Path.Combine(physicalPath, CodeRefDocsDirectoryName);
var reposPath = Path.Combine(physicalPath, CodeRefReposDirectoryName);
// Create the standard directory structure
if (!Directory.Exists(docsPath))
{
Directory.CreateDirectory(docsPath);
}
if (!Directory.Exists(reposPath))
{
Directory.CreateDirectory(reposPath);
}
// Create the AGENTS.md guide
await EnsureCodeRefAgentsDocumentAsync(physicalPath, cancellationToken);
// Create the index.yaml metadata
await WriteCodeRefIndexDocumentAsync(indexPath, mergedDocument, cancellationToken);
}

This structure is carefully designed as well:

  • docs/ stores your study notes, where you can record your understanding of the code, architecture analysis, lessons learned, and so on in Markdown
  • repos/ manages reproduced repositories through Git submodules instead of copying code directly, which keeps the code in sync and saves space
  • index.yaml contains the vault metadata so AI assistants can quickly understand the purpose and contents of the vault
  • AGENTS.md is a guide written specifically for AI assistants, explaining how to handle the contents of the vault

Organized this way, perhaps AI can understand what you have in mind a little more easily.

Automatic initialization for system-managed vaults

Section titled “Automatic initialization for system-managed vaults”

In addition to manually created vaults, HagiCode also supports system-managed vaults:

public async Task<IReadOnlyList<VaultRegistryEntry>> EnsureAllSystemManagedVaultsAsync(
CancellationToken cancellationToken = default)
{
var definitions = GetAllResolvedDefinitions();
var entries = new List<VaultRegistryEntry>(definitions.Count);
foreach (var definition in definitions)
{
entries.Add(await EnsureResolvedSystemManagedVaultAsync(definition, cancellationToken));
}
return entries;
}

The system automatically creates and manages the following vaults:

  • hagiprojectdata: project data storage used to save project configuration and state
  • personaldata: personal data storage used to save user preferences
  • hbsprompt: a prompt template library used to manage commonly used AI prompts

These vaults are initialized automatically when the system starts, so users do not need to configure them manually. Some things are simply better left to the system instead of humans worrying about them.

An important part of the design is access control. The system divides vaults into two access types:

export interface VaultForText {
id: string;
name: string;
type: string;
physicalPath: string;
accessType: 'read' | 'write'; // Key: distinguish read-only from editable
}
  • reference (read-only): AI is only used for analysis and understanding and cannot modify content. Suitable for referenced open source projects, documents, and similar materials
  • editable (editable): AI can modify content as needed for the task. Suitable for your notes, drafts, and similar materials

This distinction matters. It tells AI which content is “read-only reference” and which content is “safe to edit,” reducing the risk of accidental changes. After all, nobody wants their hard work to disappear because of an unintended edit.

Now that we’ve covered the ideas, let’s look at how it works in practice.

Here is a complete frontend call example:

const createCodeRefVault = async () => {
const response = await VaultService.postApiVaults({
requestBody: {
name: "React Learning Vault",
type: "coderef",
physicalPath: "/Users/developer/vaults/react-learning",
gitUrl: "https://github.com/facebook/react.git"
}
});
// The system will automatically:
// 1. Clone the React repository into vault/repos/react
// 2. Create the docs/ directory for notes
// 3. Generate the index.yaml metadata
// 4. Create the AGENTS.md guide file
return response;
};

This API call completes a series of actions: creating the directory structure, initializing Git submodules, generating metadata files, and more. You only need to provide the basic information and let the system handle the rest. It is honestly a fairly worry-free approach.

After creating the vault, you can reference it in an AI proposal:

const proposal = composeProposalChiefComplaint({
chiefComplaint: "Help me analyze React's concurrent rendering mechanism",
repositories: [
{ id: "react", gitUrl: "https://github.com/facebook/react.git" }
],
vaults: [
{
id: "react-learning",
name: "React Learning Vault",
type: "coderef",
physicalPath: "/vaults/react-learning",
accessType: "read" // AI can only read, not modify
}
],
quickRequestText: "Focus on the Fiber architecture and scheduler implementation"
});

The system automatically injects vault information into the AI context, letting AI know which learning resources are available. When AI can understand what you have in mind, that kind of tacit understanding is hard to come by.

While using the Vault system, we have summarized a few lessons learned.

The system strictly validates paths to prevent path traversal attacks:

private static string ResolveFilePath(string vaultRoot, string relativePath)
{
var rootPath = EnsureTrailingSeparator(Path.GetFullPath(vaultRoot));
var combinedPath = Path.GetFullPath(Path.Combine(rootPath, relativePath));
if (!combinedPath.StartsWith(rootPath, StringComparison.OrdinalIgnoreCase))
{
throw new BusinessException(VaultRelativePathTraversalCode,
"Vault file paths must stay inside the registered vault root.");
}
return combinedPath;
}

This is important. If you customize a vault path, make sure it stays within the allowed range, otherwise the system will reject the operation. You really cannot overemphasize security.

CodeRef Vault recommends Git submodules instead of directly copying code:

private static string BuildCodeRefAgentsContent()
{
return """
# CodeRef Vault Guide
Repositories under `repos/` should be maintained through Git submodules
rather than copied directly into the vault root.
Keep this structure stable so assistants and tools can understand the vault quickly.
""" + Environment.NewLine;
}

This brings several advantages: keeping code synchronized with upstream, saving disk space, and making it easier to manage multiple versions of the code. After all, who wants to download the same thing again and again?

To prevent performance problems, the system limits file size and type:

private const int FileEnumerationLimit = 500;
private const int PreviewByteLimit = 256 * 1024; // 256KB

If your vault contains a large number of files or very large files, preview performance may be affected. In that case, you can consider processing files in batches or using specialized search tools. Sometimes when something gets too large, it becomes harder to handle, not easier.

When creating a vault, the system returns diagnostic information to help with debugging:

List<VaultBootstrapDiagnosticDto> bootstrapDiagnostics = [];
if (IsCodeRefVaultType(normalizedType))
{
bootstrapDiagnostics = await EnsureCodeRefBootstrapAsync(
normalizedName,
normalizedPhysicalPath,
normalizedGitUrl,
cancellationToken);
}

If creation fails, you can inspect the diagnostic information to understand the specific cause. When something goes wrong, checking the diagnostics is often the most direct way forward.

Through a unified storage abstraction layer, the Vault system solves several core pain points of reproducing projects in the AI era:

  • Centralized knowledge management: all learning resources are gathered in one place instead of scattered everywhere
  • Automatic AI context injection: AI assistants can automatically understand the available learning resources without manual context setup
  • Cross-project knowledge reuse: knowledge can be shared and reused across multiple learning projects
  • Standardized directory structure: a consistent directory layout lowers the learning curve

This solution has already been validated in the HagiCode project. If you are also building tools related to AI-assisted development, or facing similar knowledge management problems, I hope these experiences can serve as a useful reference.

In truth, the value of a technical solution does not lie in how complicated it is, but in whether it solves real problems. The core idea of the Vault system is very simple: build a unified knowledge storage layer that AI can understand. Yet it is precisely this simple abstraction that improved our development efficiency quite a bit.

Sometimes the simple approach really is the best one. After all, complicated things often hide even more pitfalls…


If this article helped you, feel free to give the project a Star on GitHub, or visit the official website to learn more about HagiCode. The public beta has already started, and you can experience the full AI coding assistant features as soon as you install it.

Maybe you should give it a try as well…

Thank you for reading. If you found this article useful, feel free to like, bookmark, and share it. This content was created with AI-assisted collaboration, and the final content was reviewed and confirmed by the author.