Skip to content

persistent storage

1 post with the tag “persistent storage”

How to Reproduce Projects in the AI Era: Vault, a Cross-Project Persistent Storage System

How to Reproduce Projects in the AI Era: Vault, a Cross-Project Persistent Storage System

Section titled “How to Reproduce Projects in the AI Era: Vault, a Cross-Project Persistent Storage System”

In the era of AI-assisted development, how can we help AI assistants better understand our learning resources? The HagiCode project built the Vault system as a unified knowledge storage abstraction layer that AI can understand, greatly improving the efficiency of learning through project reproduction.

In the AI era, the way developers learn new technologies and architectures is changing profoundly. “Reproducing projects” - that is, deeply studying and learning from the code, architecture, and design patterns of excellent open source projects - has become an efficient way to learn. Compared with traditional methods like reading books or watching videos, directly reading and running high-quality open source projects helps you understand real-world engineering practices much faster.

Still, this learning method comes with quite a few challenges.

Learning materials are too scattered. Your notes may live in Obsidian, code repositories may be scattered across different folders, and your AI assistant’s conversation history becomes yet another isolated data island. When you want AI to help analyze a project, you have to manually copy code snippets and organize context, which is rather tedious.

What is even more troublesome is the broken context. AI assistants cannot directly access your local learning resources, so you have to provide background information again in every conversation. On top of that, reproduced code repositories update quickly, manual syncing is error-prone, and knowledge is hard to share across multiple learning projects.

At the root, all of these problems come from “data islands.” If there were a unified storage abstraction layer that allowed AI assistants to understand and access all your learning resources, the problem would be solved neatly.

The Vault system shared in this article is exactly the solution we developed while building HagiCode. HagiCode is an AI coding assistant project, and in our daily development work we often need to study and refer to many different open source projects. To help AI assistants better understand these learning resources, we designed Vault, a cross-project persistent storage system.

This solution has already been validated in HagiCode in real use. If you are facing similar knowledge management challenges, I hope these experiences can offer some inspiration. After all, once you’ve fallen into a few pits yourself, you should leave something behind for the next person.

The core idea of the Vault system is simple: create a unified knowledge storage abstraction layer that AI can understand. From an implementation perspective, the system has several key characteristics.

The system supports four vault types, each corresponding to a different usage scenario:

// folder: general-purpose folder type
export const DEFAULT_VAULT_TYPE = 'folder';
// coderef: a type specifically for reproduced code projects
export const CODEREF_VAULT_TYPE = 'coderef';
// obsidian: integrated with Obsidian note-taking software
export const OBSIDIAN_VAULT_TYPE = 'obsidian';
// system-managed: vault automatically managed by the system
export const SYSTEM_MANAGED_VAULT_TYPE = 'system-managed';

Among them, the coderef type is the most commonly used in HagiCode. It is specifically designed for reproduced code projects, providing a standardized directory structure and AI-readable metadata descriptions.

The Vault registry is stored persistently in JSON format, ensuring that the configuration remains available after the application restarts:

public class VaultRegistryStore : IVaultRegistryStore
{
private readonly string _registryFilePath;
public VaultRegistryStore(IConfiguration configuration, ILogger<VaultRegistryStore> logger)
{
var dataDir = configuration["DataDir"] ?? "./data";
var absoluteDataDir = Path.IsPathRooted(dataDir)
? dataDir
: Path.GetFullPath(Path.Combine(Directory.GetCurrentDirectory(), dataDir));
_registryFilePath = Path.Combine(absoluteDataDir, "personal-data", "vaults", "registry.json");
}
}

The advantage of this design is that it is simple and reliable. JSON is human-readable, which makes debugging and manual editing easier; filesystem storage avoids the complexity of a database and reduces system dependencies. After all, sometimes the simplest option really is the best one.

Most importantly, the system can automatically inject vault information into the context of AI proposals:

export function buildTargetVaultsText(
vaults: VaultForText[],
template: VaultPromptTemplate = DEFAULT_VAULT_PROMPT_TEMPLATE,
): string {
const readOnlyVaults = vaults.filter((vault) => vault.accessType === 'read');
const editableVaults = vaults.filter((vault) => vault.accessType === 'write');
if (readOnlyVaults.length === 0 && editableVaults.length === 0) {
return '';
}
const sections = [
buildVaultSection(readOnlyVaults, template.reference),
buildVaultSection(editableVaults, template.editable),
].filter(Boolean);
return `\n\n### ${template.heading}\n\n${sections.join('\n')}`;
}

This enables an important capability: AI assistants can automatically understand the available learning resources without users manually providing context. You could say that counts as a kind of tacit understanding.

The standardized structure of CodeRef Vault

Section titled “The standardized structure of CodeRef Vault”

For the coderef type of vault, HagiCode provides a standardized directory structure:

my-coderef-vault/
├── index.yaml # vault metadata description
├── AGENTS.md # operating guide for AI assistants
├── docs/ # stores study notes and documents
└── repos/ # manages reproduced code repositories through Git submodules

When creating a vault, the system automatically initializes this structure:

private async Task EnsureCodeRefStructureAsync(
string vaultName,
string physicalPath,
ICollection<VaultBootstrapDiagnosticDto> diagnostics,
CancellationToken cancellationToken)
{
Directory.CreateDirectory(physicalPath);
var indexPath = Path.Combine(physicalPath, CodeRefIndexFileName);
var docsPath = Path.Combine(physicalPath, CodeRefDocsDirectoryName);
var reposPath = Path.Combine(physicalPath, CodeRefReposDirectoryName);
// Create the standard directory structure
if (!Directory.Exists(docsPath))
{
Directory.CreateDirectory(docsPath);
}
if (!Directory.Exists(reposPath))
{
Directory.CreateDirectory(reposPath);
}
// Create the AGENTS.md guide
await EnsureCodeRefAgentsDocumentAsync(physicalPath, cancellationToken);
// Create the index.yaml metadata
await WriteCodeRefIndexDocumentAsync(indexPath, mergedDocument, cancellationToken);
}

This structure is carefully designed as well:

  • docs/ stores your study notes, where you can record your understanding of the code, architecture analysis, lessons learned, and so on in Markdown
  • repos/ manages reproduced repositories through Git submodules instead of copying code directly, which keeps the code in sync and saves space
  • index.yaml contains the vault metadata so AI assistants can quickly understand the purpose and contents of the vault
  • AGENTS.md is a guide written specifically for AI assistants, explaining how to handle the contents of the vault

Organized this way, perhaps AI can understand what you have in mind a little more easily.

Automatic initialization for system-managed vaults

Section titled “Automatic initialization for system-managed vaults”

In addition to manually created vaults, HagiCode also supports system-managed vaults:

public async Task<IReadOnlyList<VaultRegistryEntry>> EnsureAllSystemManagedVaultsAsync(
CancellationToken cancellationToken = default)
{
var definitions = GetAllResolvedDefinitions();
var entries = new List<VaultRegistryEntry>(definitions.Count);
foreach (var definition in definitions)
{
entries.Add(await EnsureResolvedSystemManagedVaultAsync(definition, cancellationToken));
}
return entries;
}

The system automatically creates and manages the following vaults:

  • hagiprojectdata: project data storage used to save project configuration and state
  • personaldata: personal data storage used to save user preferences
  • hbsprompt: a prompt template library used to manage commonly used AI prompts

These vaults are initialized automatically when the system starts, so users do not need to configure them manually. Some things are simply better left to the system instead of humans worrying about them.

An important part of the design is access control. The system divides vaults into two access types:

export interface VaultForText {
id: string;
name: string;
type: string;
physicalPath: string;
accessType: 'read' | 'write'; // Key: distinguish read-only from editable
}
  • reference (read-only): AI is only used for analysis and understanding and cannot modify content. Suitable for referenced open source projects, documents, and similar materials
  • editable (editable): AI can modify content as needed for the task. Suitable for your notes, drafts, and similar materials

This distinction matters. It tells AI which content is “read-only reference” and which content is “safe to edit,” reducing the risk of accidental changes. After all, nobody wants their hard work to disappear because of an unintended edit.

Now that we’ve covered the ideas, let’s look at how it works in practice.

Here is a complete frontend call example:

const createCodeRefVault = async () => {
const response = await VaultService.postApiVaults({
requestBody: {
name: "React Learning Vault",
type: "coderef",
physicalPath: "/Users/developer/vaults/react-learning",
gitUrl: "https://github.com/facebook/react.git"
}
});
// The system will automatically:
// 1. Clone the React repository into vault/repos/react
// 2. Create the docs/ directory for notes
// 3. Generate the index.yaml metadata
// 4. Create the AGENTS.md guide file
return response;
};

This API call completes a series of actions: creating the directory structure, initializing Git submodules, generating metadata files, and more. You only need to provide the basic information and let the system handle the rest. It is honestly a fairly worry-free approach.

After creating the vault, you can reference it in an AI proposal:

const proposal = composeProposalChiefComplaint({
chiefComplaint: "Help me analyze React's concurrent rendering mechanism",
repositories: [
{ id: "react", gitUrl: "https://github.com/facebook/react.git" }
],
vaults: [
{
id: "react-learning",
name: "React Learning Vault",
type: "coderef",
physicalPath: "/vaults/react-learning",
accessType: "read" // AI can only read, not modify
}
],
quickRequestText: "Focus on the Fiber architecture and scheduler implementation"
});

The system automatically injects vault information into the AI context, letting AI know which learning resources are available. When AI can understand what you have in mind, that kind of tacit understanding is hard to come by.

While using the Vault system, we have summarized a few lessons learned.

The system strictly validates paths to prevent path traversal attacks:

private static string ResolveFilePath(string vaultRoot, string relativePath)
{
var rootPath = EnsureTrailingSeparator(Path.GetFullPath(vaultRoot));
var combinedPath = Path.GetFullPath(Path.Combine(rootPath, relativePath));
if (!combinedPath.StartsWith(rootPath, StringComparison.OrdinalIgnoreCase))
{
throw new BusinessException(VaultRelativePathTraversalCode,
"Vault file paths must stay inside the registered vault root.");
}
return combinedPath;
}

This is important. If you customize a vault path, make sure it stays within the allowed range, otherwise the system will reject the operation. You really cannot overemphasize security.

CodeRef Vault recommends Git submodules instead of directly copying code:

private static string BuildCodeRefAgentsContent()
{
return """
# CodeRef Vault Guide
Repositories under `repos/` should be maintained through Git submodules
rather than copied directly into the vault root.
Keep this structure stable so assistants and tools can understand the vault quickly.
""" + Environment.NewLine;
}

This brings several advantages: keeping code synchronized with upstream, saving disk space, and making it easier to manage multiple versions of the code. After all, who wants to download the same thing again and again?

To prevent performance problems, the system limits file size and type:

private const int FileEnumerationLimit = 500;
private const int PreviewByteLimit = 256 * 1024; // 256KB

If your vault contains a large number of files or very large files, preview performance may be affected. In that case, you can consider processing files in batches or using specialized search tools. Sometimes when something gets too large, it becomes harder to handle, not easier.

When creating a vault, the system returns diagnostic information to help with debugging:

List<VaultBootstrapDiagnosticDto> bootstrapDiagnostics = [];
if (IsCodeRefVaultType(normalizedType))
{
bootstrapDiagnostics = await EnsureCodeRefBootstrapAsync(
normalizedName,
normalizedPhysicalPath,
normalizedGitUrl,
cancellationToken);
}

If creation fails, you can inspect the diagnostic information to understand the specific cause. When something goes wrong, checking the diagnostics is often the most direct way forward.

Through a unified storage abstraction layer, the Vault system solves several core pain points of reproducing projects in the AI era:

  • Centralized knowledge management: all learning resources are gathered in one place instead of scattered everywhere
  • Automatic AI context injection: AI assistants can automatically understand the available learning resources without manual context setup
  • Cross-project knowledge reuse: knowledge can be shared and reused across multiple learning projects
  • Standardized directory structure: a consistent directory layout lowers the learning curve

This solution has already been validated in the HagiCode project. If you are also building tools related to AI-assisted development, or facing similar knowledge management problems, I hope these experiences can serve as a useful reference.

In truth, the value of a technical solution does not lie in how complicated it is, but in whether it solves real problems. The core idea of the Vault system is very simple: build a unified knowledge storage layer that AI can understand. Yet it is precisely this simple abstraction that improved our development efficiency quite a bit.

Sometimes the simple approach really is the best one. After all, complicated things often hide even more pitfalls…


If this article helped you, feel free to give the project a Star on GitHub, or visit the official website to learn more about HagiCode. The public beta has already started, and you can experience the full AI coding assistant features as soon as you install it.

Maybe you should give it a try as well…

Thank you for reading. If you found this article useful, feel free to like, bookmark, and share it. This content was created with AI-assisted collaboration, and the final content was reviewed and confirmed by the author.