mvp1-caching-research/caching.md

## Caching in MVP1

### Requirements

- We want to cache artifacts from upstream repositories in order to

  - Avoid rate limiting (docker hub)
  - Improve on download speed
  - Improve on availability

- We want to cache container images

  - Docker Hub
  - GCR
  - Quay

- We want to cache common software dependency artifacts of various programming languages

  - Maven/Ivy Java
  - Go
  - NPM
  - Rust
  - PyPI

- Must be easily configurable / manageable

  - Static config
  - API config (REST)

- Must store artifacts permanently

  - Resetting the cache (delete everything) should be easy, tho

- Currently of out Scope
  - Auth: Cache provides data to anyone who can reach it

- Nice to have
  - Repo Cache: Can store uploaded artifacts

### Architectural Solutions

#### File System-Based Caching

- Re-using artifacts stored on the local file system

  - e.g. backup and restore `node_modules` directory
  - Setup within pipelines

- Important: proper cache key selection

- Performance depends on the cache's storage location

  - on node: fast but localized to node
  - network storage: still has to download cache archive

- Pro: Artifacts are downloaded directly from upstream, no further config needed
  - Con: Does not address rate limiting concerns for initial cache warm up
- Pro: No extra config needed in tooling apart of pipeline cache config
- Has to be stored somewhere?
  - GitHub Actions / GitLab typically manage this
  - similar to local dev env
- Con: State management
  - Update the cache if new dependencies are used/requested
  - Dirty state (looking at you, maven)
  - Impure behaviour possible, creating side effects
  - Integrity checks of package managers might be bypassed at this point
- Con: Duplicate content
- Con: Invalidation needed at some point

#### Pull-Through Cache

- Mirror/Proxy repo for upstream repo

  - Downloads artifacts transparently from upstream when requested
  - Downloaded artifacts are stored locally in the mirror for faster access

- Pro: Can be re-used in pipelines, dev machines, cloud/prod environments
- Pro: Little state management necessary if any
- Con: Requires extra config in tooling, build tools, `containerd`, etc

- Using only the pull-through cache should be fast enough for builds in CI
  - Reproducible builds ftw

### Solution Candidates

#### Forgejo Runner Cache

- common actions like `setup-java` do a good job as they create dependencies on all build config files (e.g all `pom.xml`)
  - invalidation if there is any change to dependencies etc.

#### Nexus

[Nexus OSS GH](https://github.com/sonatype/nexus-public)

- Open source / free version

  - EPL License allows commercial distribution

- OSS version only has an extremely limited feature set of supported repository types.

  - basically only maven support
  - does not suffice for our use case

- Community Edition has more features but is limited in sizing. Upgrade to Pro edition necessary if those limits are exceeded.

#### Artifactory

- Open source / free version

  - Limited feature set
  - Separate distributions per repo type java / container / etc

- Inconvenient and insufficient for our use case
- Full feature set requires paid license

License evaluation needed
[EULA](https://jfrog.com/artifactory/eula/)

#### Artipie

[GH](https://github.com/artipie/artipie)
[Wiki](https://github.com/artipie/artipie/wiki)

- Self-hosted and upstream artifact caching

- MIT License

- might be abandoned / low dev activity / needs new maintainer
  - However, technically it looks extremely promising
  - Initial setup does not run out of the box correctly, needs some love
- Mostly headless
  - Brings a limited web interface
    - Repo creation, artifact viewing
- Buggy default config
  - config changes require restart, which seems to be a bug?
- Easy to setup, once bugs and buggy config are mitigated/worked around
- File system and object storage supported
- No databases required

- Pro: Config in yaml file

- Due to its simplicity it might be a good candidate for a first upstream caching solution

#### Pulp

[Website](https://pulpproject.org/)
[GH](https://github.com/pulp/pulpcore)

- Self-hosted and upstream artifact caching

- GPL 2.0 License

- Pull-Through Caches are only technical previews and might not work correctly
  - Pull-through cache does not fit into the concept of how artifacts are stored an tracked
  - Intended workflow is to sync dedicated artifacts with some upstream repo, not the entire repo
- Setup and config are quite complex
  - Build for high availability
- File system and object storage supported
- Requires SQL Db (Postgres) and possibly Redis

#### kube-image-keeper

[GH](https://github.com/enix/kube-image-keeper)

- Creates a DaemonSet, installing a service on each worker node
- Works within the cluster and rewrites image coordinates on the fly

- Pro: fine grained caching control
  - select/exempt images / namespaces
  - cache invalidation
- Pro: config within k8s or as k8s objects

- Con: Invasive
- Con: Rewrites image coordinates using a mutating webhook
- Con: Must be hosted within each (workload) cluster
- Con BLOCKER: Cannot handle image digest due to manifest rewrites

#### 'Simple' Squid proxy (or similar)

- Caching of arbitrary resources via HTTP
- "Stupid" caching
  - Invalidation becomes a problem rather quickly

#### Harbor

[Website](https://goharbor.io/)
[GH](https://github.com/goharbor/harbor)

- Apache 2.0 License

- The go-to container registry
- Allows self-hosting artifacts and caching upstream ones

- Pro: Image Signing
- Pro: Multi Tenant
- Pro: Quotas
- Pro: Vulnerability Scans
- Pro: SBOM creation
- Pro: P2P distribution of artifacts
- Pro: fully fledged web interface

- Con: Only Container / OCI related artifacts

### Recommendation

- File system cache
  - Easy solution as it is offered within most pipelines
  - Reduces build times significantly if dependencies have to be downloaded from outside networks
  - Avoid using fs cache, i.e. forgejo runner cache, long term or at all
    - Unless you can handle proper cache invalidation
    - Promotes immutable infra and reproducible builds without side effects
  - Use as additional layer if there is no local cache repo

- Repo caches
  - Can replace file system cache if network and repo are fast enough
  - Optimal solution would be a Nexus/Artifactory-like unified solution
    - Foss solutions like Artipie and Pulp have severe problems
      - Requires us to add features/fixes/maintenance
  - Due to scarce landscape of proper foss solutions we might have to opt for multiple dedicated solutions
    - If we opt for a dedicated container cache, we should re-evaluate Harbor or Quay

- Try to use Artipie as a first, simple solution and use Forgejo Runner caches in conjunction for even better performance
  - If Artipie does not work correctly or does not fit some reason we didn't waste too much time on it
  - If Artipie is abandoned but the concept works for us, we should consider maintaining it and continuing its development