mvp1-caching-research/caching.md at 377e2a53438d7eaa043b84bfaa087bef7f93fc65

Patrick.Sy/mvp1-caching-research

Fork 0

Patrick Sy 377e2a5343

chore: artifactory and nexus

2025-03-04 11:29:56 +01:00

4.4 KiB

Raw Blame History

Caching in MVP1

Requirements

We want to cache artifacts from upstream repositories in order to
- Avoid rate limiting (docker hub)
- Improve on download speed
- Improve on availability
We want to cache container images
- Docker Hub
- GCR
- Quay
We want to cache common software dependency artifacts of various programming languages
- Maven/Ivy Java
- Go
- NPM
- Rust
- PyPI
Must be easily configurable / manageable
- Static config
- API config (REST)
Must store artifacts permanently
- Resetting the cache (delete everything) should be easy, tho
Currently of out Scope
- Auth: Cache provides data to anyone who can reach it

Architectural Solutions

File System-Based Caching

Re-using artifacts stored on the local file system
- e.g. backup and restore node_modules directory
- Setup within pipelines
Important: proper cache key selection
Performance depends on the cache's storage location
- on node: fast but localized to node
- network storage: still has to download cache archive
Pro: Artifacts are downloaded directly from upstream, no further config needed
- Con: Does not address rate limiting concerns for initial cache warm up
Pro: No extra config needed in tooling apart of pipeline cache config
Has to be stored somewhere?
- GitHub Actions / GitLab typically manage this
- similar to local dev env
Con: State management
- Update the cache if new dependencies are used/requested
- Dirty state (looking at you, maven)
- Impure behaviour possible, creating side effects
- Integrity checks of package managers might be bypassed at this point
Con: Duplicate content
Con: Invalidation needed at some point

Pull-Through Cache

Mirror/Proxy repo for upstream repo
- Downloads artifacts transparently from upstream when requested
- Downloaded artifacts are stored locally in the mirror for faster access
Pro: Can be re-used in pipelines, dev machines, cloud/prod environments
Pro: Little state management necessary if any
Con: Requires extra config in tooling, build tools, containerd, etc
Using only the pull-through cache should be fast enough for builds in CI
- Reproducible builds ftw

Solution Candidates

Forgejo Runner Cache

common actions like setup-java do a good job as they create dependencies on all build config files (e.g all pom.xml)
- invalidation if there is any change to dependencies etc.

Nexus

Nexus OSS GH

Open source / free version
- EPL License allows commercial distribution
OSS version only has an extremely limited feature set of supported repository types.
- basically only maven support
- does not suffice for our use case
Community Edition has more features but is limited in sizing. Upgrade to Pro edition necessary in those limits are exceeded.

Artifactory

Open source / free version
- Limited feature set
- Separate distributions per repo type java / container / etc
Inconvenient and insufficient for our use case

License evaluation needed EULA

Artipie

might be abandoned / low dev activity / needs new maintainer
- However, technically it looks extremely promising

Pulp

Pull-Through Caches are only technical previews and might not work correctly

kube-image-keeper

Creates a DaemonSet, installing a service on each worker node
Works within the cluster and rewrites image coordinates on the fly
Pro: fine grained caching control
- select/exempt images / namespaces
- cache invalidation
Pro: config within k8s or as k8s objects
Con: Invasive
Con: Rewrites image coordinates using a mutating webhook
Con: Must be hosted within each (workload) cluster
Con BLOCKER: Cannot handle image digest due to manifest rewrites

'Simple' Squid proxy (or similar)

Caching of arbitrary resouces via HTTP

Harbor

Recommendation

File system cache
- Easy solution as it is offered within most pipelines
- Reduces build times significantly if dependencies have to be downloaded from outside networks
- Avoid using fs cache, i.e. forgejo runner cache, long term or at all
  - Unless you can handle proper cache invalidation
  - Promote immutable infra and reproducible builds without side effects
- Use as additional layer if there is no local cache repo

4.4 KiB Raw Blame History

Caching in MVP1

Requirements

Architectural Solutions

File System-Based Caching

Pull-Through Cache

Solution Candidates

Forgejo Runner Cache

Nexus

Artifactory

Artipie

Pulp

kube-image-keeper

'Simple' Squid proxy (or similar)

Harbor

Recommendation

4.4 KiB

Raw Blame History