optar_infra/RUNBOOK.md

54 lines
2.1 KiB
Markdown
Raw Permalink Normal View History

2024-07-18 09:30:33 +00:00
# Runbook
## getting started
To also pull the submodules, make sure to clone this repo like this:
`git clone --recurse-submodules https://github.com/Askill/Web-Crawler-on-EKS`
## CI
In this Github repo, there are multiple workflows:
- to deploy all infrastructure
- runs on push to main or during a PR
- to destroy all infrastructure
- manual action
- to deploy a new latest version of the aplication
- manual action
### Image build
The app image can be built with:
`docker build -t 705632797485.dkr.ecr.eu-central-1.amazonaws.com/optar:latest-dev ./optar`
`docker push 705632797485.dkr.ecr.eu-central-1.amazonaws.com/optar:latest-dev`
## Deployment
The crawler is deployed as a K8s Job, defined in ./optar/deployment.yaml
Which can be rolled out to the cluster with:
`kubectl apply -f .\deployment.yaml`
Prerequisite: the correct kubectl config has been set with:
`aws eks --region eu-central-1 update-kubeconfig --name optar-dev-eks`
## Crawler config
For this PoC, no changes have been made to how the crawler gets its config, meaning the sites and keywords are set during build time as lines in `./optar/keywords.txt` and `./optar/sites.txt`.
I reused a crawler I had made earlier: `https://github.com/Askill/optar`
This crawler traverses all links on a given website, caches this tree, compares the new tree to previously cached ones and searches all *new* sites for specific keywords.
This crawler is specifically designed for news sites and blogs and not for content changes on normally static sites like a companies home page.
## AWS Infrastructure
Components of note:
- EKS cluster
- using the standard Terraform EKS module, which utilizes ECS under the hood for auto managed nodes
- also has a service account which can read from the S3 bucket, the application needs, the account is specified in `./optar.deployment.yaml`
- ECR
- created one registry (optar)
- all users and roles in the account have pull and push access, fine for low security applications
- S3 Bucket
- lifecycle rule to delete objects older that 3 days, assuming this crawler is run at least once per day, this leaves some room for error, while also ensuring low overhead.