54 lines
2.1 KiB
Markdown
54 lines
2.1 KiB
Markdown
# Runbook
|
|
|
|
## getting started
|
|
|
|
To also pull the submodules, make sure to clone this repo like this:
|
|
`git clone --recurse-submodules https://github.com/Askill/Web-Crawler-on-EKS`
|
|
|
|
## CI
|
|
|
|
In this Github repo, there are multiple workflows:
|
|
|
|
- to deploy all infrastructure
|
|
- runs on push to main or during a PR
|
|
- to destroy all infrastructure
|
|
- manual action
|
|
- to deploy a new latest version of the aplication
|
|
- manual action
|
|
|
|
### Image build
|
|
|
|
The app image can be built with:
|
|
`docker build -t 705632797485.dkr.ecr.eu-central-1.amazonaws.com/optar:latest-dev ./optar`
|
|
`docker push 705632797485.dkr.ecr.eu-central-1.amazonaws.com/optar:latest-dev`
|
|
|
|
## Deployment
|
|
|
|
The crawler is deployed as a K8s Job, defined in ./optar/deployment.yaml
|
|
Which can be rolled out to the cluster with:
|
|
`kubectl apply -f .\deployment.yaml`
|
|
|
|
Prerequisite: the correct kubectl config has been set with:
|
|
`aws eks --region eu-central-1 update-kubeconfig --name optar-dev-eks`
|
|
|
|
## Crawler config
|
|
|
|
For this PoC, no changes have been made to how the crawler gets its config, meaning the sites and keywords are set during build time as lines in `./optar/keywords.txt` and `./optar/sites.txt`.
|
|
|
|
I reused a crawler I had made earlier: `https://github.com/Askill/optar`
|
|
This crawler traverses all links on a given website, caches this tree, compares the new tree to previously cached ones and searches all *new* sites for specific keywords.
|
|
This crawler is specifically designed for news sites and blogs and not for content changes on normally static sites like a companies home page.
|
|
|
|
## AWS Infrastructure
|
|
|
|
Components of note:
|
|
|
|
- EKS cluster
|
|
- using the standard Terraform EKS module, which utilizes ECS under the hood for auto managed nodes
|
|
- also has a service account which can read from the S3 bucket, the application needs, the account is specified in `./optar.deployment.yaml`
|
|
- ECR
|
|
- created one registry (optar)
|
|
- all users and roles in the account have pull and push access, fine for low security applications
|
|
- S3 Bucket
|
|
- lifecycle rule to delete objects older that 3 days, assuming this crawler is run at least once per day, this leaves some room for error, while also ensuring low overhead.
|