2.1 KiB
Runbook
getting started
To also pull the submodules, make sure to clone this repo like this:
git clone --recurse-submodules https://github.com/Askill/Web-Crawler-on-EKS
CI
In this Github repo, there are multiple workflows:
- to deploy all infrastructure
- runs on push to main or during a PR
- to destroy all infrastructure
- manual action
- to deploy a new latest version of the aplication
- manual action
Image build
The app image can be built with:
docker build -t 705632797485.dkr.ecr.eu-central-1.amazonaws.com/optar:latest-dev ./optar
docker push 705632797485.dkr.ecr.eu-central-1.amazonaws.com/optar:latest-dev
Deployment
The crawler is deployed as a K8s Job, defined in ./optar/deployment.yaml
Which can be rolled out to the cluster with:
kubectl apply -f .\deployment.yaml
Prerequisite: the correct kubectl config has been set with:
aws eks --region eu-central-1 update-kubeconfig --name optar-dev-eks
Crawler config
For this PoC, no changes have been made to how the crawler gets its config, meaning the sites and keywords are set during build time as lines in ./optar/keywords.txt and ./optar/sites.txt.
I reused a crawler I had made earlier: https://github.com/Askill/optar
This crawler traverses all links on a given website, caches this tree, compares the new tree to previously cached ones and searches all new sites for specific keywords.
This crawler is specifically designed for news sites and blogs and not for content changes on normally static sites like a companies home page.
AWS Infrastructure
Components of note:
- EKS cluster
- using the standard Terraform EKS module, which utilizes ECS under the hood for auto managed nodes
- also has a service account which can read from the S3 bucket, the application needs, the account is specified in
./optar.deployment.yaml
- ECR
- created one registry (optar)
- all users and roles in the account have pull and push access, fine for low security applications
- S3 Bucket
- lifecycle rule to delete objects older that 3 days, assuming this crawler is run at least once per day, this leaves some room for error, while also ensuring low overhead.