# Runbook ## getting started To also pull the submodules, make sure to clone this repo like this: `git clone --recurse-submodules https://github.com/Askill/Web-Crawler-on-EKS` ## CI In this Github repo, there are multiple workflows: - to deploy all infrastructure - runs on push to main or during a PR - to destroy all infrastructure - manual action - to deploy a new latest version of the aplication - manual action ### Image build The app image can be built with: `docker build -t 705632797485.dkr.ecr.eu-central-1.amazonaws.com/optar:latest-dev ./optar` `docker push 705632797485.dkr.ecr.eu-central-1.amazonaws.com/optar:latest-dev` ## Deployment The crawler is deployed as a K8s Job, defined in ./optar/deployment.yaml Which can be rolled out to the cluster with: `kubectl apply -f .\deployment.yaml` Prerequisite: the correct kubectl config has been set with: `aws eks --region eu-central-1 update-kubeconfig --name optar-dev-eks` ## Crawler config For this PoC, no changes have been made to how the crawler gets its config, meaning the sites and keywords are set during build time as lines in `./optar/keywords.txt` and `./optar/sites.txt`. I reused a crawler I had made earlier: `https://github.com/Askill/optar` This crawler traverses all links on a given website, caches this tree, compares the new tree to previously cached ones and searches all *new* sites for specific keywords. This crawler is specifically designed for news sites and blogs and not for content changes on normally static sites like a companies home page. ## AWS Infrastructure Components of note: - EKS cluster - using the standard Terraform EKS module, which utilizes ECS under the hood for auto managed nodes - also has a service account which can read from the S3 bucket, the application needs, the account is specified in `./optar.deployment.yaml` - ECR - created one registry (optar) - all users and roles in the account have pull and push access, fine for low security applications - S3 Bucket - lifecycle rule to delete objects older that 3 days, assuming this crawler is run at least once per day, this leaves some room for error, while also ensuring low overhead.