# Runbook

## getting started

To also pull the submodules, make sure to clone this repo like this:  
`git clone --recurse-submodules https://github.com/Askill/Web-Crawler-on-EKS`

## CI

In this Github repo, there are multiple workflows:

- to deploy all infrastructure
  - runs on push to main or during a PR
- to destroy all infrastructure
  - manual action
- to deploy a new latest version of the aplication
  - manual action

### Image build

The app image can be built with:  
`docker build -t 705632797485.dkr.ecr.eu-central-1.amazonaws.com/optar:latest-dev ./optar`  
`docker push  705632797485.dkr.ecr.eu-central-1.amazonaws.com/optar:latest-dev`

## Deployment

The crawler is deployed as a K8s Job, defined in ./optar/deployment.yaml
Which can be rolled out to the cluster with:  
`kubectl apply -f .\deployment.yaml`

Prerequisite: the correct kubectl config has been set with:
`aws eks --region eu-central-1 update-kubeconfig --name optar-dev-eks`

## Crawler config

For this PoC, no changes have been made to how the crawler gets its config, meaning the sites and keywords are set during build time as lines in `./optar/keywords.txt` and `./optar/sites.txt`.

I reused a crawler I had made earlier: `https://github.com/Askill/optar`  
This crawler traverses all links on a given website, caches this tree, compares the new tree to previously cached ones and searches all *new* sites for specific keywords.
This crawler is specifically designed for news sites and blogs and not for content changes on normally static sites like a companies home page.

## AWS Infrastructure

Components of note:

- EKS cluster
  - using the standard Terraform EKS module, which utilizes ECS under the hood for auto managed nodes
  - also has a service account which can read from the S3 bucket, the application needs, the account is specified in `./optar.deployment.yaml`
- ECR
  - created one registry (optar)
  - all users and roles in the account have pull and push access, fine for low security applications
- S3 Bucket
  - lifecycle rule to delete objects older that 3 days, assuming this crawler is run at least once per day, this leaves some room for error, while also ensuring low overhead.