Table of Contents

Table of Contents

  1. Reality bends to my will!
  2. Oh let's break it down!
  3. Wingin' it!
  4. No one can hide from my sight.

Reality bends to my will!

Recently two things have been on my mind.

At work: explaining data pipelines to non-technical teams who require support in understanding how to use them to achieve their business objectives. Enabled by Kafka (not the Lebanese minced lamb dish).

Not at work: Overwatch 2. Nothing beats after a long day jumping into Discord with your friends to join a 5v5 deathmatch where you fight until your fingers hurt and you need a new keyboard to gloriously...lose?

screenshot 2024 03 14 205258

Now that can't be right! Everyone was warmed up and ready to perform...was our timing off? Did we push too aggressively? Was it me?

All normal thoughts when in the love/hate relationship that is Overwatch 2. After a bout of poor performance the thought crossed my mind - I had been looking for an engaging use case to get familiarity with Kafka to enable real-time data streaming, what if I could leverage game analytics to see how I've performed?

Oh let's break it down!

Game plan time - I want to be able to scrape game data, more specifically player stats and then display them in a visually friendly way to review how I've done. Breaking it down into steps I would need to:

  1. Identify a source of game performance metrics: A big shout to Valentin "TeKrop" PORCHET for creating the OverFast API which scrapes data from official Blizzard pages to obtain the data suitable to power my experiment.
  2. Find a suitable tech stack that will enable me to run a Kafka cluster to stream data from the API call to my reporting solution
  3. Assess a reporting solution that can ingest from the Kafka topic and display the necessary visualisations
  4. Host this dashboard somewhere and allow users to input any player name to retrieve their performance

Sounds pretty straightforward right?

I was a sweet summer child before I started this project - soon I found myself hopping between software faster than a kangaroo on coals. Which brings me to my High Level Solution Diagram:

hlsd kafka ow2 drawio 9

My HLSD seems pretty simple when looking back at it now but the road to it was paved with plenty of learnings.

Wingin' it!

This was my first project incorporating AWS, Docker, Kafka and Streamlit and some of the learnings I gained for each:

Local Deployment

Before jumping into the cloud I wanted to test the underlying components of Python facilitating the hops between systems on my local Windows machine. My scripts had to:

  1. Act as a producer by calling an API
  2. Pass it to a Kafka Topic
  3. Convert json to csv
  4. Have streamlit script read csv and format as required

It was also my first time deploying Zookeeper & Kafka and I learnt that Kafka on Windows requires shortened file paths. Multiple installation errors of trying to locate dependent files taught me this the hard way. I created a flowchart detailing each of the commands to run & what each command was doing while also documenting learnings along the way.

local deployment overview drawio 1

Local deployment was a success - it was now time to get my head in the cloud.

AWS

I had to learn a lot about AWS - I started by googling how to spin up a virtual machine 24/7 through AWS and learnt that ec2 was the way to go. I was eligible for AWS' free tier so could use a t2.micro with 30gbs of storage for next to no cost. Next was security groups, where I had to create configurations that would allow different systems to send traffic to, from and within each other within my ec2 instance. I knew I had to open ports for the following:

  • secure shell (ssh) to allow for remote access of my ec2
  • zookeeper to support in managing my kafka deployment
  • kafka so it could receive incoming json payloads
  • streamlit to allow anyone access to input battletags and search

Once my ec2 was up and running I had to learn how to ssh into it from my local machine to upload my python scripts and begin installing zookeeper, kafka and Docker.

I also learnt about using .pem files that would allow for securely ssh'ing in by first navigating to the file path location containing my .pem before being granted access.

With my ec2 now up and running I ssh'ed in and began transferring files and installing the necessary software.

#navigate to my .pem file
cd "C:\path\to\my\pem\dashboard.pem"

#ssh into my ec2
ssh -i dashboard.pem ec2-user@my-instance-public-dns

#xfer files from local to ec2
scp -i "C:\path\to\my\pem\dashboard.pem" "C:\path\to\my\local\files\aws_python_scripts" ec2-user@my-instance-public-dns:/home/ec2-user/

#if i need to download copies of files if changes have been made in ec2
scp -r -i "C:\path\to\my\pem\dashboard.pem" ec2-user@my-instance-public-dns:/home/ec2-user/ "C:\path\to\my\local\files\aws_python_scripts"

#ensure ec2 package repo is updated
sudo yum update -y

#install java
sudo yum install java-1.8.0-openjdk -y

#get kafka
wget https://archive.apache.org/dist/kafka/2.13-3.1.0/kafka_2.13-3.1.0.tgz
#extract latest kafka
tar -xzf kafka_2.13-3.1.0.tgz

#install python
sudo yum install python3

However as I started to install all of these I was quickly realising that all of these libraries and softwares could be tricky if I had to deal with different dependencies across them. I had learnt at work that python venvs were best practice to avoid conflicts but I was starting to think I wanted each contributing service to have its own containerised instance where they could interact with each other.

This brought me to...

Docker

docker cert

Where would I be without Fireship? This video was a great help when I started on the Docker journey and soon enough I was installing docker & docker-compose to setup my 3 separate Docker images.

docker images drawio

Zookeeper ensures Kafka runs smoothly, Kafka receives events from Python producer script after calling Overfast API, then another Python script consumes json, converts to csv and uses it to display dashboard results.

To assist with the setup of these docker images I learnt about two key concepts:

  • Dockerfile: The blueprint that defines how each docker image should be setup at the time of creation.
  • docker-compose.yaml: If Dockerfile is the blueprint, docker-compose.yaml is the foreman that ensures the blueprints are followed.

With all of my instructions setup I just had to say the word:

#invokes docker-compose.yaml file contents to build images then start them. 
#creates in detached mode so can return to terminal after all setup
docker-compose up --build -d

whale foreman

Kafka

Ahhh Kafka - I had heard the term thrown around at work long enough, now it was time to get first hand experience. Allowing real-time data streaming, I learnt that in order to get Kafka to work within my low-spec virtual machine I had to explicitly limit the memory Kafka used in my docker-compose.yaml file or else I'd receive errors:

#explictly limiting kafka memory to 512mbs
KAFKA_HEAP_OPTS: "-Xmx512M -Xms512M"

Streamlit

Streamlit is a cool library allowing for the creation of dashboards written exclusively within Python. I had to learn a number of commands to format how the data was presented.

def display_stats_horizontally(df, category, subcategories, title):
    # Display a subheader in the Streamlit app with the provided title.
    st.subheader(title)
    
    
    # Filter the dataframe for the specific category
    filtered_df = df[df['Category'].str.lower() == category.lower()]
    
    # Prepare columns for each subcategory
    # Define the max number of metrics you want to display per row
    metrics_per_row = max(len(subcategories), 6)  # for example, if you want max 6 metrics per row
    columns = st.columns(metrics_per_row)
    # This sets up a variable `metrics_per_row` to determine the maximum number of metrics (subcategories) to display per row, with a fallback maximum of 6. It then creates that many columns in Streamlit.
    
    # Pad the subcategories list to match the metrics_per_row if it's not already equal
    subcategories_padded = subcategories + [''] * (metrics_per_row - len(subcategories))
    # If there are fewer subcategories than `metrics_per_row`, this line pads the list with empty strings to match the expected number of columns.
    
    for i, subcategory_full in enumerate(subcategories_padded):
        if subcategory_full:  # Skip if this is just a padding empty string
            with columns[i % metrics_per_row]:
                # Iterates through each subcategory, using each one to configure and display a metric in one of the columns. It skips the iteration if the subcategory is an empty string (used for padding).
                
                # Change the display label for time played
                display_label = "Time Played (hours)" if subcategory_full == "time_played" else subcategory_full.replace('_', ' ').capitalize()
                # Sets a more readable display label for each metric. Specifically, it replaces underscores with spaces, capitalizes the label, and handles a special case for "time_played".
                
                # Extract the specific row for the subcategory
                specific_row = filtered_df[filtered_df['Subcategory'].str.lower() == subcategory_full.lower()]
                if not specific_row.empty:
                    # Retrieve the value, format it, and display
                    value = specific_row['Value'].iloc[0]
                    formatted_value = format_number(value) if pd.api.types.is_number(value) else str(value)
                    # Retrieves and formats the value for the subcategory from the filtered DataFrame. If the value is a number, it formats it using a custom `format_number` function; otherwise, it converts it to a string.
                    
                    # Replace underscores with spaces and capitalize for display
                    display_subcategory = subcategory_full.replace('_', ' ').capitalize()
                    st.metric(label=display_label, value=formatted_value)
                    # Displays the metric with the formatted label and value in the Streamlit app.
                else:
                    # If no data is found for the subcategory, display N/A
                    display_subcategory = subcategory_full.replace('_', ' ').capitalize()
                    st.metric(label=display_subcategory, value="N/A")
                    # If there's no data available for a subcategory, it displays "N/A" as the value.

Shout out to the streamlit docs & chatGPT for helping me navigate the wealth of formatting options.

No one can hide from my sight.

To sum up, my key learnings include:

  • How to create a virtual instance in the cloud to keep my app up 24/7
  • The power of Docker as an orchestration method to keep instances with different functions separate
  • Using Kafka to create real-time data streaming
  • Use of the streamlit library to create user-friendly dashboards

Without further ado, here is the culmination of my learnings - my dashboard available below. Some usernames to test include UBLINKED-11828 or Spaztek-1732:

24/03/2024: As my AWS free tier has expired I've included a screenshot of the final dashboard.

final dashboard output

Thanks for checking out my project!

Thanks for stopping by!