ETL - Athletics (Staging Package)

Last Updated 7/2/2021

Overview
Location
Schedule
Details
 Master Packages
    athl_load_all
    catapult_load
    output_files
 Python Modules
    catapult_load.py
    catapult_helper.py
    dw_helper.py
Change Log
Troubleshooting Notes

Overview:

The Athletics ETL is primarily used to pull strength and performance data used by athletics teams for day-to-day performance analysis. 

The ETL has multiple components that pull data from various source, including flat files and APIs .  

The primary purpose of this job is to populate data needed for the Athletics performance dashboards. Uses include:

  1. Populate staging table schemas with athletics data
  2. dataMarq procs to build athletics related tables
    1. d_athletes
      One row per athlete with distinguishing information about the atheltes
    2. d_rosters
      One row per athlete per roster year with details about roster spot - position, etc.
    3. d_seasons
      One row per sport per season with details about the seasons (wins, losses,etc.)
    4. d_catapult_(athletes/activities/periods/tags)
      Catapult is a third-party tool that tracks movement, these are d_tables related to the catapult data
    5. f_athlete_test_results
      The results of strength and performance tests done by the staff
    6. f_catapult_(period/daily)_stats
      The movement/activity stats from catapult

 Schemas in staging that are populated with this data include:

  • Base schemas: athl, catapult
  • CDC schema: athl_cdc, catapult_cdc

Location:

The solution 鈥 ETL-Athletics鈥 contains all the packages for staging athletics data and resides in the ETL-dataMarq folder in the SSIS catalog.

Schedule:

The jobs in this project run as part of multiple SQL Agent jobs (full dataMarq schedule).  

ETL 鈥 Athletic Performance - Full - Every day at 3am

ETL- Athletic Performance - Incremental - Hourly

Project and Package Details:

The packages in ETL 鈥 Athletics work with a variety of source technologies, including flat files, DB and web APIs.  Each type follows our fairly standard methodolgy for loading data to staging. 

Project Parameters

brand: This is either COVID or MU, to differentiate the two qualtrics instances we want to pull data from. These packages are designed to work with either instance, although we are only pulling from COVID now. Passed to python program as 鈥揵rand argument

dbEnvr: INT, STG, PRD 鈥 which DB should be loaded with the data. Passed to python program as 鈥揹bEvnr argument

dwConnStr: connection to datamarq

dwStgConnStr: connection to datamarq_staging

fsEnvr: The file share environment for the landing files. We have both a DEV and PROD landing folder to differentiate when jobs are pulling in different environments and not to overlap source files. Passed to python program as 鈥揻ileEnvr argument

fullRefresh: 1 = Full Refresh, 0 = Incremental

gitEnvr: DEV or PROD. Because these jobs use python packages they python code is stored in our git folder on the BI shared drive.  For testing, we are able to point at the python code in DEV or in PROD.

logLevel: DEBUG, INFO, WARNING, ERROR, CRITICAL 鈥 This is passed to the python scripts as an argument to set the log level for the jobs

pythonPath: The path to the python executable 鈥 saved in the D:/ drive on servers but usually in C:/ if executing locally/testing.

sourceFolder: The foldere path for the raw files that will be imported/exported

Master Packages

ATHL_LOAD_ALL (Flat File Load Packages)

Overview

This packages executes all the individual flat file load packages (generic flow below and details on individual files). They use our standard flat file methodology and most tables have CDC.  CDC is important here because there are 5 people working in the department and each might load multiple partial files everyday, so the data has to incrementally update through a CDC import.

Note that the strength staff is uploading these flat files directly to the folder, as a result there is a high chance of metadata mismatch if the files don't align exactly with the template.

Package Parameters

None

Package Variables

sourceFolder: The folder to look for raw files

varDestBackupFolderFile: The folder to store files once they have been loaded

varFileMask: The file mask (gradebooksettings_*.*) to look for raw files to load

varFileName: a variable to store the full file name from the look component for processing

varFileExists: True or False, set by the first step in the package to execute the rest only if a file with the specific mask exists

varSourceFolderFile: A variable set during run time to capture the full path of the file being processed

Package Flow

  1. Script Check if File Exists
    This is a SSIS script task that takes the file mask and folder as parameters and checks to see if file(s) exist to continue to the next step
  2. Loop Through Each File in the Folder
    1. Uses Source Folder and File Mask variables to look for files to process, loops through each of these files storing the file name to the File Name variable
    2. Truncate CDC Table
    3. Loads Data to Staging Table using Flat File to OLEDB Source
    4. Execute CDC staging proc (this merges in new data and/or runs a delete and insert statement directly in the proc)
    5. Archives File - files are archived 

File Loading Instructions

The Athletics department will be loading files directly into the BI shared drive Performance folder.  These are instructions on how to drop those files.

  1. Files must have the exactly field headers and layout as listed below. If field headers are in the wrong order or missing the load job will fail.
  2. All files except DXA Access files should be dropped directly in the DEV or PROD folder. DEV is for testing and PROD is for the final data that will be consumed by users.  DXA files should be dropped in the specific dxa folder and can be embedded in nested folders (MBB/12-1-2021, etc.)
  3. Once files are loaded, the will show up in the "backup" folder with the date they were loaded appended to the file name. Any file that is left in the DEV/PROD folder after a load job has run (every hour) means that file failed to load for some reason - usually bad headers
  4. For results, athlete, rosters, and seasons, the files can contain partial data (i.e., one day's worth of results, one season's roster, etc.). For tests and catapult_match, the full file must be present).   Notes on this:
    - Rosters must have a full season's worth of data.  Different seasons can be in different files, but a full seasons must be loaded as these are treated as individual units.  This is done in order to be able to delete athletes who were accidentally added to the wrong seasons
  5. When new athletes are entered into catapult, their catapult id needs to be paired with their muid.  This is a manual process and is done through the catapult_match.csv file.  This file will remain in the Performance folder and will be imported and exported every job.  If new athletes are added to catapult they will be added to the match file and will need to have their muid entered on the file by Athletics staff
  6. For testing the whole load job there is an output_files folder.  After every job execution, new files will be output from the data warehouse that contain the key processed data.  This can be used by athletics staff as a check or a quick lookup.

NOTES:  These shoud be csv, muid will most likely come in as int and joins need to be careful

  1. Results
    result_date muid first_name last_name test_code test_name rep result team_ind
  2. Athlete
    muid first_name last_name pref_first_name scholarship_ind pro_ind nba_ind national_team_ind transfer_out_ind
  3. Rosters
    sport_code sport muid first_name last_name alias_name position_code position season class_year year_at_mu
  4. Tests
    test_code test_name test_category pr_type
  5. Season
    sport_code season coach total_wins total_losses total_ties conference_champ_ind conference_wins conference_losses conference_ties ncaa_ind ncaa_wins
  6. Catapult Match
    id first_name last_name muid
    This file is both uploaded from the raw file and the repopulated with data pulled down from the API, so this will always remain in the source folder. Users need to just go in and find athletes without muids and add the muid to join the catapult id with the student muid.
  7. DXA
    These are actually individual Access database files. The basic procedure is the same except a query is used to load the staging table and the files have to be extracted from their source folders as they are uploaded in nested folders.

CATAPULT_LOAD

Overview

This package is very basic and has one task that executes the python module that loads data from the catapult API.  The python job doesn't land any raw flat files and does all the inserting and updating directly in the python code.

Package Parameters

None

Package Variables

None

Package Flow

  1. Execute the python program to load the stats data to the cdc table and then to the base table 鈥 this command is created dynamically as an expression in the execute process task. It is essentially calls the catapult_load.py program (see details below) passing the required parameters as set by the project/package parameters and variables.  Example below.  Note the important placement of quotes around each parameter

    "\"\\\\marqnet.mu.edu\\depts\\ITS\\BI\\git\\" + @[$Project::gitEnvr]  + "\\DW\\ETL-Athletics\\athletics_load\\catapult_load.py\"  \"--dbEnvr\" \"" + @[$Project::dbEnvr]   + "\" \"--fullRefresh\" \"" +  (DT_WSTR, 1) @[$Project::fullRefresh]   + "\"  \"--log\" \"" +  @[$Project::logLevel]  + "\""

OUTPUT_FILES

Overview

The strength staff would like flat files of the data that has been processed by the warehouse, so this package does a simple export and overwrite of the base files of the DW every time the job runs.

Package Parameters

None

Package Variables

sourceFolder: The folder to look for raw files

Package Flow

  1. Five data flow that run each run a sql query and output the results into flat csv files.
    athletes
    rosters
    test_results
    dxa
    catapult_period_stats
    catapult_daily_stats 

Python Modules

Athletics uses a number of third party services to track perfromance data. These are all accessed through web service APIs.  data is accessed through a web service which dataMarq achieves through the use of python modules.  These all reside in the ETL-Qualtrics solution 鈥 and therefore in the git repository 鈥 alongside the SSIS packages, which call them.  They also require a config file that is stored in the BI shared drive, config folder. 

Catapult API Reference:

catapult_load.py

 This program is the main driver of the catapult data load.  It leverage functions in the two 鈥渉elper鈥 modules to do the actual work, but provides the flow logic.

 Parameters

dbEnvr 鈥 INT, STG, PRD

fullRefresh 鈥 1 or 0, whether to run just incremental activities or athletes, teams, tags, etc.

sportCode (optional) - if running for only one sport (BB, WB, etc.)

logLevel 鈥 the level to log to the log folder for the job

 Quasi code flow

  1. Set the config file path based on the dbEnvr
  2. Set key variable for tags table and stats table - as these are standard across all sports
  3. Connect to the DB
  4. Get a list of stats to pull down from the DW using the get_active_stats module in the dw_helper.
  5. Loop through the tokens in the config file - there is one api token for each sport
    1. Loop through each endpoint in (athletes, teams, tags, tagtype) - these are "d" type endpoints
    2. Get responses and load them into the datawarehouse using get_endpoint, insert_rows and cdc_rows modules
    3. For activites endpoint (the "f" type endpoint)
      1. Get last sync date by sport using the get_last_sync_dt module
      2. Create a date list for each day between the last sync date and the current date - this is done because only one day of stats is loaded at a time to speed processing, lessen chance for failure loss
      3. Loop through each data and get the activities data
      4. Loop through the details activity data and get load to the activites, periods, participants, stats and tags_list tables
      5. Update last sync dt table
  6. Run the load_survey_results function from the qualtrics_helper.py module to load survey result to the data warehouse

catapult_helper.py

 All of the functions that are used to interact with the catapult APIs and download data

get_response

Parameters:

url 鈥 url for web service response

headers 鈥 headers to be passed to get call

Function: Generic function to get a json response from a url payload. Customized for the specific format of the response json

Output:  responseJson -  Url response in JSON format

 get_endpoint

Parameters:

apiToken 鈥 Catapult API token

endpoint - the endpoint to tag on to the api url

parms - If there are any params for the endpoint (startime, etc.)

Function: Takes a token and the dnpoint and will loop through all the pages (catapult usually returns single page results) and returns the responses in a dictionary keyed by id

Output:  responseIdsDict- A dictionary with the response in JSON format keyed by the id of the item

 get_response_tag_lists

Parameters:

apiToken 鈥 Catapult API token

endpoint - the endpoint to tag on to the api url

responseIdsDict - the response dictionary returned by get_endpoint

fieldList - the field list that will be inserted into the staging table

statSlugs - the list of stats that will be inserted into the staging table

statsFields - the list of stats fields that will be insterted into the stagin table

Function: The function loops through response from a given endpoint and creates lists of data ready to be inserted into 

Output:  responseList, tagList, statsList - all the lists ready to be inserted into the database

post_request_status

Parameters:

apiToken - Catapult api token

filterName鈥 The filtered item (activity, period, etc.)

filterids - the ids to pass to return from the stats endpoint

params - params to be passed to to the stats endpoint

groupBy - the group by clause to get from the stats endpoint

Function: This is a custom function to work with the catapult stats endpoitn and return catapult stats for specific activities

Output:  nextPage, response, responseJson - the response data from the stats request

 

 

dw_helper.py

All of the functions that are used to interact with the datamarq database

insert_rows

Parameters:

connection 鈥 the python connection object to datamarq

table 鈥 the table name for insert

fieldList 鈥 the list of fields to be inserted

valueLIst 鈥 the list of values to be inserted

Function: Generic function that inserts values into a given table and fields

Output:  s 鈥 number of successful rows inserted, f 鈥 number of failed rows caught in exception

 

cdc_rows

Parameters:

connection 鈥 the python connection object to datamarq

cdcTable 鈥 the cdc table

table 鈥 the base table

keyList鈥 the keys for the cdc matching

Function: This function executes a generic catapult cdc proc - sp_delete_insert - that deletes and inserts from the cdc table to the base table.

Output: None

 

get_active_stats

Parameters:

connection 鈥 the python connection object to datamarq

 

Function: Queries the catapult.active_stats table to get a list of the stat slugs to be pulled from catapult

Output:  statsList

 

get_last_sync_dt

Parameters:

connection 鈥 the python connection object to datamarq

sportCode - the sport code to get the sync date (WB, BB, etc.) 

Function: Queries the catapult.last_sync_dt table to get the last sync date and pull incremental data going forward

Output:  syncDate

 

update_hash

Parameters:

connection 鈥 the python connection object to datamarq

table 鈥 the table name

fieldList鈥揕ist of fields in the table

Function: takes a field list and builds a hash statement to get a unique row hash then runs an update statement

Output:  None

cursor_list

Parameters:

cursor 鈥 a python sql cursor object (select statement usually)

name 鈥 the name of a tuple for each output row of the cursor

fields鈥搇ist of fields in the output row

Function: Creates a result set (python list) from a cursor execution as a each row a named tupled with defined fields

Output:  resultSet 鈥 python list of named tuples as each row
 

update_audit

Parameters:

connection 鈥 the python connection object to datamarq

table 鈥 the table name

stageSource 鈥 The string to be put into the stage_source field in the staging table

Function: Runs commands to update the audit fields in a staging table with the default getdate() for the modified and created date

Output:  None

update_dt

Parameters:

connection 鈥 the python connection object to datamarq

table 鈥 the table name

dt_field 鈥 The date to be entered into a datefield

Function: Runs commands to update the given date field to the given date

Output:  None

alter_index

Parameters:

connection 鈥 the python connection object to datamarq

table 鈥 the table name

method 鈥 The method to be run on the index (disable, rebuild)

Function: Runs commands to alter all indexes on the table for the given method

Output:  None

trunc_table

Parameters:

connection 鈥 the python connection object to datamarq

table 鈥 the table name

Function: Runs command to truncate the given table

Output:  None

  

Change Log:

Date Developer Change Record
7/1/2021 N/A Initial Documentation


Troubleshooting Notes:

  1. Flat File Loads

    The loading of flat files is notorioiusly difficult when working with end user files as SSIS is very sensitive to metadata changes. If there are failures this is the number one culprit.
  2. Catapult Match File

    The catapult match file is a unique process in that it gets imported and repopulated each run.  This is so new catapult "ids" can be linked to their correct id.  If something crashes along the way, this file could get repopulated with blank MUIDs, which will kill the whole matching process.  If you see this file with blank MUIDs across the board, this has happened.  The DEV folder in the backup has an initial folder where the initial load documents live that can be reprocessed as necessary.  This can also be helpful for any other data that gets messed up after the initial loads.