ETL - Athletics (Staging Package)
Last Updated 7/2/2021
Overview
Location
Schedule
Details
Master Packages
athl_load_all
catapult_load
output_files
Python Modules
catapult_load.py
catapult_helper.py
dw_helper.py
Change Log
Troubleshooting Notes
Overview:
The Athletics ETL is primarily used to pull strength and performance data used by athletics teams for day-to-day performance analysis.
The ETL has multiple components that pull data from various source, including flat files and APIs .
The primary purpose of this job is to populate data needed for the Athletics performance dashboards. Uses include:
- Populate staging table schemas with athletics data
- dataMarq procs to build athletics related tables
- d_athletes
One row per athlete with distinguishing information about the atheltes - d_rosters
One row per athlete per roster year with details about roster spot - position, etc. - d_seasons
One row per sport per season with details about the seasons (wins, losses,etc.) - d_catapult_(athletes/activities/periods/tags)
Catapult is a third-party tool that tracks movement, these are d_tables related to the catapult data - f_athlete_test_results
The results of strength and performance tests done by the staff - f_catapult_(period/daily)_stats
The movement/activity stats from catapult
- d_athletes
Schemas in staging that are populated with this data include:
- Base schemas: athl, catapult
- CDC schema: athl_cdc, catapult_cdc
Location:
The solution 鈥 ETL-Athletics鈥 contains all the packages for staging athletics data and resides in the ETL-dataMarq folder in the SSIS catalog.
Schedule:
The jobs in this project run as part of multiple SQL Agent jobs (full dataMarq schedule).
ETL 鈥 Athletic Performance - Full - Every day at 3am
ETL- Athletic Performance - Incremental - Hourly
Project and Package Details:
The packages in ETL 鈥 Athletics work with a variety of source technologies, including flat files, DB and web APIs. Each type follows our fairly standard methodolgy for loading data to staging.
Project Parameters
brand: This is either COVID or MU, to differentiate the two qualtrics instances we want to pull data from. These packages are designed to work with either instance, although we are only pulling from COVID now. Passed to python program as 鈥揵rand argument
dbEnvr: INT, STG, PRD 鈥 which DB should be loaded with the data. Passed to python program as 鈥揹bEvnr argument
dwConnStr: connection to datamarq
dwStgConnStr: connection to datamarq_staging
fsEnvr: The file share environment for the landing files. We have both a DEV and PROD landing folder to differentiate when jobs are pulling in different environments and not to overlap source files. Passed to python program as 鈥揻ileEnvr argument
fullRefresh: 1 = Full Refresh, 0 = Incremental
gitEnvr: DEV or PROD. Because these jobs use python packages they python code is stored in our git folder on the BI shared drive. For testing, we are able to point at the python code in DEV or in PROD.
logLevel: DEBUG, INFO, WARNING, ERROR, CRITICAL 鈥 This is passed to the python scripts as an argument to set the log level for the jobs
pythonPath: The path to the python executable 鈥 saved in the D:/ drive on servers but usually in C:/ if executing locally/testing.
sourceFolder: The foldere path for the raw files that will be imported/exported
Master Packages
ATHL_LOAD_ALL (Flat File Load Packages)
Overview
This packages executes all the individual flat file load packages (generic flow below and details on individual files). They use our standard flat file methodology and most tables have CDC. CDC is important here because there are 5 people working in the department and each might load multiple partial files everyday, so the data has to incrementally update through a CDC import.
Note that the strength staff is uploading these flat files directly to the folder, as a result there is a high chance of metadata mismatch if the files don't align exactly with the template.
Package Parameters
None
Package Variables
sourceFolder: The folder to look for raw files
varDestBackupFolderFile: The folder to store files once they have been loaded
varFileMask: The file mask (gradebooksettings_*.*) to look for raw files to load
varFileName: a variable to store the full file name from the look component for processing
varFileExists: True or False, set by the first step in the package to execute the rest only if a file with the specific mask exists
varSourceFolderFile: A variable set during run time to capture the full path of the file being processed
Package Flow
- Script Check if File Exists
This is a SSIS script task that takes the file mask and folder as parameters and checks to see if file(s) exist to continue to the next step - Loop Through Each File in the Folder
- Uses Source Folder and File Mask variables to look for files to process, loops through each of these files storing the file name to the File Name variable
- Truncate CDC Table
- Loads Data to Staging Table using Flat File to OLEDB Source
- Execute CDC staging proc (this merges in new data and/or runs a delete and insert statement directly in the proc)
- Archives File - files are archived
File Loading Instructions
The Athletics department will be loading files directly into the BI shared drive Performance folder. These are instructions on how to drop those files.
- Files must have the exactly field headers and layout as listed below. If field headers are in the wrong order or missing the load job will fail.
- All files except DXA Access files should be dropped directly in the DEV or PROD folder. DEV is for testing and PROD is for the final data that will be consumed by users. DXA files should be dropped in the specific dxa folder and can be embedded in nested folders (MBB/12-1-2021, etc.)
- Once files are loaded, the will show up in the "backup" folder with the date they were loaded appended to the file name. Any file that is left in the DEV/PROD folder after a load job has run (every hour) means that file failed to load for some reason - usually bad headers
- For results, athlete, rosters, and seasons, the files can contain partial data (i.e., one day's worth of results, one season's roster, etc.). For tests and catapult_match, the full file must be present). Notes on this:
- Rosters must have a full season's worth of data. Different seasons can be in different files, but a full seasons must be loaded as these are treated as individual units. This is done in order to be able to delete athletes who were accidentally added to the wrong seasons - When new athletes are entered into catapult, their catapult id needs to be paired with their muid. This is a manual process and is done through the catapult_match.csv file. This file will remain in the Performance folder and will be imported and exported every job. If new athletes are added to catapult they will be added to the match file and will need to have their muid entered on the file by Athletics staff
- For testing the whole load job there is an output_files folder. After every job execution, new files will be output from the data warehouse that contain the key processed data. This can be used by athletics staff as a check or a quick lookup.
NOTES: These shoud be csv, muid will most likely come in as int and joins need to be careful
- Results
result_date muid first_name last_name test_code test_name rep result team_ind - Athlete
muid first_name last_name pref_first_name scholarship_ind pro_ind nba_ind national_team_ind transfer_out_ind - Rosters
sport_code sport muid first_name last_name alias_name position_code position season class_year year_at_mu - Tests
test_code test_name test_category pr_type - Season
sport_code season coach total_wins total_losses total_ties conference_champ_ind conference_wins conference_losses conference_ties ncaa_ind ncaa_wins - Catapult Match
id first_name last_name muid - DXA
These are actually individual Access database files. The basic procedure is the same except a query is used to load the staging table and the files have to be extracted from their source folders as they are uploaded in nested folders.
CATAPULT_LOAD
Overview
This package is very basic and has one task that executes the python module that loads data from the catapult API. The python job doesn't land any raw flat files and does all the inserting and updating directly in the python code.
Package Parameters
None
Package Variables
None
Package Flow
OUTPUT_FILES
Overview
The strength staff would like flat files of the data that has been processed by the warehouse, so this package does a simple export and overwrite of the base files of the DW every time the job runs.
Package Parameters
None
Package Variables
sourceFolder: The folder to look for raw files
Package Flow
- Five data flow that run each run a sql query and output the results into flat csv files.
athletes
rosters
test_results
dxa
catapult_period_stats
catapult_daily_stats
Python Modules
Athletics uses a number of third party services to track perfromance data. These are all accessed through web service APIs. data is accessed through a web service which dataMarq achieves through the use of python modules. These all reside in the ETL-Qualtrics solution 鈥 and therefore in the git repository 鈥 alongside the SSIS packages, which call them. They also require a config file that is stored in the BI shared drive, config folder.
Catapult API Reference:
catapult_load.py
This program is the main driver of the catapult data load. It leverage functions in the two 鈥渉elper鈥 modules to do the actual work, but provides the flow logic.
Parameters
dbEnvr 鈥 INT, STG, PRD
fullRefresh 鈥 1 or 0, whether to run just incremental activities or athletes, teams, tags, etc.
sportCode (optional) - if running for only one sport (BB, WB, etc.)
logLevel 鈥 the level to log to the log folder for the job
Quasi code flow
- Set the config file path based on the dbEnvr
- Set key variable for tags table and stats table - as these are standard across all sports
- Connect to the DB
- Get a list of stats to pull down from the DW using the get_active_stats module in the dw_helper.
- Loop through the tokens in the config file - there is one api token for each sport
- Loop through each endpoint in (athletes, teams, tags, tagtype) - these are "d" type endpoints
- Get responses and load them into the datawarehouse using get_endpoint, insert_rows and cdc_rows modules
- For activites endpoint (the "f" type endpoint)
- Get last sync date by sport using the get_last_sync_dt module
- Create a date list for each day between the last sync date and the current date - this is done because only one day of stats is loaded at a time to speed processing, lessen chance for failure loss
- Loop through each data and get the activities data
- Loop through the details activity data and get load to the activites, periods, participants, stats and tags_list tables
- Update last sync dt table
- Run the load_survey_results function from the qualtrics_helper.py module to load survey result to the data warehouse
catapult_helper.py
All of the functions that are used to interact with the catapult APIs and download data
get_response
Parameters:
url 鈥 url for web service response
headers 鈥 headers to be passed to get call
Function: Generic function to get a json response from a url payload. Customized for the specific format of the response json
Output: responseJson - Url response in JSON format
get_endpoint
Parameters:
apiToken 鈥 Catapult API token
endpoint - the endpoint to tag on to the api url
parms - If there are any params for the endpoint (startime, etc.)
Function: Takes a token and the dnpoint and will loop through all the pages (catapult usually returns single page results) and returns the responses in a dictionary keyed by id
Output: responseIdsDict- A dictionary with the response in JSON format keyed by the id of the item
get_response_tag_lists
Parameters:
apiToken 鈥 Catapult API token
endpoint - the endpoint to tag on to the api url
responseIdsDict - the response dictionary returned by get_endpoint
fieldList - the field list that will be inserted into the staging table
statSlugs - the list of stats that will be inserted into the staging table
statsFields - the list of stats fields that will be insterted into the stagin table
Function: The function loops through response from a given endpoint and creates lists of data ready to be inserted into
Output: responseList, tagList, statsList - all the lists ready to be inserted into the database
post_request_status
Parameters:
apiToken - Catapult api token
filterName鈥 The filtered item (activity, period, etc.)
filterids - the ids to pass to return from the stats endpoint
params - params to be passed to to the stats endpoint
groupBy - the group by clause to get from the stats endpoint
Function: This is a custom function to work with the catapult stats endpoitn and return catapult stats for specific activities
Output: nextPage, response, responseJson - the response data from the stats request
dw_helper.py
All of the functions that are used to interact with the datamarq database
insert_rows
Parameters:
connection 鈥 the python connection object to datamarq
table 鈥 the table name for insert
fieldList 鈥 the list of fields to be inserted
valueLIst 鈥 the list of values to be inserted
Function: Generic function that inserts values into a given table and fields
Output: s 鈥 number of successful rows inserted, f 鈥 number of failed rows caught in exception
cdc_rows
Parameters:
connection 鈥 the python connection object to datamarq
cdcTable 鈥 the cdc table
table 鈥 the base table
keyList鈥 the keys for the cdc matching
Function: This function executes a generic catapult cdc proc - sp_delete_insert - that deletes and inserts from the cdc table to the base table.
Output: None
get_active_stats
Parameters:
connection 鈥 the python connection object to datamarq
Function: Queries the catapult.active_stats table to get a list of the stat slugs to be pulled from catapult
Output: statsList
get_last_sync_dt
Parameters:
connection 鈥 the python connection object to datamarq
sportCode - the sport code to get the sync date (WB, BB, etc.)
Function: Queries the catapult.last_sync_dt table to get the last sync date and pull incremental data going forward
Output: syncDate
update_hash
Parameters:
connection 鈥 the python connection object to datamarq
table 鈥 the table name
fieldList鈥揕ist of fields in the table
Function: takes a field list and builds a hash statement to get a unique row hash then runs an update statement
Output: None
cursor_list
Parameters:
cursor 鈥 a python sql cursor object (select statement usually)
name 鈥 the name of a tuple for each output row of the cursor
fields鈥搇ist of fields in the output row
Function: Creates a result set (python list) from a cursor execution as a each row a named tupled with defined fields
Output: resultSet 鈥 python list of named tuples as each row
update_audit
Parameters:
connection 鈥 the python connection object to datamarq
table 鈥 the table name
stageSource 鈥 The string to be put into the stage_source field in the staging table
Function: Runs commands to update the audit fields in a staging table with the default getdate() for the modified and created date
Output: None
update_dt
Parameters:
connection 鈥 the python connection object to datamarq
table 鈥 the table name
dt_field 鈥 The date to be entered into a datefield
Function: Runs commands to update the given date field to the given date
Output: None
alter_index
Parameters:
connection 鈥 the python connection object to datamarq
table 鈥 the table name
method 鈥 The method to be run on the index (disable, rebuild)
Function: Runs commands to alter all indexes on the table for the given method
Output: None
trunc_table
Parameters:
connection 鈥 the python connection object to datamarq
table 鈥 the table name
Function: Runs command to truncate the given table
Output: None
Change Log:
Date | Developer | Change Record |
7/1/2021 | N/A | Initial Documentation |
Troubleshooting Notes:
-
Flat File Loads
The loading of flat files is notorioiusly difficult when working with end user files as SSIS is very sensitive to metadata changes. If there are failures this is the number one culprit. -
Catapult Match File
The catapult match file is a unique process in that it gets imported and repopulated each run. This is so new catapult "ids" can be linked to their correct id. If something crashes along the way, this file could get repopulated with blank MUIDs, which will kill the whole matching process. If you see this file with blank MUIDs across the board, this has happened. The DEV folder in the backup has an initial folder where the initial load documents live that can be reprocessed as necessary. This can also be helpful for any other data that gets messed up after the initial loads.