ETL - D2L (Staging Package) // dataMarq // 向日葵视频

Marquette.edu // dataMarq // Data Warehouse // Documentation //

ETL - D2L (Staging Package)

Last Updated 6/9/2021

Overview
Location
Schedule
Details
Master Packages
d2l_master
d2l_load_script
d2l_load_all
Python Modules
    d2l_load.py
    d2l_helper.py
    dw_helper.py
Change Log
Troubleshooting Notes

Overview:

The D2L ETL project is primarily used to pull raw data from D2L.

The ETL has two main components. A python package that pulls down data from the D2L datahub and stores the raw csv files into the BI shared drive, and SSIS packages that load the raw CSV files into the staging tables. The project/package can be run as an incremental load or a full load depending on parameters.

Full data files are only produced every Sunday night in the D2L datahub, so the data is refreshed incrementally during the week to keep up to date.

This has the following end uses:

Populating D2L staging tables in the data warehouse
Providing data for the d2l d and f tables in the data warehouse
Integrations with other university tools (Everspring)

Schemas in staging that are populated with this data include:

Base schema: d2l
CDC schema: d2l_cdc

Location:

The solution 鈥� ETL-D2L鈥� contains all the packages for staging D2L data and resides in the ETL-dataMarq folder in the SSIS catalog.

Schedule:

The jobs in this project run as part of multiple SQL Agent jobs (full dataMarq schedule).

ETL 鈥� D2L 鈥� Full - Everyday Sunday at 1am

ETL 鈥� D2L 鈥� Incremental - Everyday except Sunday at 12am

Project and Package Details:

The packages in ETL 鈥� D2L are a combination of python and SSIS. In general, they work with our classic ETL process, where there is one package per table that loads a flat file. The flat file (CSV) is downloaded directly each day from the D2L data hub using the python script.

Project Parameters

cleanupDays: The number of days of files to keep in the staging folder. These can be large so we usually only keep a couple of days

d2lEnvr: DEV or PROD, the d2l instance to pull data from for staging

dbEnvr: INT, STG, PRD 鈥� which DB should be loaded with the data. Passed to python program as 鈥揹bEvnr argument

dwConnStr: connection to datamarq

dwStgConnStr: connection to datamarq_staging

fsEnvr: The file share environment for the landing files. We have both a DEV and PROD landing folder to differentiate when jobs are pulling in different environments and not to overlap source files. Passed to python program as 鈥揻ileEnvr argument

fullRefresh: 1 = Full Refresh, 0 = Incremental

gitEnvr: DEV or PROD. Because these jobs use python packages they python code is stored in our git folder on the BI shared drive. For testing, we are able to point at the python code in DEV or in PROD.

loadType: FULL or DIFF, whether to run a full refresh or an incremental refresh. These are a totally different set of files, although they have the same structure

logLevel: DEBUG, INFO, WARNING, ERROR, CRITICAL 鈥� This is passed to the python scripts as an argument to set the log level for the jobs

pythonPath: The path to the python executable 鈥� saved in the D:/ drive on servers but usually in C:/ if executing locally/testing.

Master Packages

D2L_MASTER

Overview

This is the master orchestration package and runs the three phases in order 1) Python Load Script 2) d2l Load All Package 3) File Cleanup Package

Package Parameters

None

Package Variables

None

Package Flow

Execute Python Load Script package
Execute D2L Load All package
Execute File Cleanup Package

D2L_LOAD_SCRIPT

Overview

This packages has one step that executes the python script d2l_load.py to run either an incremental or full load.

Package Parameters

None

Package Variables

None

Package Flow

Run D2L load script d2l_load.py
"\"\\\\marqnet.mu.edu\\depts\\ITS\\BI\\git\\" + @[$Project::gitEnvr] + "\\DW\\ETL-D2L\\d2l_load\\d2l_load.py\" \"--dbEnvr\" \"" + @[$Project::dbEnvr] + "\" \"--d2lEnvr\" \"" + @[$Project::d2lEnvr] + "\" \"--loadType\" \"" + @[$Project::loadType] + "\" \"--log\" \"" + @[$Project::logLevel] + "\""

D2L_LOAD_ALL (TABLE LOAD PACKAGES)

Overview

This packages executes the d2l_stage packages which is a 鈥渕eta鈥� package of all the individual packages. Those packages are detailed below. They use our standard flat file methodology, although the CDC packages have a slightly different flow.

Package Parameters

None

Package Variables

sourceFolder: The folder to store the raw files once downloaded from D2L

varDestBackupFolderFile: The folder to store files once they have been loaded

varFileMask: The file mask (gradebooksettings_*.*) to look for raw files to load

varFileName: a variable to store the full file name from the look component for processing

varFileExists: True or False, set by the first step in the package to execute the rest only if a file with the specific mask exists

varSourceFolderFile: A variable set during run time to capture the full path of the file being processed

Package Flow

Script Check if File Exists
This is a SSIS script task that takes the file mask and folder as parameters and checks to see if file(s) exist to continue to the next step
(CDC Only) Check if Full or Incremental, path separates because there are different staging tables, cdc as the _cdc suffix
Truncate Staging Table
For Each Loop
1. Uses Source Folder and File Mask variables to look for files to process, loops through each of these files storing the file name to the File Name variable
2. Loads Data to Staging Table using Flat File to OLEDB Source
3. Archives File
(CDC Only) Execute the cdc stored proc. Each CDC table has its own procedure to merge CDC/Incremental data into the final table. This is part of our standard CDC methodology.
EXEC d2l_cdc.sp_competency_activity_results 1

Python Modules

D2L data is accessed through a web service which is part of the D2L data hub. The modules to interact with the data hub all reside in the ETL-D2L solution 鈥� and therefore in the git repository 鈥� alongside the SSIS packages, which call them. They also require a config file that is stored in the BI shared drive, config folder.

IMPORTANT AUTHENTICATION NOTE: These modules require an Oauth Access token from D2L. Unfortunately there is no programmatic way to get this for the first time. Generally we use a refresh token, but it expires after a week. If the job doesn鈥檛 run for a week a new access token needs to be generated manually and stored to the token file. This can be done using a python program stored in the ETL-D2L folder called oauth2-client-shell-master. For details on this program see the README in the folder.

D2L API resource:

Data Hub resource:

d2l_load.py

This program is the main driver of the file load. It uses functions from the dw_helper.py and d2l_helper.py modules to structure the downloading and unzipping of raw files from the data hub.

Parameters

dbEnvr 鈥� INT, STG, PRD

d2lEnvr 鈥� DEV or PROD 鈥� the d2l instance to access

loadType 鈥� FULL or DIFF

logLevel 鈥� the level to log to the log folder for the job (WARNING, INFO, ERROR, DEBUG)

Quasi code flow

Set the config file path based on the dbEnvr
Get a new access token using the refresh token from the last run
Get a list of plugin sets (D2L name for files) by executing the get_plugin_sets function from the d2l_helper.py module
Get only the 鈥減lugins鈥� we want to load by running the get_active_plugins function
Get all the download links by running the get_plugin_links function
Loop through the plugin links
1. Get the last sync date for that plugin
2. Get the active fields for the plugin
3. Get the created date for each individual link/file
4. Download the file if the created date is greater than the last sync date by running the get_plugin_response function
5. Run the unzip_and_save function
6. Run the update_last_sync_dt function

d2l_helper.py

All of the functions that are used to interact with the D2L APIs and download data

get_old_token

Parameters: None

Function: Accesses the token location from the config file and loads the tokenGeneric function to get a json response from a url payload. Customized for the specific format of the response json

Output: responseJson - Url response in JSON format

trade_in_refresh_token

Parameters:

token_file 鈥� Old token stored in token file

Function: Takes an old refresh token and gets a new access token from the oauth protocol

Output: respone.json() for the token response 鈥� essentially the new token

put_token_file

Parameters:

迟辞办别苍冲蹿颈濒别鈥� Old token stored in token file

Function: Stores a token response into the token file for future use

Output: None

get_with_auth

Parameters:

endpoint 鈥� The end point for the api call

access_token 鈥� The token to pass to the endpoint for authentication

Function: Takes an endpoint and a token and uses the requests module to get a response from an endpoint

Output: response 鈥� the endpoint response object

get_plugin_sets

Parameters:

d2lEnvr 鈥� The d2l instance to access

access_token 鈥� The token to pass to the endpoint for authentication

Function: Takes a d2lEnvr and gets the correct url for D2L data hub. Using the get_with_auth module it creates a list of all possible data_sets 鈥減lugins鈥� available to access.

Output: data_sets 鈥� a list of plugins 鈥� each plugin its own dictionary 鈥� of data sets to access and the information needed to access them (downloadLink, Name, PluginId)

get_plugin_links

Parameters:

set_list 鈥� the plugin list that is targeted for download

plugin_sets鈥� the full plugin list generated by get_plugin_sets

loadType 鈥� FULL or DIFF

Function: Uses the config file to get passed a list of targeted plugins, then loops through all the plugins and returns a dictionary of target sets.

Output: target_sets 鈥� a dictionary with pluginid as the key and a list of files that could be downloaded as the items

get_plugin_response

Parameters:

plugin_set 鈥� A dictionary item for a specific plugin from get_plugin_links

access_token 鈥� The token to pass to the endpoint for authentication

Function: Uses get_with_auth function to get the response from the download link for a plugin, this returns the response content which is a zipped file containing the csv of the file

Output: response.content 鈥� zipped folder in this case

unzip_and_save

Parameters:

plugin_set 鈥� A dictionary item for a specific plugin from get_plugin_links

response_content 鈥� the response from get_plugin_response function

dbEnvr 鈥� INT, STG, PRD

activeFields 鈥� a list of fields to include in the final csv. D2L often adds new fields to these datasets and that can break our load, so we only retain these fields in the csv

loadType 鈥� FULL or DIFF

Function: Takes the response content and unzips it and saves it to the correct landing folder with the created date prefixed to the name of the file. During the saving the function also removes any fields that are not in the active fields list to prevent errors in the SSIS package load.

Output: None

format_date_fields (not currently use)

Parameters:

df 鈥� pandas dataframe

Function: A small function to format all the date fields in a pandas data frame for loading to sql server.

Output: None

dw_helper.py

All of the functions that are used to interact with the datamarq database

db_connect

Parameters:

dbEnvr 鈥� INT, STG, PRD

Function: Creates a python odbc connection object

Output: conn 鈥� the Connection object

db_disconnect

Parameters:

connection 鈥� the python connection object to datamarq

Function: Ends the connection to the database

Output: None

cursor_list

Parameters:

cursor 鈥� a python sql cursor object (select statement usually)

name 鈥� the name of a tuple for each output row of the cursor

fields 鈥� list of fields in the output row

Function: Creates a result set (python list) from a cursor execution as a each row a named tupled with defined fields

Output: resultSet 鈥� python list of named tuples as each row

get_active_plugins

Parameters:

connection 鈥� the python connection object to datamarq

loadType鈥� FULL, DIFF

Function: Queries the table d2l.sync_dt for a list of active plugins for the appropriate load type

Output: syncDateList 鈥� a list of plugins, active_fields, and last_sync_dt

update_last_sync_dt

Parameters:

connection 鈥� the python connection object to datamarq

plugin_id

sync_dt 鈥� the date to store in the last_sync_dt field

Function: Executes query to update the sync date for a given plugin.

Output: None

Change Log:

Date	Developer	Change Record
06/9/2021	N/A	Initial Documentation

Troubleshooting Notes:

Changing Fields in Data Hub Data Sets
D2L releases monthly changes to its data sets, usually one or two of the ones we download get affected. This broke the ETL for a long time until we implemented the active_fields lookup and removed new fields until we were ready to add them. New fields not only break SSIS metadata, they require adding to the staging table ddl and a stored proc if a cdc table. So not exactly quick. Shouldn't break in the future but in case this is probably a culprit
Access and Refresh Token
D2L doesn't allow a programmatic way to get a new access token if your refresh token expires. So unless the job runs regularly this is likely to happen. New access tokens need to be created manually then by running a python oauth program stored in the git repo (see notes in program and above in overview). Once the new access/refresh tokens are stored in the token file all should run smoothly again.