ETL - D2L (Staging Package)
Last Updated 6/9/2021
Overview
Location
Schedule
Details
Master Packages
d2l_master
d2l_load_script
d2l_load_all
Python Modules
d2l_load.py
d2l_helper.py
dw_helper.py
Change Log
Troubleshooting Notes
Overview:
The D2L ETL project is primarily used to pull raw data from D2L.
The ETL has two main components. A python package that pulls down data from the D2L datahub and stores the raw csv files into the BI shared drive, and SSIS packages that load the raw CSV files into the staging tables. The project/package can be run as an incremental load or a full load depending on parameters.
Full data files are only produced every Sunday night in the D2L datahub, so the data is refreshed incrementally during the week to keep up to date.
This has the following end uses:
- Populating D2L staging tables in the data warehouse
- Providing data for the d2l d and f tables in the data warehouse
- Integrations with other university tools (Everspring)
Schemas in staging that are populated with this data include:
- Base schema: d2l
- CDC schema: d2l_cdc
Location:
The solution 鈥 ETL-D2L鈥 contains all the packages for staging D2L data and resides in the ETL-dataMarq folder in the SSIS catalog.
Schedule:
The jobs in this project run as part of multiple SQL Agent jobs (full dataMarq schedule).
ETL 鈥 D2L 鈥 Full - Everyday Sunday at 1am
ETL 鈥 D2L 鈥 Incremental - Everyday except Sunday at 12am
Project and Package Details:
The packages in ETL 鈥 D2L are a combination of python and SSIS. In general, they work with our classic ETL process, where there is one package per table that loads a flat file. The flat file (CSV) is downloaded directly each day from the D2L data hub using the python script.
Project Parameters
cleanupDays: The number of days of files to keep in the staging folder. These can be large so we usually only keep a couple of days
d2lEnvr: DEV or PROD, the d2l instance to pull data from for staging
dbEnvr: INT, STG, PRD 鈥 which DB should be loaded with the data. Passed to python program as 鈥揹bEvnr argument
dwConnStr: connection to datamarq
dwStgConnStr: connection to datamarq_staging
fsEnvr: The file share environment for the landing files. We have both a DEV and PROD landing folder to differentiate when jobs are pulling in different environments and not to overlap source files. Passed to python program as 鈥揻ileEnvr argument
fullRefresh: 1 = Full Refresh, 0 = Incremental
gitEnvr: DEV or PROD. Because these jobs use python packages they python code is stored in our git folder on the BI shared drive. For testing, we are able to point at the python code in DEV or in PROD.
loadType: FULL or DIFF, whether to run a full refresh or an incremental refresh. These are a totally different set of files, although they have the same structure
logLevel: DEBUG, INFO, WARNING, ERROR, CRITICAL 鈥 This is passed to the python scripts as an argument to set the log level for the jobs
pythonPath: The path to the python executable 鈥 saved in the D:/ drive on servers but usually in C:/ if executing locally/testing.
Master Packages
D2L_MASTER
Overview
This is the master orchestration package and runs the three phases in order 1) Python Load Script 2) d2l Load All Package 3) File Cleanup Package
Package Parameters
None
Package Variables
None
Package Flow
D2L_LOAD_SCRIPT
Overview
This packages has one step that executes the python script d2l_load.py to run either an incremental or full load.
Package Parameters
None
Package Variables
None
Package Flow
- Run D2L load script d2l_load.py
"\"\\\\marqnet.mu.edu\\depts\\ITS\\BI\\git\\" + @[$Project::gitEnvr] + "\\DW\\ETL-D2L\\d2l_load\\d2l_load.py\" \"--dbEnvr\" \"" + @[$Project::dbEnvr] + "\" \"--d2lEnvr\" \"" + @[$Project::d2lEnvr] + "\" \"--loadType\" \"" + @[$Project::loadType] + "\" \"--log\" \"" + @[$Project::logLevel] + "\""
D2L_LOAD_ALL (TABLE LOAD PACKAGES)
Overview
This packages executes the d2l_stage packages which is a 鈥渕eta鈥 package of all the individual packages. Those packages are detailed below. They use our standard flat file methodology, although the CDC packages have a slightly different flow.
Package Parameters
None
Package Variables
sourceFolder: The folder to store the raw files once downloaded from D2L
varDestBackupFolderFile: The folder to store files once they have been loaded
varFileMask: The file mask (gradebooksettings_*.*) to look for raw files to load
varFileName: a variable to store the full file name from the look component for processing
varFileExists: True or False, set by the first step in the package to execute the rest only if a file with the specific mask exists
varSourceFolderFile: A variable set during run time to capture the full path of the file being processed
Package Flow
- Script Check if File Exists
This is a SSIS script task that takes the file mask and folder as parameters and checks to see if file(s) exist to continue to the next step - (CDC Only) Check if Full or Incremental, path separates because there are different staging tables, cdc as the _cdc suffix
- Truncate Staging Table
- For Each Loop
- Uses Source Folder and File Mask variables to look for files to process, loops through each of these files storing the file name to the File Name variable
- Loads Data to Staging Table using Flat File to OLEDB Source
- Archives File
- (CDC Only) Execute the cdc stored proc. Each CDC table has its own procedure to merge CDC/Incremental data into the final table. This is part of our standard CDC methodology.
EXEC d2l_cdc.sp_competency_activity_results 1
Python Modules
D2L data is accessed through a web service which is part of the D2L data hub. The modules to interact with the data hub all reside in the ETL-D2L solution 鈥 and therefore in the git repository 鈥 alongside the SSIS packages, which call them. They also require a config file that is stored in the BI shared drive, config folder.
IMPORTANT AUTHENTICATION NOTE: These modules require an Oauth Access token from D2L. Unfortunately there is no programmatic way to get this for the first time. Generally we use a refresh token, but it expires after a week. If the job doesn鈥檛 run for a week a new access token needs to be generated manually and stored to the token file. This can be done using a python program stored in the ETL-D2L folder called oauth2-client-shell-master. For details on this program see the README in the folder.
D2L API resource:
Data Hub resource:
d2l_load.py
This program is the main driver of the file load. It uses functions from the dw_helper.py and d2l_helper.py modules to structure the downloading and unzipping of raw files from the data hub.
Parameters
dbEnvr 鈥 INT, STG, PRD
d2lEnvr 鈥 DEV or PROD 鈥 the d2l instance to access
loadType 鈥 FULL or DIFF
logLevel 鈥 the level to log to the log folder for the job (WARNING, INFO, ERROR, DEBUG)
Quasi code flow
- Set the config file path based on the dbEnvr
- Get a new access token using the refresh token from the last run
- Get a list of plugin sets (D2L name for files) by executing the get_plugin_sets function from the d2l_helper.py module
- Get only the 鈥減lugins鈥 we want to load by running the get_active_plugins function
- Get all the download links by running the get_plugin_links function
- Loop through the plugin links
- Get the last sync date for that plugin
- Get the active fields for the plugin
- Get the created date for each individual link/file
- Download the file if the created date is greater than the last sync date by running the get_plugin_response function
- Run the unzip_and_save function
- Run the update_last_sync_dt function
d2l_helper.py
All of the functions that are used to interact with the D2L APIs and download data
get_old_token
Parameters: None
Function: Accesses the token location from the config file and loads the tokenGeneric function to get a json response from a url payload. Customized for the specific format of the response json
Output: responseJson - Url response in JSON format
trade_in_refresh_token
Parameters:
token_file 鈥 Old token stored in token file
Function: Takes an old refresh token and gets a new access token from the oauth protocol
Output: respone.json() for the token response 鈥 essentially the new token
put_token_file
Parameters:
迟辞办别苍冲蹿颈濒别鈥 Old token stored in token file
Function: Stores a token response into the token file for future use
Output: None
get_with_auth
Parameters:
endpoint 鈥 The end point for the api call
access_token 鈥 The token to pass to the endpoint for authentication
Function: Takes an endpoint and a token and uses the requests module to get a response from an endpoint
Output: response 鈥 the endpoint response object
get_plugin_sets
Parameters:
d2lEnvr 鈥 The d2l instance to access
access_token 鈥 The token to pass to the endpoint for authentication
Function: Takes a d2lEnvr and gets the correct url for D2L data hub. Using the get_with_auth module it creates a list of all possible data_sets 鈥減lugins鈥 available to access.
Output: data_sets 鈥 a list of plugins 鈥 each plugin its own dictionary 鈥 of data sets to access and the information needed to access them (downloadLink, Name, PluginId)
get_plugin_links
Parameters:
set_list 鈥 the plugin list that is targeted for download
plugin_sets鈥 the full plugin list generated by get_plugin_sets
loadType 鈥 FULL or DIFF
Function: Uses the config file to get passed a list of targeted plugins, then loops through all the plugins and returns a dictionary of target sets.
Output: target_sets 鈥 a dictionary with pluginid as the key and a list of files that could be downloaded as the items
get_plugin_response
Parameters:
plugin_set 鈥 A dictionary item for a specific plugin from get_plugin_links
access_token 鈥 The token to pass to the endpoint for authentication
Function: Uses get_with_auth function to get the response from the download link for a plugin, this returns the response content which is a zipped file containing the csv of the file
Output: response.content 鈥 zipped folder in this case
unzip_and_save
Parameters:
plugin_set 鈥 A dictionary item for a specific plugin from get_plugin_links
response_content 鈥 the response from get_plugin_response function
dbEnvr 鈥 INT, STG, PRD
activeFields 鈥 a list of fields to include in the final csv. D2L often adds new fields to these datasets and that can break our load, so we only retain these fields in the csv
loadType 鈥 FULL or DIFF
Function: Takes the response content and unzips it and saves it to the correct landing folder with the created date prefixed to the name of the file. During the saving the function also removes any fields that are not in the active fields list to prevent errors in the SSIS package load.
Output: None
format_date_fields (not currently use)
Parameters:
df 鈥 pandas dataframe
Function: A small function to format all the date fields in a pandas data frame for loading to sql server.
Output: None
dw_helper.py
All of the functions that are used to interact with the datamarq database
db_connect
Parameters:
dbEnvr 鈥 INT, STG, PRD
Function: Creates a python odbc connection object
Output: conn 鈥 the Connection object
db_disconnect
Parameters:
connection 鈥 the python connection object to datamarq
Function: Ends the connection to the database
Output: None
cursor_list
Parameters:
cursor 鈥 a python sql cursor object (select statement usually)
name 鈥 the name of a tuple for each output row of the cursor
fields 鈥 list of fields in the output row
Function: Creates a result set (python list) from a cursor execution as a each row a named tupled with defined fields
Output: resultSet 鈥 python list of named tuples as each row
get_active_plugins
Parameters:
connection 鈥 the python connection object to datamarq
loadType鈥 FULL, DIFF
Function: Queries the table d2l.sync_dt for a list of active plugins for the appropriate load type
Output: syncDateList 鈥 a list of plugins, active_fields, and last_sync_dt
update_last_sync_dt
Parameters:
connection 鈥 the python connection object to datamarq
plugin_id
sync_dt 鈥 the date to store in the last_sync_dt field
Function: Executes query to update the sync date for a given plugin.
Output: None
Change Log:
Date | Developer | Change Record |
06/9/2021 | N/A | Initial Documentation |
Troubleshooting Notes:
-
Changing Fields in Data Hub Data Sets
D2L releases monthly changes to its data sets, usually one or two of the ones we download get affected. This broke the ETL for a long time until we implemented the active_fields lookup and removed new fields until we were ready to add them. New fields not only break SSIS metadata, they require adding to the staging table ddl and a stored proc if a cdc table. So not exactly quick. Shouldn't break in the future but in case this is probably a culprit -
Access and Refresh Token
D2L doesn't allow a programmatic way to get a new access token if your refresh token expires. So unless the job runs regularly this is likely to happen. New access tokens need to be created manually then by running a python oauth program stored in the git repo (see notes in program and above in overview). Once the new access/refresh tokens are stored in the token file all should run smoothly again.