DS - PeopleSoft (Staging Package)
Last Updated 5/24/2021
Overview
Location
Schedule
Details
Change Log
Troubleshooting Notes
Overview:
The DS - Peoplesoft Project (DS stands for direct stage) moves data from PeopleSoft (PS) to the dataMarq staging database. The dataMarq staging database serves as the data lake for Marquette. Tables are pulled from the source with no manipulation, the only addition being audit fields (see below).
The majority of tables are pulled directly into a base schema in dataMarq. Tables that pull based on deltas (Change Data Capture - CDC) land initially in a "cdc" schema and then are loaded into the base schema using procedures to insert/update the appropriate data.
Base Schema: ps
CDC Schema: ps_cdc
The primary purpose of this job is to populate the ps schema in the datamarq_staging database with raw data directly from PS. Downstream applications include:
- dataMarq procs to build 鈥渁cad鈥 schema tables
- Source data for integration with other applications (ccure, peoplegrove, etc.)
Location:
The project 鈥 DS-PeopleSoft 鈥 contains all the packages for staging PS data and resides in the ETL-dataMarq folder in the SSIS catalog.
Schedule:
The job runs as part of the ETL 鈥 dataMarq SQL Agent jobs (full dataMarq schedule):
ETL 鈥 dataMarq 鈥 Incremental 鈥 Daily w/o Friday at 9pm
ETL 鈥 dataMarq 鈥 Full 鈥 Friday at 9pm
Project and Package Details:
Each individual table in this staging project has its own package. Packages follow the same basic flow as described below.
Project Parameters
dwStgConnStr: connection string to datamarq_staging. Used for the connection string in the connection manager to make that connection dynamic
fullRefresh: set to 0 (incremental) or 1 (full refresh)
psConnStr: connection to the PS instance
refreshDays: integer, the number of days to look back for delta changes on delta load
yearCutoff: the full job has an option to only pull and insert data for years past a certain cutoff year. So not technically a full refresh but a full refresh of all data for those years. Helps to speed up a 鈥渇ull鈥 refresh.
Package Parameters
V_STAGE_SOURCE: the stage source code that gets entered into the audit fields. It is PS for these jobs.
Package Variables
Only the CDC jobs have packages variables, and these variables are used to dynamically generate the sql statements for pulling delta data from the source system
cdcField: The field that will be used to determine the change data for a row on the source table. Usually this is a modified date or an updated date, but sometimes will be an effective date.
sqlSelect: This is the generic select statement that indicates all the fields that will be pulled from the source table. The where clause 鈥淲HERE 1=1 鈥 must be included at the end to build the dynamic where clause with the cdc information. 鈥淪elect *鈥 should not be used and field name should be qualified.
sqlWhere: This is a dynamic variable with an expression that builds a where clause based on the values of the fullRefresh project parameter and the cdcField. Example of expression:
@[$Project::fullRefresh] == 1 ? " AND EXTRACT(YEAR FROM " + @[User::yearField] + ") >= " + (DT_WSTR, 4)@[$Project::yearCutoff] + " OR " + @[User::yearField] + " IS NULL" : "AND ( " + @[User::cdcField] + " >= sysdate - " + (DT_WSTR, 3)@[$Project::refreshDays] + " OR EXTRACT(YEAR FROM " + @[User::yearField] + ") >= EXTRACT(YEAR FROM sysdate) )"
sqlStatement: An expression variable build on the combination of sqlSelect + sqlWhere. This variable is used in the Load Data task as the expression for the sql statement to pull source data.
Package Flow
- Truncate landing table 鈥 in either ps_cdc or ps schema depending on job type
- Disable indexes, if applicable. This is done to speed up inserts.
- Load source data
- Add audit fields
stage_created: The date when the stage row was first loaded into datamarq_staging
stage_modified: The latest update date for that stage row, should only be different from stage_created date if a cdc table.
stage_deleteflag: Applied to deleted staging rows. The data warehouse will keep a record of deleted fields for staging tables for troubleshooting lineage/missing data issues but this flag is set so they are not used in the downstream procs that build dim and fact tables.
stage_source: A string indicating from where the stage data was sourced. Especially helpful if one staging table has data from two different sources.
- Load data to landing table 鈥 either ps_cdc or ps schema depending on job type
- Update stage table sys_hash_id on each row (only applicable to CDC tables with no modified date)
- The sys_hash_id is a hash product of the concatenation of all the fields on the source table row. It essentially serves as a single, unique id of that row鈥檚 values. On tables where there is no modified date and no way to know if a row had been updated, but we still want to do CDC in order to be able to leverage the stage_modified date for downstream processing (this happens in procs where we process deltas for dim and fact tables) we compare a ps_cdc row鈥檚 sys_hash_id to the sys_hash_id for that row in the ps schema to tell if the row has changed and update the stage_modified field appropriately.
- Execute CDC proc for table
- For CDC tables, we execute a custom stored procedure for that table. The CDC procs are all named using the table name 鈥 [ps_cdc].[sp_ps_addresses]. For more detail on the CDC procs see that section below.
- Rebuild indexes
Package Snapshots
CDC Stored Procedures
Stored procedures at the database level are the primary method for making updates to staging tables when we are doing incremental (or delta/CDC) loads. Incremental data is loaded to a cdc schema and the stored procedure inserts/updates/sets deleteflags on the final staging table in the base schema.
CDC Proc Variables
fullRefresh: tinyint, 1 = full refresh, 0 = incremental/delta
This comes directly from the project parameter set when the Sql Agent Job is fired for a full refresh or an incremental
yearCutoff: int, the earliest year being processed for any incremental run
This comes directly from the Sql Agent Job and passed through the project parameter to this procedure. It is used to determine when a stage_deleteflag can be set by indicating not to look for 鈥渕issing鈥 records before the cutoff year.
Inserts and updates are done using a merge statement, with a join on the table鈥檚 natural key and a comparison of the sys_hash_id
Change Log:
Date | Developer | Change Record |
5/24/2021 | N/A | Initial Documentation |
Troubleshooting Notes:
-
Changes to source tables
SSIS is very metadata sensitive and any changes to the source table can cause the packages to fail. We use fully qualified queries - as opposed to Select * - in order to help mitigate this issue. Patches to source systems should be tested in the INT environment to make sure current packages continue to work. -
CDC not capturing deletes
Although it has been fixed, an issue has popped up in the past where the CDC jobs have not correctly marked soft deletes and deleted records have remained in dim and fact tables during the week. Full refreshes almost always fix this issue, but many procs have been fixed to allow soft deletes during an incremental refresh. -
Bad natural key joins on CDC procs
If the natural key is not correctly identified for the join, rows can be improperly updated/deleted/inserted during the CDC merge process