BI Solution to prepare data coming in raw files for analytics

Client

The US based client providing services to various Car dealers wanted a BI solution which can help them populate various raw file data to Data warehouse, Client was getting various raw files from their dealer on different schedule in different format on remote location.


Industry

Automobile


Business Challenge

  • Picking files from remote location as per the path given in property file
  • To avoid load on source system process data as per chunk size defined in property file.
  • Keep track of every record processed or not with moved date or error detail.
  • Apply various business rules on source data coming from various files in various formats.
  • Check existence of file on remote location and process files of different formats and load data in SQL database.
  • Keep detail of what and when data loaded from which file.
  • Handle delta or full load in various files as per client requirement.
  • Keep track of Logging and Audit activity for every job.
  • Success/Failure Notification email with Audit log detail and attachment for every process to defined group of users as pervaluesin property file.

Benefits

  • Automatic processing of raw files of different formats whenever they are available on remote location and loading data into SQL database.
  • Accurate logging and auditing to monitor status of each job and rows processed.
  • Email notification to maintenance team to inform success/failure, so they can take action if required.
  • No need to worry about partial process failure.


Solution

Our Team of experts utilized Talend Open data Solution for data extraction from various Raw file formats like CSV, Excel,

Our team collected sample data files from the client which is given to them by dealers on various time intervals

Created metadata document to define Columns and their data types for each file,

Design Flow diagram in Visio for each file loading to simplify process of development for Full and Delta load.

ETL Package developed using Talend for data extraction which will do extraction of data from various files when they are available on Remote location.

Master Package created which will run all child jobs and do Entry in Process Job Logs about time taken by each job for that Run, Log Audit Counts for each file

Send Email to notify Success or Failure with List of jobs and audit counts for that run of Master Job to specific email group defined for success and failure.

Also placed functionality to process data in Chunk instead of whole at once. Client can set parameters for specifying number of records processed for that file in Property file as a parameter.

Client can supply database credentials, Step/Chunk Size for each file, Remote Location for each file, SMTP credential in Property file, Email notification group in case of success/failure.

Error rows can be logged in associated Error table or CSV file with unique processor id, Job id, Fileid, error record, error detail as per the option specified in property file.

As Client has Big data coming in their RAW files, we designed the architecture for the ETL Package to do data extraction within a couple of minutes instead of hours.


Data Transformation and loading from Staging to Data warehouse:

A second requirement from the client was to give us task of loading data from staging to Data warehouse after applying a few business rules.


Benefits

  • Client had not given any Business keys with any requirement so loading data in incremental way in data warehouse was really challenging.
  • Keeping design of the solution consistent against different requirement of Client.
  • Quick loading of data and processing them in chunks.


Solution

As we got detail about what goes where on various time intervals from client, our team of experts defined Staging to Target mapping document for each job. We also identified and defined Business Keys for each of them

A Flow diagram was created to define method of data loading from source to target stating the sequence of steps to be followed by developer to simplify and make development process easy, quick and bug free.

Defined business rules which should be applied on source data of each file before loading to target table.

Process and Job Logging table schema defined to keep track of each master process and child jobs associated with that process along with details related to time consumed by each job, audit counts associated with each child job.

Master package created to run each child job and do Process Log, Job Log level entries in the System.

Each Child job created as per the SRS to apply data cleansing, Business rules on source records were set to mark them with an Error ID if they are erroneous records.

Notify Maintenance Group by Email after completion of the Process with detailed audit counts

Data loading process in ETL package implemented in such a way that it will process data as per size defined for chunk/step in property file.

Error Logging mechanism created in such a way that the client can view status of each source

Architecture defined in such a way that it deals with huge data in staging very easily and quickly without placing much load on source and target system


Technology Used


Data Tool Talend Open Data solution
Database SQL Server 2012

Why Volga Infotech?

  • High-quality and cost-effective services
  • High-end technology and best-of-breed infrastructure
  • Skilled, talented and experienced professionals
  • Daily updates on the progress of work
  • Direct contact with the Team
  • Save on time, effort and infrastructure by outsourcing
  • Maximize revenue and minimize expenses
  • Quick turnaround time
  • Latest software and technologies