Profile sample data and use it to generate synthetic data.
The term data set refers to a file that contains one or more records.The record is the basic unit of information used by a program
A data set can hold information such as educational records or medical records or insurance records, to be used by a program running on the system. Data sets are also used to store information needed by applications.
Data Discovery and Profiling is the process of identifying and analyzing data and to detect duplicate information to prioritize data cleansing and standardization of data.
It helps you to discover ,organize data and understand what problems exist and what actions need to be taken.
Data Discovery and Profiling helps you to make better decisions to increase profits and improve customer satisfaction .
In this article, we are going to cover the third option of data profiling:
- Why is data profiling important?
- Profiling Data Using Pandas.
- Profiling Data Using CloudTDMS
- Create a stream.
- Generate test data from your input file.
- An outlook
Why is Data Profiling Important?
Data profiling is important for analyzing large amounts of data using a systematic , consistent and repeatable process. It helps a business improve profits and cut waste and why not efficient and accurate data is a profitable business asset.
Data Profiling can eliminate costly errors commonly found in databases e.g, missing cells, duplicate values, unexpected pattern in data etc.
Data Profiling helps in performing data quality assessment,Identifying data types etc.
Profiling Data Using CloudTDMS
CloudTDMS discovery and profiling solutions allows business and IT Users to constantly monitor data available from any existing information source(e.g a database or a file ) and collecting complete statistics i.e. variable types,number of variables ,number of observations(rows of DataFrame), completeness and correctness of data ,missing cells, percentage of missing cells, duplicate rows, percentage of duplicate rows, and total size in memory that helps us to identify if the existing data can be easily used for other purposes.
Advantages of CloudTDMS Profiling
CloudTDMS profiling report is an excellent tool that can offer following benefits:overview, no of variables, no of records, no of cells, missing cells, missing cells percentage,duplicate rows, duplicate rows percentage, correctness, completeness, total size in memory and a sample of your data.
How to generate profiling report on CloudTDMS?
- Visit CloudTDMS, go to Data menu and click on Data Discovery and Profiling.
- Click on Upload files button and a pop will appear.You have two options here:
- First you can only browse to upload your file and keep the Generate stream checkbox unchecked and click on submit.This way only file will be profiled but you won't be able to generate stream from it.
- Second is after choosing file to profile you check Generate stream checkbox and choose application in which you want to generate your stream and click on submit button to generate stream .This way file will be profiled as well as stream for the same will be generated .
- Check logs for successful execution.
- After execution .html extension will be added to the file name as shown below.
- Click on Profiling report to view your generated report.
Description of the elements analyzed
- Overview
The overview includes number of variables ,number of observations(rows of DataFrame), missing cells, percentage of missing cells, duplicate rows, percentage of duplicate rows, and total size in memory. - Variables
This section of the report is a detailed analysis of variables, columns and features of the datasets.- Numeric Variables It gives information about the distinct values, missing values, min-max ,mean and negative values count.
- String Variables It gives information about the distinct values, distinct percentage, missing, missing percentage, memory size.
Create a stream
Stream is the definition/metadata of the data that we need to generate.We need to create a stream for file that we need to generate data for.
Inside Data Discovery and Profiling . After choosing file to profile check Generate stream and choose application in which you want to generate your stream and click on submit button.
Check logs to see if stream is generated successfully.
Once executed you can go to the same application to check your generated stream.
Generate test data from your input file
To generate data from this stream you have to create a workflow and execute it in order to generate data from it.
Go to the workflows menu and click on Create new button.
Here we will define some parameters for our Workflow.
Application:The name of your application within which your stream exists, Select your application from the list (e.g., 'Generate Profile Report')
NOTE: An application can have multiple streams, All the active streams will be executed by workflow
Storage: For Free Plan (starter), you can use Internal Storage — Select the name of your storage which you created earlier.
NOTE:By default we are providing a local storage where you can store your data
Scheduled At:If you want to generate data on a specific date and time.
NOTE:By default the workflow will be executed at a maximum delay of 1 minute
Your data will be generated after the status of workflow is 'Executed'.
- You can also check logs to see if data is generated successfully.
Once workflow will be executed go to the Data menu and navigate to Data Generated section and will see your file in the section where you can download the file by clicking on it.
An Outlook
This article points out the needs of data profiling, why is data profiling important and explains data profiling reports .At the same time, this article shows how can a user generate profiling reports for their business and also how will you create stream from data profiling report and generate data from it.