Data Processing

Data Constrain Definition Examples
Data type Values must be of a certain type: date, number, percentage, Boolean, etc. If the data type is a date, a single number like 30 would fail the constrain and be invalid
Data range Values must fall between predefined maximum and minimum values If the data range is 10-20, a value of 30 would fail the constrain and be invalid
Mandatory Values can't be left blank of empty If age is mandatory, that value must be filled in
Unique Values can't have a duplicate Two people can't have the same mobile phone number within the same service area
Regular expression (regex) patterns Values must patch a prescribed pattern A phone number must match ###-###-#### (no other characters allowed)
Cross-field validations Certain conditions for multiple fields must be satisfied Values are percentages and values from multiple fields must add up to 100%
Primary-key (Databases only) Values must be unique per column
Set-membership (Databases only) Values for a column must come from a set of discrete values. Value for a column must be set to Yes, No, or Not Applicable
Foreign-key Values for a column must be unique values coming from a column in another table
Accuracy The degree to which the data conforms to the actual entity being measured or described If value for zip codes are validated by street location, the accuracy of the data goes up
Completeness The degree to which the data contains all desired components or measures If data for personal profiles required hair and eye color, and both are collected, the data is complete
Consistency The degree to which the data is repeatable from different points of entry or collection If a customer has the same address in the sale and repair database, the data is consistent.
Unit Reurns
Y The number of complete years in the period
M The number of complete months in the period
D The number of days in the period
YM The difference between the months in start_date and end_date. The days and years of the dates are ignored
YD The difference between the days of start_date and end_date. The years of the dates are ignored.
Possible Solutions Examples of solutions in real life
Gather the data on a small scale to perform a preliminary analysis and then request additional time to complete the analyis after you have collected more data. If you are surveying employees about the they think about a new performance and bonus plan, use a sample for a preliminary analysis. Then, ask for another 3 weeks to collect the data from all employees.
If there isn't time to collect data, perform the analysis using proxy data from other datasets
(this is the most common workaround)
If you are analyzing peak travel or commuters but don't have the data for a particular city, use the data from another city with a similar size and demographic
Possible Solutions Examples of solutions in real life
Do the analysis using proxy data along with actual data If you are analyzing trends for owners of golden retrievers, make your dataset larger by including the data from owners of labradors
Adjust your analysis to align with the data you already have If you are missing data for 18-to-24-year-olds, do the analysis but note the following limitation in your report: this conclusion applies to adults 25 years and older only
Possible Solutions Examples of solutions in real life
If you have the wrong data because requirements were misunderstood, communicate the requirements again If you need the data for female voters and received the data for male voters, restate your needs
Identify errors in the data and, if possible, correct them at the source by looking for a pattern in the errors If your data is in a spreadsheet and there is a conditional statement or boolean causing calculations to be wrong, change the conditional statement instead of just fixing the calculated values
If you can't correct data errors yourself, you can ignore the wrong data and go ahead with the analysis if your sample size is still large enough and ignoring the data won't cause systematic bias if your dataset was translated from a different language and some of the translations don't make sense, ignore the data with bad translation and go ahead with the analysis of the other data.
Important note: Sometimes data with errors can be a warning sign that the data isn't reliable. Use your best judgement Screenshot 2024-12-05 at 2.28.42 PM.png

CALCULATE SAMPLE SIZE

Terminology Definitions
Population The entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company.
Sample A subset of your population. Just like a food sample, it is called a sample because it is only a taste. So if your company is too large to survey every individual, you can survey a representative sample or your population
Margin of Error Since a sample is used to represent a population, the sample's result are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. The smaller the margin of error, the closer the results of the sample are to what the result would have been if you had surveyed the entire population.
Confidence level How confident you are in the survey results. For example, a 95% confidence level means that if you were to run the same survey 100 times, you would get similar results 95 of those 100 times. Confidence level is targeted before you start your study because it will affect how big your margin of error is at the end of your study
Confidence interval The range of possible values that the population's result would be at the confidence level of the study. This range is the sample result +/- the margin of error.
Statistical Power Can be calculated and reported for a completed experiment to comment on the confidence one might have in the conclusions drawn from the results of the study, It can also be used as a tool to estimate the number of observations or sample size required in order to detect an effect in an experiment.
Statistical significance The determination of whether your result could be due to random chance or not. The greater the significance, the less due to chance.

You could probably accept a larger margin of error surveying how residents feel about the new library versus surveying residents about how they would vote to fund it. For that reason, you would most likely use a larger sample size for the voter survey.

WHEN DATA ISN'T READILY AVAILABLE

Business Scenario How proxy data can be used
A new car model was just launched a few days ago and the auto dealership can't wait until the end of the month for sales data to come in. They want sales projections now. The analyst proxies the number of clicks to the car specifications on the dealership's website as an estimate of potential sales at the dealership
A brand new plant-based meat product was only recently stocked in grocery stores and the supplier needs to estimate the demand over the next for years The analyst proxies the sales data for a turkey substitute made out of tofu that has been on the market for several years.
The Chamber of Commerce wants to know how a tourism campaign is going to impact travel to their city, but the results from the campaign aren't publicly available yet. The analyst proxies the historical data for airline bookings to the city one to three months after a similar campaign was run six months earlier.

CLEANING DATA

VERIFYING AND REPORTING