Data Preparing

HOW DATA IS COLLECTED

DATA COLLECTION CONSIDERATIONS

PRIMARY VS SECONDARY DATA

Data format classification Definition Examples
PRIMARY DATA Collected by a researcher from first-hand sources - Data from an interview you conducted - Data from a survey returned from 20 participants
- Data from questionnaires you got back from a group of workers
SECONDARY DATA Gathered by other people or from other research - Data you bought from a local data analytics firm's customer profiles
- Demographic data collected by a university
- Census data gathered by the federal government

INTERNAL VERSUS EXTERNAL DATA

Data format classification Definition Examples
Internal data Data that is stored inside a company's own system - Wages of employees across different business units tracked by HR
- Sales data by store location
- Product inventory levels across distribution centers
External data Data that is stored outside of a company or organization - National average wages for the various positions throughout your organization
- Credit reports for customers of an auto dealership

CONTINUOUS VS DISCRETE DATA

Data format classification Definition Examples
Continuous data Data that is measured and can have almost any numeric value - Height of kids in third grade classes
- Runtime markers in a video
- Temperature
Discrete data Data that is counted and has limited number of values - Number of people who visit a hospital on a daily basis (100,200,1000)
- Maximum capacity allowed in a room
- Tickets sold in the current month

QUALITATIVE VS QUANTITATIVE DATA

Data format classification Definition Examples
Qualitative A subjective and explanatory measure of a quality or characteristic - Favorite exercise activity
- Brand with best customer service
- Fashion preference of young adult
Quantitative A specific and objective measure, such a number, quantity, or range - Percentage of board certified doctors who are women
- Population size of elephants in Africa
- Distance from Earth to Mars at a particular time

NOMINAL VS ORDINAL DATA

Data format classification Definition Examples
Nominal A type of qualitative data that us categorized without a set order - First time customer, returning customer, regular customer
- New job applicant, existing applicant, internal applicant
- New listing, reduced price listing, foreclosure
Ordinal A type of qualitative data with a set order or scale - Moving rating (1 star, 2 stars)
- Ranked-choice voting selections (1st, 2nd)
- Satisfaction level measured in a survey (satisfied, neutral, dissatisfied)

STRUCTURED VS UNSTRUCTURED DATA

Data format classification Definition Examples
Structured data Data organized in a certain format, like rows ad columns - Expense report
- Tax return
- Store inventory
Unstructured data Data that cannot be stored as columns and rows in a relational database - Social media posts
- Emails
- Videos

SPREADSHEETS

TRANSFORMING DATA:

Wide data is preffered Long data is preferred
Creating tables and charts with a few variables about each subject Storing a lot of variables about each subject. For example, 60 years worth of interest rates for each bank
Comparing straightforward line graphs Performing advanced statistical analysis or graphing.

TYPES OF DATA BIAS:

VALIDATION FOR GOOD DATA:

ETHICS

METADATA

IMPORT RANGE AND IMPORT HTML:

1 – Open a new spreadsheet in Google Sheets. In an empty cell enter the following formula:

=IMPORTHTML("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies", "table", 1)

The formula IMPORTHTML requires 3 inputs:

URL:"https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"

This is the URL of the page we will import data from. It should include the protocol (e.g. http:// or https://), and be enclosed in quotation marks. Alternatively, it can be a reference to a cell in your workbook that contains the relevant URL.

QUERY:"table"

Google Sheets also gives you the option to import lists from a webpage. Search for a list element (either an unordered list, tag <ul>, or an ordered list, tag <ol>). Each list item starts with the<li> tag.

It can be either “list” or “table”, depending on the type of the webpage’s element that you want to import data from. It should also be enclosed in quotation marks. In this example, we are importing a table.

INDEX:1

The index, starting at 1, identifies which table or list should be returned from the page’s HTML source. This is useful if your page contains multiple tables or lists.

2 – Press Enter and enjoy the imported data. This import is dynamic and will update automatically when new data is added to the table. That can be useful when scraping tables that are frequently updated, for example, results of sports competitions or elections.

3 – You can download this newly created data set as .xlsx, .csv. or .tsv for further manipulation, or connect to it directly from Tableau Desktop by selecting “Google Sheets” from the list of servers.

BEST PRACTICES WHEN ORGANIZING DATA

DATA SECURITY: