Skip to main content

Data Visualization: Cleaning Data

Introduction

Cleaning up data is the first and also a very important step towards preparing data for visualization. It saves a great deal of time and effort for visualization when cleaning and formatting data is nicely done. If you have some data in speadsheet you would like to visualize, it is a great idea to go clean it up before moving on to visualize it. 

This page introduces some tools to help prepare data format before visualization.

Excel

If you have your numeric / textual mixed data in spreadsheets, please make sure that you set different columns in the right format and then validate your data by format to make sure no mis-formatted data in your dataset.

1. Organize your data in consistent format (some basic format issues are listed below. All formats listed above could be found under Format Cells in Excel):

  • Put all dates in a date/time format;
  • Put numeric data in numeric format;
  • Put money data in currency format;
  • Put text in text format.

2. Validate your dataset by format and set up error message:

  • Select a range of data (usually a column) for validation;
  • In Excel go to Data, choose Data Validation tab on the top to set up the valid format and error message if a field is invalid.

Note: please keep your data format consistent. Do not leave blank space. You can put in N/A or NONE for blank spaces.

More cleaning tricks are shown in the demo below:

OpenRefine

A free open-source tool to help you identify inconsistent data field, suggest refining options and merge field together under your chosen display.

Go to OpenRefine

To learn basic data manipulation skills: OpenRefine Help Resources.

Regular expression for advanced data manipulation: OpenRefine Cheat Sheet

Data Wrangler

A free web-based data cleaning tool. Users can copy paste your data into the website for cleaning. It has multiple functions and will provide a comprehensive list of editing functions including extract, merge, drop, delete, etc. The tool provides suggested editing options to manipulate the data.

Go to Data Wrangler

SAS

SAS is a powerful tool to manipulate data. Some of its data cleaning functions include:

  • check character and numeric variables for invalid values
  • identify and count missing values for numeric and character variables
  • work with and manipulate dates
  • work with data sets that have multiple observations per subject
  • identify inconsistencies among multiple files
  • compare two data sets with and without an ID variable
  • use integrity constraints and audit trails to weed out data errors.

Below is a video of cleaning data in SAS. It, however, doesn't show all the data cleaning techniques in SAS. Please contact your librarian if you have further questions about data cleaning in SAS.