In order to support their opinions and decisions, people need data more than ever. Fortunately there are a ton of free online datasets. Google definitely is your friend with any questions about online stuff, but many of the highest quality data sources do not necessarily appear in a Google search at the top. This article is aimed at directing you to my top 7 resources to find online data of high quality.
1 – Five Thirty Eight
FiveThirtyEight covers a wide range of news topics and includes data in their articles at all times. They now share a lot of the data sets that they are using. This is a great source of sport, culture, and politics data.
2 – Data world
Data world has a wide variety of datasets and enables you to work easily on a given data project with others. You need to create a login to access datasets on this site.
3 – World Bank
The World Bank is providing a ton of different country information worldwide.
4 – Government Data Agencies
Most government agencies have a lot of data that can be downloaded and used by the public. You can find datasets for city, state, and federal. Environment, economy, demographics, and much more data sets are available.
5 – Github
Github is the online collaborative code repository world standard. There’s more than just code on Github, there are many platform projects that have datasets to use. Searching for data is a great place and there is even a project with another list of public data sources:
6 – Kaggle
Kaggle is a competition website for data science. Some data and a prompt will be posted by different groups. Site users then have a set amount of time to finish the project. The best part is that it remains on the website after their data is posted and is available for free download. More than 12,000 datasets are now available on the website.
7 – Google
Google created a separate search engine for data sets specifically. It’s still in Beta so for every topic there may not be great results, but this should be your first place to check out when looking for data.
Data Analysis and Quality Check
There are a few questions you should ask for any data set you can find online.
Can I trust on data source?
Consider the data source’s reputation, are they a large institution or a single person? If you’re skeptical, look at other data sources around the same topic and see if the numbers still look reasonable. I would rank most of those sources as highly reputable. You should be a little cautious about any community contributed data on websites such as DataWorld or Github as it is unlikely to be verified.
Is this data is valid?
Investigate the data, get some estimates of what the maximum and minimum for any column should be, and then see if any values are outside. An easy way to see this is to sort the maximum and minimum values in ascending and descending order by each column. To do this, select all the data in Excel or Google Sheets, then click the filter icon and select options A to Z and then Z to A.
Much data could have been incorrectly entered, somebody might have typed in $1,100.00 or $11,00.00 instead of $11,000.00. Using the above described sorting options can help detect the most obvious examples. Another common example is that sometimes people don’t want to provide real data for things like phone numbers, so in those columns you might get a lot of 9 or 0.
A column’s title may also be misleading. For example, a field could be called “Employed percent” and the field could mean 0.80 or 80 both 80 percent. Usually this can be determined with context clues (what seems reasonable, what other values look like in this column, etc.).
Is this data is complete and compatitive?
In a dataset, data is missing many times. Checking for null or missing values in any dataset you want to use is a best practice. For example, in Excel you can do this by using the COUNTBLANK function, COUNTBLANK(B1:B3) results in a count of 1 in the image below.
Is this data is biased or skewed?
Try to visualize the various data columns in your data set. Use a histogram for numerical columns. See what type of distribution there is (normal, left, right, uniform, bimodal, etc) for each column. A frequency table is mostly one value for non-numeric columns? Checking these things will build your intuition about the overall quality of the data and the columns to be used in an analysis.
Many data tools allow you to check for all these kinds of quality issues quickly and easily. With csv or excel file, Excel and Google Sheets are the fastest and easiest to use. There are also more advanced tools, such as Alteryx, that you can use to check multiple columns at once.