Scraping and Cleaning Data

Scraping and Cleaning Data

RECENT UPDATES

New York Times Spreadsheet Training Resources (free)

Google Sheets Functions List

GIJN: Ethics of Web Scraping

Free OCR
Scrape text from images. Always double-check your work and 3’s can look like 8’s.

10 Examples of Web Scraping in Use


FINDING, SCRAPING AND CLEANING DATA

Journalist’s Toolbox Public Records Page
Portals galore at the top of the page.

Google Dataset Search
Search for data in this Google search tool.

Workbench
A great all-in-one data viz tool. Scrape, sort, filter and design graphics in this free tool. Video on how it works.

Data Scraping in Google Sheets
This tutorial and formula will let you scrape HTML tables out of web pages. Created by Mike Reilley, Journalist’s Toolbox founder and SPJ digital trainer.

Tabula.technology
Download this desktop tool to scrape tables out of native .PDFs. Also offers tips on preparing scanned .PDFs for scraping.

Introduction to Web Scraping
Python scraping primer with Mindy McAdams.


For more training videos, visit our YouTube page.


Google Sheets Functions List

GIJN: Ethics of Web Scraping

Able2Extract Professional
A PDF converter, creator, editor and more. Convert .PDFs into Excel.

10 Examples of Web Scraping in Use

Data Viz Data Scraping Tools

Web Scraper
Scrape data off of web pages.

Import.io
Structured web scraping and data visualization.

Ultimate Facebook Profile Scraper Tool

CometDocs File Converter
Convert your PDF files to Word, Excel, PowerPoint and more. Convert various formats to PDF. Store & share your documents for free. Also available as phone and tablet apps.

QuickCode.io ScraperWiki
A Python and R data analysis environment.

LinkedIn Scraper

Geocode.io
Straightforward and easy-to-use geocoding, reverse geocoding, and data matching for U.S. and Canadian addresses. You get 2,500 free lookups per day. Just upload a spreadsheet.

Census Geocoder
Census Geocoder provides interactive & programmatic (REST) access to users interested in matching addresses to geographic locations and entities containing those addresses.

Open Refine
OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data. It’s a free download and available in several languages.

DataProofer
A tool that checks your datasets for errors.

Mr. Data Converter
Convert your Excel data into one of several web-friendly formats, including HTML, JSON and XML.

Big List of Resources for Design, Data & Code
Data viz tools and resources from ProPublica’s Lena Groeger.

The Hacktavist Toolbox
Great set of tools for analyzing data from Politico.

US Local Data Portals
This Github account lists dozens of portals.


For more training videos, visit our YouTube page.


Zanran
Search engine designed specifically for finding tables, charts, and graphs online.

DataPortals.org
A list of more than 400 data portals from around the world.

Flowing Data: How to Find the Data You Need (2016)

Data Wrangler
A data cleaning tool created by Stanford University’s Visualization Group for rearranging data for other tools to use (e.g. spreadsheet).

Online Tools to Convert Data Formats
CSV to JSON, etc.

AirTable
Free database tool up to 1,500 records. Pricing from there.

OpenStates.org
Search for bills or legislators across all states.

Free OCR
Scrape text from images. Always double-check your work and 3’s can look like 8’s.

Quandl
Tons of free datasets, and a very, very easy to use API. Even those less technically inclined could spend just a few minutes and get dataviz with async data. Of course, downloadable data is always an option.

Open Addresses
Community-hosted data on Github. Street names, house numbers and postal codes, when combined with geographic coordinates, are the hub that connects digital to physical places. It’s crowdsourced, so fact-check the data.

Quora: List of Places to Find Open Datasets


Related

Return to Public Records | Investigative Journalism | Digital Journalism