Scraping and Cleaning Data

Scraping and Cleaning Data

RECENT UPDATES

LA Times Data Desk: Github Scrapers
A demo of how the Times’ scrapers work using coronavirus data.

PDFtoExcel.com
A free web-based service for extracting data tables from regular and scanned PDF files into fully editable Excel spreadsheets. Email address is not required; converts PDFs of any size and does not impose any limits on the number of files that can be converted for free. It can come handy to investigative and data journalists for extracting tabular data from PDFs to .XLSX format for free, especially when working with the big tables.

Octoparse Web Scraper
Download the software as a free trial before buying. Use it to scrape emails, websites, etc.

New York Times Spreadsheet Training Resources (free)

GIJN: Ethics of Web Scraping


FINDING, SCRAPING AND CLEANING DATA

Journalist’s Toolbox Public Records Page
Portals galore at the top of the page.

Google Dataset Search
Search for data in this Google search tool.

Workbench
A great all-in-one data viz tool. Scrape, sort, filter and design graphics in this free tool. Video on how it works.

Data Scraping in Google Sheets
This tutorial and formula will let you scrape HTML tables out of web pages. Created by Mike Reilley, Journalist’s Toolbox founder and SPJ digital trainer.

Tabula.technology
Download this desktop tool to scrape tables out of native .PDFs. Also offers tips on preparing scanned .PDFs for scraping.

Introduction to Web Scraping
Python scraping primer with Mindy McAdams.

LA Times Data Desk: Github Scrapers
A demo of how the Times’ scrapers work using coronavirus data.


For more training videos, visit our YouTube page.


Google Sheets Functions List

GIJN: Ethics of Web Scraping

Able2Extract Professional
A PDF converter, creator, editor and more. Convert .PDFs into Excel.

10 Examples of Web Scraping in Use

Data Viz Data Scraping Tools

Octoparse Web Scraper
Download the software as a free trial before buying. Use it to scrape emails, websites, etc.

Import.io
Structured web scraping and data visualization.

Ultimate Facebook Profile Scraper Tool

CometDocs File Converter
Convert your PDF files to Word, Excel, PowerPoint and more. Convert various formats to PDF. Store & share your documents for free. Also available as phone and tablet apps.

QuickCode.io ScraperWiki
A Python and R data analysis environment.

PDFtoExcel.com
A free web-based service for extracting data tables from regular and scanned PDF files into fully editable Excel spreadsheets. Email address is not required; converts PDFs of any size and does not impose any limits on the number of files that can be converted for free. It can come handy to investigative and data journalists for extracting tabular data from PDFs to .XLSX format for free, especially when working with the big tables.

LinkedIn Scraper

Geocode.io
Straightforward and easy-to-use geocoding, reverse geocoding, and data matching for U.S. and Canadian addresses. You get 2,500 free lookups per day. Just upload a spreadsheet.

Census Geocoder
Census Geocoder provides interactive & programmatic (REST) access to users interested in matching addresses to geographic locations and entities containing those addresses.

Open Refine
OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data. It’s a free download and available in several languages.

DataProofer
A tool that checks your datasets for errors.

Mr. Data Converter
Convert your Excel data into one of several web-friendly formats, including HTML, JSON and XML.

Big List of Resources for Design, Data & Code
Data viz tools and resources from ProPublica’s Lena Groeger.

The Hacktavist Toolbox
Great set of tools for analyzing data from Politico.

US Local Data Portals
This Github account lists dozens of portals.


For more training videos, visit our YouTube page.


Zanran
Search engine designed specifically for finding tables, charts, and graphs online.

DataPortals.org
A list of more than 400 data portals from around the world.

Flowing Data: How to Find the Data You Need (2016)

Data Wrangler
A data cleaning tool created by Stanford University’s Visualization Group for rearranging data for other tools to use (e.g. spreadsheet).

Online Tools to Convert Data Formats
CSV to JSON, etc.

AirTable
Free database tool up to 1,500 records. Pricing from there.

OpenStates.org
Search for bills or legislators across all states.

Free OCR
Scrape text from images. Always double-check your work and 3’s can look like 8’s.

Quandl
Tons of free datasets, and a very, very easy to use API. Even those less technically inclined could spend just a few minutes and get dataviz with async data. Of course, downloadable data is always an option.

Open Addresses
Community-hosted data on Github. Street names, house numbers and postal codes, when combined with geographic coordinates, are the hub that connects digital to physical places. It’s crowdsourced, so fact-check the data.

Quora: List of Places to Find Open Datasets


Related

Return to Public Records | Investigative Journalism | Digital Journalism