Scraping | Cleaning Data

Scraping | Cleaning Data

RECENT UPDATES

Scrapy
Python-driven web-scraping platform

PhantomBuster
Helps with data extraction from social media and websites. Pull lists of influencers from LinkedIn and top 10 articles to add to outreach lists, but you can also automate things like following people on Twitter or liking Instagram posts.

Clay
Automated, self-fulling spreadsheet that you can use to take Twitter handles or LinkedIn URLs and have it fill out people’s information into a searchable spreadsheet. Then search for keywords from bios, locations, or sort by follower account to see who you want to reach out to. The tool also has a data scraper built-in as a Chrome extension and can pull tables from websites pretty reliably.

DataWrangler
Wrangler is an interactive tool for data cleaning and transformation. Spend less time formatting and more time analyzing your data.

Guide to Writing Your First Python Web Scraper

15 Best Web Scraping Tools (2021)

Tweepy Tweet Scraper
From Dr. Ken Blake, this tools lets you search recent tweets, using search terms you specify, and save the tweets. along with key metadata, in an Excel file on your computer. Updated April 2021.


FINDING, SCRAPING AND CLEANING DATA

Journalist’s Toolbox Public Records Page
Portals galore at the top of the page.

Google Dataset Search
Search for data in this Google search tool.

Workbench
A great all-in-one data viz tool. Scrape, sort, filter and design graphics in this free tool. Video on how it works.

Data Scraping in Google Sheets
This tutorial and formula will let you scrape HTML tables out of web pages. Created by Mike Reilley, Journalist’s Toolbox founder and SPJ digital trainer.

Tabula.technology
Download this desktop tool to scrape tables out of native .PDFs. Also offers tips on preparing scanned .PDFs for scraping.

Introduction to Web Scraping
Python scraping primer with Mindy McAdams.

DataWrangler
Wrangler is an interactive tool for data cleaning and transformation. Spend less time formatting and more time analyzing your data.

LA Times Data Desk: Github Scrapers
A demo of how the Times’ scrapers work using coronavirus data.

PhantomBuster
Helps with data extraction from social media and websites. Pull lists of influencers from LinkedIn and top 10 articles to add to outreach lists, but you can also automate things like following people on Twitter or liking Instagram posts.

Clay
Automated, self-fulling spreadsheet that you can use to take Twitter handles or LinkedIn URLs and have it fill out people’s information into a searchable spreadsheet. Then search for keywords from bios, locations, or sort by follower account to see who you want to reach out to. The tool also has a data scraper built-in as a Chrome extension and can pull tables from websites pretty reliably.


For more training videos, visit our YouTube page.


Scrapy
Python-driven web-scraping platform

Google Sheets Functions List

GIJN: Ethics of Web Scraping

Able2Extract Professional
A PDF converter, creator, editor and more. Convert .PDFs into Excel.

10 Examples of Web Scraping in Use

Data Viz Data Scraping Tools

Octoparse Web Scraper
Download the software as a free trial before buying. Use it to scrape emails, websites, etc.

Import.io
Structured web scraping and data visualization.

Ultimate Facebook Profile Scraper Tool

CometDocs File Converter
Convert your PDF files to Word, Excel, PowerPoint and more. Convert various formats to PDF. Store & share your documents for free. Also available as phone and tablet apps.

QuickCode.io ScraperWiki
A Python and R data analysis environment.

Guide to Writing Your First Python Web Scraper

15 Best Web Scraping Tools (2021)

PDFtoExcel.com
A free web-based service for extracting data tables from regular and scanned PDF files into fully editable Excel spreadsheets. Email address is not required; converts PDFs of any size and does not impose any limits on the number of files that can be converted for free. It can come handy to investigative and data journalists for extracting tabular data from PDFs to .XLSX format for free, especially when working with the big tables.

LinkedIn Scraper

Geocode.io
Straightforward and easy-to-use geocoding, reverse geocoding, and data matching for U.S. and Canadian addresses. You get 2,500 free lookups per day. Just upload a spreadsheet.

Census Geocoder
Census Geocoder provides interactive & programmatic (REST) access to users interested in matching addresses to geographic locations and entities containing those addresses.

Open Refine
OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data. It’s a free download and available in several languages.

Scraping Bee
ScrapingBee is a web scraping API. It renders Javascript code inside a headless browser and handles proxies. It i’s useful for data journalist in order to scrape data on websites are hard-to-scrape (either using a lot of Javascript, or that are blocking your scrapers).

Mr. Data Converter
Convert your Excel data into one of several web-friendly formats, including HTML, JSON and XML.

Big List of Resources for Design, Data & Code
Data viz tools and resources from ProPublica’s Lena Groeger.

The Hacktavist Toolbox
Great set of tools for analyzing data from Politico.

US Local Data Portals
This Github account lists dozens of portals.


For more training videos, visit our YouTube page.


Zanran
Search engine designed specifically for finding tables, charts, and graphs online.

DataPortals.org
A list of more than 400 data portals from around the world.

Flowing Data: How to Find the Data You Need (2016)

Tweepy Tweet Scraper
From Dr. Ken Blake, this tools lets you search recent tweets, using search terms you specify, and save the tweets. along with key metadata, in an Excel file on your computer. Updated April 2021.

Data Wrangler
A data cleaning tool created by Stanford University’s Visualization Group for rearranging data for other tools to use (e.g. spreadsheet).

Online Tools to Convert Data Formats
CSV to JSON, etc.

Instagram Scraper API

AirTable
Free database tool up to 1,500 records. Pricing from there.

OpenStates.org
Search for bills or legislators across all states.

Free OCR
Scrape text from images. Always double-check your work and 3’s can look like 8’s.

Quandl
Tons of free datasets, and a very, very easy to use API. Even those less technically inclined could spend just a few minutes and get dataviz with async data. Of course, downloadable data is always an option.

Open Addresses
Community-hosted data on Github. Street names, house numbers and postal codes, when combined with geographic coordinates, are the hub that connects digital to physical places. It’s crowdsourced, so fact-check the data.

Quora: List of Places to Find Open Datasets


Related

Return to Public Records | Investigative Journalism | Digital Journalism