Data Scraping | Cleaning

Data Scraping | Cleaning

RECENT UPDATES

Crosswalker
A Washington Post open-source tool helps join columns of text that don’t match perfectly.

ExportData
Twitter data export tool. Allows downloading historical tweets since 2006, exporting unlimited followers and followings, and gathers historical Twitter trends in 63 countries and 400 cities.


For more training videos, subscribe to the Toolbox’s YouTube channel.



FINDING, SCRAPING AND CLEANING DATA

Journalist’s Toolbox Public Records Page
Portals galore at the top of the page.

Google Dataset Search
Search for data in this Google search tool.

Data Scraping in Google Sheets
This tutorial and formula will let you scrape HTML tables out of web pages. Created by Mike Reilley, Journalist’s Toolbox founder and SPJ digital trainer.

Introduction to Web Scraping
Python scraping primer with Mindy McAdams.

DataWrangler
Wrangler is an interactive tool for data cleaning and transformation. Spend less time formatting and more time analyzing your data.

Outwit Hub Web Scraper
OutWit Hub breaks down Web pages into their different constituents and allows you to scrape them. Pull links, text, photos, RSS feeds and more.

LA Times Data Desk: Github Scrapers
A demo of how the Times’ scrapers work using coronavirus data.

PhantomBuster
Helps with data extraction from social media and websites. Pull lists of influencers from LinkedIn and top 10 articles to add to outreach lists, but you can also automate things like following people on Twitter or liking Instagram posts.

Clay
Automated, self-fulling spreadsheet that you can use to take Twitter handles or LinkedIn URLs and have it fill out people’s information into a searchable spreadsheet. Then search for keywords from bios, locations, or sort by follower account to see who you want to reach out to. The tool also has a data scraper built-in as a Chrome extension and can pull tables from websites pretty reliably.

Beginner’s Guide to Extracting Data from PDFs
From Laura Grant of Media Hack

Crosswalker
A Washington Post open-source tool helps join columns of text that don’t match perfectly.


For more training videos, visit our YouTube page.


GIJN: Free, Game-Changing Data Extraction Tools that Require No Coding Skills
An excellent collection of scraping tools from the IRE 2022 conference.

Scrapy
Python-driven web-scraping platform

SimpleScraper.io
Chrome extension or use as a web-based tool. Select on the page what you want to scrape by highlighting with the shift key. Free but has a paid upgrade.

Instant Data Scraper
Chrome extension. Open it once you’re on your page that has a table.

Table Capture
Simple data scraper Chrome extension. Just right click on a table to scrape it.

Batch Link Downloader
A free Chrome extension that lets you mouse over files on a web page and it will batch download them. E.g.: A government page with links to dozens of .PDFs.

PDF Candy
Free suite of PDF and file conversion tools.

Google Sheets Functions List

GIJN: Ethics of Web Scraping

Able2Extract Professional
A PDF converter, creator, editor and more. Convert .PDFs into Excel.

10 Examples of Web Scraping in Use

Data Viz Data Scraping Tools

Octoparse Web Scraper
Download the software as a free trial before buying. Use it to scrape emails, websites, etc.

Parse Hub
A desktop app that can identify and scrape elements and sub-elements. The free version lets you scrape 200 pages at a time.

Import.io
Structured web scraping and data visualization.

Outwit Hub
A desktop app that can identify each HTML element on a web page and scrape it. The free version lets you download 100 rows at a time.

ExportData
Twitter data export tool. Allows downloading historical tweets since 2006, exporting unlimited followers and followings, and gathers historical Twitter trends in 63 countries and 400 cities.

Minelead.io
A technology solutions company that has developed a cutting-edge email finder and verifier tool. It helps to find professional emails for any company or website and verify their quality through a browser extension, API, or web interface.

ComputerWorld: Seven Best Tools for PDF to Excel Conversion
Sharon Machlis reviews the best tools for converting the files.

GIJN: How to Find Hidden Data in the Web Inspector

Ultimate Facebook Profile Scraper Tool

Training Videos: 30-Minute Skills
The New England First Amendment Coalition presents a monthly educational series featuring short, practical lessons on journalism and the First Amendment. The program — called “30 Minute Skills” — provides reporters and other citizens knowledge they can use immediately in newsgathering, data collection, storytelling and other areas of journalism and First Amendment law. The lessons are featured in a 30-minute format to accommodate the demanding schedules faced by many working in New England newsrooms. Videos include web-scraping skills.

Guide to Writing Your First Python Web Scraper

15 Best Web Scraping Tools (2021)

LinkedIn Scraper

Geocode.io
Straightforward and easy-to-use geocoding, reverse geocoding, and data matching for U.S. and Canadian addresses. You get 2,500 free lookups per day. Just upload a spreadsheet.

Excel Formula Bot
Type in a sentence describing what you want the spreadsheet to do and the bot uses AI to generate a formula.

Google Sheets Formula Bot
Type in a sentence describing what you want the spreadsheet to do and the bot uses AI to generate a formula.

PDF 24 Tools
Edit, merge, split and convert PDF documents

Census Geocoder
Census Geocoder provides interactive & programmatic (REST) access to users interested in matching addresses to geographic locations and entities containing those addresses.

Open Refine
OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data. It’s a free download and available in several languages.

Scraping Bee
ScrapingBee is a web scraping API. It renders Javascript code inside a headless browser and handles proxies. It i’s useful for data journalist in order to scrape data on websites are hard-to-scrape (either using a lot of Javascript, or that are blocking your scrapers).

Mr. Data Converter
Convert your Excel data into one of several web-friendly formats, including HTML, JSON and XML.

Big List of Resources for Design, Data & Code
Data viz tools and resources from ProPublica’s Lena Groeger.

The Hacktavist Toolbox
Great set of tools for analyzing data from Politico.

US Local Data Portals
This Github account lists dozens of portals.

DataPortals.org
A list of more than 400 data portals from around the world.

Flowing Data: How to Find the Data You Need (2016)

Tweepy Tweet Scraper
From Dr. Ken Blake, this tools lets you search recent tweets, using search terms you specify, and save the tweets. along with key metadata, in an Excel file on your computer. Updated April 2021.

Data Wrangler
A data cleaning tool created by Stanford University’s Visualization Group for rearranging data for other tools to use (e.g. spreadsheet).

Online Tools to Convert Data Formats
CSV to JSON, etc.

ExportComments.com
Easily exports all comments from your social media posts to Excel file.

AirTable
Free database tool up to 1,500 records. Pricing from there.

OpenStates.org
Search for bills or legislators across all states.

Free OCR
Scrape text from images. Always double-check your work and 3’s can look like 8’s.

Quandl
Tons of free datasets, and a very, very easy to use API. Even those less technically inclined could spend just a few minutes and get dataviz with async data. Of course, downloadable data is always an option.

Open Addresses
Community-hosted data on Github. Street names, house numbers and postal codes, when combined with geographic coordinates, are the hub that connects digital to physical places. It’s crowdsourced, so fact-check the data.

Twint
An advanced Twitter scraping & OSINT tool written in Python that doesn’t use Twitter’s API, allowing you to scrape a user’s followers, following, Tweets and more while evading most API limitations.

ParseHub
Free web-scraping tool.

Mobile App Scraping: Extract Data from iOS and Android Apps

Video: How to Do Full Web Scraping on iPad

Quora: List of Places to Find Open Datasets


For more training videos, visit our YouTube page.


SCRAPING PDFS

Tabula.technology
Download this desktop tool to scrape tables out of native .PDFs. Also offers tips on preparing scanned .PDFs for scraping.

CometDocs File Converter
Convert your PDF files to Word, Excel, PowerPoint and more. Convert various formats to PDF. Store & share your documents for free. Also available as phone and tablet apps.

PDFtoExcel.com
A free web-based service for extracting data tables from regular and scanned PDF files into fully editable Excel spreadsheets. Email address is not required; converts PDFs of any size and does not impose any limits on the number of files that can be converted for free. It can come handy to investigative and data journalists for extracting tabular data from PDFs to .XLSX format for free, especially when working with the big tables.

Tiny Wow
A great collection of tools that solve file management issues, including merging PDFs.

AbletoExtract
Download a free trial version to pull data out of scanned PDFs.

Batch Link Downloader
A free Chrome extension that lets you mouse over files on a web page and it will batch download them. E.g.: A government page with links to dozens of .PDFs.

Advanced PDF OCR
Session from NICAR 22 on how to scrape data out of PDFs.


For more training videos, visit our YouTube page.


Related

Return to Public Records | Investigative Journalism | Digital Journalism