SlideShare a Scribd company logo
INTRODUCTION TO
OPENREFINE
CENTRAL PA OPEN SOURCE CONFERENCE
OCTOBER 17, 2015
Heather Myers / @privatestorm
https://www.linkedin.com/in/heathercmyers
ABOUT ME
Web administrator in the government and cultural heritage
sectors.
Currently working at the Pennsylvania Historical and
Museum Commission.
OPENREFINE
"a powerful tool for working with messy data: cleaning it;
transforming it from one format into another; and extending it
with web services and external data."
GETTING STARTED
Choose dataset
Decide what you want to accomplish with data
Install OpenRefine
Run OpenRefine
http://openrefine.org/download.html
http://127.0.0.1:3333
ABOUT THE DATASET
Pennsylvania Heritage magazine subject index.
Index of 12,000+ magazine terms for issues dated 1975–
2002.
http://bit.ly/1Udha8D
DATA TO DO LIST
Create lists
of terms for
specific
issues
Extract list
of terms
IMPORT ALL DATA
File upload
Web
download
Copy from
clipboard
Google data
import
CONFIGURE PARSING OPTIONS
Choose
options
Update
preview
Name
project
Create
project »
OPENREFINE PROJECT
APPLY A TEXT FILTER
Click down
arrow on
column
Choose text
filter from
menu
APPLY A TEXT FILTER
Type text
Choose
additional
options
TEXT FILTER OPTIONS
Add
multiple
filters
Reorder
filters
TEXT FILTER OPTIONS
Case
sensitive
Regular
expressions
EXPORT DATA
Click export
button
Choose
export type
Save file
DATA TO DO LIST
Create lists
of terms for
specific
issues
Extract list
of terms
IMPORT SELECTION OF DATA
Choose
CSV /TSV/
separator-
based files
Use
semicolon
as custom
separator
DATA SEPARATED INTO COLUMNS
SPLIT INTO SEVERAL COLUMNS
Click down
arrow on
column
Choose edit
column
Choose
split into
several
columns
SPLIT INTO SEVERAL COLUMNS
Split by
separator
or field
length
Choose
after
splitting
options
EDIT SINGLE CELL
Hover over
cell & click
edit
Update
data type
Update text
Click apply
TRANSFORM TEXT
Click down
arrow on
column
Choose edit
cells
Choose
transform
TRANSFORM TEXT
Type
expression
Choose
options
Choose OK
TERMS SPLIT INTO COLUMNS
MOVE COLUMN
Click down
arrow on
column
Choose edit
column
Choose
move
column left
TRIM WHITESPACE
Click down
arrow
Choose edit
cells,
common
trans-
formations
Choose
trim leading
& trailing
whitespace
JOIN CELLS
Custom
text
transform
GREL
expressions
plus space
in between
HIDE COLUMNS
Click down
arrow
Choose
view
Choose
collapse all
other
columns
SORT VALUES
Click down
arrow
Choose sort
Choose
additional
options
UNDO / REDO
Choose
undo/redo
tab in left
column
Click
previous
step to
undo
FILTER UNDO / REDO
Type
keywords in
filter box
EXTRACT OPERATION HISTORY
Choose
extract
button
Choose
steps to
save in left
column
Copy JSON
in right
column
APPLY OPERATION HISTORY
Open or
create new
project
Click apply
button in
undo/redo
Paste JSON
Click
perform
operations
EXPORT DATA
Click export
button
Choose
export type
Save file
LEARN MORE
Website: OpenRefine
http://openrefine.org/
Book: Using OpenRefine
http://bit.ly/1QC0oNS
Course: Big Data University
http://bit.ly/1QC1sl1
THE END

More Related Content

PDF
Talend Open Studio Data Integration
PPTX
Introduction To Pentaho
PPTX
OpenRefine Tutorial
PPT
Kettle – Etl Tool
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
Informatica PowerCenter
PDF
Talend ETL Tutorial | Talend Tutorial For Beginners | Talend Online Training ...
Talend Open Studio Data Integration
Introduction To Pentaho
OpenRefine Tutorial
Kettle – Etl Tool
Simplifying Big Data Analytics with Apache Spark
Informatica PowerCenter
Talend ETL Tutorial | Talend Tutorial For Beginners | Talend Online Training ...

What's hot (20)

PPTX
What is Informatica Powercenter
PPTX
Informatica Powercenter Architecture
PDF
Introduction to MLflow
PPTX
Final PPT Imdb
PPTX
Lakehouse Analytics with Dremio
PDF
Sentiment Analysis
PPTX
Free Training: How to Build a Lakehouse
PDF
SSIS Tutorial For Beginners | SQL Server Integration Services (SSIS) | MSBI T...
PPTX
Druid deep dive
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
PPTX
Introduction of ssis
PDF
Data Discovery at Databricks with Amundsen
PPTX
Introduction to HiveQL
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
PDF
Data Source API in Spark
PPTX
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
PPTX
Microsoft Azure Databricks
PPTX
1. informatica power center architecture
PPTX
The 8 Best Examples Of Real-Time Data Analytics
What is Informatica Powercenter
Informatica Powercenter Architecture
Introduction to MLflow
Final PPT Imdb
Lakehouse Analytics with Dremio
Sentiment Analysis
Free Training: How to Build a Lakehouse
SSIS Tutorial For Beginners | SQL Server Integration Services (SSIS) | MSBI T...
Druid deep dive
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Introduction of ssis
Data Discovery at Databricks with Amundsen
Introduction to HiveQL
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Data Source API in Spark
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Microsoft Azure Databricks
1. informatica power center architecture
The 8 Best Examples Of Real-Time Data Analytics
Ad

Viewers also liked (7)

PPTX
TXDHC OpenRefine Training
PPTX
A Quick Tour of OpenRefine
PPTX
Employing Google Refine to publish Linked Data
PPTX
Data and Donuts: Data cleaning with OpenRefine
PPTX
OpenRefine Class Tutorial
ODP
OpenRefine - Data Science Training for Librarians
PPTX
Google refine tutotial
TXDHC OpenRefine Training
A Quick Tour of OpenRefine
Employing Google Refine to publish Linked Data
Data and Donuts: Data cleaning with OpenRefine
OpenRefine Class Tutorial
OpenRefine - Data Science Training for Librarians
Google refine tutotial
Ad

Similar to Introduction to OpenRefine (20)

PDF
Briefing on US EPA Open Data Strategy using a Linked Data Approach
PDF
Brief on Linked Data at U.S. EPA to Chief Data Scientist
PDF
Brief on Linked Data for U.S. EPA's Chief Data Scientist
PDF
3 Round Stones Briefing to U.S. EPA's Chief Data Scientist on Open Data
PPTX
(PROJEKTURA) open data big data @tgg osijek
PPT
Brand Niemann Tutorial12242009
PPT
Put Your Desktop in the Cloud In Support of the Open Government Directive and...
PPT
Put Your Desktop in the Cloud In Support of the Open Government Directive and...
PPTX
The State of Linked Government Data
PPTX
Preserving Public Government Information: The End of Term Web Archive
PDF
US EPA Resource Conservation and Recovery Act published as Linked Open Data
PPSX
The Web of data and web data commons
PDF
Linking Open Government Data at Scale
PDF
Open data 4 startups (2°edition)
PPT
Putting the L in front: from Open Data to Linked Open Data
PDF
(PROJEKTURA) Big Data Open Data story for TGG
PPTX
Microdata cataloging tool (nada)
PPT
RDFa From Theory to Practice
PDF
Linked Open Government Data: What’s Next?
PDF
Enterprise & Scientific Data Interoperability Using Linked Data at the Health...
Briefing on US EPA Open Data Strategy using a Linked Data Approach
Brief on Linked Data at U.S. EPA to Chief Data Scientist
Brief on Linked Data for U.S. EPA's Chief Data Scientist
3 Round Stones Briefing to U.S. EPA's Chief Data Scientist on Open Data
(PROJEKTURA) open data big data @tgg osijek
Brand Niemann Tutorial12242009
Put Your Desktop in the Cloud In Support of the Open Government Directive and...
Put Your Desktop in the Cloud In Support of the Open Government Directive and...
The State of Linked Government Data
Preserving Public Government Information: The End of Term Web Archive
US EPA Resource Conservation and Recovery Act published as Linked Open Data
The Web of data and web data commons
Linking Open Government Data at Scale
Open data 4 startups (2°edition)
Putting the L in front: from Open Data to Linked Open Data
(PROJEKTURA) Big Data Open Data story for TGG
Microdata cataloging tool (nada)
RDFa From Theory to Practice
Linked Open Government Data: What’s Next?
Enterprise & Scientific Data Interoperability Using Linked Data at the Health...

Recently uploaded (20)

PPTX
Web dev -ppt that helps us understand web technology
PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
PDF
Company Presentation pada Perusahaan ADB.pdf
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Economic Sector Performance Recovery.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
LESSON-1-NATURE-OF-MATHEMATICS.pptx patterns
PPTX
Global journeys: estimating international migration
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Understanding Prototyping in Design and Development
PPTX
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
PPTX
Azure Data management Engineer project.pptx
PPTX
Machine Learning Solution for Power Grid Cybersecurity with GraphWavelets
PDF
345_IT infrastructure for business management.pdf
PDF
Nashik East side PPT 01-08-25. vvvhvjvvvhvh
PPTX
Presentation1.pptxvhhh. H ycycyyccycycvvv
PPTX
Purple and Violet Modern Marketing Presentation (1).pptx
PPTX
Challenges and opportunities in feeding a growing population
PDF
Chad Readey - An Independent Thinker
PDF
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
Web dev -ppt that helps us understand web technology
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Company Presentation pada Perusahaan ADB.pdf
Launch Your Data Science Career in Kochi – 2025
Economic Sector Performance Recovery.pptx
Moving the Public Sector (Government) to a Digital Adoption
LESSON-1-NATURE-OF-MATHEMATICS.pptx patterns
Global journeys: estimating international migration
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Understanding Prototyping in Design and Development
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
Azure Data management Engineer project.pptx
Machine Learning Solution for Power Grid Cybersecurity with GraphWavelets
345_IT infrastructure for business management.pdf
Nashik East side PPT 01-08-25. vvvhvjvvvhvh
Presentation1.pptxvhhh. H ycycyyccycycvvv
Purple and Violet Modern Marketing Presentation (1).pptx
Challenges and opportunities in feeding a growing population
Chad Readey - An Independent Thinker
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf

Introduction to OpenRefine