python tutorial-mining imgur images

42
iosity Bits - curiositybits.com This tutorial is created for social scientists interested in grabbing data from the image-hosting site, Imgur (imgur.com). Find out more about Python for mining the social web, please visit Curiosity Bits (curiositybits.com ). Social-Metrics.org also hosts a series of Python tutorials on aggregating and analyzing Twitter/Facebook data. More at social-metrics.org CURIOSITY BITS © Get Imgur Data through Python

Upload: weiai-wayne-xu

Post on 14-Aug-2015

182 views

Category:

Education


8 download

TRANSCRIPT

Page 1: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

• This tutorial is created for social scientists interested in grabbing data from the image-hosting site, Imgur (imgur.com).

• Find out more about Python for mining the social web, please visit Curiosity Bits (curiositybits.com).

• Social-Metrics.org also hosts a series of Python tutorials on aggregating and analyzing Twitter/Facebook data. More at social-metrics.org

CURIOSITY BITS©

Get Imgur Data through Python

Page 2: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

This tutorial shows you how to download images and the images’ meta-data (i.e., image title, description, source, upload time, etc.).

CURIOSITY BITS©

Imgur.com is a popular site for sharing selfies and images intended for persuasion.

Page 3: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Images will be saved in a designated folder.

The metadata will be saved in a SQLite database

Page 4: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Imgur images are available through keyword search, reddit timelines and public albums.

Here are the examples:• http://imgur.com/search?q=hillary images from the search using the

keyword “hillary”• http://imgur.com/r/transtimelines images from a public reddit timeline

called transtimelines• http://imgur.com/gallery/ROYAZ images from a public album called

ROYAZ

Three sources of Imgur images

Page 5: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

Final note before we start: for simplicity, I am laying out only the most essential steps. Previous tutorials provide details about how to set up a Python programing environment, please visit curiositybits.com and click the PYTHON tab.

CURIOSITY BITS©

Page 6: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

1. Install Anaconda Python (with Spyder and Ipython Notebook)2. Install SQLite Browser3. Install four essential Python packages (imgurpython, sqlalchemy, urllib, sqlite3)4. Register a Imgur client to get client ID and client secret 5. Create a SQLite database for images from keyword search6. Download images from the keyword search 7. Create a SQLite database for images from a reddit timeline8. Download images from a reddit timeline9. Create a SQLite database for images from a public album10. Download images from a public album

CURIOSITY BITS©

Steps

Page 7: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Install Anaconda Python

• store.continuum.io/cshop/anaconda/

You can run Python codes in Spyder, which is a component in Anaconda Python

Or use IPython Notebook, a web-based interactive Python environment

Page 8: Python Tutorial-Mining imgur images

CURIOSITY BITS©

This is what Spyder looks like. The left side of the window displays codes. The one at the bottom right shows you the results.

IPython Notebook runs your codes on the web and displays results in your browser.

Install Anaconda Python

Page 9: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

http://sqlitebrowser.org

Install SQLite Browser

Page 10: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

• In the Command Prompt, use the command line pip install, followed by the package name, to install necessary packages.

Required packages• Imgurpython• Sqlalchemy• Urllib• sqlite3

Install four essential Python packages

Page 11: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

type the following command lines in the command prompt:

pip install imgurpythonpip install sqlalchemypip install urllibpip install sqlite3

Install four essential Python packages

Page 12: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

Imgur Fetcher V1 is a collection of codes used in this tutorial. You can download them at https://github.com/cosmopolitanvan/imgur_curiositybits_v1

On the site, you can also find three examples of SQLite database. The database files end with the extention .sqlite

CURIOSITY BITS©

The Python codes used in the tutorial

Page 13: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Sign in your Imgur account, go to https://api.imgur.com/oauth2/addclient and get your “security clearance”.

Register a Imgur client

Page 14: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

You need Client ID and Client secret to grab data from Imgur API

Register a Imgur client

Page 15: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Let’s get images with the keyword “hilliary”

http://imgur.com/search?q=hillary

We will download all available images, but at first, let’s get their metadata. By metadata, I mean, attributes related to each image. Examples of the metadata include image title, image description, image upload date, image link (which is what we will use to download images).

Create a SQLite database for images from keyword search

Page 16: Python Tutorial-Mining imgur images

You can right-click the link and download the file with the extension .py. Open the file in Spyder.

Or, copy the entire block of codes into Ipython Notebook.

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Use the code named Imgur search v1.py

Page 17: Python Tutorial-Mining imgur images

CURIOSITY BITS©

You don’t have to change anything in this block of codes. It is used for importing necessary Python packages. Think of packages as apps running on iOS.

Open Imgur search v1.py in Spyder

Page 18: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

This is where you enter the keyword(s) you want to apply to the search. You can have multiple keywords, wrapped in parenthesis, and separated by a comma.

Page 19: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

This is where you enter the client ID and client secret generated in the previous step.

Page 20: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

No tweaking is needed for this block. But it gives you a sense of what data are to be collected.

Page 21: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Go to line 136, this is where you specify the name of the SQLite database to be saved. If no absolute file path is given, the database will be saved in the same folder with your Python script (Imgur search v1.py).

Page 22: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Imgur images are indexed into multiple pages. Here, 5 means that we are to get five pages of images from the keyword search. You can put 3 or 1 or 2, just play around to see what number gives you the most adequate, while at the same time manageable amount of data.

Page 23: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Now, the tweaking is done! From the menu, chose Run-Configure and hit RUN

Page 24: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

This is the SQLite database that contains all the metadata.

Page 25: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Notice that the last column is called multiple_images.

Most image links will end with an extension name of a picture (.jpg, .png, .,bmp, .gif). If that is the case, the value on the column will be NO.

But, some links contain multiple images, and you will see “this link contains multiple images…” warning in the column.

Page 26: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Now, we have a saved SQLite database with image links. The next step is to use Python to download images from those image links

Use the .py file named Imgur search_downloader v1.py

Download images from the keyword search

Page 27: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Go to line 72, and make sure the filename there matches the SQLite database we have just created.

Download images from the keyword search

Page 28: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Please create a new folder and enter the folder path here.

Image filename is comprised of the image’s unique identifier id from Imgur API.

Download images from the keyword search

Page 29: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Again, this script downloads images from links with the extension of .jpg, .png, .gif, .bmp

If an image link ends with anything but an image extension, the script will prompt a reminder message.

Download images from the keyword search

Page 30: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Now, execute the code and you will find the downloaded images in the folder you have just created.

Download images from the keyword search

Page 31: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Getting image from a reddit timeline is very similar to getting images from keyword search. This time, we will try executing the Python code in Ipython Notebook.

Use the script called Imgur reddit timeline v1.py

Create a SQLite database for images from a reddit timeline

Page 33: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Enter the name of a reddit timeline. In the URL term, a reddit timeline displayed ashttp://imgur.com/r/buffalo

Enter your client ID and client secret.

Page 34: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Enter or change the filename of the SQLite database to be saved

Page 35: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

The number of pages to be grabbed…

Page 36: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Click Run Cell

The output will be saved in your IPython Notebooks folder. By default, it is …\Documents\IPython Notebooks

Page 38: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Download images from a reddit timeline

Load the code in IPython Notebooks, Like what you do to download images from keyword search, make sure the file path in the code matches the database you have just created.

Page 39: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Download images from a reddit timeline

Specify the folder path, and run the code!

Page 40: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Create a SQLite database for images from a public albumExactly as what we do in previous steps. There are only a few places in the script that need tweaking. You need to enter client ID, client secret, the name of the album, the file path of the database, the number of pages to be grabbed. Then you are all good to go!

To create a SQLite database for images from a public album, use the script named Imgur album v1.py

Page 42: Python Tutorial-Mining imgur images

Curiosity Bits - curiositybits.com

CURIOSITY BITS©

Having questions?

Contact @cosmopolitanvan on Twitter