please select a platform or a tool
What you can scrape: Facebook Pages
Info Extra: login request | retroactive
Facebook is one of the biggest social networks. On Facebook you can create two different type of account (private profile and Facebook Pages) and two different type of groups (private and public).
☞ page like network: analyze networks of pages connected through the likes between them.
☞ page posts: analyze user activity around posts on pages.
☞ page timeline images: analyze images from the "Timeline Photos" album on pages.
☞ search: interface Facebook's search function.
☞ link stats: generate statistics for links shared on Facebook.
☞ Big pages can take some time to process (minutes or hours). Be patient and don't try to reload!
☞ Netvizz will provide you only image-links. Use browser extension like Tab Save or DownThemAll to download large amount of links at once.
☞ Google Sheets allows you to visualize the images (connected to a link) directly inside your working sheet. That is particularly helpful when you are cleaning your dataset.
What you can scrape: tweets
Info Extra: mandatory to run the script for the whole scrape | not retroactive
Twitter is one of the biggest social networking, that allows its user to communicate via tweets. A tweet is a short text of 140 or 280 characters, that includes hashtags or tags to other users and can also include images or videos. One of the main characteristic of Twitter is that the tweets are by default public and potentially readable from everyone.
Gazouilloire is a tool developed by the Sciences Po Medialab that helps you to gather data from twitter. According to the privacy policies of Twitter you can not retroactively scrape informations from twitter. To start the scrape you have to follow the instructions in order to install the tool (instructions here). Please take note that in order to complete the scrape the tool has to run for the entire time frame you want to analyze.
Catwalk is another tool developed by the Sciences Po Medialab. It allows you to clean your dataset from tweets not related to your field of investigation.
Tool: Instagram Scraper
What you can scrape: public posts & public profiles
Info Extra: retroactive
Instagram is a social network focused on hashtags, images and the connection between these 2 elements. Profile can be public or private, and everyone can like, comment and share the different kind of content (post or stories).
Instagram Scraper is the tool developed by Digital Methods Initiative that allows you to gather data from Instagram without writing a line of code. This tool only scrapes post based on usernames, hashtags and locations. Stories, pinned stories or IGTV content are not included in the scrape. Once the scrape is done, based on your research question, you can choose among these different outputs:
☞ A CSV file containing metadata for all scraped posts.
☞ A JSON file containing metadata for all scraped posts.
☞ A HTML table containing metadata for all scraped posts.
☞ A GXF graph, compatible with Gephi, with co-tag data. Co-tags are tags that appear together; the weight of the connection between two tags is determined by how often they appear together in the scraped posts.
☞ A GXF graph, compatible with Gephi, with post/hashtag data. Nodes consist of posts and hashtags; connections are formed between a post and all hashtags appearing in it.
☞ (Only when querying usernames) A CSV file with metadata for all scraped users.
☞ There are other tools (based on Python language) that allow you to scrape also stories and other type of content from Instagram (Instaloader, Instagram-scraper)
☞ Instagram Scraper will provide you only image-links. Use browser extension like Tab Save or DownThemAll to download large amount of links at once.
What you can scrape: images + autocomplete function
Info Extra: -
Google.com is the most famous search engine and also the most visited website worldwide.
There are a large number of tools that allow you to gather and analyze different type of data provide by google.com.
For example browser extensions help you to download large amount of images at once: GetThemAll works with Google Chrome and DowntThemAll for Firefox (it works only with an old version of Firefox that you can download from here.) There are also a lot of different Tools developed by the Digital Methods Initiative (here). One of these is the Autocomplete tool, which can be used in Open Mic method. It allows you to build a small dataset which uses the autocomplete google function, in order to build a new sentence or complete a query research.
Tool: Amazon Scraper
What you can scrape: comments & feedbacks
Info Extra: retroactive
Amazon is the largest worldwide internet company and is available in different Countries and different languages. That is why Amazon could be a really interesting source to analyze: infact amazon has a well used feedback system that could become an excellent source of data.
Amazon-Scraper allows you to gather data from amazon based on item-codes. Firstly you need to install Python and Pip, then with an easy code line in the terminal, you will be able to extract the data you need. The tool works with Asin codes, unique item-codes that you can find in the item-links.
In the Github-Tool webpage there are multiple options that allow you to scrape different kind of data, like item rating, comments and questions raised by users.
This simple line of code allows you to gather all the information about these 3 different ASIN codes:
$ amazon-scraper B01H2E0J5M B01GYLZD8C B0736R3W1F
What you can do: transform .cvs files in gephi networks
Table2Net is a tool developed by the Sciences Po Medialab of Paris that allows you to create a Gephi network starting from a .csv file. Once that the dataset is uploaded choose the type of network you want to create among the options:
☞ Normal: single type of nodes network. For instance authors; they will be linked when they share a value in another column, for example papers.
☞ Bipartite: two types of nodes network. For instance authors and papers will be linked when they appear in the same row of the table.
☞ Citation: if you have a column containing references to another one, for instance paper title and cited papers (title).
☞ No link: a single type of nodes, without links.
Once that the type of network is decided you can choose the column that will define the edges (links) of your network, plus eventually the attributes that also you want to include. Finally you can download the .gefx file and start to explore it in Gephi.
Once you decide the column for nodes or edges click on the sample button in order to check that everything is correct.
What you can do: explore networks
Gephi is an open-source network analysis and visualization software.
Networks are the representations of either symmetric relations or asymmetric relations between discrete objects. In computer science, a network can be defined as a graph in which nodes and/or edges have attributes. Edges in networks can be directed, undirected, weighted or unweighted. Networks can be normal (1 type of nodes) or bipartite (2 types of nodes).
Gephi is a software that allows you to explore networks. You can build networks with tools like Table2Net or export networks directly from tools like Netvizz or Gazouilloire. Table2Net is a tool that helps you build a Network explorable in Gephi from a .csv file. Moreover thanks to Gephi Add Images, starting from a network you can create an image moodboard like in Flock Of Birds method.
There is not the UNDO function, so you will not be able to reverse your last action.
What you can do: replace the nodes with images
This Python script, developed by Michele Mauri, allows you to include images in a Gephi network.
Firstly, you have to set the image name as label of the node itself. Secondly, export the network in a .svg format through the preview panel. Create a folder that contains the .svg file, the python script and all the images (in a subfolder). Load your folder and Run the Python script in the terminal writing the following line of code:
$ python replace-with-images.py (input_file.svg) (name_of_images_folder) (upscaling) (output_file_name)
Now open your network in software that can read .svg file or in Chrome and you will see images instead of the nodes.
☞ It needs Beautiful Soup 4 that you can find here.
☞ In order to load the folder you can also drag and drop it inside the terminal, otherwise write the path.
What you can do: work with messy data
Open Refine is an open source desktop application for data cleanup and transformation of dataset that is not well constructed.
Two possible operations are:
☞ Cleaning messy data: for example if working with a text file with some semi-structured data, it can be edited using transformations, facets and clustering to make the data cleanly structured.
☞ Transformation of data: converting values to other formats, normalizing and denormalizing.
Once you open the software you will be redirected to your browser where you can upload and start to work on your dataset. The number of operations you can do is enormous. It is suggested to read the documentation present in the wiki page or follow one of the several tutorial online. When you are done with your operation you can download your new dataset in different format, like .csv, .tsv, .xls, .json etc.
If you don't know how to do something try to google it! It's full of examples and tips out there.
Tool: Imagga folder
What you can do: tag automatically your images corpus
Imagga is an API image recognition system that analyzes your image-dataset in order to automatically assign tags to the images.
Firstly you have to sign up on Imagga.com and create a free account in order to obtain free API KEY (up to 2000 images/month). Then download the IMAGGA folder from this link. In the IMAGGA folder you will find two subfolders (IMG & RESULT) plus a python script. Open the python script with a text editor and insert your API Key on line 8 and 10. Copy the image-corpus you want to analyze and paste it in the IMG folder; open the terminal and load the IMAGGA folder via path or drag and drop the folder on the terminal icon (mac-users). Now run the script by writing the follow line of code in the terminal:
$ python tag.py img result
If you changed the names of the subfolders the general rule is:
$ python tag.py (input_folder) (output_folder)
Once the script is over, your will find the .json file inside the RESULT folder with all the tags.
Use Open Refine to analyze and work on the .json file.
What you can do: tag automatically your images corpus
DD Image tagging is an online Tool developed by Density Design Lab that uses Clarify API in order to generate automatically descriptive tags of your images without metadata. It works directly with the image-urls and you don't have to write a single line of code.
Firstly you have the prepare the .csv file with all the image-urls that you will upload in the tool. It's necessary that all the urls stay in the first column of the file (one url per line); plus the column must have a label in the first row. Sign up on Clarify in order to obtain API KEY (you can do it for free up to 5,000 operations/month). Insert your API KEY and upload your dataset, then choose among the 11 Clarify Models. Once the operation is over you will be able to download a new .csv file with all the tags that Clarify recognized inside the images.
It is advisable to create a copy of the original file because the tool will override all the columns except for the one that contains the image-urls.