Skip to content

A bundle of web scraping script that harvest information about ships, arrivals and passengers

License

Notifications You must be signed in to change notification settings

gonza7aav/scraping-passenger-list

Repository files navigation

scraping-passenger-list

GitHub release GitHub repository size Repository license

Léeme en Español

A bundle of web scraping scripts that harvest information about ships, arrivals and passengers from Jewish Genealogy in Argentina.

Result files can be imported into a SQL database for querying. These files are available in Releases.

💡 Motivation

My mother's family emigrated mainly from Czech Republic. While looking for them in some passenger lists, a couple of problems appeared. The first was the last name. I still don't quite understand how it works, but women have their last name "changed" by adding "ova". For example, "Vonka" is "Vonkova". The second was how they were registered when they arrived in Argentina. When the names were somewhat complex, they were changed to a similar one from here. For example, "Jan" to "Juan" or "František" to "Francisco".

This made things more difficult, I needed to look for all the possibilities. What was my solution? Regular expressions. But since the page had no option to search with them, I decided to copy its information to a personal database to work from there.

🚧 Prerequisites

If you are going to create the database and query it, you also need:

You can import the .csv files to your preferred database service. But this code only covers MySQL.

🛠️ Install

  1. Download this repository

  2. Install the dependencies

    npm install
  3. Fill .env.sample file and rename it to .env

🚀 Usage

If you don't want to bother yourself about getting the information:

  1. Create a results folder inside the project

  2. Download the latest data release

  3. Unzip the downloaded file inside results

  4. Skip the "🔍 Getting some information" section

🔍 Getting some information

These are the scraping scripts you can run:

  • Get ships

    npm run get-ships

    Search for available ships on the page. Then, they will be written in ships.csv.

  • Get arrivals

    npm run get-arrivals -- [flags]

    Look for the arrivals of every ship in ships.csv. So, you must have run get-ships before.

    Then, the results will be saved in arrivals.csv. If any request failed over the network or is due to a limit, it will be in ships.error.csv to retry them later.

  • Get passengers

    npm run get-passengers -- [flags]

    Get the passenger list of every arrival in arrivals.csv. So, you have to run the get-arrivals command before.

    After it, all passengers could be found in passengers.csv. If any request failed over the network or is due to a limit, it will be in arrivals.error.csv to retry them later.

🚩 Flags

I added these to modify the behaviour without changing a config file or some constant inside the script.

  • Limit the amount of work to do

    Example:

    npm run get-arrivals -- [-l | --limit <number>]

    When you set a limit, some requests or inserts may exceed it. So, it will be saved in a .error.csv file to be resumed later. The default value is 500. With 0 we set no limit.

  • Change the delay

    Example:

    npm run get-passengers -- [-d | --delay <number>]

    The default value is 200ms. It's not recommended to go below that without knowing how many requests the server could handle/allows. I am not responsible for any ban for making too many requests in a very short time.

♻️ Retrying those which failed

If you encountered a failure or set a limit, then you have a .error.csv file and here is what you should do to retry those.

Example: If you want to retry getting the arrivals of ships that failed

npm run get-arrivals -- [-r | --retry]

This search for the arrivals of ships in ships.error.csv. The results will be appended to arrivals.csv. Same logic for other commands.

🔣 Querying the database

Wait. What database? Well... first you need to create it running:

npm run init-database

With a database created, you can insert the result files into it with:

  • Ships

    npm run insert-ships -- [flags]
  • Arrivals

    npm run insert-arrivals -- [flags]
  • Passengers

    npm run insert-passengers -- [flags]

By the time I am writing this, I harvested almost 1.2 million passengers. Inserting this quantity will take a while... like several minutes. So, stretch out and go get some coffee.

Once finished, you will be able to query the scraping-passenger-list database. You don't have to worry about table joins, I leave a template called selectPassenger.sql.

📝 License

Copyright © 2021 Aguirre Gonzalo Adolfo. This project is MIT licensed.

About

A bundle of web scraping script that harvest information about ships, arrivals and passengers

Topics

Resources

License

Stars

Watchers

Forks