MONOSUS
ICECREAMING MAG

Make haste and take it slow: How to create your own tools
~ HTML data collection tool using node ~

My name is Komiya, a coder and lazy person who would do direct hotpot (eating food directly from the pot) if it wouldn't upset anyone.

This time I will introduce a homemade tool and record its use.
Please understand that I am not an expert.
I'm sure there are many points you've pointed out, but I would be happy if you felt that it was fun to make.

*If you want to use the tool right away, please read from this anchor link .

One day, a mission suddenly came

Two options for starting

It was one busy day.
"I want to compile about 40 pages of information from a website into Excel format for each page."
I was entrusted with this mission by the director of the department.

Being lazy, I start thinking about the best way to complete this task compactly.

So, first I decided to search the web and get help from people who have already provided useful information.
I searched around for a tool that would compile the information and output it in Excel.

I couldn't find a tool that suited my needs, as they were either paid or difficult to customize...

Having lost one of my possibilities, a conflict began to form for me at this point.

At this point you have two choices.

  1. The data is compiled and compiled manually. It is quick if you just want to create it. The creation method is flexible.
  2. I've never done it before, but I'd like to automate it and put it all together. There's a risk of failure, but if it works, there will be no human error. There will also be no repetitive work.


Ugh...it's difficult.

In the midst of a busy schedule, in a situation like this, both of these options seem attractive.

Thinking about it rationally, it is better to automate the latter even if it takes a little more time.
When you encounter a common case such as , "The original data has been changed in many ways, can you recreate it?" , the time it takes to make corrections will vary greatly.

They say "Sorry!" but the corrections will still have to be made mercilessly.

And despite the conflict, I decided to go down the path of automation.
(As an insurance, I will inform the director in advance of the method and procedure in case I am not able to implement it due to lack of ability.)

It's a moment when I'm filled with anxiety and a selfish sense of mission.

Design hearing

The advantage of making your own tools is that you can have them made to order.
We aim to create products that have the user in mind as much as possible.
So, I went to the person who gave me the mission to hear about the ideal shape.

In brief, I learned the following:

  • Environment in which it is used (original data, user, PC, etc.)
    • The data is the current site HTML.
    • It is used by directors or coders.
    • PC (a PC for coding, with various languages)
  • Image of how to use
    • Copy and paste the list placed on the sheet
    • Select the target part (jquery is good)
    • When you execute the process, data is created.
  • The format of the output data (how the information is displayed)
    • csv, json, or excel
    • Excel is best (because the final documents will be in Excel)
    • One HTML, one file

The key points for ordering are:
You can choose any part (it's easy to set with jQuery)
For Excel, it was one HTML, one file.

It's difficult, but I'd like to make it happen as much as possible.

Consultation with Skills

I'm not sure how to do this, but I'll sort it out here for now.
As a coder, my go-to companions for automation are *1 node and *2 gulp.

When using gulp, you need to be considerate of your users.
This time, the users will be directors or coders, so there will be no problems with the environment or literacy.

There are other ways to do this, but I don't have much time, so I'll go with this environment.
If you try to broaden your options too much, you'll end up spending a lot of time studying, so save that for another time.

※1 Node: A language highly compatible with JS, which is appreciated by coders. It works both locally and on a server, and has a wide range of packages.
※2 Gulp: A task runner that conveniently bundles processes together. It reduces the effort of linking and executing processes written in Node.

Production Concept

Create a simple image (hypothesis) of how the process will proceed.

The point is
"Proper entrance and exit"
In fact, as long as you follow this rule, it doesn't matter how complicated or difficult the process is for the user. While following this rule, imagine the process along the way.

  1. Set the URL list and the content you want to obtain (multiple options possible)
  2. Access the specified URL.
  3. Extract the content you want to obtain from the accessed page information in jQuery format.
  4. The acquired information is compiled as data.
  5. Convert the collected data into Excel format.
  6. Output data in Excel format.

Within the image, the unknown processing became the challenge for this production.

The challenge for me this time is:
2. Access the set URL.
3. Extract the desired content from the accessed page information in jQuery format.
6. Output the data in Excel format.

Trial and error

I'll try to find a way to solve the problem.
This time, I plan to solve the problem using npm (Node Packaged Modules), so I decided to try various things.

Using npm makes some of the complexity a lot simpler.
Thank you to everyone who made the package...

There are many different npm packages, so it's time to find the one that works best for you.
Search for "excel convert" on the official npm website .
As expected, there seems to be quite a wide variety of Excel-related packages.

If the purpose is complex, it may not come up in a search, or there may not be a package for it at all.
In that case, you will have to search by trial and error. It will be hard to find what you are looking for.

Actually install npm, try it, and repeat.
Repeat patiently to achieve the desired output.

Thanks to that, I came across a great npm called cheerio-httpcli .
This npm allows you to access the page and perform various operations like jquery. (Thanks to this npm, 2. Access the set URL and 3. Extract the desired information from the accessed page information in jquery format were solved.

Combine npm into one process

Each process is connected together to create the final form.
"I want to compile information from a website of about 40 pages into Excel data for each page."
We will complete the tool for

Based on the processing order considered in the production concept

Were you able to access the URL?
Was any information extracted from the pages you visited?
We will connect each of these together and proceed while checking the logs.

The ideal form is created by combining each separate function.
It's like building with toy blocks.

Confidence and futility

At least I was able to print it!

As it gets closer to completion, I get excited and want to make it even more convenient.
Let's add this and that, make it more convenient, make it even more convenient, and do my job better!
I feel like that.

However, if you try to cram too much into this, it could end up being counterproductive.

Overdoing it can come across as intrusive and make the device difficult to use.
Go back to the original plan you wanted to achieve and make sure it has all the functionality you need.

Then we test it, and if it's OK, we put it into practice.

It's time to put it into practice!

It's a tense moment.
I feel a conflicted feeling that my efforts might pay off, but also that I might fail and have to start again.

Specify various settings and run the tool.

Even the department director couldn't help but smile at this.

I'm glad I made it after all

At first I had mixed feelings of anxiety, but before I knew it I was filled with a sense of satisfaction that I had made it.
When this happens, you'll brag about it even to people who aren't interested.

At other times, I sometimes think, "Oh, that reminds me, maybe we could use that tool," and the tools I created are still alive within the department, and it makes me feel very emotional when that happens.

It's okay if what you create is awkward.
I think it's a great thing to broaden your experiences, understanding, and possibilities in production.
Although I felt conflicted at the first stage of trying, I would like to continue taking on challenges proactively whenever the opportunity arises.

Below is the scraping tool I actually created.
If you like, I would be happy if you would try it out.

It's fun to create things.

jquery-like scraping tool


This was made using node + gulp.

It extracts information from any page and digitizes it.
It is used to semi-automatically retrieve necessary page information, such as the page title or a list of links.
This time I created a tool that allows you to easily retrieve this data using jQuery.
Apparently it's a technique called scraping.

Points to note when scraping

Scraping allows you to automatically obtain information, but doing so without permission is not a good idea. Problems may arise depending on the purpose of use, so we recommend using it in the following cases.

  • For personal use
  • Conduct information analysis

Please handle the information at your own risk.

Preparing your environment for use

  1. Install node and gulp on your computer.
  2. Please download this tool from here .
  3. In the downloaded folder root, run npm install from the command line (Win) or terminal (Mac) to install the package and you’re done.

(The rest can be used by following the instructions we will show you here.)

How to use

It's simple to use.
Set the information you want to obtain in jQuery selector format and set the target page URLs.

/gulpfile.js

Let's edit gulpfile.js as follows and try to scrape Maruyama-senpai's article on the Monosasu site.

/gulpfile.js (Example: Scraping settings for articles on Monosasu site)

When you run the gulp task, Excel files will be generated as specified.
Then, you will find the information you were looking for inside.

/dest/scraping_data/posts_2017_12_200014.html.xlsx

That's how to use it.

Thank you for reading to the end.
I will continue to work harder.

モノサスアーカイブ