View on GitHub

lambda.mg

puchka's blog

Malagasy Web Crawler

30 Apr 2017

The idea came when I searched for corpora for the project I’m currently working in my intership.

I came across a fastTex pre-trained models built by Facebook Research.

I wrote about an experiment around it in a blog post in Malagasy Using model built with malagasy language in the malagasy digitial community blog.

These models was trained on Wikipedia for 90 languages including Malagasy.

I found when experimenting with this model that it has less quality than for example the Google News corpus model povided by Google that was trained on Google News 3 billions running words corpus.

When searching in large list of corpora too, I didn’t find mention to Malagasy.

Beside this fact, I found that malgasy web sites was indexed by some open projects crawling and archiving of the Web, like Common Crawl and Archive.org.

I get to know it by requesting people in the facebook group facedev.mg to add in comment malagasy websites that they know.

I just found that there was a repository list for malagasy website too : Malagasy sites

By requesting the index of CommonCrawl Index API and Wayback CDX Server API, I get to know that there are crawling records of malagasy web sites.

My proposition is to use these already crawled data to begin to build a corpus for malagasy language.

If the interest remaing and if we find use cases of frequently crawling malagasy websites, we can begin to run a crawler targetting malagasy web sites using The CommonCrawl Crawler Engine and Related MapReduce code or Internet Archive’s public Wayback Machine.

But for these, we’ll need some infrastructure resources and operational effort. I hope to engage more people interested to Natural Language Processing to this project.

A priori, the interest should be there according to recent interest to Bot development.