Malagasy Web Crawler
30 Apr 2017The idea came when I searched for corpora for the project I’m currently working in my intership.
I came across a fastTex pre-trained models built by Facebook Research.
I wrote about an experiment around it in a blog post in Malagasy Using model built with malagasy language in the malagasy digitial community blog.
These models was trained on Wikipedia for 90 languages including Malagasy.
I found when experimenting with this model that it has less quality than for example the Google News corpus model povided by Google that was trained on Google News 3 billions running words corpus.
When searching in large list of corpora too, I didn’t find mention to Malagasy.
- Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources - Stanford NLP
- List of text corpora - Wikipedia
Beside this fact, I found that malgasy web sites was indexed by some open projects crawling and archiving of the Web, like Common Crawl and Archive.org.
I get to know it by requesting people in the facebook group facedev.mg to add in comment malagasy websites that they know.
I just found that there was a repository list for malagasy website too : Malagasy sites
By requesting the index of CommonCrawl Index API and Wayback CDX Server API, I get to know that there are crawling records of malagasy web sites.
My proposition is to use these already crawled data to begin to build a corpus for malagasy language.
If the interest remaing and if we find use cases of frequently crawling malagasy websites, we can begin to run a crawler targetting malagasy web sites using The CommonCrawl Crawler Engine and Related MapReduce code or Internet Archive’s public Wayback Machine.
But for these, we’ll need some infrastructure resources and operational effort. I hope to engage more people interested to Natural Language Processing to this project.
A priori, the interest should be there according to recent interest to Bot development.