How to extract text from HTML using Ruby/Hpricot |
|
|
|
I found this while solving my own problem. ( That is always a best way to learn).
Requirement: Extract text from HTML body which includes ignoring large white spaces between tags and words. Solution : Use Hpricot to do the magic Assumption : Only HTML body is used here One liner : Hpricot(html).inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")
Now if want scan thru the whole HTML right from <html> tags, then you will have to strip out script, link, meta, style tags as well. to do that just do following hpricot = Hpricot(html) hpricot.search("script").remove hpricot.search("link").remove hpricot.search("meta").remove hpricot.search("style").remove |
|
Possibly related messages
|
|
|
COMMENTS |



