Think deeper, what can you do with MessageDance. @ to ∞
Logo
WHAT IS MESSAGEDANCE?

MessageDance allows you to share messages, tweets, photos, links and videos. Share them from your email, your favorite social network and from your phone.

Sign-up now

Take a video tour



Rajesh Shetty
Sharetweets
Location: San Francisco Bay Area
Btn_join_dnce
Preferred Network
Want to reach me? Your best bet is my preferred network. Prefnet
Stats
Messages Sent 522
Messages Received 635
Comments 164
Preferred Email Gmail
Preferred Network http://www...
Friends 47

Friends
Rupesh Singh Marica Odagaki kaybi foram gandhi
Rosano Silveira Sachin Raut Imran mohammed Madison Avenue Tracker
Deal Dance Sunil Joseph Zissy Foy Drew Long
Danish Khan Novanglus S Suvi-Tuuli Mäki-Asiala gaurav lall
Kiera Jones Peter Theunis Pooja Hegde Royans Tharakan
Ryan Katsanes Ajay Kharbanda Tat Leung Deepak Shivhare
Chloe slambuet Manoj Ahuja victoria atanassova Mark Lee
Tejesh Shetty Jack Chin Reggae Man Marley Ramesh D'Souza
Justin Wang Avanish G Dave Borzillo Tarun Ohri
Jyoti Shetty Green Is Good Brij Singh Geoff Wolfe
Social Media LakshmiNarasimhan Natarajan josh clarke Ask Sam
j t Raj S Open Web20
 

How to extract text from HTML using Ruby/Hpricot

To blog  Using Gmail at 04/25/2008 04:12 PM   Twicon_sml 78 views
I found this while solving my own problem. ( That is always a best way to learn). 

Requirement: Extract text from HTML body which includes ignoring large white spaces between tags and words.

Solution : Use Hpricot to do the magic

Assumption : Only HTML body is used here 

One liner : Hpricot(html).inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")
 

  1. Above line gets the inner text (which is a very convenient method to get the actual meat out of HTML)
  2. Replaces line returns into spaces. 
  3. Do a split/join which eliminates multiple spaces between tags and words. This trims it down to single space

Now if want scan thru the whole HTML right from <html> tags, then you will have to strip out script, link, meta, style tags as well. to do that just do following

hpricot = Hpricot(html)
hpricot.search("script").remove
hpricot.search("link").remove
hpricot.search("meta").remove
hpricot.search("style").remove







COMMENTS

IF YOU HAVE A MESSAGEDANCE ACCOUNT, PLEASE SIGN IN AND JOIN CONVERSATION

Comments

or Reply On Twitter