Liam Healy ([info]lhealy) wrote,
@ 2007-05-15 10:24:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Entry tags:lisp

Parsing HTML and memory fault from cl-curl
I've had need to parse HTML in lisp from time to time. The latest reason is some very specific and uncomplicated HTML that I scrape for some satellite data off a published database. A search online turns up XMLS as a likely candidate for this task. I have used it successfully in the past, but recently I find it won't parse its own example and its own supplied HTML documentation, to say nothing of the real HTML I want it to parse. It either returns NIL (meaning an error in parsing the correct HTML), or only the first line of HTML. There is a thread on comp.lang.lisp about how to parse HTML, and many people recommend cl-html-parse. I was dissuaded at first because of the wiki comments implying that it had been superseded by pxmlutils whose web page in turn implies that it had been superseded by... XMLS! But cl-html-parse works just fine on the web pages I need to scrape.


So, success. But then, I am grabbing the web page with cl-curl which works most of the time, but for a particular query gives "memory fault," I think because there is a lot of data. And the author/maintainer of cl-curl is... me! D'Oh. It would be nice to have cl-curl using CFFI instead of UFFI, maybe based on the pedagogical development given in the CFFI tutorial. My hope is at least it would solve this problem. Any interest/volunteers/motivators?




(Post a new comment)

Why not use DRAKMA ?
(Anonymous)
2007-05-15 07:59 pm UTC (link)
why not use a pure CL lib such as DRAKMA(http://weitz.de/drakma) instead of using bindings to a fairly complex C library like curl ? in this case perhaps the problem is in the bindings or perhaps you've found a bug in curl, in either case debugging it would be IMHO not fun at all

(Reply to this) (Thread)

Re: Why not use DRAKMA ?
[info]lhealy
2007-05-15 09:07 pm UTC (link)
I didn't know about this, thanks for the heads-up. It does look like SBCL support (my implementation of choice) is a bit iffy, but it's certainly worth exploring.

(Reply to this) (Parent)(Thread)

Re: Why not use DRAKMA ?
(Anonymous)
2007-05-16 02:06 pm UTC (link)
I've used SBCL with DRAKMA to write ediware (http://common-lisp.net/~loliveira/ediware/) with no problems whatsoever; it's very good. Don't waste your time with CURL and FFI, really. :-)

(Reply to this) (Parent)(Thread)

Re: Why not use DRAKMA ?
[info]lhealy
2007-05-16 02:20 pm UTC (link)
Yes, I just tried DRAKMA for my application in SBCL, and I had no problems. I'll blog on it soon.

(Reply to this) (Parent)

curl cffi bindings
(Anonymous)
2007-05-15 08:02 pm UTC (link)
you may want to take a look at verrazano at http://common-lisp.net/project/fetter/

i've just pushed some pre-generated cffi bindings for libcurl (libcurl3-gnutls-dev from Ubuntu, i'm not sure that matters).

feel free to write to the devel list if the generated binding is not ok as-is and you need help with verrazano.

hth,

- attila

(Reply to this) (Thread)

Re: curl cffi bindings
[info]lhealy
2007-05-15 09:08 pm UTC (link)
Thanks for the info. I'm interested hearing what progress you make with it.

(Reply to this) (Parent)

An alternative
(Anonymous)
2007-05-16 06:21 am UTC (link)
http://www.cliki.net/trivial-html-parser ?

(Reply to this) (Thread)

Re: An alternative
[info]lhealy
2007-05-16 02:21 pm UTC (link)
Looks interesting, but since cl-html-parse seems to be working acceptably for this application, I'll stick with it for the time being. Thanks for the tip though.

(Reply to this) (Parent)

I have a cffi implementation for curl!
(Anonymous)
2007-05-16 09:10 pm UTC (link)
I'd love to contribute it to your project. If you are interested, I can be reached at atsmyles at earthlink dot net. Also has callback support!

Arthur

(Reply to this) (Thread)

Re: I have a cffi implementation for curl!
[info]lhealy
2007-05-17 09:00 pm UTC (link)
It appears that sebthecat (see other comment) has done something similar. I am now moving my code over to drakma, but perhaps you and sebthecat can coordinate and make a published project if it provides something that drakma doesn't. If you'd like to use the name cl-curl I can help.

(Reply to this) (Parent)


[info]sebthecat
2007-05-17 05:11 am UTC (link)
Well, I did implement lisp-curl, which was curl acccessed via CFFI. Not published so far, and a classic case of stopping once it did what I needed, but you'd be welcome to it if you're interested.

(Reply to this) (Thread)

lisp-curl
[info]lhealy
2007-05-17 08:58 pm UTC (link)
If you think it will be useful to others, perhaps you should publish it. Maybe you should get together with Arthur (see other comment)? For my purposes, drakma now replaces most of what cl-curl did, and I'm working on the rest.

(Reply to this) (Parent)


Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…