28 May 2017, 19:23

Replacing Eval with Object.send and a self written Parser

Intro

A while ago, I was added as a curator for a Gem called JsonPath. It’s a small but very useful and brilliant gem. It had a couple of problems which I fixed, but the hardest to eliminate proved to be a series of evals throughout the code.

You could opt in using eval with a constructor parameter, but generally, it was considered to be unsafe. Thus, normally when a project was using it, like Huginn they had to opt out by default, thus missing out on sweet parsing like this: $..book[?(@['price'] > 20)].

Eval

In order to remove eval, first I had to understand what it is actually doing. I had to take it apart.

apart

After much digging and understanding the code, I found, all it does is perform the given operations on the current node. And if the operation is true, it will select that node, otherwise, return false, and ignore that node.

For example $..book[?(@['price'] > 20)] could be translated to:

return @_current_node['price'] > 20

Checking first if 'price' is even a key in @_current_node. Once I’ve understood this part, I set on trying to fix eval.

SAFE = 4

In ruby, you could extract the part where you Eval and put it into its own proc and set SAFE = 4 which will disable some things like system calls.

proc do
  SAFE = 4
  eval(some_expression)
end.call

SAFE levels:

$SAFE Description 0 No checking of the use of externally supplied (tainted) data is performed. This is Ruby’s default mode. >= 1 Ruby disallows the use of tainted data by potentially dangerous operations. >= 2 Ruby prohibits the loading of program files from globally writable locations. >= 3 All newly created objects are considered tainted. >= 4 Ruby effectively partitions the running program in two. None - tainted objects may not be modified. Typically, this will be used to create a sandbox: the program sets up an environment using a lower $SAFE level, then resets $SAFE to 4 to prevent subsequent changes to that environment.

This has the disadvantage that anything below 4 is just, meh. But nothing above 1 will actually work with JsonPath so… scratch that.

Sandboxing

We could technically try and sandbox eval into it’s own process with a PID and whitelist methods which are allowed to be called.

Not bad, and there are a few gems out there which are trying to do that like SafeRuby. But all of these project have been abandoned years ago for a good reason.

Object.send

nobodylikesyou

Object.send is the best way to get some flexibility while still being safe. You basically just call methods on objects by describing said method on an object and giving parameters to it, like:

1.send(:+, 2) => 3

This is a very powerful tool in our toolbox which we will exploit immensely.

So let’s get to it.

Writing a parser

Writing a parser in Ruby is a very fluid experience. It has nice tools which support that, and the one I used is StringScanner. It has the ability to track where you are currently at in a string and move a pointer along with regex matches. In fact, JsonPath already employs this method when parsing a json expression. So reusing that logic was in fact… elementary.

The expression

How do we get from this:

$..book[?(@['price'] < 20)]

To this:

@_current_node['price'] < 20

Well. By simple elimination. There are a couple of problems along the way of course. Because this wouldn’t be a parser if it couldn’t handle ALL the other cases…

Removing Clutter

Some of this we don’t need. Like, $..book part.

dontneed1

The other things we don’t need are all the '[]?()

dontneed2

Once this is done, we can move to isolating the important bits.

takingaim

BreakDown

Elements

How does an expression actually look like?

Let’s break it down.

confused

So, this is a handful. Operations can be <=,>=,<,>,==,!= and operands can be either numbers, or words, and element accessor can be nested since something like this is perfectly valid: $..book[?(@.written.year == 1997)].

feedline

To avoid being overwhelmed, ruby has our back with a method called dig.

dig

This, basically lets us pass in some parameters into a dig function on a hash or an array with variadic parameters, which will go on and access those elements in order how they were supplied. Until it either returns a nil or an end result.

For example:

2.3.1 :001 > a = {a: {b: 'c'}}
 => {:a=>{:b=>"c"}}
2.3.1 :002 > a.dig(:a, :b)
 => "c"

Easy. However… Dig was only added after ruby 2.3 thus, I had to write my own dig for now, until I stop supporting anything below 2.3.

At first, I wanted to add it to the hash class, but it proved to be a futile attempt if I wanted to do it nicely, thus the parser got it as a private method.

    def dig(keys, hash)
      return hash unless hash.is_a? Hash
      return nil unless hash.key?(keys.first)
      return hash.fetch(keys.first) if keys.size == 1
      prev = keys.shift
      dig(keys, hash.fetch(prev))
    end

And the corresponding regex behind getting a multitude of elements is as follows:

...
if t = scanner.scan(/\['\w+'\]+/)
...

Operator

Selecting the operator is another interesting part as it can be a single one or multiple and all sorts. Until I realized that no… it can actually be only a couple.

whatone

whattwo

Also, after a bit of fiddling and doing and doing a silly case statement first:

case op
when '>'
  dig(@_current_node, *elements) > operand
when '<'
  dig(@_current_node, *elements) > operand
...
end

…I promptly saw that this is not how it should be done.

And here comes Object.send.

send

This gave me the opportunity to write this:

dig(elements, @_current_node).send(operator, operand)

Much better. Now I could send all the things in the way of a node.

send

Parsing an op be like:

elsif t = scanner.scan(/\s+[<>=][<>=]?\s+?/)

Operand

Now comes the final piece. The value which we are comparing. This could either be a simple integer, a floating number, or a word. Hah. So coming up with a regex which fits this tightly took a little fiddling, but eventually I ended up with this:

elsif t = scanner.scan(/(\s+)?'?(\w+)?[.,]?(\w+)?'?(\s+)?/)

Without StackOverflow I would say this is fine ((although I need to remove all those space check, shees)). What are all the question marks? Basically, everything is optional. Because an this expression $..book[?(@.price)] is valid. Which is basically just asserting if a given node has a price element.

Logical Operators

The last thing that remains is logical operators, which if you are using eval, is pretty straight forward. It takes care of anything that you might add in like &&, ||, |, &, ^ etc etc.

Now, that’s something I did with a case though. Until I find a nicer solution. Since we can already parse a single expression it’s just a question of breaking down a multi structure expression as the following one: $..book[?(@['price'] > 20 && @.written.year == 1998)].

exps = exp.split(/(&&)|(\|\|)/)

This splits up the string by either && or || and the usage of groups () also includes the operators. Than I evaluate the expressions and save the whole thing in an array like [true, '&&', false]. You know what could immediately resolve this? Yep…

saynotoeval.

I’d rather just parse it although technically an eval at this stage wouldn’t be that big of a problem…

def parse(exp)
  exps = exp.split(/(&&)|(\|\|)/)
  ret = parse_exp(exps.shift)
  exps.each_with_index do |item, index|
    case item
    when '&&'
      ret &&= parse_exp(exps[index + 1])
    when '||'
      ret ||= parse_exp(exps[index + 1])
    end
  end
  ret
end

Closing words

That’s it folks. The parser is done. And there is no eval being used. There are some more things here that are interesting. Like, array indexing is allowed in jsonpath which is solved by sending .length to a current node. For example:

if scanner.scan(/\./)
  sym = scanner.scan(/\w+/)
  op = scanner.scan(/./)
  num = scanner.scan(/\d+/)
  return @_current_node.send(sym.to_sym).send(op.to_sym, num.to_i)
end

If an expression begins with a .. So you see that using send will help a lot, and understanding what eval is trying to evaluate and rather writing your own parser, isn’t that hard at all using ruby.

I hope you enjoyed reading this little tid-bit as much as I enjoyed writing and drawing it. Leave a comment if your liked the drawings or if you did not and I should never do them again (( I don’t really care, this is my blog haha. )). Note to self: I shouldn’t draw on the other side of the drawing because of bleed-through.

Thank you! Gergely.

06 Oct 2016, 00:00

RScrap scraper

Intro

Hey folks.

So, there is this project called Huginn which I absolutely love.

But the thing is, that for a couple of scrappers ( at least for me ), I don’t want to spin up a whole rails app.

Hence, I’ve come up with RScrap. Which is a bunch of Ruby scripts run as cron jobs on a raspberry pi. And because I dislike emails as well, and most of the time, I don’t read them, I opted for a nicer solution. Enter the world of Telegram. They provide you with the ability to create bots. You basically get an API key, and than using that key, you can send private messages, or even create an interactive bot which you can send messages too.

In my simple example, I’m using it to send private messages to myself, but I could just as well, make it interactive and than tell it to run one of the scripts.

The Code

Let’s take a look at what we got.

The main scraper

The main scraper, is simply bunch of convenience methods that wrap handling and working with the database and the telegram bot. That’s all. It’s very simple. Very short. The Telegram part is just this bit:

def send_message(text)
  Telegram::Bot::Client.run(@token) do |bot|
    bot.api.send_message(chat_id: @id, text: text)
  end
end

Straightforward. Creating an interactive bot, would look something like this:

#!/usr/bin/env ruby
require 'telegram/bot'

token = 'YOUR_TELEGRAM_BOT_API_TOKEN'

Telegram::Bot::Client.run(token) do |bot|
  bot.listen do |message|
    case message.text
    when '/start'
      bot.api.send_message(chat_id: message.chat.id, text: "Hello, #{message.from.first_name}")
    when '/stop'
      bot.api.send_message(chat_id: message.chat.id, text: "Bye, #{message.from.first_name}")
    end
  end
end

Basically, it will listen, and than you can send it messages and based on the parsed message.text you can define functions to call. For example, for rscrap I could define something like run_script(script). And the command would be: /run reddit. Which will execute my reddit script. The possibilities are endless.

The scripts

The scripts use nokogiri to parse a web page, and than return a URL which will be sent by the TelegramBot. They are also saved in the database so that when a new comic strip comes out, I know that it’s new. For reddit, I’m saving a timestamp as well, and I collect everything after that timestamp through the reddit API as JSON, and send it as a bundled message with shortified links to the posts using bit.ly.

The scraping is most of the times the same for every comic. Thus, there is a helper method for it. The script itself, is very short. For example, lets look at gunnerkrigg court.

require_relative '../rscrap'
require 'nokogiri'
require 'open-uri'

url = 'http://www.gunnerkrigg.com'
scrap = Rscrap.new
page = Nokogiri::HTML(open(url))
comic_id = page.css('img.comic_image')[0].select { |e| e if e[0] == 'src' }[0][1]
new_comic = "#{url}#{comic_id}"
scrap.send_new_comic(url, new_comic)

The interesting part of it is this bit: comic_id = page.css('img.comic_image')[0].select { |e| e if e[0] == 'src' }[0][1]. It extracts the URL for the comic image, and stores it as an “id” of the comic. This than, is sent as a message which Telegram will embed. There is no need to visit the web page, the image is in your feed and you can view it directly. Just like an RSS ready.

Cron

These scripts are best used in a cron job. The comics are usually running with a daily frequency, where as the reddit gatherer is running with an hour frequency. Basically, I’m receiving updates on an hourly basis if there are new posts by then. Running ruby from cron was a bit tricky. I’m using bundler for the environment, and came up with this:

0 6-23 * * * /bin/bash -l -c 'cd /home/<youruser>/rubyproj/rscrap && bundle exec ruby scripts/reddit.rb'
0 8,22 * * * /bin/bash -l -c 'cd /home/<youruser>/rubyproj/rscrap && bundle exec ruby scripts/gunnerkrigg.rb'
0 8,22 * * * /bin/bash -l -c 'cd /home/<youruser>/rubyproj/rscrap && bundle exec ruby scripts/aws_blog.rb'
0 5,23 * * * /bin/bash -l -c 'cd /home/<youruser>/rubyproj/rscrap && bundle exec ruby scripts/goblinscomic.rb'
0 6,20 * * * /bin/bash -l -c 'cd /home/<youruser>/rubyproj/rscrap && bundle exec ruby scripts/xkcd.rb'
0 7,19 * * * /bin/bash -l -c 'cd /home/<youruser>/rubyproj/rscrap && bundle exec ruby scripts/commitstrip.rb'
0 8 * * * /bin/bash -l -c 'cd /home/<youruser>/rubyproj/rscrap && bundle exec ruby scripts/sequiential_art.rb'

And a telegram message for all these things, looks like this: Reddit: TelegramIMReddit Comics: TelegramIMComics

Conclusion

That’s it folks. Adding a new scraper is easy. I added the aws blog as a new entry as well by just copying the comics scripts. And I’m also getting Weather Reports delivered every morning to me.

Have fun. Any questions, please feel free to leave a comment!

Thanks, Gergely.

12 Jul 2016, 00:00

Ruby Sieve

Though it could be done better, I’m sure, but I’m actually pretty satisfied with this one. It loops only twice as opposed to filtered ranges and whatnot other solutions to the sieve. I was thinking of rather creating a list and deleting elements from it, but that’s already three loops.

Maybe I’ll do a benchmark later on more solutions.

# Sieve contains a function to return a set of primes
class Sieve
  def initialize(n)
    @n = n
  end

  # Returns a list of primes up to a certain limit
  # @param n limit
  # @return list of primes
  def primes
    marked = []
    primes = []
    (2..@n).each do |e|
      unless marked.include?(e)
        primes.push e
        (e..@n).step(e) { |s| marked.push s }
      end
    end
    primes
  end
end

Cheers, Gergely.