Categories: Chronicles from the trenchesData EngineeringPython

Python Lambda and Regex – A good team for replacing a string using dictionaries

The aim of this post is to show you a specific and useful tip in Python for replacing strings with matched values contained in a dictionary, for this task that could sound trivial but in the practice may represent an interesting challenge.

Our scenario can be described as a string that contains a group of values that match with some keys in our dictionary, nonetheless, we don’t want to do a repetitive replace based on a loop solution which is still valid but it is not the objective of this post.

Here an example with the string variable and the dictionary:

request = "select column1,column2 from mytable limit batchsize;"

parameters = {"column1": "first_name", "column2": "last_name",
              "mytable": "actor",
              "batchsize": "10"}

# The expected result after the matching/replacing is:
# select first_name,last_name from actor limit 10;

As you can see the values contained in the dictionary match with words of the request variable, but in common scenarios, the dictionary is not limited to a few keys, so the approach addressed in this article focus on bringing you outcomes accurately even when we have to deal with hundreds of items.

We need to start with a short explanation of regular expression, it does not belong only to the Python world, in fact, we can found it in many modern language programming, nonetheless, I want to cite the Python documentation for his definition:

“A regular expression (or RE) specifies a set of strings that matches it;”. Source: https://docs.python.org/3/library/re.html

The regular expression(shortened as regex or regexp); also allows us to define a search pattern, that is vital for our use case explained above. The first step will consist of yielding a new dictionary that we will backslash the special symbols in the dictionary’s key to avoid conflicts related to special symbols, this is because Python strings also use the backslash to escape characters.

More details in this link: https://bit.ly/2KRb9Ew

In this case, we will be building a new dictionary comprehension and applying the regular expression escape function to backslash it. Once that we have completed, we are going to compile a regex for use later, the regex will be composed of all the keys which are coming from the formatted dictionary and split them with the symbol “|”, explaining the benefit of compile a regex is out of the scope of this post, but you can find interesting articles about it.

Let me show you the following piece of code where we cover the new dictionary and creation of compiled regex object:

import regex as regex

request = "select column1,column2 from mytable limit batchsize;"

parameters = {"column1": "first_name", "column2": "last_name",
              "mytable": "actor",
              "batchsize": "10"}

formatted_parameters = dict((regex.escape(k), v) for k, v in parameters.items())
print(formatted_parameters)
# Output: {'column1': 'first_name', 'column2': 'last_name', 'mytable': 'actor', 'batchsize': '10'}

pattern = regex.compile("|".join(formatted_parameters.keys()))
print(pattern)
# Output: 
# regex.Regex('column1|column2|mytable|batchsize', flags=regex.V0)

Maybe you are asking why was required to have a compiled regex object, but before start to explain to you the main reason, I want to introduce you to a part of the regular expression and which we will be using, it is regex.sub, here the syntax:

re.sub(<regex>, <repl>, <string>, count=0, flags=0)

This function returns a new string as an outcome from performing replacements on a search string, for more information visit this link: https://docs.python.org/3/library/re.html

As the documentation mentioned, inside of regex sub function we can specify <repl> as a function and therefore the regex sub will call this function for each match found, so instead of passing a function, is here where we can use the regex sub together with a lambda expression and I think that this kind of scenario represents a good opportunity to implement a lambda expression.

Returning to the above statement about why a compiled regex object, one of the best answers is because we can reuse the compiled expression and even have the possibility of using the regex sub function, having in one simple line the power to combining lambda function, regex sub and dictionary for getting the desired result.

Let me add the next statement before to have the final script

request = pattern.sub(lambda m: formatted_parameters[regex.escape(m.group(0))], request)

What is new in the previous statement? Well.. probably the first thing that you are asking is about lambda word, it is a keyword which indicates to python that you are defining a lambda function or lambda expression, it can be defined in simple words as a shortcut to create anonymous functions and it yields a function object, in reality, there is nothing special that force you to use lambda, this last is only a syntactically compact way of defining a function and even in many cases not recommended to use it, but probably the example used in this article is one of the few interesting use (in my humble opinion) that deserves attention.

Before to continue, I encourage you to read this helpful article about lambda :https://realpython.com/python-lambda/#first-example

At this point, our Lambda expressions define a bound variable in this case m, immediately later we define the body of the function, remember, lambda at the end of the story is an anonymous function, so here the interesting behavior of this code, we are passing the formatted_parameters which is a dictionary created and which contains the key that needs to match and replaces values into the request with the values of the dictionary, and here is where the regex.sub and compiled object help us to compact and achieve this result.

Remember that regex.sub in this case is able to call a defined function (in our case the Lambda function) for every match delimited for m.group(0) that means an exact match, so in the practice it will internally replace every match into the request string with the respective value of the key contained in the formatted_parameters dictionary.

Here the final version of our code

import regex as regex

request = "select column1,column2 from mytable limit batchsize;"

parameters = {"column1": "first_name", "column2": "last_name",
              "mytable": "actor",
              "batchsize": "10"}

formatted_parameters = dict((regex.escape(k), v) for k, v in parameters.items())

pattern = regex.compile("|".join(formatted_parameters.keys()))
request = pattern.sub(lambda m: formatted_parameters[regex.escape(m.group(0))], request)

I hope this simple trick would be useful to you and remember that every day is a great opportunity to learn new things, Happy Coding !!

geohernandez

Next AZ-900: Lessons learned and Cloud perspectives »

Previous « Configuring a Cassandra cluster on Azure – Part III

View Comments

Nichol Sturkie says:

January 29, 2021 at 8:56 pm

Music began playing as soon as I opened up this web page, so annoying!
james says:

March 28, 2021 at 1:33 pm

That is very cool -- thanks for sharing

Formatting our Postgres scripts with pgformatter in DBeaver

Are you a PostgreSQL enthusiast using DBeaver on a Windows Platform? If you find yourself…

4 months ago

Looking back to Kimball’s approach to Data Warehousing

Over time, it's fascinating to witness how certain concepts, approaches, or visions age. Time, in…

5 months ago

Python

Python Lambda and Regex – A good team for replacing a string using dictionaries

View Comments

Recent Posts

Formatting our Postgres scripts with pgformatter in DBeaver

Looking back to Kimball’s approach to Data Warehousing

List Comprehension and Walrus operator in Python

Playing with some Pandas functions and Airflow operators

Using interpolated format strings in Python

Getting the last modified directory with Python

Python Lambda and Regex – A good team for replacing a string using dictionaries

View Comments

Related Post

Recent Posts

Formatting our Postgres scripts with pgformatter in DBeaver

Looking back to Kimball’s approach to Data Warehousing

List Comprehension and Walrus operator in Python

Playing with some Pandas functions and Airflow operators

Using interpolated format strings in Python

Getting the last modified directory with Python