LaVOZs

The World’s Largest Online Community for Developers

'; Parse String to create objects using Regex in Scala - LavOzs.Com

I have an input list of strings, which I'd like to translate to list of objects using regex. In the code below, I am not creating the objects but printing them to stdout for simplicity.

I am able to support some of the input strings but not the entire list. Can someone share what am I doing wrong?

  lazy val TIMESTAMP_PATTERN: Regex = """(year|month|day|hour)\(([a-zA-Z_]+)[,]?([a-zA-Z_]*)\)""".r
  lazy val BUCKET_PATTERN: Regex = """(bucket)\((.+)(,)(.+)[,]?(.*)\)""".r

  Seq(
    "year(timestamp)",
    "year(timestamp, _MY_YEAR)",
    "month(timestamp)",
    "month(timestamp, _MY_MONTH)",
    "day(timestamp)",
    "day(timestamp, _MY_DAY)",
    "hour(timestamp)",
    "hour(timestamp, _MY_HOUR)",
    "bucket(id, 32)",
    "bucket(id, 32, _MY_BUCKET)",
  ).foreach { input => input match {
      case TIMESTAMP_PATTERN(transform, sourceColumn, targetColumn) => println(s"$transform ::: $sourceColumn :::- $targetColumn")
      case BUCKET_PATTERN(sourceColumn, numBuckets) => println(s"bucket ::: $sourceColumn ::: $numBuckets")
      case BUCKET_PATTERN(sourceColumn, numBuckets, targetColumn) => println(s"bucket ::: $sourceColumn ::: $numBuckets ::: $targetColumn")
      case z => println(s"Unexpected match: $z")
    }
  }

Output

year ::: timestamp :::- 
Unexpected match: year(timestamp, _MY_YEAR)
month ::: timestamp :::- 
Unexpected match: month(timestamp, _MY_MONTH)
day ::: timestamp :::- 
Unexpected match: day(timestamp, _MY_DAY)
hour ::: timestamp :::- 
Unexpected match: hour(timestamp, _MY_HOUR)
Unexpected match: bucket(id, 32)
Unexpected match: bucket(id, 32, _MY_BUCKET)

I made a few fixes in your regexps and matching :

lazy val TIMESTAMP_PATTERN: Regex = """(year|month|day|hour)\((\w+)(?:,\s+)?(\w*)\)""".r
lazy val BUCKET_PATTERN: Regex = """bucket\((\w+),(?:\s+)?(\w+)(?:,\s+)?(\w*)\)""".r

  Seq(
    "year(timestamp)",
    "year(timestamp, _MY_YEAR)",
    "month(timestamp)",
    "month(timestamp, _MY_MONTH)",
    "day(timestamp)",
    "day(timestamp, _MY_DAY)",
    "hour(timestamp)",
    "hour(timestamp, _MY_HOUR)",
    "bucket(id, 32)",
    "bucket(id, 32, _MY_BUCKET)",
  ).foreach {
    case TIMESTAMP_PATTERN(transform, sourceColumn, "") => println(s"$transform ::: $sourceColumn")
    case TIMESTAMP_PATTERN(transform, sourceColumn, targetColumn) => println(s"$transform ::: $sourceColumn :::- $targetColumn")
    case BUCKET_PATTERN(sourceColumn, numBuckets, "") => println(s"bucket ::: $sourceColumn ::: $numBuckets")
    case BUCKET_PATTERN(sourceColumn, numBuckets, targetColumn) => println(s"bucket ::: $sourceColumn ::: $numBuckets ::: $targetColumn")
    case z => println(s"Unexpected match: $z")
  }

The output now is:

year ::: timestamp
year ::: timestamp :::- _MY_YEAR
month ::: timestamp
month ::: timestamp :::- _MY_MONTH
day ::: timestamp
day ::: timestamp :::- _MY_DAY
hour ::: timestamp
hour ::: timestamp :::- _MY_HOUR
bucket ::: id ::: 32
bucket ::: id ::: 32 ::: _MY_BUCKET

Here are the changes I made:

  • Added ?: to groups consisting of ,s and white spaces to make those groups non-capturing. With this method, the white spaces are still optional but will not affect the end matching.
  • Removed () from bucket so it is not a capture group
  • Since the last match is optional and can be empty, modified the pattern with fewer items to match this case. Note that if the last column is not present, its capture group will be empty.
Related
A comprehensive regex for phone number validation
How to negate specific word in regex?
RegEx match open tags except XHTML self-contained tags
Difference between object and class in Scala
Check whether a string matches a regex in JS
What are all the uses of an underscore in Scala?
Scala Regex Partial Function with Regex defined in Partial Function