01 02 E 0 W AlphaRegex 01 02 Wprl.korea.ac.kr/~pronto/home/posters/regex-synthesis.pdf ·...

고려대학교 정보통신대학 컴퓨터통신공학부 소순범고려대학교 정보대학 컴퓨터학과 이민아

1.�� 연구�� 동기 2.�� 문제�� 및�� 목표

3.�� 정규식�� 합성�� 알고리즘

4.�� 실험문제

예제 개수합성된 정규식

소요 시간 (초)속도향상

P N 기본 알고리즘 우리 알고리즘

w는 오른쪽으로부터 5번째 글자가 1이다. 3 3 (0+1)*1(0+1)(0+1)(0+1)(0+1) 148.0 8.2 18x

w는 최대 두 개의 0을 가진다. 8 7 1*0?1*0?1* 425.0 1.2 354x

w는 0과 1이 번갈아가며 등장한다. 10 11 0?(10)*1? 4073.9 1.6 2546x

w에 있는 0의 개수는 3으로 나누어 떨어진다. 8 7 (1+01*01*0)* > 7200.0 5.9 n/a

w가 0으로 시작하면 홀수의 길이를 가지고, 1로 시작하면 짝수의 길이를 가진다. 5 3 (0+1(0+1)) ((0+1)(0+1))* > 7200.0 10.9 n/a

w는 최소 1개의 0과 최대 1개의 1을 가진다. 12 10 0*(01?+100*) > 7200.0 7.5 n/a

w는 최대 1쌍의 연속한 1을 가진다. 9 8 (1+(01?)*)(0+10*) 465.1 24.4 19x

실험�� 환경�� MacBook�� Pro�� /�� OS�� X�� El�� Capitan�� 10.11.1�� /�� 2.2�� GHz�� Intel�� Core�� i7�� /�� 16GB�� 1600�� MHz�� DDR3

5.�� 결론✓ 적은 수의 예로부터 사람도 풀기 어려워 하는 정규식을 빠르게 합성

✓ 효율적으로 상태공간을 탐색하기 위한 다양한 탐색 기법 제시

✓ 실제 계산이론 책에 등장하는 고난이도 문제를 통해 성능 입증

고려대학교 프로그래밍 연구실오학주 교수님

Σ = {0,�� 1}�� 에�� 대해,�� 다음�� 언어에�� 대한�� 정규식을�� 찾으시오.L = {w ∈ {0, 1}* | w 는�� 정확히�� 한�� 쌍의�� 연속인�� 0들을�� 갖는다.}

옳은�� 예��

00,�� 1001,��

0101001010�� 1111001111

틀린�� 예��

01,�� 11,�� 000,�� 00100

효율적인 상태공간 가지치기

3. 해를 가질수 없는 상태(Dead States) 가지치기

2. 같은 의미 상태(Semantically-Equivalent States) 가지치기

1. 간단한 정규식 우선탐색 (Best-first Enumerative Search)

정규표현식 자동 합성기

00,�� 1001,��

0101001010�� 1111001111

01,�� 11,�� 000,�� 00100

정규식 문법으로 생성되는 모든 상태공간 탐색기본 알고리즘 챌린지 매우 큰 상태공간. 깊이 d에 있는 상태 개수:

해결 방법 효율적인 공간 탐색 알고리즘 고안

4. 불필요한 상태(Redundant States) 가지치기

계산이론 수업을 듣다가: 정규식 합성을 자동으로 할 수 없을까? 주어진 예제를 만족하는 정규식을 자동 합성하기:

AlphaRegex

AlphaRegex

(0?1)⇤00(10?)⇤

목표: 계산이론 수강생과 교수님보다 똑똑하게!

옳은 예(Positive examples)

틀린 예(Negative examples)

자동 합성된 정규표현식

(in 0.5s)

⇤

a ✏ ; ⇤+⇤

a+ a a+ ✏ a+ ; a+ (⇤+⇤)

a+ (a+ a) a+ (a+ ✏) a+ (a+ ;) · · ·

a+ (⇤ ·⇤)

· · ·

a+ (⇤⇤)

· · ·

✏+ a ✏+ ✏ ✏+ ; ✏+ (⇤+⇤)

· · ·

· · ·

⇤ ·⇤· · ·

⇤⇤

· · ·

Figure 1. Exhaustive Search

e1 ! e01e1 + e2 ! e01 + e2

e2 ! e02e1 + e2 ! e1 + e02

e1 ! e01e1 · e2 ! e01 · e2

e2 ! e02e1 · e2 ! e1 · e02

e ! e0

e⇤ ! e0⇤e ! e0

e? ! e0?

⇤ ! aa 2 ⌃ ⇤ ! ✏ ⇤ ! ;

⇤ ! ⇤+⇤ ⇤ ! ⇤ ·⇤ ⇤ ! ⇤⇤ ⇤ ! ⇤?

Figure 2. Transition Relation between States

regular expression (e.g. c = 7). The number of states atdepth d in worst case is

N(0) = 1

N(d+ 1) = N(d) · c2d

when c = 7:

N(d) = 7Pd�1

k=0 2k 2 O(72d�1)

Search Strategy We pick a state that has a minimal cost,where the cost of states is defined as follows:

C(a) = C(✏) = C(;) = c1C(⇤) = c2 (c2 > c1)

C(e1 + e2) > C(e1) + C(e2)C(e1 · e2) > C(e1) + C(e2)

C(e⇤) > C(e)

Intuitively, we prefer simpler expressions by following theprinciple of Ockham’s razor, so that the solution found isthe simplest regular expression that is consistent with theexamples.

Algorithm 1 Search AlgorithmInput: Positive and negative examples (P,N )Output: A regular expression E consistent with (P,N )

1: W := {⇤}2: repeat3: pick s from W4: if solution(s) then return s5: else6: W := W [ next(s)7: end if8: until W 6= ;

3.2 NormalizationExamples:

[[s⇤s⇤]] = [[s⇤]]

[[(s+ s)]] = [[s]]

[[(s · s⇤)⇤]] = [[s⇤]]

...

3.3 Pruning Search SpaceDefinition 1 (Dead States). Let (P,N ) be a regular expres-sion problem. We say a state s 2 S is dead, denoted dead(s),iff every closed state s0 reachable from s is not a solution:

dead(s) ()�(s !⇤ s0) ^ s0 6! =) ¬solution(s0)

�.

Intuitively, a state s is dead if exploring further the reach-able states of s is guaranteed to fail to find a solution. Oursearch algorithm aims to identify as many dead states as pos-sible and does not attempt to explore beyond them. Specifi-cally, we identify two types of dead states: pdead and ndead.

Definition 2. A state s is dead for positive examples, de-noted pdead(s), iff every closed state s0 reachable from sfails to accept a positive example:

pdead(s) ()�s !⇤ s0 ^ s0 6! =) 9p 2 P. p 62 [[s0]]

�.

Example 1. Suppose b 2 P . Any closed state s0 reachablefrom state s = a·⇤ is doomed to reject the positive example;

3 2016/6/6

Lemma 4. Let s be any state. Then,

pdead(s) () 9p 2 P. p 62 [[bs]].

Proof. Consider each direction.

• (=)) Suppose pdead(s) holds:

s !⇤ s0 ^ s0 6! =) 9p 2 P. p 62 [[s0]]. (5)

From (5) and Lemma 6, we obtain 9p 2 P. p 62 [[bs]].• ((=) Suppose p 62 [[bs]]. By Lemma 2, we have

p 62[

s!⇤s0^s0 6![[s0]].

which implies that p 62 [[s0]] for all closed s0 reachablefrom s.


ndead(s) () 9n 2 N . n 2 [[es]].


• (=)) Suppose ndead(s) holds:

s !⇤ s0 ^ s0 6! =) 9n 2 N . n 2 [[s0]]. (6)

From (6) and Lemma 7, we obtain 9n 2 N . n 2 [[es]].• ((=) Suppose n 2 [[es]]. By Lemma 3, we have

n 2\

s!⇤s0^s0 6![[s0]]

which implies that n 2 [[s0]] for all closed s0 reachablefrom s.

Lemma 6. For any state s, we have s !⇤ bs and bs 6!.

Proof. By structural induction on s.

Lemma 7. For any state s, we have s !⇤ es and es 6!.


Final Algorithm With normalization and pruning, thesearch algorithm uses the following next function:

next(s) =

8<

:

; 9p 2 P. p 62 [[bs]]; 9n 2 N . n 2 [[es]]{normalize(s0) | s ! s0} otherwise

5 2016/6/4


pdead(s) () 9p 2 P. p 62 [[bs]].



s !⇤ s0 ^ s0 6! =) 9p 2 P. p 62 [[s0]]. (5)

From (5) and Lemma 6, we obtain 9p 2 P. p 62 [[bs]].• ((=) Suppose p 62 [[bs]]. By Lemma 2, we have

p 62[

s!⇤s0^s0 6![[s0]].

which implies that p 62 [[s0]] for all closed s0 reachablefrom s.


ndead(s) () 9n 2 N . n 2 [[es]].


• (=)) Suppose ndead(s) holds:

s !⇤ s0 ^ s0 6! =) 9n 2 N . n 2 [[s0]]. (6)

From (6) and Lemma 7, we obtain 9n 2 N . n 2 [[es]].• ((=) Suppose n 2 [[es]]. By Lemma 3, we have

n 2\

s!⇤s0^s0 6![[s0]]

which implies that n 2 [[s0]] for all closed s0 reachablefrom s.

Lemma 6. For any state s, we have s !⇤ bs and bs 6!.


Lemma 7. For any state s, we have s !⇤ es and es 6!.


Final Algorithm With normalization and pruning, thesearch algorithm uses the following next function:

next(s) =

8<

:

; 9p 2 P. p 62 [[bs]]; 9n 2 N . n 2 [[es]]{normalize(s0) | s ! s0} otherwise

5 2016/6/4

⇤

a ✏ ; ⇤+⇤

a+ a a+ ✏ a+ ; a+ (⇤+⇤)

a+ (a+ a) a+ (a+ ✏) a+ (a+ ;) · · ·

a+ (⇤ ·⇤)

· · ·

a+ (⇤⇤)

· · ·

✏+ a ✏+ ✏ ✏+ ; ✏+ (⇤+⇤)

· · ·

· · ·

⇤ ·⇤· · ·

⇤⇤

· · ·


e1 ! e01e1 + e2 ! e01 + e2

e2 ! e02e1 + e2 ! e1 + e02

e1 ! e01e1 · e2 ! e01 · e2

e2 ! e02e1 · e2 ! e1 · e02

e ! e0

e⇤ ! e0⇤e ! e0

e? ! e0?

⇤ ! aa 2 ⌃ ⇤ ! ✏ ⇤ ! ;

⇤ ! ⇤+⇤ ⇤ ! ⇤ ·⇤ ⇤ ! ⇤⇤ ⇤ ! ⇤?



N(0) = 1

N(d+ 1) = N(d) · c2d

when c = 7:

N(d) = 7Pd�1

k=0 2k 2 O(72d�1)


C(a) = C(✏) = C(;) = c1C(⇤) = c2 (c2 > c1)

C(e1 + e2) > C(e1) + C(e2)C(e1 · e2) > C(e1) + C(e2)

C(e⇤) > C(e)





[[s⇤s⇤]] = [[s⇤]]

[[(s+ s)]] = [[s]]

[[(s · s⇤)⇤]] = [[s⇤]]

...



�.



pdead(s) ()�s !⇤ s0 ^ s0 6! =) 9p 2 P. p 62 [[s0]]

�.


3 2016/6/6

효율적인 상태공간 탐색기법• Pruning dead states: 탐색을 아무리 진행해도 해를 가질수 없는 상태공간은 탐색하지 않음

no matter how the hole gets instantiated, the string b cannotbe accepted.

(b 2 P)

...

a ·⇤

......

...

(a 2 N )

...

a · (⇤)⇤

......

...

Definition 3. A state s is dead for negative examples, de-noted ndead(s), iff every closed state s0 reachable from sfails to reject a negative example:

ndead(s) ()�s !⇤ s0 ^ s0 6! =) 9n 2 N . n 2 [[s0]]

�.

Example 2. Suppose a 2 N . Any closed state s0 reachablefrom state s = a · (⇤)⇤ is doomed to accept the negativeexample; no matter how the hole gets instantiated, the lan-guage of any reachable state includes the string a.

It is clear that a state is guaranteed to be dead if one ofpdead(s) and ndead(s) holds:

Lemma 1. Let s be any state. Then,�pdead(s) _ ndead(s)

�=) dead(s).

Note that, however, the converse of Lemma 1 is not true.Suppose s is a dead state. This means that every reach-able state s0 either rejects some positive example or acceptssome negative example. However, pdead(s) requires thatthe reachable state s0 always rejects some positive example.Similarly, ndead(s) requires a strong condition that everyreachable state s0 accepts some negative example.

Example 3. When P = N , no solutions cannot exist andthe initial state ⇤ is dead. However, neither pdead(s) norndead(s) holds, because we can always find a regular ex-pression (e.g., (a + b)⇤) that accepts all positive examplesand we can also always find a regular expression (e.g., ;)that rejects all negative examples.

We identify the pdead states and ndead states by comput-ing over- and under-approximations of states.

Definition 4. The over-approximation bs and under-approximationee of state s are defined inductively as follows:

ba = ab✏ = ✏b; = ;

\e1 + e2 = be1 + be2\e1 · e2 = be1 · be2

be⇤ = (be)⇤b⇤ = (a+ b)⇤

ea = ae✏ = ✏e; = ;

e1 + e2 = ee1 + ee2e1 · e2 = ee1 · ee2

ee⇤ = (ee)⇤e⇤ = ;

Intuitively, the over-approximation bs is obtained by replac-ing all holes in s by (a + b)⇤, and the under-approximationes is obtained by replacing the holes by ;.

Example 4. Consider a state s = a + (⇤ · ⇤). Then,bs = a+ ((a+ b)⇤ + (a+ b)⇤) and es = a+ (;+ ;).

bs is over-approximated in a sense that the language of bscontains all the languages of states reachable from s (Lemma2). Dually, es is under-approximated because every state s0

reachable from s subsumes the language of es (Lemma 3).

Lemma 2. For any state s, we have

[[bs]] ◆[

s!⇤s0^s0 6![[s0]].

Proof. Todo

Lemma 3. for any state s, we have

[[es]] ✓\

s!⇤s0^s0 6![[s0]].

Proof. Todo

Given a state s, we conclude that s is dead with positiveexample (i.e. pdead(s)) if bs rejects some positive example:

9p 2 P. p 62 [[bs]] (3)

and we conclude that s is dead with negative example (i.e.ndead(s)) if es accepts some negative example:

9n 2 N . n 2 [[es]]. (4)

Lemma 4 and 5 show that our algorithm for identifyingpdead and ndead states is both sound and complete.


pdead(s) () 9p 2 P. p 62 [[bs]].



s !⇤ s0 ^ s0 6! =) 9p 2 P. p 62 [[s0]]. (5)

From (5) and Lemma 6, we obtain 9p 2 P. p 62 [[bs]].

4 2016/6/4


(b 2 P)

...

a ·⇤

......

...

(a 2 N )

...

a · (⇤)⇤

......

...


ndead(s) ()�s !⇤ s0 ^ s0 6! =) 9n 2 N . n 2 [[s0]]

�.




�=) dead(s).





ba = ab✏ = ✏b; = ;

\e1 + e2 = be1 + be2\e1 · e2 = be1 · be2

be⇤ = (be)⇤b⇤ = (a+ b)⇤

ea = ae✏ = ✏e; = ;

e1 + e2 = ee1 + ee2e1 · e2 = ee1 · ee2

ee⇤ = (ee)⇤e⇤ = ;






[[bs]] ◆[

s!⇤s0^s0 6![[s0]].

Proof. Todo


[[es]] ✓\

s!⇤s0^s0 6![[s0]].

Proof. Todo


9p 2 P. p 62 [[bs]] (3)


9n 2 N . n 2 [[es]]. (4)



pdead(s) () 9p 2 P. p 62 [[bs]].



s !⇤ s0 ^ s0 6! =) 9p 2 P. p 62 [[s0]]. (5)


4 2016/6/4

• Pruning redundant states: 해를 가질 수 있더라도 다른 곳에 더 간단한 해가 존재하는 상태공간은 탐색하지 않음


(b 2 P)

...

a ·⇤

......

...

(a 2 N )

...

a · (⇤)⇤

......

...

(aab 2 P)

...

a · (b+ ✏) ·⇤

......

...


ndead(s) ()�s !⇤ s0 ^ s0 6! =) 9n 2 N . n 2 [[s0]]

�.




�=) dead(s).





ba = ab✏ = ✏b; = ;

\e1 + e2 = be1 + be2\e1 · e2 = be1 · be2

be⇤ = (be)⇤b⇤ = (a+ b)⇤

ea = ae✏ = ✏e; = ;

e1 + e2 = ee1 + ee2e1 · e2 = ee1 · ee2

ee⇤ = (ee)⇤e⇤ = ;






[[bs]] ◆[

s!⇤s0^s0 6![[s0]].

Proof. Todo


[[es]] ✓\

s!⇤s0^s0 6![[s0]].

Proof. Todo


9p 2 P. p 62 [[bs]] (3)


9n 2 N . n 2 [[es]]. (4)


4 2016/6/4

효율적인 상태공간 탐색기법• Pruning dead states: 탐색을 아무리 진행해도 해를 가질수 없는 상태공간은 탐색하지 않음


(b 2 P)

...

a ·⇤

......

...

(a 2 N )

...

a · (⇤)⇤

......

...


ndead(s) ()�s !⇤ s0 ^ s0 6! =) 9n 2 N . n 2 [[s0]]

�.




�=) dead(s).





ba = ab✏ = ✏b; = ;

\e1 + e2 = be1 + be2\e1 · e2 = be1 · be2

be⇤ = (be)⇤b⇤ = (a+ b)⇤

ea = ae✏ = ✏e; = ;

e1 + e2 = ee1 + ee2e1 · e2 = ee1 · ee2

ee⇤ = (ee)⇤e⇤ = ;






[[bs]] ◆[

s!⇤s0^s0 6![[s0]].

Proof. Todo


[[es]] ✓\

s!⇤s0^s0 6![[s0]].

Proof. Todo


9p 2 P. p 62 [[bs]] (3)


9n 2 N . n 2 [[es]]. (4)



pdead(s) () 9p 2 P. p 62 [[bs]].



s !⇤ s0 ^ s0 6! =) 9p 2 P. p 62 [[s0]]. (5)


4 2016/6/4


(b 2 P)

...

a ·⇤

......

...

(a 2 N )

...

a · (⇤)⇤

......

...


ndead(s) ()�s !⇤ s0 ^ s0 6! =) 9n 2 N . n 2 [[s0]]

�.




�=) dead(s).





ba = ab✏ = ✏b; = ;

\e1 + e2 = be1 + be2\e1 · e2 = be1 · be2

be⇤ = (be)⇤b⇤ = (a+ b)⇤

ea = ae✏ = ✏e; = ;

e1 + e2 = ee1 + ee2e1 · e2 = ee1 · ee2

ee⇤ = (ee)⇤e⇤ = ;






[[bs]] ◆[

s!⇤s0^s0 6![[s0]].

Proof. Todo


[[es]] ✓\

s!⇤s0^s0 6![[s0]].

Proof. Todo


9p 2 P. p 62 [[bs]] (3)


9n 2 N . n 2 [[es]]. (4)



pdead(s) () 9p 2 P. p 62 [[bs]].



s !⇤ s0 ^ s0 6! =) 9p 2 P. p 62 [[s0]]. (5)


4 2016/6/4

• Pruning redundant states: 해를 가질 수 있더라도 다른 곳에 더 간단한 해가 존재하는 상태공간은 탐색하지 않음


(b 2 P)

...

a ·⇤

......

...

(a 2 N )

...

a · (⇤)⇤

......

...

(aab 2 P)

...

a · (b+ ✏) ·⇤

......

...


ndead(s) ()�s !⇤ s0 ^ s0 6! =) 9n 2 N . n 2 [[s0]]

�.




�=) dead(s).





ba = ab✏ = ✏b; = ;

\e1 + e2 = be1 + be2\e1 · e2 = be1 · be2

be⇤ = (be)⇤b⇤ = (a+ b)⇤

ea = ae✏ = ✏e; = ;

e1 + e2 = ee1 + ee2e1 · e2 = ee1 · ee2

ee⇤ = (ee)⇤e⇤ = ;






[[bs]] ◆[

s!⇤s0^s0 6![[s0]].

Proof. Todo


[[es]] ✓\

s!⇤s0^s0 6![[s0]].

Proof. Todo


9p 2 P. p 62 [[bs]] (3)


9n 2 N . n 2 [[es]]. (4)


4 2016/6/4

⇤

a ✏ ; ⇤+⇤

a+ a a+ ✏ a+ ; a+ (⇤+⇤)

a+ (a+ a) a+ (a+ ✏) a+ (a+ ;) · · ·

a+ (⇤ ·⇤)

· · ·

a+ (⇤⇤)

· · ·

✏+ a ✏+ ✏ ✏+ ; ✏+ (⇤+⇤)

· · ·

· · ·

⇤ ·⇤· · ·

⇤⇤

· · ·


e1 ! e01e1 + e2 ! e01 + e2

e2 ! e02e1 + e2 ! e1 + e02

e1 ! e01e1 · e2 ! e01 · e2

e2 ! e02e1 · e2 ! e1 · e02

e ! e0

e⇤ ! e0⇤e ! e0

e? ! e0?

⇤ ! aa 2 ⌃ ⇤ ! ✏ ⇤ ! ;

⇤ ! ⇤+⇤ ⇤ ! ⇤ ·⇤ ⇤ ! ⇤⇤ ⇤ ! ⇤?



N(0) = 1

N(d+ 1) = N(d) · c2d

when c = 7:

N(d) = 7Pd�1

k=0 2k 2 O(72d�1)


C(a) = C(✏) = C(;) = c1C(⇤) = c2 (c2 > c1)

C(e1 + e2) > C(e1) + C(e2)C(e1 · e2) > C(e1) + C(e2)

C(e⇤) > C(e)





[[s⇤s⇤]] = [[s⇤]]

[[(s+ s)]] = [[s]]

[[(s · s⇤)⇤]] = [[s⇤]]

...



�.



pdead(s) ()�s !⇤ s0 ^ s0 6! =) 9p 2 P. p 62 [[s0]]

�.


3 2016/6/4

⇤

a ✏ ; ⇤+⇤

a+ a a+ ✏ a+ ; a+ (⇤+⇤)

a+ (a+ a) a+ (a+ ✏) a+ (a+ ;) · · ·

a+ (⇤ ·⇤)

· · ·

a+ (⇤⇤)

· · ·

✏+ a ✏+ ✏ ✏+ ; ✏+ (⇤+⇤)

· · ·

· · ·

⇤ ·⇤· · ·

⇤⇤

· · ·


e1 ! e01e1 + e2 ! e01 + e2

e2 ! e02e1 + e2 ! e1 + e02

e1 ! e01e1 · e2 ! e01 · e2

e2 ! e02e1 · e2 ! e1 · e02

e ! e0

e⇤ ! e0⇤e ! e0

e? ! e0?

⇤ ! aa 2 ⌃ ⇤ ! ✏ ⇤ ! ;

⇤ ! ⇤+⇤ ⇤ ! ⇤ ·⇤ ⇤ ! ⇤⇤ ⇤ ! ⇤?


for regular expression (e.g. c = 7). The number of states atdepth d in worst case is

N(0) = 1

N(d+ 1) = N(d) · c2d

when c = 7:

N(d) = 7Pd�1

k=0 2k 2 O(72d�1)


C(a) = C(✏) = C(;) = 1C(e1 + e2) = C(e1) + C(e2) + 5C(e1 · e2) = C(e1) + C(e2) + 5

C(e⇤) = C(e) + 5C(⇤) = 10





[[s⇤s⇤]] = [[s⇤]]

[[(s+ s)]] = [[s]]

[[(s · s⇤)⇤]] = [[s⇤]]

...



�.



pdead(s) ()�s !⇤ s0 ^ s0 6! =) 9p 2 P. p 62 [[s0]]

�.


3 2016/6/4

엄밀한 이론에 기반

고안한 가지치기 기법들은 프로그래밍 언어 이론에 기반하여 결과의 안전성(Soundness)을 보장.

정규식 합성 알고리즘

• 기본 알고리즘: 정규식 문법으로 생성되는 모든 상태공간을 탐색

⇤

a ✏ ; ⇤+⇤

a+ a a+ ✏ a+ ; a+ (⇤+⇤)

a+ (a+ a) a+ (a+ ✏) a+ (a+ ;) · · ·

a+ (⇤ ·⇤)

· · ·

a+ (⇤⇤)

· · ·

✏+ a ✏+ ✏ ✏+ ; ✏+ (⇤+⇤)

· · ·

· · ·

⇤ ·⇤· · ·

⇤⇤

· · ·

Figure 1. search space

e1 ! e01e1 + e2 ! e01 + e2

e2 ! e02e1 + e2 ! e1 + e02

e1 ! e01e1 · e2 ! e01 · e2

e2 ! e02e1 · e2 ! e1 · e02

e ! e0

e⇤ ! e0⇤e ! e0

e? ! e0?

⇤ ! aa 2 ⌃ ⇤ ! ✏ ⇤ ! ;

⇤ ! ⇤+⇤ ⇤ ! ⇤ ·⇤ ⇤ ! ⇤⇤ ⇤ ! ⇤?






C(e⇤) = C(e) + 5C(⇤) = 10



s⇤s⇤ ! s⇤

(s+ s) ! s

(s · s⇤)⇤ ! s⇤

...



�.



pdead(s) ()�s !⇤ s0 ^ s0 6! =) 9p 2 P. p 62 [[s0]]

�.

Example 1. Suppose b 2 P . Any closed state s0 reachablefrom state s = a·⇤ is doomed to reject the positive example;no matter how the hole gets instantiated, the string b cannotbe accepted.


ndead(s) ()�s !⇤ s0 ^ s0 6! =) 9n 2 N . n 2 [[s0]]

�.

Example 2. Suppose a 2 N . Any closed state s0 reach-able from state s = a(⇤)⇤ is doomed to accept the negativeexample; no matter how the hole gets instantiated, the lan-guage of any reachable state includes the string a.


3 2016/6/4

Challenge: 매우 큰 상태공간

⇤

a ✏ ; ⇤+⇤

a+ a a+ ✏ a+ ; a+ (⇤+⇤)

a+ (a+ a) a+ (a+ ✏) a+ (a+ ;) · · ·

a+ (⇤ ·⇤)

· · ·

a+ (⇤⇤)

· · ·

✏+ a ✏+ ✏ ✏+ ; ✏+ (⇤+⇤)

· · ·

· · ·

⇤ ·⇤· · ·

⇤⇤

· · ·


e1 ! e01e1 + e2 ! e01 + e2

e2 ! e02e1 + e2 ! e1 + e02

e1 ! e01e1 · e2 ! e01 · e2

e2 ! e02e1 · e2 ! e1 · e02

e ! e0

e⇤ ! e0⇤e ! e0

e? ! e0?

⇤ ! aa 2 ⌃ ⇤ ! ✏ ⇤ ! ;

⇤ ! ⇤+⇤ ⇤ ! ⇤ ·⇤ ⇤ ! ⇤⇤ ⇤ ! ⇤?



N(0) = 1

N(d+ 1) = N(d) · c2d

when c = 7:

N(d) = 7Pd�1

k=0 2k 2 O(72d�1)



C(e⇤) = C(e) + 5C(⇤) = 10





s⇤s⇤ ! s⇤

(s+ s) ! s

(s · s⇤)⇤ ! s⇤

...



�.



pdead(s) ()�s !⇤ s0 ^ s0 6! =) 9p 2 P. p 62 [[s0]]

�.


3 2016/6/4

깊이 d에 있는 상태개수:

actively responds to each input by taking only a few secondsfor deriving new regular expressions that reflect the change.

Contributions This paper makes the following contribu-tions:• We present a new synthesis algorithm for synthesizing

regular expressions in realtime from examples. The mainnovelty is the techniques that effectively prune out largesearch space using over- and under-approximations ofregular expressions.

• We evaluate the proposed technique on 30 benchmarkproblems. The results show that our method quickly de-rive regular expressions on all of the benchmarks withinfew seconds.

• We implement the technique in a tool, ALPHAREGEX,and made it publicly available at http://prl.korea.ac.kr/AlphaRegex.

2. Regular Expression Problems2.1 Regular ExpressionsIntroductory textbooks on automata theory [? ? ? ] use thefollowing syntax for regular expressions:

e ! a 2 ⌃ | ✏ | ; | e1 + e2 | e1 · e2 | e⇤ (1)

A symbol a from an alphabet ⌃, the empty string ✏, and theempty language ;, constitute the primitive regular expres-sions. The remaining cases are inductively defined. Givenregular expressions e1 and e2, we can construct regular ex-pressions by taking the union e1 + e2 or the concatenatione1 · e2. e⇤ denotes the Kleene closure of e. In the introduc-tory courses, the alphabet is typically assumed to be binary;we assume ⌃ = {a, b} in the rest of this paper.

Formally, a regular expression e denotes a language (i.e.a set of strings). We write [[e]] ✓ ⌃⇤ for the language that edenotes, which is inductively defined as follows:

[[a]] = {a}[[✏]] = {✏}[[;]] = ;

[[e1 + e2]] = [[e1]] [ [[e2]][[e1 · e2]] = [[e1]][[e2]]

[[e⇤]] = [[e]]⇤

2.2 Regular Expression ProblemsIn a regular expression problem, students are given with adescription of a regular language L. We assume that the de-scription of a language is given by a pair (P,N ) of examplestrings, where P ✓ ⌃⇤ is a set of positive examples thatmust be included in the language and N ✓ ⌃⇤ is a set ofnegative examples that must be excluded from the language.Given (P,N ), the regular expression problem asks studentsto find a regular expression e that is consistent with the givenexamples:

8p 2 P.p 2 [[e]] ^ 8n 2 N .n 62 [[e]].

3. Our Synthesis Algorithm3.1 Basic Search AlgorithmSuppose a regular expression problem (P,N ) is given. Weformulate this problem as a search problem and present anefficient algorithm to find a solution. The search problem isdefined by a transition system (S,!, I, F ), where S is theset of states, (!) ✓ S ⇥ S is a transition relation, I 2 S isan initial state, and F ✓ S is a set of final, solution states.

• States: A state s 2 S is a partial regular expression thatpossibly has holes (⇤). A hole is a placeholder that canbe replaced by another regular expression. The set S ofstates is inductively defined as follows:

s ! a 2 ⌃ | ✏ | ; | s1 + s2 | s1 · s2 | s⇤ | ⇤ (2)

Note that a state has multiple holes. For example, (a +(⇤ ·⇤))⇤ is a state which has two holes in it.

• Initial State: The initial state is a single hole, i.e., I = ⇤.• Transition Relation: The transition relation (!) ✓ S ⇥S determines the next states of a given state. The transi-tion relation ! is inductively defined as a set of inferencerules in Figure 2. For example, (a+⇤)⇤ ! (a+(⇤·⇤))⇤

because we can find a derivation according to the infer-ence rules as follows:

⇤ ! ⇤ ·⇤(a+⇤) ! (a+ (⇤ ·⇤))

(a+⇤)⇤ ! (a+ (⇤ ·⇤))⇤

We write next(s) for the set of all states that follow s:

next(s) = {s0 | s ! s0}.

For example, when ⌃ = {a, b}, next(a + ⇤) = {(a +a)⇤, (a + b)⇤, (a + ✏)⇤, (a + ;)⇤, (a + (⇤ + ⇤))⇤, (a +(⇤ · ⇤))⇤, (a + (⇤⇤))⇤, (a + (⇤?))⇤}. We write s 6! toindicate that s has no next states; that is, s is a closedexpression with no holes.

• Solution States: A state s is a solution state iff s is aclosed expression (i.e., s 6!) and s is consistent with thegiven positive and negative examples:

solution(s) ()s 6! ^ 8p 2 P.p 2 [[s]] ^ 8n 2 N .n 62 [[s]].

Algorithm 1 presents a naive workset algorithm that solvesthe search problem. Initially, the workset consists of theinitial state (line 1). We choose and remove a state s fromthe workset (line 3). If a solution is found, it is returned.Otherwise, we search for the next states of s by adding theminto the workset.

Size of Search Space The maximum number of holes instate at depth d is 2d. The number of next states for a statewith n holes is cn, where c is the number of inductive rules

2 2016/6/4

정규식 합성 알고리즘

• 기본 알고리즘: 정규식 문법으로 생성되는 모든 상태공간을 탐색

⇤

a ✏ ; ⇤+⇤

a+ a a+ ✏ a+ ; a+ (⇤+⇤)

a+ (a+ a) a+ (a+ ✏) a+ (a+ ;) · · ·

a+ (⇤ ·⇤)

· · ·

a+ (⇤⇤)

· · ·

✏+ a ✏+ ✏ ✏+ ; ✏+ (⇤+⇤)

· · ·

· · ·

⇤ ·⇤· · ·

⇤⇤

· · ·

Figure 1. search space

e1 ! e01e1 + e2 ! e01 + e2

e2 ! e02e1 + e2 ! e1 + e02

e1 ! e01e1 · e2 ! e01 · e2

e2 ! e02e1 · e2 ! e1 · e02

e ! e0

e⇤ ! e0⇤e ! e0

e? ! e0?

⇤ ! aa 2 ⌃ ⇤ ! ✏ ⇤ ! ;

⇤ ! ⇤+⇤ ⇤ ! ⇤ ·⇤ ⇤ ! ⇤⇤ ⇤ ! ⇤?






C(e⇤) = C(e) + 5C(⇤) = 10



s⇤s⇤ ! s⇤

(s+ s) ! s

(s · s⇤)⇤ ! s⇤

...



�.



pdead(s) ()�s !⇤ s0 ^ s0 6! =) 9p 2 P. p 62 [[s0]]

�.

Example 1. Suppose b 2 P . Any closed state s0 reachablefrom state s = a·⇤ is doomed to reject the positive example;no matter how the hole gets instantiated, the string b cannotbe accepted.


ndead(s) ()�s !⇤ s0 ^ s0 6! =) 9n 2 N . n 2 [[s0]]

�.

Example 2. Suppose a 2 N . Any closed state s0 reach-able from state s = a(⇤)⇤ is doomed to accept the negativeexample; no matter how the hole gets instantiated, the lan-guage of any reachable state includes the string a.


3 2016/6/4

Challenge: 매우 큰 상태공간

⇤

a ✏ ; ⇤+⇤

a+ a a+ ✏ a+ ; a+ (⇤+⇤)

a+ (a+ a) a+ (a+ ✏) a+ (a+ ;) · · ·

a+ (⇤ ·⇤)

· · ·

a+ (⇤⇤)

· · ·

✏+ a ✏+ ✏ ✏+ ; ✏+ (⇤+⇤)

· · ·

· · ·

⇤ ·⇤· · ·

⇤⇤

· · ·


e1 ! e01e1 + e2 ! e01 + e2

e2 ! e02e1 + e2 ! e1 + e02

e1 ! e01e1 · e2 ! e01 · e2

e2 ! e02e1 · e2 ! e1 · e02

e ! e0

e⇤ ! e0⇤e ! e0

e? ! e0?

⇤ ! aa 2 ⌃ ⇤ ! ✏ ⇤ ! ;

⇤ ! ⇤+⇤ ⇤ ! ⇤ ·⇤ ⇤ ! ⇤⇤ ⇤ ! ⇤?



N(0) = 1

N(d+ 1) = N(d) · c2d

when c = 7:

N(d) = 7Pd�1

k=0 2k 2 O(72d�1)



C(e⇤) = C(e) + 5C(⇤) = 10





s⇤s⇤ ! s⇤

(s+ s) ! s

(s · s⇤)⇤ ! s⇤

...



�.



pdead(s) ()�s !⇤ s0 ^ s0 6! =) 9p 2 P. p 62 [[s0]]

�.


3 2016/6/4

깊이 d에 있는 상태개수:

actively responds to each input by taking only a few secondsfor deriving new regular expressions that reflect the change.

Contributions This paper makes the following contribu-tions:• We present a new synthesis algorithm for synthesizing

regular expressions in realtime from examples. The mainnovelty is the techniques that effectively prune out largesearch space using over- and under-approximations ofregular expressions.

• We evaluate the proposed technique on 30 benchmarkproblems. The results show that our method quickly de-rive regular expressions on all of the benchmarks withinfew seconds.

• We implement the technique in a tool, ALPHAREGEX,and made it publicly available at http://prl.korea.ac.kr/AlphaRegex.

2. Regular Expression Problems2.1 Regular ExpressionsIntroductory textbooks on automata theory [? ? ? ] use thefollowing syntax for regular expressions:

e ! a 2 ⌃ | ✏ | ; | e1 + e2 | e1 · e2 | e⇤ (1)

A symbol a from an alphabet ⌃, the empty string ✏, and theempty language ;, constitute the primitive regular expres-sions. The remaining cases are inductively defined. Givenregular expressions e1 and e2, we can construct regular ex-pressions by taking the union e1 + e2 or the concatenatione1 · e2. e⇤ denotes the Kleene closure of e. In the introduc-tory courses, the alphabet is typically assumed to be binary;we assume ⌃ = {a, b} in the rest of this paper.

Formally, a regular expression e denotes a language (i.e.a set of strings). We write [[e]] ✓ ⌃⇤ for the language that edenotes, which is inductively defined as follows:

[[a]] = {a}[[✏]] = {✏}[[;]] = ;

[[e1 + e2]] = [[e1]] [ [[e2]][[e1 · e2]] = [[e1]][[e2]]

[[e⇤]] = [[e]]⇤

2.2 Regular Expression ProblemsIn a regular expression problem, students are given with adescription of a regular language L. We assume that the de-scription of a language is given by a pair (P,N ) of examplestrings, where P ✓ ⌃⇤ is a set of positive examples thatmust be included in the language and N ✓ ⌃⇤ is a set ofnegative examples that must be excluded from the language.Given (P,N ), the regular expression problem asks studentsto find a regular expression e that is consistent with the givenexamples:

8p 2 P.p 2 [[e]] ^ 8n 2 N .n 62 [[e]].

3. Our Synthesis Algorithm3.1 Basic Search AlgorithmSuppose a regular expression problem (P,N ) is given. Weformulate this problem as a search problem and present anefficient algorithm to find a solution. The search problem isdefined by a transition system (S,!, I, F ), where S is theset of states, (!) ✓ S ⇥ S is a transition relation, I 2 S isan initial state, and F ✓ S is a set of final, solution states.

• States: A state s 2 S is a partial regular expression thatpossibly has holes (⇤). A hole is a placeholder that canbe replaced by another regular expression. The set S ofstates is inductively defined as follows:

s ! a 2 ⌃ | ✏ | ; | s1 + s2 | s1 · s2 | s⇤ | ⇤ (2)

Note that a state has multiple holes. For example, (a +(⇤ ·⇤))⇤ is a state which has two holes in it.

• Initial State: The initial state is a single hole, i.e., I = ⇤.• Transition Relation: The transition relation (!) ✓ S ⇥S determines the next states of a given state. The transi-tion relation ! is inductively defined as a set of inferencerules in Figure 2. For example, (a+⇤)⇤ ! (a+(⇤·⇤))⇤

because we can find a derivation according to the infer-ence rules as follows:

⇤ ! ⇤ ·⇤(a+⇤) ! (a+ (⇤ ·⇤))

(a+⇤)⇤ ! (a+ (⇤ ·⇤))⇤

We write next(s) for the set of all states that follow s:

next(s) = {s0 | s ! s0}.

For example, when ⌃ = {a, b}, next(a + ⇤) = {(a +a)⇤, (a + b)⇤, (a + ✏)⇤, (a + ;)⇤, (a + (⇤ + ⇤))⇤, (a +(⇤ · ⇤))⇤, (a + (⇤⇤))⇤, (a + (⇤?))⇤}. We write s 6! toindicate that s has no next states; that is, s is a closedexpression with no holes.

• Solution States: A state s is a solution state iff s is aclosed expression (i.e., s 6!) and s is consistent with thegiven positive and negative examples:

solution(s) ()s 6! ^ 8p 2 P.p 2 [[s]] ^ 8n 2 N .n 62 [[s]].

Algorithm 1 presents a naive workset algorithm that solvesthe search problem. Initially, the workset consists of theinitial state (line 1). We choose and remove a state s fromthe workset (line 3). If a solution is found, it is returned.Otherwise, we search for the next states of s by adding theminto the workset.

Size of Search Space The maximum number of holes instate at depth d is 2d. The number of next states for a statewith n holes is cn, where c is the number of inductive rules

2 2016/6/4

✓ 840 lines in OCaml

✓ 학생들이 어려워하는 정규식 문제를 위주로

✓ 탐색 기법을 하나도 적용하지 않은 기본 알고리즘을 비교군으로

✓ 탐색 기법을 모두 적용한 알고리즘의 성능 및 향상폭 측정

01 02 E 0 W AlphaRegex 01 02 Wprl.korea.ac.kr/~pronto/home/posters/regex-synthesis.pdf ·...

Documents

Transcript of 01 02 E 0 W AlphaRegex 01 02 Wprl.korea.ac.kr/~pronto/home/posters/regex-synthesis.pdf ·...