none
How to find repeated first sentence in biography?

    질문

  • Hi,

    Table : Person

    Field : Bio

    I've table called person table when we stored of persons including biography. Some times users typed duplicate sentences in biography. In below biography the first sentence repeated. How to find such cases from Database?

    Ms. Mary has been Executive Director and Chief Executive Officer at XYZ Limited June 08, 2018. Ms. Mary has been Executive Director and Chief Executive Officer at XYZ Limited June 08, 2018. She is the Chairman and Founder of ABC Ltd. She received MBA from Standford University in 2016.

    Thanks in advance.

    2018년 7월 14일 토요일 오전 10:16

모든 응답

  • Hi,

    Table : Person

    Field : Bio

    I've table called person table when we stored of persons including biography. Some times users typed duplicate sentences in biography. In below biography the first sentence repeated. How to find such cases from Database?

    Ms. Mary has been Executive Director and Chief Executive Officer at XYZ Limited June 08, 2018. Ms. Mary has been Executive Director and Chief Executive Officer at XYZ Limited June 08, 2018. She is the Chairman and Founder of ABC Ltd. She received MBA from Standford University in 2016.

    Thanks in advance.

    Good day KVAKVA,

    If the replication is always ONLY in the first sentence then this is simple case which I will explain bellow, but if you want to find all duplicate sentences then this is probably best to do with intelligence platform.

    In any case executing this task dynamically (directly in query) is HIGHLY NOT RECOMMENDED in most cases, since it will consume very high resources. This requirement indicates that you should probably re-think about your system design! In other world you should not come to a point that you have this exact requirement. One option is to make the check in the application side, another option can be storing the first sentence in separate column and the second one as well (if the duplication is only in the first two sentences), and so on

    For the first case the solution can be simply (simply to solve but consume very high resources) like this (all in one simple query):

    1. find the position of the end of the first sentence (for example use CHARINDEX to find the first dot).
    2. This way you can find the text of the first sentence using the function SUBSTRING.
    3. If the duplication is only in the first 2 sentences then do the same to find the second sentence and in the where condition you can compare the 2 sentences

    * I highly recommend to re-think about the design and other solutions


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]    [Linkedin]

    2018년 7월 14일 토요일 오후 3:07
    중재자
  • A non-trivial problem where you first need to divide the text into sentences. But identifying a sentence in a text is not trivial, particularly not if users are sloppy and type a dot where they meant a comma etc. And what if users write sentences with repeating information but with different word order, or even different words?

    If you google around, you may find algorithms for this. (I like to point out that these algorithms would be general algorithms, and not tied to SQL Server.)

    A possible strategy is to break up the text in words, where words simply are define a strings of alphanumeric characters and then store the words in a table with two columns word and wordno. You would then attempt to find repeating patterns if a repeating pattern is more than N words long, flag it. And, no, I don't have code to do this. What I can say is that the code that splits the text into words, should be implemented in a CLR procedure so that you can use the RegEx classes. Possibly, also the attempt to find the repeating pattern should also be performed inside the CLR (and in that case you would not store the data in a table.)


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    2018년 7월 14일 토요일 오후 9:37
  • I agree with others that you do not use T-SQL for this subject ...I have doubts that below is reliable solution, test it firsr 

    Let see if it helps you

    create table #a (col1 varchar(200))

    insert into #a values ('Ms. Mary has been Executive Director and Chief Executive Officer at XYZ Limited June 08, 2018')
    insert into #a values ('Ms. Mary has been Executive Director and Chief Executive Officer at XYZ Limited June 08, 2018. She is the Chairman and Founder of ABC Ltd. She received MBA from Standford University in 2016.')
    insert into #a values ('David Green II')
    insert into #a values ('David Smith')


    Select * from #a


    select a.col1, b.col1
    from #a a inner join #a b
    on
    a.col1 = LEFT(b.col1, LEN(a.col1))
    WHERE SIGN(LEN(a.col1) - LEN(b.col1)) = -1


    Best Regards,Uri Dimant SQL Server MVP, http://sqlblog.com/blogs/uri_dimant/

    MS SQL optimization: MS SQL Development and Optimization
    MS SQL Consulting: Large scale of database and data cleansing
    Remote DBA Services: Improves MS SQL Database Performance
    SQL Server Integration Services: Business Intelligence

    2018년 7월 15일 일요일 오전 9:05
    답변자
  • HI,

    create table t1 (c1 varchar(max));
    insert t1 values('Ms. Mary has been Executive Director and Chief Executive Officer at XYZ Limited June 08, 2018. Ms. Mary has been Executive Director and Chief Executive Officer at XYZ Limited June 08, 2018. She is the Chairman and Founder of ABC Ltd. She received MBA from Standford University in 2016.');

    select replace(string_agg(c1,'. '),'Ms','Ms.')
    from
    (
    select distinct ltrim(rtrim(value))
    from
    (select replace(replace(c1,'Ms.','Ms'),'Mr.','Mr')
    from t1) t1(c1) cross apply string_split(c1,'.')
    where len(ltrim(rtrim(value))) > 0
    ) t1 (c1)

    2018년 7월 16일 월요일 오전 9:17