Text Mining In Python For Beginners to Advance



Text is everywhere, you see them in printed material and in book. you have newspaper, you have Wikipedia and other Social Media. You have people talking to each other in online formus, and disscussion groups, and so on.

You could Parse The text, classify the text document, try to understand and it says. Find and extract relevant information from text, even define what information is. You're to search for relevant text documents, this is information retrieval.

text contains all  information like politics news, any sports news and any celebrity news and etc.

today's world you can find any information in text. everything is available in text.

Their are lots of information in text but we need to find some specific information that we use text mining.

In text mining we can analyze the text, scarpe the text, find specific word, find repetition of words and so on.

if you are beginners then i sure this blog definatly help You.

You Can try this code in jupyter notebook.

Docker tutorial click here :- Docker tutorial for Beginners

1. text1 = "Ethics are built right into the ideals and objectives of the United Nations"
2. len(text1)
output : 75
3. text2 = text1.split(' ')
4. text2
output :['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations']

5. len(text2)
Output : 13

6. [w for w in text2 if len(w)>3]
Output :
['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

7. [w for w in text2 if w.istitle()] # Capitalized words

output : ['Ethics', 'United', 'Nations']

8. [w for w in text2 if w.endswith('s')] # words that end with s 

output : ['Ethics', 'ideals', 'objectives', 'Nations']

9. text3 = 'To be or not to be' # findind unique words
   text4  =  text3.split(' ')
   text4

output : ['To', 'be', 'or', 'not', 'to', 'be']

10. len(text4)

output : 6

11. len(set(text4))

output : 5

12. set(text4)

output : {'To', 'be', 'not', 'or', 'to'}

13. len(set([w.lower() for w in text4]))

output : 4

14. set([w.lower() for w in text4 ])

output : {'be', 'not', 'or', 'to'}

15. # some words comparison functions..
    # s.startswith(t)
    #s.endswith(t)
    # t in s
    # s.isupper() ; s.islower(); s.istitle()
    # s.isalpha( ); s.isdigit(); s.isalnum()

16. # string Operation 
    # s.lower()
    # s.upper(); 
    # s.titlecase();
    # s.split(t)
    # s.splitlines()
    # s.join(t)
    # s.strip()
    # s.rstrip()
    # s.find(t); s.rfind(t)
    # s.replace(u,v)

17. text5 = 'ouagadougou'
    text6= text5.split('ou')
    text6

output : ['', 'agad', 'g', '']

18. 'ou'.join(text6)

output : 'ouagadougou'

19. list(text5)

output : ['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']

20. [c for c in text5]

output : ['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']

21. text8 = '   A quick brown fox jumped over the lazy dog.  '
    text8.split(' ')

output : ['',
 '',
 '',
 'A',
 'quick',
 'brown',
 'fox',
 'jumped',
 'over',
 'the',
 'lazy',
 'dog.',
 '',
 '']

22. text9 = text8.strip()
    text9

output : 'A quick brown fox jumped over the lazy dog.'

23. text9.split(' ')

output : ['A', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']

24. text9  # changing text  # Find and replace

output : 'A quick brown fox jumped over the lazy dog.'

25. text9.find('o')

output : 10

26. text9.rfind('o')

output : 40

27. text9.replace('o', 'O')

output : 'A quick brOwn fOx jumped Over the lazy dOg.'

28. f=open('file location', 'r')  # handling large texts

29. f.readline()# reading line by line

30. f.seek(0) # reading the full file

31. text12= f.read()

32. text12

33. len(text12)

34. text13 = text12.splitlines()

35. len(text13)

36. text13[0]

37. # file operatios 
    # f= open(filename,  mode)
    # f.readlie();f.read(); f.read(n)
    # for line in f: doSomething(line)
    # f.seek()
    # f.write(message)
    # f.close()
    # f.closed

38. f = open('file location', 'r')

39. text14 = f.readline()

40. text14 # issues with reading text files

41. text14.rstrip() # how do you remove the last newline character ?

42. # Processing free-text
text15 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'

43. text16 = text15.split(' ')
    text16

output : ['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

44. [w for w in text16 if w.startswith('#')] # fiding hastags

output : ['#UNSG']

45. [w for w in text16 if w.startswith('@')]

output : ['@']

46. text17 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
    text18 = text17.split(' ')
    text18

output : ['@UN',
 '@UN_Women',
 '"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '\\#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

47. #We can use regular expressions to help us with more complex parsing.

#For example '@[A-Za-z0-9_]+' will return all words that:

#start with '@' and are followed by at least one:
#capital letter ('A-Z')
#lowercase letter ('a-z')
#number ('0-9')
#or underscore ('_')

48. #import re  # import re - a module that provides support for regular expressions
    #[w for w in text18 if re.search('@[A-Za-z0-9_]+',w)]



#######  Meta-chacters: Character matches ########## 


    #  . "wildcaard, matches a single character" 
    #  ^ "start of a string"
    # $ "end of a string"
    # [] "matches one of the set of character within []"
    # [a-z] "matches one of the range of chacarters a,b,c,....z"
    # [^abc] "matches a character that is not a,b, or c "
    # a|b "matches either a or  b, where a and b are strings"
    # () "Scoping for opertors"
    # \ "Escape character for special character (\t,\n,\b)"


#######  Meta-chacters: Character symbols ########## 


    #  \b "matches word boundary"
    #  \d "any digit, equivalent to [0-9]"
    #   \D "any non-digit, equivalent to [^0-9]"
    #  \s "any white space, equivalent to [ \t \n \r \f \v]"
    #  \S "any non-white space, equivalent to[^ \t\n\r\f\v]"
    # \w "Alphaumeric character , equivalent to[a-zA-Z0-9_]"
    # \W " non-alphanumeric, equivalent to[^a-zA-Z0-9_]"


#######  Meta-chacters: Repetitios  ##########


    # * "matches zero or more occurreces"
    # + " MATCHES one or more occurrences"
    # ? "Matches zero or more occurrences"
    # {n} "exactly n repetitions"
    # {n,} "at least n repetitions"
    # {,n} " at most n repetitions"
    # {m,n} "at least m and at most n repetitions"

    #[w for w in text18 if re.search('@\w+' , w)]

49. #finding special charcter 
import re
text20 = 'ousnduiuela'
re.findall(r'[aeiou]',text20)

output : ['o', 'u', 'u', 'i', 'u', 'e', 'a']

50. re.findall(r'[^aeiou]',text20)

output : ['s', 'n', 'd', 'l']

Comments

Popular posts from this blog

Docker Tutorial For Beginners: Basic Commands of Docker

Stock Buy and Sell to earn Maximize Profit programme in Python